Level: Bantu, Yoruba(37 subjects & 643,862 SNPs)
Previous level: West Africa
Our goal
We seek Ancestry Informative Markers (AIMs) that may be used to separate the following two populations:
- Bantu (18 subjects)
- Yoruba (19 subjects)
Principal Components Analysis
In
order to achieve the above separation via Principal Components Analysis
(PCA) and a simple five Nearest Neighbor (5-NN) algorithm, our analysis
indicated that we should retain the top principal component
(eigenSNP) in the data. The following plot shows the projection of
the data on the top eigenSNP.
Selecting panels of AIMs using PCA scores
We first identified the top 5,000 PCA Informative Markers (PCAIMs
for short). These SNPs were selected using a simple metric of
correlation with the top eigenSNP (see Paschou et al 2007
for details). Significant "redundancies" are observed within the top
5,000 markers (for example, a considerable number of pairs of markers are in
linkage disequilibrium). In order to better summarize the top 5,000
PCAIMs, we clustered them in 50 clusters (the number of clusters was indicated by our analysis), and chose one representative
SNP from each cluster. (For each cluster, we chose the representative
SNP to be the one that has the highest PCA score, i.e., the one that
is most correlated with the top eigenSNP). We provide the list of
all top 5,000 PCAIMs, as well as their cluster assignments and their
respective PCA scores. To some extent, SNPs within the same cluster
should be interchangeable in constructing and/or interpreting panels of
ancestry informative markers. (The file is tab-delimited; the first
column corresponds to cluster number, the second column corresponds to
SNP rs number, the third column corresponds to the SNP chromosome, the
fourth column links to the corresponding gene for intragenic SNPs, with
-- denoting intergenic SNPs, and the fifth column corresponds to the
PCA score of the SNP. The SNPs are sorted in descending order with
respect to their PCA score within each cluster, and the clusters are
sorted in descending order with respect to the highest PCA score among
all SNPs in the cluster.)
The
following 50 SNP panel, consisting of the 50 SNPs with the highest PCA
score within each cluster, was determined to be sufficient in order to
classify an individual to one of the aforementioned two populations. (The file is tab-delimited; the first column
corresponds to the SNP rs number, the second column corresponds to the
SNP chromosome, and the third column links to the corresponding gene
for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are
sorted in descending order with respect to their PCA score, i.e.,
their correlation coefficient with the top eigenSNP. Thus, the
first SNP is the most highly correlated SNP with the top
eigenSNP.)
We also provide two more panels of AIMs that emerged by clustering the top 5,000 PCAIMs in 100 and 150 clusters respectively. (The file is tab-delimited; the first column
corresponds to the SNP rs number, the second column corresponds to the
SNP chromosome, and the third column links to the corresponding gene
for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are
sorted in descending order with respect to their PCA score.)
Selecting panels of AIMs using the Informativeness for Assignment (In) metric
We also identified the top 5,000 Informative Markers (INFAIMs for short) according to the Informativeness for Assignment (In) metric described in Rosenberg et al 2003.
Again, significant "redundancies" are
observed within the top 5,000 markers. In order to better
summarize the top 5,000 INFAIMs, we clustered them in 50 clusters (see comment above) and
chose one representative SNP from each cluster. (For each cluster, we
chose the representative SNP to be the one that has the highest In
score.) We provide the list of all top 5,000 INFAIMs, as well as
their cluster assignments and their respective In scores.
To some
extent, SNPs within the same cluster should be interchangeable in
constructing and/or interpreting panels of ancestry informative
markers. (The file is tab-delimited; the first column corresponds to
cluster number, the second column corresponds to SNP rs number, the
third column corresponds to the SNP chromosome, the fourth column links
to the corresponding gene for intragenic SNPs, with -- denoting
intergenic SNPs, and the fifth column corresponds to the In score of
the SNP. The SNPs are sorted in descending order with respect to their In score within each cluster, and the clusters are sorted
in
descending order with respect to the highest In score
among all SNPs
in the cluster.)
Finally,
we provide a panel of 50 SNPs, ie. the SNPs with the highest In score in each of the 50 clusters described above. (The file is tab-delimited; the
first column corresponds to the SNP rs number, the second column
corresponds to the SNP chromosome, and the third column links to the
corresponding gene for intragenic SNPs, with -- denoting intergenic
SNPs. The SNPs are sorted in descending order with respect to their In
score.)
We also provide two more panels of AIMs that emerged by clustering the top 5,000 INFAIMs in 100 and 150 clusters respectively. (The file is tab-delimited; the first column
corresponds to the SNP rs number, the second column corresponds to the
SNP chromosome, and the third column links to the corresponding gene
for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are
sorted in descending order with respect to their In score.)
Overlap between the panels
There
were 968 common SNPs between the top 5,000 PCAIMs and the top 5,000
INFAIMs (a 19.4% overlap). The following two plots show the
distribution of the PCA scores as well as the In scores;
scores at the right of the vertical blue line correspond to the top
5,000 AIMs.