Level: Africa
(120 subjects & 643,862 SNPs)
Previous level: World



Our goal

We seek Ancestry Informative Markers (AIMs) that may be used to separate the following four groups of populations:
Principal Components Analysis


In order to achieve the above separation via Principal Components Analysis (PCA) and a simple five Nearest Neighbor (5-NN) algorithm, our analysis indicated that we should retain the top two (2) principal components (eigenSNPs) in the data. The following plot shows the projection of the data on the two eigenSNPs
.

Selecting panels of AIMs using PCA scores

We first identified the top 5,000 PCA Informative Markers (PCAIMs for short). These SNPs were selected using a simple metric of correlation with the top two eigenSNPs (see Paschou et al 2007 for details). Significant "redundancies" are observed within the top 5,000 markers (for example, a considerable number of pairs of markers are in linkage disequilibrium). In order to better summarize the top 5,000 PCAIMs, we clustered them in 30 clusters (the number of clusters was indicated by our analysis), and chose one representative SNP from each cluster. (For each cluster, we chose the representative SNP to be the one that has the highest PCA score, i.e., the one that is most correlated with the top two eigenSNPs.) We provide the list of all top 5,000 PCAIMs, as well as their cluster assignments and their respective PCA scores. To some extent, SNPs within the same cluster should be interchangeable in constructing and/or interpreting panels of ancestry informative markers. (The file is tab-delimited; the first column corresponds to cluster number, the second column corresponds to SNP rs number, the third column corresponds to the SNP chromosome, the fourth column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs, and the fifth column corresponds to the PCA score of the SNP. The SNPs are sorted in descending order with respect to their PCA score within each cluster, and the clusters are sorted in descending order with respect to the highest PCA score among all SNPs in the cluster.) The following 30 SNP panel, consisting of the 30 SNPs with the highest PCA score within each cluster, was determined to be sufficient in order to classify an individual to one of the aforementioned four groups of populations. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their PCA score, i.e., their correlation coefficient with the top two eigenSNPs. Thus, the first SNP is the most highly correlated SNP with the top two eigenSNPs.) We also provide two more panels of AIMs that emerged by clustering the top 5,000 PCAIMs in 60 and 90 clusters respectively. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their PCA score.)
Selecting panels of AIMs using the Informativeness for Assignment (In) metric

We also identified the top 
5,000 Informative Markers (INFAIMs for short) according to the Informativeness for Assignment (In) metric described in Rosenberg et al 2003. Again, significant "redundancies" are observed within the top 5,000 markers. In order to better summarize the top 5,000 INFAIMs, we clustered them in 30 clusters (see comment above) and chose one representative SNP from each cluster. (For each cluster, we chose the representative SNP to be the one that has the highest In score.) We provide the list of all top 5,000 INFAIMs, as well as their cluster assignments and their respective In scores. To some extent, SNPs within the same cluster should be interchangeable in constructing and/or interpreting panels of ancestry informative markers. (The file is tab-delimited; the first column corresponds to cluster number, the second column corresponds to SNP rs number, the third column corresponds to the SNP chromosome, the fourth column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs, and the fifth column corresponds to the In score of the SNP. The SNPs are sorted in descending order with respect to their In score within each cluster, and the clusters are sorted in descending order with respect to the highest In score among all SNPs in the cluster.)
Finally, we provide a panel of 30 SNPs, ie. the SNPs with the highest In score in each of the 30 clusters described above. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their In score.)
We also provide two more panels of AIMs that emerged by clustering the top 5,000 INFAIMs in 60 and 90 clusters respectively. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their  In score.)

Overlap between the panels

There were 745 common SNPs between the top 5,000 PCAIMs and the top 5,000 INFAIMs (a 14.9% overlap). The following two plots show the distribution of the PCA scores as well as the 
In scores; scores at the right of the vertical blue line correspond to the top 5,000 AIMs.