PCAIMs::LEVEL::WORLD::CENTRAL SOUTH ASIA, EUROPE, MIDDLE EAST:: MIDDLE EAST

Level: Middle East
(172 subjects & 643,862 SNPs)
Previous level: C S Asia, Europe, M East

Our goal

We seek Ancestry Informative Markers (AIMs) that may be used to separate the following four (4) populations:

Bedouin (47 subjects)

Druze (46 subjects)

Mozabite (28 subjects)

Palestinian (51 subjects)

Principal Components Analysis

In order to achieve the above separation via Principal Components Analysis (PCA) and a simple five Nearest Neighbor (5-NN) algorithm, our analysis indicated that we should retain the top six principal components (eigenSNPs) in the data. The following plots show the projections of the data on the top two eigenSNPs, on the top three eigenSNPs, as well as on each of the top six eigenSNPs.

Projection on top two eigenSNPs

Projection on top three eigenSNPs

Projection on first eigenSNP only

Projection on second eigenSNP only

Projection on third eigenSNP only

Projection on fourth eigenSNP only

Projection on fifth eigenSNP only

Projection on sixth eigenSNP only

Selecting panels of AIMs using PCA scores

We first identified the top 5,000 PCA Informative Markers (PCAIMs for short). These SNPs were selected using a simple metric of correlation with the top six eigenSNPs (see Paschou et al 2007 for details). Significant "redundancies" are observed within the top 5,000 markers (for example, a considerable number of pairs of markers are in linkage disequilibrium). In order to better summarize the top 5,000 PCAIMs, we clustered them in 300 clusters (the number of clusters was indicated by our analysis), and chose one representative SNP from each cluster. (For each cluster, we chose the representative SNP to be the one that has the highest PCA score, i.e., the one that is most correlated with the top six eigenSNPs.) We provide the list of all top 5,000 PCAIMs, as well as their cluster assignments and their respective PCA scores. To some extent, SNPs within the same cluster should be interchangeable in constructing and/or interpreting panels of ancestry informative markers. (The file is tab-delimited; the first column corresponds to cluster number, the second column corresponds to SNP rs number, the third column corresponds to the SNP chromosome, the fourth column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs, and the fifth column corresponds to the PCA score of the SNP. The SNPs are sorted in descending order with respect to their PCA score within each cluster, and the clusters are sorted in descending order with respect to the highest PCA score among all SNPs in the cluster.)

The top 5,000 PCAIMs, with hyperlinks to NCBI's Entrez Gene database.

The top 5,000 PCAIMs (text only).

The following 300 SNP panel was determined to be sufficient in order to classify an individual to one of the aforementioned four populations. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their PCA score, i.e., their correlation coefficient with the top six eigenSNPs. Thus, the first SNP is the most highly correlated SNP with the top six eigenSNPs.)

A panel of 300 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 300 AIMs (text only).

We also provide two more panels of AIMs that emerged by clustering the top 5,000 PCAIMs in 600 and 900 clusters respectively. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their PCA score.)

A panel of 600 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 600 AIMs (text only).

A panel of 900 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 900 AIMs (text only).

Selecting panels of AIMs using the Informativeness metric

We also identified the top 5,000 Informative Markers (INFAIMs for short) according to the Informativeness for Assignment (I_n) metric described in Rosenberg et al 2003. Again, significant "redundancies" are observed within the top 5,000 markers. In order to better summarize the top 5,000 INFAIMs, we clustered them in 300 clusters (see comment above) and chose one representative SNP from each cluster. (For each cluster, we chose the representative SNP to be the one that has the highest I_n score.) We provide the list of all top 5,000 INFAIMs, as well as their cluster assignments and their respective I_n scores. To some extent, SNPs within the same cluster should be interchangeable in constructing and/or interpreting panels of ancestry informative markers. (The file is tab-delimited; the first column corresponds to cluster number, the second column corresponds to SNP rs number, the third column corresponds to the SNP chromosome, the fourth column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs, and the fifth column corresponds to the I_n score of the SNP. The SNPs are sorted in descending order with respect to their I_n score within each cluster, and the clusters are sorted in descending order with respect to the highest I_n score among all SNPs in the cluster.)

The top 5,000 INFAIMs, with hyperlinks to NCBI's Entrez Gene database.

The top 5,000 INFAIMs (text only).

Finally, we provide a panel of 300 SNPs, ie. the SNPs with the highest I_n score in each of the 300 clusters described above. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their I_n score.)

A panel of 300 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 300 AIMs (text only).

We also provide two more panels of AIMs that emerged by clustering the top 5,000 INFAIMs in 600 and 900 clusters respectively. (The file is tab-delimited; the first column corresponds to the SNP rs number, the second column corresponds to the SNP chromosome, and the third column links to the corresponding gene for intragenic SNPs, with -- denoting intergenic SNPs. The SNPs are sorted in descending order with respect to their I_n score.)

A panel of 600 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 600 AIMs (text only).

A panel of 900 AIMs, with hyperlinks to NCBI's Entrez Gene database.

A panel of 900 AIMs (text only).

Overlap between the panels

There were 851 common SNPs between the top 5,000 PCAIMs and the top 5,000 INFAIMs (a 17% overlap). The following two plots show the distribution of the PCA scores as well as the I_n scores; scores at the right of the vertical blue line correspond to the top 5,000 AIMs.

Distribution of PCA scores for all 643,862 SNPs.

Distribution of I_n scores for all 643,862 SNPs.