Interdisciplinary study ties India’s genetic diversity to language, not geography - Department of Computer Science - Purdue University Skip to main content

Interdisciplinary study ties India’s genetic diversity to language, not geography


WEST LAFAYETTE, Ind. — The popularity of genetic and ancestry services like and 23andMe attests that people care about where their ancestors originated. The underlying assumption is that the geography of one’s forebears affects one’s genes today.

Historically, scientists have found that geography is the biggest driver behind the genetic diversity of a population. Now, new research from Purdue University indicates that while that may be true for European countries, it is not true for all other parts of the world – especially places like India, where language and social systems have strongly affected how and where people live. The model the researchers developed to analyze India’s population genetics will allow other researchers to analyze populations where genetics are not as closely tied to geography. Understanding the genetics of human populations helps scientists understand the history of human movement and cultures, and paves the way to understanding human health and susceptibility to disease.

Peristera Paschou, a population geneticist and associate professor of biological sciences at Purdue, studies human genetic variation all around the world and led the study with Petros Drineas, associate head of Purdue's Department of Computer Science.

Peristera Paschou, a population geneticist and associate professor of biological sciences at Purdue Peristera Paschou

"Our genome carries the signature of our ancestors, and the genetic structure of modern populations has been shaped by the forces of evolution. What we are looking for is what led different groups of people to come together and what drove them apart.” Paschou said. “To understand the genetics of human populations, we created a model that allows us to consider jointly many different factors that may have shaped genetics. Interdisciplinary research bringing together genetics and computer science was key to our work, as well as analyzing a comprehensive dataset that represents the diversity of the Indian subcontinent.”

Many population analyses mostly rely on datasets from European-ancestry individuals living in Europe or North America; genomic data for populations from other parts of the world is lacking. The data from European samples showed that genetics correlates very closely with geography: If you know someone’s genetics, you can guess where they are from, to within a few kilometers in some cases, and if you know where someone’s ancestors came from, you have a close approximation of their genetic makeup.

Aritra Bose earned his doctorate at Purdue in computer science. His area of research was in both data science and genetics. Aritra Bose

Aritra Bose earned his doctorate at Purdue in computer science. His area of reasearch was in both data science and genetics. Reading studies about how European genomes map onto geography, Bose, who was born and raised in Calcutta, thought, “Huh. That wouldn’t work in India.” India is home to more than 800 languages as well as a millennia-old caste system that regulates who can marry – and have children with – whom.

“I read these papers, and I thought, ‘How can I use this concept in a stratified population like India?’” Bose said. “I grew up there, I have an understanding of the castes and the languages, and the intricacies of the society that can affect genetics.”

Former studies of the Indian population had shown that the European model of population genetics and geography failed in trying to explain Indian population genetics. Bose wondered if he could come up with a model that would take into account other factors affecting the Indian population, including the caste system, culture and language.

The model, and the conclusions the team of geneticists and data scientists reached using it, were just published in a study in the journal Molecular Biology and Evolution. Their study revealed that shared language, not geography, is the most powerful force in shaping gene flow in India.

Developing the model was not easy. Early on, Bose hit a roadblock with some of his equations and mentioned the problem to his mentor at IBM Research, where he was an intern at the time. Working with both his doctoral advisors and several computer scientists from IBM Research, the team was able to craft a robust, flexible model.

Professor Petros Drineas Petros Drineas

Drineas, one of Bose’s doctoral advisers, said: “I was intrigued by the interplay between genetics and socio-demographic factors in shaping the population structure of the Indian continent. It was exciting to see that our model detected spoken language as a major force in bringing people together in India, across geographic and social barriers. We were fortunate to have Aritra Bose, our former doctoral student (jointly advised with professor Paschou) work on this project, since he has extensive background in both the algorithmic and the human genetic sides of our research, as well the expertise to interpret our findings in the context of human genetic diversity within India.”

The resulting model, the first to be able to take into account so many different variables, has been highly successful at analyzing the genetics of the Indian population, giving scientists a lens into how the Indian people moved into India and how various groups of people commingled. People who speak the same language – or even similar languages – tended to be much more closely related, even if they lived far apart geographically.

“It sheds light on how genetics work in our society,” Bose said. “This is the first model that can take into account social, cultural, environmental and linguistic factors that shape the gene flow of populations. It helps us to understand what factors contribute to the genetic puzzle that is India. It disentangles the puzzle.”

The data helps place India in context with the rest of the globe genetically. Indians who spoke Indo-European and Dravidian languages were more closely tied to Europeans, while Indians who speak Tibeto-Burman languages were more closely related to East Asians.

This type of interdisciplinary research, pairing data science with population genetics, and this model in particular, will help researchers understand the genetics of the human world, especially non-European countries with rich histories of diversity and migrations.

About Purdue University

Purdue University is a top public research institution developing practical solutions to today’s toughest challenges. Ranked the No. 5 Most Innovative University in the United States by U.S. News & World Report, Purdue delivers world-changing research and out-of-this-world discovery. Committed to hands-on and online, real-world learning, Purdue offers a transformative education to all. Committed to affordability and accessibility, Purdue has frozen tuition and most fees at 2012-13 levels, enabling more students than ever to graduate debt-free. See how Purdue never stops in the persistent pursuit of the next giant leap at

Writer, Media contact: Brittany Steff, 765-494-7833, 

Sources: Aritra Bose,

Peristera Paschou,

Petros Drineas,


Journalists visiting campus: Journalists should follow Protect Purdue protocols and the following guidelines:

  • Campus is open, but the number of people in spaces may be limited. We will be as accommodating as possible, but you may be asked to step out or report from another location.
  • To enable access, particularly to campus buildings, we recommend you contact the Purdue News Service media contact listed on the release to let them know the nature of the visit and where you will be visiting. A News Service representative can facilitate safe access and may escort you on campus.
  • Correctly wear face masks inside any campus building, and correctly wear face masks outdoors when social distancing of at least six feet is not possible. 


Integrating linguistics, social structure, and geography to model genetic diversity within India

Aritra Bose, Daniel E. Platt, Laxmi Parida, Petros Drineas, and Peristera Paschou

DOI: 10.1093/molbev/msaa321

India represents an intricate tapestry of population substructure shaped by geography, language, culture and social stratification. While geography closely correlates with genetic structure in other parts of the world, the strict endogamy imposed by the Indian caste system and the large number of spoken languages add further levels of complexity to understand Indian population structure. To date, no study has attempted to model and evaluate how these factors have interacted to shape the patterns of genetic diversity within India. We merged all publicly available data from the Indian subcontinent into a dataset of 891 individuals from 90 well-defined groups. Bringing together geography, genetics and demographic factors, we developed COGG (Correlation Optimization of Genetics and Geodemographics) to build a model that explains the observed population genetic substructure. We show that shared language along with social structure have been the most powerful forces in creating paths of gene flow in the subcontinent. Furthermore, we discover the ethnic groups that best capture the diverse genetic substructure using a ridge leverage score statistic. Integrating data from India with a dataset of additional 1,323 individuals from 50 Eurasian populations we find that Indo-European and Dravidian speakers of India show shared genetic drift with Europeans, whereas the Tibeto-Burman speaking tribal groups have maximum shared genetic drift with East Asians.

Last Updated: Feb 9, 2021 2:32 PM

Department of Computer Science, 305 N. University Street, West Lafayette, IN 47907

Phone: (765) 494-6010 • Fax: (765) 494-0739

Copyright © 2024 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.