Molecular Biology major Mayisha Sultana uses statistics to probe the human genome

Written by
Sharon Adarlo, Center for Statistics and Machine Learning, Princeton University
April 6, 2020

Mayisha Sultana’s independent research for her senior year is on the human genome, specifically using data science techniques to study population genetics. 

Since the completion of the mapping of the human genome in 2003, researchers have been making steady work in finding links between certain genes and associated diseases (ortraits) in studies called genome-wide association studies. These surveys are important because they can help spur new genetic research or lead to the development of effective treatments for illnesses, said Sultana. 

But these studies often assume that their subjects come from the same overall population. This assumption can generate false correlations due to the phenomenon of population structure, said Sultana. “Structure” in population genetics refers to shared ancestry between individuals that arises from people marrying within the same ethnic group or sharing common ancestors. These interpersonal relations give rise to systematic errors that confound scientific findings about the human genome.

To glean information from this noisy data and correct for false positives, Sultana put together a method to estimate how genetically mixed an individual is by developing a new statistical estimator. Sultana is using the Dirichlet distribution to model how mixed an individual is, and she is developing a method to accurately find the underlying parameter of this distribution. 

“Overall, I am coming up with a more accurate way to estimate how genetically mixed an individual is,” continued Sultana. “It’s kind of like 23andme (the commercial genetic test kit) -- it breaks down the ancestral background of people. But the goal is geared towards using it to correct for false positives in association studies, or to understand the roots of human evolution.”

For her project, Sultana has found that the CSML certificate curriculum has served her well.

“I have really enjoyed my classes. CSML has given me the skills to apply data science in biology, but it has also allowed me to take classes like econometrics and machine learning which have opened my eyes to how statistics is applied in disciplines like economics or computer science,” she said. “The framework of thinking that I have learned in statistics has been extremely broad and therefore super useful.”

After graduation, Sultana is planning to be a strategy analyst for ClearView Healthcare Partners, a consultancy firm serving companies in healthcare and life sciences.