POP-GWAS: Valid inference for machine learning-assisted GWAS
Dr. Qiongshi Lu
Dr. Qiongshi Lu is an Associate Professor in the Department of Biostatistics and Medical Informatics at the University of Wisconsin-Madison. His research focuses on developing statistical and computational methods to study complex trait genetics. In particular, he is interested in noncoding genome annotation, genetic risk prediction, genetic correlation estimation, and gene-environment interaction.
Dr. Jiacheng Miao
Dr. Jiacheng Miao is a fifth-year Ph.D. candidate in Biomedical Data Science at the University of Wisconsin–Madison advised by Drs. Qiongshi Lu and Lauren Schmitz. His research interests lie at the intersection of human genetics, statistics, machine learning, and their applications to medicine. His work focuses on machine learning-assisted statistical inference, heterogeneous treatment effect estimation based on genetic variation, and genetically-informed risk prediction across diverse contexts.
Regarding the research background and significance, does this work discover new knowledge or solve existing problems within the field? Please elaborate in detail.
A common challenge in modern scientific research is the scarcity of gold-standard data that can be expensive and time-consuming to collect. With the rapid development of machine learning (ML), scientists have realized that it is now possible to leverage ML/AI and predict gold-standard outcomes with variables that are much easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring the noise and bias introduced by the prediction procedure. One notable example of this type of research is ML-assisted genome-wide association study (GWAS), which applies advanced ML to predict phenotypes that are difficult or expensive to measure and then conducts GWAS on the ML-imputed outcomes. In our work, we first demonstrate the risk of pervasive false positive associations in existing ML-assisted GWAS. To address this issue, we introduce POP-GWAS, a principled statistical framework that ensures valid and efficient statistical inference in GWAS of ML-predicted outcomes, irrespective of the quatlify of ML prediction. Applying POP-GWAS to data from UK Biobank, we performed the largest GWAS to date on bone mineral density derived from dual-energy x-ray absorptiometry imaging at 14 skeletal sites, achieving a 9.7-50.7% increase in effective sample size and identifying 89 novel loci. Our work exemplifies how to leverage ML prediction to facilitate rigorous scientific discovery.
How did the reviewers evaluate (praise) it?
The reviewers stated that our paper “illustrates a key issue with using machine learning predictions to augment sample sizes in genome-wide association studies.” Additional comments include “The issue is nicely summarized and indeed credible and critical. The motivating example of HbA1c intended as a glycemic control metric but inevitably reflecting erythrocytic levels is an excellent example.”, and the paper “contributes an important new result (assumption free ML augmentation of GWAS)”.
If this achievement has potential applications, what are some specific applications it might have in a few years?
A notable application of our work is to understand the genetic basis of disease outcomes or human traits that are challenging or costly to measure. Examples include but are not limited to undiagnosed diseases in biobank cohorts, imaging-derived outcomes, and molecular traits in rare tissues. As of early 2024, various data types in the UK Biobank, including proteomics, brain magnetic resonance imaging (MRI), heart MRI, dual-energy X-ray absorptiometry imaging, electrocardiograms, and metabolomics, exhibit missing rates ranging from 45% to 94%. Advanced ML can be employed to predict these missing variables, allowing POP-GWAS to facilitate valid and efficient inference in downstream genetic association studies.
Can you recount the specific steps or stages from setting the research topic to the successful completion of the research?
Our research was initially prompted by the observation that most current ML-assisted GWAS directly perform genetic association analysis on ML-predicted phenotypes. Given that ML predictions may be imprecise or biased, we wanted to investigate whether current approaches could lead to false positive findings. We came across a study suggesting that HbA1c, a clinical diagnostic criterion for type 2 diabetes (T2D), is influenced by genetic variants through both glycemic and non-glycemic pathways. In addition, individuals with many genetic variants in non-glycemic pathways that increase their HbA1c levels may be misdiagnosed with T2D. With these biological insights, we conducted comparisons between GWAS on ground-truth and ML-imputed T2D outcomes, and found that 82% of the associations from ML-imputed GWAS could not be replicated in the ground-truth GWAS. Notably, many of these associations were associated with non-glycemic pathways affecting HbA1c. This finding reinforced our belief that current ML-assisted GWAS methods are prone to generating false-positive results.
To tackle this issue, we applied a recently introduced method for ML-assisted inference to UK Biobank data, encountering several challenges. First, this method identified fewer significant associations compared to conventional GWAS based solely on labeled data. After some investigation, we realized that current ML-assisted inference methods can be less statistically powerful compared to conventional GWAS approaches when the ML prediction has poor quality. To overcome this issue, we developed a weighting scheme that enables ML-assisted GWAS to adapt according to the accuracy of ML prediction. This approach worked out quite effectively in our analysis, enabling us to identify many more significant associations through the integration of ML prediction. We also provide the theoretical proof that our method is always more powerful than GWAS that does not incorporate ML. In fact, we demonstrated that our method is the best linear unbiased estimator for GWAS coefficients based on observed and ML-imputed phenotypes.
Another challenge is that the existing ML-assisted inference methods struggle to handle the technical factors that are crucial for real-world scientific applications. In our GWAS application, this includes sample relatedness and population structure, which the field has learned to account for in order to ensure valid GWAS results. Additionally, existing software does not effectively scale with the large sample size and data dimensionality in a typical GWAS, which often involves hundreds of thousands of individuals and tens of millions of linear regressions. We proposed a solution to this problem, drawing inspiration from the observation that the GWAS field has developed sophisticated and computationally efficient algorithms to address these complexities, and summary statistics is sufficient for ML-assisted inference. Our solution employs a two-step approach. First, we utilize highly optimized GWAS software to generate summary statistics for both observed and ML-predicted outcomes. Subsequently, we directly integrate these summary statistics to produce statistically-valid ML-assisted inference results. This method has proven highly effective in biobank-scale genetic analyses. Through the use of POP-GWAS, we have identified many new genetic associations for bone mineral density and provided new insights into the genetic basis of osteoporosis and fracture risk.
Were there any memorable events during the research? You can tell a story about anything related to people, events, or objects.
There are two particularly memorable events. The first is the time we proved an ML-assisted version of the Gauss-Markov theorem. Traditionally, the Gauss-Markov theorem states that the ordinary least squares estimator is the best linear unbiased estimator in linear regressions of observed outcomes. In our work, we demonstrated that our POP-GWAS estimator is the best linear unbiased estimator given both the observed and ML-predicted outcomes. This result was unexpected since our estimator, a heuristic weighted sum of three sets of GWAS results, was not anticipated to exhibit such strong statistical properties. The second memorable event was when we identified novel head-specific genetic associations for bone mineral density, including LGR5. This discovery was backed by evidence showing that Lgr5-deleted mice exhibit various craniofacial abnormalities, suggesting potential new avenues for diagnosing and treating head-related skeletal disorders.
Is there a follow-up plan based on this research? If so, please elaborate.
We are actively extending the POP-GWAS framework beyond genome-wide association studies typically based on linear or logistic regressions. Our goal is to develop general methodology that enables valid and efficient statistical inference following machine learning predictions across diverse types of statistical applications.
Without a doubt, AI is one of the hot topics of 2024, requiring extensive data support in its development. What assistance can biostatistics offer to the development of AI?
ML/AI is quickly gaining popularity in scientific research. However, it is crucially important to develop modern statistical methods that enables integrative analysis based on both statistical ideas and ML/AI techniques, while ensuring reliability of scientific findings. It will be a missed opportunity if statisticians and biostatisticians ignore the strengths of ML/AI. Instead, as a field, we should critically introduce ML/AI into statistics, and develop new methods to facilitate new applications. It is our view (and experience) that many new and exciting data science problems will emerge once we start doing that.
Proofread by: Hongtu Zhu