Interview with Award-Winning Dr. Xu Shi on the interface between statistics and AI
Interviews

1. Can you summarize your award-winning research and its significance for statistics and data science?
Thank you for the opportunity to discuss my research. I am honored to receive the 2025 IMS Thelma and Marvin Zelen Emerging Women Leaders in Data Science Award. My research focuses on developing statistical methods for electronic health record (EHR) data. Routinely collected EHR data are rich but come with unique challenges because they are collected for clinical and administrative purposes, not specifically designed for research studies. As such, traditional statistical and causal assumptions often break down. I think about novel ways to utilize information from EHR data and develop better designs and methods to provide robust and efficient analysis. Real-world healthcare data constantly reveal new complexities, and I am privileged to work with interdisciplinary collaborators who bring diverse expertise to make such data more trustworthy for decision-making.
2. How does your work address challenges at the intersection of statistics and data science?
In causal inference, skepticism about the interpretation of observational studies often stems from the potential of unmeasured confounding bias. My team develops negative control and proximal causal inference methods, which utilize variables without a direct causal relationship with either the exposure or outcome to identify and correct unmeasured confounding bias, and ultimately improve the reliability of observational studies. They are particularly beneficial in EHR-based studies because unmeasured confounding typically exists while candidate negative controls abound. These methods have been applied to post-market effectiveness and safety surveillance studies for medical products using EHR and claims data and are now part of the FDA’s PDUFA VII (Prescription Drug User Fee Act) commitment, which aims to improve the rigor of real-world evidence in regulatory decisions.
We also address the challenge of EHR data heterogeneity across different healthcare institutions, which is essential for improving data quality and enabling more accurate and transferable statistical analyses in multi-institutional studies. For example, by leveraging natural language processing techniques, we develop statistical methods that help translate medical codes between institutions, thus minimizing the differences in how clinical concepts are captured across different healthcare systems.
3. What impact do you foresee your research having on the future of data science systems?
I hope our work will make incremental progress in integrating quality checks into data analysis pipelines to enhance the validity and reproducibility of scientific research. For example, prior to any statistical analysis, our automated data quality checks and harmonization methods tackle the inherent heterogeneity in EHR data. This long-standing issue traditionally consumes 90% of the time in EHR-based research, especially in multi-institutional studies with data sharing constraints. After completing an analysis, we envision utilizing negative controls— essentially knowing that certain pairs of variables shouldn’t have a causal relationship—to run falsification checks on the results. This approach can help identify hidden biases, including unmeasured confounding, measurement error, and selection bias. Together, these efforts aim to advance data science by ensuring that the conclusions drawn from analyses are more reliable in real-world scenarios.
4. What emerging trends in statistics do you believe are crucial for advancing AI?
An emerging trend is the effort to close the gap between statistical innovation and real-world implementation, as evidenced by the increasing number of tutorials and communications on advancements in statistics published in medical journals. Such efforts are crucial because, while statistical methods and theories rapidly evolve in academia, their adoption in regulatory and clinical settings remains limited. For example, breakthroughs in AI/ML algorithmic bias are published daily, however, these issues are rarely reflected in FDA approved AI/ML-based medical devices. The challenge is no longer about developing new methods but ensuring that existing ones are validated, understood, trusted, and integrated into practice. Therefore, our field could benefit from prioritizing the translation of advancements into ethical, reliable, and impactful solutions for real-world problems, as well as redefining incentives to reward implementation science.
Another notable trend is the shift in statistician’s role from “playing in everyone’s backyard” to leading in the forefront of data science and scientific initiatives. This is reflected in the growing number of scientific publications led by statisticians and the trend of statisticians becoming more involved in data collection rather than merely adopting existing data. This is critical because leading interdisciplinary research enables statisticians to enhance the rigor and quality of scientific innovation, ensure the responsible use of data, and drive meaningful changes across diverse fields.
Edited by: Shan Gao
Proofread by: Hongtu Zhu, Jian Kang
Proofread by: Hongtu Zhu, Jian Kang
Page Views: