2025 Workshop at BIRS: Day 2 Recordings

Events

Videos

Tuesday, August 19 · Day 2 of the 2025 BIRS Workshop “Foundation Models and Their Biomedical Applications: Bridging the Gap”

Published

August 20, 2025

Visit the Stats Up AI Channel for More

🏁 2025 Workshop at BIRS: Overview

Foundation Models and Their Biomedical Applications: Bridging the Gap

📍 Banff International Research Station (BIRS), Banff, Alberta, Canada
Event Website: 2025 Workshop Homepage · Dates: Aug 17–22, 2025

🎬 Talks — Quick Looks, Full Notes & Recordings

⮕ Full program: 2025 Workshop Schedule

🔗 2025 Workshop at BIRS: Day 1 Recordings (Monday, Aug 18)
📖 2025 Workshop at BIRS: Day 2 Recordings (Tuesday, Aug 19) 👈
🔗 2025 Workshop at BIRS: Day 3 Recordings (Wednesday, Aug 20)
🔗 2025 Workshop at BIRS: Day 4 Recordings (Thursday, Aug 21)

↩︎ Read more on Stats Up AI 📰 Community News

▶️ Day 2 Recordings: Morning Session

🎤 Tianxi Cai: Unlocking the Potential of EHR Data for Discovery: Opportunities and Challenges

📅 Tuesday, August 19, 2025 • 🕘 09:02 - 09:38
🏛️ Harvard University

Keywords: longitudinal data, rare disease diagnosis, data heterogeneity
Summary: New strategies for leveraging large-scale EHR data are outlined to accelerate precision medicine and translational research by overcoming data heterogeneity and enabling actionable clinical insights.

📖 Read more

Introduction: Electronic Health Record (EHR) data offers unprecedented opportunities for advancing precision medicine, enhancing clinical decision support, and accelerating translational research. By leveraging rich, longitudinal clinical data, we can uncover patient-specific insights, identify novel risk factors, and tailor interventions in real time. Large scale EHR data, linked with other data sources, also open opportunities for drug discovery, repurposing, and rare disease diagnosis. However, realizing this potential requires addressing significant challenges, including high dimensionality, data sparsity, and heterogeneity across healthcare systems. This presentation will explore strategies for harnessing EHR data effectively, with a focus on methodological innovations, collaborative infrastructure, and the importance of clinical context in transforming raw data into actionable knowledge.

🎬Open the video directly

🎤 Xihong Lin: Navigate the Crossroad of Statistics, Generative AI and Genomic Health

📅 Tuesday, August 19, 2025 • 🕘 09:39 - 10:19
🏛️ Harvard University

Keywords: synthetic likelihood, diffusion models, whole-genome sequencing (WGS), biobank data
Summary: Generative AI is combined with robust statistical methods to unlock large-scale genomic discovery, enabling valid inference and scalable analysis of biobank and whole-genome sequencing data.

📖 Read more

Introduction: Scalable and robust statistical methods empowered by generative AI offer unprecedent potentials for trustworthy science as they empower statistical analysis, quantify uncertainty, enhance interpretability, and accelerate scientific discovery. In this talk, I will discuss the challenges and opportunities as we navigate the crossroad of statistics, generative AI, and genomic health science. I will discuss robust and powerful statistical analysis using the synthetic likelihood that leveraging synthetic data generated by generative AI models, such as diffusion models and transformer, while ensuring valid statistical inference when generative AI models are misspecified. I will illustrate key points using the analysis of large scale biobanks, whole genome sequencing data, and electronic health records, and demonstrate the power of scientific discovery by integrating statistics and generative AI using synthetic data. I will also discuss how to conduct scalable and interpretable large-scale whole genome sequencing (WGS) data, and illustrate the WGS analysis ecosystem using the TOPMed WGS samples of 200,000, the UK biobank of 500,000 subjects in the cloud platform RAP and as well the All of Us data of 400,000 subjects in the NIH cloud platform AnVIL.

🎬Open the video directly

🎤 Annie Qu: Representation Retrieval Learning for Heterogeneous Data Integration

📅 Tuesday, August 19, 2025 • 🕘 10:47 - 11:20
🏛️ University of California Irvine

Keywords: multi-task learning, optimal transport, latent representation, transferability
Summary: A new representation retrieval learning framework improves prediction and inference by borrowing partially shared structures across heterogeneous datasets.

📖 Read more

Introduction: In this presentation, I will showcase advanced statistical machine learning techniques and tools designed for the seamless integration of information from multi-source datasets. These datasets may originate from various sources, encompass distinct studies with different variables, and exhibit unique dependent structures. One of the greatest challenges in investigating research findings is the systematic heterogeneity across individuals, which could significantly undermine the power of existing machine learning methods to identify the underlying true signals. This talk will investigate the advantages and drawbacks of current data integration methods such as multi-task learning, optimal transport, missing data imputations, matrix completions and transfer learning. Additionally, we will introduce a new representation retriever learning aimed at mapping heterogeneous observed data to a latent space, facilitating the extraction of shared information and knowledge, and disentanglement of source-specific information and knowledge. The key idea is to project heterogeneous raw observations to representation retriever library, and the novelty of our method is that we can retrieve partial representations from the library for a target study. The main advantages of the proposed method are that it can increase statistical power through borrowing partially shared representation retrievers from multiple sources of data. This approach ultimately allows one to extract information from heterogeneous data sources and transfer generalizable knowledge beyond observed data and enhance the accuracy of prediction and statistical inference.

🎬Open the video directly

▶️ Day 2 Recordings: Afternoon Session 1

🎤 Hongzhe Li: Decoding Population Diversity at Single-Cell Resolution with AI and Machine Learning

📅 Tuesday, August 19, 2025 • 🕘 13:05 - 13:40
🏛️ University of Pennsylvania

Keywords: single-cell omics, trajectory inference, population variability, precision health
Summary: AI methods link single-cell profiles with population-level traits, revealing how molecular variation drives human diversity in health and disease.

📖 Read more

Introduction: Understanding population diversity at the single-cell level is essential for uncovering the biological basis of development, disease, and therapeutic response. Recent advances in single-cell technologies enable high-resolution profiling of cellular states across individuals, but analyzing these large, heterogeneous datasets remains challenging. AI and machine learning offer powerful approaches to characterize inter-individual variability, identify cell types and states, and infer trajectories from complex single-cell data. This talk will highlight recent methods that integrate single-cell profiles with individual-level metadata—such as genotype, ancestry, environmental exposure, and clinical outcomes—to link molecular variation with population-level traits. These approaches provide new insights into the cellular basis of human diversity and hold promise for advancing precision health.

🎬Open the video directly

🎤 Tianming Liu: Quantum Transformers

📅 Tuesday, August 19, 2025 • 🕘 13:50 - 14:05
🏛️ University of Georgia

Keywords: quantum computing, transformer architecture, foundation models
Summary: Quantum-based transformer architectures promise to accelerate AI capabilities and broaden applications across science and biomedicine.

📖 Read more

Introduction: Transformers have transformed the AI field, particularly in recent large language models and other foundation models. In parallel, quantum computing holds extraordinary promise for accelerating AI development and application. This talk will introduce a quantum‑computing‑based implementation of the transformer architecture and showcase its potential across various applications. Quantum transformers are envisioned to offer significant benefits in many domains — including biomedicine — in the years ahead.

🎬Open the video directly

🎤 Tao Wang: Profiling antigen-binding affinity of B cell repertoires in tumors by deep learning

📅 Tuesday, August 19, 2025 • 🕘 14:06 - 14:22
🏛️ MD Anderson Cancer Center

Keywords: contrastive learning, antibody–antigen interactions, immune checkpoint inhibitors, risk scoring
Summary: A deep learning framework (Cmai) profiles antigen–antibody binding affinities from BCR repertoires, enabling prediction of immunotherapy response and risk of immune-related adverse events.

📖 Read more

Introduction: The capability to profile the landscape of antigen-binding affinities of a vast number of antibodies (B cell receptors, BCRs) will provide a powerful tool to reveal biological insights. However, experimental approaches for detecting antibody–antigen interactions are costly and time-consuming and can only achieve low-to-mid throughput. In this work, we developed Cmai (contrastive modeling for antigen–antibody interactions) to address the prediction of binding between antibodies and antigens that can be scaled to high-throughput sequencing data. We devised a biomarker based on the output from Cmai to map the antigen-binding affinities of BCR repertoires. We found that the abundance of tumor antigen-targeting antibodies is predictive of immune-checkpoint inhibitor (ICI) treatment response. We also found that, during immune-related adverse events (irAEs) caused by ICI, humoral immunity is preferentially responsive to intracellular antigens from the organs affected by the irAEs. We used Cmai to construct a BCR-based irAE risk score, which predicted the timing of the occurrence of irAEs.

🎬Open the video directly

🎤 Rong Ma: Two recent methods for nonlinear joint embedding of high-dimensional data

📅 Tuesday, August 19, 2025 • 🕘 14:28 - 14:48
🏛️ Harvard University

Keywords: optimal transport, integral operators, manifold alignment, multi-omics integration
Summary: New methods based on entropic optimal transport and duo-landmark integral operators achieve consistent nonlinear joint embedding, recovering shared manifold structures while removing dataset-specific noise.

📖 Read more

Introduction: In this talk, I will present two recent and closely related methods for nonlinear joint embedding of high-dimensional datasets. The first builds on ideas from entropic optimal transport, while the second is based on duo-landmark integral operators. Both are principled approaches for aligning and jointly embedding multiple datasets, supported by rigorous theoretical guarantees. We show that for a pair of noisy, high-dimensional datasets, these methods consistently recover the shared underlying manifold structure while mitigating dataset-specific nuisance variation. I will provide an intuitive geometric explanation of each methodology, along with the theoretical foundations that justify their performance. I will demonstrate their effectiveness in analyzing a single-cell multiomic dataset comprising snRNA-seq and snATAC-seq for human brain cells, which uncovers interesting cell-type-specific interactions between transcription and epigenomic regulation. This talk is based on recent work in collaboration with Xiucai Ding, and a work with Boris Landa, and Yuval Kluger.

🎬Open the video directly

🎤 Yang Ning: Optimal Variable Clustering for High-Dimensional Matrix Valued Data

📅 Tuesday, August 19, 2025 • 🕘 14:52 - 15:14
🏛️ Cornell University

Keywords: matrix-valued data, latent variable models, covariance clustering, minimax optimality
Summary: A latent variable model with covariance-based dissimilarities yields a hierarchical clustering algorithm that is minimax optimal for high-dimensional matrix-valued data and effective in genomic applications.

📖 Read more

Introduction: Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.

🎬Open the video directly

▶️ Day 2 Recordings: Afternoon Session 2

🎤 Yingqi Zhao: Early classification of time series with constraint

📅 Tuesday, August 19, 2025 • 🕘 15:32 - 15:55
🏛️ Fred Hutch Cancer Center

Keywords: sequential decision making, early classification, trade-off optimization, biomarkers
Summary: A constrained early-classification framework for time series balances sensitivity, specificity, and earliness, providing tractable optimal solutions with strong real-world applicability.

📖 Read more

Introduction: Biomarker levels are associated with adverse events among patients. These adverse events present serious health risks to affected patients and are associated with significant financial costs. Thus, a high-quality predictive model that could identify high-risk patients has the potential to improve patient outcomes while reducing healthcare costs. From the perspective of sequential decision making, we propose a novel approach for early classification of time series incorporating various constraints. The classifier either concludes positively/negatively based on the series or waits for further information from the next time step. We characterize the trade-off among multiple criteria, such as sensitivity, specificity and earliness. We also explicitly formulate the optimal solution, which is tractable via plugging-in estimators. Experimental studies demonstrate its potential in real-world applications.

🎬Open the video directly

🎤 Xiaowu Dai: Training-Free Multi-Agent Language Models

📅 Tuesday, August 19, 2025 • 🕘 15:56 - 16:22
🏛️ University of California, Los Angeles

Keywords: multi-agent systems, peer elicitation, game theory, truthful equilibrium
Summary: A training-free game-theoretic framework (Peer Elicitation Games) aligns LLMs through multi-agent interaction, achieving provably truthful equilibria and improved factual accuracy without supervision.

📖 Read more

Introduction: Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where rewards are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.

🎬Open the video directly

📌 Watch All Recordings

StatsUpAI YouTube Channel: Subscribe for updates
BIRS Official Videos Page: 2025 Workshop Videos
Direct Video Downloads: BIRS Video Server

Visit the Stats Up AI Channel for More

AI is rapidly reshaping biomedical research by integrating diverse data, accelerating discovery, and supporting decision-making under uncertainty. With statisticians at the forefront, these applications gain the depth, rigor, and reliability needed to truly transform science and medicine.