Decoding Tumor Heterogeneity: GBCD on Transforming Single-Cell Cancer Analysis

Interviews
Published

April 25, 2025

Decoding Tumor Heterogeneity: GBCD on Transforming Single-Cell Cancer Analysis Interview Image

The Article Link:

Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition
Interviewee Name

Dr. Yusha Liu

Dr. Yusha Liu received her PhD in Statistics from Rice University in 2019 and postdoctoral training from the Department of Human Genetics at the University of Chicago prior to joining UNC. Yusha’s research interests lie at the intersection of statistics and biology, and she is particularly interested in developing and applying flexible and scalable statistical approaches to analyze large-scale and complex genomics data, such as single cell data, and ultimately contributing to the understanding of complex diseases like cancer and the development of targeted therapy and prevention strategies.

Regarding the research background and significance, does this work discover new knowledge or solve existing problems within the field? Please elaborate in detail.

Profiling tumors with single-cell RNA sequencing has the potential to identify recurrent patterns of transcription variation related to cancer progression, and to produce therapeutically relevant insights. However, strong inter-tumor heterogeneity can obscure more subtle patterns that are shared across tumors. In this work, we introduce a statistical method, generalized binary covariance decomposition (GBCD), to address this problem. We show that GBCD can decompose transcriptional heterogeneity into interpretable components—including patient-specific, dataset-specific and shared components relevant to disease subtypes—and that, in the presence of strong inter-tumor heterogeneity, it can produce more interpretable results than existing methods. Applied to data on pancreatic ductal adenocarcinoma, GBCD produced a refined characterization of existing tumor subtypes, and identified a gene expression program prognostic of poor survival independent of tumor stage and subtype. This gene expression program is enriched for genes involved in stress responses, and suggests a role for the integrated stress response in pancreatic ductal adenocarcinoma.

How did the reviewers evaluate (praise) it?

The review process for this paper was quite smooth and involved just one round of revision. The reviewers recognized the importance and potential impact of our work, which aimed to identify shared gene expression patterns that could reveal valuable insights about cancer etiology from multi-tumor scRNA-seq data dominated by inter-tumor heterogeneity. Quotes from the reviewers include: “There have been numerous publications describing matrix decomposition approaches to analyzing scRNA-seq data. What is unique about the method is its ability to simultaneously identify signatures that explain both inter- and intra-sample variation in gene expression, which can be particularly powerful for subtyping patient cohorts.”

If this achievement has potential applications, what are some specific applications it might have in a few years?

GBCD is a new computational tool to perform unified analysis of multi-tumor scRNA-seq data, whose integration is much more challenging than that of non-tumor tissue samples due to the extensive inter-patient heterogeneity. This powerful tool is readily available for use to jointly analyze the vast amount of available scRNA-seq data from curated cancer cell atlas to extract recurrent gene expression programs that represent biologically relevant cell subtypes or states in a particular type of cancer.

To date, multiple scRNA-seq studies have been conducted for a given cancer type, contributing to a rich resource of scRNA-seq data. While these studies revealed much more structure in tumor transcriptional landscape than what we used to observe with bulk transcriptomics, they often reported non-replicable or even divergent findings from each other. For example, in pancreatic ductal adenocarcinoma (PDAC), although consensus has coalesced around two major PDAC subtypes, classical and basal-like, different studies provided further subclassifications beyond classical/basal that were inconsistent with one another. Such inconsistencies across studies can be attributed, at least in part, to the following two reasons. (1) The number of tumors sequenced per study is usually very limited due to the high cost of scRNA-seq. (2) Even in a single study, an appropriate approach to identifying common cell types and states by integrating scRNA-seq data across multiple tumors is lacking in the literature. Our work filled up this methodological gap. By integrating scRNA-seq data of a particular type of cancer from various studies and patients, we expect GBCD to provide an in-depth analysis of transcriptional heterogeneity and to identify biologically relevant transcriptional states that are also replicable across studies. These analyses will greatly advance our understanding of cancer biology and have the potential to motivate novel therapeutic strategies.

Can you recount the specific steps or stages from setting the research topic to the successful completion of the research?

I started to work on this project when I was a postdoc at Dr. Matthew Stephens’ lab. I personally have a long-standing interest in cancer research and Matthew has extensive expertise in statistical modeling of scRNA-seq data, so he suggested that I perform a literature review of existing challenges of single-cell transcriptomic data modeling in cancer. I immediately realized the lack of computational methods for unified analysis of scRNA-seq data collected from multiple patients and studies, and thus set this as the research topic. To separate the shared patterns of gene expression variation from the strong patient effects, we considered many modeling strategies but they were either overly complicated or did not work well. After many attempts, we landed on the unique set of modeling assumptions of GBCD, a key feature of which is that it assumes gene expression programs to be orthogonal to one another, which helps avoid absorbing shared components of expression variability into patient-specific expression programs. This approach is simple and mathematically elegant, yet it has superior performance in simulation studies and leads to biologically interesting and clinically relevant findings when applied to real data. Throughout this process, we overcame many challenges in model development and refinement. We also engaged with cancer researchers at the University of Chicago, who helped interpret our results and significantly enhanced the impact of this work.

Were there any memorable events during the research? You can tell a story about anything related to people, events, or objects.

The discussion of GBCD results with our collaborators from the pancreas working group at the University of Chicago was particularly memorable. Our analyses identified a gene expression program that is recurrent among multiple PDAC patients and remains significantly prognostic of patients’ overall survival after accounting for known prognostic factors of PDAC, including tumor stage and subtype. When we presented this result to our collaborators, we were excited to learn that this gene program is related to integrated stress response and many driving genes are validated targets of a transcriptional factor called ATF4, which is exactly the research focus of our collaborators’ labs. This pleasant coincidence led to the most exciting biological discovery and the most interesting case study in this work.

Is there a follow-up plan based on this research? If so, please elaborate.

We will use this powerful computational tool to analyze the human PDAC scRNA-seq atlas to identify biologically relevant gene expression programs that are replicable across studies in PDAC. We have limited our analyses to gene expression data from malignant cells so far. Notably, malignant cells usually constitute a minority of PDAC tumors; the diverse populations of cancer-associated fibroblasts and immune cells from the immunosuppressive tumor microenvironment also play a crucial role in driving the aggressiveness and therapy resistance of PDAC. Thus, we will conduct integrative analysis separately for neoplastic, stromal and immune compartments to identify the biologically meaningful and replicable transcriptional states in each compartment, and further study the between-compartment crosstalk patterns and their prognostic relevance. Our PDAC collaborators will conduct follow-up experiments to validate the prognostic significance of ATF4 in mouse models.

Without a doubt, AI is one of the hot topics of 2024, requiring extensive data support in its development. What assistance can biostatistics offer to the development of AI?

In my understanding, AI tools are often black boxes that offer excellent performance in prediction (e.g., predicting the presence of cancer from a patient’s sample) but are less interpretable than traditional statistical models (e.g., identifying the key genomic and epigenomic drivers of cancer initiation and progression). As biostatisticians, our statistical modeling expertise and computing skills uniquely position us to integrate statistical ideas and AI techniques to fully realize the potential of AI in advancing biomedical research.

Edited by: Shan Gao
Proofread by: Hongtu Zhu
Page Views: