MaxFuse: Unleashing the Full Potential of Single Cell and Spatial Genomics Integration

Interviews
Published

June 19, 2024

MaxFuse: Unleashing the Full Potential of Single Cell and Spatial Genomics Integration Interview Image

The Article Link:

Integration of spatial and single-cell data across modalities with weakly linked features
Interviewee Name

Dr. Zongming Ma

Dr. Ma is a Professor of Statistics and Data Science at Yale University. His current research interests include multi-modal data analysis, network data analysis, and statistical and machine learning methods in genomics and imaging. Dr. Ma obtained his PhD in Statistics from Stanford University in 2010 and joined the Department of Statistics and Data Science at the Wharton School of the University of Pennsylvania as a tenure-track assistant professor in the same year. After working at the University of Pennsylvania for thirteen years, he moved to Yale University in 2023. He is the recipient of an NSF CAREER award in 2014, a Sloan Research Fellowship in Mathematics in 2016, and an Institute of Mathematical Statistics Fellowship in 2023.

Interviewee Name

Dr. Nancy Ruonan Zhang

Dr. Zhang is a Ge Li and Ning Zhao Professor of Statistics in The Wharton School at University of Pennsylvania. Her research focuses primarily on the development of statistical methods and computational algorithms for the analysis of data from high-throughput biological experiments. She has made contributions to copy number and structural variant detection, to the modeling and estimation of intra-tumor genetic heterogeneity, and to the modeling and analysis of single-cell and spatial genomic data. In Statistics, she has made contributions to change-point analysis, variable selection, and model selection. Dr. Zhang obtained her Ph.D. in Statistics in 2005 from Stanford University. After one year of postdoctoral training at University of California, Berkeley, she returned to the Department of Statistics at Stanford University as Assistant Professor in 2006. She received the Sloan Fellowship in 2011, and formally moved to University of Pennsylvania with tenure in 2012. She was awarded the Medallion Lectureship by the Institute of Mathematical Statistics in 2021. Her work has been funded by grants from the NSF and NIH. At Penn, she is a member of the Graduate Group in Genomics and Computational Biology, and currently serves as the Vice Dean of the Wharton Doctoral Program.

Regarding the research background and significance, does this work discover new knowledge or solve existing problems within the field? Please elaborate in detail.

The work solves existing problems within the field. To introduce the work, we need to first say a little bit about what is single cell and spatial genomics. The last decade brought about a complete paradigm shift in biology where genome-wide measurements can be made on single cells, across many cells in parallel. By “genome-wide”, we mean, for example, measuring the expression of tens of thousands of genes at once, for each of thousands to millions of cells at once. Such technologies are called “single cell technologies”. Even more exciting, the past couple of years have brought about “spatial genomic technologies”, which can make such measurements for cells in situ. “In situ” means “fixed in space”, so that we know not only the cell’s internal molecular stage but also its tissue neighborhood. Since cells form tight-knit communities and “collaborate” and “compete” with other cells in its tissue niche, spatial genomic technologies provide really important information for understanding cell and tissue biology.

For such single cell and spatial data, a primary unsolved challenge is how to perform diagonal integration. What is diagonal integration? To thoroughly characterize a cell, we would like to measure the expression of all of its genes, as well as to quantify what proteins and metabolites it is making. We perhaps would also like to know the conformation of its chromatin (that is, which regions of its DNA are accessible, and which are “closed” and not available for transcription). Each of these modalities, on its own, provide valuable information but never the complete picture. Ideally, we would like to measure everything for each cell, but no technology is yet capable of that. Thus we are left with having to apply different technologies, each profiling a subset of the genes/modalities of interest. Sometimes, for example, we have a spatial technology recording the tissue position of cells and their expression of a panel of proteins, and a single cell technology that does not provide the spatial coordinates but measures the full transcriptome. In such cases, we would really like to have all three: spatial position, full transcriptome measurement, and protein measurement. The subsequent analysis challenge of “aligning” cells across technologies is often called diagonal integration. Diagonal integration is extremely challenging when the data modalities are weakly linked, that is, when there are few or none shared features between data modalities.

In this paper, we provide a new method, MaxFuse, which, through iterative co-embedding, data smoothing and cell matching, uses all information in each modality to achieve high quality integration even in extremely challenging scenarios. MaxFuse is, to our knowledge, the only method that can reliably integrate single cell RNA sequencing data with spatial proteomic data, which is a commonly appearing study design in current tissue atlasing efforts.

How did the reviewers evaluate (praise) it?

The reviewers immediately recognized the importance of this problem and the impact of the method, but they made us do more benchmarking and exploration of model parameters before letting us publish it. We believe that the reviewers were very interested in how this method can standardize the computational pipelines for analysis of spatial proteomic and transcriptomic data.

If this achievement has potential applications, what are some specific applications it might have in a few years?

As mentioned, single cell and spatial genome profiling technologies are now the bread-and-butter of biomedical research, but we are still at the early stages in realizing the potential of these technologies. We believe MaxFuse can greatly increase the flexibility of study designs and allow for more comprehensive characterization of tissues through integration across data modalities. In addition, the design of the algorithm is modality agnostic, and so the method is very likely to continue to work for new modalities.

Can you recount the specific steps or stages from setting the research topic to the successful completion of the research?

In applied problems, the initial step of defining the scope of the problem, and giving it a precise formulation, is always very important. We recognized that diagonal integration was an important unsolved problem, and the then existing methods being applied to this problem did not utilize anything except for the first moments of the shared features. They did not work well except for cases where there are lots of overlapping features with strong correlation between modalities. The correlation structures within each modality also contain valuable information about the geometry of the data, but were not used. We wanted to use all of the information within each modality, not just the shared features, This, and other insights, such as the data denoising via graph smoothing, allows us to separate the signal from the noise.

The project was also boosted by a collaboration with Garry Nolan’s lab, especially an extremely capable graduate student Bokai Zhu in the driver’s seat. As an interesting fact, the two co-first authors Shuxiao Chen and Bokai have been close friends since college, and the collaboration was originally initiated by them. Bokai brought a lot of insights about the different spatial technologies, and selected data sets to illustrate the power of the approach. Often, for scientific papers, the selection of data sets is just as important as the development of the method. We identified data sets that can be used for benchmarking (aka has known truth of how cells should be aligned) and data sets that can be used to tell an interesting story. These are extremely important and took up most of our time working on the paper.

Were there any memorable events during the research? You can tell a story about anything related to people, events, or objects.

It is always memorable when with your own method you are able to uncover something new in data. We applied MaxFuse to the analysis of CODEX tonsil data. CODEX is a spatial technology that only aligns a small panel of proteins. We aligned the cells in CODEX to cells from scRNAseq of the same tissue, which enabled the precise definition of expression patterns of any gene in space. We then checked to see how the genes were distributed within B cell germinal centers, and found patterns that were biologically meaningful and made sense. This was not possible with any previous method.

Is there a follow-up plan based on this research? If so, please elaborate.

We tried many different ideas and have other backup ideas up our sleeve in case things fail. But we didn’t really deviate significantly from our original planned course.

Without a doubt, AI is one of the hot topics of 2023, requiring extensive data support in its development. What assistance can biostatistics offer to the development of AI?

Although AI is a big exciting direction, it is not the best solution to everything. To the problem of diagonal integration, AI has been applied, and a few have worked ok. However, none so far worked on the challenging cases of weak linkage, which motivated MaxFuse. We showed in this paper that simpler but more thoughtful methods development can sometimes beat black box AI algorithms. As we have elaborated earlier, the statistical ideas of denoising, local smoothing, and canonical correlations have played important roles in MaxFuse. Successful applications of AI require understanding of the problem domain, and a good grasp of the signal and noise structures of the data. In this aspect, successful application of AI is no different from successful application of Statistics. We believe that what ever question we are addressing, the key to successful methods development is working closely with field scientists to understand the scientific problem and the data.

Edited by: Shan Gao
Proofread by: Hongtu Zhu
Page Views: