Summary

Computational biology, the main focus of my research, is interdisciplinary in nature and lies at the intersection of computing, life sciences, and statistics. My goal, in this realm, is to bridge the gap across these traditional disciplines, to bring computational innovations to wet labs and, eventually, to clinical practice in order to make a difference in peoples' lives.

While much of the recent work in the computational biology community has focused on an organism-level understanding of genes, proteins, and their interactions, the overarching theme of my work has been on specializing these datasets to tissue, cell type, and pathology-specific context. Scaling from models for a single ``canonical'' cell to models that can handle assays of billions of potentially distinct interacting cells in different states of disease progression or maturation, can easily overwhelm current computational paradigms and challenge statistical models. My work builds on, and significantly advances, areas of probabilistic modeling, machine learning, and complex network analysis. The scale and scope of the emerging data violate many of the underlying assumptions in traditional machine learning techniques, including, statistical independence, underlying distributions, and required sampling rates. New techniques must account for strong correlations, heavily skewed distributions, and significant undersampling, while supporting well-characterized notions of statistical significance, correlations, and causality.

Current Research

The focus of my recent work is to develop efficient Computational tools coupled with rigorous statistical models to:

  1. Identify cell types and tumor subclasses from single-cell gene expression profiles, in order to dissect heterogeneity of tumor microenvironment.Single-cell transcriptomic data has the potential to radically redefine our view of cell type identity. Cells that were previously believed to be homogeneous are now clearly distinguishable in terms of their expression phenotype. Developing methods to automatically identify de novo cell types and their functional identity aids in prioritizing combination therapies to simultaneously target different tumor subclasses, including rare subpopulations.
  2. Deconvolve expression profile of individual cell types from tumor biopsies, to gain a mechanistic understanding of tumor-immune interactions.Biological samples are typically heterogeneous in terms of their constituting cell types. Dynamic changes in the relative cell type proportions, in different conditions, can highlight the underlying biological response and are indicative of progression of disorders ranging from neurodegenerative diseases to cancers. Deconvolving relative fractions of the constituting cell types in a given complex mixture has immense diagnostic, prognostic, and pharmaceutical applications.
  3. Construct robust tissue/cell type-specific networks, with the goal of uncovering pathways that are involved in drug-resistance in targeted therapies.The majority of human proteins do not work in isolation but take part in pathways, complexes, and other functional modules. These physical interactions are commonly modeled using an undirected network, also referred to as interactome. However, this global network does not provide any information regarding the spatiotemporal context that the interactions occur. Computational techniques to construct a reliable model of active interactions in a cell can potentially unlock pathways responsible for unique susceptibility of cell types/tissues to pathologies and therapeutic agents.
Overarching theme of my recent work

Identify cell types from single-cell transcriptomic profiles

The emergence of ultra high-throughput technologies to assay organisms all the way down to a single cell provides an unprecedented opportunity to uncover transcriptional, functional, and pathological characteristics of cells. I have developed a novel technique, called ACTION, to bridge the gap between the transcriptional profile of cells and their functional identity. At the core of this method is a metric space for representing the functional relationship between cells. This space is defined according to a biologically-inspired kernel. The fundamental assumption here is that cellular functions are embedded within each other, with the outermost layer representing housekeeping functions. The transcriptional profile of cells is dominated by generic functions, whereas their functional identity is determined by a combination of a small number of weakly expressed cell type-specific genes. Under the ``pure cell assumption''; in which there exist cells that are specialized to perform a unique set of functions, and the rest of cells can be represented in terms of these functions; this metric induces a convex topology. Corners, or archetypes, of this topology, characterize ``principal functions'' of cells, using which we can estimate the underlying cell type of each cell. Finally, I have developed a statistical framework to identify transcription factors (TFs) that play a key role in defining cell type identities. I show, through extensive evaluation, that ACTION is more effective than state-of-the-art methods for identifying novel cell types and cancer subtypes. More importantly, ACTION identifies novel cancer subtypes in Melanoma patients, constructs their underlying transcriptional regulatory network (TRN), and suggests novel therapeutic targets and biomarkers.

Deconvolving heterogeneous mixture of expression profiles

Deconvolving a mixture of expression profiles into its individual components is an important and (computationally) hard problem. There has been a multitude of approaches proposed in the literature to target this problem. These methods differ greatly in terms of their underlying assumptions, pre-processing, problem formulation, and post-processing steps. To perform an unbiased study, I systematically assessed the effect of each component on the overall quality of deconvolution. I identified robust configurations, which were not proposed previously in the literature, that perform well when reference cell types are cultured under different microenvironmental conditions. Additionally, I proposed novel preprocessing steps, and a new constraint, to select invariant marker genes that significantly enhance deconvolution quality in all cases. I show that with the right combination of the loss function, regularizer, and feature selection, we can bound the deconvolution error rates within a range of 4-7% across different datasets. Finally, I summarized these findings in a prescriptive step-by-step process, which can be applied to a wide range of deconvolution problems.

Constructing cell type/tissue-specific interactomes

Proteins often perform their functions as part of larger functional units. Thus, we can infer the functional activity of proteins using their interaction context. I formulated this intuitive hypothesis as a suitably regularized convex optimization problem and used it to identify tissue-specific networks. The objective function of this problem has two terms -- the first term is a diffusion kernel that propagates activity of genes through interactions (network links). The second term is a sparsifier that penalizes changes to enhance the robustness of results. I use these estimated functional activity scores to compute tissue-specificity of each edge in the global interactome. Using this technique, and RNA-Seq datasets from the GTEx project, I have created a comprehensive dataset of tissue-specific networks. I show that these tissue-specific networks, compared to the global interactome, excel in predicting genes involved in tissue-specific pathologies. Moreover, they can identify disease-related pathways that link susceptibility factors from GWAS studies. Finally, they can predict pairs of genes that have similar tissue-specific functions. These networks are publicly available online.

Future Directions

In the short term, I plan to build on my current efforts and extend them along multiple dimensions. In terms of single-cell analysis, I am interested in developing methods to identify complex relationships among cell types, infer an ordering among them, and establish a ``history'' of changes between cells. I am also interested in developing new techniques that utilize a structural prior on the relationship between the cell types to significantly scale the deconvolution problem. Finally, I aim to combine single-cell analysis with deconvolution. Single-cell transcriptomics can provide a reference panel, which can be used to perform supervised deconvolution. On the other hand, deconvolution techniques can estimate the underlying fractions of cell types in the mixture, which can be incorporated into the single-cell analysis to correct for sampling biases, among other confounding factors.

In terms of application, I plan to expand the scope of my work in two ways. First, I will collaborate closely with experimentalists to validate my findings and cross-examine them. On the other hand, most of the problems I am working on are motivated by clinical applications, and I am keenly interested in testing them on real data from patients.

My long-term vision is to build a comprehensive set of methods and models to extend traditional computational paradigm along spatial, temporal, and pathological dimensions. These methods have a direct impact on dissecting the heterogeneity of tumor microenvironment, the knowledge of which significantly impacts the effectiveness of targeted and immuno-therapies. These challenges can only be tackled using a large-scale collaborative effort to bring scientists across different disciplines together to ask the right questions and seek the right answers. During my Ph.D., I initiated lasting collaborations with colleagues across different institutions and disciplines. In future, I plan to expand these collaborations to establish a multi-institutional, interdisciplinary effort.