Computational Genomics

We develop new computational methods to interpret human genomes using machine learning based on biological and genetic mechanisms. Ongoing projects are on these topics: predict functional and genetic effect of missense or noncoding variants using deep learning and graphical models, integrate single cell functional genomics data in genetic analysis, new statistical genetics methods to identify new candidate risk genes by rare variants, and automated end-to-end methods to identify structural variants from exome or genome sequencing data.

Selected papers

  1. Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning. bioRxiv. 2022.
  2. Imputing cognitive impairment in SPARK, a large autism cohort. Autism Research. 2022.
  3. Predicting localized affinity of RNA binding proteins to transcripts with convolutional neural networks. bioRxiv. 2021.
  4. Predicting functional effect of missense variants using graph attention neural networks. bioRxiv. 2021.
  5. MVP predicts pathogenicity of missense variants by deep learning. Nature Communications. 2021.
  6. Template-based prediction of protein structure with deep learning. BMC Genomics. 2020.
  7. Dissecting Autism Genetic Risk Using Single-cell RNA-seq Data. bioRxiv. 2020.
  8. EM-mosaic detects mosaic point mutations that contribute to congenital heart disease. Genome medicine. 2020.
  9. Whole Genome De Novo Variant Identification with FreeBayes and Neural Network Approaches. bioRxiv. 2020.
  10. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nature communications. 2018.
  11. A Cell Type-Specific Expression Signature Predicts Haploinsufficient Autism-Susceptibility Genes. Human mutation. 2016.
  12. CANOES: detecting rare copy number variants from whole exome sequencing data. Nucleic acids research. 2014.
  13. A hidden Markov model for copy number variant prediction from whole genome resequencing data. BMC bioinformatics. 2011.
  14. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research. 2009.

Human Genetics

Identify genetic causes of human conditions provides the foundation for precise diagnosis and risk prediction, in-depth understanding of disease mechanisms, and effective targets for intervention. However, the vast majority of genetics of complex conditions are still unknown. To improve the ability to identify risk variants and genes, we develop new computational methods to integrate gene expression and epigenomic data with genetic data and to leverage large scale population genome data. We apply these methods in genetic studies of a broad range of human diseases and conditions. Recently we have been working on autism, congenital diaphragmatic hernia, congenital heart disease, pulmonary hypertension, tracheoesophageal defects, and breast cancer.

Selected papers

  1. Identification and validation of novel candidate risk genes in endocytic vesicular trafficking associated with esophageal atresia and tracheoesophageal fistulas. HGG Advances. 2022.
  2. Integrating de novo and inherited variants in over 42,607 autism cases identifies mutations in new moderate risk genes. medRxiv. 2021.
  3. Imputing cognitive impairment in SPARK, a large autism cohort. Autism Research. 2022.
  4. Rare variant analysis of 4,241 pulmonary arterial hypertension cases from an international consortium implicate FBLN2, PDGFD and rare de novo variants in PAH. Genome Medicine. 2021.
  5. Penetrance of breast cancer genes from the eMERGE III Network. JNCI Cancer Spectrum. 2021.
  6. Functional interrogation of DNA damage response variants with base editing screens. Cell. 2021.
  7. Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nature Genetics. 2020.
  8. Likely damaging de novo variants in congenital diaphragmatic hernia patients are associated with worse clinical outcomes. Genetics in Medicine. 2020.
  9. Novel Candidate Genes in Esophageal Atresia/Tracheoesophageal Fistula Identified by Exome Sequencing. EJHG. 2020.
  10. Dissecting Autism Genetic Risk Using Single-cell RNA-seq Data. bioRxiv. 2020.
  11. EM-mosaic detects mosaic point mutations that contribute to congenital heart disease. Genome medicine. 2020.
  12. Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes. NPJ genomic medicine. 2019.
  13. De novo variants in congenital diaphragmatic hernia identify MYRF as a new syndrome and reveal genetic overlaps with other developmental disorders. PLoS genetics. 2018.
  14. Rare variants in SOX17 are associated with pulmonary arterial hypertension with congenital heart disease. Genome medicine. 2018.
  15. A Cell Type-Specific Expression Signature Predicts Haploinsufficient Autism-Susceptibility Genes. Human mutation. 2016.
  16. Deep Genetic Connection Between Cancer and Developmental Disorders. Human mutation. 2016.
  17. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science. 2016.
  18. Increased burden of de novo predicted deleterious variants in complex congenital diaphragmatic hernia. Human molecular genetics. 2015.

Computational Immunology

Computational analysis and mathematical modeling to understand dynamics of immune cells in human. In particular, we are interested in predictive and generative models of T cell receptor recognition of antigens.

Selected papers

  1. Single cell RNA-Seq reveals pre-cDCs fate determined by transcription factor combinatorial dose. BMC molecular and cell biology. 2019.
  2. Quantifying size and diversity of the human T cell alloresponse. JCI insight. 2018.
  3. Human Tissue-Resident Memory T Cells Are Defined by Core Transcriptional and Functional Signatures in Lymphoid and Mucosal Sites. Cell reports. 2017.
  4. Lineage specification of human dendritic cells is marked by IRF8 expression in hematopoietic stem cells and multipotent progenitors. Nature immunology. 2017.
  5. Diversity and divergence of the glioma-infiltrating T-cell receptor repertoire. PNAS. 2016.
  6. Tracking donor-reactive T cells: Evidence for clonal deletion in tolerant kidney transplant patients. Science translational medicine. 2015.
  7. Spatial map of human T cell compartmentalization and maintenance over decades of life. Cell. 2014.

* * * * * *

Software

Data

Homsy et al, Science, 2015: de novo mutations from 1120 congenital heart disease cases

All supplementary tables (a zip file):

  1. S1: Phenotypes for each case proband, including cardiac, neurodevelopmental disorders and extra-cardiac congenital anomalies.
  2. S2: List of de novo Mutations in CHD case cohort.
  3. S3: List of de novo Mutations in Control cohort.
  4. S4: List of de novo probabilities for each variant class in each protein-coding gene on the Nimblegen V2 exome, adjusted for depth in Cases.
  5. S5: List of de novo probabilities for each variant class in each protein-coding gene on the Nimblegen V2 exome, adjusted for depth in Controls.
  6. S6: Functional term enrichment analysis of all Genes with Damaging (loss of function + deleterious missense) de novo mutations in all cases.
  7. S7: Functional term enrichment analysis of all Genes with Loss of Function de novo mutations in 860 new cases.
  8. S8: List of 1,563 variants (1,161 unique genes) with damaging de novo mutations from 7 independent NDD cohorts.
  9. S9: Functional term enrichment analysis among 69 genes with Damaging de novo mutations overlapping between CHD cases and the published NDD (P-NDD) cohort.
  10. S10: Percentile ranks of genes by expression in the developing mouse heart and brain.

GWAS of drug adverse reactions

GWAS data sets from Serious Adverse Event Consortium, related to these papers (Daly et al 2009; Shen et al 2011; Lucena et al 2011; Overby et al 2014) can be found at SAEC Data Portal (Registration required). Processed files ready for PLINK are available upon request (yshen@c2b2.columbia.edu).

Detecting CNVs from exome sequencing

Exome sequencing and genotyping data used in CANOES (Backenroth et al 2014) is from NHLBI Pediatric Cardiac Genomics Consortium (PCGC) and available through dbGaP