Yufeng Shen

Principal investigator, Associate Professor,
CV
College
Peking University
PhD
Baylor College of Medicine
GitHub
Google Scholar

I’m an Associate Professor at Columbia University in the Department of Systems Biology and Department of Biomedical Informatics.

I received my B.Sc. in biochemistry and molecular biology from Peking University and Ph.D. in computational biology from Baylor College of Medicine. At Baylor, I led the analysis of the first human genome sequenced by next-generation technologies.

I currently direct National Institutes of Health–funded research programs that integrate genomics data to predict consequences of genetic variation using statistical and computational methods and to apply genomics and computational biology in genetic studies of human diseases. My lab developed CANOES (Backenroth et al., Nucleic Acids Res., 2014), a method to identify rare copy number variants from exome sequencing data. We discovered that epigenomic patterns under normal conditions are associated with risk genes of developmental disorders (Han et al., Nat. Commun., 2018). In addition, our research led to the discovery of a number of novel risk genes of common birth defects such as congenital heart disease (Homsy et al., Science, 2015) and congenital diaphragmatic hernia (Qi et al., PLoS Genetics, 2018).

Papers
  1. Understanding Language Model Scaling on Protein Fitness Prediction Nature Computational Science. 2026
  2. Molecular dynamics simulations of intrinsically disordered protein regions enable biophysical interpretation of variant effect predictors HGG Advances. 2026
  3. Protein language models trained on biophysical dynamics inform mutation effects PNAS. 2026
  4. Antisense oligonucleotides to KIF1A polymorphisms expand targets and rescue patient-derived neurons in vitro Nature Communications. 2025
  5. MotifAE Reveals Functional Motifs from Protein Language Model: Unsupervised Discovery and Interpretability Analysis bioRxiv. 2025
  6. Disrupted endosomal trafficking of the Vangl-Celsr polarity complex underlies congenital anomalies in trachea-esophageal morphogenesis Developmental Cell. 2025
  7. LONP1 Variants Are Associated With Clinically Diverse Phenotypes Clin. Genet.. 2025
  8. PreMode predicts mode of action of missense variants by deep graph representation learning of protein sequence and structural context Nature Communications. 2025
  9. Genome Sequencing is Critical for Forecasting Outcomes following Congenital Cardiac Surgery Nature Communications. 2025
  10. Distinct Clinical Phenotypes in KIF1A-Associated Neurological Disorders Result from Different Amino Acid Substitutions at the Same Residue in KIF1A Biomolecules. 2025
  11. G2VTCR: predicting antigen binding specificity by Weisfeiler-Lehman graph embedding of T cell receptor sequences bioRxiv. 2025
  12. A probabilistic graphical model for estimating selection coefficientsof nonsynonymous variants from human population sequence data Nature Communications. 2025
  13. Genomic analysis of 11,555 probands identifies 60 dominant congenital heart disease genes. PNAS. 2025
  14. Common variants increase risk for congenital diaphragmatic hernia within the context of de novo variants AJHG. 2024
  15. Benchmarking Machine Learning Missing Data Imputation Methods in Large-Scale Mental Health Survey Databases medRxiv. 2024
  16. Heterogeneity of comprehensive clinical phenotype and longitudinal adaptive function and correlation with computational predictions of severity of missense genotypes in KIF1A-associated neurological disorder Genetics in Medicine. 2024
  17. Return of genetic research results in 21,532 individuals with autism Genetics in Medicine. 2024
  18. Functional dissection of human cardiac enhancers and noncoding de novo variants in congenital heart disease Nature Genetics. 2024
  19. Rescuing lung development through embryonic inhibition of histone acetylation Science Translational Medicine. 2024
  20. SABRE: Self-Attention Based model for predicting T-cell Receptor Epitope Specificity bioRxiv. 2023
  21. VBASS enables integration of single cell gene expression data in Bayesian association analysis of rare variants Communications Biology. 2023
  22. Association of Predicted Damaging De Novo Variants on Ventricular Function in Individuals With Congenital Heart Disease Circulation Genomic and Precision Medicine. 2023
  23. Tissue adaptation and clonal segregation of human memory T cells in barrier sites Nature Immunology. 2023
  24. Rare predicted loss of function alleles in Bassoon (BSN) are associated with obesity npj Genomic Medicine. 2023
  25. SHINE: Protein Language Model based Pathogenicity Prediction for Inframe Insertion and Deletion Variants Briefings in Bioinformatics. 2023
  26. Predicting functional effect of missense variants using graph attention neural networks Nature Machine Intelligence. 2022
  27. Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning Nucleic acids research. 2022
  28. AlphaCluster: Coevolutionary driven residue-residue interaction models enable quantifiable clustering analysis of de novo variants to enhance predictions of pathogenicity Research Square. 2022
  29. Integrating de novo and inherited variants in over 42,607 autism cases identifies mutations in new moderate risk genes Nature Genetics. 2022
  30. Statistical models of the genetic etiology of congenital heart disease Current Opinion in Genetics & Development. 2022
  31. Newborn screening for neurodevelopmental diseases: Are we there yet? Am J Med Genet. 2022
  32. Identification and validation of novel candidate risk genes in endocytic vesicular trafficking associated with esophageal atresia and tracheoesophageal fistulas HGG Advances. 2022
  33. The genetic architecture of pediatric cardiomyopathy AJHG. 2022
  34. Rare and de novo variants in 827 congenital diaphragmatic hernia probands implicate LONP1 as candidate risk gene AJHG. 2021
  35. Imputing cognitive impairment in SPARK, a large autism cohort Autism Research. 2022
  36. Gene expression atlas of energy balance brain regions JCI Insight. 2021
  37. Predicting localized affinity of RNA binding proteins to transcripts with convolutional neural networks bioRxiv. 2021
  38. Rare variant analysis of 4,241 pulmonary arterial hypertension cases from an international consortium implicate FBLN2, PDGFD and rare de novo variants in PAH Genome Medicine. 2021
  39. Penetrance of breast cancer genes from the eMERGE III Network JNCI Cancer Spectrum. 2021
  40. Genotype and defects in microtubule-based motility correlate with clinical severity in KIF1A-associated neurological disorder HGG Advances. 2021
  41. Human plasmacytoid dendritic cells mount a distinct antiviral response to virus-infected cells Science Immunology. 2021
  42. Association of Damaging Variants in Genes With Increased Cancer Risk Among Patients With Congenital Heart Disease JAMA Cardiology. 2021
  43. Lymphohematopoietic graft-versus-host responses promote mixed chimerism in patients receiving intestinal transplantation JCI. 2021
  44. Functional interrogation of DNA damage response variants with base editing screens Cell. 2021
  45. MVP predicts pathogenicity of missense variants by deep learning Nature Communications. 2021
  46. Template-based prediction of protein structure with deep learning BMC Genomics. 2020
  47. Genomic analyses implicate noncoding de novo variants in congenital heart disease Nature Genetics. 2020
  48. Likely damaging de novo variants in congenital diaphragmatic hernia patients are associated with worse clinical outcomes Genetics in Medicine. 2020
  49. Novel Candidate Genes in Esophageal Atresia/Tracheoesophageal Fistula Identified by Exome Sequencing EJHG. 2020
  50. Dissecting Autism Genetic Risk Using Single-cell RNA-seq Data bioRxiv. 2020
  51. EM-mosaic detects mosaic point mutations that contribute to congenital heart disease. Genome medicine. 2020
  52. Whole Genome De Novo Variant Identification with FreeBayes and Neural Network Approaches bioRxiv. 2020
  53. Novel Mutations and Decreased Expression of the Epigenetic Regulator TET2 in Pulmonary Arterial Hypertension. Circulation. 2020
  54. Tissue Determinants of Human NK Cell Development, Function, and Residence. Cell. 2020
  55. Novel risk genes and mechanisms implicated by exome sequencing of 2572 individuals with pulmonary arterial hypertension. Genome medicine. 2019
  56. De novo and recessive forms of congenital heart disease have distinct genetic and phenotypic landscapes. Nature communications. 2019
  57. Deletion of donor-reactive T cell clones after human liver transplant. American journal of transplantation. 2019
  58. Pathway analysis of genomic pathology tests for prognostic cancer subtyping. Journal of biomedical informatics. 2019
  59. Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes. NPJ genomic medicine. 2019
  60. Single cell RNA-Seq reveals pre-cDCs fate determined by transcription factor combinatorial dose. BMC molecular and cell biology. 2019
  61. Shared Genetic Risk Factors Across Carbamazepine-Induced Hypersensitivity Reactions. Clinical pharmacology and therapeutics. 2019
  62. Crossreactive public TCR sequences undergo positive selection in the human thymic repertoire. The Journal of clinical investigation. 2019
  63. Recessive Rare Variants in Deoxyhypusine Synthase, an Enzyme Involved in the Synthesis of Hypusine, Are Associated with a Neurodevelopmental Disorder. American journal of human genetics. 2019
  64. Drug-Induced Liver Injury due to Flucloxacillin: Relevance of Multiple Human Leukocyte Antigen Alleles. Clinical pharmacology and therapeutics. 2019
  65. De novo variants in congenital diaphragmatic hernia identify MYRF as a new syndrome and reveal genetic overlaps with other developmental disorders. PLoS genetics. 2018
  66. Human Intestinal Allografts Contain Functional Hematopoietic Stem and Progenitor Cells that Are Maintained by a Circulating Pool. Cell stem cell. 2018
  67. Early expansion of donor-specific Tregs in tolerant kidney transplant recipients. JCI insight. 2018
  68. A pan-cancer analysis of driver gene mutations, DNA methylation and gene expressions reveals that chromatin remodeling is a major mechanism inducing global changes in cancer epigenomes. BMC medical genomics. 2018
  69. Loss-of-Function ABCC8 Mutations in Pulmonary Arterial Hypertension. Circulation. Genomic and precision medicine. 2018
  70. Human Lymph Nodes Maintain TCF-1(hi) Memory T Cells with High Functional Potential and Clonal Diversity throughout Life. Journal of immunology. 2018
  71. Quantifying size and diversity of the human T cell alloresponse. JCI insight. 2018
  72. Rare variants in SOX17 are associated with pulmonary arterial hypertension with congenital heart disease. Genome medicine. 2018
  73. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nature communications. 2018
  74. Exome Sequencing in Children With Pulmonary Arterial Hypertension Demonstrates Differences Compared With Adults. Circulation. Genomic and precision medicine. 2018
  75. Robust identification of mosaic variants in congenital heart disease. Human genetics. 2018
  76. SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. Neuron. 2018
  77. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nature genetics. 2017
  78. Human Tissue-Resident Memory T Cells Are Defined by Core Transcriptional and Functional Signatures in Lymphoid and Mucosal Sites. Cell reports. 2017
  79. Congenital diaphragmatic hernias: from genes to mechanisms to therapies. Disease models & mechanisms. 2017
  80. Contrasting Determinants of Mutation Rates in Germline and Soma. Genetics. 2017
  81. Lineage specification of human dendritic cells is marked by IRF8 expression in hematopoietic stem cells and multipotent progenitors. Nature immunology. 2017
  82. Longterm maintenance of human naive T cells through in situ homeostasis in lymphoid tissue sites. Science immunology. 2017
  83. Genome-wide enrichment of damaging de novo variants in patients with isolated and complex congenital diaphragmatic hernia. Human genetics. 2017
  84. Bidirectional intragraft alloreactivity drives the repopulation of human intestinal allografts and correlates with clinical outcome. Science immunology. 2017
  85. Association of Liver Injury From Specific Drugs, or Groups of Drugs, With Polymorphisms in HLA and Other Genes in a Genome-Wide Association Study. Gastroenterology. 2017
  86. A Cell Type-Specific Expression Signature Predicts Haploinsufficient Autism-Susceptibility Genes. Human mutation. 2016
  87. Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nature communications. 2016
  88. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nature communications. 2016
  89. Variants in HNRNPH2 on the X Chromosome Are Associated with a Neurodevelopmental Disorder in Females. American journal of human genetics. 2016
  90. De novo missense variants in HECW2 are associated with neurodevelopmental delay and hypotonia. Journal of medical genetics. 2016
  91. Deep Genetic Connection Between Cancer and Developmental Disorders. Human mutation. 2016
  92. Long-read sequencing and de novo assembly of a Chinese genome. Nature communications. 2016
  93. Diversity and divergence of the glioma-infiltrating T-cell receptor repertoire. PNAS. 2016
  94. A recurrent de novo CTBP1 mutation is associated with developmental delay, hypotonia, ataxia, and tooth enamel defects. Neurogenetics. 2016
  95. HLA-DRB1*16: 01-DQB1*05: 02 is a novel genetic risk factor for flupirtine-induced liver injury. Pharmacogenetics and genomics. 2016
  96. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science. 2015
  97. A New Window into the Human Alloresponse. Transplantation. 2016
  98. SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Scientific reports. 2015
  99. RNA sequencing from human neutrophils reveals distinct transcriptional differences associated with chronic inflammatory states. BMC medical genomics. 2015
  100. Mutations in ARID2 are associated with intellectual disabilities. Neurogenetics. 2015
  101. Next-Generation Sequencing of Pulmonary Sarcomatoid Carcinoma Reveals High Frequency of Actionable MET Gene Mutations. Journal of clinical oncology. 2015
  102. The support of human genetic evidence for approved drug indications. Nature genetics. 2015
  103. ABC transporters and the proteasome complex are implicated in susceptibility to Stevens-Johnson syndrome and toxic epidermal necrolysis across multiple drugs. PloS one. 2015
  104. Genomic signatures of cooperation and conflict in the social amoeba. Current biology. 2015
  105. Increased burden of de novo predicted deleterious variants in complex congenital diaphragmatic hernia. Human molecular genetics. 2015
  106. Tracking donor-reactive T cells: Evidence for clonal deletion in tolerant kidney transplant patients. Science translational medicine. 2015
  107. Comparable frequencies of coding mutations and loss of imprinting in human pluripotent cells derived by nuclear transfer and defined factors. Cell stem cell. 2014
  108. Spatial map of human T cell compartmentalization and maintenance over decades of life. Cell. 2014
  109. Coding mutations in SORL1 and Alzheimer disease. Annals of neurology. 2014
  110. Increased frequency of de novo copy number variants in congenital heart disease by integrative analysis of single nucleotide polymorphism array and exome sequence data. Circulation research. 2014
  111. Estimating heritability of drug-induced liver injury from common variants and implications for future study designs. Scientific reports. 2014
  112. CANOES: detecting rare copy number variants from whole exome sequencing data. Nucleic acids research. 2014
  113. Whole exome sequencing identifies de novo mutations in GATA6 associated with congenital diaphragmatic hernia. Journal of medical genetics. 2014
  114. Novel association of early onset hepatocellular carcinoma with transaldolase deficiency. JIMD reports. 2013
  115. Defining a comprehensive verotype using electronic health records for personalized medicine. Journal of the American Medical Informatics Association. 2013
  116. A collaborative approach to developing an electronic health record phenotyping algorithm for drug-induced liver injury. Journal of the American Medical Informatics Association. 2013
  117. Whole-exome sequencing identifies novel LEPR mutations in individuals with severe early onset obesity. Obesity. 2013
  118. Position effect on FGF13 associated with X-linked congenital generalized hypertrichosis. PNAS. 2013
  119. Variants in GATA4 are a rare cause of familial and sporadic congenital diaphragmatic hernia. Human genetics. 2012
  120. Limited contribution of common genetic variants to risk for liver injury due to a variety of drugs. Pharmacogenetics and genomics. 2012
  121. Genomewide pharmacogenetics of bisphosphonate-induced osteonecrosis of the jaw: the role of RBMS3. The oncologist. 2012
  122. A hidden Markov model for copy number variant prediction from whole genome resequencing data. BMC bioinformatics. 2011
  123. Coverage tradeoffs and power estimation in the design of whole-genome sequencing experiments for detecting association. Bioinformatics. 2011
  124. Susceptibility to amoxicillin-clavulanate-induced liver injury is influenced by multiple HLA class I and II alleles. Gastroenterology. 2011
  125. Genome-wide association study of serious blistering skin rash caused by drugs. The pharmacogenomics journal. 2011
  126. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research. 2009
  127. HLA-B*5701 genotype is a major determinant of drug-induced liver injury due to flucloxacillin. Nature genetics. 2009
  128. Bos taurus genome assembly. BMC genomics. 2009
  129. Comparing platforms for C. elegans mutant identification using high-throughput whole-genome sequencing. PloS one. 2008
  130. A high-throughput percentage-of-binding strategy to measure binding energies in DNA-protein interactions: application to genome-scale site discovery. Nucleic acids research. 2008
  131. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008
  132. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007
  133. Shedding genomic light on Aristotle's lantern. Developmental biology. 2006
  134. Conformational pathways in the gating of Escherichia coli mechanosensitive channel. PNAS. 2002
  135. Intrinsic flexibility and gating mechanism of the potassium channel KcsA. PNAS. 2002