Assistant Professor
Department of Biostatistics
Mailman School of Public Health
Columbia University
722 West 168th Street, 6th Floor
New York, NY 10032
Email: qc2138@cumc.columbia.edu
Phone: 212-342-1245
Research Interests:
Survey sampling; Missing data; Bayesian statistics; latent class analysis; population health; environmental health sciences.
Education:
2009, Ph.D. in Biostatistics, University of Michigan
2004, M.S. in Applied Statistics, Bowling Green State University
2002, B.A. in Economics, Nankai University
Professional Services:
Associate Editor for the Journal of the Royal Statistical Society: Series C .
Honors and Awards:
2016, Career Development Award, the NIEHS Center for Environmental Health in Northern Manhattan
2011, Best Contributed Paper Award, the Statistics and Data Analysis Section of the SAS Global Forum
2010, Calderone Research Prize for Junior Faculty, Columbia University
2010, Department of Biostatistics Teaching Award, Columbia University
2009, Edward C. Bryant Scholarship, American Statistical Association
2008, Student Paper Competition Award, Social Statistics Section/Section on Government Statistics/Survey Research Methods Section, Joint Statistical Meetings
2007, Otto Hutzinger Award, 27th International Symposium on Halogenated Persistent Organic Pollutants, Tokyo, Japan
Papers
^{‡} denotes students or mentees.
Statistical Methods Papers
Research
Bayesian model-based survey inference
My research focuses on the development of statistical methods for the analysis of complex survey data. I am interested in developing Bayesian model-based methods that include design variables as covariates in the regression models for survey outcomes. Below shows the estimation of finite population cumulative distribution function (CDF) and associated quantiles in probability proportional to size sampling.
A Bayesian probit penalized spline regression was used to model a smooth relationship between the CDF and the probability of selection. (a) The posterior mean and 95% credible interval of the population CDF estimate was obtained for each of 20 selected sample units. (b) A smoothed CDF was estimated by smoothing the CDF estimates in (a) using monotonic smooth cubic regression. (c) The population median was obtained by inverting the estimated smoothed CDF. See the paper [PDF].
Analysis of data with missing values
My research on missing data is broad and has been motivated by real world problems emerging from my collaborative research. Below shows an application of my MI-LASSO method, a variable selection method for multiply-imputed data, to the University of Michigan Dioxin Exposure Study to identify important circumstances and
exposure factors that were associated with human serum dioxin concentrations in Midland, Michigan.
The MI-lasso treated the regression coefficients of the same variable
across all imputed datasets as a group and applied the group lasso penalty to yield a consistent variable selection across all imputed datasets. The graphic shows the profiles of MI-LASSO coefficients and BIC value as the shrinkage factor changes. See the paper [PDF].
Latent class analysis
I have a great interest in developing new statistical methods and the novel application of statistical methods to population health research, especially in environmental health sciences. Below shows the use of latent class growth analysis (LCGA) to define phenotypes of wheeze using repeated questionnairs in the Columbia Center for Children's Environmental Health birth cohort study.
The LCGA identified four wheeze phenotypes: never/infrequent (47.1%), early-transient (37.5%), early-persistent (7.6%), and late-onset (7.8%). See the paper [PDF].
National and local community-based health surveys
I have also been actively involved in the design and analysis of a few national and local community-based health surveys, including the University of Michigan Dioxin Exposure Study, the National Drug Abuse Treatment System Survey, the Ohio National Guard Study, etc.
Software
MI-lasso for multiply-imputed data
Available as the %MI_lasso SAS macro and MI.lasso R function.
An implementation of the MI-lasso variable selection method that extends the lasso method to multiply-imputed data. The MI-lasso treats the regression coefficients of the same covariate across all imputed datasets as a group and applies the group lasso penalty to yield a consistent variable selection across all imputed datasets.
Reference:
Chen, Q. and Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study,
Statistics in Medicine, 32, 3646-59.
[PDF]
Logistic partially linear models with missing covariates
Available in this supplement.
We propose a new kernel-assisted estimating equation method for logistic partially linear models with missing covariates. We replace the conditional expectation in the doubly robust estimating function with an unbiased estimating function constructed using the conditional mean of the outcome given the observed data, and impute the missing covariates using the so called link-preserving imputation models to simplify the estimation.
Reference:
Chen, Q., Paik, M. C., Kim, M., and Wang, C. (2016). Using link-preserving imputation for logistic partially linear models with missing covariates,
Computational Statistics and Data Analysis,
101 174-185.
[PDF]
Studying missing data patterns
Available as the %missingPattern SAS macro.
The macro is designed to look at missing data in four ways: the proportion of units for each pattern of missing data, the number and percentage of missing data for each individual variable, the concordance of missingness in any pair of variables, and possible unit nonresponse. The user can customize these analyses by specifying which variables to include or exclude, and which output should be produced.
Reference:
Schwartz, T.^{‡}, Chen, Q., and Duan, N. (2011).
Studying missing data patterns using a SAS macro,
SAS Global Forum 2011 proceedings. [PDF] (Best Contributed Paper Award, Statistics and Data Analysis Section, 2011 SAS Global Forum) ^{‡} Chen's student.
Backward selection for survey linear regression
Available as the %backward SAS macro.
A macro to do backward selection for survey regression using PROC SURVEYREG.
Reference:
Chen, Q. and Gillespie, B. (2006). A SAS macro for performing backward selection in PROC SUREVYREG,
SAS Conference Proceedings: Midwest SAS User Group. (Best Statistical Paper Award, 17th Annual Conference of the Midwest SAS Users Group)
Survey regression for multiply-imputed data
Available as the %MI_SREG SAS macro for survey weighted linear regression and the %MI_SLOGIT SAS macro for survey weighted logistic regression.
These macros are designed to use Rubin's rules to combine the regression coefficient estimates of survey linear and logistic regression models (using PROC SURVEYREG and SURVEYLOGISTIC) for multiply-imputed data.
Forward stepwise selection for multiply-imputed data
Available as the %MI_SREG_STEPWISE SAS macro for survey weighted linear regression and %MI_SLOGIT_STEPWISE SAS macro for survey weighted logistic regression.
These macros are designed to implement the stepwise selection method for multiply-imputed data using Rubin's rules for survey linear and logistic regression models. To use these two macros, the user also needs to download the %MI_SREG and the %MI_SLOGIT.
Reference:
Chen, Q. and Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study,
Statistics in Medicine, 32, 3646-59.
[PDF]
