abstract:
We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences (“motifs”) in the gene’s regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment (“parents”). Thus our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurement to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S. cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on heldout experiments. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks.
abstract:
Much of recent systems biology has been motivated by discussions of modularity in biological networks. In order to provide a parameter-free algorithm for quantifying ``modularity," we present an information-theoretic definition applicable to biological networks. This definition draws on methods developed by Tishby, Pereira, and Bialek in the context of clustering. In the case of regulatory networks inferred from gene expression data, this allows a measure of the extent to which these networks may be described in terms of transcriptional modules, a type of dimensionality reduction useful both for suggesting novel experiments and for simplifying the number of parameters which must be inferred. We apply these techniques to protein-protein interaction data, genetic regulatory networks, and neuronal networks.
abstract:
The physicist's desire to analyze in terms of local structures, breaking systems into fundamental parts, is uniquely thwarted by idealized networks: they are composed of identical nodes, differentiated only by the combinatorial explosion of possible connections. The ideal reduced degrees of freedom -- the correct "local substructures" -- are not at all obvious; we strive here to develop a systematic and principled algorithm for their discovery. Functional genomics and the development of modular biology motivate a systematic, statistical approach to identifying the most important features, functionals of the adjacency matrix representation of the graph. The features are global, involving all nodes in each feature; although they can be related to subgraph enumeration, the analysis does not require hypothetical "most important" subgraphs. The resulting algorithm provides an automated tool for graph drawing and decomposition, and suggests novel machine-learning techniques for network classification.
abstract:
Traditional models of wormlike chains in shear flows at finite
temperature approximate the equation of motion via finite
differences and approximate its solution via discretization (bead
and rod models). We introduce here a new method, including
stochastic forcing and driving by shear, based on a spectral
representation in terms of the natural eigenfunctions.
This formulation separates tumbling and bending dynamics, clearly
showing their interrelation, naturally orders the bending dynamics
according to the characteristic decay rate of its modes, and
displays coupling among bending modes in a general flow.
This hierarchy naturally yields a low dimensional
stochastic dynamical system which recovers and extends previous
numerical results and which leads to a fast and efficient numerical
method for studying the stochastic nonlinear dynamics of
semiflexible polymers in general flows. It is anticipated that this
formulation will be of utility in other problems involving spatially
extended systems subject to stochastic dynamics in the presence of
constraints.
abstract:
We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules -- sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. Starting with transcription factor binding site information -- data that can now be readily obtained, for example, for S. cerevisiae -- and a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course datasets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.
abstract:
Recent genomic and bioinformatic advances have motivated the development
of numerous
random network models purporting to describe graphs of biological,
technological,
and sociological origin. The success of a model has been evaluated by
how well it reproduces a few key features of the real-world data,
such as degree distributions, mean geodesic lengths, and clustering
coefficients.
Often pairs of models can reproduce these features with indistinguishable
fidelity despite being generated by vastly different mechanisms.
In such cases, these few target features are insufficient to distinguish
which of the different models best describes real world networks of interest;
moreover, it is not
clear a priori that any of the presently-existing algorithms for network generation
offers a predictive description of networks inspiring them.
To this end, we construct a
mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional)
data
``word space.''
This map defines
an input space for
classification schemes which allow us for the first time to state unambiguously which models are
most descriptive of the networks of biological, sociological, and
technological interest they purport to describe.
Our training sets include networks generated from \nummod
models drawn from the literature, source code for which is freely available.
We anticipate that this new approach to network analysis will be of broad impact to a
number of communities.