CW: Codes

`codes`

`mac os x hacks:`

click here

`research codes:`

Length restrictions force authors to be more and more telegrammatic about details of their work, leading to more and more confusion among researchers about what was actually done. Hoping to counter this and promote transparency in research, I am making source codes used in my papers freely available. I can not promise I will respond to all questions, but please do email me with any questions about the codes. Subject line should include "[Codes]".

vbmod, cf. A Bayesian Approach to Network Modularity, Phys. Rev. Lett. 100, 258701 (2008)
infomod, cf. An Information-Theoretic Approach to Network Modularity , PRE 2005
cellmap, cf. Phys. Rev. Lett. 97 038102 , 2006., among others
specmark, cf. A stochastic spectral analysis of transcriptional regulatory cascades Walczak et al PNAS 2009
Sourceforge open-source MATLAB code for MatStat, NetBoost, and InfoMod
Supplementary material associated with ``Predicting genetic regulatory response using classification" with M. Middendorf, A. Kundaje, Y. Freund, and C. Leslie, Proceedings of the Twelfth International Conference on Intelligent Systems for Molecular Biology (ISMB 2004), 2004, To appear.
abstract:
We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences (“motifs”) in the gene’s regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment (“parents”). Thus our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurement to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S. cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on heldout experiments. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks.
Code associated with ``Information-theoretic measures of biological network modularity" with Manuel Middendorf and Etay Ziv
abstract:
Much of recent systems biology has been motivated by discussions of modularity in biological networks. In order to provide a parameter-free algorithm for quantifying ``modularity," we present an information-theoretic definition applicable to biological networks. This definition draws on methods developed by Tishby, Pereira, and Bialek in the context of clustering. In the case of regulatory networks inferred from gene expression data, this allows a measure of the extent to which these networks may be described in terms of transcriptional modules, a type of dimensionality reduction useful both for suggesting novel experiments and for simplifying the number of parameters which must be inferred. We apply these techniques to protein-protein interaction data, genetic regulatory networks, and neuronal networks.
Code associated with ``Novel systematic discovery of statistically significant network features'' (cond-mat/0306610) (with Etay Ziv and Robin Koytcheff)
abstract:
The physicist's desire to analyze in terms of local structures, breaking systems into fundamental parts, is uniquely thwarted by idealized networks: they are composed of identical nodes, differentiated only by the combinatorial explosion of possible connections. The ideal reduced degrees of freedom -- the correct "local substructures" -- are not at all obvious; we strive here to develop a systematic and principled algorithm for their discovery. Functional genomics and the development of modular biology motivate a systematic, statistical approach to identifying the most important features, functionals of the adjacency matrix representation of the graph. The features are global, involving all nodes in each feature; although they can be related to subgraph enumeration, the analysis does not require hypothetical "most important" subgraphs. The resulting algorithm provides an automated tool for graph drawing and decomposition, and suggests novel machine-learning techniques for network classification.
Code associated with ``The Spectral Stochastic Dyanamics of Bending and Tumbling'' (cond-mat/0307551) (with Alberto Montesi and Matteo Pasquali)
abstract:
Traditional models of wormlike chains in shear flows at finite temperature approximate the equation of motion via finite differences and approximate its solution via discretization (bead and rod models). We introduce here a new method, including stochastic forcing and driving by shear, based on a spectral representation in terms of the natural eigenfunctions. This formulation separates tumbling and bending dynamics, clearly showing their interrelation, naturally orders the bending dynamics according to the characteristic decay rate of its modes, and displays coupling among bending modes in a general flow. This hierarchy naturally yields a low dimensional stochastic dynamical system which recovers and extends previous numerical results and which leads to a fast and efficient numerical method for studying the stochastic nonlinear dynamics of semiflexible polymers in general flows. It is anticipated that this formulation will be of utility in other problems involving spatially extended systems subject to stochastic dynamics in the presence of constraints.
Code associated with ``Combining Sequence and Time Series Expression Data to Learn Transcriptional Modules'' (submitted) (with Anshul Kundaje, Manuel Middendorf, and Christina Leslie)
abstract:
We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules -- sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. Starting with transcription factor binding site information -- data that can now be readily obtained, for example, for S. cerevisiae -- and a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course datasets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.
Code associated with ``Classification of Biological Networks Via Walks and Words" with Manuel Middendorf, Etay Ziv, Carter Adams, Jen Hom, Robin Koytcheff, Chaya Levovitz, Gregory Woods, and Linda Chen;
abstract:
Recent genomic and bioinformatic advances have motivated the development of numerous random network models purporting to describe graphs of biological, technological, and sociological origin. The success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of networks inspiring them. To this end, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) data ``word space.'' This map defines an input space for classification schemes which allow us for the first time to state unambiguously which models are most descriptive of the networks of biological, sociological, and technological interest they purport to describe. Our training sets include networks generated from \nummod models drawn from the literature, source code for which is freely available. We anticipate that this new approach to network analysis will be of broad impact to a number of communities.