# Foundations of Data Science Seminar

The Foundations of Data Science seminar series is organized through the Data Science Institute at Columbia University, with support from the Center for Foundations of Data Science and the Columbia University Department of Statistics.

In the Fall 2015 Semester, the seminar will occur on Friday afternoons. The Fall '15 seminar is organized by Daniel Hsu and John Wright.

## Fall 2015 Schedule

 9/25/15 1-2 PM 750 CEPSR (Schapiro) Sujay Sanghavi University of Texas, Austin Title: Convex Shmonvex: Dropping Convexity for Faster Matrix Estimation Abstract: Fitting a low-rank matrix to data is a fundamental and widely used primitive in machine learning; this often takes the form of minimizing a loss function over the set of low-rank matrices. For most problems beyond the very basic PCA, theoretically sound methods have overwhelmingly combined statistical models of the data with convex optimization. As the size and dimensionality of data increases, this approach is overly computationally wasteful, not least because it represents an $$O(nr)$$ dimensional object with $$O(n^2)$$ parameters. In this talk we present several of our recent results in understanding and designing much faster non-convex algorithms, and characterizing their statistical performance. 10/2/15 1-2 PM 413 Kent Hall Percy Liang Stanford University Title: Learning Hidden Computational Processes Abstract: We are interested in prediction problems in which evaluating the learned function requires multiple intermediate steps of computation. One motivation is building a system that can answer complex questions: here the function would need to map "How many countries have held the Summer Olympics more than once?" to "3" by applying a sequence of aggregation and filtering operations on a database. In this talk, we examine two key machine learning problems that arise in this setting. First, how do we model the computational process? We argue that the classic paradigm of decoupling modeling from inference is inadequate, and we propose techniques that directly model the inference procedure. Second, learning is very difficult: in our example, the supervision "3" constrains the hidden computational process in a very indirect way. We propose methods that relax the output supervision in a parameterized way, and learn both the relaxation and model parameters jointly subject to an explicit computational constraint. Finally, we show some empirical progress on a new challenging question answering task. 10/16/15 1-2 PM 413 Kent Hall Kyle Luh Yale University Title: Random Matrices: $$\ell^1$$ Concentration and Dictionary Learning with Few Samples. Abstract: Let $$X$$ be a sparse random matrix of size $$n \times p$$ $$(p \gg n)$$. We prove that if $$p \ge C n \log^4 n$$, then with probability $$1 - o(1)$$, $$||X^Tv||_1$$ is close to its expectation for all vectors $$v \in \mathbb{R}^n$$ (simultaneously). The bound on $$p$$ is sharp up to the polylogarithmic factor. The study of this problem is directly motivated by an algorithmic application. Let $$A$$ be an $$n \times n$$ matrix, $$X$$ be an $$n \times p$$ matrix and $$Y = AX$$. A challenging and important problem in data analysis, motivated by dictionary leraning and other practical problems, is to recover both $$A$$ and $$X$$, given $$Y$$. Under normal circumstances, it is clear that this problem is underdetermined. However, in the case when $$X$$ is sparse and random, Spielman, Wang and Wright showed that one can recover both $$A$$ and $$X$$ efficiently from $$Y$$ with high probability, given that $$p$$ (the number of samples) is sufficiently large. Their method works for $$p \ge C n^2 \log^2 n$$, and they conjectured that $$p \ge C n \log n$$ suffices. The bound $$n \log n$$ is sharp for an obvious information theoretical reason. The matrix concentration result verifies the Spielman et. al. conjecture up to a $$\log^3 n$$ factor. Our proof of the concentration result is based on two useful observations. The first is an economical way to apply the union bound. The second is a refined version of Bernstein's concentration inequality for sum of independent variables. Both have nothing to do with random matrices and are applicable in general settings This is joint work with Van Vu. 10/23/15 1-2 PM 413 Kent Hall David Dunson Duke University Title: Probabilistic modeling of big tables and sequences Abstract: High-dimensional discrete data are collected in many application areas, but have seen limited consideration in the literature. We focus in particular on two related problems: (i) high-dimensional categorical data that can be organized as a many way contingency table; and (ii) sequential categorical data having a complex dependence structure. The first problem arises in numerous applications ranging from survey research in social sciences and epidemiology to genomics and marketing. The second is similarly common in engineering, signal processing, and (our motivation) studies of the neurological and genetic basis of communication. We propose a general approach for these problems based on new classes of probabilistic tensor factorizations, which are low rank, sparse and have various optimality properties. As illustration, we consider inference on associations in many way contingency tables, classification from high-dimensional features, and estimation of higher order Markov chain models. In each case, substantial advantages are illustrated over competitors in various applications. 11/6/15 1-2 PM 413 Kent Hall Jun Zhu Tshingua University, Carnegie Mellon University Title: Fast and Discriminative Training of Probabilistic Generative Models Abstract: Probabilistic generative models are flexible tools on revealing the latent structures underlying complex data and performing “top-down” inference to generate observations. However, the “bottom-up” recognition ability is often not fully explored. An MLE estimate often leads to inferior prediction accuracy than discriminative methods. In this talk, I will introduce max-margin learning for probabilistic generative models, under a general framework of regularized Bayesian inference (RegBayes), where a posterior regularization term is defined to encourage a large-margin separation between true categories and alternatives, without sacrificing the generative capability. The basic idea is illustrated by examples on learning topic models and deep generative models. I will also talk about an online learning version of RegBayes for dealing with streaming data and large datasets.