Statistics 4240: Data Minng

Summer 2013


This is a master's course in data mining/machine learning.

Course goals: Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, numerical linear algebra and optimization. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
The aim of the course is for students to master the concepts of numerical linear algebra and optimization needed to understand supervised and unsupervised learning approaches, such as linear models for regression/classification, mixture models and EM, continuous latent variables, clustering, and methods for combining models.
Programming aspects of data mining, however, will not be covered in the lectures.
NOTE: The course covers the material of one course, in six weeks. The pace, therefore, is fast, and not all students will be able to keep up. Furthermore, the material is cumulative, that is, almost every lecture builds on previously discussed concepts, and students unable to keep up will find themselves in a very uncomfortable position.

Time: MTWR 6:15pm-7:50pm, May 28 - July 5 (Friday, May 31 is the makeup date for the Memorial Day Holiday and Friday, July 5 is the makeup date for the Independence Day holiday)
Place: 313 Fayerweather
Professor: Kamiar Rahnama Rad; Office: 1255 Amsterdam Ave, Rm 901. Email: stat.s4240 at gmail dot com
Office Hours: Thursdays 4-5pm (if you need more, request an appointment by email).
Textbook: Pattern Recognition and Machine Learning, 2006, by Bishop, Christopher M. (ISBN 9780387310732).
Teaching/Homework assistant: 1) Susanna Makela; office: rm 1020 in 1255 Amsterdam Ave; email: smm2253 at columbia dot edu; office hours: Wednesdays 1-3pm.
Prerequisite: A good working knowledge of probability, multivariate calculus, linear algebra, and programming languages such as R or Matlab is necessary.

Evaluation(subject to change): There will be a problem set due each Monday, with the exception of the midterm date (June 17). The midterm examination will be an in-class exam and will cover the material in chapters 1-4 of the text. There will be a quiz on June 6. Exam problems will be similar to those given in the problem sets and worked out in the lectures. For the midterm and final, you will be permitted to use one handwritten page, front and back, of notes. The final grade is a weighted average of four homeworks (32%), quize(8%), midterm (25%), and final (35%)

Old homeworks will be deposited in room 904 in the stat dept building.

Final Exam: July 5, 6:15pm-7:50pm.
No makeup midterm or final will be given.
Homework must be uploaded by midnight of the due date. No late homework will be accepted.
Students are encouraged to work together on the homework assignments but should write up solutions on their own. Of course, all work on the exams absolutely must be each student's alone.
Solutions to the homework assignments will be posted on Courseworks each week.


The outline and summary of topics to be covered is subject to change.


Date Topic Notes
Tuesday, May 28 polynomial curve fitting, probability theory, model selection, curse of dimensionality read chapter 1.1-1.4
Wednesday, May 29 decision theory, information theory, properties of matrices, gaussian distribution read 1.5, 1.6, 2.3.1-2.3.4, appendix c
Thursday, May 30 mixtures of gaussians, nonparametric methods read 2.3.9, 2.5
Friday, May 31 linear basis function models, the bias-variance decomposition read 3.1,3.2
Monday, June 3 bayesian linear regression, bayesian model comparison read 3.3, 3.4. homework 1
Tuesday, June 4 evidence approximation, empirical bayes read chapter 3.5, 3.6
Wednesday, June 5 discriminant functions read 4.1
Thursday, June 6 quiz(chapter 1,2), probabilistic generative models read 4.2
Monday, June 10 probabilistic discriminative models read 4.3.
Tuesday, June 11 laplace approximation, bayesian logistic regression read 4.4, 4.5
Wednesday, June 12 kernel methods read 6.1,6.2,6.3
Thursday, June 13 gaussian processes read 6.4.1-6.4.6, homework 2.
Monday, June 17 midterm, chapter 1-4 .
Tuesday, June 18 maximum margin classifiers read 7.1
Wednesday, June 19 maximum margin classifiers read 7.1
Thursday, June 20 relevance vector machines read 7.2
Monday, June 24 mixture models and EM read 9.1, 9.2, homework 3.
Tuesday, June 25 mixture models and EM read 9.3, 9.4.
Wednesday, June 26 continuous latent variables, PCA read 12.1, 12.2
Thursday, June 27 probabilistic PCA read 12.2
Monday, July 1 combining models, bayesian model averaging, committees,boosting read 14.1, 14.2, 14.3, homework 4.
Tuesday, July 2 tree-based models, conditional mixture models read 14.4, 14.5
Wednesday, July 3 tba .
Friday, July 5 final exam