If you’ve registered for the Columbia University Mailman School of Public Health course Application of Epidemologic Resarch Methods II (P9489) you’ve reached the right place. We’ll be meeting on Wednesdays from 1:00 to 3:50PM in Room HSC 410. You can find the syllabus, course correspondence, announcements and administrative information on CourseWorks . I may be making some (minor) changes to the topics in the later half of the course after getting some feedback from the class.

This is an R course for epidemiologists insomuch as I am an epidemiologist who uses R. This material is intended for students or practitioners who want to use R to apply the basic epidemiological or biostatistical methods they have learned or are currently learning. As much as possible, I emphasize the kinds of data and computational issues epidemiologists are likely to confront, and how they can be solved with R. I plan to allow you some time to play with R. To that end, I’ll spend the first hour or so of each session speaking and demonstrating code, then I’ll give you some time to work on exercises and self-study material I’ve prepared, and then I’ll end with some more lecture. There is no homework. It is, after all, summer. But I will be posting additional material on this site, so you can return and add to your repertoire.

There’s a bit of learning curve with R. But it is very much worth the effort. You don’t need to be a computer programmer (though I imagine it surely helps...) but you should be familiar with your operating system (Windows, OS X, or LINUX), its file structure, and how to download, install and transfer files.

My overarching goal is to introduce participants to R programming skills so they can (1) use those skills to answer epidemiological questions, and (2) develop their own tools to apply epidemiological methods. By the end of the course you will, at a minimum, be able to:

  • Understand R data objects and how they are used for epidemiologic analysis
  • Enter and manipulate data in R in a way that makes epidemiologic sense
  • Use R to calculate risks and rates, analyze survival data, and calculate confidence intervals
  • Write your own simple functions to calculate rates, ratios, and measures of association
  • Plot data using R base graphics capabilities, and be aware of additional capabilities such as ggplot2

We will also be addressing some additional topics and epidemiologic applications of R, including Bayesian and Spatial analyses, power calculations, and using R to scrape or download data from websites.

Links to slides, exercises and data sets are on the side bar to the right of this page. I will also be lecturing from some of the html material on this site.

The primary software tool we will be using is R, a multi-platform (Windows, Mac, Linux) free, open-source, user-maintained advanced statistical and scientific computing platform based on the S-plus language. In addition to the thousands of user-contributed packages, the language is easily extendable though simple object-based programming and can interface with other sophisticated free open-source statistical programs.

Download and install R

  1. Go to the R Project home page.
  2. Click the CRAN (Comprehensive R Archive Network) link from the left-hand menu, under Download Packages.
  3. Click on a link to one of the mirror servers listed on this page. I usually choose a site that is geographically close.
  4. From the box at the top of the page titled “Download and Install R. Precompiled binary distributions....” select your operating system.
  5. On the Windows page, choose “base”. On the Mac page, choose “R-2.15.0.pkg (latest version)” (If you’re a Linux user, I’m assuming you’re already smarter than me about the software repository that came with your distro)
  6. Save the downloaded file (R-X.X.X-win32.exe (windows) or R-X.X.X-mini.dmg (Mac OS X)) to your desktop.
  7. Run the installation program from your desktop accepting defaults

Download and Install R-Studio

Though not necessary for the course, take a moment to download and install R-Studio. It is a (relatively) recent addition to the R world that acts like kind of a wrapper to create a neat working environment that looks and acts the same across platforms. I’ve used it and liked it, but am happy enough with my own program editor as an interface to R. I find, though, that R-Studio can smooth the transition from more integrated programs like SAS and STATA to the command-line world of R.

Download and Install JAGS

In the second half of the course I’d like to introduce you to Bayesian MCMC analysis. There are a couple of very good programs available, of which I find JAGS (Just Another Gibbs Sampler) to be the easiest to implement across different platforms in R.


I am indebted to these authors and practitioners from whom I’ve borrowed extensively and shamelessly:

An Introduction to R This deceptively short overview by Venebles and Smith continues to amaze me for its ability to answer almost every question about how to use R.
Applied Epidemiology Using R. Tomas Aragon is one of my new heros. He is a physician and epidemiologist who also has the ability to demystify even the most arcane aspects of statistical computing. I have based large chunks of this course (particularly the material on R objects) on Dr. Aragon’s book. He tends to update his website, so you need to search for this book. Do so. While you’re at it, take a look for a short book on the use of the Latex typesetting language he wrote called “EpiTex”.
An Introduction to Statistical Computing in R by John Fox is another wonderful resource from which I’ve stolen liberally.
Finally, Data Manipulation with R by Phil Spector (the only reference here that you would have to pay for) is well worth the price if you intend to work in R on a regular basis.