Bayesian Analysis for Epidemiologists Part II: Monte Carlo and BUGS

Introduction to Monte Carlo
Introduction to Markov Chain Monte Carlo
Assessing MCMC Results
Running MCMC in BUGS
Airline fatality data example

Up until the late 1980s, Bayesian inference consisted primarily of the conjugate models we considered in the previous section. While the prior and the likelihood could usually be described in closed form, for most reasonably realistic models, the posterior was often not analytically tractable. The introduction of theory and computing power to employ Monte Carlo methods led to a way to get at more complex problems, and has resulted in something of a revolution in Bayesian statistics. Not only can Monte Carlo can be used to simulate values from closed forms of prior and posterior distributions, it is the basis for Markov Chain algorithms to sample from arbitrary (typically high-dimensional) posterior distributions that would otherwise not be amenable to analysis. These Monte Carlo samples allow approximations to the integral of otherwise intractable distributions. The basic concepts of sampling and simulation are the same as simple Monte Carlo, but we sample from non-closed form distributions. Theorems exist which prove convergence to the integral as the sample size approaches infinity, even if the sampling is not independent. Monte Carlo algorithms form the approach of much of modern Bayesian analysis.

Introduction to Monte Carlo

BUGS programs (of which WinBUGS, OpenBUGS and JAGS are the most popular) use a Monte Carlo approach to estimating probabilities, summary statistics and tail areas of probability distributions. Monte Carlo itself, is basically simulation. Say you are interested in the probability of 8 or more heads in 10 tosses of a fair coin. We could use the binomial formula to get an exact answer (it is 0.0547). Or you could take a physical approach and toss 10 coins repeatedly into the air, and count up how many times out of how many tosses we get 8 or more heads. Simulation uses a computer to “toss the coins”. Here is some R code that does just that:

1
2
coins<-rbinom(10000,10,.5)
length(coin[coin>7])/length(coin)

Ten thousand simulations gets close to the exact answer. And a computer will not complain if you ask for a hundred thousand tosses. This is the basis of Monte Carlo approaches: simulate by sampling from some distribution and summarize the results as a probability distribution. You can also calculate a so-called Monte Carlo error on your results, which is the \(\frac{\sigma^2_{sample}}{\sqrt{n_{iterations}}}\) The MC error decreases with the number of iterations, and can be used as an estimate of how close your estimate is to an exact estimate.

Monte Carlo Calculation of Pi

Let’s look at another simple simulation example. 1 Imagine you are throwing darts, but for reasons know only to folks who make up probability examples, you are just interested in hitting the dartboard in the upper right corner. at the following figure.

Dart Board 1

Dart Board 1

Dart Board 2

Dart Board 2

Further imagine (for the same obscure reasons) you are a very, very poor dart player. In fact your darts are essentially random tosses. If you throw enough darts, eventually, the number of darts that hit the upper right corner will be proportional to that area. The ratio of the number of darts that hit the shaded area to the total number of darts thrown will, in fact, equal one-fourth the value of pi.

We can create a “virtual” dart board by generating x and y coordinates values, and calculate the distance from the origin (0,0) using the Pythagorean theorem. If the circle’s radius is 1, then values that are less than 1 fall in the shaded area, and values that are more than one fall outside the area. Do this enough times, and we can get a reasonable approximation of pi.

Here’s some R code that accomplishes that.

1
2
3
4
5
6
x<-runif(1000)
y<-runif(1000,)
distance<-sqrt(x^2+y^2)
hits<-distance[distance<1]
quart.pi<-length(hits)/length(distance)
4*quart.pi

Run this code and I think you will find that just 1,000 “throws” gets us pretty close. This, then, is the essence of a Monte Carlo simulation.

Introduction to Markov Chain Monte Carlo

While we may be able to get a lot of mileage out of the simple conjugate analyses we considered in the first section (and, in fact even the most sophisticated modern software will resort to conjugacy if it applies), we will very likely run out of gas when we try to address more realistic and complex problems. There are a couple of options for when we have a likelihood function and prior distribution that can’t be handled by formal, mathematical (read conjugate) analysis. We could take a so-called “grid” approach by specify a prior with a dense grid of say 1,000 values spanning all possible values of \(\theta\). This may be reasonable when dealing with one parameter, but imagine the case of 6 parameters. We would need \(1000^6\) combinations of values. That’s (1,000,000,000,000,000,000 for those of you keeping score.)

Many of the limitations on the use of Bayesian statistics were solved when theoretical and computing advances allowed the use of sampling algorithms that produce a large number of representative values of most any posterior distribution of \(\theta\). We can then calculate things like means and variances using those representative values. Independent sampling from a joint posterior distribution like \(p(\theta | y)\) is difficult, but dependent sampling from a Markov Chain, where each value is conditional on the previous value, is easier. There are several standard ‘recipes’ available for designing Markov chains for a stationary target distribution. The basic version is the Metropolis algorithm (Metropolis et al, 1953), which was generalized by Hastings (1970). Gibbs Sampling is a special case of the Metropolis-Hastings algorithm which generates a Markov chain by sampling from the full set of conditional distributions. Hopefully that will make more sense, soon.

There are two key ingredients for an MCMC recipe. First, our prior probability must be straightforward enough so that for each value of \(\theta\) we can evaluate the \(Pr[\theta]\). Second, we need a likelihood which can be calculated for any data value given \(\theta\). It turn out these two criteria are usually reasonable. And given these two criteria, we can apply the so-called Metropolis algorithm to create a representative sample of the posterior distribution.

The Metropolis Algorithm

It can be easy to get caught up in the terminology surrounding “MCMC”. John Kruschke, I think, puts it most simply and clearly. “Any simulation that samples a lot of random values from a distribution is called a … Any process in which each step has no memory for states before the current one is called a Markov process, and a succession of such steps is called a Markov chain. The Metropolis algorithm is an example of a process”

We can introduce the Metropolis algorithm with a simple, albeit somewhat farfetched, example. 2

Imagine you are a politician campaigning in seven districts arrayed one adjacent to the other. You want to spend time in each district, but because of limited resources you want to spend the most time in those districts with the most voters. Ideally, the time you spend in a district would be proportional to the number of likely voters, but (again, because of limited resources) you only have information on the number of likely voters in the district you are currently in, and those that are directly adjacent to it on either side. Each day, you must decide whether to campaign in the district in which you find yourself, move to the immediately adjacent eastern district, or move to the immediately adjacent western. So, (like many people, not just politicians) you don’t have the big picture (the “target distribution”). All you know is what is on your immediate horizon. But, good news: the Metropolis algorithm can give you the benefits of knowing the big picture or target distribution, with only local information. On any given day, here’s how you decide whether to move or stay put:

  1. First, flip a coin. Heads to move east, tails to move west.
  2. If the district indicated by the coin (east or west) has more likely voters than the district in which you are currently staying, you move there.
  3. If the district indicated by the coin has fewer likely voters, you make your decision based on a probability calculation:
    • calculate the probability of moving as the ratio of the number of likely voters in the proposed district, to the number of voters in the current district:
      • \(Pr[move] = \frac{pop_{proposed}}{pop_{current}}\)
    • Take a random sample between 0 and 1.
    • If the value of the random sample is between 0 and the probability of moving, you move. Otherwise, you stay put.

This is basically a “random walk”, and in the long run, the time you spend in each district will be proportional to the number of voters in a district.

Let’s put some numbers to this example. Say the seven districts have the following relative proportions of likely voters.

The random walk, would look like this:

  • At time=1, you are in district 4.
  • Flip a coin. The proposed move is to district 5.
  • Accept the proposed move because the voter population in district 5 is greater than that in district 4.
  • New day. Flip a coin. The proposed move is to district 6. Greater population, so you move.
  • New day. Flip a coin. The proposed move is to district 7. Greater population, so you move.
  • At time=4, you are in district 7. Flip a coin. If the proposal is to move to district 6, you base your decision to move on the probability criterion of 6/7. Draw a random sample between 0 and 1, if the value is between 0 and 6/7, you move. Otherwise you remain in district 7.
  • Repeat many, many times.

Note in the figure above how the density of the moves begins to mirror (upside down) the distribution of voters. This next figure represents the amount of time you spend in any one district.

And the following figure illustrates the evolution of this distribution over time.

You can see it gradually approaches the target distribution, even though the only information you have is from one step in either direction. Notice, also, the initial “bulge” of probability around the starting value of district 4. This is feature of MCMC, and why you will need to carefully consider initial or starting values for your simulations, and allow adequate “burn in” time.

The basis of the Metropolis algorithm, then, consists of: (1) proposing a move, and (2) accepting or rejecting that move. Proposing a move involves a proposal distribution. Accepting or rejecting the move involves an acceptance decision.

The proposal distribution is the range of possible moves. In the simple example with which we’ve been working, the proposal distribution consists of two possible moves: to the east or to the west, each with a probability of 50% based on the flip of a coin.

The acceptance decision for the proposal distribution is based on a probability. It is really quite clever:

\[ Pr[move] = P_{min} (\frac{P_{\theta_{proposal}}}{P_{\theta_{current}}}, 1) \]

So, if the population of the proposed district is greater than the current population, the minimum is 1, and you move with 100% certainty. If, on the other hand, the proposed population is less than the current population, you move with a probability equal to that proportion. The process works because the relative transition distributions match the relative values of the target distribution. You can summarize the process, from coin flipping to acceptance probability as

\[ 0.5*P_{min}(\frac{P_{\theta_{proposal}}}{P_{\theta_{current}}}, 1) \].

Because we are only interested in the ratio of the proposed to the current values, the final (target or posterior) distribution is not normalized. It does not have to integrate or sum up to 1, The consequence of this is that we are not restricted to closed solutions. We can use this approach to get at any complex posterior distributions or the kind \(Pr[data|\theta]*Pr[\theta]\)

From discrete Metropolis algorithm to continuous

The Metropolis algorithm can be generalized from the discrete (e.g. “right vs. left move”) case to a continuous multi-dimensional parameter space version while remaining essentially the same. As mentioned earlier, the main criterion is to be able to calculate the \(Pr[\theta]\) for any candidate value of \(\theta\). Since it doesn’t have to be normalized, i.e., all we need is the ratio of the proposed to the current , we can use the product of the likelihood and the prior to calculate the of the posterior.

A normally distributed proposal distribution makes it easier to generate proposal values, and the proposed move will typically be near the current position. The decision to move is based on the same probability as that in the simple discrete version:

\[ Pr[move] = P_{min}(\frac{P_{\theta-propose}}{P_{\theta-current}}, 1) \]

Some care needs to be taken in specifying a proposal distribution. If the proposal distribution is too narrow, it will take a long time to traverse the entire target distribution, especially if the initial or starting value is off somewhere in the hinterlands of a large multi-dimensional parameter space. That’s one reason that the operationalization of a Metropolis algorithm should involve more than one starting value. Of course, it’s also a potential problem if the proposal distribution is too wide. Then, \(\frac{P_{\theta-propose}}{P_{\theta-current}}\) will rarely be large enough to move from the current spot, and your sample will be “stuck” in a localized area of the target distribution.

Metropolis algorithm with two parameters

Let’s review and extend the Metropolis algorithm to the two-parameter setting. The mechanics of applying the Metropolis algorithm to a two-parameter bivariate normal model proceed similarly to the simple one-parameter setting. As a quick recap, our random walk starts at some arbitrary starting point, hopefully not too far away from the meatiest part of the posterior distribution. We propose a move from our current position using a proposal distribution from which (we assume) it is easy to generate values. We accept the move with certainty if the target or posterior distribution is more dense (of greater value) at the proposed position than at the current position. If the target or posterior distribution is less dense at the proposed vs. the current position, we accept the move with probability set to the ratio of the proposed to the current value.

The algorithm, then, needs only a few simple tools:

  • a value sampled randomly from the proposal distribution
  • another value sampled randomly from a \(\sim Unif(0,1)\) distribution to accept or reject probabilistic moves
  • the ability to calculate the likelihood and prior for any given parameter values
    • conveniently, since we are only interested in their ratio, these values don’t have to be normalized

For a two-parameter problem, we can use the product of beta distributions as a prior. As an example (again, from John Kruschke’s text) of a two-parameter setting, we will work with a a bivariate normal proposal distribution. Below, you’ll find the proposal distribution and the R-code used to create it. Then the scatter plot of samples (excluding a burn-in period), where \(N_{pro}\) is the number of accepted proposals. Note that for more complex parameter spaces, it may be difficult or impossible to visualize your proposal distribution, but the underlying concepts remain the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 # R code to create bivariate figure
mu1<-0 # set mean x1
mu2<-0 # set mean x2
s11<-10 # set variance x1
s22<-10 # set variance x2
s12<-15 # set covariance x1 and x2
rho<-0.5 # set correlation coefficient  x1 and x2
x1<-seq(-10,10,length=41) # generate vector  x1
x2<-x1 # copy x1 to x2


f<-function(x1,x2) # multivariate function
{
term1<-1/(2*pi*sqrt(s11*s22*(1-rho^2)))
term2<--1/(2*(1-rho^2))
term3<-(x1-mu1)^2/s11
term4<-(x2-mu2)^2/s22
term5<--2*rho*((x1-mu1)*(x2-mu2))/(sqrt(s11)*sqrt(s22))
term1*exp(term2*(term3+term4-term5))
} 

z<-outer(x1,x2,f) # calculate density values


persp(x1, x2, z, # 3-D plot
main="Two dimensional Normal Distribution",
sub=expression(italic(f)~(bold(x))==frac(1,2~pi~sqrt(sigma[11]~
sigma[22]~(1-rho^2)))~phantom(0)^bold(.)~exp~bgroup("{",
list(-frac(1,2(1-rho^2)),
bgroup("[", frac((x[1]~-~mu[1])^2, sigma[11])~-~2~rho~frac(x[1]~-~mu[1],
sqrt(sigma[11]))~ frac(x[2]~-~mu[2],sqrt(sigma[22]))~+~
frac((x[2]~-~mu[2])^2, sigma[22]),"]")),"}")),
col="lightgreen",
theta=30, phi=20,
r=50,
d=0.1,
expand=0.5,
ltheta=90, lphi=180,
shade=0.75,
ticktype="detailed",
nticks=5) 
bivariate normal proposal distribution

bivariate normal proposal distribution

Metropolis scatter plot

Metropolis scatter plot

Gibbs Sampling

As we have noted, for the Metropolis sampler to work well, the proposal distribution needs to be “tuned” to the poster distribution. If the proposal distribution is too narrow, too many proposed moves or samples will be rejected. If it is too wide, it can get bogged down in a local region and not move efficiently across the parameter space. Gibb’s sampling (Geman and Geman, 1984) is an alternative algorithm that does not require a separate proposal distribution, so is not dependent on tuning a proposal distribution to the posterior distribution.

The main difference between Gibbs sampling and Metropolis sampling is how the proposal value is chosen. Instead of choosing a proposed value from a proposal distribution that represents the entire multi-parameter density, the Gibbs sample chooses a random value for a single parameter holding all the other parameters constant. The randomly proposed single parameter is combined with the unchanged values of all the other parameters to calculate the proposed new position in the random walk. The Gibb’s sampler then repeats this process sequentially through all the parameters, calculating a proposal value, and accepting or rejecting it in turn. You can think of it as a special case of the more general Metropolis algorithm, where the proposal distribution is determined, at each step, by the conditioning variables. Note that the process is set up so that the proposal distribution will exactly mirror the posterior probability for a parameter. So, the proposed move is always accepted. This is why the Gibb’s sampler is more efficient than the Metropolis sampler. You may see references to the Metropolis-Hastings algorithm. This is a generalization of the Metropolis algorithm required to prove that the Gibbs sampler “works”.

To sum up Gibbs sampling:

  • Select a parameter
  • Choose a random value for that parameter from a conditional distribution based on all the other parameters
  • The new point is determined by the values of the other parameters and the proposed value of the parameter conditional on them

Gibbs sampling is useful when the complete posterior distribution cannot be determined and so cannot be sampled. This may be the case in complex models. The following figure illustrates the concept for the two-parameter bivariate normal model.

The top figure shows a randomly generated value of \(\theta_1|\theta_2\). The dark line is a slice through the posterior at the conditional value of \(\theta_2\). The dot is the value of \(\theta_1\) from that conditional distribution. The bottom figure shows the same process for \(\theta_2|\theta_1\). As opposed to the Metropolis algorithm, the Gibbs sampler results in the following scatterplot.

Programs like WinBUGS, OpenBUGS and JAGS, even though they are named for Gibbs sampling, often use Metropolis algorithms, or even conjugate results if available, depending on the model specification. You can override the default behaviors, but will rarely have reason to.

Reporting Results

In classical statistical inference, we can’t make direct probability statements about parameters. A p value is not the probability that the null hypothesis is true, but rather the probability of observing data as or more extreme than we obtained given that the null hypothesis is true. In Bayesian inference we can directly calculate the probability of parameter values by calculating the area of the posterior distribution to the right of that value, which is simply equal to the proportion of values in the posterior sample of the parameter which are greater than that value. We can use that information to report the results of Bayesian analyses as means with so-called 95% credible intervals for parameter estimates.

Assessing MCMC Runs

There are issues to consider with MCMC methods that don’t arise in classical statistical inference. In addition to more familiar aspects of statistical analysis, the process of conducting an MCMC analysis involves assessing graphs and diagnostics for convergence, mixing, acceptance rates or efficiency, and correlation. Specifically, you’ll explore convergence or whether your sample of \(\theta\) ’s approaches the posterior distribution of \(Pr(\theta | y)\)), “mixing”, i.e. whether the sampler algorithm explored the entire posterior distribution, efficiency or how well \(Pr(\theta | y)\) is estimated by the sample of \(\theta\) ’s you’ve generated or whether you need a larger sample size, and autocorrelation of the values in your sample.

Convergence

Convergence in MCM refers to the final, stable set of samples from the target or “stationary” posterior distribution. A series of samples, “converges” to a distribution, not a particular value. The series of samples or iterations that precedes convergence is sometimes referred to as “burn in”. Once convergence is achieved, samples should look like random scatter around a stable mean.

You have an obligation when conducting MCMC to assess convergence. But, there is no fool-proof means of assuring convergence. Sometimes, all that is needed is a graphical assessment of trace plots by iteration number for multiple chains starting from dispersed initial values. For example, look at two chains on a trace plot for the quick, jerky or spiky overlapping chains indicative that they are both sampled from the same space.

Convergence

Convergence

Some statistical tests (available in CODA and BOA R software packages) may help diagnose convergence. One popular test is is the Gelman-Rubin statistic, which compares multiple chains, each with widely differing starting points, by quantifying whether sequences are farther apart than expected based on their internal variability.

Problems with convergence

You may find that it takes an inordinately long period of time for your samples to converge to a stable distribution. Or that the samples never converge. This may be due to correlation between variables in your model. On the figure on the left (below) you see small changes in one variable are associated with large changes in the other variable. The MCMC algorithm will have to sample a largely dispersed probability space, and it may not do this quickly or perhaps at all. One possible fix is to “center” the variables, e.g. by subtracting the mean, or standardizing. Centering doesn’t change the relationship between variables, it just moves the intercept, resulting in a situation like that on the right side of the figure. Now, the MCMC algorithm will run and converge more quickly because it has only to sample from a relatively tight probability space.

NonConverge

NonConverge

Efficiency

After your sampling algorithm has converged to an acceptably stable, stationary posterior distribution, you will need further iterations to obtain samples for posterior inference. The more iterations you sample, the more accurate the posterior estimates. But, since the sample is not independent, you can’t use traditional standard error estimates. Instead, you can assess the efficiency or accuracy of your estimates with a Monte Carlo standard error, which is the standard error associated with the posterior sample mean as a theoretical expectation for a given parameter. As noted above, the MC error is calculated as \(\frac{\sigma^2_{sample}}{\sqrt{n_{iterations}}}\), and depends on:

  • the true variance of the posterior distribution,
  • the posterior sample size (i.e. the number of MCMC iterations)
  • autocorrelation in the MCMC sample.

As a rule of thumb, you’d like to see an MC error less than 1% to 5% of your posterior standard deviation.

Autocorrelation

Since a sample value at time t is based on the previous value at time t-1 , the samples will invariably be correlated. It is, actually less of a concern than you might expect, because your inferences and summary measures are based on the entire sample taken together, not on the series. Still you’d like the correlation to be small as possible, with a relatively steep drop off as the lag number increases indicating convergence to independent samples. This also increases efficiency because you can explore the entire probability space as efficiently as possible with as small a sample as possible. You can assess correlation in your sample chain with autocorrelation plots (autocorrelation by lag time), with the goal of autocorrelation decreasing to zero in as few steps as possible.

Running MCMC in BUGS

The BUGS Program

BUGS stands for Bayesian inference using Gibbs sampling. It’s a language for specifying complex Bayesian models. WinBUGS and it’s open-source cousin OpenBUGS are perhaps the first and most well-known implementations, but since I started using a Mac I’ve migrated over to JAGS (Just Another Gibbs Sampler) because of it’s cross-platform ease of use. To follow along with the material below you will need to download the software at the links above, and install R2WinBUGS, R2OpenBUGS and/or R2jags packages from CRAN.

WinBUGS was originally distinguished form other implementations of Gibbs sampling by the use of graphical object-oriented internal representations of models called “Doodles”. The “Doodle Editor” allows you to create models graphically and can be very nice to work with, though I don’t know many folks who use it. Most folks find themselves using the the point-and click-approach built into Win/OpenBUGS. This consists of:

  1. Specifying a model (We’ll talk about this in a moment)
  2. Double clicking the word “Model” to select it
  3. Clicking “Check Model” in the menu (under “Model > Specification…”)
  4. Look for the message in the lower left hand side of you screen that your “model is syntactically correct”
  5. Clicking “Compile” (again under under “Model > Specification…”, and again checking the message that your “model compiled”)
  6. Clicking “load inits” (or “gen inits” if you are allowing WinBUGS to generate initial values for you) under the same menu and looking for the to load or generate initial or starting values for your MCMC algorithm, and again monitoring the response, this time that your “model initialized”
  7. Loading your data (under the same menu)
  8. Clicking “Samples” under the “Inference” menu to specify the beginning (“beg”) and ending (“end”) for your the number of iterations or samples on which you want to base your results (here is where you can specify your “burn-in” period)
  9. Typing the names of the “nodes” (parameters) you want results for in the “Sample Monitor” window, clicking “Set” after entering each node
  10. Clicking on “Update” in the menu to specify the total number of iterations and begin the MCMC process
  11. Clicking (again) on “Samples” under the “Inference” menu
  12. Typing “*” in the “Sample Monitor”and clicking “Stats” to see results
  13. Typing “*” in the “Sample Monitor”and clicking “Trace” to see sampled values
  14. Clicking on various diagnostics and tools like “bgr statistic” for the Gelman-Rubin statistic
  15. Clicking on various plots and graphs to assess convergence and autocorrelation

If this seems like a lot of pointing and clicking, well, it is. Most folks quickly tire of it. The good news is that you can run the entire process under R with some fairly simple interfaces to your BUGS program of choice. Perhaps the most popular of this syntax-driven approach is to interface BUGS with R using the R2WinBUGS package. Although, as I say above, my preference is the R2jags package.

Specifying a Model

Before going over how to specify a model in R, we need to discuss how to specify a model. We’ll begin with a simple example using the point-and-click approach described above. It will allow you to see native BUGS software, and likely convince you that you no longer wish to work with native BUGS software. We’ll recreate the coin toss example from the top of the page that we coded in R. We are interested in the probability of 8 or more heads in 10 tosses of a fair coin. Our outcome, y, is ~Bernoulli(0.5, 10) The model statement for this problem in BUGS is:

Model{
Y ~ dbin(0.5, 10)
P8 <- step(Y-7.5)
}

BUGS doesn’t have the capability for “if-then” statements, so we use a step function instead. If the argument is true, it returns a value of 1, otherwise it returns a value of 0. Here we are saying that if Y-7.5 is greater than or equal to zero (i.e. Y is 8 or more) P8 takes on the value 1, otherwise P8 takes on the value 0. We assign the results of the step function to an indicator variable called P8 so we can monitor and get results on it. Cut and paste this model into an new model frame in Win or Open BUGS and go through the the point-and-click approach outlined above. Let WinBUGS generate initial values. And, since this is pure Monte Carlo, you don’t need to load any data. Run 5000 iterations and monitor the results of the P8 variable. When you (finally) click on “stats” in the “sample monitor tool” you should get something like:

WinBUGS Screen Shot

WinBUGS Screen Shot

The BUGS Language:

In some statistical software programs, you tell the package where the data is and it applies pre-packaged statistical models. Like R, WinBUGS is much more flexible. You can specify just about any model you like, and then apply your own model to the data. The model language itself is similar to R. So, <- represents logical dependence e.g. m <- a+ b*x. Unlike R, though, the tilde symbol (~) represents a stochastic or probability dependence, e.g x ~ dunif(a,b). One very important point about specifying probability distributions in BUGS: you define distributions in terms of precision, not variance, so ~ dnorm(0.0,1.0E-6) means a very small precision, and a very wide (in this case almost uniform) variance.

Arrays and loops are also similar to R language:

for (i in 1:n){
r[i] ~ dbin(p[i], n[i])
p[i] ~ dunif(0,1)
}

BUGS comes with functions, some of which may be familiar, e.g.

mean(p[])

returns the the mean of a whole array,

mean(p[m:n]) 

returns the mean of elements m to n.

Some (but not all) of functions can appear on the left side of an expression, e.g.

logit (p[i]) <- a + b*x[i]
log(m[i]) <- c + d*y

Here are some useful BUGS functions:

p<-dnorm(0,1)I(0,) 

Truncates the variable, which means the variable will be restricted to the range 0 to infinity (i.e. only the positive side of a normal distribution). This can be useful when specifying a prior distribution that can not, say, have negative values.

p<- step(x-1) 

Returns a 1 if x is greater than or equal to 1 (or any number), and a 0 otherwise. Monitoring p and recording it’s mean will give the probability that x is greater than or equal to 1.

p<- equals(x, .7)

Does the same thing for x equal to 1.

tau <- 1/pow(s,2)

Sets \(\tau\) to \(\frac{1}{s^2}\)

complex functions as parameters

Since in a Bayesian model both variables and parameters are allowed to vary and are associated with errors, you can construct and calculate errors for even the most complex functions. You just calculate the the function as a node at each iteration and summarize the posterior samples as you would for any node. Practically, this involves adding in extra lines of code to monitor what you’re interested in. For example, suppose you are interested in the the dose of a drug that will provide 95% efficacy (“ED95”). In BUGS you can add a line of code like this:

ED95 <- (logit(0.95) – alpha) / beta

Where, alpha is the intercept or baseline response and beta is the change in response per unit dose.

Similarly, you could enter a missing or predicted value using your model, and monitor results for that. For example, say you model the effect of distance from a disaster (x1), local deaths from that disaster (x2) and median household income (x3) on psychological distress. Your BUGS statement for the model could be:

y[i] <- beta0 + beta1 * x1[i] + beta2*x2[i]  + beta3*x3[i] 

You can enter and model the results of the effect of 2, 4 and 6 mile distances with a few lines of code, substituting the mean values for the other two variables (x2 and x3), and monitoring and summarizing the new variables y2, y4, y6.

y2 <- beta0 + beta1 * 2 + beta2*1.26  + beta3*41064 
y4 <- beta0 + beta1 * 4 + beta2*1.26  + beta3*41064 
y6 <- beta0 + beta1 * 6 + beta2*1.26  + beta3*41064 

Airline fatality data example

Now, let’s look at an example of running a Bayesian MCMC analysis by interfacing R to a BUGS program. As I mentioned above, I prefer to use the JAGS program and interface with the R2jags package, but you can do the same thing (and should get the same answers) using WinBUGS with R2WinBUGS, or OpenBUGS with R2OpenBUGS.

This example comes from Lyle Gurrin and Bendix Cartensen We are going to analyze 26 years worth of airline fatality data using a simple Poisson model that will allow us to compare the exact form of the posterior to the simulated MCMC results.

Begin by reading in the airline fatality data. which you can find here.

1
2
airline<-read.csv("http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/
resources/Bayes/Bayes2/airline.csv",header=T, stringsAsFactors=F)
airline

Explore the airline data, then take advantage of conjugacy to get the exact results for the posterior distribution for the mean number of fatalities. The model for the data or likelihood is \(y_i|\mu \sim Poisson(\mu)\). An uninformative prior for this data is \(\mu \sim \Gamma(0,0)\). For computational purposes we will use \(\mu \sim \Gamma(0.01,0.01)\) which is nearly uniform on \((0, +\inf)\). The conjugate posterior is then \(\mu \sim \Gamma(0.01 + \Sigma y_i, 0.01 + n)\), where \(\Sigma y_i\) is the total number of fatalities, and \(n\) is the number of observations.

Our result is the mean of the distribution, which we plot.

1
2
3
4
5
6
7
nrow(airline)
sum(airline$fatal)
head(airline)

(mn <- mean( xx <- rgamma( 10000, 634.01, 26.01 ) ) )
curve(dgamma(x,634.01,26.01), from=20, to=30, lwd=2)
abline( v=mn, col="red", lwd=3 )

Now, let’s compare this same value using MCMC simulation and BUGS (JAGS flavor in this case). We begin by creating a list object to hold the data, and another list object to hold the initial values for the simulation (notice that we need three values because we will be running three chains). We then write a model statement to file using the cat() function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
a.dat<-list(fatal=c(airline$fatal,NA), I=27)

a.inits<-list(
    list(mu=20),
    list(mu=23),
    list(mu=26)
    )

cat(
"model
{
for(i in 1:I)
{
fatal[i]~dpois(mu)
}
mu~dgamma(0.1,0.1)
}",
file="m1.jag")

We run the simulation in JAGS via the R2jags package function jags(). Type ?jags to get more information on options and use. We are essentially specifying with syntax, using the R objects we created, everything we had to point to and click on in WinBUGS. Note that we are only monitoring the parameter “mu”. Type ?jags to get more information on options and use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(R2jags)

args(jags)

res<-jags(data=a.dat, 
    inits=a.inits, 
    parameters.to.save="mu", 
    model.file="m1.jag")

print(res)
plot(res)
traceplot(res)

res.mcmc<-as.mcmc(res)
summary(res.mcmc)
plot(res.mcmc)

autocorr.plot(res.mcmc)

res.list<-mcmc.list(res.mcmc[[1]],res.mcmc[[2]],res.mcmc[[2]])
gelman.diag(res.list)

The summary, print, plot and traceplot functions operate directly on the bugs object R2jags produces. You’ll see that print() returns the statistics for the nodes we are monitoring. The plot() method returns some minimalist graphics summarizing the results. The traceplot() method returns plots for the three chains for each variable, here devaiance and mu. Each chain is in a different color. Notice how the chains are spiky, dense and overlapping, indicating convergence and good mixing.

We can get more tools by using the “coda” package. To access the tools in in the coda package, we first have to convert the jags object to an mcmc object. With the mcmc object, we can get the Gelman-Rubin statistic and autocorrelation plots, both of which look acceptable.


  1. This example comes from Joy Woller and is well worth a look.

  2. This little probability parable, and much of the material as well as may of the images in this section comes from John Kruschke’s excellent book, “Doing Bayesian Analysis”. It is one of the clearest, most lucid introductions to this topic that I have come across. Buy it.