NOTES ON SOFTWARE, SAS PROCEDURES AND METHODS
Statistical software packages of interest to disaster and injury epidemiologists
There are many statistical software packages available, each with their advantages and disadvantages. Public health practitioners tend to use the software they first trained on, but there are advantages to learning more than one program. Some procedures are most easily and effectively conducted by certain packages. It is also becoming increasingly easier to translate data, procedures and commands across packages. What follows is a (by no means exhaustive or complete) annotated list of some popular programs.
Microsoft Excel – This ubiquitious spreadsheet program has a quick learning curve, many useful statiscial functions and the advantage of being readily understood and shared for collaborative purposes. Many researchers will find, though, that they will soon outgrow it when they develop a need for for more advanced and specialized procedures, data manipulation, and graphics.
SPSS – A flexible and reasonable choice if you prefer graphical user interfaces and a menu-driven approach to data analysis. Depending on the type of work you do, this may very well be the only statistical software package you will need. It is particularly well-suited to procedures such as factor analysis and is very popular among social sciences. (In fact, SPSS stands for Statiscial Package for Social Science). The spreadsheet-like interface provides a relatively painless transition from a program like Excel. An added advantage for many users is that the package can be purchased during one’s graduate education obviating the need for costly yearly licensing fees.
SAS – A powerful suite of programs that is well known, well-documented and flexible enough to handle most any statistical procedure. While a menu driven version is available, most researchers and epidemiologic analysts will use the syntax-driven SAS-language based approach. Writing syntax presents a greater learning curve, but is amply rewarded by the increased control, flexibility and ability to document and recall analyses. SAS is configured to work across most computer platforms, perhaps contributing to its popularity among US government agencies. Perhaps its principle disadvantage is the yearly licensing fee which may be prohibitively expensive for some researchers. Mutliple-user site licence agreements with academic institutions may offset the expense to some extent. It is particularly powerful at handling the extremely large datasets that are increasingly available to practitioners and that constitute the databases for many surveillance activities. The examples that follow utilize SAS exclusively.
SUDAAN – A software program developed expressly for the purpose of analyzing survey data. The procedures themselves are familiar, but the analyses take account of complex sampling strategies, nesting of variables within strata, and correlations among data elements. Available in either a stand-alone version or as a SAS-callable version, the program is utilized for the analysis of many large, ongoing US government health surveys such as CDC’s Behavioral Risk Factor Surveillance System.
R – A popular open-source freeware alternative to proprietary statistical software packages. The language is similar to S-Plus. As more researchers and statisticians contribute to its development, and as documentation proliferates, it will be an increasingly viable option for researchers and epidemiologists. There are few, if any, procedures, it can not handle. The graphics are as good as or (in many cases) better than those found in expensive packages. It even includes mapping and spatial analytic tools to rival those found in packages costing thousands of dollars. It has a lot to recommend it.
EpiINFO – Developed by CDC for their own investigators, this program has something for everyone, including sample size calculators, simple intuitive 2x2 table calculators, and questionnaire-development and analysis programs for case-control studies. It even comes with a mapping module. Every epidemiologist in practice owes it to themselves to download this free program.
ArcGIS – A suite of software packages from ESRI that is the standard among government planners. If your work calls on you to do more than create simple choropleths of health-related or risk data, you will undoubtedly come across this software. The program comes bundled with more cartographic and geostatistical tools than you will ever have time or need to learn. One can also marry the statistical power of SAS to ArcGIS through the seamless integration offered by the SAS bridge for ESRI add on module.
SatScan – This program was initially developed by Martin Kuldorf and his colleagues for use in detecting cancer clusters. The easy to use software will identify locations of clusters of any outcome of interest across both time and space. An excellent public health surveillance tool available for free download after registration.
WinBUGS – BUGS stands for Bayesian Inference Using Gibbs Sampling. This freeware program allows sophisticated iterative sample-based analyses that were unfeasible just a few years ago. It comes with mapping tools that allow hierarchical Bayesian modeling to control for and address many of the difficulties and drawbacks of performing spatial analyses using a frequentist approach.
SAS Notes for Disaster and Injury Epidemiologists
You may find the following bits and pieces of SAS code useful tools to have in your injury epidemiology toolbox. The following section is not a review of basic statistics and assumes knowledge of biostatistics (there is no substitute for taking courses and reading textbooks) and familiarity with SAS programming such as one might obtain from a good book. The SAS Press homepage has numerous recommendations. Better yet, attend one of SAS’s many short courses. Much of this material has been taken directly from standard SAS sources. The topics are more or less in order of increasing complexity. Use your browser’s search function (usually Control-F) to find a particular test or procedure.
Utility Procedures
Find help on SAS procedures:
Help > SAS Help and Documentation > SAS Products > Base SAS > SAS Procedures > Procedures
Find help on SAS functions;
Help > SAS help and documentation > SAS products > base SAS > SAS language dictionary > dictionary of language elements > SAS data set options
PROC TABULATE to create a table with sums, medians, etc…
proc tabulate data=your.data;
var cont_var; /* continuous variable you want stats on*/
class cat_var; /*categorical variable you want stats by*/
format cat_var cat.; /*formatting statement using a previously
created format for your categorical variable*/
table cat_var, sum* cont_var pctsum* cont_var median* cont_var;
/* requests the sum, percent of the sum, and median for the
continuous variable for each value of the categorical
variable;*/
run;
INPUT function to convert a SAS character value to a numeric value
data new; /* creating a new data set with the numeric value */
set old; /* your existing data set with the character value */
new_numeric_var=input(old_char_var, informat); /* note that you need to specify
a numeric informat that best describes how to read
the character into the numeric value */
run;
ODS HTML to output a table directly into Excel
ods html file="C:\Folder\file.xls"
style = minimal; /* open ods, specify the drive, folder and name of the file,
note the .xls extension, minimal style removes formats */
proc ; /* procedure call */
run;
ods html close;
Univariate and Graphing Procedures
PROC GPLOT to graph one variable against another:
goptions reset=all; /* remove any previous settings */
goptions cback=white device=win; /* set background color, specify windows */
axis1 major = (h=2.0 c=black) /* define your vertical axis */
minor = (h=1.0 c=black)
order = 0 to 3000 by 100
label = (h=1.5 a=90 f=arial c=black 'Label for the Axis');
axis2 major = (h=2.0 c=black) /* define your horizontal axis */
minor = (h=1.0 c=black)
order = 1 to 132 by 6
label = (h=1.5 f=arial c=black ‘Label’);
proc gplot data= your.data;
plot var1 * var2 /* specify variables to be plotted */
/ vaxis = axis1 /* invoke previously defined variables */
haxis = axis2;
symbol v='.' f=arial h=2 i=sm30 c=black; /*define symbol to use for plot
note i= statement defines 30% smoothing
via interpolation*/
title h=3.0 c=black f=arial j=center ‘Main Title of Plot ‘;
title2 h=2.0 c=black f=arial j=center ‘Subtitle of Plot’;
run;
quit;
PROC FREQ to calculate an odds ratio and its 95% confidence interval for a 2x2 table:
proc freq data = your.data;
tables var_a * var_b / measures cl; /* variables are both dichotomous, measures
returns the odds ratio, cl returns the
confidence limits*/
run;
PROC BOXPLOT as convenient graphical way to get a 5-number summary and sense of normality or symmetry across groups of continuous data
proc boxplot data = your.data;
plot continous_var * category_var / boxstyle schematic cboxes=black; /* specify type and color of boxplot, note, if only one category create dummy variable against which to plot */
run;
PROBPLOT option in PROC UNIVARIATE to graphically assess assumptions of normality for continuous data.
proc univariate data = your.data;
var ; */ specify your variable */
id idnumber; /* useful to have a way to identify outliers or illogical observations */
probplot continuous_var / normal (mu=est sigma= est color=blue w=1);
/* invoke the probability plot option, overlay a normal like against
which to judge, set parameters */
title;
run;
STERR and CLM in PROC MEANS to get the standard error of the sample mean and its confidence limits.
proc means data=your.data n mean stderr clm; /* invoke procedure, specify statistics */
var continuous_var;
title ‘Your Title’;
run;
Regression Procedures
PROC CORR to get correlation coefficient, a summary measure of linear relationship.*
Proc corr data= your.data rank; /* ‘rank’ orders correlations from high to low */
Var predictor1 predictor2 predictor3 predictor4
With outcome_var; /* omit the ‘with’ statement to get a correlation matrix */
Title ‘correlation of outcome with predictors’;
Run;
*NB: Caution, not a measure of a causal relationship, thrown off by extreme outliers, mulitcollinearity, etc…always graph the variables first (See PROC GPLOT)
PROC REG for simple linear regression.
Proc reg data= your.data;
Model outcome_var = pred_var;
Title ‘simple linear regression’;
Run;
Quit;
PROC LOGISTIC for categorical outcomes.
Proc logistic data = your.data;
Class cat_var (param=ref ref=”high”); /* class statement to define the variable
as a categorical predictor, use reference
cell coding with “high” as the referent */
Model outcome_var (event=’1’) = cat_var / clodds=pl /* pl= profile likelihood
which will give you better CI’s for small sample
sizes */
title1 ‘logistic regression’ ;
run;
PROC LOGISTIC for ordinal outcomes e.g. Likert Scales seen in many surveys*
proc logistic data=your.data desc; /* SAS will invoke ordinal analyses if outcome
variable contains more than 2 responses. The desc
option defines cumulative probability of being in one of ordered categories or lower */
model ordinal_outcome=var;
title 'Ordinal Logistic Regression Model’;
run;
*NB: if complex sampling strategies or correlated responses consider using SUDAAN or a general estimating equation
PROC VARCLUS to cut down on redundant predictor variables by grouping based on correlations and choosing one from each highly correlated group
proc varclus data=your.data maxeigen=.70 short; /* maxeigen sets the threshold for the
eigenvalue below which will no
longer cluster variables; short cuts
down on the voluminous output */
var var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12
var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 ;
title 'Variable Clustering’';
run;
PROC GENMOD to conduct Poisson Regression for count data (non-negative, highly skewed and rare)*
proc genmod data=your.data;
class catvar1 (param=ref ref=first) catvar2 (param=ref ref=first); /* define
categorical variables in the model as
type with first level as the referent */
model depvar = catcar1 catvar2 / dist=poi link=log type3; /* specify model and the
distribution as poisson with log link type 3
requests likelihood statistics */
title ;
run;
* Consider negative binomial model (dist=nb) if overdispersed (μ > var)
PROC GENMOD to conduct Poisson Regression for rate data by specifying a population as an offset variable
proc genmod data=your.data;
class catvar1 (param=ref ref=first) catvar2 (param=ref ref=first);
model depvar = catcar1 catvar2 / offset=log_pop dist=poi link=log type3;
/* note must be log of rate denominator
e.g. population; result incidence density
ratio*/
title ;
run;
Time Series Analysis Tools
Time series analyses can be useful for evaluating health outcomes over time. You might, for example, be interested in determining if a disaster or other event had an effect on the occurance of some outcome and whether one could expect future occurances to change in pattern or frequency. Time series analysis can be conducted by invoking SAS’s PROC ARIMA and related procedures, but most recent versions of SAS come packaged with the menu-driven SAS Time Series Forecasting System (TSFS). The following notes present a very brief overview of an approach to times series data using SAS’s TSFS.
Typing dm ‘forecast’ in the editor window invokes the TSFS and leads to a screen where you specify a SAS data set that includes a time variable. If your data set does not appear as a time series, there is likely a problem with the time variable such as a missing observation. After specifying the file, click the Develop Models button to access most all tools. A Left Click of your mouse on a white area will bring up an interactive menu, including an option to fit models automatically which is useful to establish a benchmark against which to measure your own efforts. Before fitting a model, it is always useful to view the raw time series with an overlaid line (see the section on plotting above).
I. Descriptive and Exploratory Analysis:
You will need to account for trend and seasonality to fit an appropriate model. Look for trend and seasonality first by plotting the data. A smoothed time plot may be more instructive than some of the diagnostic tools. Use autocorrelation plots and statistics such as the Ljung-Box test for white noise and the Dickey-Fuller unit root test for stationarity to help you diagnose trend and seasonality.
A trend may be evident in a plot of in an autocorrelation function (ACF) partial autocorrelation function (PACF) or inverse autocorrelation function (IACF) with a significant lag at 1. The ACF will decay slowly from lag 1 and will have few significant values after first differencing. A non-significant unit-root test will become significant after differencing.
Seasonality presents as repetitive behavior on a plot at time S. There will be a significant ACF, PAC and/or IACF at lag S. There may be continued significant ACF values at lags that are multiples of S. The ACF will become non-significant after seasonal differencing. The seasonal unit root test becomes significant after seasonal differencing.
Note that a while a stochastic trend will generally convert to to white noise with simple differencing, a linear trend will not.
II. Fitting a Simple Model
Left click the Develop Models button. Review the plot, autocorrelations, and statistical test by clicking on them. Close out of the viewer. Left click a blank space of the Develop Models window. Start with Fit Model Automatically to get a benchmark against which to work
To fit a linear trend, left click the blank area again, and choose Fit Custom Model. Click the down arrow next to Trend Model and select Linear Trend > click OK to fit the model.
To specify a model with, say a quadratic trend, proceed as above, except select Trend Curve… rather than Linear Trend Other trend types are available and their choice will be driven by the form of any apparent trend in your data. Fit a number of models to compare.
To model seasonality, the two basic choices are seasonal dummy variables or differencing. In a simple model, you will likely choose a seasonal dummy or linear trend with a seasonal term. Select Develop Models è Select Fit Models from List è Select Seasonal Dummy or Linear Trend with Seasonal Terms
Interupted time series analyses using transfer functions to model events such as disasters or other impacts are conceptually complex and require knowledge of the underlying process and statistical theory. Interpret them cautiously. To analyze an interrupt term in SAS’s TSFS Select Fit Custom Model > Add > Interventions. Specify a date and whether the event has a point, step or ramp, specify decay pattern. The choice is informed by your knowledge of the event and its expected effect as well as how it appears on the time series plot. Choices include an Abrupt, Temporary Effect (a point transfer function with exponential decay), an Abrupt, Permanent Effect (a step function with no decay) and a Gradual, Permanent Effect (a step function with exponential decay that‘decays up’). When reviewing the parameter estimates associated with the event, look at the intercept estimate which reflects the level or average per month prior to the event, the point, step or ramp estimate which is the increase or decrease in the outcome due to event and at all the associated p values to see if any of the coefficients were statistically significant.
To obtain forecast values, first select your best-fitting model based on indices such as Root Mean Square Error (RMSE), Mean Absolute Percent Error (MAPE) or Akaike Information Criteria (AIC). Then close out of the developer window and click on Produce Forecastè Run è Output (Was previously grayed out, should now be active)
III. Box-Jenkins Methodology:
Box-Jenkins represents a powerful methodology that addresses trend and seasonality well. ARIMA models have a strong theoretical foundation and can closely approximate any stationary process. The process consists of model identification by using autocorrelation functions, evaluation by assessing the fit of the possible models and forecasting using the best model. Caution, these methods are applied to stationary time series. See the following section for how to proceed with the usual situation of non-stationary time series.
The steps in ARMA(p,q) identification are: (1) Ensure a stationary series by accounting for trend and seasonality (2) Determine the highest possible value of q from the highest significant ACF lag (3) Determine the highest possible value of p from highest significant PACF/IACF lag (4) Determine all ordered pairs (j, k) such that 0<j<p and 0<k<q (5) Fit an ARMA model for each ordered pair, and (6) Select model with smallest value of RMSE on a holdout sample or AIC on a fit sample
To fit an ARMA model left click in an empty section of the Develop Models window and choose Fit ARIMA Model…Specfiy the p and q values. To specify a holdout sample, click the Set Ranges… button on the Develop Models window. Type in the number of periods next to Hold-Out Sample .
To evaluate a model, first click the View Selected Models Graphically button found below the Browse button. You will see a plot of predicted values and original data. Select Prediction Errors for plot of residuals. Select Prediction Errors Autocorrelation . A good model will not have any spikes significantly different from zero. Under Prediction Error you will want to see white noise and significant unit root tests. In the Parameter Estimates window, the intercept represent the mean (μ ) of the series. Check if any of the parameters are significantly different from zero. Finally, review the Statistics of Fit (you will be presented with a number of different fit statistics) the Forecast Graph (to plot the forecast and its 95% CI) and the Forecast Table (to get same information in tabular fashion).
Once you have developed and assessed a model, you can use it to produce forecasts and save the forecasts to a data set. Close out of the Develop Models window. Select the Produce Forecasts… button on the Intro window. The default is to forecast all series in a data set and save the forecast SAS data set as work.forecast. Accept this simple format to forecast 12 time periods into the future. Click on Select… to forecast only a subset of the series. Format will allow you to change the format from simple to interleaved or concatenated. Horizon will allow you to change how many time periods into the future you want to forecast. Run starts the process. The Output button allows you to view the table.
IV. Modeling Non-Stationary Time Series:
It is rare to encounter purely stationary time series, so before applying Box-Jenkins methods you will need to (1) diagnose trend and seasonality (2) determine appropriate trend/seasonal components to use for modeling (3) check residuals after applying trend/seasonal components (4) verify that residuals appear stationary only then (5) apply methods from previous section
Box-Jenkins emphasizes differencing to achieve stationarity. You will rarely require greater than 2nd order differing for non-seasonal trends and 1st order differencing for seasonal trends. The TSFS is designed to make differencing as easy as possible. Begin by looking at both the ACF and PACF. Click the differencing buttons. Do the ACF and PACF graphs look ‘better’ (smaller bars)? If there is no trend or seasonality in the raw data, first differencing will create a trend so you will see bigger bars. If you believe there is a quadratic trend, a 2nd order differencning may be necessary. If first differencing does not work, you can apply linear trend modeling through the custom model window as described above.
The mechanics of seasonal differencing for a ARIMA(p,d,q)(P,D,Q) model proceeds similarly to trend differencing, where P is seasonal AR term, D is the seasonal differencing term and Q is the seasonal MA term.
Note that when you are specifying an ARIMA model in SAS’s TSFS specify the model both with and without an intercept term because when TSFS applies differencing the default is to omit the intercept term. You can duplicate and edit a model by left clicking the white are in the Develop Model window. This allows you to do things like adding a linear or quadratic trend to an ARIMA model. You can compare models head to head by using Tools Menu > Compare Models… feature in the Tools Menu. The objective is to select the best model using a robust list of accuracy criteria.
V. Adding Interrupt or Event Terms to ARIMA models
Interventions can be added through the Custom Model Specification window or the ARIMA Model Specification window if seasonal AR or MA terms are required. The ARIMA window has the most general model specification tools. Click Add è Interventions . Proceed as previously described, by specifying the Date of the onset of the event term, the Type and Effect Decay Pattern (where Exp indicates an order 1 decay and Wave describes an order 2 decay) |