Abstract: Stata is a general-purpose statistical package. This document offers a brief introduction to Stata on CUNIX, with several examples of reading ascii data.
Stata 9 and Stata 9 SE (Special Edition) are currently available on the cunix cluster along with their X-Windows versions. Stata10 and Stata 10 SE are also available. Commands to invoke them are:
- stata -> Stata 9
- xstata -> Xwindows version of Stata 9
- stata-se -> Stata 9 SE
- xstata-se -> Xwindows version of Stata 9 SE
- stata10 -> Stata 10
- stata10-se -> Stata 10 SE
The X versions of Stata10 are not currently available.
Individual copies of Stata for Windows, Macintosh, and Linux are available to Columbia University students, faculty, and staff at a significant discount from their normal prices. See this URL for site license information.
Sections:
- Documentation
- Stata Help
- Using and Saving "Stata Datasets"
- Some Useful Commands
- Reading Ascii Data into Stata
- How to Increase Memory
- Creating a Log File
- Stata in Batch Mode
Documentation:
There are many Stata manuals. The most important are:
- Getting Started: a brief overview.
- Stata User's Guide: a thorough step-by-step overview of Stata's features.
- Stata Reference Manual: a logically organized description of all Stata commands in 3 volumes.
See the Stata Bookstore for a list of other documentation and books on Stata and statistics.
A set of current manuals is in the Electronic Data Service in 215 Lehman Library. Older versions are also available for reference in the Lehman Library Permanent Reserves section.
NOTE: Just about everything in the printed manuals, and a lot more, is available with the "help" command.
Stata Help:
Stata has extensive interactive help.- The "help" command. Use help when you know the Stata word or phrase you need help on .
- The "search" command. Use search when you are not sure of the name of the command or are looking for information on a topic. It searches a keyword database and the Internet.
- The "findit" command. This is like "search" but searches for information on a topic across all sources including the online help, the FAQs at the Stata web site, the Stata Journal, and all Stata-related internet sources including user-written additions. From findit, you can click to go to a source or to install additions.
- The "describe" command. This tells you about your active dataset
and its variables. Use "describe, short" if you don't want to
see the list of variables. If you want to find a certain
variable but are unsure of its name, use the wild card "*"
symbol, e.g., "describe in*" will list all the variables starting
with in and "describe *9" will list all the variables ending in
9.
- The "lookfor" command. Use this to search through variable names and variable lables for a string, e.g., "lookfor in" will find the variable, "income", and variables whose variable labels are "ethnic minority" and "moves since age 16".
Using and Saving Stata Datasets:
A "Stata Dataset" is one in the special Stata format. Stata datasets have the extension ".dta". To open a Stata Datasets, type the use command followed by the name of a Stata dataset.
use survey1
If you only need some of the variables from a Stata Dataset, you can just read in those variables with this variant of the use command:
use age sex status using survey1
To save a Stata Dataset, type the save command plus a filename. You do not have to type the file extension. The extension will be ".dta" by default. If the file already exists, you will need the replace option.
save survey2, replace
Note: about versions of Stata. A dataset created by the most recent version of Stata, cannot be read by versions earlier than version 8. To create a file that can be read by version 7 of Stata use the saveold command.
saveold survey2
If you are using Stata/SE and want to save the dataset for use in the smaller, Intercooled version of Stata, use the option, intercooled on the save command:
save surveyl, intercooled
String variables must be less than 80 characters to be save in intercooled.
Some Useful Stata Commands:
- describe - Describes the currently active data file, showing the number of observations and variables, size of file, names and types of variables. describe, short gives info about the file but not the variables.
- summarize - Gives summary statistics. You can give it an argument of a list of variables.
- codebook - Creates a simple codebook describing your data.
- clear - Clear everything from memory, including your data, value labels, equations, etc. In effect, it resets Stata.
- memory - Check memory allocations.
-
list - Lists all or part of the currently active
data. The command can be quite complex. You can give it
arguments of a list of variables, cells, rows, pattern
and conditional matches, for example, the command:
list age status in 1/100 if age>14lists the variables "age" and "status" for those aged over 14 in the first 100 observersions. Be careful about using list. If you have a very large dataset, the listing will go on and on.
- tab1 [varname] - Do simple frequencies on a variable. It is always worth knowing what the data you're working with looks like.
Reading ASCII Data into Stata:
The two most common commands to read data from an ASCII file into Stata are insheet and infile:
- insheet - Use insheet if the file was created by a spreadsheet or a database program with one observation per line and the variable delimitor is a comma or a tab character. If you are coming from Excel, create a .csv file. The first line can be a list of variables. A period (.) is understood to mean a numeric missing value; double quotes ("") to mean a missing string variable.
- infile is used to read fixed format raw data without delimiers using a dictionary file. See example below or click here for more information on writing a dictionary file.
The syntax and examples are below.
Insheet:Syntax: insheet using filename , optionswhere "filename" is the name of the ascii file created by the spreadsheet or database program. By default, Stata will assign the names v1,v2,...,vn to the variables. If you saved the spreadsheet file with variables names in the first row, Stata can use them if you specify the option "names", for example:
insheet using mystuff.dat, namesIf you didn't save the spreadsheet file with variables names, you can add them later with the label command. If you have a lot of varibles, make up a .do file with all the label commands.
Infile with a Dictionary File:Syntax: infile using dictionary-fileIf the variables in your data file are not delimited, you need a Dictionary File to describe the positions of your variables to Stata.
Where "dictionary-file" is the dictionary file containing the specifications for reading the variables. Here's an example.
dictionary using dump.dat { _column(1) id %5f _column(6) age %2f _column(8) str1 sex %1s _column(9) str1 status %1s }Click here for more in formation on writing a dictionary file.
Increasing Memory:
If you get the message "No more room for observations" (as opposed to variables), you don't have enough memory to read in your entire Stata dataset. The command, "memory", gives a report on memory usage. To increase memory, give the command:
set memory #mm
Where "#" is a number and "mm" is megabytes.
You may be able to reduce your memory requirements by saving your data more efficiently. Stata's default variable type is 8 bytes. This is unnecessarily large for most social science data. Use stata's "compress" command to reduce your data to its most efficient format and then resave your file.
Note: The "compress" command does not create a compressed version of your file in the way that compression utilities such as gzip or pkzip do. Rather, the Stata "compress" command changes the data types to store your variables such that each variable is stored optimally. See the Stata Manual for more information on Stata variables types.
Further note: If you get the message "No more room for variables" (as opposed to observations), you have too many variables. Intercooled Stata has an absolute limit of 2,047 (2**11 -1) . Stata SE has a limit of 32,767 (2**15 - 1). If you know the names of the variables, you can read in only the ones you need. Since Stata works almost entirely in memory, the fewer the variables (and observations) the faster it runs.
The log File:
To start logging your session on cunix, give the Stata command:
log using filename
where "filename" is the name of the file. It can be any name. ".smcl" will be the file extension. This is a Stata-proprietary format. To log to an ordinary ASCII file use the t option:
log using filename, t
This will save the log to "filename.log".
If the file already exists, it will be appended to. Logging can be turned on and off any number of times during a stata session. The log file closes automatically when you exit stata. If you use the t option, the log file is an ascii file so you can edit it with any unix editor (pine, emacs, vi).
Stata in Batch Mode:
It is possible to run Stata in batch mode on cunix. Prepare a file with the stata command you want executed. Be sure to turn paging off. Save the file with the extension "do". ( Click here to see an example.) Then run stata in the background with this command:
stata -q -b do mybatch > NUL &
where "mybatch" is the filename of the file containing your Stata commands. Output will be in a file with the same name as the input "do" file and with the file extension "log". (The "> NUL" gets rid of the "running on computer" message.)

