Abstract: This is a brief guide to the essentials you need to know to write a Stata program file to read raw data. This sort of program is also referred to as a Stata dictionary file.
Important Note: If your data exists in the format of some other statistical or database application, e.g., SPSS, SAS, Excel, dBase, it may be simpler to convert that file to Stata using the application Stat/Transfer. This application is available in EDS.
Part 1 - Information You Need Before You Start:
Before you start to write a Stata program to read raw data, you need to know the following about your data. Most of this information can be found in the codebook.
-
Your variable names:
Pick the variables you need from the codebook and note their mnemonic names, e.g., "sex", "status", "age_sex", "yr8st2a", "aidsknow", etc.. Note that case counts. "YEAR" and "year" are two different variables.
If the codebook only has a description of each variable, you will have to make up variable names yourself. Stata variable names must meet the following criteria:
- Eight characters or less in length.
- Combinations of letters (A-Z or a-z), digits (0-9), and the underscore character (_) only!
- The first letter of a variable name must be a letter or the underscore character (ex: _var1 or OCCUP).
- A variable name must not be one of Stata's reserved names. See the Stata User's Guide for a list. Most of these start with an underscore or are the name of data types so if you start your variable names with a letter you are pretty safe.
- The start position of each variable, i.e., the first column of the variable's position in the record.
- Each variable's end position or its length i.e, the number of columns the variable takes up.
- If the variable is a string or numeric. String variables may contain numbers or may actually be numbers. Hint: If you know that all of the values of string variable are actually numbers, define that variable as numeric.
- The number of decimal places desired, if any, for numeric variables.
- The input file name and location of the raw data file you want to read in. The file extensions ".raw" or ".dat" work best for raw data, although Stata will read in files of other extensions.
- You may want to label your variables with something longer than the 8 character variable name. These can be added later, however.
- If you are very ambitious, you can also add value labels to individual variable, e.g., "Male", "Female", "Unknown". These can be added later, however.
Part 2 - Putting it Together:
The Stata dictionary program can be written with any editor or word processor. Be sure to save it as an ASCII file with the file extension .dct.
-
A Stata dictionary program begins with a line that
looks like this:
dictionary using mydata.dat {where "mydata.dat" is the name of your raw data file. If the file is somewhere besides the Stata directory (or your home directory on CUNIX), add the path name, e.g.:dictionary using d:\MyDocs\sz2\mydata.dat { dictionary using ~sz2/data/mydata.dat {You don't need quotes around the name of the data file unless it has spaces or other odd characters in the file or directory name (a bad practice). -
Then comes the definitions of individual variables.
Each variable is defined by a line with the following 5
items:
- An underline and the word _column followed by the starting column of the variable in parentheses.
- The variable type (usually you only need to indicate string varables)
- The mnemonic name of the variable
-
The variable input format which consists of
- a "%" sign
- a number stating the variable width
- a "." (period) followed by a number indicating the number of decimal places (omiitted for integers and string variables)
- a letter indicating the format. The format is
f for numbers and s for
strings.
Some examples of input formats:
%2f 2 column integer variable %12s 12 column string variable %8.2f 8 column number with 2 implied decimal places. (Note: periods actually typed in the data override formats decared in the program.)
- You can add a label (optional). Labels can be up to 80 characters long.
- The program ends with a "}" (close bracket). You also need a return character at the last line, that is, before you save the file move your cursor to the beginning of the next line below the "}". Finally, save your file with the file extension .dct, e.g., test.dct.
Part 3 - Example:
Here's the codebook:
Variable Description Columns Format
IDNUM Assigned ID Number 1-3
FNAME First Name 4-15 String
LNAME Last Name 16-27 String
AGE Age at Death 28-29
SEX Sex 30
1=male
2=female
BYEAR Birth Year 31-34
DYEAR Death Year 35-38
STATUS Status 39
1=poor
2=middle class
3=rich
INTAX Inheritance Tax 40-47 2 implied decimals
And here's what the stata dictionary program for the above data looks like:
dictionary using test.dat {
_column(1) idnum %3f
_column(4) str12 fname %12s
_column(16) str12 lname %12s
_column(28) age %2f
_column(30) sex %1f
_column(31) byear %4f "Year of Birth"
_column(35) dyear %4f "Year of Death"
_column(39) status %1f "Socioeconomic Status"
_column(40) intax %8.2f "Inheritance Tax:
}
In the example above, "fname" and "lname" are 12 column string variables and "intax" has 2 decimal places. Only "byear", "dyear", "status", and "intax" have labels as the other mnemonic variable names are obvious. Also note that the names of the variables are in lower case. This just makes for easier typing when you get to the analysis stage. You could have used upper case, but case matters. "IDNUM" is not the same as "idnum".
Note that you don't have to write a definition line for every variable in the dataset. You can skip the ones you don't need.
Part 4 - Executing the Program:
To run the Stata dictionary program, start up Stata and give the command:
infile using filename
where "filename" is the name of your file, e.g., test.dct. You don't have to type the .dct extension. If all is well, you will see the program appear on the screen followed by the message that it has read N observations. Check that N is the right number of observations in your dataset. Check on the variables with the describe command:
. describe
Contains data
obs: 15
vars: 9
size: 840 (99.8% of memory free)
-------------------------------------------------------------
1. idnum float %9.0g
2. fname str12 %12s
3. lname str12 %12s
4. age float %9.0g
5. sex float %9.0g
6. byear float %9.0g Year of Birth
7. dyear float %9.0g Year of Death
8. status float %9.0g Socioeconomic Status
9. intax float %9.0g Inheritance Tax
-------------------------------------------------------------
Sorted by:
Note that all of the numeric variable have the type "float".
This is inefficient. To change them to their most
efficient type, give the command compress.
. compress
idnum was float now byte
age was float now byte
sex was float now byte
byear was float now int
dyear was float now int
status was float now byte
fname was str12 now str8
lname was str12 now str9
Now, save this as a Stata dataset with the command:
save mydata
where "mydata" is the name of the Stata dataset.
You don't have to add the .dta extension.
The default location is the Stata directory. If
you want to save it elsewhere, add the path
information.
| As always, CHECK YOUR DATA. Stata has a nice summarize command to give you summary statistics but there is no substitute for doing frequencies (tab1) on the variables you will be using in analysis. |
Example of a Program to Read Data with Multiple Records/Case:
Here's the data layout:
Variable Description Record Columns Format
idnum Assigned ID Number 1 1-4
treetype Type of Tree 1 5-6
idnum Assigned ID Number 1 1-4
soilphn Soil PH - North Side 2 5-7 (2 decimals)
soilphe Soil PH - East Side 2 8-10 (2 decimals)
soilphs Soil PH - South Side 2 11-13 (2 decimals)
soilphw Soil PH - West Side" 2 14-16 (2 decimals)
idnum Assigned ID Number 3 1-4
height Height of Tree 3 5-9 (1 decimal)
circ Circumference of Tree 3 10-14 (1 decimal)
And here's what the stata dictionary program for the above data looks like:
dictionary using tree.dat {
_lines(3)
_line(1)
_column(1) idnum %4f
_column(5) treetype %2f
_line(2)
_column(5) soilphn %3.2f "Soil PH - North Side"
_column(8) soilphe %3.2f "Soil PH - East Side"
_column(11) soilphs %3.2f "Soil PH - South Side"
_column(14) soilphw %3.2f "Soil PH - West Side"
_line(3)
_column(5) height %5.1f
_column(10) circ %5.1f
}

