Stata: How to Write a Dictionary Program to Read Raw Data

EDS > Statistical Software > Stata > Reading Raw Data
printer friendly version Print
Page

Abstract: This is a brief guide to the essentials you need to know to write a Stata program file to read raw data. This sort of program is also referred to as a Stata dictionary file.

Important Note: If your data exists in the format of some other statistical or database application, e.g., SPSS, SAS, Excel, dBase, it may be simpler to convert that file to Stata using the application Stat/Transfer. This application is available in EDS.


Part 1 - Information You Need Before You Start:

Before you start to write a Stata program to read raw data, you need to know the following about your data. Most of this information can be found in the codebook.

  1. Your variable names:

    Pick the variables you need from the codebook and note their mnemonic names, e.g., "sex", "status", "age_sex", "yr8st2a", "aidsknow", etc.. Note that case counts. "YEAR" and "year" are two different variables.

    If the codebook only has a description of each variable, you will have to make up variable names yourself. Stata variable names must meet the following criteria:

    • Eight characters or less in length.
    • Combinations of letters (A-Z or a-z), digits (0-9), and the underscore character (_) only!
    • The first letter of a variable name must be a letter or the underscore character (ex: _var1 or OCCUP).
    • A variable name must not be one of Stata's reserved names. See the Stata User's Guide for a list. Most of these start with an underscore or are the name of data types so if you start your variable names with a letter you are pretty safe.

  2.  
  3. The start position of each variable, i.e., the first column of the variable's position in the record.

  4.  
  5. Each variable's end position or its length i.e, the number of columns the variable takes up.

  6.  
  7. If the variable is a string or numeric. String variables may contain numbers or may actually be numbers. Hint: If you know that all of the values of string variable are actually numbers, define that variable as numeric.

  8.  
  9. The number of decimal places desired, if any, for numeric variables.

  10.  
  11. The input file name and location of the raw data file you want to read in. The file extensions ".raw" or ".dat" work best for raw data, although Stata will read in files of other extensions.

  12.  
  13. You may want to label your variables with something longer than the 8 character variable name. These can be added later, however.

  14.  
  15. If you are very ambitious, you can also add value labels to individual variable, e.g., "Male", "Female", "Unknown". These can be added later, however.
Part 2 - Putting it Together:

The Stata dictionary program can be written with any editor or word processor. Be sure to save it as an ASCII file with the file extension .dct.

  1. A Stata dictionary program begins with a line that looks like this:
         dictionary using mydata.dat  {
    
    where "mydata.dat" is the name of your raw data file. If the file is somewhere besides the Stata directory (or your home directory on CUNIX), add the path name, e.g.:
         dictionary using d:\MyDocs\sz2\mydata.dat  {
         dictionary using ~sz2/data/mydata.dat  {
    
    You don't need quotes around the name of the data file unless it has spaces or other odd characters in the file or directory name (a bad practice).

  2.  
  3. Then comes the definitions of individual variables. Each variable is defined by a line with the following 5 items:
    • An underline and the word _column followed by the starting column of the variable in parentheses.
    • The variable type (usually you only need to indicate string varables)
    • The mnemonic name of the variable
    • The variable input format which consists of
      • a "%" sign
      • a number stating the variable width
      • a "." (period) followed by a number indicating the number of decimal places (omiitted for integers and string variables)
      • a letter indicating the format. The format is f for numbers and s for strings.

        Some examples of input formats:

             %2f         2 column integer variable
             %12s        12 column string variable
             %8.2f       8 column number with 2 implied decimal places.  
        
             (Note:  periods actually typed in the data override formats
                             decared in the program.)
        
    The format statement is actually more complicated than the above, but this will do for most data. See "infile" in the Reference Manual for more information if you have numbers in scientific notation or numbers with commas.

  4.  
  5. You can add a label (optional). Labels can be up to 80 characters long.

  6.  
  7. The program ends with a "}" (close bracket). You also need a return character at the last line, that is, before you save the file move your cursor to the beginning of the next line below the "}". Finally, save your file with the file extension .dct, e.g., test.dct.
Part 3 - Example:

Here's the codebook:

     Variable  Description         Columns    Format        
                                                       
     IDNUM     Assigned ID Number     1-3                
     FNAME     First Name             4-15    String
     LNAME     Last Name             16-27    String
     AGE       Age at Death          28-29                  
     SEX       Sex                   30                  
                 1=male                                     
                 2=female                                   
     BYEAR     Birth Year            31-34                  
     DYEAR     Death Year            35-38                  
     STATUS    Status                39                  
                  1=poor                                    
                  2=middle class
                  3=rich
     INTAX     Inheritance Tax       40-47   2 implied decimals       

And here's what the stata dictionary program for the above data looks like:

     dictionary using test.dat  {
         _column(1)         idnum    %3f
         _column(4)   str12 fname    %12s
         _column(16)  str12 lname    %12s
         _column(28)        age      %2f
         _column(30)        sex      %1f
         _column(31)        byear    %4f      "Year of Birth"
         _column(35)        dyear    %4f      "Year of Death"
         _column(39)        status   %1f      "Socioeconomic Status"
         _column(40)        intax    %8.2f    "Inheritance Tax:
     
     } 

In the example above, "fname" and "lname" are 12 column string variables and "intax" has 2 decimal places. Only "byear", "dyear", "status", and "intax" have labels as the other mnemonic variable names are obvious. Also note that the names of the variables are in lower case. This just makes for easier typing when you get to the analysis stage. You could have used upper case, but case matters. "IDNUM" is not the same as "idnum".

Note that you don't have to write a definition line for every variable in the dataset. You can skip the ones you don't need.

Part 4 - Executing the Program:

To run the Stata dictionary program, start up Stata and give the command:

     infile using filename

where "filename" is the name of your file, e.g., test.dct. You don't have to type the .dct extension. If all is well, you will see the program appear on the screen followed by the message that it has read N observations. Check that N is the right number of observations in your dataset. Check on the variables with the describe command:

     . describe
     
     Contains data
       obs:            15
      vars:             9
      size:           840 (99.8% of memory free)
     -------------------------------------------------------------
        1. idnum     float  %9.0g
        2. fname     str12  %12s
        3. lname     str12  %12s
        4. age       float  %9.0g
        5. sex       float  %9.0g
        6. byear     float  %9.0g                  Year of Birth
        7. dyear     float  %9.0g                  Year of Death
        8. status    float  %9.0g                  Socioeconomic Status
        9. intax     float  %9.0g                  Inheritance Tax
     -------------------------------------------------------------
     Sorted by:
Note that all of the numeric variable have the type "float". This is inefficient. To change them to their most efficient type, give the command compress.

    . compress
    idnum was float now byte
    age was float now byte
    sex was float now byte
    byear was float now int
    dyear was float now int
    status was float now byte
    fname was str12 now str8
    lname was str12 now str9

Now, save this as a Stata dataset with the command:
     save mydata
where "mydata" is the name of the Stata dataset. You don't have to add the .dta extension. The default location is the Stata directory. If you want to save it elsewhere, add the path information.

As always, CHECK YOUR DATA. Stata has a nice summarize command to give you summary statistics but there is no substitute for doing frequencies (tab1) on the variables you will be using in analysis.
Example of a Program to Read Data with Multiple Records/Case:

Here's the data layout:

     Variable  Description             Record     Columns    Format        
                                                                            
     idnum     Assigned ID Number        1          1-4                   
     treetype  Type of Tree              1          5-6
     idnum     Assigned ID Number        1          1-4                   
     soilphn   Soil PH - North Side      2          5-7      (2 decimals)
     soilphe   Soil PH - East Side       2          8-10     (2 decimals)
     soilphs   Soil PH - South Side      2         11-13     (2 decimals)
     soilphw   Soil PH - West Side"      2         14-16     (2 decimals)
     idnum     Assigned ID Number        3          1-4
     height    Height of Tree            3          5-9      (1 decimal)
     circ      Circumference of Tree     3         10-14     (1 decimal)

And here's what the stata dictionary program for the above data looks like:

dictionary using tree.dat  {
         _lines(3)
         _line(1)
             _column(1)         idnum       %4f
             _column(5)         treetype    %2f
         _line(2)
             _column(5)         soilphn     %3.2f   "Soil PH - North Side"
             _column(8)         soilphe     %3.2f   "Soil PH - East Side"
             _column(11)        soilphs     %3.2f   "Soil PH - South Side"
             _column(14)        soilphw     %3.2f   "Soil PH - West Side"
         _line(3)
             _column(5)         height      %5.1f
             _column(10)        circ        %5.1f
     }

Sue Zayac
Electronic Data Service