Stata: How to Read Hierarchical Files in Stata

EDS > Statistical Software > Stata > Reading Hierarchical Files
printer friendly version Print
Page

Abstract: Hierarchical datasets are datasets organized on two levels, eg. a dataset may have information at both the household and the individual level, where the individuals are also part of the household. This document comes from the StataCorp as part of their NetCourse program and briefly describes how to read hierarchical files using Stata.


Can infile read in a hierarchical dataset?

Title: Reading a hierarchical dataset with infile.
Author: William Gould, StataCorp.
Date: January 1996.
(c) Copyright 1997 StataCorp.

In hierarchical datasets, the records do not all have the same format nor do they contain the same type of information. In computer jargon, the file is organized so that the number and types of fields depend on the record types. For example, you may have a dataset that has records for families and records for persons within family.

In order to read these types of datasets, you need to create a dictionary for each type of record. You then read in all of the data using each of the dictionaries and keep only the appropriate data. Once you have these two Stata datasets, you can merge them together. This is a difficult data management problem, but manageable if you break the problem down into several steps. We cover this in an extended example in our NetCourses.

The following answer is 3 pages (out of 141 pages) from NetCourse 151.


Reading a hierarchical dataset

Now let's consider reading a hierarchical dataset with Stata. This is tricky.

You have data on families and persons within families. The data has the format of family record followed by one or more person records:

                  family record
                  person record
                  ...
                  person record
                  family record
                  etc...

Let's assume:

          family record:                          person record:
          col. 1-5   family id                    col. 1-5  person id
          col. 7     "1"                          col. 7    "2"
          col. 9     dwelling type code           col. 8-9  age
                                                  col. 11   sex code

The data probably contains more information than this, but this is enough for illustration. Note that if column 7 contains a "1", it is a family record and if it contains a "2", it is a person record. This is called the record-type indicator. I want to create a Stata .dta dataset containing

            1. family id
            2. dwelling code
            3. person id
            4. age
            5. sex code

My dataset will contain one observation per person and the family information will be repeated for persons in the same family.

First, I will create separate dictionaries for reading the family and person information for family.dct:

          dictionary using hier.raw {
                          long    famid   %5f     "family id"
          _column(7)      byte    rectype %1f     "record type"
          _column(9)      byte    dwell   %1f     "dwelling code"
          }

and for person.dct:

          dictionary using hier.raw {
                          long    perid   %5f     "person id"
          _column(7)      byte    rectype %1f     "record type"
          _column(8)      byte    age     %2f     "age (years)"
          _column(11)     byte    sex     %1f     "sex code"
          }

I will then test each one of these dictionaries interactively, just to be sure they work:

          . clear
          . infile using family if rectype==1 in 1/100
          . list in 1/5
          . type hier.raw                 <- I'll press Break to stop this

          . clear
          . infile using person if rectype==2 in 1/100
          . list in 1/5
          . type hier.raw                 <- I'll press Break to stop this

What I'm doing above is reading a small amount of data, typing the original, and comparing them. Satisfied I have good dictionaries, I then create a do-file to read the entire dataset. The basic plan of my do-file is

  1. read the family records and save them in a data file
  2. read the person records
  3. merge the person and family records

This problem would be easy if the person records contained the family id to which they belonged --step 1 would be an infile ... if rectype==1 followed by a sort and save, step 2 would be an infile ... if rectype==2 followed by a sort, and step 3 would be a merge. My whole do-file would be

          clear                                           /* step 1 */
          infile using family if rectype==1
          sort famid
          save tmph, replace
                                                          /* step 2 */
          clear
          infile using person if rectype==2
          sort famid
                                                          /* step 3 */
          merge famid using tmph

How easy it would be. In my example, however, the famid does not appear on the person records (just as it does not in data released by the U.S. Census and the U.S. Bureau of Labor Statistics).

This adds significantly to the complication. I'm going to read the family records the same way as above, but in addition, I'm going to manufacture my own family id variable, labeling the first family 1, the second 2, and so on:

          clear                                           /* modified step 1 */
          infile using family if rectype==1
          gen long id = _n
          sort id
          save tmph, replace

Next, when I read the person data, I am going to read the family records as if they were person records, too. The result will be that I have a placeholder observation for the family record:

               perid  rectype        age      sex
          1.                1       junk     junk
          2.       1        2         32        0
          3.       2        2         30        1
          4.    junk        1       junk     junk
          5.       1        2         40        1
          etc.

I will then remanufacture my temporary family id variable and then discard the misread family records.

To regenerate the id variable, I will first gen id=1 if rectype==1,

               perid  rectype        age      sex     id
          1.    junk        1       junk     junk      1
          2.       1        2         32        0      .
          3.       2        2         30        1      .
          4.    junk        1       junk     junk      1
          5.       1        2         40        1      .
          etc.

and then replace id=sum(id),

               perid  rectype        age      sex     id
          1.    junk        1       junk     junk      1
          2.       1        2         32        0      1
          3.       2        2         30        1      1
          4.    junk        1       junk     junk      2
          5.       1        2         40        1      2
          etc.

and finally drop if rectype==1:

               perid  rectype        age      sex     id
          1.       1        2         32        0      1
          2.       2        2         30        1      1
          3.       1        2         40        1      2

I will then be able to merge my family data with my person data. So the outline of my do-file is

          clear                                           /* modified step 1 */
          infile using family if rectype==1
          gen long id = _n
          sort id
          save tmph, replace
                                                          /* modified step 2 */
          clear
          infile using person /* no if! */
          gen long id = 1 if rectype==1
          replace id = sum(id)
          sort id
                                                          /* modified step 3 */
          merge id using tmph
          drop id

Some final details:

  1. The infile using person will generate lots of warning messages --every time it tries to (mis)read a family record with the person format, it will complain. A quietly in front of the infile will patch this problem.
  2. My outline just assumes rectype takes on the values 1 and 2 and the documentation claims and that everything merges. I need to include some checks.

Thus, my final do-file is in crhier.do

          capture log close
          log using crhier, replace
          clear
          infile using family if rectype==1
          drop rectype
          gen long id = _n                /* make my own temporary id var */
          sort id                         /* to set sort markers */
          save tmph, replace

          clear
          quietly infile using person     /* no matter what the rectype */
          assert rectype==1 | rectype==2  /* just to be safe --see note */
          gen long id = 1 if rectype==1
          replace id = sum(id)
          drop if rectype==1
          drop rectype
          sort id perid

          merge id using tmph
          assert _merge==3                /* they are supposed to match */
          drop _merge id

          sort famid perid
          save hier, replace
          erase tmph.dta
          log close

End quotation from NetCourse 151
© Copyright 1997 StataCorp LT.

Sue Zayac
Academic Information Systems