This page describes the various public use microdata (PUMS) products that are companions to the population and housing summary files produced either from Census 2000 or from the American Community Survey (ACS). This page gives information meant to be helpful in making the decision about which product to use. Much of the information provided was taken form the Census Bureau's publication Public Use Microdata Sample 2000 Census of Population and Housing (large pdf file). The products that are available include:
PUMS - Census 200
|
PUMS - ACS
|
Restricted Access Files
|
An Introduction to Microdata
The following list of questions and answers serves as an introduction to the understanding of microdata.
- What is microdata?
Microdata are the individual records that contain information collected about each person and housing unit. They are computerized versions of the questionnaires collected from households, as coded and edited during census processing. - How does microdata relate to Summary File Data?
The Census Bureau uses these confidential microdata to produce the summary data that go into the various reports, summary files, and special tabulations. The individual response data are tabulated and often cross tabulated based on the values of more than one variable, and the totals statistically adjusted to give counts that are representative of the entire population. These tabulated results are what are in the Summary Files. - Why would you use microdata?
Microdata samples are useful to users who are doing research that does not require the identification of specific small geographic areas or detailed cross tabulations for small populations. Use microdata to study relationships among census variables not shown in existing census tabulations. This often is done when studying the characteristics of specially defined populations groups other than the race, hispanic origin, and age groups analyzed in the summaries files. - What is the difference between the PUMS files and restricted
files?
The restricted data files contain all the survey questionnaire responses while the PUMS are subsets of the survey responses selected to represent people and housing units (in the case of 2000 data there is a 1% and 5% sample). For PUMS files the subsetting and reporting are done in a manner that avoids disclosure of information about households and individuals. The techniques used to accomplish this are: a unique geographic reporting structure based on relatively large reporting areas called PUMAs and super-PUMAs, the use of reporting thresholds, and a variety of statistical procedures to mask identifable persons or household. With restricted files, microdata are not modified. Confidentiality is ensured by screening access.
Public Use Microdata Samples (PUMS)
Public Use Microdata Sample (PUMS) files contain records representing samples of the occupied and vacant housing units and the people in the occupied units. Persons in group quarters are also included. The files contain individual weights for each person and housing unit, which when applied to the individual records, expand the sample to the relevant total.
For the 2000 data, the 1% file provides a fuller range of detailed
characteristics
and
the 5% file provides greater geographic
detail but less characteristic detail.
Below is a summary of the characteristics of each file and a review of what options there are for access. Although the detailed information refers to Census 2000 products, much also applies to ACS, particularly with files produced starting with 2006.
| 1% PUMS | 5%/ACS PUMS | More Information | |
| Levels of geographic reporting |
-lowest level is super-PUMA with a minimum population threshold of
400,000 -super-PUMA boundaries encompass one or more contiguous PUMA areas (no PUMA codes on 1% file) -super-PUMAs are defined within states and state codes are reported -codes to show relationship to MSAs are reported |
-lowest level is a PUMA with a minimum population threshold of
100,000 -super-PUMA and state codes are reported -codes to show relationship to MSAs are reported   -for NYC, PUMA boundaries approximate Community District boundaries |
-files are hierarchial as each housing unit record is followed by a
variable number of person records, one for each occupant. -serial number on both record types affords the option of processing the data either sequentially or hierarchically -for each state there is a geographic equivalency file, PUMEQ1-XX.TXT or PUMEQ5-XX-TXT, that shows the relationship of PUMS geography to standard census geography -the MABLE/Geocorr2K: Geographic Correspondence Engine with Census 2000 Geography is another way to lookup geographic equivalents for the 5% file. |
| Data variables |
-maximum amount of social, economic, and housing information
available -the only threshold for the identification of variable category is a national minimum population of 8,000 for race and Hispanic origin |
-a minimum threshold of 10,000 nationally is set for the identification of variable categories within categorical variables |
Use the
2000 PUMS documentation (large pdf file) for details about
variables and the reporting categories used within each
one. Chapter 6 - Data Dictionary (1%) Chapter 7 - Data Dictionary (5%) Appendices - give code values and definitions for variables. Use ACS subject documentation for ACS files. |
| Data available via FTP |
2000 Census Bureau |
Census Bureau ACS PUMS site DataGate: study 2000-PA for 2000 study 1992 for ACS |
Note, files are published by state. Requires handling large (sometimes zipped) files and the use of statistical software. For 2000, DataGate has NYC subsets in Stata and SPSS foramts. |
| Data available via online extraction | IPUMS site. | IPUMS site. |
Extraction is done with the web-based IPUMS interface and subsets of
individual microdata records are created.
Weighting and tabulating the results must be done with statistical
software.
Use the documentation at this site since IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic characteristics over time. You can obtain data across states in one download. |
|
Data available on DVD (2000 only) |
ask for the DVD to use on the EDS PC network; application runs from the disc, no installation necessary | ask for the DVD to use on the EDS PC network; application runs from the disc, no installation necessary |
Beyond 20/20 software is designed to perform basic cross tabulations of
any desired set of variables on the PUMS file.
Easy to use; no software skills needed. Only the data for one cross tabulations can be extracted at a time. You can use geographic codes for cross tabulations allowing you to analyze data from multiple geographic areas including totals for the nation in one pass. Extracts done with B20/20 software are reported as tabulated results not as records of individual responses. You can choose to produce weighted or unweighted extracts. |
Weight variables are applied to the variables in the PUMS file during data analysis to create results that are representative of the population. There is a person weight variable for use with person characteristics and a housing weight variable for use with housing characteristics. Using the weights within a software application, usually requires only that you designate which weight variable you want to use. An understanding of how they are applied will help you choose and the examples below serve that purpose.
To use the educational attainment variable to determine the number of persons with high school diploma as their level of attainment, select all records where the category value for the educational attainment is "HS diploma." The unweighted count would be the total number of selected records and the weighted count (the one that is representative of the total population) would be the sum of the weight variables. If the variable being analyzed was a housing characteristics, use the housing weight. To create estimates of households or families, use the person weight of the householder.
To get estimates of characteristics such as the total number of related children in households, simply multiply the PUMS weight by the value of the characteristic and sum across all household records. If the desired estimate is the number of households with at least one related child in household, add the PUMS person weight of the householder for all households with a value not equal to zero for the characteristic.
Restricted Access Files
To access all the responses from the long form rather than a subset of the responses, user can choose one of two options offered by the Census Bureau. These options are very different yet both are designed so that researchers can base an analysis on all responses and at the same time safe guard the confidentiality of the information.
- Census Research Data Centers (CRDC)
The CRDCs are locations that provide a secure environment where researchers have limited access to confidential economic and demographic microdata, with appropriate safeguards to protect data confidentiality. The controlled environment ensures that the Census Bureau?s standads for maintaining the confidentiality of data by its census and survey respondents are rigorously maintained. Users must submit their research design for approval before being granted access and must use the data resources on-site at one the Census RDC sites.
The Census Bureau's Center for Economic Studies operates the Research Data Center program. Because of our membership in the New York Census Research Data Center consortium, researchers affiliated with Columbia can use the NYCRDC at Baruch College, CUNY, without incurring the standard usage fee (use the contact information at the NYCRDC site). This priviledge also extends to the Michigan CDRC at the University of Michigan through our membership in ICPSR.

