Columbia Escutcheon

Columbia University Libraries Digital Program

Oral History Transcriptions: Digitization and Markup Specs


  Path: Digital Library Projects  : Digitization : Specs : Oral History Transcriptions

Specifications, rev. 9/5/2006

These specifications were originally developed for the 2005 Notable New Yorkers project (see http://www.columbia.edu/cgi-bin/cul/resolve?AH-5AU1DPS05514).  They were updated again in 2006 for use in an ongoing project to convert other frequently requested oral history transcriptions.

Note that these markup specs were developed for digital transcriptions of existing paper transcriptions, not for transcriptions from audio.  The specs were intended to support basic text display, not any specialized types of search and retrieval.  They do not attempt to reflect any best practices, since there did not appear to be any body of practice at all in 2006 when these requirements were developed.

Note, too, that the typescript transcriptions for which these specs were developed were often decades old and frequently annotated by hand.   Because of this we also defined certain special markup accommodations to identify, inserted and deleted text, gaps, etc.


  1. Scanning
    1. Bitonal – Text Pages Scanning; Bit Depth: 1-bit File Format TIFF; TIFF Compression Group IV; Digital Resolution 600 dpi

    2. Scan all pages for each interview. Consult the Scanning Manifest for the starting filename for each session: http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/oh_scanning_manifest1.xls

    3. For title pages, use the first filename listed for the interview and append "tp" to the basename, e.g., ldpd.3452627.001.000.000001tp.tif

    4. In addition to TEI XML files (specified below), page image deliverables will consist of:

      1. Full set of original unmodified tiffs

      2. Full set of PDFs generated from the tiffs, one per each session as identified in the Scanning Manifest

      3. Tiffs and PDFs will be delivered to Columbia on CD-Rs, one set for tiffs and another for PDFs

  2. Markup: General
    1. Use P4 version of TEI-Lite
    2. Create one xml file for each Session
      See also Sample marked-up interview (Oates), at:
      http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/oakes_sample.xml

  3. TEI Header
    1. Use minimum amount of data in teiHeader needed to validate file. CUL will create authoritative data at a later stage of the project.
    2. Make sure to include information on date of creation of file in profileDesc/creation

  4. TEI Frontmatter
    1. Use <front> element to encode the listing of Interviewer, Interviewee, etc... at beginning of each session transcript.
    2. Tag each entry in list as an <item>, in a <list>, itself enclosed by a <div>. Encode the name of the interviewer and interviewee with a <name> element.

  5. TEI Body
    1. Use <milestone> element preceding each <sp> with <speaker> "Q".
      1. Use "id" attribute with value in format "q-[sequential number of milestone element]"
      2. Use "n" attribute with integer representing sequence number of milestone element

    2. Use <sp> element for each question or answer. These can be distinguished by being preceded by either Q: or a [name]:
      1. Use "id" attribute with value in form of "sp-[sequential number of milestone element]"

    3. Use <speaker> element around the "speech prefix" (either Q: or [name]:) preceding each question or answer.

    4. Use <p> for each block of text inside question/answer

    5. Use <emph rend="italic> to tag italicized text

    6. Use <pb> element at point in text where a new page begins

      1. Use "id" attribute with value in format "pb-[sequential number of page break element]"
      2. Use "n" attribute with value representing page number as printed in interview

    7. Use <add> for any non-typescript text (i.e., written by hand) which can be read, e.g.,

      And so the <add>Jelliffes</add> naturally talked a great deal about Langston [where Jelliffes is a handwritten insert]

    8. Use <unclear/> for any text, both manuscript and typescript, which cannot be read, e.g.,

      a) ... and people at <unclear/> knew a great deal about him ... [where the word(s) following at are not legible.]

      b) ... accepted the nomination on the basis <add>that the outcome in <unclear>N.H.</unclear> represented the sentiment of the county</add> ...   [where there are unclear words within the addendum]

    9. Use <gap> for spaces left in the typescript, e.g.

      So music was something <gap/> was always in the background... [where there is whitespace between something and was.]

  6. Backmatter

    1. a. Do not mark up backmatter unless it is an index (usually of names)

    2. For indices, use CUL's RelaxNG schema:
      http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/index-culdpd.rng

    3. Example:
      http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/test_index.xml

 


Columbia Libraries    Digital Program
Last revision: 10/21/10
© Columbia University Libraries