Specifications, rev. 9/5/2006
These specifications were originally developed for the 2005 Notable
New Yorkers project (see http://www.columbia.edu/cgi-bin/cul/resolve?AH-5AU1DPS05514). They were updated again in 2006 for use in an ongoing project to convert other frequently requested oral history transcriptions.
Note that these markup specs were developed for digital transcriptions of existing paper transcriptions, not for transcriptions from audio. The specs were intended to support basic text display, not any specialized types of search and retrieval. They do not attempt to reflect any best practices, since there did not appear to be any body of practice at all in 2006 when these requirements were developed.
Note, too, that the typescript transcriptions for which these specs were developed were often decades old and frequently annotated by hand. Because of this we also defined certain special markup accommodations to identify, inserted and deleted text, gaps, etc.
- Scanning
- Bitonal – Text
Pages Scanning; Bit Depth: 1-bit
File Format TIFF; TIFF Compression
Group IV; Digital Resolution 600
dpi
- Scan all pages for each interview.
Consult the Scanning Manifest for
the starting filename for each session: http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/oh_scanning_manifest1.xls
- For title pages, use the first
filename listed for the interview
and append "tp" to the basename,
e.g., ldpd.3452627.001.000.000001tp.tif
- In addition to TEI XML files (specified
below), page image deliverables
will consist of:
- Full set of original unmodified
tiffs
- Full set of PDFs generated
from the tiffs, one per each
session as identified in the
Scanning Manifest
- Tiffs and PDFs will be delivered
to Columbia on CD-Rs, one set
for tiffs and another for PDFs
- Markup: General
- Use P4 version of TEI-Lite
- Create one xml file for each Session
See also Sample marked-up interview
(Oates), at:
http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/oakes_sample.xml
- TEI Header
- Use minimum amount of data in
teiHeader needed to validate file.
CUL will create authoritative data
at a later stage of the project.
- Make sure to include information
on date of creation of file in profileDesc/creation
- TEI Frontmatter
- Use <front> element to encode
the listing of Interviewer, Interviewee,
etc... at beginning of each session
transcript.
- Tag each entry in list as an <item>,
in a <list>, itself enclosed
by a <div>. Encode the name
of the interviewer and interviewee
with a <name> element.
- TEI Body
- Use <milestone> element
preceding each <sp> with <speaker> "Q".
- Use "id" attribute
with value in format "q-[sequential
number of milestone element]"
- Use "n" attribute
with integer representing sequence
number of milestone element
- Use <sp> element for each
question or answer. These can be
distinguished by being preceded
by either Q: or a [name]:
- Use "id" attribute
with value in form of "sp-[sequential
number of milestone element]"
- Use <speaker> element around
the "speech prefix" (either
Q: or [name]:) preceding each question
or answer.
- Use <p> for each block of
text inside question/answer
- Use <emph rend="italic> to
tag italicized text
- Use <pb> element at point
in text where a new page begins
- Use "id" attribute
with value in format "pb-[sequential
number of page break element]"
- Use "n" attribute
with value representing page
number as printed in interview
-
Use <add> for any non-typescript
text (i.e., written by hand) which
can be read, e.g.,
And
so the <add>Jelliffes</add> naturally
talked a great deal about Langston [where Jelliffes is
a handwritten insert]
- Use <unclear/> for any text,
both manuscript and typescript,
which cannot be read, e.g.,
a) ...
and people at <unclear/> knew a
great deal about him ... [where
the word(s) following at are
not legible.]
b) ...
accepted the nomination on the
basis <add>that
the outcome in <unclear>N.H.</unclear> represented
the sentiment of the county</add> ... [where
there are unclear words within
the addendum]
- Use <gap> for spaces left
in the typescript,
e.g.
So music
was something <gap/> was
always in the background...
[where
there is whitespace between something and was.]
- Backmatter
- a. Do not mark up backmatter unless
it is an index (usually of names)
- For indices, use CUL's RelaxNG
schema:
http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/index-culdpd.rng
- Example:
http://www.columbia.edu/cu/libraries/inside/projects/digitization/specifications/test_index.xml
|