Columbia Escutcheon

 CLiMB Toolkit
Inputs

          Path: Digital Library Projects  :  CLiMB  :  Toolkit  :  Inputs
Input Formats

  1. Target Object Authority List

    See: Target Object Identifier -- Interface Format (TOI-IF)

    NB: Each entry in the authority list requires unique Target Object Identifer -- e.g., an accession number, metadata record ID or filename -- that serves as a link to the image (or other) files.

  2. Initial Target Text Format
    1. "TEI Lite" XML minimum
    2. Marked-up text only, no images or other multimedia content
    3. Any image captions should be marked up as: <figure><head> etc.
    4. Back of book indexes should be marked up at least to the level of:
      <div1 type="index"><list><item> etc.

  3. External Thesaurus Format
    1. The AAT format will be pre-processed for the CLIO Toolkit; permission from the Getty is required to use this resource
    2. Other imported or manually created thesauri or subject lists need to be formatted according to simple NISO thesaurus terminology, e.g., Use, Use For, Broader Term, Narrower Term

  4. Stop List Format
    Once created or loaded, the administrator may assign simple rules to terms if desired to particularize the attributes according to which words or phrases are dropped from or retained in the intermediate proccessed text, e.g., based on capitalization, occurrence in first position in a sentence; English or non-English language text; etc. (No structure for this type of list has been determined.)

    In addition, word and phrase attributes may also be used to include or exclude them from the intermediate processed text. This might be applied, e.g., to proper names, geographic names, dates, arabic or roman numerals, etc.

  5. Intermediate & Final Processed Text Format
    The process of CLIMB text analysis will result in iterative and dynamic markup changes to the target text(s), using valid TEI linguistic markup and extensions.

Output Formats

  1. Metadata Export Formats
    1. USMARC fields (e.g., 600-651 fields with $2aat, for bona fide AAT headings)
    2. USMARC-like fields that parallel the categories of 6xx fields, e.g., for local personal name headings used as subjects, local topical subjects, local geographic subjects, local genre terms
    3. USMARC 653 field "Index Term -- Uncontrolled"
    4. USMARC-like fields that distinguish between CLIMB metadata terms with differing ranking or type, e.g., terms with high confidence of their association with the Target Object; terms with intermediate or low confidence; terms believed to be of high "importance ranking"; noun phrases, keywords, etc.

    NB: Metadata output will be in the form of tab-delimited files that can be correlated with image metadata records and loaded into the target database system.

  2. Project Profile / Rulesets / Logs
    Project and process related information will be stored and output in a simple, documented format to be determined, that will facilitate the later review and analysis of settings, parameters and decision rationales.

Columbia Libraries    Digital Program
Last revision: 05/26/04
© Columbia University Libraries