Input Formats
- Target Object Authority
List
See: Target Object Identifier
-- Interface Format (TOI-IF)
NB: Each entry in the
authority list requires
unique Target Object
Identifer -- e.g.,
an accession number, metadata
record ID or filename
-- that serves as a link
to the image (or other)
files.
- Initial Target Text
Format
- "TEI Lite"
XML minimum
- Marked-up text
only, no images
or other multimedia
content
- Any image captions
should be marked
up as: <figure><head>
etc.
- Back of book indexes
should be marked
up at least to
the level of:
<div1 type="index"><list><item>
etc.
- External Thesaurus
Format
- The AAT format
will be pre-processed
for the CLIO Toolkit;
permission from
the Getty is required
to use this resource
- Other imported
or manually created
thesauri or subject
lists need to
be formatted according
to simple NISO
thesaurus terminology,
e.g., Use, Use
For, Broader Term,
Narrower Term
- Stop List Format
Once
created
or loaded,
the
administrator
may
assign
simple
rules
to terms
if desired
to particularize
the attributes
according
to which words
or phrases
are
dropped
from
or retained
in the
intermediate
proccessed
text,
e.g.,
based
on capitalization,
occurrence
in first
position
in a
sentence;
English
or non-English
language
text;
etc.
(No structure
for
this
type
of list
has
been
determined.)
In addition, word and
phrase attributes may
also be used to include
or exclude them from the
intermediate processed
text. This might be applied,
e.g., to proper names,
geographic names, dates,
arabic or roman numerals,
etc.
- Intermediate &
Final Processed Text Format
The process of CLIMB text
analysis will result in
iterative and dynamic
markup changes to the
target text(s), using
valid TEI linguistic markup
and extensions.
Output Formats
- Metadata Export Formats
- USMARC fields
(e.g., 600-651
fields with $2aat,
for bona fide
AAT headings)
- USMARC-like fields
that parallel
the categories
of 6xx fields,
e.g., for local
personal name
headings used
as subjects, local
topical subjects,
local geographic
subjects, local
genre terms
- USMARC 653 field
"Index Term
-- Uncontrolled"
- USMARC-like fields
that distinguish
between CLIMB
metadata terms
with differing
ranking or type,
e.g., terms with
high confidence
of their association
with the Target
Object; terms
with intermediate
or low confidence;
terms believed
to be of high
"importance
ranking";
noun phrases,
keywords, etc.
NB: Metadata output
will be in the form
of tab-delimited files
that can be correlated
with image metadata
records and loaded
into the target database
system.
- Project Profile / Rulesets
/ Logs
Project and process related
information will be stored
and output in a simple,
documented format to be
determined, that will
facilitate the later review
and analysis of settings,
parameters and decision
rationales.
|
|
|