Columbia
Escutcheon Columbia University Libraries Digital Program
Lehman Papers Digitization

          Path: Digital Library Projects  : Lehman Papers Digitization :  Technical & Operational Overview


Lehman Special Correspondence Files
Technical & Operational Overview


Project Teams

  • Libraries Digital Program Division:  Robbie Blitz, Terry Catapano, Joanna DiPasquale, Stuart Marquis, Stephen Davis

  • Preservation & Digital Conversion Division:  Dave Ortiz, Dina Sokolova, Emily Holmes, Janet Gertz

  • Curatorial & Administrative Staff:  Jean Ashton, Tamar Dougherty, Susan Hamson, Michael Ryan, Janet Gertz, Jane Winland

  • Pre-Processing: Ann Young (2006), Annie Grunow (2006)

  • Scanning Vendor: Backstage Library Works, Provo Utah (1/2007 - 7/2007)


Collection Statistics

  • Collection Size:
    • ca. 500 linear feet scanned
    • 32,890 complete documents scanned
    • 43,479 page images scanned

  • Authorship: 
    • HH Lehman & staff = 15,362 (47%)
    • Lehman & family    =    1,095 (3%)
    • All other                 =  16,433 (50%)
                    TOTAL      =  32,890
       

Pre-Processing & Metadata

  • Item Numbering: All files, documents and pages were collated & numbered, e.g.,

        [file #]-[document #]-[page #]

        0002-0001-001

  • Collection Reprocessing:  Duplicates were marked as not-to-be scanned; poor-quality photocopies re-copied; items needing conservation identified and referred to Conservation Lab; entire collection was relabeled and refoldered; 'separated material' was reintegrated. Decided original documents in "VIP Files" would not be scanned directly because of security and operational concerns, instead photocopies of them were scanned.

  • Descriptive Metadata: Recorded file ID, file title, folder ID, document ID, document date, number of pages in document, genre, author type (i.e., HHL / Staff, HHL Family, Other).  (See master project spreadsheet.)

  • Conservation Information:  Pre-scanning conservation needs, photocopy status (if not original document)

  • Technical Metadata Recorded: EXIF standard data plus
    • Image Producer (vendor name)
    • OS version
    • Scanner or Digital Camera
    • Scanner/Digital Camera Software
    • Lens (if applicable)
    • Focal Length (if applicable)
    • Scene Illuminant (if applicable)
    • Sampling Frequency Plane (in this case it is direct capture)
    • Sampling Frequency Unit (in this case inches

Scanning Information

  • Scanning Equipment:
    • BSLW Hasselblad H2D-39

  • Scanning Specifications:
    • Items measuring up to 10” x 13.5” scanned at 400 ppi, 24 bit color 
    • Items measuring 13.5” x 18” to 18” x 24” scanned at 300 ppi, 24 bit color.

  • Scanning Deliverables:
    • One set of unaltered original TIFF images on DVD
    • One set of cropped and de-skewed 24 bit TIFF images will be delivered on DVD
    • One set of Macbeth scanned color charts for each scanning session
    • One set of text-searchable PDF files
    • OCR converted text (Raw OCR)

Rights & Permissions

Application & Web Presentation

Lucene and SOLR
  • Metadata indexed for search capabilities
  • Information queried via SOLR
  • ISO-compliant dates allows for date-range searching
  • SOLR output format of PHP and serialized PHP allowed us to use LAMP server and server-side processing, resulting in very fast application

Web Interface
  • Hosted on LAMP server and written in PHP, with some content pulled from Libraries' web site Included functionality:
    • Interface has "smart" lookups for correspondent, date, and document type, so only relevant queries appear in the field

    • Searching is faceted by date (10-year increments), correspondence file, and document type; limits are removable

    • Document pagination via DLO queries

    • Documents have look-up feature and session setting so that users can go to previous/next document and return to last search

    • Toggle OCR and page image

  • Site reflects new name for Suite, "The Lehman Collections," and somewhat scaled-down template
  • Site sets time-sensitive cookie ("lehman") that records if user has agreed to terms and onditions. Once agreed, cookie lasts 15 minutes and is consistently renewed as user navigates through system. Cookie expires 15 minutes after last Lehman document is viewed. No user information is collected.
  • Bookmarkable resolver URL created
  • Mnemonic URL created
  • Chicago Manual of Style citation offered
  • Brief help on searching offered
  • R&D work on text analysis to offer better metadata (in development)

User testing
  • 4 students for formal, task-based study
  • 4 students for focus group
  • All students use archival collections for research Results:
  • Corrected issues with citation format
  • Lengthy discussion regarding searching of OCRed text led to more prominent message regarding what is being searched and removal of "subject" facet
  • Added "correspondence file" facet to sidebar of results overview
  • Described "contact" information more clearly
  • Added more descriptive citation information around actual citation
  • Made sure date range search could be implemented (this was very popular)

Statistics

  • Google Analytics, from 4/15/2008

 

 

 

 Planning Documents


Project Timeline

 •Orig. proposal: 12/11/03
 •Planning & "hiatus": 2004
 •Budget approved: 12/2/2004
 •Preprocessing begins: 1/2006
 •Preprocessing complete: 8/2006
 •Vendor selected:  8/2006
 •Scanning begins: 1/2007
 •Scanning completed: 6/2007
 •Post-processing completed: 1/2008
 •Application / web site development   begins: 1/2008 
 •Application / web site launched:
  4/15/2008

Columbia Libraries    Digital Program
Last revision: 04/10/10
© Columbia University Libraries