Columbia Escutcheon

Columbia University Libraries Digital Program

Web Archiving - Tech Plan
  Path: Digital Library Projects  : Digital Preservation : Web Archiving : Mellon Project : Project Plan
Rev. 12/14/2011

A. Human Rights Web Portal

  • Test tools for web harvesting, access
    1. Install and test Heritrix software locally [completed 12/2010]
    2. Install and test Wayback machine locally [completed 12/2010]
    3. Improve scripting and configurations for CUL / IA harvesting [ongoing]

  • Create prototype search interface to Columbia IA HR collection
    1. Obtain snapshots of Columbia HR collection content [completed 4/2011]
    2. Explore methods for metadata + full-text indexing of content [completed, 10/2011]
    3. Create preliminary prototype keyword + metadata search interface to Columbia IA HR collection [completed 11/15/2011]

  • Develop and Launch Beta 1 prototype
    1. Migrate preliminary prototype to new Web Application Framework [January 2012]
    2. Launch Beta 1 prototype [Feb. 1, 2012]
    3. Internal CUL/IS testing [Feb. - Mar. 2012]
    4. Incorporate results of internal testing [March 2012]

  • Launch Beta 2 prototype [April 2012]
    • Conduct formal user testing with outside groups
    • Incorporate results of testing into product

    Add selected non-Web resources to local search interface
    1. Identify other data sources to integrate into portal [in process 12/2012]
    2. Configure and do beta testing

  • Create new local HR document repository and integrate into local search interface [in planning]
    1. Set up new Fedora collection
    2. Set up workflow for selectors to create metadata and upload documents
    3. Index metadata and full text
    4. Merge into composite index
    5. Perform user testing and assess results

  • Launch Public Beta 3 [July 2012]
    • Conduct formal user testing
    • Incorporate results of testing into product

  • Launch Public Portal Version 1.0 [Sept. 2012]
    1. Complete HR Portal site public interface
    2. Launch Version 1.0

B. Search Enhancements

(These features may or may not be included in main public portal release, but rather made available separately for ongoing development and testing)

  • Explore / Implement Enhanced Searching / Semantic Web / Linked Data technologies
    1. Semantically analyze human rights web collection (English-language content)

      Explore text mining, named entity recognition, frequency analysis, clustering; analyze internal links & 'anchor windows'; extend analysis to relevant non-Web resources; evaluate DPpedia and Tagpedia approaches

    2. Develop / leverage defined human rights ontology / concept map

      Investigate existing research and practice; coordinate with other institutions working with human rights content; explore applicability of semantically aware discovery and query engines; explore relevant existing taxonomies

    3. Semantically characterize web collection

      E.g., use ORE, resource maps, linked data, RDF/XML

    4. Develop prototype semantically-generated research guides and other tools

      Develop prototype content guides / content overviews; perform user testing; revise and extend protype; explore functional integration with other related projects; explore enrichment of basic search and retrieval interface with semantically-generated metadata

  • Complete and Release Portal Version 2.0
    1. Integrate enhanced searching, semantic web and linked data functionality
    2. Develop sustainable update strategy based on automated processes
    3. Explore extending local web archive searching to other relevant Achive-IT and CDL content
    4. Launch Version 2.0

  • Ingest local Web archives into Fedora for long-term preservation
    1. Develop / implement Fedora data model for Web content
    2. Ingest archived Web content into Columbia's Fedora repository
    3. Develop / implement content curation / migration strategy



Change log:

2012-12-14:

  • Updated completed and in-process tasks
  • Integrated planned testing / beta version release schedule
  • Broke out Search Enhancements as separate track

2011-10-03:

  • Updated completed tasks
  • Renamed releases Version 1.0, 2.0
  • Added additional user testing tasks

2011-03-22:  

  • Removed task of local implementation of Wayback software; local interface now to access IA content remotely.
  • Added milestones for testing tools for web harvesting/access
  • Added milestones for merging non-Web metadata
  • Added milestone for creating and merging local document repository
  • Added milestone for releasing Phase 1 portal prototype
    (Previous version of project plan.)

2010-07-09:   Changed to show local implementation of access-oriented Wayback software as separate task from Fedora ingest and long-term preservation of Web content; generalized description to include creating local web collections in areas other than human rights; added task of enabling selective document-oriented archiving and access.


Columbia Libraries    Digital Program
Last revision: 12/14/11
© Columbia University Libraries