Project Plan - Web Archiving

Columbia University Libraries Digital Program

Web Archiving - Draft Tech Plan
10/8/2010

Path: Digital Library Projects : Digital Preservation : Web Archiving : Project Plan

Implement local Web archive access system
1. Implement / customize local instance of "Wayback" software
2. Transfer selected content collections from Internet Archive
3. Implement capacity for selective document-oriented archiving with local access
4. Explore / implement indexing alternatives to Nutchwax
5. Release local web archive search system with one or more collections (e.g., human rights collection, Columbia University website collection, historic preservation collection)
Explore / implement selected Semantic Web technologies
1. Semantically analyze human rights web collection (English-language content)
  
  Explore text mining, named entity recognition, frequency analysis, clustering; analyze internal links & 'anchor windows'; extend analysis to relevant non-Web resources; evaluate DPpedia and Tagpedia approaches
2. Develop / leverage defined human rights ontology / concept map
  
  Investigate existing research and practice; coordinate with other institutions working with human rights content; explore applicability of semantically aware discovery and query engines; explore relevant existing taxonomies
3. Semantically characterize web collection
  
  E.g., use ORE, resource maps, linked data, RDF/XML
4. Develop prototype semantically-generated research guides and other tools
  
  Develop prototype content guides / content overviews; perform user testing; revise and extend protype; explore functional integration with other related projects; explore enrichment of basic search and retrieval interface with semantically-generated metadata
Create Human Rights Reference Portal
1. Integrate web archive searching, reference tools, research guides
2. Develop sustainable update strategy based on automated processes
3. Explore extending local web archive searching to other relevant Achive-IT and CDL content
Ingest local Web archives into Fedora for long-term preservation
1. Develop / implement Fedora data model for Web content
2. Ingest archived Web content into Columbia's Fedora repository
3. Develop / implement content curation / migration strategy

Change log:

2010-07-09: Changed to show local implementation of access-oriented Wayback software as separate task from Fedora ingest and long-term preservation of Web content; generalized description to include creating local web collections in areas other than human rights; added task of enabling selective document-oriented archiving and access.

Columbia Libraries

Digital Program