Columbia's Long-Term Digital Preservation Archive is a key component of Columbia's Digital Library and Institutional Repository infrastructure. It consists of a robust asset management system that can manage the digital resources of Columbia Libraries/Information Services for a variety of applications and at the same time provide the set of feature and services needed for long-term preservation of relevant digital assets. The need for a comprehensive architecture was identified when the new Digital Programs and Technology Services group (DPTS) was created in July 2007. Planning for the storage system began in the first quarter of 2008; implementation began in 2009.
The technical architecture has been designed with four main components:
- Digital preservation storage system
- Fedora software platform
- Application and authentication middleware
- Applications to support the Long Term Digital Preservation Achive and other programs
1. Digital Preservation Storage System
CUL/IS stores digital preservation assets on a total of four copies, two on disk and two on tape. One copy on disk and a second copy on tape will be located in an automated system in Columbia’s main data center. A third copy on disk is located in the NYSERNet Data Center located in Syracuse, New York. A fourth copy on offline tape is sent to Iron Mountain to provide an additional offsite location.
To manage multiple copies, automate migration and replication and provide a policy-based model to manage the long-term retention and access to digital assets, CUL/IS chose the Sun StorageTek Storage Archive Manager (SAM) software along with Sun hardware as a single vendor solution. SAM is “tried and true,” with over a decade of proven use in managing large data repositories at corporations, supercomputer centers and libraries. It provides a self-protecting, automated data migration and recovery model that enables us to populate and incrementally expand the preservation storage to meet current and future needs. To support long-term sustainability and end-of-life data migration, SAM uses portable, nonproprietary data formats to store data on disk, the source code has been published as open source and uses open standards to provide data retrieval and access.
A total of 280 terabytes (TB) of disk and tape storage has been purchased to support the Digital Preservation Storage System. After this storage has been configured to support four copies of the digital assets, the system will have an effective storage capacity of approximately 70TB. As purchased, the system may be expanded incrementally to an effective storage capacity of up to 400TB. A high-speed, 10TB local disk cache provides increased access performance for commonly accessed digital assets and ensures that CUL/IS can rapidly load the system as required by large digital preservation efforts.
2. Fedora Software Platform
CUL/IS has chosen the Fedora Commons software platform to manage Columbia’s digital repository, long term archive and a variety of other applications. Fedora version 3 has been installed on CUL/IS production servers, managed by Columbia University's central IT group. Fedora has been configured in a “leader/follower” configuration, to provide replication and failover support.
3. Application and Authentication Middleware
CUL/IS has configured middleware to support common application needs, including but not limited to search and authentication/authorization. Authentication and authorization are built leveraging the University’s identity management system that is based on Kerberos, LDAP and Shibboleth.
4. Long-Term Digital Archive Content and Applications
In 2009 we began loading content into our long-term digital preservation archive on the Fedora platform. As part of this ingest process, descriptive, structural and rights metadata will be reformulated according to current standards (e.g. MODS, METS, PREMIS, AES), and technical metadata will be generated automatically (e.g., using JHOVE) when feasible. New institutional policies and procedures required for meeting the standard of a “trusted digital repository” will be substantially complete by July 2011. The digital preservation repository will also be functionally integrated with our evolving digital asset management system, so that digital content of all kinds can be repurposed and for use by both the Columbia community as well as external users.
At present the range of content we will be preserving in our LTA include: digitized library collections (largely special and rare materials), content input into our local institutional repository (Academic Commons) and, on a test basis, web sites harvested as part of a grant-funded pilot program.
In implementing our local applications and policies supporting the LTA, we will be guided by the May 2002 report "Trusted Digital Repositories, Attributes and Responsibilities: An RLG-OCLC Report" along with other more recent research and best practices.
NYSERNet Data Center: http://www.nysernet.org/services/bcc/
Iron Mountain: http://www.ironmountain.com/dataprotection/vault/
Sun StorageTek SAM Software: http://www.sun.com/storagetek/management_software/data_management/sam/