Columbia University Digital Library
Architecture and Services

David Millman
Academic Information Systems
January, 2000

The Columbia digital library program has grown since the early 90's into a comprehensive systems infrastructure and a set of support services. Projects using these systems and services are created in many areas of the University, including the Libraries, academic departments and centers, Columbia University Press, and Academic Information Systems (AcIS). Development and operations are closely coordinated among units of AcIS and the Libraries.

Systems Architecture

Columbia's strategy for digital library architecture may be thought of as a series of discrete layers of technology services, ranging from the storage of bits to specialized interactive applications. The layers are ordered: each successive layer depends upon the services of all prior layers.

1. Repository.

This facility is reponsible for the storage of digital data, independent of media type, intellectual content or provenance. At the moment all data is stored on networked disk systems with traditional tape backup. Data is migrated as necessary to take advantage of the most current technology. The AcIS Computing Systems group operates the facility. Work is now underway to build additional redundancy through mirroring and off-site storage. Guiding priorities at the repository level are preservation and integrity. We plan to establish a system to better insure authenticity of data through cryptographically signed documents. We hope ultimately to adopt a robust preservation strategy; a coordinating group from AcIS, the Library Systems Office and the Library Preservation Division meet periodically to review the current state of the art.

2. Naming.

Access to most of Columbia's digital collections is through web protocols, with distinct items referenced directly by the web servers using simple unix file system or database semantics and operations. But we have also introduced mechanisms to reference items with a new form of identifier. These identifiers are location independent and hopefully more persistent than traditional URL's. Two collections now use this form of naming exclusively. Development has taken place in the AcIS R&D unit in consultation with the Library Systems Office.

Our strategy is to move entirely to a system of persistent identifiers in order to create a reliable citation framework, to enable more flexible hybrid and aggregate digital objects, and to improve scaling and maintenance of the collections.

We are encouraged by related efforts such as the handle system, DOI's and PURL's. We do not feel any of these yet meet our needs, although we do hope for eventual standardization of a persistent identifer architecture for the Internet or, at least, for the scholarly community. The goal of our current implementation is to better understand semantics, hierarchy and granularity issues, as well as the relationship of the naming facility to other parts of the system infrastructure.

3. Access Control.

The communities of content consumers and providers have each grown rapidly in size and complexity. Because the University as a whole is sometimes provider and sometimes consumer, the technical infrastructure plays a mediating role among these communities in several ways.

Access rights for the many and overlapping constituencies of the Columbia University community have been codified in directory and authenticaion services, developed by the AcIS Computing Systems group in collaboration with offices such as the registrar, Human Resources, alumni relations, the Libraries and the medical center. AcIS R&D is responsible for access control to all online services, including digital library content, and has developed mechanisms to authenticate and authorize access to services housed locally or remotely, based on one's membership in appropriate Columbia constituencies.

To provide access to Columbia's content for individuals outside our community, AcIS R&D has developed a rudimentary mechanism enabling the owner of the content to grant access rights.

Strategically, we are developing a more general architecture in collaboration with several other institutions and under the sponsorship and guidance of the Digital Library Federation. This method allows content consuming organizations to certify access rights to members of their own community for services at content providing organizations. Fine gradations of control are possible in defining the capabilities and privacy of individual consumers; and the model has good scaling properties. Because the underlying technology uses digital certificates, this method raises issues of institutional encryption policy and inter-organizational trust, which may have significant impact outside of the digital library realm. Development is ongoing in AcIS R&D.

Fig 1. Layered architecture

4. Index.

Classification, retrieval and collation are key library requirements with many mixed metaphors in digital settings. This "indexing" function is currently the most resource intensive: in computational requirements, software costs, and in labor. Columbia balances these costs through a number of efforts.

Full Text Retrieval. While Columbia was a pioneer in the 1980's, with high performance home grown software, it has migrated over the years to, e.g., Harvest, Glimpse, Webinator and most recently to Ultraseek. These systems are increasingly fast, precise and popular. But they are designed to optimize performance at the expense of flexibility, often presuming particular styles of user interface, security and content partitioning.

Structured Text Retrieval. Using an Opentext product, "Pat", we have implemented several retrieval systems for specific areas of scholarly research. Some of the above restrictions are relaxed by the innovative design of this product and through additional local development.

Specialized Indexes. A number of locally developed systems are used. Most often these create indexes based on bibliographic characteristics through, e.g., parsing MARC records or extracting tagged metadata from within documents.

Master Metadata File (MMF). This facility, developed by Library Systems and AcIS R&D, is a relational database application holding bibliographic and structural information. It is able to represent multiple versions, collections, aggregations such as pages in a book, and hierarchies of digital objects. While still under development, it is currently being used in several projects. Information may be imported and exported in several formats, and the database may be queried interactively.

Fig 2. Sample applications

5. Interface.

People experience the digital library through the presentation of content and through tools for navigating and manipulating it. Many styles of interface have been developed, as projects have traditionally been conceived and described from a vision of how particular content should be experienced.

While acknowledging the benefits of that approach, our strategy is to provide a framework in which content may be used in as many ways, by as many different interfaces, as possible. As our architecture is becoming more stable, it has begun to offer such a set of services.

Interfaces continue to be developed by project teams working with library collections and, most recently, by the new EPIC and CCNMTL centers, which are creating large amounts of content and have an expressed need to use material in multiple settings.

The systems architecture is explicitly designed to enable many kinds of collaborations, including widely distributed ones. For example, the Digital Scriptorium project uses an interface at Berkeley to integrate their own as well as Columbia material; or, in the Advanced Papyrological Information System, the Columbia interface integrates material from several institutions. In neither case are remote web pages presented; rather, remote content is seamlessly presented by a consistent interface.

Support Processes

Our strategy is informed by ongoing collaborations with many colleagues in different ways. The systems architecture indicates clear directions for development and we believe it is becoming a strong infrastructure, offering flexible access to organized and persistent collections. But new acquisitions, projects and partners continue to introduce challenging variables; and new technologies are still more the rule than the exception. We therefore engage actively in individual projects through several processes.

1. Consultation and Tools.

We continue to perform systems analysis, bibliographic analysis, programming and content reformatting functions. These range from activities for particular projects, such as the development of interface software to query the Master Metadata File, to more general tools, such as software to filter metadata records from one standard format to another, or to create derivative images in bulk. Library Systems and AcIS R&D do most of this work, often in close collaboration with per-project participants.

AcIS R&D staff design interfaces, specify structural markup, and oversee markup production for a number of projects. Most recently, R&D and the AcIS Computing Systems group have begun to develop standard methods to store, process and deliver multimedia content. And, working with Columbia's Electronic Data Service, we have developed a capacity to analyze usage data and to correlate it with the demographics of the local community. An explicit goal of our security research is to permit such measurement on a much wider scale.

As Columbia's digital collection development has become less opportunistic, we have increasingly applied our technology research to the acquisitions process. For example, we are able to be more proactive in specifying metadata requirements. Also, our security and persistent naming research has raised the possibility of improved services from the vendors of licensed resources. We have begun to express these security and naming needs to vendors in a more articulate and specific way. In both of these cases related efforts in the Digital Library Federation have increased the range and strength of the message.

2. Partnerships.

In addition to the many collaborations above, a few others merit particular attention. Columbia's Center for Research on Information Access (CRIA) develops systems in a pure research environment and also facilitates technology transfer from Columbia research projects generally. Through CRIA, we are working with the Computer Science Department to adopt their work on harmonization and correlation among large numerical data sets. We hope to bring CRIA's own "Significant Topics" retrieval system into widespread use as part of our infrastructure. A computer science researcher is currently working with us to digitize theatrical set models from Special Collections and render them in three-dimensions. And our own security systems will contribute to the Patient Care Digital Library project, in collaboration with Computer Science and Medical Informatics.

Externally, we have a long history of collaboration, including the Museum Educational Site Licensing project, with seven museums, six other universities and the Library of Congress; and the papyrus project mentioned above, currently with six university partners. Of course, since its founding we have been active participants in the highly collaborative projects of the Digital Library Federation.