Columbia Digital Library Initiatives, Fall 2000

I. Collections, Services, and Systems
II. Projects and Programs
III. Publications
IV. Specific Digital Library Challenges

I. Collections, Services, and Systems

A. Collections

1. Primarily Text

Columbia International Affairs Online (CIAO)
Launched in Fall 1997 with seed funding from the Andrew W. Mellon Foundation, Columbia International Affairs Online (CIAO) includes a mix of scholarly materials in the field of international affairs, including working papers, conference proceedings, journal abstracts, books, maps, links to a wide variety of related sites, and a sophisticated indexing and search system that allows scholars and students to use the publication in a number of different ways.
      The project is now self-sustaining through subscription sales, after three years of development. Our experience with CIAO suggests that scholars are interested in exploring the potential of online publication and research, and that enough librarians and end-users find this material sufficiently valuable to purchase it at a price that permits cost-recovery over the long term.
      CIAO is part of the Electronic Publishing Initiative at Columbia (EPIC), described below.

The American Historical Association's Gutenberg-e Dissertation Prize
This innovative new project explores the impact of awarding a prize backed by the prestige of the American Historical Association for electronic publication of monographs in several areas of history that are considered to be "endangered" in the publishing arena. The project will examine the reaction of the scholarly community (authors, academic review committees, administrators, and end users) to the electronic publication of history dissertations, as well as the cost-recovery model that would make such an enterprise feasible over the long term.
      Columbia, through its EPIC center, has been selected as the publication site for this project and has received funding from the Andrew W. Mellon Foundation for three years to cover the costs of publishing these materials online. EPIC is holding a series of seminars for the winners of the prize designed to guide the authors in how to create online publications from their print dissertations. The first set of online versions will become available in early 2001.

Online Books Evaluation Project  (1995-99)
With support from the Andrew W. Mellon Foundation, the Libraries and Academic Information Systems (AcIS) collaborated with publishers to assess costs, user preferences, and potential delivery “packaging” for monographs in online electronic form. Participating publishers include Columbia University Press, Oxford University Press, Garland, Simon and Schuster Higher Education.

Humanities Texts
The Libraries and AcIS provide classic literary and philosophical texts in ASCII, HTML, and SGML formats for study, searching and analysis.

Virtual Reading Room
A joint project of Columbia College, AcIS, the Libraries and the Columbia Center for New Media Teaching and Learning (CCNMTL), described below, to create annotated versions of works studied in Core Curriculum courses. These versions will be integrated with instructional support software facilitating student annotation and discussion.

Commercial Collections
Columbia is also actively building a collection from commerical sources of information. In 2000-01 it expects to spend more than 1.3 million dollars on networked reference tools. CD-ROM and e-journals are largely paid for with other funds. Commercial digital collections include suites of databases from ABC Clio, Bowker, Cambridge Scientific, Chadwick-Healy, Dialog, Gale, OCLC First Search, Ovid, RLG, Silver Platter and others. It also has comprehensive title access agreements with most major e-journal publishers, e.g., Blackwell Science, Elsevier, Springer, etc. Columbia's users also have access to a variety of integrated full-text services like Academic Universe, Ideal, J-Store, Project Muse, ProQuest, Science Direct, etc. Columbia, with Cornell, Dartmouth, and Middlebury College provide access to nearly 20,000 NetLibrary e-books. They are involved in a fairly unique system where users actually select what is to be purchased. Columbia's Business, Law, and Health Sciences libraries also provides access to sizable specialized reference, data, and e-journal collections.
2. Primarily Images

Ling Lung Women's Magazine
Ling Lung Women’s Magazine was published in the 1930s in Shanghai, China at a time when women’s role in society, at least in that sophisticated and foreign-influenced metropolis, was in rapid transition. This pocket-sized, slender, and inexpensive weekly, boldly ventured to meet these new needs by encouraging women to advance toward the good life through socially high-minded entertainment. It was filled with articles on fashion, interior decoration, pop psychology, and new careers; and also advice columns on love, sex, and marriage, as well as lavish illustrations of local and Hollywood celebrities. The wide array of advertisements for women’s products are often just as revealing of life and aspirations as the words of the text.
      The first issue came out on March 18, 1931 and the magazine ceased publication in 1937. As far as we know, Columbia University's Starr East Asian Library is the only library outside China to hold a nearly complete set.

Art Humanities Reserve Collection    (1995–)
The Libraries and AcIS have created an image library of several thousand images. The fully-cataloged collection focuses on material from a core course, required of all undergraduates, and is used both in electronic classrooms and for study.

Advanced Papyrological Information System (APIS)
With support from the National Endowment for the Humanities, APIS is a multi-institutional project to create a digital library of papyri, transcriptions and related bibliographical information. The Libraries and AcIS are collaborating with faculty both in implementing the project at Columbia and in coordinating the overall effort.

Digital Scriptorium
With funding from the Andrew W. Mellon Foundation, the Libraries and AcIS are collaborating with University of California, Berkeley Libraries to create a digital library of dated and datable medieval manuscripts.

Judging a Book by Its Cover: Gold-Stamped Publishers' Bindings of the 19th Century
The advent of gold-stamped decoration, circa 1832, was the most important factor in the acceptance of publishers' bindings. Gold stamping brought to the mass-produced book some of the prestige associated with gold-tooled leather bindings of the pre-industrial era. In fact, stamping often imitated the decorative styles and motifs of the hand-finished book. However, gold stamping also developed its own styles and imagery that reflected the period' s taste and culture.

Museum Educational Site Licensing Project   (1994–97)
Columbia participated in this J. Paul Getty Trust-coordinated project to explore the use and economics of digital images in university research and teaching. The project developed mechanisms, both technical and procedural, to deliver 10,000 images from eight U.S. museums to seven research universities. A project to evaluate the process and economics of this distribution was supported by the Andrew W. Mellon Foundation.

Oversized Color Images   (1994–96)
This joint project, from the Libraries and AcIS, has evaluated options and recommended best practices for preserving and accessing brittle textual materials containing large sized color images. (Funded by CLIR.)
3. Mixed Media

Columbia Earthscape: An Online Resource on the Global Environment
This publication, launched in December 1999, has been funded by the Scholarly Publishing and Academic Resource Coalition (SPARC) and the National Science Foundation for three years for development and launch. Columbia Earthscape seeks to explore how the collaborative model of content development in the digital environment can transform both scholarship and education in the rapidly-developing field of earth systems science.
      As part of the Electronic Publishing Initiative at Columbia (EPIC), and based on the design and development experience of our other EPIC projects, we have created a fully-integrated, interdisciplinary, interactive online publication for both research and teaching resources in this field and will evaluate the ongoing educational value and economic viability of providing these services on a cost-recovery model.

Online Music Collections
Each semester, instructors of the Music Humantities course, required of all undergraduates, make audio selections of reserve material available to their students over the network. A "Sonic Glossary," an online audio dictionary of musical terms is also available. Students may access this material from residence hall rooms or from several campus laboratory facilities.

Asian Topics
Asian Topics is a joint project of the East Asian Curriculum Project (EACP), for teachers and students at the pre-collegiate level, and the Project on Asia in the Core Curriculum, for teachers and students at the undergraduate level. It is designed to be a digital library of multi-media presentations that bring leading scholar-teachers in Asian studies, from Columbia and other institutions, into classrooms, libraries, and homes, speaking on topics in Asian literature, history, religion, and contemporary society. The Topics are interdisciplinary and draw on Columbia's long experience in the intellectual design and production of content materials on Asia for teachers.
      Each Topic features a cameo, audio-visual presentation by a leading scholar-teacher designed to engage the teacher or student new to the study of Asia with the Topic he has chosen to explore. Each entry also provides bibliography, background essays, other web links, and curricular links for the viewer to pursue the topic in more depth. The site is designed to provide a "first look" on the Topic for teachers who come to the web to gain background on a new subject they plan to introduce in class and for students seeking initial direction for a research report.

South Asia Resources
With support from the Dharam Hinduja Indic Research Center and the Department of Education, Columbia is collaborting with the University of Chicago and the North Carolina Triangle South Asia Consortium to create networked versions of the twenty-six modern literary languages of South Asia.
      The Hinduja Center is supporting preservation and networked access to Indic manuscripts: 325 important manuscripts from Columbia's Rare Books and Manuscripts Library, in various languages of India. The project includes detailed metadata creation, online publication of an in-depth study of these particular manuscripts (by the world's greatest Indic codicologist, Prof. David Pingree, Brown Univ), microfilming and scanning from microfilm to create full page images of the entirety of this collection, hyperlinked to the metadata.
      Columbia's South Asia Resource Access on the Internet (SARAI) is is designated by the WWW-Virtual Library as the official World Wide Web Virtual Library for South Asia.

B. Services

Electronic Data Service
The Center, jointly operated by the Libraries and AcIS, the EDS is the University’s numerical data archive. The index for much of this collection is being provided over the network through the EDS Datagate, and has become a recognized national resource. Gigabytes of raw data are also available to scholars locally.

Direct Borrowing
Borrow Direct, developed by RLG, Columbia, Penn and Yale, went live in November 1999. This direct consortial borrowing pilot project offers faculty, staff and students at the three institutions the ability to request and borrow circulating items from the combined stack collections.
      Borrow Direct provides a combined Z39.50 search of Penn's Endeavor catalog and Columbia and Yale's NOTIS catalogs, then determines availability, handles request management and connects to the local circulation systems. Requesters receive e-mail notification at each stage of the request process. A commitment to priority handling with a maximum of four days from request to receipt is central to the design of the service.

"CLIONotify" is a user-profile-driven new book notification service currently in pre-release testing for a likely fall 2000 implementation. This service provides weekly email notification to users about newly-cataloged books, electronic resources and other media that match patrons' study and research interests. User profile editing and viewing is secured via the campus Kerberos-based authentication system. User interests are specified using CU's prototype "Hierarchical Interface to LC Classification" (HILCC), enabling MARC catalog records to be matched on LC Classification numbers. Keywords may also be used to qualify the broader HILCC categories.

Electronic Text Service (ETS)
ETS is a research and instructional facility of the Columbia University Libraries designed to help Columbia faculty and students incorporate computer-based textual and bibliographic information into their research, study, and teaching. ETS has machine-readable primary source texts, software programs for textual analysis and critical editing, hypermedia and database research tools in the humanities, bibliographic database management programs, IBM and Macintosh microcomputers, X terminals, and optical scanning equipment for the creation of machine-readable text. The ETS staff will provide demonstrations, workshops, and classes for students and faculty, as well as individual consultations.

C. Systems

Access Management
Several systems manage access to documents and services, whether locally or remotely provided. AcIS has created systems to integrate the campus identity infrastructure, based on the Kerberos technology, and the campus authorization service, built from distributed databases and delivered by an LDAP directory. These systems control access to restricted resources locally, on secure web servers, and remotely, routed through secure proxy servers.
      Over the last two years, we have been active in the DLF pilot project to use X.509 digital certificates and LDAP to more effectively provide access to remote services without using proxies (see also under Publications, below). Beginning in Fall 2000, we have begun using certificate technology to create a secure "identity hand-off" mechanism to third-party vendors in the context of individualized web services, often known as "portals." Related to this work, we are developing prototypes for identity hand-off among applications ("channels") within individualized services.

Master Metadata File
This facility, developed by Library Systems and AcIS, is a relational database application holding bibliographic and structural information for collections held locally or remotely. It is able to represent multiple versions, collections, aggregations such as pages in a book, and hierarchies of digital objects. While still under development, it is currently being used in several projects. Information may be imported and exported in several formats. The database may be used as an intermediate architectural component and may be queried interactively.

Hierarchical Interface to LCC
Columbia's "Hierarchical Interface to LC Classification" (HILCC) is a key component of CU digital library metadata planning. It is based on a detailed mapping of the LC Classification schedules to language-based subject categories ("entry vocabulary") organized hierarchically, e.g.,
    QD415.000 - QD436.999 = Sciences -- Chemistry -- Biochemistry
Preliminary versions of HILCC are already used in production to create browsable Web subject listings for the "reference tools and indexes" portion of CU digital library collections, and also for user profile creation/MARC record matching in the CLIONotify application (see above). Columbia plans to continue to develop and refine this tool for campus digital library use and also circulate it to other research libraries for discussion and possible development as a prototype standard.

LOCKSS Alpha Test Site
Columbia is an alpha test site for the LOCKSS system prototype. LOCKSS preserves access to scientific journals published on the web. The system ensures that hyperlinks continue to resolve and appropriate content is delivered, even when the content is no longer available from the original source. Libraries running LOCKSS cooperate to detect and repair preservation failures.

Significant Topics
This research investigates the relationship between the occurrence of significant topics in a document and the structure of the document. The unique contribution of this research lies in the innovative combination of statistical and rule-based methods to identify a list of significant topics as a function of the distribution of terms in documents. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply to general text.

University-wide Bulletin
A comprehensive University-wide schedule of classes, updated daily, has been available for a number of years. The current schedule reflects data from the Registrar, by extracting from that administrative database, as well as from individual departments and instructors, by providing a mechanism for them to supplement the central data with links to, for example, instructional material ("course home pages") and information on instructor's research. Beginning in Fall 2000, links to instructional material will be enhanced and links to library reserve materials will be added.

A multi-document summarization system, MultiGen, automatically generates a concise summary by identifying similarities and differences across a set of related documents. Input to the system is a set of related documents, such as those retrieved by a search engine in response to a particular query. Our work to date has focused on generating a summary including similarities across documents. Our approach uses machine learning over linguistic features extracted from the input documents to identify several groups of paragraph-sized text units so that all units in each group convey approximately the same information. Shallow linguistic analysis and comparison between phrases of these units is used to select the phrases that can adequately convey the similar information. This task is performed by the content planner of the language generation component and results in determination of summary content. Sentence planning and generation are then used to combine the phrases together to form a coherent whole.

PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video And Language resources)
This project will provide personalized access to a distributed patient care digital library through the development of a system, PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video And Language resources). PERSIVAL will tailor search, presentation, and summarization of online medical literature and consumer health information to the end user, whether patient or healthcare provider. PERSIVAL will utilize the online patient records available at Columbia Presbyterian Medical Center (CPMC) as a sophisticated, pre-existing user model that can aid in predicting user's information needs and interests. For those patients with no CPMC patient record, PERSIVAL will ask specific questions, depending upon the user query and clinical context, to build and maintain a skeletal user health information model.
      Key features of the proposed work include personalized access to distributed, multimedia resources available both locally and over the Internet, fusion of repetitive information and identification of conflicting information from multiple relevant sources, and presentation of information in concise multimedia summaries that cross-link images, video, and text. Given the widely varying nature of online resources, research in retrieval and search methodology will focus on automatically identifying source type (e.g., journal articles vs. self-help groups), quality, and level of intended audience. In addition to fusing information from multiple sources, summaries must also express facts in terms the user can understand, regardless of background. Video sources range from diagnostic test results to educational video, each of which requires search based on image characteristics and identification of significant events to aid in finding appropriate clips.

Content-Based Multimedia Search and Retrieval
This research focuses on development of next-generation image technologies for efficient multimedia content creation, manipulation, searching, and distribution. Multimedia content, especially images and videos, are becoming increasingly important thanks to the rapid advancement in multimedia computing and communication. In particular, the ubiquitous impact of the Internet has made it possible for general users to disseminate and access multimedia content on-line easily and quickly.

CARDGIS: Center for Applied Research in Digital Government Information Systems
CARDGIS brings together a strong team of researchers and developers with interests and experience in databases, human-computer interaction, knowledge representation, data mining, and other areas of computer science and information systems. The center's mission is research in the design and development of advanced information systems with capabilities for generating, sharing and interacting with knowledge in a networked environment. Participants are drawn from Columbia University's Department of Computer Science, from the University of Southern California's Information Sciences Institute. Technical assistance is provided by experts from several Federal government agencies.

II. Projects and Programs

Electronic Publishing Initiative at Columbia (EPIC)
The Electronic Publishing Initiative at Columbia (EPIC) is a groundbreaking new initiative in digital publishing at Columbia University that involves Columbia University Press, the Libraries, and Academic Information Systems. Its mission is to create new kinds of scholarly and educational publications through the use of new media technologies in an integrated research and production environment. Working with the producers of intellectual property at Columbia University and other leading academic institutions, it aims to make these digital publications self-sustaining through subscription sales to institutions and individual users.
      EPIC is committed to pursuing the highest standards in the development of content, use of technology, handling of issues of intellectual property and copyright, development of business plans, and evaluation of use. Its publications are designed to be innovative, efficient and cost-effective.
      Current constituent projects within EPIC are described above: Columbia International Affairs Online (CIAO), Columbia Earthscape and Gutenberg-e.

Columbia Center for New Media Teaching and Learning
In partnership with the faculty as content experts, the Center is committed to advancing the purposeful use of new media and digital technologies in the educational programs of Columbia University. It is committed to ongoing evaluation of the efficacy of its work within the University.
      The Center is committed to extending the population of involved faculty by providing them with a broad range of points of access: workshops, forums, individual consultations, as well as ongoing and sustaining support in the development of projects. The Center begins with anyone who is willing to bring a syllabus. In developing more advanced projects, it is committed to building “visible heuristics,” that is, projects in collaboration with faculty that act as demonstrations and explorations of pedagogical and curricular possibility.
      The Center is committed to building what is called the Columbia Educational Operating System, a suite of integrated applications that extends the capacity of students and faculty to capture, analyze, and integrate data in new ways. CEOS will also provide equally powerful communications tools to facilitate dialogue and the exchange of ideas.
      The Center is about building partnerships and providing the motivation and venue for the integration of disparate efforts in digital development. In this capacity, the Center is presently working with not only individual faculty members but also other entities of the University committed to similar goals, such as CIESIN, CERC, AcIS, CME and others. The Center is also active in contributing to the strategic planning on the school, college and university level.

Center for Research on Information Access
This center was established in early 1995 to act as a vehicle for linking different projects on the Columbia campus involved in developing and using digital technology. Initial funding comes from the Office of the Vice-Provost under the Virtual Information Initiative. CRIA has received funding from national agencies, foundations, and industry.
      CRIA is committed to facilitating connections between projects for exploring new technologies, developing new electronically available resources, and improving instruction, which are all essential facets of information access. CRIA is housed within the Columbia University Libraries with links to the Computer Science Department.

III. Publications

Columbia University Digital Library: Architecture and Services.
This paper describes Columbia's technology framework for digital collections generally, including strategies for distributed portions of infrastructure, independent development tracks, and folowing moving targets.

Policy for Preservation of Digital Resources
Digital resources are part of the Columbia Libraries collections and subject to the same criteria for selection and retention decisions as other media. As such, they are included under the central preservation policy: ensuring that the collections remain available over the long term, through prevention of damage and deterioration; reversing damage where possible; and, when necessary, changing the format of materials to preserve their intellectual content.

Cross-Organizational Access Management
This November, 1999 D-Lib Magazine article describes the DLF authentication and authorization architecture pilot project. The architecture uses X.509 digital certificates and LDAP directory services to create a highly effective access management system.

IV. Specific Digital Library Challenges

How can we best balance the integration into decentralized campus organizations, such as libraries, computing, schools, departments, and new media learning, teaching and publishing centers with the need for central coordination, planning, architecture and standards?


What is the most appropriate critical path to effect more interoperable architectures? What is the best mix between standardization and development among the many dimensions of scalability?