Columbia University Libraries/Academic Information Systems

OVERSIZE COLOR IMAGES PROJECT PHASE II

FINAL REPORT

to the

Commission on Preservation and Access

Janet Gertz
Director for Preservation, Libraries

Robert Cartolano
Lead Consultant, Academic Information Systems

Susan Klimley
Science and Engineering Digital Projects Librarian

November 1996


Table of Contents

  1. Introduction
  2. Creating the Film Intermediaries
  3. Metadata
  4. Vendor Scanning
  5. Preservation of the Digital Files
  6. Columbia User Presentation
  7. User Reactions
  8. The Future
  9. Endnotes

Introduction

In 1994 Columbia University undertook the "Oversize Color Images Scanning Project," funded by a contract from the Commission on Preservation and Access. The goal of the project has been to investigate digitization as a method of providing access to brittle volumes in which the text is accompanied by color and oversize illustrations, by testing the hybrid approach to creating microforms for long-term storage and then scanning the microforms to create an on-line digital use version. Microforms provide a long-term preservation storage medium but are awkward at best as a use medium, especially with oversize materials; normal high-contrast black-and-white film is inadequate for color materials.

In Phase I of the project comparison was made between direct digital scans of five original color, oversize maps from four numbers of the New York State Museum Bulletin and scans of single-frame color microfiche of the same maps. It was determined that scanning the microfiche could produce files with resolution equal to that of the scanned originals. The smallest print on the original maps could be read both in the online versions and on the paper printouts produced from the scans of the microfiche and from the scans of the original maps. In other words, the project proved that scanning and digitization are now capable of capturing the level of information needed to provide preservation-quality surrogates for oversize, color printed documents such as maps.

Phase II of the project is now complete. Accomplishments were:

Creating the Film Intermediaries

The four brittle numbers of the Museum Bulletin used for this test project are those which contain the maps scanned in Phase I. They date from 1905-1906. Each consists of one or more articles, and most of the articles include illustrations of six possible types.

Many of the color plates are maps. Click here for a detailed listing of the illustrations.

Film intermediaries were created for the text and illustrations.

Metadata

Broadly defined, metadata is information about information, more specifically information about data that enables intelligent, efficient access and management of data. It may include information about the item which has been scanned (size, date), about film intermediaries (film stock, lighting), or about the capture process (equipment used, settings). Some would also include here any bibliographic record corresponding to the scanned item, finding aids, and so forth.

The metadata for this project is contained in four spreadsheets created by Preservation Resources at Columbia's request. Each spreadsheet contains information about all of the files pertaining to one of the four numbers of the Bulletin . A record consisting of defined fields was created for each scanned page, whether illustration or text, to identify the page uniquely and locate it within the structural hierarchy of the volume. Record fields include the Bulletin number, chapter name, page number, short plate title, file type, and so forth, as well as the 8-character file name assigned to the original scan. That file name serves as the core of the unique identified for each subsequently derived file, in any of a variety of formats and resolutions. Programs were written to display elements of the metadata in HTML pages which frame the bit-mapped page and illustration images with an area that includes variable information such as the page number and navigational tools. All the pages were then linked together to mimic the published organization of the Bulletin .

The exact content and form of the metadata and file names had a direct effect on manipulation of the files for display and access. Whatever was included in or referred to by the metadata could be manipulated automatically. When the metadata did not include needed information, it became necessary to make changes manually, file-by-file. For instance, because some plates had no author-assigned plate number and it was decided not to assign them numbers as part of the metadata, uniformity of display required manual intervention.

Vendor Scanning

Figure I lays out the various film versions and file formats employed by the project.

Illustrations

Nineteen color microfiche of oversize illustrations which were scanned to Photo CD by MicroColor International on a drum scanner at approximately 200 dpi and 24-bit color, the same methodology MicroColor had employed in Phase I of the project. (Compare the maps in Set 2B of the Phase I report). There were also 74 slides of the page-sized plates and figures that required grayscale/color scanning. They were converted to Photo CD by MicroColor. When mounted and viewed, the images had a strong pink tone. The color bars on the slides were carefully examined on a light box to confirm that no pink tone was present. The slides were then returned to MicroColor for rescanning, which eliminated the problem.

Text

The black-and-white microfilm was scanned twice, by different vendors. Since the Preservation Department of the Yale University Library has served as the leader in conversion of preservation microfilm to digital form through its Open Book Project, they were asked to scan the Bulletin film in order to provide a basis for comparison with other vendors. Yale scanned the film with the same equipment and following the same methods used in the Open Book Project. Click here for a description. A TIFF image was produced for each page of the Bulletin numbers; conversion was bitonal, 600 dots per inch for text pages and 300 dpi for full page and oversize illustrations. The TIFF images were made available to Columbia via FTP. Click here to view samples.

The microfilm of the text, and the CDs of the slides and microfiche scans were then sent to Preservation Resources. They provided two services. First, they made 600 dpi bitonal scans of the text from the microfilm. For each page of the Bulletin , one TIFF image was created and compressed using CCITT Fax 4. Preservation Resources also created the corresponding metadata for each page and illustration image to provide hierarchical access to the scanned pages and illustrations for each number as well as to the chapter/article level within each number. Second, they mounted sample images on their own web page as an example of one access and display mechanism for bit-mapped volumes. Click here to see a description of the hardware and software employed. It should be noted that while their proposal lists a Mekel scanner, the work work was actually carried out on a SunRise scanner which they had acquired in the meantime.

Quality control

The experience with the pink images points up the both the importance of including color and grayscale bars in the images, and the necessity for quality control inspection, even with vendors whose products are usually up to standard. Quality control (that is, verification file-by-file that every image has been scanned and that the files meet designated levels of resolution, bit depth, and so forth) should be applied at three stages: by the vendor at the time of scanning; by library staff at the time of receipt of the master image files; and after manipulation and mounting of the files at the site, to assure that all links work properly.

Preservation of the Digital Files

The 600 dpi bitonal scans of the microfilm and the 200 dpi 24-bit color scans of the illustrations constitute the "master" digital copies to be archived permanently by Columbia; all versions provided to users are derived from these masters. Columbia maintains its digital library collection in production (including stringent back-up and archiving procedures) as a central part of University operations. Issues of long-term archiving and migration, as raised in the RLG/CPA Task Force on Archiving Digital Information report, are under review and initial requirements, such as the use of preservation-related metadata and non-proprietary data formats are being implemented.

Columbia User Presentation

In order to place the New York State Museum Bulletins online for users, Columbia set several goals with regard to accessibility and readability:

Online Text

In order to make online texts available to a wide audience, they should be able to be viewed from a wide variety of computer platforms and should not require expensive computer hardware or software. Columbia has chosen to place the information on a web server, to support web browsers such as Netscape Navigator, and to tailor the information to work well on low-cost personal computers with standard 14" computer displays.

The master 600 dpi TIFF file format, while suitable for preservation capture of the content of the page, is not ideal as a format for online access or for printing. The file sizes (which range from 75 to 175 K) can cause unacceptable delays in accessing the files. Further, TIFF is not supported by today's Internet web browsers, and CCITT Fax 4 compression is not supported by many graphics viewers. Even if a web browser such as Netscape Navigator could open the TIFF document, the 600 dpi document would be essentially unreadable on most average computers. Common computer displays have approximately 72 dpi resolution, and since Netscape opens images at screen resolution, a 600 dpi image would require a display over 40 inches across and over 60 inches down. If image viewing software is used to scale the document to its actual size, the result of scaling renders the page unreadable. Additional conversion steps are therefore needed to make the page readable.

The lowest common denominator for displays today is 640 by 480 pixels and 256 colors. The image must to be reduced in size from 600 dpi resolution to approximately 120 dpi in order to fit the screen so that the entire width of the page can be seen horizontally. Essentially,

original page:
5 inches across x 600 dpi = 3000 pixels
8 inches down x 600 dpi = 4800 pixels.
 
reduce to:
5 inches across x 120 dpi = 600 pixels
8 inches down x 120 dpi = 960 pixels
 
computer display at:
72 dpi, 640 pixels across - entire width of page displays horizontally
72 dpi, 480 pixels down - 2 screens needed to display entire length of page.


Several tests were made at various pixel depths. Best results were obtained if the image was converted from a 600 dpi bitonal (1 bit) image, to a 120 dpi, gray scale image using 4 bits per pixel. This had the effect of producing an anti-aliasing effect on the typeface, smoothing out "jaggies" and producing a page that was still fully legible using today's web browsers.

The display image was approximately 600 pixels across by 900 pixels down. It was stored at 72 dpi and 4-bits per pixel in GIF file format, which is widely used by all major web browsers to display images. The file size was under 100 kilobytes, which makes it acceptable for lower speed connections to the Internet. A UNIX package called ImageMagick was used to convert the over 800 TIFF images to GIF images. Widely available on the Internet, ImageMagick includes a package called "convert", which can be used to batch process the images. A Sparc 20 was used for conversion. The resulting images were given file names that corresponded to the original TIFF file names.

Printable Text

Printing the original 600 dpi TIFF files to a 600 dpi laser printer produces high-quality, fully legible output that is an adequate substitution for the original page. Unfortunately, this format is not widely supported and an alternative had to be provided for the majority of users. Adobe Portable Document Format, or PDF, is a well-accepted document format that runs on a wide variety of systems and can be printed to both low-cost ink-jet printers and to industry-standard laser printers. Adobe Acrobat Reader reads and prints PDF documents, and is freely available for Mac, Windows, and UNIX platforms. After testing both 600 dpi and 300 dpi PDF documents, it was decided to use 300 dpi, since the resulting printout was still quite acceptable, and the file size was cut to approximately 50 kilobytes.

Several steps were required to convert from TIFF to PDF. To center the text and avoid improper placement on the lower-left side of the page, the TIFF image was first downscaled to 300 dpi and converted to PostScript Level 2 format, using the ImageMagick convert software previously mentioned. One line of PostScript was then added, centering the document in the middle and center of an 8.5"x11" page. To convert from PostScript to PDF format, Adobe's Distiller software was used on a Power Macintosh 8100 system. All of this is invisible to the user, who simply issues the print command. It would be possible to crate a similar PDF file for each article to make it easier to download and print the article as an entity.

In sum, every page of text from the Bulletins corresponds to three files, each suited to a particular type of use:

Plates and Figures

Preservation Resources followed Columbia's instructions to crop the color bars out of the illustration images and save the files in TIFF format 24-bit color (uncompressed) at thumbnail and 600x980 screen resolution, as well as uncropped versions at 1024x1536 screen resolution. Columbia later decided to make higher resolution (2048x3072) uncropped versions instead of the 1024x1536 versions available for access for users whose equipment can handle the very large file sizes. The color and grayscale bars were removed from the lower resolution files to reduce file size but retained in the higher resolution files on the assumption that viewers with high-end equipment would be more likely to have a concern with color accuracy and the capability to calibrate monitors and printers. It should be noted that careful, consistent placement of the color and grayscale bars greatly facilitates automation of the cropping process.

All the display images except the thumbnails were then converted by Columbia to JPEG for easier viewing by web browsers, and to reduce the file sizes for downloading. Thumbnails were converted to GIF. Conversion was performed using Kodak Access Plus software on a Power Macintosh, converting the images from Photo CD to TIFF format. TIFF to JPEG conversion was performed using a Graphic Converter on the PowerMac. The JPEG files can be printed by most currently available color printers. The quality of printout naturally depends on the quality of the printer. Accurate, fully legible facsimiles can be printed out from the high resolution TIFF files by high-end equipment, as demonstrated in Phase I.

Like the text pages, then, the illustrations each correspond to a variety of files for specialized purposes:

Integrating Text, Plates and Figures

The project had as a goal creation of online facsimiles of the Bulletin numbers to serve as preservation surrogates for the originals. The original organization of the Bulletins as published was therefore retained as closely as possible. While pages would obviously display in numerical order, placement and display of illustrations required more choices.

There were three categories of illustration: in-text, separate plate bound with the text, and folded plates located in pockets at the end of the volume. Pages with in-text illustrations were scanned twice, once from the microfilm as black-and-white text images and once from slides as color/grayscale images. The black-and-white bitonal versions display in sequence with the other text pages, and are linked to the color/grayscale versions which the user can call up at will. Plates bound into the text had individual numbers but were not paginated. Some were scattered throughout the text and others clustered at the ends of volumes. A decision was made to display the plate images in the positions they occupied in the bound volumes rather than displaying them in conjunction with page(s) whose text referred to them. Illustrations from pockets are displayed following the last text page of the volume. Aside from the desire to produce a facsimile of the original volumes as published, any effort to link illustrations to appropriate text pages would have required someone with knowledge of the subject matter to determine which parts of the text referred to a given illustration, since there was not always an explicit mention of a plate number or title.

In order to integrate text pages, plates, and figures so that they would display as just described, several hundred HTML pages needed to be created. The ideal was to generate these pages automatically, using the available metadata. Fortunately, most of the metadata provided by Preservation Resources was accurate and provided enough information so that the Bulletins could be reassembled. Errors consisted primarily of transposition of numbers in file names, and amounted to less than 5% of the whole.

Four Perl programs were written to create the online Bulletins.

Using software programs to generate the on-line books from the metadata proved to be invaluable. The programs quickly pointed out errors in the metadata, speeding correction of typographical errors and omissions. Once these errors were corrected, it became easy to modify the presentation of the entire book. By changing the program in one place, for example, additional navigational aids were provided automatically to all of the pages. The programs take only a few minutes to process the files, making it easy to experiment with new ideas in real time. The ability to create a book automatically means that this process can easily scale to a very large number of volumes without a correspondingly high increase in processing time.

Navigational Aids

The use of the web interface provides an easy way to link one page to another. Small arrows indicating next page and previous page are placed at the top and bottom of each displayed document. The Bulletin number, chapter, and page number clearly label the top of each display. To facilitate page jumping, a set of links to a range of next and previous pages is available at the top of each display; text pages accompanied by plates are so designated. The URL of the current file is visible at the bottom of the display, as well as a link to instructions for viewing and printing. Links to the original TIFF and PDF documents are also at the bottom of each page display for anyone wanting to print or otherwise work with the other versions. The top of the display includes a link to the table of contents, list of plates, and list of pages. Each image of an illustration is linked to the text page which precedes it in the bound volume. When the image for the text page is displayed, the plate number is also displayed and links to both the medium and high resolution versions of the plate appear. Finally, the top of each page display includes a link to the table of contents, list of plates, and list of pages.

Search Capabilities

Providing search capability was not part of the original plan for the project. However, since Columbia already had made Glimpse available as one of its search engines for HTML text, it was decided to take advantage of the function and provide a search capability across all the Bulletins on the project homepage. Glimpse indexes the information in the HTML files that were created from the metadata: the title of each HTML document, the Bulleitn number, chapter name, page number, and plate label. A successful search displays the title field of each document, and provides a link directly to that page or plate.

User reactions

The strong relationship between text and images, and the need to keep the images in proximity to the relevant text, had a profound effect on the design of the online Bulletin , as described above. Because we chose to link the plates to the text pages they were adjacent to in the printed version, we felt the need to provide additional ways to locate and view the plates easily.

To enhance the online version, a plate list was created as described above. It was designed in particular to facilitate the scientific researcher's habit of browsing by flipping through books and serials looking at the illustrations or charts to determine which articles might be interesting to read. In a printed book such a list would be simply the titles of the plates. In the online volume the plate list consists of the thumbnail images with links to different resolutions of the images and to the text page adjacent to the image. One of the researchers [2] who evaluated the electronic version of the Bulletin commented:

I like the plates page. It corresponds to the way that I often look at geological books and papers, by glancing through the figures, and then checking out the text that goes along with any interesting figures. This serves the purpose, not only of the printed book index but also a way for users to browse through the images of a Bulletin much in the way they would rifle through the pages of the printed Bulletin.

Even with care taken to retain the form and order of the original books, however, it is difficult to achieve the same functionality the book has. The same researcher remarked that during reading:

It is not obvious that the plates appear with a page to which they are logically and intellectually attached. Certainly, there are no (plate 3) or (figure 4) parenthetical comments in the text, and in a few of the cases where I actually read the text, the plate and the text varied in their degree of connectedness. I know in a lot of those old books, the plates had to be bound in specific places, not necessarily adjacent to the text they referred to.

The researcher then reflected:

I guess in an archiving project, you want to retain the original order of pages in the original book. To a modern user, it might be more useful to put the pictures next to the text they refer to. That would be more work, because a geologically knowledgeable person would have to read the text and make decisions, and there would certainly be room for disagreement about some of the decisions.

These comments suggest that there are text/image aspects of the printed book which depend upon the physical organization of the bound volume. Perhaps one of the reasons passages of text do not always specify the plates referred to it is because the bound volume not only permits but requires the user to turn page by page, passing the relevant plates "naturally" in the process of reading the text.

Another reviewer [3] of the electronic Bulletin remarked on a short paleontological observation accompanied by three plates. The reader noted and pulled up the first plate and, finding it uninteresting, skipped over the second plate, acknowledging impatience at having to pull up a separate image. Later, when reviewing the thumbnails in the plates list, the reader realized that the third plate had been far more important to understanding the article and yet it had been missed during a reading of the text because the on-line version does not automatically display all illustrations, but instead allows the reader to choose whether to take the time needed to open the files. Fortunately, once the image file is open, the windowing capabilities of the web viewers enable a reader to alternate between the text and image views, referring back and forth between the two.

The importance of text/image interaction has begun to be discussed in the light of contemporary electronic publishing. Stewart's [4] recent study of chemistry journals contains data on the importance of images to users of current science literature. Finding "browsing graphics to determine the value of an article" second in importance only to printing, 73% of the users listed it as "very important" and an additional 15% indicated it was "important" (p.341). The author questioned users on their preferences for exact bit-mapped page images versus extracted graphics. Some 43% of the respondents indicated that extracted images were adequate, 27% indicated that an original page image was needed, and 16% said that the text and graphics should be presented together but not necessarily an exact image (p.345). The variation in responses may indicate lack of experience to determine what is the best placement of text and image, as well as a difference in individuals' needs and preferences.

Other user issues are pertinent specifically to maps. The first concerns scale. Images on the screen and in printouts change sizes. No printout is a perfect match to the size of the original map. Most of the maps included a printed scale which changed size proportionally as the map changed, but several simply had statements such as "1 mile = 1 inch". The color bars contain a ruler which is very useful in making it clear to users what the actual scale on the displayed or printed map is. The lower resolution images, however, lack this information because the color bars were cropped out. Removal of the color bars to improve access times and allow display on lower-end monitors was thus a compromise which has ramifications for use completely apart from color, and should be re-evaluated after more use of the online maps to see if it causes problems for readers.

Another issue, discussed in the Phase I report, which continues to concern map users is the importance of peripheral vision in orienting oneself while viewing a map, and the related issue of being able to see all of a large feature on a map at one time (for instance, the entire course of a river) while still being able to read the fine print. Files with resolutions high enough for legibility do not fit on a small screen, and can only be viewed by scrolling from place to place. Flipping back and forth between a lower resolution view which contains the entire map and a high resolution view of a detail proves to be both time-consuming and disorienting. Only a screen large enough to display the entire map at high resolution would eliminate the problem of trying to work with small excerpted portions of a map. It will be interesting to study changes in how maps are used on-line as scholars increasingly make user of both vectored and bit-mapped products.

The Future

Many further enhancements to the sample Bulletins are possible as experience is gained with what is successful and useful in presenting bit-mapped versions of texts and images to readers. An obvious next step would be to improve the search capability through providing a "dirty" OCR version of the bit-mapped text purely for searching. The variety in sizes and styles of type fonts imply that there would be numerous errors in the OCR process, but manual conversion or proofreading would be very expensive. Even without correction, however, the converted text could still prove to be useful for searching purposes. A compromise might be to expend the time and money needed for accurate conversion just of the authors' indices. Further improvements in display and access are easily suggested by many of the web sites currently available.

Throughout, a major factor in decisions about presentation of the Bulletins to readers has been the attempt to balance between the desire to view high-quality images and the need for reasonably fast and wide-spread accessibility. Certainly there are grounds to hope that improved, and less expensive, equipment and software will mitigate this conflict, based on the swift progress to date in increased memory and general functionality of commonly available machines. At the same time, it is clear that capture at high resolution now for preservation purposes is extremely important, even though we may need to wait some years before display and access technology develop to allow common use of the high resolution images.

The basic question is whether the preservation and access methodology investigated in the two phases of this project will prove to be viable over the long run. Certainly significant interest in the methodology has been expressed by librarians and geologists. The Geoscience Information Society has recently put forth an action plan for preservation in which they document their deep concern about oversize and color materials. The results of this project have been presented to them, and they hope to be able to utilize this methodology as they work to preserve their core literature.

Columbia will actively continue to evaluate this method of preserving and making accessible the combination of text and images. True evaluation of the site as a research tool will not become possible until a large enough body of material is available in digital format to potentially serve researchers' needs. Until the materials are put under the stress of everyday use we are not going to find what works and what proves to be a serious impediment to sustained use.

Endnotes

1. "The Bulletin was filmed by Challenge Industries (Ithaca, NY) for Cornell University as part of the three-year Preserving the Literature of Natural History of the Northeast Bio-Region Project, funded by a grant from the New York State Program for the Conservation and Preservation of Library Research Materials (1994/97). The film conforms to national standards for preservation-quality negatives.

2. Kim Kastens, personal e-mail communication, September 5, 1996.

3. Susan Klimley, personal e-mail communication, September 6, 1996.

4. Linda Stewart, User Acceptance of Electronic Journals: Interviews with  Chemists at Cornell University". College and Research Libraries, 1996, v.57, p.339-49.



NYS Museum Bulletins | Columbia University Home Page
Columbia University Libraries/Academic Information Systems
Revised: 2/20/97 URL: http://www.columbia.edu/dlc/nysmb/reports/phase2.html