Special 21stC
home page

Genome data in the public domain: unleashed synergies

By LAURA NEWMAN

A PUBLIC DATABASE is fast becoming a vital tool in the push to annotate the human genome. Within six months after Columbia placed information about some 200,000 gene fragments in the database, biologists worldwide were already reporting breakthroughs. The impact has been "almost impossible to estimate," says Mark S. Boguski, senior investigator at NIH's National Center for Biotechnology Information (NCBI), GenBank division. While "remarkable" findings such as cloning of an Alzheimer's gene are relatively rare, Boguski adds, "discoveries on a lesser scale are made every day."

Advocates of the open-access database hope it will expedite the discovery, complete mapping, and sequencing of genes. Before obtaining clones and sequences, users must agree to contribute any new sequence, map, or expression data to a public map. New information cannot be disclosed or transferred to commercial entities until it has entered the public database. This requirement has added many more markers to the database.

At the cutting edge of online database technology, the database can be queried by e-mail, file transfer protocol, or the World Wide. A biologist who discovers that a gene matches an existing sequence in the database, for example, can obtain a clone within days.

Proprietary databases once housed the bulk of gene sequence information. As a result, limited access made it impossible for researchers to extract genetic information. For example, the large database amassed by Human Genome Sciences (HGS) and the Institute for Genomic Research, in Rockville, Md., limited the information that an individual or corporate entity could extract in any given year. HGS also retained the right to preview any material intended for publication and had first options regarding patentable material.

Impetus for expanding the public database came after prominent geneticists claimed that privately held database restrictions hindered research. Many scientists came out against patenting of partial gene fragments or portions of the genome before their biological significance was known. Today, the scientific, academic, and legal communities "are over the hump," says Columbia professor of law Harold Edgar, an expert on genes and patent law. "As people have thought about the problem" adds Edgar, "the consensus is that you shouldn't be able to control a gene based solely on identifying a fragment of it."

A leading voice was Francis Collins, director of NIH's National Center for Human Genome Research. He describes private genome databases as "unmappable," "unworkable," and unnecessarily "constrained by intellectual properties" (referring in this context to proprietary attempts to patent partial sequence data). "Unless you have a public database of expressed sequences," Collins adds, "it is impossible to contemplate the public genome project putting them on the map."

Before Columbia's contribution, the internationally available public database was more significant in concept than in practice, since it contained relatively little genetic material. In April 1995, clone libraries compiled by Columbia neurogeneticist Bento Soares were given to the Integrated Molecular Analysis of Genome Expression Consortium at Lawrence Livermore National Laboratory. The libraries are arrayed (assigned identification codes) by Lawrence Livermore, sequenced at Washington University in St. Louis, and entered into the GenBank database at NCBI, with funding by Merck & Co.

What Columbia donated was Soares' libraries of complementary DNA (cDNA): actual copies of gene fragments. Soares prepared the libraries using normalization, a novel technology that makes it easier to find rare clones, thus saving time in the gene discovery process. "What normalization tries to achieve," says Soares, "is to minimize the redundancy of the prevalent cDNAs while increasing the representation of the rarer mRNAs." (While the university donated the cDNA libraries, it has a patent on the normalization technique.)

ACCORDING TO COLUMBIA Vice Provost Michael Crow, the university decided to forgo potential profits from licensing the clones. Proponents of the public effort "convinced us that Soares' materials were needed to fulfill the NIH objective" of mapping and sequencing the genome, Crow explains. Columbia had to consider whether the library was "merely a tactic by Merck dealing with their competition from SmithKline Beecham," he continues. SmithKline Beecham had previously invested $125 million in the proprietary database held by HGS. Columbia felt confident, Crow adds, "that the public domain library would lead to continued scientific and medical developments and was not some maneuver by Merck to deal with their competition."

Today, the clone libraries compiled by Soares account for 95 percent of the material being sequenced at Washington University, says Rick Wilson, co-director of the Genome Sequencing Center. Large-scale sequencing is occurring at the rate of 5,000 clones each week. More than 150,000 sequences have been released, with GenBank adding approximately 1,500 sequences per day. Boguski reports that database usage has skyrocketed: NCBI's expressed sequence tag database has processed more than 262,000 queries, and each month some 4,500 people connect to the Web database and 1,500 download data.

"The database was key to cloning an Alzheimer's gene," asserts molecular biologist David Galas, who reported cloning a gene that when mutated caused a predisposition to a variant of early-onset Alzheimer's disease. "It would have taken us another year to get there" without access to such a library, says Galas, chief scientific officer of Darwin Molecular in Bothell, Wash.

According to Galas and other observers of genome research, there's no turning back to restricted private databases that house raw sequence information. "Pretty soon," Galas says, "everything you can find in the private database will be in the public database. The time of the private database is almost over."

What's more, biologists hope this public map is a prototype for other more complex biologic maps on the horizon. "Biology is so potentially complex," notes Galas, that information density in the field will increase spectacularly. He envisages integrated maps that link information on biological function and biochemical pathways, for example, back to specific genes, diseases, or enzymes. He calls such resources "public libraries for biologists of the future"-a resource that will be increasingly critical to "the health and vigor of the science."


LAURA NEWMAN is a writer and health policy analyst specializing in academic medicine, public health, and public policy. She has been a contributing editor to Dermatology Times and Urology Times, written for Ocular Surgery News and numerous other publications, conducted policy research for the New York City Health and Hospitals Corp., and conducted studies for the Health Care Financing Administration. Her e-mail address is lnewman@panix.com.

PHOTO CREDITS: Jonathan Smith, Howard Roberts


21stC home
page 21stC is. .
. special features next page