Bioinformatics--Trying to Swim in a Sea of Data
David S. Roos
http://www.sciencemag.org/cgi/content/full/291/5507/1260
Advances in many areas of genomics research are heavily rooted in engineering
technology, from the capillary electrophoresis units used in large-scale DNA
sequencing projects, to the photolithography and robotics technology used in
chip manufacture, to the confocal imaging systems used to read those chips, to
the beam and detector technology driving high-throughput mass spectroscopy.
Further advances in (for example) materials science and nanotechnology promise
to improve the sensitivity and cost of these technologies greatly in the near
future. Genomic research makes it possible to look at biological phenomena on
a scale not previously possible: all genes in a genome, all transcripts in a
cell, all metabolic processes in a tissue.
One feature that all of these approaches share is the production of massive
quantities of data. GenBank, for example, now accommodates >1010 nucleotides
of nucleic acid sequence data and continues to more than double in size every
year. New technologies for assaying gene expression patterns, protein
structure, protein-protein interactions, etc., will provide even more data.
How to handle these data, make sense of them, and render them accessible to
biologists working on a wide variety of problems is the challenge facing
bioinformatics--an emerging field that seeks to integrate computer science
with applications derived from molecular biology. We are swimming in a rapidly
rising sea of data...how do we keep from drowning?
Bioinformatics faces its share of growing pains, many of which presage
problems that all biologists will soon encounter as we focus on large-scale
science projects. For starters, few scientists can claim a strong background
on both sides of the divide separating computer science from biomedical
research. This shortage means a lack of mentors who might train the next
generation of "bioinformaticians." Lack of familiarity with the intellectual q
uestions that motivate each side can also lead to misunderstandings. For
example, writing a computer program that assembles overlapping expressed
sequence tag (EST) sequences may be of great importance to the biologist
without breaking any new ground in computer science. Similarly, proving that
it is impossible to determine a globally optimal phylogenetic tree under
certain conditions may constitute a significant finding in computer science,
while being of little practical use to the biologist. Identifying problems of
intellectual value to all concerned is an important goal for the maturation of
computational biology as a distinct discipline. "Real" biology is increasingly
carried out in front of a computer, while an increasing number of projects in
computer science will be driven by biological problems.
Further difficulties stem from the fact that bioinformatics is an inherently
integrative discipline, requiring access to data from a wide range of sources.
Without the underlying data, and the ability to combine these data in new and
interesting ways, the field of bioinformatics would be very much limited in
scope. For example, the widespread utility of BLAST for the identification of
gene similarity (1) is attributable not only to the algorithm itself (and its
implementation), but also to the availability of databases such as GenBank,
the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of
Japan (DDBJ), which pool genomic data from a variety of sources. BLAST would
be of limited utility without a broad-based database to query.
One core aspect of research in computational biology focuses on database
development: how to integrate and optimally query data from (for example)
genomic DNA sequence, spatial and temporal patterns of mRNA expression,
protein structure, immunological reactivity, clinical outcomes, publication
records, and other sources. A second focus involves pattern recognition
algorithms for such areas as nucleic acid or protein sequence assembly,
sequence alignment for similarity comparisons or phylogeny reconstruction,
motif recognition in linear sequences or higher-order structure, and common
patterns of gene expression. Both database integration and pattern recognition
depend absolutely on accessing data from diverse sources, and being able to
integrate, transform, and reproduce these data in new formats.
As noted above, computational biology is a fundamentally collaborative
discipline, owing its very existence to the availability of rich and extensive
data sets for analysis, integration, and manipulation. Data accessibility and
usability are therefore critical, raising concerns about data release
policies--what constitutes primary data, who owns this resource, when and how
data should be released, and what restrictions may be placed on further use.
Two challenges have emerged that could potentially restrict the advancement of
bioinformatics research: (i) questions related to the appropriate use of data
released before publication and (ii) restrictions on the reposting of
published data.
The first challenge to bioinformatics research relates to the analysis of data
posted on the Web in advance of publication. Recognizing the value of early
data release for a wide range of studies, the Human Genome Project adopted a
policy of prepublication data release (2), and many genome projects (and the
funding agencies that support them) now adhere to similar rules. Because
bioinformatics depends absolutely on the ability to integrate data from a wide
variety of sources, it is to be hoped that other projects that generate
genomic-scale data (including expression analysis and proteomics research)
will follow a similar policy (3), because immensely valuable results can
emerge from large-scale comparative studies of genome structure, microarray
data, protein interactions, and so on (4-6). The success of such altruistic
data release policies, however, requires that those who generate primary
sequence data (often on behalf of the community at large) receive appropriate
recognition and are able to derive intellectual satisfaction from their work.
Rowen et al. (7) have recently proposed treating unpublished data available on
the Web as analogous to "personal communication," thereby establishing some
degree of intellectual property protection.
The difficulty with this approach comes in determining what types of analysis
should require permission from the submitters, and what types of analysis can
reasonably be prohibited. Clearly, the identification of individual genes of
interest for further experimental analysis must be acceptable--perhaps even
without the need for formal permission--otherwise, early data release serves
no purpose at all. Conversely, second-party publication of raw, unpublished,
sequence data posted on the Web must be viewed as violating ethical
standards--analogous to the verbatim plagiarism of unpublished results from a
meeting presentation. Where to draw the line in intermediate cases will
ultimately depend on the intellectual contributions provided by the manuscript
in question, and whether such work might reasonably have been expected to
emerge in due course from those who generated the original data (7). Such
considerations of "value added" are not terribly different from those normally
applied during manuscript review, but require special consideration by
reviewers and editors of the anticipated contributions from the original
submitter.
Experience with the Plasmodium falciparum genome project (8-15) suggests that
disagreements over what kinds of data and analyses are permissible for
publication are sometimes attributable to the failure of second parties to
adequately consider the interests and involvement of those generating the
primary data. More often, however, disputes are attributable to a lack of
understanding: either on the part of biologists, who do not fully appreciate
the long lag that may reasonably be expected between (for example) the first
appearance of shotgun sequencing results and final sequence closure and
annotation, or on the part of those generating the primary data, who may not
fully appreciate the intellectual contributions of
biologists/bioinformaticians. One hopes that as the gulf between those engaged
in the application of genomic technologies, bioinformatics research, and
laboratory analysis is bridged by understanding, these problems will diminish
in importance. Increased acceptance of Web-based release as a form of
publication (for hiring, promotion, tenure decisions, etc.), as well as
increased understanding of the nature of "big science" projects in biology,
will also reduce tensions.
The second challenge to bioinformatics research derives not from restrictions
on data access but from restrictions on downstream use, such as incorporation
into new or existing databases. This challenge is of a more fundamental
nature, involving not just when bioinformatic analysis is permissible, but
what kinds of analyses can be carried out. Today's publication of a draft
analysis of the human genome by Celera Genomics (16) focuses a spotlight on
this question, because the primary data themselves are being released only
through a private company that places restrictions on the reposting and
redistribution of their data. Other genome-scale projects, including a recent
analysis of protein-protein interactions in Helicobacter pylori (17), have
placed similar restrictions on the reposting of primary data.
As described in the accompanying editorial (18), Science has taken care to
craft a policy which guarantees that the data on which Celera's analyses are
based will be available for examination. But the purpose of insisting that
primary scientific data be released is not merely to ensure that the published
conclusions are correct, but also to permit building on these results, to
allow further scientific advancement. Bioinformatics research is particularly
dependent on unencumbered access to data, including the ability to reanalyze
and repost results. Thus the statement that "... any scientist can examine and
work with Celera's sequence in order to verify or confirm the conclusions of
the paper, perform their own basic research, and publish the results" (19) is
inaccurate with respect to research in bioinformatics. For example, a
genome-wide analysis and reannotation of additional features identified in
Celera's database could not be published or posted on the Web without
compromising the proprietary nature of the underlying data. Nor could this
information be combined with the resources available from other
databases--such as the information from additional species necessary for
cross-species comparisons, or data from microarray and proteomics resources
that would permit queries based on a combination of genome sequence data,
expression patterns, and structural information. It is certainly true that the
present state of genomics research would never have been achieved without the
freedom to use (properly attributed) information from GenBank/EMBL/DDBJ.
The potential for restricting downstream analysis offers the prospect of
making a wealth of proprietary data generated by private companies accessible
to the research community at large, but this potential comes at a very great
cost. Imagine, for example, genomics research in a world where
GenBank/EMBL/DDBJ did not exist and could not be assembled because of
ownership restrictions. Five years ago, the Bermuda Conventions (2)
established a standard for the release of genome sequence data that has served
biologists very well; we should consider carefully what precedent to establish
for the next 5 years, as considerations of data-release and data-use policy
are likely to have far-reaching implications for all of biomedical research.
The "postgenomic era" holds phenomenal promise for identifying the mechanistic
bases of organismal development, metabolic processes, and disease, and we can
confidently predict that bioinformatics research will have a dramatic impact
on improving our understanding of such diverse areas as the regulation of gene
expression, protein structure determination, comparative evolution, and drug
discovery. The availability of virtually complete data sets also makes
negative data informative: by mapping entire pathways, for example, it becomes
interesting to ask not only what is present, but also what is absent. As the
potential of genomics-scale studies becomes more fully appreciated, it is
likely that genomics research will increasingly come to be viewed as
indistinguishable from biology itself. But such research is only possible if
data remain available not only for examination, but also to build upon. It is
hard to swim in a sea of data while bound and gagged!
--- --- --- --- ---
Useless hypotheses, etc.:
consciousness, phlogiston, philosophy, vitalism, mind, free will, qualia,
analog computing, cultural relativism, GAC, Cyc, Eliza, cryonics, individual
uniqueness, ego, human values, scientific relinquishment
We move into a better future in proportion as science displaces superstition.
This archive was generated by hypermail 2b30 : Sat May 11 2002 - 17:44:14 MDT