|
September 8, 2003
FEATURE: Data mining
Data mining Open Access research
By Matthew Cockerill, Technical Director, BioMed Central
In the words of Sydney Brenner
The definition of data mining:
what's my data is mine and what's yours is also mine.
The need for data mining tools
The biomedical research literature is
growing at a daunting rate, with more
than 400,000 new research articles listed
each year in PubMed alone. The
power of post-genomic research technologies
and increases in biomedical
research funding suggest that we can
expect the number of articles to continue
to expand at an overwhelming rate.
The sheer scale of the scientific literature
poses a formidable challenge to
research scientists in their attempt to
locate published results that are relevant
to their research interests.
Text searching of abstracts and fulltext
articles has helped. But much
information is missed by simple text
searches. More sophisticated tools are
needed to identify relationships, connections,
and patterns that are hidden
within the data reported in the scientific
literature. Data mining researchers
and computer scientists are developing
such tools using statistical, linguistic
and artificial intelligence approaches.
The search engines used to find information
on the Internet provide a simple
example of how such techniques can
be helpful. The well-known search
engine Google exploits the pattern of
links between different websites in
order to generate a 'page rank' for each
page that it indexes. Pages that are frequently
linked to by other pages are
given a higher page rank. And it turns
out that page rank is an extremely helpful
predictor of how useful a page is to
Internet surfers.
Data mining
experts hope to
develop tools that
can identify
knowledge that
lurks beneath the
surface
But new approaches are needed to
tackle both the expanding Internet and
the scientific literature. The World
Wide Web was originally built for
human consumption, and although
everything on it is machine-readable,
this data is not machine-understandable.
Part of the problem comes from the
very nature of natural language that
presents a considerable challenge for
computational algorithms to extract
meaningful information. For example,
English has both synonymy (multiple
ways to refer to the same object) and
polysemy (words with multiple meanings),
which complicate text searching.
Scientific literature has numerous
examples of such ambiguities. For
example, individual genes often have
several different names that relate to
their historical discovery. And names
such as RAS and MYC are used to
refer to both genes and their
protein products.
More sophisticated approaches are
required that can extract meaning taking
into account the context within the text.
Furthermore, data mining experts hope
to develop tools that can identify knowledge
that lurks beneath the surface and
can be only found by connecting the
'semantics' (meaning) reported in one
article with findings reported in another.
Semantics, mark-up
languages and ontologies
The parallels between finding information
on the Internet and finding relevant
information in the scientific literature
can offer instructive hints for
data mining approaches.
Jim Hendler, a computer scientist at the
University of Maryland, is a pioneer of
the Semantic Web, a project developing
Internet resources for the future.
"The Semantic Web is not a separate
web, but an extension of the current
one, in which information is given
well-defined meaning, better enabling
computers and people to work in cooperation,"
explained Hendler in a recent
article. "For the Semantic Web to function,
computers must have access to
structured collections of information
and sets of inferences rules that they
can use to conduct automated reasoning,"
says Hendler. That requires a computational language
that can be used to express both data and
rules for reasoning about the data. The
advent of extensible markup language
(XML) as a standard for information
interchange holds invaluable promise,
as XML makes it possible to express
mathematical equations and chemical
structures in a way that a computer can
easily recognize and process. XML
gives structure to a document by creating
tags, hidden labels attached to annotated
information. Resource description
frameworks (RDF) are then used to
express meaning, which is encoded in
sets of triples, each triple being like the
subject, verb, and object of an elementary
sentence.
The third key component of a functional
Semantic Web is a system that
allows algorithms to define the relationships
between different terms - this
is referred to as ontology. The use of
ontologies is illustrated by the Gene
Ontology (GO) Consortium that is
developing three structured, controlled
vocabularies (ontologies) that describe
gene products in terms of their associated
biological processes, cellular
components, and molecular functions
in a species-independent manner.
|
|
Once biological research information
is structured in this way, using appropriate
mark-up languages and ontologies,
researchers will have a realistic
opportunity to mine the data. This
paves the way for computer-assisted
discovery.
"The main benefits are realized by
using these techniques to create curated
machine-readable datasets," says
Ian Donaldson at Samuel Lunenfeld
Research Institute in Toronto.
"Biomolecular knowledge is really
contained in the collective mind of the
research community. The biomedical
literature is an imperfect reflection of
this knowledge that is obfuscated by
natural language and its use in describing
models of consensus that are constantly
changing. Data mining techniques
facilitate one step in creating
reliable models that can be computationally
accessed," says Donaldson.
As data mining
research progresses
and as we learn
more about the
nature and the
structure of scientific
knowledge, we will
be able to extract
more complex and
subtle information
Donaldson has developed PreBIND, a
data mining tool that helps researchers
locate biomolecular protein-protein
interaction information in the scientific
literature. Users can enter the name or
accession number of a protein and
PreBIND will return a list of potentially
interacting proteins. The list of potential
interacting proteins is determined using
a list of protein synonyms, in combination
with a supervised learning algorithm
known as a support vector
machine (SVM). The SVM determines
whether there is an indication in the
text that two proteins interact. As a
result, PreBIND can extract information
that would be missed by a simple
literature search for keywords such as
'interaction'. "Information extraction
is about finding the relationship
between molecules, something that it is
difficult to do with straight text searches,"
says Donaldson.
"There are few limits to the types of
information we can extract," says Bob
Futrelle, who works on automated
analysis of technical documents and
diagrams at Northeastern University in
Boston. "As data mining research progresses
and as we learn more about the
nature and the structure of scientific
knowledge, we will be able to extract
more complex and subtle information".
Futrelle runs the BIONLP website
which brings together resources relating
to the natural language processing of
biology text. His group has built viewers
that look at text and text comparisons
in much that same way that people
look at DNA sequence homology using
on-screen viewers. They are also focusing
on automated analysis of diagrams
"We've shown that the interplay
between figures and text is critical to the
understanding of papers. Each needs the
other in order to tell the whole story."
The importance of
Open Access for data
mining
"The most important reason for Open
Access is for data mining," said Gerry
Rubin of the Howard Hughes Medical
Institute in the first issue of Open
Access Now (July 14, 2003). One of the
key benefits of true Open Access is that
articles are not only viewed on the publisher's
site, but they can also be downloaded,
aggregated, analysed in new
ways, and redistributed freely. This
freedom is vital if the potential of data
mining is to be realized - the Internet
as a whole demonstrates that Open
Access can be a powerful stimulus to
the development of innovative tools
(for example, search engines such as
Google).
Open Access to
full-text articles is
essential if the
promise of data
mining is to be
fully realized
"The success of the Semantic Web will
be significantly limited if content and
tools are not widely shared," writes
Hendler. "Much of the original World
Wide Web grew from an open-source,
open-content model, so too must the
Semantic Web. Research scientists
must team with their computer science
brethren and fight against the intellectual
property policies and runaway
patent madness that make free dissemination
of our products impossible."
"Too much of the published knowledge
and too many of the connections are
held in publishers' hands, so our hands
are tied," says Futrelle. "Traditional
publishers allow us to do the analysis,
but copyright restrictions mean we cannot
give readers full access to the articles
which we have enhanced. This
means that the results of our data mining
have up to now been rather useless in
practical terms, being confined to our
lab alone. Open Access promises to
change all that."
Data mining researchers are now taking
advantage of the freedom offered
by published Open Access research to
turn the tools they have developed to
practical use.
Donaldson says that there is no doubt
that Open Access to full-text articles is
essential if the promise of data mining
is to be fully realized. Only then will
data mining lead to the discovery of
new horizons. "The human mind has
an attention horizon. While you're able
to go anywhere in the world, you're
unlikely to visit that place without
knowing of its existence and how it
relates to something near you. Data
mining expands the circumference of
that horizon."
BioMed Central is leading the way in
encouraging researchers to do data
mining research, by making available
its entire corpus of Open Access
research available for download via ftp
in XML form (see www.biomedcentral.com/info/about/datamining). We
hope other publishers will follow.
|