OA Now back issues
 Search OA Now
Archive

September 8, 2003

FEATURE: Data mining

Data mining Open Access research

By Matthew Cockerill, Technical Director, BioMed Central

In the words of Sydney Brenner
The definition of data mining:
what's my data is mine and what's yours is also mine.

The need for data mining tools

The biomedical research literature is growing at a daunting rate, with more than 400,000 new research articles listed each year in PubMed alone. The power of post-genomic research technologies and increases in biomedical research funding suggest that we can expect the number of articles to continue to expand at an overwhelming rate. The sheer scale of the scientific literature poses a formidable challenge to research scientists in their attempt to locate published results that are relevant to their research interests.

Text searching of abstracts and fulltext articles has helped. But much information is missed by simple text searches. More sophisticated tools are needed to identify relationships, connections, and patterns that are hidden within the data reported in the scientific literature. Data mining researchers and computer scientists are developing such tools using statistical, linguistic and artificial intelligence approaches.

The search engines used to find information on the Internet provide a simple example of how such techniques can be helpful. The well-known search engine Google exploits the pattern of links between different websites in order to generate a 'page rank' for each page that it indexes. Pages that are frequently linked to by other pages are given a higher page rank. And it turns out that page rank is an extremely helpful predictor of how useful a page is to Internet surfers.


Data mining experts hope to develop tools that can identify knowledge that lurks beneath the surface


But new approaches are needed to tackle both the expanding Internet and the scientific literature. The World Wide Web was originally built for human consumption, and although everything on it is machine-readable, this data is not machine-understandable.

Part of the problem comes from the very nature of natural language that presents a considerable challenge for computational algorithms to extract meaningful information. For example, English has both synonymy (multiple ways to refer to the same object) and polysemy (words with multiple meanings), which complicate text searching. Scientific literature has numerous examples of such ambiguities. For example, individual genes often have several different names that relate to their historical discovery. And names such as RAS and MYC are used to refer to both genes and their protein products.

More sophisticated approaches are required that can extract meaning taking into account the context within the text. Furthermore, data mining experts hope to develop tools that can identify knowledge that lurks beneath the surface and can be only found by connecting the 'semantics' (meaning) reported in one article with findings reported in another.

Semantics, mark-up languages and ontologies

The parallels between finding information on the Internet and finding relevant information in the scientific literature can offer instructive hints for data mining approaches.

Jim Hendler, a computer scientist at the University of Maryland, is a pioneer of the Semantic Web, a project developing Internet resources for the future. "The Semantic Web is not a separate web, but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation," explained Hendler in a recent article. "For the Semantic Web to function, computers must have access to structured collections of information and sets of inferences rules that they can use to conduct automated reasoning," says Hendler. That requires a computational language that can be used to express both data and rules for reasoning about the data. The advent of extensible markup language (XML) as a standard for information interchange holds invaluable promise, as XML makes it possible to express mathematical equations and chemical structures in a way that a computer can easily recognize and process. XML gives structure to a document by creating tags, hidden labels attached to annotated information. Resource description frameworks (RDF) are then used to express meaning, which is encoded in sets of triples, each triple being like the subject, verb, and object of an elementary sentence.

The third key component of a functional Semantic Web is a system that allows algorithms to define the relationships between different terms - this is referred to as ontology. The use of ontologies is illustrated by the Gene Ontology (GO) Consortium that is developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components, and molecular functions in a species-independent manner.

 

Once biological research information is structured in this way, using appropriate mark-up languages and ontologies, researchers will have a realistic opportunity to mine the data. This paves the way for computer-assisted discovery.

"The main benefits are realized by using these techniques to create curated machine-readable datasets," says Ian Donaldson at Samuel Lunenfeld Research Institute in Toronto. "Biomolecular knowledge is really contained in the collective mind of the research community. The biomedical literature is an imperfect reflection of this knowledge that is obfuscated by natural language and its use in describing models of consensus that are constantly changing. Data mining techniques facilitate one step in creating reliable models that can be computationally accessed," says Donaldson.


As data mining research progresses and as we learn more about the nature and the structure of scientific knowledge, we will be able to extract more complex and subtle information


Donaldson has developed PreBIND, a data mining tool that helps researchers locate biomolecular protein-protein interaction information in the scientific literature. Users can enter the name or accession number of a protein and PreBIND will return a list of potentially interacting proteins. The list of potential interacting proteins is determined using a list of protein synonyms, in combination with a supervised learning algorithm known as a support vector machine (SVM). The SVM determines whether there is an indication in the text that two proteins interact. As a result, PreBIND can extract information that would be missed by a simple literature search for keywords such as 'interaction'. "Information extraction is about finding the relationship between molecules, something that it is difficult to do with straight text searches," says Donaldson.

"There are few limits to the types of information we can extract," says Bob Futrelle, who works on automated analysis of technical documents and diagrams at Northeastern University in Boston. "As data mining research progresses and as we learn more about the nature and the structure of scientific knowledge, we will be able to extract more complex and subtle information". Futrelle runs the BIONLP website which brings together resources relating to the natural language processing of biology text. His group has built viewers that look at text and text comparisons in much that same way that people look at DNA sequence homology using on-screen viewers. They are also focusing on automated analysis of diagrams "We've shown that the interplay between figures and text is critical to the understanding of papers. Each needs the other in order to tell the whole story."

The importance of Open Access for data mining

"The most important reason for Open Access is for data mining," said Gerry Rubin of the Howard Hughes Medical Institute in the first issue of Open Access Now (July 14, 2003). One of the key benefits of true Open Access is that articles are not only viewed on the publisher's site, but they can also be downloaded, aggregated, analysed in new ways, and redistributed freely. This freedom is vital if the potential of data mining is to be realized - the Internet as a whole demonstrates that Open Access can be a powerful stimulus to the development of innovative tools (for example, search engines such as Google).


Open Access to full-text articles is essential if the promise of data mining is to be fully realized


"The success of the Semantic Web will be significantly limited if content and tools are not widely shared," writes Hendler. "Much of the original World Wide Web grew from an open-source, open-content model, so too must the Semantic Web. Research scientists must team with their computer science brethren and fight against the intellectual property policies and runaway patent madness that make free dissemination of our products impossible."

"Too much of the published knowledge and too many of the connections are held in publishers' hands, so our hands are tied," says Futrelle. "Traditional publishers allow us to do the analysis, but copyright restrictions mean we cannot give readers full access to the articles which we have enhanced. This means that the results of our data mining have up to now been rather useless in practical terms, being confined to our lab alone. Open Access promises to change all that."

Data mining researchers are now taking advantage of the freedom offered by published Open Access research to turn the tools they have developed to practical use.

Donaldson says that there is no doubt that Open Access to full-text articles is essential if the promise of data mining is to be fully realized. Only then will data mining lead to the discovery of new horizons. "The human mind has an attention horizon. While you're able to go anywhere in the world, you're unlikely to visit that place without knowing of its existence and how it relates to something near you. Data mining expands the circumference of that horizon."

BioMed Central is leading the way in encouraging researchers to do data mining research, by making available its entire corpus of Open Access research available for download via ftp in XML form (see www.biomedcentral.com/info/about/datamining). We hope other publishers will follow.


Glossary of data mining

Data mining
Using computers to extract information that is hidden within a large set of data. Data mining researchers who work with text use statistical, linguistic, and artificial intelligence techniques to go beyond simple text searching.
BioNLP.org - natural language processing of biological text.
www.ccs.neu.edu/home/futrelle/bionlp
BioMed Central's data mining information page
www.biomedcentral.com/info/about/datamining

Semantic Web
The vision of Tim Berners-Lee for an improved version of the World Wide Web in which content is annotated in a machine readable way, to allow its meaning to be analysed by automated 'agents'.
www.semanticweb.org
www.mindswap.org/Science

XML
eXtensible Markup Language (XML) is a standard text format that allows information to be represented in a structured way, thereby facilitating automatic processing. Different dialects of XML (known as schemas) are used to describe different types of content (for example, CML describes chemical structures, MathML describes equations).
www.w3.org/XML
www.xml-cml.org
www.w3.org/Math
 

Ontology
In artificial intelligence research, an Ontology refers to a structured collection of concepts relevant to a particular domain of knowledge. For example, an Ontology might incorporate the concept of engrailed, which would be an instance of the concept gene. And like all genes, engrailed would be associated with a specific organism (Drosophila), and chromosome (2).
Gene Ontology Consortium

PreBIND
PreBIND is a data mining tool that helps researchers locate biomolecular protein-protein interaction information in the scientific literature.
PreBIND home page
Article on PreBIND
 

 
 

Open Access Now is published by BioMed Central.
Editor: Jonathan B Weitzman.