Similarity Searches and Domain Vocabularies
By Ken North
When you become involved in database development and the Web, it is easy to feel overwhelmed by the volume of material available about products, research, standards, and leading-edge technologies. The curse of the Web is that, even when we're doing goal-directed Web-surfing, we often follow links that take us in unanticipated directions. The beauty of the Web is that, by doing this, we often uncover nuggets. If you're interested in the latest R&D on information retrieval, indexing, data mining, parallel processing, content-based queries, or other research topics, you'll find volumes on the Web.
The Association for Computing Machinery posts the proceedings of its Special Interest Group on Management of Data (SIGMOD) conference, which typically include interesting papers on data mining and information retrieval techniques. Database research long ago moved past the premise that we'll always deal with queries against structured data. Web search engines are a concrete example of applied research into information retrieval techniques for non-structured data. Researchers continue to develop technologies for querying the diverse mix of data sources that constitute the Web.
The WHIRL search engine was the product of a research project at AT&T Research Labs (Machine Learning and Information Retrieval Research department). The project's purpose was to advance the technology for searching with queries based on textual similarity. Object-relational DBMS (ORDBMS) products from Oracle and IBM offer extenders for text searching. For example, IBM DB2's Text Extender supports text indexing using linguistic, precise, dual, and ngram indexes. Linguistic indexes support synonym and variant searches and ngram indexes support fuzzy searches.
WHIRL differs from current object-relational DBMS extensions in its ability to query heterogeneous data sources and treat the Web as a unified database that supports fuzzy-text searching. WHIRL treats Web pages as data, not hypertext, and it supports joins between Web pages. The difference between the WHIRL Search Engine and engines such as Google or Alta Vista is a question of quantity. The WHIRL Search Engine indexes Web pages at fewer sites, but it has more information in its indexes.
WHIRL is useful technology for a fuzzy search across heterogeneous sources, when there are differences in vocabulary. It retrieves information even when data sources use different names for the same entity. WHIRL does this by modeling information sources as relations and matching on names. Instead of joining on identical keys (hard joins), it finds tuples with similar names (soft joins) and produces a ranked list.
Using WHIRL, the following type of query will return results when one Web page refers to a software publisher as
Microsoft Kids and another refers to the same publisher as
SELECT s.game, s.pub, p.name, p.href FROM publisher as p, superkid as s WHERE similar (s.pub,p.name)
William Cohen, PhD, the principal WHIRL researcher, is a member of the faculty at Carnegie-Mellon University. The WHIRL software (C++ source code) is available for download.
WHIRL is a powerful solution for searches in the absence of a controlled, domain-specific vocabulary. (Domain in this context refers to a knowledge domain such as biochemistry, not a Web domain such as nih.gov or GridSummit.com).
Imprecise matches can uncover nuggets, but often we are overwhelmed by the sheer volume of results. Applying a domain context will address that problem. For example, an Alta Vista search turned up 922 hits for "impedance mismatch." The first link was to a page about selecting coaxial cable for baseband video. The second link was to a page that discussed object orientation, software and SGML. The first page used a vocabulary such as resistance, ground-loop interference, signal attenuation, and "ground-ground-leak-through." The domain of the second page was software engineering, not video engineering, so the vocabulary included instances, abstraction, and inheritance. The search would have been more precise if I'd used exclusion terms and Boolean logic, or if I'd been able to tell the search engine to use a domain thesaurus to constrain the search to either software engineering or video engineering. A domain-specific vocabulary enables us to use a dictionary or thesaurus to restrict querying, indexing, and abstracting to a standard vocabulary. This simplifies the task of the query processor and makes it easier to produce precise matches.
The U.S. National Library of Medicine (NLM) pioneered controlled vocabularies in the biomedical field. NLM's first controlled vocabulary was the Medical Subject Headings (MeSH), used for indexing, abstracting, cataloging, and retrieving bibliographic citations to medical literature. Since the 1960s, MeSH has been used for computer queries related to medicine, nursing, dentistry, and veterinary medicine.
Years ago, I worked on a query processor for NLM's Medical Literature Analysis and Retrieval System (MEDLARS). It queried on Boolean combinations of MeSH headings and subheadings, an approach that is still used today. By 1999:
- MeSH contained 18,000 subject headings and 96,000+ chemical records
- MEDLARS contained about 18 million references to medical literature
- NLM added 31,000 new citations each month.
The successors to the original MEDLARS application include MEDLINE, Internet Grateful Med, and PubMed, which provides literature searches such as the example shown in Figure 1. Internet Grateful Med and PubMed are an example of modern technology and the Web making information widely accessible. (In the 1960s and 1970s, MEDLARS access was restricted to authorized users such as medical libraries.)
MeSH is an example of a controlled, domain vocabulary but NLM has been involved in other medical language research and development projects. More recently, NLM sponsored the development of the Unified Medical Language System (UMLS) for retrieving information from diverse sources. The objective of UMLS is to support diverse applications involving digital libraries, patient data, decision support, bibliographies, and Web information retrieval. Besides MeSH, UMLS includes other vocabularies, lexical programs, and UMLS Knowledge Sources. The Knowledge Sources include a semantic network, a meta-thesaurus, and an information sources map. The UMLS knowledge base includes a dictionary of concepts that defines core concepts and constraints that guide the browsing of UMLS Knowledge Sources. Research projects have been measuring the effectiveness of concept-matching algorithms and using canonical concepts to index documents, as opposed to using word indexes. Results have been promising in machine-learning situations, where systems automatically learned connections between words and concepts in MEDLINE documents. (For more information about machine learning, visit www.aic.nrl.navy.mil/~aha/research/machine-learning.html.) Research continues on new approaches to automated indexing and retrieval from medical documents. Searches would be more efficient if we had a standard for various domains. For example, a "Cajun-cooking" thesaurus would draw from the expertise and works of people such as Marcelle Bienvenu and Paul Prudhomme. If I were coding a page containing Cajun recipes, I might include information about their origin and nutritional content, and use a tag such as:
<thesaurus> "Cajun Cooking, Nutrition, American History"
If coding a page about diabetes, I might use
<thesaurus> "Medical Subject Headings, ICD-9CM"
Perhaps one day we will see hybrid search engines that are capable of using a domain-specific vocabulary and name-matching capabilities such as WHIRL. When asked (in 1999) about integrating domain-specific vocabularies with WHIRL, Dr. Cohen replied, "So far I've used only general-purpose similarity metrics, although you can obviously set up the interface so that the queries themselves are suited to a domain."
Perhaps we will update each thesaurus by a combination of domain expert input, automated indexing, and machine intelligence. Publishers might routinely submit new books, magazines, journals, and other publications to indexing engines. The engines will assist in generating an the index for each publication, and contribute to the statistical databases used to update the standard thesaurus for each domain. When a term or concept consistently appears in the literature, it will pass some qualitative measure of acceptance that can be detected without human analysis. In addition, domain experts would periodically review the thesaurus for completeness and accuracy.
An earlier version of this article was supplemental reading for the January 1999 "Database Developer" column in Web Techniques. Since then, there have been changes to the bibliographic databases at the National Library of Medicine. Their scope has been increased to supplement biomedical topics with more life sciences information. NLM retired Internet Grateful Med in September 2001 but Internet users can now search MEDLINE bibliographic citations via PubMed. The PubMed databases contain approximately 12 million citations.
WHIRL, and before that, MEDLARS, made significant contributions to the body of knowledge about information retrieval strategies. There has been additional work on searching semi-structured data and the W3C published specifications for XQuery and the Resource Description Framework (RDF). Research continues on distributed queries, machine learning, knowledge representation, and ontology-based computing.
Ken North is an author and consultant. He teaches Expert Series seminars. Contact him at