Better Search Through Information Retrieval


						R E • D A C • T I O N

Looking at Search as Information Retrieval
Approaches, Tools, and When to Use Them

Abstract:

The unstructured information found in applications needs to be accessible. Using relational or XML-based database queries to find relevant text is unlikely to be the best approach. This article discusses when to look to products based on the computer science field of Information Retrieval for effective search. Some basic background in Information Retrieval techniques is provided.

Many Internet applications need to find unstructured information using simple queries entered by the user. Search engines that catalog the Web at large (e.g. Google.com) are just one example. Systems that query unstructured data live in the Information Retrieval (IR) category. Frequently software engineers comfortable with structured information systems, particularly relational databases, approach text search problems with familiar but technically inappropriate solutions. Not surprisingly the correct choice for an IR application is a purpose built IR system. This note looks at some common situations calling for an IR approach, what IR in fact is, and what types of IR tools exist.

WHEN TO USE IR

IR deals with document-based information discovery. In the Web developer's mind IR may end with the question of if a site needs an index server for its static pages. Database driven Web sites often have few static pages. In those cases it is not uncommon to find search implemented in the relational database. Depending on the content of the tables and the expected queries this may be the right answer. Some questions to ask are:

Will searches be for specific data or for information?
Will there be multi term searches against multi word text fields?
Are partial match or phrase queries required?
Is ranking based on in-field fit-to-query required?
Are 'this near that' queries expected?
Do semantically identical instances of search terms vary morphologically across records?
Would there be value in neutralizing punctuation, spacing, and common words?
Are the queryable fields multi values? (relational theory says they shouldn't be but this is a reasonable question from a human query point of view. E.g. is a fully qualified zip code one value or two?)

If these answer in the affirmative it is likely that a relational database is a poor choice for a search tool. Some example applications where this will be the case include:

Any multi term searches against a large text corpus
Searching an e-mail box
Searching a collection of published works
Where ranking is valued, for instance when search terms link to very different subjects (e.g. 'presidential suite', 'presidential address')

The result of attempting to use a relational database to achieve this kind of queryability will be:

Sophisticated queries (in information not relational terms) will not be available
Results will be incomplete and probably will not satisfy the user
Where many results are returned they will be in no order relevant to the user
Where partial exact match is applied to many text records the table scan will not perform well

And again, the obvious solution is to turn to the IR tools industry to find a more appropriate tool for the job. We will come back to the varieties of IR tools and some specific suggestions below. Also, at the end of this note there is a list places to seek more information on tools, approaches, and the underlying theories of IR.

DIFFERENT DATA, DIFFERENT ALGORITHMS

The difference between IR systems and databases is a difference in data. IR handles unstructured data. Because the data is unstructured and, to the system, un-understandable, the results are generated probabilistically. That is, results are not necessarily correct in terms of being exactly what the human searcher wanted to find. A query like 'microsoft and browser' may well return an article profiling Netscape Communications, Inc. However the query will also return other documents that may fit the searcher's needs better. And the query will return in a handful of milliseconds (think of how fast Google.com searches 3 billion pages of text), where a relational solution based on a text field might return after many minutes of looking (depending, of course, on the size of the text field and the number of records). Clearly a Google sized corpus would simply not work in a relational context.

The world of IR is full of fascinating and tricky algorithms. First among these is the inverted file. An inverted file is basically an index (usually a hash, trie or b-tree) of words pointing to documents. As well as the pointers (or postings) to the source documents a record may hold other term information (e.g. lexicon-level term frequency). Down at the pointer level there is likely to be more much more information, examples include term frequency, positions and weight. (Note: between terms, pointers, frequencies, positions, weights, and other measures our uncompressed index may turn out to be as large or larger than our corpus). Our search for 'Microsoft and browser' will very quickly pull the 'Microsoft' and the 'browser' terms from the index and immediately know what documents to return. (A simple UNION would get us a consolidated results list).

Another important concept in IR is normalization, especially through stemming. During query processing and before our query is handed to the index it would be nice to do a couple of things to improve the quality of results. First we breaks the terms out of the query: Microsoft, browser. Then we set all characters to their upper or lower case (depending on how our documents were processed at index-time): MICROSOFT, BROWSER. Then we remove the punctuation and collapse any whitespace in our terms (in our example query there is none). Our next step is to drop any statistically useless words, like 'and', 'the', 'some', 'if', etc—again we have none. Finally we stem the terms. Stemming is the process of finding the root of a word by removing suffixes for tense, plurality, etc. In our example MICROSOFT remains the same, but stemming converts BROWSER to BROWSE. There is no perfect stemmer so many language-dependent variations on the process exist. The important thing is to have the corpus indexed under the same stemmer the queries are transformed by. Beyond increasing the effectiveness of queries stemming has the side benefit of compressing an index and therefore somewhat improving performance.

Beyond inverted files and normalization there is a host of other specialized IR infrastructure. Ranking is a big attraction to IR. We can see from the discussion of postings above how word frequency and position could be used to order the document pointers in result sets. Clustering is another approach. Clustered indexes group documents by similarity measures and enable searches to return individual documents or access a particular cluster of related works. Automatic thesaurus generation is another way of finding similarity, but at the term level. Thesauri can be used to process queries for improved results, supplement clustering measures, or browse by term group. Yet another way of grouping terms is by sound with the Soundex algorithm. Although all these algorithms may seem like a lot this is really just the tip of the iceberg of IR.

TYPES OF IR TOOLS

Back to the real world, application developers have a couple choices in IR tools. An application will query unstructured text in either a database centric system, a structured file format (e.g. XML) or a hierarchical or flat file store. For many simpler applications with file stores (e.g. Web sites) a packaged index server is likely to be the right option. These index servers are relatively familiar and frequently very cost effective. Good open source and commercial products exist and some of the commercial products are almost trivially easy to set up an operate. For an Internet site an alternative to running your own index server is to have a remote 3rd party provide the index and search service. Atomz is one such example.

For a more complex application where the data is in a database or structured file format the most likely tool will be an index engine framework or the database vender's embedded IR tools, if any. Both Microsoft and Oracle provide IR subsystems with their flagship database products and several of the major XML database venders are providing at least some IR support. If for reasons of unavailability, separation of concern, or a need for greater functionality you choose not to use a database vender's product the best choice is likely to be a field-based index toolkit. Well known examples of these are the Verity product and the open source Lucene framework, now part of the Apache projects. These toolkits provide strong and flexible indexing support but rely on the development team to do more work integrating indexing and search into the application. Since the application will feed the indexer (instead of using generic crawlers as packaged index servers do) the team can take advantage of the field structure. Documents are structured as sets of named indexed fields. Breaking documents into fields, or using fields to hold metadata, provides some of the functionality of a relational or XML-based structure, but in a search-effective IR context.

SUMMARY

Arguments for using IR methods in an application dealing with unstructured content are straightforward. The field of IR has a 30+ year history of improving information discovery. Today's tools are sophisticated, cost effective, and productive. In discussing the fundamentals of IR implementation we have just scratched the surface (see the notes below for pointers to more information). However it is not necessary to delve into the algorithms of IR to deliver successful text search features, only to recognize both the need for specialized tools and their availability.

REFERENCES AND MORE INFORMATION

Witten, Ian. Et. al. Managing Gigabytes. Compressing and Indexing Documents and Images. Morgan Kaufmann. 1999.
Frakes, William B. Et al. Eds. Information Retrieval. Data Structures and Algorithms. Prentice Hall. 1992.
Garcia-Molina, Hector. Et. al. Database System Implementation. Prentice Hall. 2000.
Lucene Search Engine. (http://jakarta.apache.org/lucene/docs/index.html)
Search Tools for Web Sites and Intranets. (http://www.searchtools.com)

© 1999-2002, d.kershaw. all rights reserved.			Δ