R E • D A C • T I O N |
|
||||||
Looking at Search as Information Retrieval Abstract:
IR deals with document-based information discovery. In the Web developer's mind IR may end with the question of if a site needs an index server for its static pages. Database driven Web sites often have few static pages. In those cases it is not uncommon to find search implemented in the relational database. Depending on the content of the tables and the expected queries this may be the right answer. Some questions to ask are:
The difference between IR systems and databases is a difference in data. IR handles unstructured data. Because the data is unstructured and, to the system, un-understandable, the results are generated probabilistically. That is, results are not necessarily correct in terms of being exactly what the human searcher wanted to find. A query like 'microsoft and browser' may well return an article profiling Netscape Communications, Inc. However the query will also return other documents that may fit the searcher's needs better. And the query will return in a handful of milliseconds (think of how fast Google.com searches 3 billion pages of text), where a relational solution based on a text field might return after many minutes of looking (depending, of course, on the size of the text field and the number of records). Clearly a Google sized corpus would simply not work in a relational context.
The world of IR is full of fascinating and tricky algorithms. First among these is the inverted file. An inverted file is basically an index (usually a hash, trie or b-tree) of words pointing to documents. As well as the pointers (or postings) to the source documents a record may hold other term information (e.g. lexicon-level term frequency). Down at the pointer level there is likely to be more much more information, examples include term frequency, positions and weight. (Note: between terms, pointers, frequencies, positions, weights, and other measures our uncompressed index may turn out to be as large or larger than our corpus). Our search for 'Microsoft and browser' will very quickly pull the 'Microsoft' and the 'browser' terms from the index and immediately know what documents to return. (A simple Another important concept in IR is normalization, especially through stemming. During query processing and before our query is handed to the index it would be nice to do a couple of things to improve the quality of results. First we breaks the terms out of the query: Microsoft, browser. Then we set all characters to their upper or lower case (depending on how our documents were processed at index-time): MICROSOFT, BROWSER. Then we remove the punctuation and collapse any whitespace in our terms (in our example query there is none). Our next step is to drop any statistically useless words, like 'and', 'the', 'some', 'if', etc—again we have none. Finally we stem the terms. Stemming is the process of finding the root of a word by removing suffixes for tense, plurality, etc. In our example MICROSOFT remains the same, but stemming converts BROWSER to BROWSE. There is no perfect stemmer so many language-dependent variations on the process exist. The important thing is to have the corpus indexed under the same stemmer the queries are transformed by. Beyond increasing the effectiveness of queries stemming has the side benefit of compressing an index and therefore somewhat improving performance. Beyond inverted files and normalization there is a host of other specialized IR infrastructure. Ranking is a big attraction to IR. We can see from the discussion of postings above how word frequency and position could be used to order the document pointers in result sets. Clustering is another approach. Clustered indexes group documents by similarity measures and enable searches to return individual documents or access a particular cluster of related works. Automatic thesaurus generation is another way of finding similarity, but at the term level. Thesauri can be used to process queries for improved results, supplement clustering measures, or browse by term group. Yet another way of grouping terms is by sound with the Soundex algorithm. Although all these algorithms may seem like a lot this is really just the tip of the iceberg of IR. Back to the real world, application developers have a couple choices in IR tools. An application will query unstructured text in either a database centric system, a structured file format (e.g. XML) or a hierarchical or flat file store. For many simpler applications with file stores (e.g. Web sites) a packaged index server is likely to be the right option. These index servers are relatively familiar and frequently very cost effective. Good open source and commercial products exist and some of the commercial products are almost trivially easy to set up an operate. For an Internet site an alternative to running your own index server is to have a remote 3rd party provide the index and search service. Atomz is one such example. For a more complex application where the data is in a database or structured file format the most likely tool will be an index engine framework or the database vender's embedded IR tools, if any. Both Microsoft and Oracle provide IR subsystems with their flagship database products and several of the major XML database venders are providing at least some IR support. If for reasons of unavailability, separation of concern, or a need for greater functionality you choose not to use a database vender's product the best choice is likely to be a field-based index toolkit. Well known examples of these are the Verity product and the open source Lucene framework, now part of the Apache projects. These toolkits provide strong and flexible indexing support but rely on the development team to do more work integrating indexing and search into the application. Since the application will feed the indexer (instead of using generic crawlers as packaged index servers do) the team can take advantage of the field structure. Documents are structured as sets of named indexed fields. Breaking documents into fields, or using fields to hold metadata, provides some of the functionality of a relational or XML-based structure, but in a search-effective IR context. Arguments for using IR methods in an application dealing with unstructured content are straightforward. The field of IR has a 30+ year history of improving information discovery. Today's tools are sophisticated, cost effective, and productive. In discussing the fundamentals of IR implementation we have just scratched the surface (see the notes below for pointers to more information). However it is not necessary to delve into the algorithms of IR to deliver successful text search features, only to recognize both the need for specialized tools and their availability.
|
||||||
|
©
1999-2002, d.kershaw. all rights reserved. |
|||||