KinoSearch::Docs::IRThUser(Contributed Perl DocumKinoSearch::Docs::IRTheory(3)NAMEKinoSearch::Docs::IRTheory - Crash course in information retrieval.
ABSTRACT
Just enough Information Retrieval theory to find your way around
KinoSearch.
Terminology
KinoSearch uses some terminology from the field of information
retrieval which may be unfamiliar to many users. "Document" and "term"
mean pretty much what you'd expect them to, but others such as
"posting" and "inverted index" need a formal introduction:
· document - An atomic unit of retrieval.
· term - An attribute which describes a document.
· posting - One term indexing one document.
· term list - The complete list of terms which describe a document.
· posting list - The complete list of documents which a term indexes.
· inverted index - A data structure which maps from terms to
documents.
Since KinoSearch is a practical implementation of IR theory, it loads
these abstract, distilled definitions down with useful traits. For
instance, a "posting" in its most rarefied form is simply a term-
document pairing; in KinoSearch, the class
KinoSearch::Index::Posting::MatchPosting fills this role. However, by
associating additional information with a posting like the number of
times the term occurs in the document, we can turn it into a
ScorePosting, making it possible to rank documents by relevance rather
than just list documents which happen to match in no particular order.
TF/IDF ranking algorithm
KinoSearch uses a variant of the well-established "Term Frequency /
Inverse Document Frequency" weighting scheme. A thorough treatment of
TF/IDF is too ambitious for our present purposes, but in a nutshell, it
means that...
· in a search for "skate park", documents which score well for the
comparatively rare term "skate" will rank higher than documents
which score well for the more common term "park".
· a 10-word text which has one occurrence each of both "skate" and
"park" will rank higher than a 1000-word text which also contains
one occurrence of each.
A web search for "tf idf" will turn up many excellent explanations of
the algorithm.
COPYRIGHT AND LICENSE
Copyright 2007-2010 Marvin Humphrey
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
perl v5.14.1 2011-06-20 KinoSearch::Docs::IRTheory(3)