KinoSearch::Docs::FileFormat man page on Fedora

Man page or keyword search:  
man Server   31170 pages
apropos Keyword Search (all sections)
Output format
Fedora logo
[printable version]

KinoSearch::Docs::FileUseraContributed Perl DocKinoSearch::Docs::FileFormat(3)

NAME
       KinoSearch::Docs::FileFormat - Overview of index file format.

OVERVIEW
       It is not necessary to understand the current implementation details of
       the index file format in order to use KinoSearch effectively, but it
       may be helpful if you are interested in tweaking for high performance,
       exotic usage, or debugging and development.

       On a file system, an index is a directory.  The files inside have a
       hierarchical relationship: an index is made up of "segments", each of
       which is an independent inverted index with its own subdirectory; each
       segment is made up of several component parts.

	   [index]--|
		    |--snapshot_XXX.json
		    |--schema_XXX.json
		    |--write.lock
		    |
		    |--seg_1--|
		    |	      |--segmeta.json
		    |	      |--cfmeta.json
		    |	      |--cf.dat-------|
		    |			      |--[lexicon]
		    |			      |--[postings]
		    |			      |--[documents]
		    |			      |--[highlight]
		    |			      |--[deletions]
		    |
		    |--seg_2--|
		    |	      |--segmeta.json
		    |	      |--cfmeta.json
		    |	      |--cf.dat-------|
		    |			      |--[lexicon]
		    |			      |--[postings]
		    |			      |--[documents]
		    |			      |--[highlight]
		    |			      |--[deletions]
		    |
		    |--[...]--|

Write-once philosophy
       All segment directory names consist of the string "seg_" followed by a
       number in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher
       numbers indicating more recent segments.	 Once a segment is finished
       and committed, its name is never re-used and its files are never
       modified.

       Old segments become obsolete and can be removed when their data has
       been consolidated into new segments during the process of segment
       merging and optimization.  A fully-optimized index has only one
       segment.

Top-level entries
       There are a handful of "top-level" files and directories which belong
       to the entire index rather than to a particular segment.

   snapshot_XXX.json
       A "snapshot" file, e.g. "snapshot_m7p.json", is list of index files and
       directories.  Because index files, once written, are never modified,
       the list of entries in a snapshot defines a point-in-time view of the
       data in an index.

       Like segment directories, snapshot files also utilize the
       unique-base-36-number naming convention; the higher the number, the
       more recent the file.  The appearance of a new snapshot file within the
       index directory constitutes an index update.  While a new segment is
       being written new files may be added to the index directory, but until
       a new snapshot file gets written, a Searcher opening the index for
       reading won't know about them.

   schema_XXX.json
       The schema file is a Schema object describing the index's format,
       serialized as JSON.  It, too, is versioned, and a given snapshot file
       will reference one and only one schema file.

   locks
       By default, only one indexing process may safely modify the index at
       any given time.	Processes reserve an index by laying claim to the
       "write.lock" file within the "locks/" directory.	 A smattering of other
       lock files may be used from time to time, as well.

A segment's component parts
       By default, each segment has up to five logical components: lexicon,
       postings, document storage, highlight data, and deletions.  Binary data
       from these components gets stored in virtual files within the "cf.dat"
       compound file; metadata is stored in a shared "segmeta.json" file.

   segmeta.json
       The segmeta.json file is a central repository for segment metadata.  In
       addition to information such as document counts and field numbers, it
       also warehouses arbitrary metadata on behalf of individual index
       components.

   Lexicon
       Each indexed field gets its own lexicon in each segment.	 The exact
       files involved depend on the field's type, but generally speaking there
       will be two parts.  First, there's a primary "lexicon-XXX.dat" file
       which houses a complete term list associating terms with corpus
       frequency statistics, postings file locations, etc.  Second, one or
       more "lexicon index" files may be present which contain periodic
       samples from the primary lexicon file to facilitate fast lookups.

   Postings
       "Posting" is a technical term from the field of information retrieval,
       defined as a single instance of a one term indexing one document.  If
       you are looking at the index in the back of a book, and you see that
       "freedom" is referenced on pages 8, 86, and 240, that would be three
       postings, which taken together form a "posting list".  The same
       terminology applies to an index in electronic form.

       Each segment has one postings file per indexed field.  When a search is
       performed for a single term, first that term is looked up in the
       lexicon.	 If the term exists in the segment, the record in the lexicon
       will contain information about which postings file to look at and where
       to look.

       The first thing any posting record tells you is a document id.  By
       iterating over all the postings associated with a term, you can find
       all the documents that match that term, a process which is analogous to
       looking up page numbers in a book's index.  However, each posting
       record typically contains other information in addition to document id,
       e.g. the positions at which the term occurs within the field.

   Documents
       The document storage section is a simple database, organized into two
       files:

       ·   documents.dat - Serialized documents.

       ·   documents.ix - Document storage index, a solid array of 64-bit
	   integers where each integer location corresponds to a document id,
	   and the value at that location points at a file position in the
	   documents.dat file.

   Highlight data
       The files which store data used for excerpting and highlighting are
       organized similarly to the files used to store documents.

       ·   highlight.dat - Chunks of serialized highlight data, one per doc
	   id.

       ·   highlight.ix - Highlight data index -- as with the "documents.ix"
	   file, a solid array of 64-bit file pointers.

   Deletions
       When a document is "deleted" from a segment, it is not actually purged
       right away; it is merely marked as "deleted" via a deletions file.
       Deletions files contains bit vectors with one bit for each document in
       the segment; if bit #254 is set then document 254 is deleted, and if
       that document turns up in a search it will be masked out.

       It is only when a segment's contents are rewritten to a new segment
       during the segment-merging process that deleted documents truly go
       away.

Compound Files
       If you peer inside an index directory, you won't actually find any
       files named "documents.dat", "highlight.ix", etc. unless there is an
       indexing process underway.  What you will find instead is one "cf.dat"
       and one "cfmeta.json" file per segment.

       To minimize the need for file descriptors at search-time, all per-
       segment binary data files are concatenated together in "cf.dat" at the
       close of each indexing session.	Information about where each file
       begins and ends is stored in "cfmeta.json".  When the segment is opened
       for reading, a single file descriptor per "cf.dat" file can be shared
       among several readers.

A Typical Search
       Here's a simplified narrative, dramatizing how a search for "freedom"
       against a given segment plays out:

       1.  The searcher asks the relevant Lexicon Index, "Do you know anything
	   about 'freedom'?"  Lexicon Index replies, "Can't say for sure, but
	   if the main Lexicon file does, 'freedom' is probably somewhere
	   around byte 21008".

       2.  The main Lexicon tells the searcher "One moment, let me scan our
	   records...  Yes, we have 2 documents which contain 'freedom'.
	   You'll find them in seg_6/postings-4.dat starting at byte 66991."

       3.  The Postings file says "Yep, we have 'freedom', all right!
	   Document id 40 has 1 'freedom', and document 44 has 8.  If you need
	   to know more, like if any 'freedom' is part of the phrase 'freedom
	   of speech', ask me about positions!

       4.  If the searcher is only looking for 'freedom' in isolation, that's
	   where it stops.  It now knows enough to assign the documents scores
	   against "freedom", with the 8-freedom document likely ranking
	   higher than the single-freedom document.

COPYRIGHT AND LICENSE
       Copyright 2005-2010 Marvin Humphrey

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.14.1			  2011-06-20   KinoSearch::Docs::FileFormat(3)
[top]

List of man pages available for Fedora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net