indexer.conf man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

INDEXER.CONF(5)		 mnoGoSearch reference manual	       INDEXER.CONF(5)

NAME
       indexer.conf - configuration file for indexer

DESCRIPTION
       This  is	 configuration	file for indexer (1).  Configuration file con‐
       sists of commands and their arguments.  All commands are	 case-insensi‐
       tive.  You can use # to comment out lines.

VARIABLES
       Global parameters

	      These  commands  should be used only once and take global effect
	      for the whole configuration file.

       DBType type
	      Database type, currently	supported  values  are	mysql,	pgsql,
	      msql,  solid,  mssql, oracle, ibase, sqlite Actually it does not
	      matter for native libraries support, but ODBC users must specify
	      one  of the supported values.  If your database type is not sup‐
	      ported, use unknown instead.

       DBHost host
	      SQL host name (Not required for ODBC)

	      Default: localhost

       DBName mnogosearch
	      SQL database name or ODBC DSN

	      Default: mnogosearch

       DBUser foo
	      Database username to connect to database

	      Default: no user

       DBPass bar
	      Database password to connect to database

	      Default: no password

       DBMode single/multi/crc/crc-multi
	      SQL database words storage mode. Does  not  apply	 for  built-in
	      database.	 When single is specified, all words are stored in the
	      same table.  multi means that  words  are	 stored	 in  different
	      tables  depending	 on  wordlength.  multi mode is usualy faster,
	      but it requires more tables in database.	In case of  crc	 mode,
	      mnoGoSearch  will	 store	32 bit integer word ID's calculated by
	      CRC32 algorythm  instead	of  words.   crc  mode	requires  less
	      diskspace	 and is faster than single and multi modes.  crc-multi
	      mode shares storage structure with crc mode, but stores words in
	      different	 tables	 depending  on	wordlength  like  multi	 mode.
	      Default DBMode value is single

       LocalCharset charset
	      Defines charset for local file system. It is required if you are
	      using  8	bit characters and is not applicable for 7 bit charac‐
	      ters.  This command is to be used once and takes	global	effect
	      for the whole configuration file.

	      Example:
	      LocalCharset windows-1250

       CrossWords yes|no
	      Building	CrossWords  index. Crosswords are those, that are used
	      in a link to the present page.  The default value is no

       StopWordFile filename
	      This command indicates which file	 contains  stopwords  list  to
	      load.   You  may	specify either absolute file name, or filename
	      with a relative path to mnoGoSearch /etc directory.  You may use
	      several StopWordsFile commands.

       MinWordLength characters
	      MinWordLength characters	With  these  commands  you  can change
	      default length range of words stored  in	database.  By  default
	      mnoGoSearch stores words that are longer than 1 and shorter than
	      32.  Example: MaxWordLength 35

       MaxDocSize bytes
	      Specify maximum size of a document in bytes that can be indexed.
	      The  default  value  is 1048576 (1 Mb). This command take global
	      effect for the whole config file.

       HTTPHeader header
	      You may add custom HTTP headers to indexer HTTP request. Do  not
	      use "If-modified-since" and "Accept-Charset" headers, since they
	      are composed by indexer  itself.	"User-Agent:  mnoGoSearch/ver‐
	      sion" is sent too, although you may override it. The command has
	      global effect for the whole configuration file.

       ServerTable table_name
	      This command works only with SQL database and is not  applicable
	      for built-in database mode.  Load servers with all their parame‐
	      ters from the table table_name For an  example  of  such	tables
	      structure,  please refer to the file create/mysql/server.txt You
	      may  use	several	 arguments  with  this	command:   ServerTable
	      my_servers1  my_servers2	my_servers3 or just a single argument:
	      ServerTable server

       DeleteNoServer yes|no
	      Use this command to specify whether to delete the URL that  have
	      no corresponding Server commands. Default value is yes

       VarDir /path/to/my/var/dir
	      Specify  a  custom path to directory that indexer stores data to
	      when use with built-in database and in cache mode.   By  default
	      /var directory of mnoGoSearch installation is used.

URL Control Configuration
       Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
	      Use  this	 command  to  allow  URL's that match (does not match)
	      given argument. First three  optional  parameters	 describe  the
	      type of comparison. Default values are Match, NoCase, String Use
	      NoCase or Case values to to choose case insensitive or sensitive
	      comparison.  Use	Regex to choose regular expression comparison.
	      Use String to choose string with wildcards comparison. Wildcards
	      are  *  for  any number of characters, and ?  for one character.
	      Note that * and ?	 have special meaning in  String  match	 type.
	      Please  use  Regex  to describe documents with ?	and * signs in
	      URL.  String match is much faster	 than  Regex,  so  use	String
	      where  it	 is  possible.	You  may use several arguments for one
	      Allow command and use this command any number of times. It takes
	      global  effect for the config file.  Note that mnoGoSearch auto‐
	      matically adds one Allow regex .*	 command after reading	config
	      file.  That command means that everything is allowed that is not
	      disallowed

       Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
	      Use this to disallow indexing documents  with  URLs  that	 match
	      given argument.  The meaning of the first three optional parame‐
	      ters is exactly the same as with the Allow command. You can  use
	      several  arguments for one Disallow command. Takes global effect
	      for config file.

       Example:
	      #Exclude cgi-bin and non-parsed-headers
	      Disallow /cgi-bin/ \.cgi /nph

	      #Exclude some known extensions
	      Disallow \.b$  \.sh$     \.md5$
	      Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
	      Disallow \.lha$ \.lzh$ \.tar\.Z$	\.rar$	\.zoo$
	      Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
	      Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
	      Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
	      Disallow \.vrml$ \.wrl$
	      Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
	      Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
	      Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
	      Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
	      Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
	      Disallow \.rpm$

	      #Exclude Apache directory list in different sort order
	      Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
	      \?S=D$

	      #Exclude ./. and ./.. from Apache and Squid directory list
	      Disallow /[.]{1,2} /\%2e /\%2f

       CheckOnly regexp [regexp [...] ]
	      Indexer  will  use HEAD instead of GET http method for URLs that
	      matches regexp. It means that file will be checked only and will
	      not  be  downloaded. Usefull for zip,exe,arj etc files.  One can
	      use several arguments for one 'CheckOnly' command.  One can  use
	      this  command any times but not more than MAXFILTER in indexer.h
	      Takes global effect for config file.

       Examples:
	      #Use HEAD method for some known non-text extensions:
	      CheckOnly \.b$ \.sh$     \.md5$
	      CheckOnly \.arj$	\.tar$	\.zip$	\.tgz$	\.gz$
	      CheckOnly \.lha$ \.lzh$ \.tar\.Z$	 \.rar$	 \.zoo$
	      CheckOnly \.gif$	\.jpg$	\.jpeg$ \.bmp$	\.tiff$
	      CheckOnly \.vdo$	\.mpeg$ \.mpe$	\.mpg$	\.avi$	\.movie$
	      CheckOnly \.mid$	\.mp3$	\.rm$	\.ram$	\.wav$	\.aiff$
	      CheckOnly \.vrml$ \.wrl$
	      CheckOnly \.exe$	\.cab$	\.dll$	\.bin$	\.class$
	      CheckOnly \.tex$	\.texi$ \.xls$	\.doc$	\.texinfo$
	      CheckOnly \.rtf$	\.pdf$	\.cdf$	\.ps$
	      CheckOnly \.ai$	\.eps$	\.ppt$	\.hqx$
	      CheckOnly \.cpt$	\.bms$	\.oda$	\.tcl$
	      CheckOnly \.rpm$

       HrefOnly regexp [regexp [...] ]
	      Indexer scans html documents that match regexp as it would  scan
	      any  other  URLs, except that it will not index the contents. It
	      will add any URLs it finds in html document to database. Usefull
	      when indexing mail list archives with big index pages which con‐
	      tain mostly URLs.	 One can use several arguments for  one	 'Hre‐
	      fOnly' command.  One can use this command any times but not more
	      than MAXFILTER in indexer.h Takes global effect for config file.

       Examples:
	      #Scan these files for href tags only, but	 do  not  index	 there
	      contents.
	      HrefOnly mail.*\.html$ thr.*\.html$

MIME types and external parsers
       UseRemoteContentType yes|no
	      This  command  specifies	if the indexer should get content type
	      from HTTP server headers (yes) , or from	its  AddType  settings
	      (no).  If	 set  to no , and the indexer could not determine con‐
	      tent-type with its AddType settings,

       SyslogFacility facility
	      Useful only if indexer is compiled with syslog  support  and  if
	      you  do  not  like  the default. Argument is the same as used in
	      syslog.conf file (for example: local7 , daemon ).	 For  list  of
	      possible	facilities  see syslog.conf(5) Takes global effect and
	      should be used only once !  Default: depends on compilation.

       LogdAddr host[:port]
	      Use cachelogd at given host and port if specified. Required  for
	      cache mode only. Default values are localhost and port 7000

       FollowOutside yes|no
	      Allow/disallow  indexer  to walk outside current server.	Should
	      be used carefully (see MaxHops command).

	      Default: no

       Period seconds
	      Reindex period in seconds, 604800 = 1 week.  May be used	before
	      every  Server  command  and  takes effect till the end of config
	      file or till next Period command.

       Tag number
	      Use this parameter for your own purposes. For example for group‐
	      ing  some	 servers  into	one  group, etc.  May be used multiple
	      times before every Server command and takes effect till the  end
	      of config file or till next Tag command.

       MaxHops number
	      Maximum  way  in	"mouse	clicks" from start URL given in Server
	      command. May be used multiple times before every Server  command
	      and  takes  effect till the end of config file or till next Max‐
	      Hops command.

	      Default: 256

       MaxNetErrors number
	      Maximum network errors for each server.  If there are  too  many
	      network  errors on some server (server is down, host unreachable
	      etc.)  indexer will try not to do more than number  attempts  to
	      connect  to  this	 server.   May	be  used multiple times before
	      Server command and takes effect till the end of config  file  or
	      till next MaxNetErrors command.

	      Default: 16

       TitleWeight number
	      Weight  of the words in the <title>...</title> Can be set multi‐
	      ple times before Server command and takes effect till the end of
	      config file or till next TitleWeight command.

	      Default: 2

       BodyWeight number
	      Weight  of  the  words in the <body>...</body> of the html docu‐
	      ments and in the contents of the text/plain documents.   Can  be
	      set  multiple  times before Server command and takes effect till
	      the end of config file or till next BodyWeight command.

	      Default: 1

       DescWeight number
	      Weight  of  the  words  in  the  <META  NAME="Description"  Con‐
	      tent="...">  Can be set multiple times before Server command and
	      takes effect till the end of config file or till next DescWeight
	      command.

	      Default: 2

       KeywordWeight number
	      Weight  of the words in the <META NAME="Keywords" Content="...">
	      Can be set multiple times before Server command and takes effect
	      till the end of config file or till next KeywordWeight command.

	      Default: 2

       UrlWeight number
	      Weight  of  the  words  in the URL of the documents.  Can be set
	      multiple times before Server command and takes effect  till  the
	      end of config file or till next UrlWeight command.

	      Default: 0

       DeleteBad yes|no
	      Prevent  indexer	from  deleting	bad (not found, forbidden etc)
	      URLs from database. Useful if you want to check  'integrity'  of
	      you  server(s),  so  if  you set it to no , that "bad" URLs will
	      remain in database.  Can be set  multiple	 times	before	Server
	      command  and  takes  effect  till the end of config file or till
	      next DeleteBad command.

	      Default: yes

       Robots yes|no
	      Allows/disallows	using  robots.txt  and	<META	NAME="robots">
	      exclusions.  Useful  if  you  want  to  check 'integrity' of you
	      server(s).  Can be set multiple times before Server command  and
	      takes  effect  till  the	end of config file or till next Robots
	      command.

	      Default: yes.

       Section <string> <number>
	      where <string> is a section name	and  <number>  is  section  ID
	      between  0  and  255.  Use  0 if you don't want to index some of
	      these sections. It is better to use different sections  IDs  for
	      different	 documents  parts.  In	this  case  during search time
	      you'll be able to give different weight to  each	part  or  even
	      disallow some sections at a search time.

       Index yes|no
	      Prevent indexer from storing words into database.	 Useful if you
	      want to check 'integrity' of you server(s).  Can be set multiple
	      times  before  "Server" command and takes effect till the end of
	      config file or till next Index command.

	      Note: Instead of Index no you can use the alternate form NoIndex

	      Default: yes

       Follow yes|no
	      Allow/disallow indexer to store <a  href="...">  into  database.
	      Can be set multiple times before Server command and takes effect
	      till the end of config file or till next Follow command.

	      Note: Instead of Follow no you can use the alternate form NoFol‐
	      low

	      Default: yes

       MaxDocSize size

	      Hope the name is self-explanatory, this command is to limit max‐
	      imum document size.  size is in bytes.   If  there  is  document
	      with  size  more	than size , indexer will parse only first size
	      bytes of documents.

	      Default: 1048576 (which is 1 megabyte)

       Mime   <from_mime> <to_mime>[;charset] ["command line [$1]"]

	      This is used to add support  for	parsing	 documents  with  mime
	      types  other  than text/plain and text/html.  It can be done via
	      external parser (which should provide output in  plain  or  html
	      text)  or	 just  by substituting mime type so indexer can under‐
	      stand it directly.

	      <from_mime> and <to_mime> are standard  mime  types.   <to_mime>
	      should be either text/plain or text/html , because these are the
	      only types that indexer understands.

	      We assume external parser generates results on stdout  (if  not,
	      you have to write a little script and cat results to stdout).

	      Optional charset parameter used to change charset if needed.

	      Command  line parameter is optional. If there's no command line,
	      this is used to change mime type. Command line could  also  have
	      $1  parameter which stands for temporary file name. Some parsers
	      could not operate on stdin, so indexer  creates  temporary  file
	      for parser and its name passed instead of $1.

       CharSet charset
	      Useful  for 8 bit character sets.	 WWW-servers send data in dif‐
	      ferent character sets.  charset  is  default  character  set  of
	      server  in  next	Server	command(s).   May be used before every
	      Server command and takes effect till the end of config  file  or
	      till next CharSet command.

	      By   now	 indexer  supports  Cyrillic  koi8-r,  cp1251,	cp866,
	      iso8859-5, x-mac-cyrillic, Arabic	 cp1256,  Western  iso-8859-1,
	      Central Europe iso-8859-2 and cp1250 character sets.

	      This  parameter  is default character set for "bad" servers that
	      do not send information about charset in header: just  "Content-
	      type:   text/html"   instead   of	  for  example	"Content-type:
	      text/html; charset=koi8-r" and do not send  charset  information
	      in META tags.

	      CharSet command.

       Examples:

	      CharSet koi8-r
	      CharSet windows-1250
	      CharSet ISO-8859-1

       ForceIISCharset1251 yes/no
	      This  option  is	useful for users dealing with Cyrillic content
	      and broken (or misconfigured?) Microsoft IIS web servers,	 which
	      tends  to	 report	 charset  incorrectly.	This is a really dirty
	      hack, but if this option is turned on it	is  assumed  that  all
	      servers  that  are reported as 'Microsoft' or 'IIS' have content
	      in Windows-1251 codepage.	 This command should be used only once
	      in configuration file and takes global effect.

	      Default: no

       AuthBasic login:passwd
	      Use  basic  http	authorization.	Can be set before every Server
	      command and takes effect only for next Server command.

       Examples:

	      AuthBasic somebody:something

	      If you have password protected directory(ies), but whole	server
	      is open, use:

	      AuthBasic login1:passwd1
	      Server http://my.server.com/my/secure/directory1/
	      AuthBasic login2:passwd2
	      Server http://my.server.com/my/secure/directory2/
	      Server http://my.server.com/

       ProxyAuthBasic login:passwd
	      Use  http	 proxy	basic  authorisation. Can be used before every
	      Server command and taked effect only for	the  next  one	Server
	      command! It should be also before Proxy command.

       Example:
	      ProxyAuthBasic somebody:smth

       Proxy your.proxy.host[:port]
	      Connect  ia   proxy  rather directly.  You can index ftp servers
	      (only) when using proxy.	If port is not specified, it is set to
	      default  value of 3128 (Squid).  If proxy host is not specified,
	      direct connection will be performed.  Can be  set	 before	 every
	      Server  command  and takes effect till the end of config file or
	      till next Proxy command.

       Examples:
	      Proxy atoll.anywhere.com
	       - proxy on atoll.anywhere.com, port 3128

	      Proxy lota.anywhere.com:8090
	       - proxy on lota.anywhere.com, port 8090

	      Proxy
	       - turn off proxy usage (direct connection)

       Server URL
	      It is the main configuration command.  Use this to add start URL
	      of  server  to  be indexed.  You may use many Server commands in
	      the same indexer.conf file

       Examples:

	      Server http://localhost/
	      Server http://www.yoursite.com/
	      Server http://www.yoursite.com/~yourname/
	      Server ftp://ftp.yourdomain.com/pub/

EXAMPLE
       This is a minimal sample indexer config file

	      DBHost	     localhost
	      DBName	     udmsearch
	      DBUser	     foo
	      DBPass	     bar
	      Server	     http://localhost/
	      Disallow /cgi-bin/ \.cgi /nph
	      Disallow \.b$  \.sh$     \.md5$
	      Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
	      Disallow \.lha$ \.lzh$ \.tar\.Z$	\.rar$	\.zoo$
	      Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
	      Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
	      Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
	      Disallow \.vrml$ \.wrl$
	      Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
	      Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
	      Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
	      Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
	      Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
	      Disallow \.rpm$
	      Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
	      \?S=D$
	      Disallow /[.]{1,2} /\%2e /\%2f

SEE ALSO
       indexer(1), syslog.conf(5)

mnoGoSearch 3.1			 23 March 2001		       INDEXER.CONF(5)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net