mailcross man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]


MAILCROSS(1)							  MAILCROSS(1)

NAME
       mailcross - a cross-validation simulator for use with dbacl.

SYNOPSIS
       mailcross command [ command_arguments ]

DESCRIPTION
       mailcross  automates  the  task of cross-validating email filtering and
       classification programs such as dbacl(1).  Given a set  of  categorized
       documents,  mailcross initiates simulation runs to estimate the classi‐
       fication errors and thereby permits fine tuning of  the	parameters  of
       the classifier.

       Cross-validation	 is a method which is widely used to compare the qual‐
       ity of classification and learning  algorithms,	and  as	 such  permits
       rudimentary  comparisons	 between  those	 classifiers which make use of
       dbacl(1) and bayesol(1), and other competing classifiers.

       The mechanics of cross-validation are as follows: A set of  pre-classi‐
       fied email messages is first split into a number of roughly equal-sized
       subsets.	 For each subset, the filter (by default, dbacl(1)) is used to
       classify each message within this subset, based upon having learned the
       categories from the remaining  subsets.	The  resulting	classification
       errors are then averaged over all subsets.

       The results obtained by cross validation essentially do not depend upon
       the ordering of the sample emails. Other methods (see  mailtoe(1),mail‐
       foot(1)) attempt to capture the behaviour of classification errors over
       time.

       mailcross uses the environment variables	 MAILCROSS_LEARNER  and	 MAIL‐
       CROSS_FILTER  when  executing,  which  permits  the cross-validation of
       arbitrary filters, provided these satisfy the compatibility  conditions
       stated in the ENVIRONMENT section below.

       For convenience, mailcross implements a testsuite framework with prede‐
       fined wrappers for several open source classifiers.  This  permits  the
       direct  comparison  of  dbacl(1) with competing classifiers on the same
       set of email samples. See the USAGE section below.

       During preparation, mailcross builds a subdirectory  named  mailcross.d
       in  the	current	 working  directory.  All needed calculations are per‐
       formed inside this subdirectory.

EXIT STATUS
       mailcross returns 0 on success, 1 if a problem occurred.

COMMANDS
       prepare size
	      Prepares a subdirectory named mailcross.d in the current working
	      directory,  and  populates  it  with  empty  subdirectories  for
	      exactly size subsets.

       add category [FILE]...
	      Takes a set of emails from either FILE if specified,  or	STDIN,
	      and  associates  them with category.  All emails are distributed
	      randomly into the subdirectories of mailcross.d for  later  use.
	      For  each	 category, this command can be repeated several times,
	      but should be executed at least once.

       clean  Deletes the directory mailcross.d and all its contents.

       learn  For every previously built subset of email messages,  pre-learns
	      all  the	categories  based  on  the contents of all the subsets
	      except this one.	The  command_arguments	are  passed  to	 MAIL‐
	      CROSS_LEARNER.

       run    For  every  previously  built subset of email messages, performs
	      the classification based upon the pre-learned categories associ‐
	      ated with all but this subset.  The command_arguments are passed
	      to MAILCROSS_FILTER.

       summarize
	      Prints statistics for the latest cross-validation run.

       review truecat predcat
	      Scans the last run statistics  and  extracts  all	 the  messages
	      which  belong  to category truecat but have been classified into
	      category predcat.	 The extracted	messages  are  copied  to  the
	      directory mailcross.d/review for perusal.

       testsuite list
	      Shows  a	list of available filters/wrapper scripts which can be
	      selected.

       testsuite select [FILTER]...
	      Prepares the filter(s) named FILTER to be used  for  simulation.
	      The  filter  name is the name of a wrapper script located in the
	      directory /usr/local/share/dbacl/testsuite.  Each filter	has  a
	      rigid  interface	documented  below, and the act of selecting it
	      copies it to the	mailcross.d/filters  directory.	 Only  filters
	      located there are used in the simulations.

       testsuite deselect [FILTER]...
	      Removes  the named filter(s) from the directory mailcross.d/fil‐
	      ters so that they are not used in the simulation.

       testsuite run
	      Invokes every selected filter on the datasets added  previously,
	      and calculates misclassification rates.

       testsuite status
	      Describes the scheduled simulations.

       testsuite summarize
	      Shows  the  cross validation results for all filters. Only makes
	      sense after the run command.

USAGE
       The normal usage pattern is the following: first, you  should  separate
       your  email collection into several categories (manually or otherwise).
       Each category should be associated with one or more folders,  but  each
       folder  should  not  contain  more  than one category. Next, you should
       decide how many subsets to use, say 10.	Note  that  too	 many  subsets
       will slow down the calculations rapidly. Now you can type

       % mailcross prepare 10

       Next,  for  every  category,  you must add every folder associated with
       this category. Suppose you have three categories named spam, work,  and
       play,  which  are  associated with the mbox files spam.mbox, work.mbox,
       and play.mbox respectively. You would type

       % mailcross add spam spam.mbox
       % mailcross add work work.mbox
       % mailcross add play play.mbox

       You can now perform as many simulations as desired. Every cross valida‐
       tion  consists  of a learning, a running and a summarizing stage. These
       operations are performed on  the	 classifier  specified	in  the	 MAIL‐
       CROSS_FILTER  and  MAILCROSS_LEARNER  variables. By setting these vari‐
       ables appropriately, you can compare classification performance as  you
       vary the command line options of your classifier(s).

       % mailcross learn
       % mailcross run
       % mailcross summarize

       The  testsuite  commands	 are  designed to simplify the above steps and
       allow comparison of a wide range of email  classifiers,	including  but
       not  limited  to	 dbacl.	  Classifiers  are  supported  through wrapper
       scripts, which  are  located  in	 the  /usr/local/share/dbacl/testsuite
       directory.

       The  first stage when using the testsuite is deciding which classifiers
       to compare.  You can view a list of available wrappers by typing:

       % mailcross testsuite list

       Note that the wrapper scripts are NOT  the  actual  email  classifiers,
       which must be installed separately by your system administrator or oth‐
       erwise.	Once this is done, you can select one or more wrappers for the
       simulation by typing, for example:

       % mailcross testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are not selected. Note also that some wrappers can have hard-coded cat‐
       egory  names,  e.g.  if the classifier only supports binary classifica‐
       tion. Heed the warning messages.

       It remains only to run the simulation. Beware, this  can	 take  a  long
       time (several hours depending on the classifier).

       % mailcross testsuite run
       % mailcross testsuite summarize

       Once  you  are  all  done  with simulations, you can delete the working
       files, log files etc. by typing

       % mailcross clean

       The progress of the cross validation is written silently in various log
       files  which  are located in the mailcross.d/log directory. Check these
       in case of problems.

SCRIPT INTERFACE
       mailcross testsuite takes care of learning and  classifying  your  pre‐
       pared  email  corpora  for  each selected classifier. Since classifiers
       have widely varying interfaces, this is only possible by wrapping those
       interfaces individually into a standard form which can be used by mail‐
       cross testsuite.

       Each wrapper script is a command line tool which accepts a single  com‐
       mand followed by zero or more optional arguments, in the standard form:

       wrapper command [argument]...

       Each  wrapper  script  also  makes  use	of  STDIN and STDOUT in a well
       defined way. If no behaviour is described,  then	 no  output  or	 input
       should be used.	The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
	      category filenames is expected in $2, $3, etc. The script writes
	      the category name corresponding to the input email on STDOUT. No
	      trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
	      a	 suitable  category  file name is expected in $2. No output is
	      written to STDOUT.

       clean  In this case, a directory is expected in $2, which  is  examined
	      for  old	database  information. If any old databases are found,
	      they are purged or reset. No output is written to STDOUT.

       describe
	      IN this case, a single  line  of	text  is  written  to  STDOUT,
	      describing  the  filter's functionality. The line should be kept
	      short to prevent line wrapping on a terminal.

       bootstrap
	      In this case, a directory is expected in $2. The wrapper	script
	      first checks for the existence of its associated classifier, and
	      other prerequisites. If the check is successful, then the	 wrap‐
	      per is cloned into the supplied directory.  A courtesy notifica‐
	      tion should be given on STDOUT to express	 success  or  failure.
	      It is also permissible to give longer descriptions caveats.

       toe    Used by mailtoe(1).

       foot   Used by mailfoot(1).

ENVIRONMENT
       Right  after  loading,  mailcross reads the hidden file .mailcrossrc in
       the $HOME directory, if it exists, so this would be  a  good  place  to
       define custom values for environment variables.

       MAILCROSS_FILTER
	      This variable contains a shell command to be executed repeatedly
	      during the running stage.	 The command should  accept  an	 email
	      message on STDIN and output a resulting category name. It should
	      also accept a list of category file names on the	command	 line.
	      If  undefined,  mailcross	 uses the default value MAILCROSS_FIL‐
	      TER="dbacl -T email -T xml -v" (and also magically adds  the  -c
	      option before each category).

       MAILCROSS_LEARNER
	      This variable contains a shell command to be executed repeatedly
	      during the learning stage. The command should accept a mbox type
	      stream of emails on STDIN for learning, and the file name of the
	      category on the command line.  If undefined, mailcross uses  the
	      default  value  MAILCROSS_LEARNER="dbacl	-H  19 -T email -T xml
	      -l".

       TEMPDIR
	      This directory is exported for the benefit of  wrapper  scripts.
	      Scripts which need to create temporary files should place them a
	      the location given in TEMPDIR.

NOTES
       The subdirectory mailcross.d can grow quite large. It contains  a  full
       copy  of the training corpora, as well as learning files for size times
       all the added categories, and various log files.

WARNING
       Cross-validation is a widely used, but  ad-hoc  statistical  procedure,
       completely  unrelated  to  Bayesian theory, and subject to controversy.
       Use this at your own risk.

SOURCE
       The source code for the latest version of this program is available  at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A. Breyer <laird@lbreyer.com>

SEE ALSO
       bayesol(1) dbacl(1), mailinspect(1), mailtoe(1), mailfoot(1), regex(7)

Version 1.14.1	      Bayesian Text Classification Tools	  MAILCROSS(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net