lsm man page on MacOSX

lsm man page on MacOSX
Printed from http://www.polarhome.com/service/man/?qf=lsm&af=0&tf=2&of=MacOSX
LSM(1)			    Latent Semantic Mapping			LSM(1)

NAME
       lsm - Latent Semantic Mapping tool

SYNOPSIS
       lsm lsm_command [command_options] map_file [input_files]

DESCRIPTION
       The Latent Semantic Mapping framework is a language independent,
       Unicode based technology that builds maps and uses them to classify
       texts into one of a number of categories.

       lsm is a tool to create, manipulate, test, and dump Latent Semantic
       Mapping maps. It is designed to provide access to a large subset of the
       functionality of the Latent Semantic Mapping API, mainly for rapid
       prototyping and diagnostic purposes, but possibly also for simple shell
       script based applications of Latent Semantic Mapping.

COMMANDS
       lsm provides a variety of commands (lsm_command in the Synopsis), each
       of which often has a wealth of options (see the Command Options below).
       Command names may be abbreviated to unambiguous prefixes.

       lsm create map_file input_files
	   Create a new LSM map from the specified input_files.

       lsm update map_file input_files
	   Add the specified input_files to an existing LSM map.

       lsm evaluate map_file input_files
	   Classify the specified input_files into the categories of the LSM
	   map.

       lsm cluster [--k-means=N | --agglomerative=N] [--apply]
	   Compute clusters for the map, and, if the --apply option is
	   specified, transform the map accordingly. Multiple levels of
	   clustering may be applied for faster performance on large maps,
	   e.g.

	      lsm cluster --k-means=100 --each --agglomerative=100 --agglomerative=1000 my.map

	   first computes 100 clusters using (fast) k-means clustering,
	   computes 100 subclusters for each first stage cluster using
	   agglomerative clustering, and finally reduces those 10000 clusters
	   to 1000 using agglomerative clustering.

       lsm dump map_file [input_files]
	   Without input_files, dumps all words in the map with their counts.
	   With input_files, dump, for each file, the words that appear in the
	   map, their counts in the map, and their relative frequencies in the
	   input file.

       lsm info map_file
	   Bypass the Latent Semantic Mapping framework to extract and print
	   information about the file and perform a number of consistency
	   checks on it. (NOT IMPLEMENTED YET)

COMMAND OPTIONS
       This section describes the command_options that are available for the
       lsm commands. Not all commands support all of these options; each
       option is only supported for commands where it makes sense.  However,
       when a command has one of these options you can count on the same
       meaning for the option as in other commands.

       --append-categories
	   Directs the update command to put the data into new categories
	   appended after the existing ones, instead of adding the data to the
	   existing categories.

       --categories count
	   Directs the evaluate command to only list the top count categories.

       --category-delimiter delimiter
	   Specify the delimiter to be used to between categories in the
	   input_files passed to the create and update commands.

	   group   Categories are separated by a `;' argument.

	   file	   Each input_file represents a separate category. This is the
		   default if the --category-delimiter option is not given.

	   line	   Each line represents a separate category.

	   string  Categories are separated by the specified string.

       --clobber
	   When creating a map, overwrite an existing file at the path, even
	   if it's not an LSM map.  By default, create will only overwrite an
	   existing file if it's believed to be an LSM map, which guards
	   against frequent operator errors such as:

	      lsm create /usr/include/*.h

       --dimensions dim
	   Direct the create and update commands to use the given number of
	   dimensions for computing the map (Defaults to the number of
	   categories). This option is useful to manage the size and
	   computational overhead of maps with large number of categories.

       --discard-counts
	   Direct the create and update commands to omit the raw word / token
	   counts when writing the map. This results in a map that is more
	   compact, but cannot be updated any further.

       --hash
	   Direct the create and update commands to write the map in a format
	   that is not human readable with default file manipulation tools
	   like cat or hexdump. This is useful in applications such as junk
	   mail filtering, where input data may contain naughty words and
	   where the contents of the map may tip off spammers what words to
	   avoid.

       --help
	   List an overview of the options available for a command. Available
	   for all commands.

       --html
	   Strip HTML codes from the input_files. Useful for mail and web
	   input. Available for the create, update, evaluate, and dump
	   commands.

       --junk-mail
	   When parsing the input files, apply heuristics to counteract common
	   methods used by spammers to disguise incriminating words such as:

	      Zer0 1nt3rest l0ans     Substituting letters with digits
	      W E A L T H	      Adding spaces between letters
	      m.o.r.t.g.a.g.e	      Adding punctuation between letters

	   Available for the create, update, evaluate, and dump commands.

       --pairs
	   If specified with the create command when building the map, store
	   counts for pairs of words as well as the words themselves. This can
	   increase accuracy for certain classes of problems, but will
	   generate unreasonably large maps unless the vocabulary is fairly
	   limited.

       --stop-words stop_word_file
	   If specified with the create command, stop_word_file is parsed and
	   all words found are excluded from texts evaluated against the map.
	   This is useful for excluding frequent, semantically meaningless
	   words.

       --sweep-cutoff threshold
       --sweep-frequency days
	   Available for the create and update commands. Every specified
	   number of days (by default 7), scan the map and remove from it any
	   entries that have been in the map for at least 2 previous scans and
	   whose total counts are smaller than threshold.  threshold defaults
	   to 0, so by default the map is not scanned.

       --text-delimiter delimiter
	   Specify the delimiter to be used to between texts in the
	   input_files passed to the create, update, evaluate, and dump
	   commands.

	   file	   Each input_file represents a separate text. This is the
		   default if the --text-delimiter option is not given.

	   line	   Each line represents a separate text.

	   string  Texts are separated by the specified string.

       --triplets
	   If specified with the create command when building the map, store
	   counts for triplets and pairs of words as well as the words
	   themselves. This can increase accuracy for certain classes of
	   problems, but will generate unreasonably large maps unless the
	   vocabulary is fairly limited.

       --weight weight
	   Scale counts of input words for the create and update commands by
	   the specified weight, which may be a positive or negative floating
	   point number.

       --words
	   Directs the evaluate or cluster commands to apply to words, instead
	   of categories.

       --words=count
	   Directs the evaluate command to list the top count words, instead
	   of categories.

EXAMPLES
       "lsm evaluate --html --junk-mail ~/Library/Mail/V2/MailData/LSMMap2
       msg*.txt"
	   Simulate the Mail.app junk mail filter by evaluating the specified
	   files (assumed to each hold the raw text of one mail message)
	   against the user's junk mail map.

       "lsm dump ~/Library/Mail/V2/MailData/LSMMap2"
	   Dump the words accumulated in the junk mail map and their counts.

       "lsm create --category-delimiter=group c_vs_h *.c ';' *.h"
	   Create an LSM map trained to distinguish C header files from C
	   source files.

       "lsm update --weight 2.0 --cat=group c_vs_h ';' ../xy/*.h"
	   Add some additional header files with an increased weight to the
	   training.

       "lsm create --help"
	   List the options available for the lsm create command.

1.0				  2011-11-07				LSM(1)
[top]

List of man pages available for MacOSX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome