huge-count.pl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

HUGE-COUNT(1)	      User Contributed Perl Documentation	 HUGE-COUNT(1)

NAME
       huge-count.pl - Count all the bigrams in a huge text without using huge
       amounts of memory.

SYNOPSIS
       huge-count.pl --tokenlist --split 100 destination-dir input

DESCRIPTION
       Runs count.pl efficiently on large amounts of data by splitting the
       data into separate files, and counting up each file separately, and
       then merging them to get overall results.

       Two output files are created. destination-dir/huge-count.output
       contains the bigram counts after applying --remove and --remove.
       destination-dir/complete-huge-count.output provides the bigram counts
       as if no --uremove or --remove cutoff were provided.

USAGE
       huge-count.pl [OPTIONS] DESTINATION [SOURCE]+

INPUT
   Required Arguments:
       [SOURCE]+

       Input to huge-count.pl should be a -

       1. Single plain text file
	   Or

       2. Single flat directory containing multiple plain text files
	   Or

       3. List of multiple plain text files

       DESTINATION

       A complete path to a writable directory to which huge-count.pl can
       write all intermediate and final output files. If DESTINATION does not
       exist, a new directory is created, otherwise, the current directory is
       simply used for writing the output files.

       NOTE: If DESTINATION already exists and if the names of some of the
       existing files in DESTINATION clash with the names of the output files
       created by huge-count, these files will be over-written w/o prompting
       user.

       --tokenlist

       This parameter is required. huge-count will call count.pl and print out
       all the bigrams count.pl can find out.

   Optional Arguments:
       --split N

       This parameter is required. huge-count will divide the output bigrams
       tokenlist generated by count.pl, sort on each part and recombine the
       bigram counts from all these intermediate result files into a single
       bigram output that shows bigram counts in SOURCE.

       Each part created with --split N will contain N lines. Value of N
       should be chosen such that huge-sort.pl can be efficiently run on any
       part containing N lines from the file contains all bigrams file.

       We suggest that N is equal to the number of KB of memory you have. If
       the computer has 8 GB RAM, which is 8,000,000 KB, N should be set to
       8000000. If N is set too small, split output file suffixes exhausted.

       --token TOKENFILE

       Specify a file containing Perl regular expressions that define the
       tokenization scheme for counting. This will be provided to count.pl's
       --token option.

       --nontoken NOTOKENFILE

       Specify a file containing Perl regular expressions of non-token
       sequences that are removed prior to tokenization. This will be provided
       to the count.pl's --nontoken option.

       --stop STOPFILE

       Specify a file of Perl regex/s containing the list of stop words to be
       omitted from the output BIGRAMS. Stop list can be used in two modes -

       AND mode declared with '@stop.mode = AND' on the 1st line of the
       STOPFILE

       or

       OR mode declared using '@stop.mode = OR' on the 1st line of the
       STOPFILE.

       In AND mode, bigrams whose both constituent words are stop words are
       removed while, in OR mode, bigrams whose either or both constituent
       words are stopwords are removed from the output.

       --window W

       Tokens appearing within W positions from each other (with at most W-2
       intervening words) will form bigrams. Same as count.pl's --window
       option.

       --remove L

       Bigrams with counts less than L in the entire SOURCE data are removed
       from the sample. The counts of the removed bigrams are not counted in
       any marginal totals. This has same effect as count.pl's --remove
       option.

       --uremove L

       Bigrams with counts more than L in the entire SOURCE data are removed
       from the sample. The counts of the removed bigrams are not counted in
       any marginal totals. This has same effect as count.pl's --uremove
       option.

       --frequency F

       Bigrams with counts less than F in the entire SOURCE are not displayed.
       The counts of the skipped bigrams ARE counted in the marginal totals.
       In other words, --frequency in huge-count.pl has same effect as the
       count.pl's --frequency option.

       --ufrequency F

       Bigrams with counts more than F in the entire SOURCE are not displayed.
       The counts of the skipped bigrams ARE counted in the marginal totals.
       In other words, --frequency in huge-count.pl has same effect as the
       count.pl's --ufrequency option.

       --newLine

       Switches ON the --newLine option in count.pl. This will prevent bigrams
       from spanning across the lines.

       Other Options :

       --help

       Displays this message.

       --version

       Displays the version information.

PROGRAM LOGIC
       ·   STEP 1

	    # create output dir
	    if(!-e DESTINATION) then
	    mkdir DESTINATION;

       ·   STEP 2

	   1. If SOURCE is a single plain file -
	      huge-count.pl with --tokenlist option call count.pl and run on
	      the single plain file and print out all bigrams into one file.
	      The count outputs are also created in DESTINATION.

	   2. SOURCE is a single flat directory containing multiple plain
	   files -
	      huge-count.pl with --tokenlist option call count.pl and run on
	      each file present in the SOURCE directory. All files in SOURCE
	      are treated as the data files. If SOURCE contains sub-
	      directories, these are simply skipped.  Intermediate bigram
	      outputs are written in DESTINATION.

	   3. SOURCE is a list of multiple plain files -
	      If #arg > 2, all arguments specified after the first argument
	      are considered as the SOURCE file names. count.pl is separately
	      run on each of the SOURCE files specified by argv[1], argv[2],
	      ... argv[n] (skipping argv[0] which should be DESTINATION).
	      Intermediate results are created in DESTINATION.

	   In summary, a large datafile can be provided to huge-count in the
	   form of

	   a. A single plain file

	   b. A directory containing several plain files

	   c. Multiple plain files directly specified as command line
	   arguments

	   In all these cases, count.pl with --tokenlist is separately run on
	   SOURCE files or parts of SOURCE file and intermediate results are
	   written in DESTINATION dir.

       ·   STEP 3

	   Split the output file generate by count.pl with --tokenlist	into
	   smaller files by the number of bigrams N.

       ·   STEP 4

	   huge-sort.pl counts the unique bigrams and sort them in alphabetic
	   order.

       ·   STEP 5

	   huge-merge.pl merge the bigrams of each sorted bigrams file.

OUTPUT
       After huge-count finishes successfully, DESTINATION will contain -

       ·   Final bigram count file (huge-count.output) showing bigram counts
	   in the entire SOURCE after --remove and --uremove applied.

       ·   Final bigram count file (complete-huge-count.output) showing bigram
	   counts in the entire SOURCE without --remove and --uremove.

BUGS
       huge-count.pl doesn't consider bigrams at file boundaries. In other
       words, the result of count.pl and huge-count.pl on the same data file
       will differ if --newLine is not used, in that, huge-count.pl runs
       count.pl on multiple files separately and thus looses the track of the
       bigrams on file boundaries. With --window not specified, there will be
       loss of one bigram at each file boundary while its W bigrams with
       --window W.

       Functionality of huge-count with --tokenlist is same as count only if
       --newLine is used and all files start and end on sentence boundaries.
       In other words, there should not be any sentence breaks at the start or
       end of any file given to huge-count.

AUTHOR
       Amruta Purandare, University of Minnesota, Duluth

       Ted Pedersen, University of Minnesota, Duluth tpederse at umn.edu

       Ying Liu, University of Minnesota, Twin Cities liux0395 at umn.edu

COPYRIGHT
       Copyright (c) 2004-2010, Amruta Purandare, Ted Pedersen, and Ying Liu

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
       General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to

       The Free Software Foundation, Inc., 59 Temple Place - Suite 330,
       Boston, MA  02111-1307, USA.

perl v5.20.2			  2011-03-31			 HUGE-COUNT(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net