rawtextFreq.pl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

RAWTEXTFREQ(1)	      User Contributed Perl Documentation	RAWTEXTFREQ(1)

NAME
       rawtextFreq.pl - Compute Information Content from Raw / Plain Text

SYNOPSIS
	rawtextFreq.pl --outfile OUTFILE [--stopfile=STOPFILE]
		      {--stdin | --infile FILE [--infile FILE ...]}
		       [--wnpath WNPATH] [--resnik] [--smooth=SCHEME]
		       | --help | --version

OPTIONS
       --outfile=filename

	   The name of a file to which output should be written

       --stopfile=filename

	   A file containing a list of stop listed words that will not be
	   considered in the frequency counts.	A sample file can be down-
	   loaded from
	   http://www.d.umn.edu/~tpederse/Group01/WordNet/words.txt

       --wnpath=path

	   Location of the WordNet data files (e.g.,
	   /usr/local/WordNet-3.0/dict)

       --resnik

	   Use Resnik (1995) frequency counting

       --smooth=SCHEME

	   Smoothing should used on the probabilities computed.	 SCHEME can
	   only be ADD1 at this time

       --help

	   Show a help message

       --version

	   Display version information

       --stdin

	   Read from the standard input the text that is to be used for
	   counting the frequency of words.

       --infile=PATTERN

	   The name of a raw text file to be used to count word frequencies.
	   This can actually be a filename, a directory name, or a pattern (as
	   understood by Perl's glob() function).  If the value is a directory
	   name, then all the files in that directory and its subdirectories will
	   be used.

	   If you are looking for some interesting files to use, check out
	   Project Gutenberg: <http://www.gutenberg.org>.

	   This option may be given more than once (if more than one file
	   should be used).

DESCRIPTION
       This program reads a corpus of plain text and computes frequency counts
       from that corpus and then uses those to determine the information
       content of each synset in WordNet. In brief it does this by first
       assigning counts to each synset for which it obtains a frequency count
       in the corpus, and then those counts are propagated up the WordNet
       hierarchy. More details on this process can be found in the
       documentation of the lin, res, and jcn measures in WordNet::Similarity
       and in the publication by Patwardhan, et. al.  (2003) referred to
       below.

       The utility programs BNCFreq.pl, SemCorRawFreq.pl, treebankFreq.pl,
       brownFreq.pl all function in exactly the same way as this plain text
       program (rawtextFreq.pl), except that they include the ability to deal
       with the format of the corpus with which they are used.

       None of these programs requires sense-tagged text; instead they simply
       distribute the counts of the observed form of word to all the synsets
       in the corpus to which it could be associated. The different forms of a
       word are found via the validForms and querySense methods of
       WordNet::QueryData.

       For example, if the observed word is 'bank', then a count is given to
       the synsets associated with the financial institution, a river shore,
       the act of turning a plane, etc.

   Distributing Counts to Synsets
       If the corpora is sense-tagged, then distributing the counts of sense-
       tagged words to synsets is trivial; you increment the count of each
       synset for which you have a sense tagged instance. It is very hard to
       obtain large quantities of sense tagged text, so in general it is not
       feasible to obtain information content values from large sense-tagged
       corpora.

       As such this program and the related *Freq.pl utilities are all trying
       to increment the counts of synsets based on the occurence of raw
       untagged word forms. In this case it is less obvious how to proceed.
       This program supports two methods for distributing the counts of an
       observed word forms in untagged text to synsets.

       One is our default method, and we refer to the other as Resnik
       counting. In our default counting scheme, each synset receives the
       total count of each word form associated with it.

       Suppose the word 'bank' can be associated with six different synets. In
       our default scheme each of those synsets would receive a count for each
       occurrence of 'bank'. In Resnik counting, the count would be divided
       between the possible synsets, so in this case each synset would get one
       sixth (1/6) of the total count.

   How are These Counts Used?
       This program maps word forms to synsets. These synset counts are then
       propagated up the WordNet hierarchy to arrive at Information Content
       values for each synset, which are then used by the Lin (lin), Resnik
       (res), and Jiang & Conrath (jcn) measures of semantic similarity.

       By default these measures use counts derived from the cntlist file
       provided by WordNet, which is based on frequency counts from the sense-
       tagged SemCor corpus. This consists of approximately 200,000 sense
       tagged tokens taken from the Brown Corpus and the Red Badge of Courage.

       A file called ic-semcor.dat is created during installation of
       WordNet::Similarity from cntlist. In fact, the util program
       semCorFreq.pl is used to do this. This is the only one of the *Freq.pl
       utility programs that uses sense tagged text, and in fact it only uses
       the counts from cntlist, not the actual sense tagged text.

       This program simply creates an alternative version of the ic-semcor.dat
       file based on counts obtained from raw untagged text.

   Why Use This Program?
       The default information content file (ic-semcor.dat) is based on
       SemCor, which includes sense tagged portions of the Brown Corpus and
       the Red Badge of Courage. It has the advantage of being sense tagged,
       but is from a rather limited domain and is somewhat small in size
       (200,000 sense tagged tokens).

       If you are working in a different domain or have access to a larger
       quantity of corpora, you might find that this program provides
       information content values that better reflect your underlying domain
       or problem.

   How can these counts be reliable if they aren't based on sense tagged text?
       Remember once the counts are given to a synset, those counts are
       propogated upwards, so that each synset receives the counts of its
       children. These are then used in the calculation of the information
       content of each synset, which is simply :

	       information content (synset) = - log [probability (synset)]

       More details on this calculation and how they are used in the res, lin,
       and jcn measures can be found in the WordNet::Similarity module
       doumentation, and in the following publication:

	Using Measures of Semantic Relatedness for Word Sense Disambiguation
	(Patwardhan, Banerjee and Pedersen) - Appears in the Proceedings of
	the Fourth International Conference on Intelligent Text Processing and
	Computational Linguistics, pp. 241-257, February 17-21, 2003, Mexico City.
	L<http://www.d.umn.edu/~tpederse/Pubs/cicling2003-3.pdf>

       We believe that a propagation effect will result in concentrations or
       clusters of information content values in the WordNet hierarchy. For
       example, if you have a text about banking, while the different counts
       of "bank" will be dispersed around WordNet, there will also be other
       financial terms that occur with bank that will occur near the financial
       synset in WordNet, and lead to a concentration of counts in that region
       of WordNet. It is best to view this as a conjecture or hypothesis at
       this time. Evidence for or against would be most interesting.

       You can use raw text of any kind in this program. We sometimes use text
       from Project Gutenburg, for example the Complete Works of Shakespeare,
       available from <http://www.gutenberg.org/ebooks/100>

BUGS
       Report to WordNet::Similarity mailing list :
	<http://groups.yahoo.com/group/wn-similarity>

SEE ALSO
       utils.pod

       WordNet home page :
	<http://wordnet.princeton.edu>

       WordNet::Similarity home page :
	<http://wn-similarity.sourceforge.net>

AUTHORS
	Ted Pedersen, University of Minnesota, Duluth
	tpederse at d.umn.edu

	Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh
	banerjee+ at cs.cmu.edu

	Siddharth Patwardhan, University of Utah, Salt Lake City
	sidd at cs.utah.edu

	Jason Michelizzi

COPYRIGHT
       Copyright (c) 2005-2008, Ted Pedersen, Satanjeev Banerjee, Siddharth
       Patwardhan and Jason Michelizzi

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.  This program is distributed in the hope
       that it will be useful, but WITHOUT ANY WARRANTY; without even the
       implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
       PURPOSE.	 See the GNU General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to

	Free Software Foundation, Inc.
	59 Temple Place - Suite 330
	Boston, MA  02111-1307, USA

perl v5.20.2			  2015-08-31			RAWTEXTFREQ(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net