kocos.pl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

KOCOS(1)	      User Contributed Perl Documentation	      KOCOS(1)

NAME
       kocos.pl - Find the Kth order co-occurrences of a word

SYNOPSIS
       This program finds the Kth order co-occurrences of a given word.

DESCRIPTION
   1. What are Kth order co-occurrences?
       Co-occurrences are the words which occur together in the same context.
       All words which co-occur with a given target word are called its co-
       occurrences.  The concept of 2nd order co-occurrences is explained in
       the paper Automatic word Sense Discrimination [Schutze98]. According to
       this paper, the words which co-occur with the co-occurring words of a
       target word are called as the 2nd order co-occurrences of that word.

       So with each increasing order of co-occurrences, we introduce an extra
       level of indirection and find words co-occurring with the previous
       order co-occurrences.

       We generalize the concept of 2nd order co-occurrences from [Schutze98]
       to find the Kth order co-occurrences of a word. These are the words
       that co-occur with the (K-1)th order co-occurrences of a given target
       word.

       We have also found [Niwa&Nitta94] to be related to kocos. While we do
       not exactly reimplement the co-occurrence vectors they propose, we feel
       that kocos is at least similar in spirit.

   2. Usage
       Usage: kocos.pl [OPTIONS] BIGRAM

   3. Input
       3.1 BIGRAM

       Specify the BIGRAM file name on the command line after the program name
       and options (if any) as shown in the usage note.

       BIGRAM should be a bigram output(normal or extended) created by NSP
       programs - count.pl, statistic.pl or combig.pl. When count.pl and
       statistic.pl are run for creating bigrams (--ngram set to 2 or not
       specified), the programs list the bigrams of all words which co-occur
       together(in certain window). So we can say that if a bigram
       'word1<>word2<>' is listed in the output of count.pl or statistic.pl
       program, it means that the words word1 and word2 are the co-occurrences
       of each other.

       In general you may want to consider the use of stop lists (--stop
       option in count.pl) to remove very common words such as "the" and
       "for", and also eliminate low frequency bigrams (--remove option in
       count.pl). The stop list is particularly	 important as high frequency
       words such as "the" or "for" will co-occur with many different words,
       and greatly expand the search needed to find kth order co-occurrences.

       If you want to run kocos.pl on a source file not created by either
       count or statistic program of this package, just make sure that each
       line of BIGRAM file will list two words WORD1 and WORD2 as
       WORD1<>WORD2<> The program minimally requires that there are exactly
       two words and they are separated by delimiter '<>' with an extra
       delimiter '<>' after the second word. So you may convert any non NSP
       input to this format where two words occurring in the same context are
       '<>' separated and provide it to kocos.

       Controlling scope of the context

       You may like to call two words as co-occurrences of each other if they
       occur within a specific distance from each other. We encourage in this
       case that you use --window w option of NSP program count.pl while
       creating a BIGRAM file. This will create bigrams of all words which co-
       occur within a distance w from each other. Thus --window w sets the
       maximum distance allowed between two words to call them co-occurrences
       of each other.

       Note that if the --window option is not used while creating BIGRAM
       input for kocos, only those words which come immediately next to each
       other will be considered as the co-occurrences (default window size
       being 2 for bigrams).

   4. Options
       4.1 --literal WORD

       With this option, the target WORD whose kth order co-occurrences are to
       be found can be directly specified on the command line.

       e.g.
	       kocos.pl --literal line test.input will find the 1st order co-
       occurrences (by default) of the word 'line' using Bigrams listed in
       file test.input.

	       kocos.pl --literal , --order 3 test.input
       will find 3rd order co-occurrences of ',' from file test.input.

       4.2 --regex REGEXFILE

       With this option, target word can be specified using Perl regular
       expression/s.  The regex/s should be written in a file and multiple
       regex/s should either appear on separate lines or should be Perl 'OR'
       (|) separated.

       We provide this option to allow user to specify various morphological
       variants of the target word e.g. line, lines, Line,Lines etc.

       e.g.  (1) let test.regex contains a regular expression for target word
       which is -
	/^[Ll]ines?$/

       To use this for finding kocos, run kocos.pl with command

	       kocos.pl --regex test.regex --order K test.input

       (2) To find say 2nd order co-occurrences of any general target word
       which occurs in Data in <head> tags like Senseval Format, we use a
       regular expression
	/^<head.*>\w+</head>$/ in our regex file say test.regex and run
       kocos.pl using command

	       kocos.pl --regex test.regex --order 2 eng-lex-sample.training.xml

       (3) To find 3rd order co-occurrences of any word that contains period
       '.'  run kocos.pl using

	       kocos.pl --literal . --order 3 test.input

       Or write a regex /\./ in file say test.regex and run kocos using

	       kocos.pl --regex test.regex --order 3 test.input

       (4) To find 2nd order co-occurrences of all words that are numbers,
       write a regex like /^\d+$/ to a regexfile say test.regex and run kocos
       using,

	       kocos.pl --regex test.regex --order 2 test.input

       Note: writing a regex /\d+/ will also match words like line20.1.cord,
       or art%10.fine456 that include numbers.

       Regex/s that should exactly match as target words should be delimited
       by ^ and $ as in /^[Ll]ines?$/. Specifying something like /[Ll]ines?/
       will match with 'incline'.

       Note - The program kocos.pl requires that the target word is specified
       using either of the options --literal or --regex

       4.3 --order K

       If the value of K is specified using the command line option --order K,
       kocos.pl will find the Kth order co-occurrences of the target word. K
       can take any integer value greater than 0. If the value of K is not
       specified, the program will set K to 1 and will simply find the co-
       occurrences of the target (the word co-occurrence generally means first
       order co-occurrences).

       4.4 --trace TRACEFILE

       To see a detailed report of how each Kth order co-occurrence is reached
       as a sequence of K words, specify the name of a TRACEFILE on the
       command line using --trace TRACEFILE option.

       TRACEFILE will show the chains of K+1 words where the first word is the
       TARGET word and every ith word in the chain is a (i-1)th order co-
       occurrence of target which co-occurs with (i-1)th word in the chain. So
       a chain of K+1 words,

	TARGET->COC1->COC2->COC3....->COCK-1->COCK

       shows that COC1 is a first order co-occurrence of the TARGET.

	COC2 is a second order co-occurrence such that COC2 co-occurs with
	COC1 which in turn co-occurs with the TARGET.
	COC3 is a third order co-occurrence such that COC3 co-occurs with
	COC2 which in turn co-occurs with COC1 which co-occurs with TARGET.

       and so on......

       4.6 --help

       This option will display the help message.

       4.7 --version

       This option will display version information of the program.

   5. Output
       The program will display a list of Kth order co-occurrences to standard
       output  such that each co-occurrence occurs on a separate line and is
       followed by '<>' (just to be compatible with other programs in NSP).

       Note that the output of kocos.pl could be directly used by the program
       nsp2regex of the SenseTools Package (by Satanjeev Banerjee and Ted
       Pedersen) to convert Senseval data instances into feature vectors in
       ARFF format where our Kth order co-occurrences are used as features.

       For more information on SenseTools you can refer to its README:
       http://www.d.umn.edu/~tpederse/sensetools.html

				       IMPORTANT NOTE

       If there are some kth order co-occurrences which are also the ith order
       co-occurrences (0<i<k) of the target word, program kocos.pl will not
       display them as the Kth order co-occurrences. kocos.pl displays only
       those words as Kth order co-occurrences whose minimum distance from
       target word is K in the co-occurrence graph.  [Co-occurrence graph
       shows a network of words where a word is connected to all words it co-
       occurs with.]

   6. Usage examples
       (a)  Using default value of order To find the (1st order) co-
       occurrences of a word 'line' from the BIGRAM file test.input, run
       kocos.pl using the following command.
	    kocos.pl --literal line test.input

       (b)  Using option order To find the 2nd order co-occurrences of a word
       'line' from the BIGRAM file test.input, run kocos.pl using the
       following command.	kocos.pl --literal line --order 2 test.input

       (c)  Using the trace option To see how the 4th order co-occurrences of
       a word 'line' is reached as a sequence of words which form a co-
       occurrence chain, run kocos.pl using the following command.
	    kocos.pl --literal line --order 4 --trace test.trace test.input

       (d)  Using a Regex to specify the target word To find Kth order co-
       occurrences of a target word 'line' which is specified as a Perl
       regular expression say /^[Ll]ines?$/ in a file test.regex, run kocos.pl
       using	  kocos.pl --regex test.regex --order K test.input

       (e)  Using a generic Regex for Data like Senseval-2, To find 2nd order
       co-occurrences of a target word that occurs in <head> tags in the data
       file eng-lex-sample.training.xml, use a regular expression like
       /<head>\w+</head>/ from a file say test.regex, and run kocos.pl using
	    kocos.pl --regex test.regex --order 2 test.input

   7. General Recommendations
       (a) Create a BIGRAM file using programs count.pl, statistic.pl or
       combig.pl
	   of NSP Package.  (b) Use --window W option of program count.pl to
       specify the scope of the
	   context. Any word that occurs within a distance W from a target
       word will be
	   treated as its co-occurrence.  (c) Use either --literal or --regex
       option to specify the target word. We
	   recommend use of regex support to detect forms of target word other
       than
	   its base form.

   8. Examples of Kth order co-occurrences
       In all the following examples, we assume that the input comes from the
       file test.input and word 'line' is a target word.

	test.input =>
	----------------
	print<>in<>    |
	print<>line<>  |
	text<>the<>    |
	text<>line<>   |
	file<>the<>    |
	file<>in<>     |
	line<>file     |
	----------------

       (Note that test.input doesn't look like a valid count/statistic output
       because kocos.pl will minimally require two words WORD1 and WORD2
       separated by '<>' with an extra '<>' after WORD2 as described in
       Section 3.1 of this README)

       (a)  The 1st order co-occurrences of word 'line' can be found by
	    running kocos.pl with either of the following commands -

	       kocos.pl --literal line test.input
		       OR
	       kocos.pl --order 1 --literal line test.input

       This will display the co-occurrences of 'line' to standard output as
       shown below in the box.

	--------
	text<> |
	file<> |
	print<>|
	--------

       This is because the program finds the bigrams

	print<>line<>
	text<>line<>
	line<>file<>

       where word 'line' co-occurs with the words print, text and file which
       become the 1st order co-occurrences.

       (b)     The 2nd order co-occurrences of word 'line' can be found by
	    running kocos.pl with the following command -
	       kocos.pl --literal line --order 2 test.input

       This will display the 2nd order co-occurrences of 'line' to standard
       output as shown below in the box.

	--------
	the<>  |
	in<>   |
	--------

       This is because the program finds the words print, text and file as the
       first order co-occurrences (as explained in case a) and finds bigrams

	print<>in<>
	text<>the<>
	file<>the<>
	file<>in

       where 'the' and 'in' co-occur with the words print, text, file.

       (c)     To see how the 2nd order co-occurrences of word 'line' are
       reached	    run the program using the following command -
	       kocos.pl --order 2 --trace test.trace test.input line

       This will display the 2nd order co-occurrences of 'line' to standard
       output as shown below in the box.

	--------
	the<>	|
	in<>	|
	--------

       and a detailed report of co-occurrence chains in test.trace file as
       shown in the box below.

	test.trace =>

	----------------
	line->text->the|
	line->file->the|
	line->file->in |
	line->print->in|
	----------------

       where the first line shows that the word 'line' co-occurred with 'text'
       which co-occurred with 'the'. Hence 'the' became a 2nd order co-
       occurrence.  Similarly, 'line' co-occurred with 'file' which in turn
       co-occurred with 'the' and 'in' which are therefore the 2nd order co-
       occurrences of 'line'.

   11. References
       [Niwa&Nitta94] Y. Niwa and Y. Nitta. Co-occurrence vectors from corpora
       vs. distance vectors from dictionaries. COLING-1994.

       [Schutze98] H. Schutze. Automatic word sense discrimination.
       Computational Linguistics,24(1):97-123,1998.

AUTHORS
	Amruta Purandare, pura0010@umn.edu
	Ted Pedersen, tpederse@umn.edu

	Last updated on 12/05/2003 by TDP

       This work has been partially supported by a National Science Foundation
       Faculty Early CAREER Development award (#0092784).

BUGS
SEE ALSO
       http://www.d.umn.edu/~tpederse/nsp.html

COPYRIGHT
       Copyright (C) 2002-2003, Amruta Purandare and Ted Pedersen

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
       General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to

       The Free Software Foundation, Inc., 59 Temple Place - Suite 330,
       Boston, MA  02111-1307, USA.

AUTHORS
	Amruta Purandare, University of Minnesota, Duluth,  pura0010@d.umn.edu
	Ted Pedersen, University of Minnesota, Duluth,	tpederse@umn.edu

BUGS
SEE ALSO
       http://www.d.umn.edu/~tpederse/nsp.html

COPYRIGHT
       Copyright (C) 2002-2003, Amruta Purandare & Ted Pedersen

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY	or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to

       The Free Software Foundation, Inc., 59 Temple Place - Suite 330,
       Boston, MA  02111-1307, USA.

perl v5.20.2			  2008-03-24			      KOCOS(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net