Lingua::EN::Sentence man page on OpenSuSE

Man page or keyword search:  
man Server   25941 pages
apropos Keyword Search (all sections)
Output format
OpenSuSE logo
[printable version]

Lingua::EN::Sentence(3User Contributed Perl DocumentatiLingua::EN::Sentence(3)

NAME
       Lingua::EN::Sentence - Module for splitting text into sentences.

SYNOPSIS
	       use Lingua::EN::Sentence qw( get_sentences add_acronyms );

	       add_acronyms('lt','gen');	       ## adding support for 'Lt. Gen.'
	       my $sentences=get_sentences($text);     ## Get the sentences.
	       foreach my $sentence (@$sentences) {
		       ## do something with $sentence
	       }

DESCRIPTION
       The "Lingua::EN::Sentence" module contains the function get_sentences,
       which splits text into its constituent sentences, based on a regular
       expression and a list of abbreviations (built in and given).

       Certain well know exceptions, such as abreviations, may cause incorrect
       segmentations.  But some of them are already integrated into this code
       and are being taken care of.  Still, if you see that there are words
       causing the get_sentences() to fail, you can add those to the module,
       so it notices them.

ALGORITHM
       Basically, I use a 'brute' regular expression to split the text into
       sentences.  (Well, nothing is yet split - I just mark the end-of-
       sentence).  Then I look into a set of rules which decide when an end-
       of-sentence is justified and when it's a mistake. In case of a mistake,
       the end-of-sentence mark is removed.

       What are such mistakes? Cases of abbreviations, for example. I have a
       list of such abbreviations (Please see `Acronym/Abbreviations list'
       section), and more general rules (for example, the abbreviations 'i.e.'
       and '.e.g.' need not to be in the list as a special rule takes care of
       all single letter abbreviations).

FUNCTIONS
       All functions used should be requested in the 'use' clause. None is
       exported by default.

       get_sentences( $text )
	   The get sentences function takes a scalar containing ascii text as
	   an argument and returns a reference to an array of sentences that
	   the text has been split into.  Returned sentences will be trimmed
	   (beginning and end of sentence) of white-spaces.  Strings with no
	   alpha-numeric characters in them, won't be returned as sentences.

       add_acronyms( @acronyms )
	   This function is used for adding acronyms not supported by this
	   code.  Please see `Acronym/Abbreviations list' section for the
	   abbreviations already supported by this module.

       get_acronyms(  )
	   This function will return the defined list of acronyms.

       set_acronyms( @my_acronyms )
	   This function replaces the predefined acroym list with the given
	   list.

       get_EOS(	 )
	   This function returns the value of the string used to mark the end
	   of sentence. You might want to see what it is, and to make sure
	   your text doesn't contain it. You can use set_EOS() to alter the
	   end-of-sentence string to whatever you desire.

       set_EOS( $new_EOS_string )
	   This function alters the end-of-sentence string used to mark the
	   end of sentences.

       set_locale( $new_locale ) Revceives language locale in the form
       language.country.character-set for example: "fr_CA.ISO8859-1" for
       Canadian French using character set ISO8859-1.
	   Returns a reference to a hash containing the current locale
	   formatting values.  Returns undef if got undef.

	   The following will set the LC_COLLATE behaviour to Argentinian
	   Spanish. NOTE: The naming and avail ability of locales depends on
	   your operating sys tem. Please consult the perllocale manpage for
	   how to find out which locales are available in your system.

	   $loc = set_locale( "es_AR.ISO8859-1" );

	   This actually does this:

	   $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list
       You can use the get_acronyms() function to get acronyms.	 It has become
       too long to specify in the documentation.

       If I come across a good general-purpose list - I'll incorporate it into
       this module.  Feel free to suggest such lists.

FUTURE WORK [1] Object Oriented like usage [2] Supporting more than just
       English/French [3] Code optimization. Currently everything is RE based
       and not so optimized RE [4] Possibly use more semantic heuristics for
       detecting a beginning of a sentence
SEE ALSO
	       Text::Sentence

AUTHOR
       Shlomo Yona shlomo@cs.haifa.ac.il

COPYRIGHT
       Copyright (c) 2001, 2002 Shlomo Yona. All rights reserved.

       This library is free software.  You can redistribute it and/or modify
       it under the same terms as Perl itself.

POD ERRORS
       Hey! The above document had some coding errors, which are explained
       below:

       Around line 40:
	   '=item' outside of any '=over'

       Around line 76:
	   Non-ASCII character seen before =encoding in 'avail'. Assuming
	   ISO8859-1

       Around line 84:
	   You forgot a '=back' before '=head1'

perl v5.18.1			  2002-09-24	       Lingua::EN::Sentence(3)
[top]

List of man pages available for OpenSuSE

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net