Lingua::EN::Sentence(3User Contributed Perl DocumentatiLingua::EN::Sentence(3)NAMELingua::EN::Sentence - Module for splitting text into sentences.
SYNOPSIS
use Lingua::EN::Sentence qw( get_sentences add_acronyms );
add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.'
my $sentences=get_sentences($text); ## Get the sentences.
foreach my $sentence (@$sentences) {
## do something with $sentence
}
DESCRIPTION
The "Lingua::EN::Sentence" module contains the function get_sentences,
which splits text into its constituent sentences, based on a regular
expression and a list of abbreviations (built in and given).
Certain well know exceptions, such as abreviations, may cause incorrect
segmentations. But some of them are already integrated into this code
and are being taken care of. Still, if you see that there are words
causing the get_sentences() to fail, you can add those to the module,
so it notices them.
ALGORITHM
Basically, I use a 'brute' regular expression to split the text into
sentences. (Well, nothing is yet split - I just mark the end-of-
sentence). Then I look into a set of rules which decide when an end-
of-sentence is justified and when it's a mistake. In case of a mistake,
the end-of-sentence mark is removed.
What are such mistakes? Cases of abbreviations, for example. I have a
list of such abbreviations (Please see `Acronym/Abbreviations list'
section), and more general rules (for example, the abbreviations 'i.e.'
and '.e.g.' need not to be in the list as a special rule takes care of
all single letter abbreviations).
FUNCTIONS
All functions used should be requested in the 'use' clause. None is
exported by default.
get_sentences( $text )
The get sentences function takes a scalar containing ascii text as
an argument and returns a reference to an array of sentences that
the text has been split into. Returned sentences will be trimmed
(beginning and end of sentence) of white-spaces. Strings with no
alpha-numeric characters in them, won't be returned as sentences.
add_acronyms( @acronyms )
This function is used for adding acronyms not supported by this
code. Please see `Acronym/Abbreviations list' section for the
abbreviations already supported by this module.
get_acronyms( )
This function will return the defined list of acronyms.
set_acronyms( @my_acronyms )
This function replaces the predefined acroym list with the given
list.
get_EOS( )
This function returns the value of the string used to mark the end
of sentence. You might want to see what it is, and to make sure
your text doesn't contain it. You can use set_EOS() to alter the
end-of-sentence string to whatever you desire.
set_EOS( $new_EOS_string )
This function alters the end-of-sentence string used to mark the
end of sentences.
set_locale( $new_locale ) Revceives language locale in the form
language.country.character-set for example: "fr_CA.ISO8859-1" for
Canadian French using character set ISO8859-1.
Returns a reference to a hash containing the current locale
formatting values. Returns undef if got undef.
The following will set the LC_COLLATE behaviour to Argentinian
Spanish. NOTE: The naming and avail ability of locales depends on
your operating sys tem. Please consult the perllocale manpage for
how to find out which locales are available in your system.
$loc = set_locale( "es_AR.ISO8859-1" );
This actually does this:
$loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
Acronym/Abbreviations list
You can use the get_acronyms() function to get acronyms. It has become
too long to specify in the documentation.
If I come across a good general-purpose list - I'll incorporate it into
this module. Feel free to suggest such lists.
FUTURE WORK [1] Object Oriented like usage [2] Supporting more than just
English/French [3] Code optimization. Currently everything is RE based
and not so optimized RE [4] Possibly use more semantic heuristics for
detecting a beginning of a sentence
SEE ALSO
Text::Sentence
AUTHOR
Shlomo Yona shlomo@cs.haifa.ac.il
COPYRIGHT
Copyright (c) 2001, 2002 Shlomo Yona. All rights reserved.
This library is free software. You can redistribute it and/or modify
it under the same terms as Perl itself.
POD ERRORS
Hey! The above document had some coding errors, which are explained
below:
Around line 40:
'=item' outside of any '=over'
Around line 76:
Non-ASCII character seen before =encoding in 'avail'. Assuming
ISO8859-1
Around line 84:
You forgot a '=back' before '=head1'
perl v5.18.1 2002-09-24 Lingua::EN::Sentence(3)