SGML::Parser::OpenSP(3User Contributed Perl DocumentatiSGML::Parser::OpenSP(3)NAMESGML::Parser::OpenSP - Parse SGML documents using OpenSP
SYNOPSIS
use SGML::Parser::OpenSP;
my $p = SGML::Parser::OpenSP->new;
my $h = ExampleHandler->new;
$p->catalogs(qw(xhtml.soc));
$p->warnings(qw(xml valid));
$p->handler($h);
$p->parse("example.xhtml");
DESCRIPTION
This module provides an interface to the OpenSP SGML parser. OpenSP and
this module are event based. As the parser recognizes parts of the
document (say the start or end of an element), then any handlers
registered for that type of an event are called with suitable
parameters.
COMMON METHODSnew()
Returns a new SGML::Parser::OpenSP object. Takes no arguments.
parse($file)
Parses the file passed as an argument. Note that this must be a
filename and not a filehandle. See "PROCESSING FILES" below for
details.
parse_string($data)
Parses the data passed as an argument. See "PROCESSING FILES" below
for details.
halt()
Halts processing before parsing the entire document. Takes no
arguments.
split_message()
Splits OpenSP's error messages into their component parts. See
"POST-PROCESSING ERROR MESSAGES" below for details.
get_location()
See "POSITIONING INFORMATION" below for details.
CONFIGURATION
BOOLEAN OPTIONS
$p->handler([$handler])
Report events to the blessed reference $handler.
ERROR MESSAGE FORMAT
$p->show_open_entities([$bool])
Describe open entities in error messages. Error messages always
include the position of the most recently opened external entity.
The default is false.
$p->show_open_elements([$bool])
Show the generic identifiers of open elements in error messages.
The default is false.
$p->show_error_numbers([$bool])
Show message numbers in error messages.
GENERATED EVENTS
$p->output_comment_decls([$bool])
Generate "comment_decl" events. The default is false.
$p->output_marked_sections([$bool])
Generate marked section events ("marked_section_start",
"marked_section_end", "ignored_chars"). The default is false.
$p->output_general_entities([$bool])
Generate "general_entity" events. The default is false.
IO SETTINGS
$p->map_catalog_document([$bool])
"parse" arguments specify catalog files rather than the document
entity. The document entity is specified by the first DOCUMENT
entry in the catalog files. The default is false.
$p->restrict_file_reading([$bool])
Restrict file reading to the specified directories (see the
"search_dirs" method and the "SGML_SEARCH_PATH" environment
variable). You should turn this option on and configure the search
paths accordingly if you intend to process untrusted resources. The
default is false.
$p->catalogs([@catalogs])
Map public identifiers and entity names to system identifiers using
the specified catalog entry files. Multiple catalogs are allowed.
If there is a catalog entry file called "catalog" in the same place
as the document entity, it will be searched for immediately after
those specified.
$p->search_dirs([@search_dirs])
Search the specified directories for files specified in system
identifiers. Multiple values options are allowed. See the
description of the osfile storage manager in the OpenSP
documentation for more information about file searching.
$p->pass_file_descriptor([$bool])
Instruct "parse_string" to pass the input data down to the guts of
OpenSP using the "OSFD" storage manager (if true) or the "OSFILE"
storage manager (if false). This amounts to the difference between
passing a file descriptor and a (temporary) file name.
The default is true except on platforms, such as Win32, which are
known to not support passing file descriptors around in this
manner. On platforms which support it you can call this method with
a false parameter to force use of temporary file names instead.
In general, this will do the right thing on its own so it's best to
consider this an internal method. If your platform is such that you
have to force use of the OSFILE storage manager, please report it
as a bug and include the values of $^O, $Config{archname}, and a
description of the platform (e.g. "Windows Vista Service Pack 42").
PROCESSING OPTIONS
$p->include_params([@include_params])
For each name in @include_params pretend that
<!ENTITY % name "INCLUDE">
occurs at the start of the document type declaration subset in the
SGML document entity. Since repeated definitions of an entity are
ignored, this definition will take precedence over any other
definitions of this entity in the document type declaration.
Multiple names are allowed. If the SGML declaration replaces the
reserved name INCLUDE then the new reserved name will be the
replacement text of the entity. Typically the document type
declaration will contain
<!ENTITY % name "IGNORE">
and will use %name; in the status keyword specification of a marked
section declaration. In this case the effect of the option will be
to cause the marked section not to be ignored.
$p->active_links([@active_links])
???
ENABLING WARNINGS
Additional warnings can be enabled using
$p->warnings([@warnings])
The following values can be used to enable warnings:
xml Warn about constructs that are not allowed by XML.
mixed
Warn about mixed content models that do not allow #pcdata anywhere.
sgmldecl
Warn about various dubious constructions in the SGML declaration.
should
Warn about various recommendations made in ISO 8879 that the
document does not comply with. (Recommendations are expressed with
``should'', as distinct from requirements which are usually
expressed with ``shall''.)
default
Warn about defaulted references.
duplicate
Warn about duplicate entity declarations.
undefined
Warn about undefined elements: elements used in the DTD but not
defined.
unclosed
Warn about unclosed start and end-tags.
empty
Warn about empty start and end-tags.
net Warn about net-enabling start-tags and null end-tags.
min-tag
Warn about minimized start and end-tags. Equivalent to combination
of unclosed, empty and net warnings.
unused-map
Warn about unused short reference maps: maps that are declared with
a short reference mapping declaration but never used in a short
reference use declaration in the DTD.
unused-param
Warn about parameter entities that are defined but not used in a
DTD. Unused internal parameter entities whose text is "INCLUDE" or
"IGNORE" won't get the warning.
notation-sysid
Warn about notations for which no system identifier could be
generated.
all Warn about conditions that should usually be avoided (in the
opinion of the author). Equivalent to: "mixed", "should",
"default", "undefined", "sgmldecl", "unused-map", "unused-param",
"empty" and "unclosed".
DISABLING WARNINGS
A warning can be disabled by using its name prefixed with "no-". Thus
calling warnings(qw(all no-duplicate)) will enable all warnings except
those about duplicate entity declarations.
The following values for "warnings()" disable errors:
no-idref
Do not give an error for an ID reference value which no element has
as its ID. The effect will be as if each attribute declared as an
ID reference value had been declared as a name.
no-significant
Do not give an error when a character that is not a significant
character in the reference concrete syntax occurs in a literal in
the SGML declaration. This may be useful in conjunction with
certain buggy test suites.
no-valid
Do not require the document to be type-valid. This has the effect
of changing the SGML declaration to specify "VALIDITY NOASSERT" and
"IMPLYDEF ATTLIST YES ELEMENT YES". An option of "valid" has the
effect of changing the SGML declaration to specify "VALIDITY TYPE"
and "IMPLYDEF ATTLIST NO ELEMENT NO". If neither "valid" nor
"no-valid" are specified, then the "VALIDITY" and "IMPLYDEF"
specified in the SGML declaration will be used.
XML WARNINGS
The following warnings are turned on for the "xml" warning described
above:
inclusion
Warn about inclusions in element type declarations.
exclusion
Warn about exclusions in element type declarations.
rcdata-content
Warn about RCDATA declared content in element type declarations.
cdata-content
Warn about CDATA declared content in element type declarations.
ps-comment
Warn about comments in parameter separators.
attlist-group-decl
Warn about name groups in attribute declarations.
element-group-decl
Warn about name groups in element type declarations.
pi-entity
Warn about PI entities.
internal-sdata-entity
Warn about internal SDATA entities.
internal-cdata-entity
Warn about internal CDATA entities.
external-sdata-entity
Warn about external SDATA entities.
external-cdata-entity
Warn about external CDATA entities.
bracket-entity
Warn about bracketed text entities.
data-atts
Warn about attribute definition list declarations for notations.
missing-system-id
Warn about external identifiers without system identifiers.
conref
Warn about content reference attributes.
current
Warn about current attributes.
nutoken-decl-value
Warn about attributes with a declared value of NUTOKEN or NUTOKENS.
number-decl-value
Warn about attributes with a declared value of NUMBER or NUMBERS.
name-decl-value
Warn about attributes with a declared value of NAME or NAMES.
named-char-ref
Warn about named character references.
refc
Warn about ommitted refc delimiters.
temp-ms
Warn about TEMP marked sections.
rcdata-ms
Warn about RCDATA marked sections.
instance-include-ms
Warn about INCLUDE marked sections in the document instance.
instance-ignore-ms
Warn about IGNORE marked sections in the document instance.
and-group
Warn about AND connectors in model groups.
rank
Warn about ranked elements.
empty-comment-decl
Warn about empty comment declarations.
att-value-not-literal
Warn about attribute values which are not literals.
missing-att-name
Warn about ommitted attribute names in start tags.
comment-decl-s
Warn about spaces before the MDC in comment declarations.
comment-decl-multiple
Warn about comment declarations containing multiple comments.
missing-status-keyword
Warn about marked sections without a status keyword.
multiple-status-keyword
Warn about marked sections with multiple status keywords.
instance-param-entity
Warn about parameter entities in the document instance.
min-param
Warn about minimization parameters in element type declarations.
mixed-content-xml
Warn about cases of mixed content which are not allowed in XML.
name-group-not-or
Warn about name groups with a connector different from OR.
pi-missing-name
Warn about processing instructions which don't start with a name.
instance-status-keyword-s
Warn about spaces between DSO and status keyword in marked
sections.
external-data-entity-ref
Warn about references to external data entities in the content.
att-value-external-entity-ref
Warn about references to external data entities in attribute
values.
data-delim
Warn about occurances of `<' and `&' as data.
explicit-sgml-decl
Warn about an explicit SGML declaration.
internal-subset-ms
Warn about marked sections in the internal subset.
default-entity
Warn about a default entity declaration.
non-sgml-char-ref
Warn about numeric character references to non-SGML characters.
internal-subset-ps-param-entity
Warn about parameter entity references in parameter separators in
the internal subset.
internal-subset-ts-param-entity
Warn about parameter entity references in token separators in the
internal subset.
internal-subset-literal-param-entity
Warn about parameter entity references in parameter literals in the
internal subset.
PROCESSING FILES
In order to start processing of a document and recieve events, the
"parse" method must be called. It takes one argument specifying the
path to a file (not a file handle). You must set an event handler using
the "handler" method prior to using this method. The return value of
"parse" is currently undefined.
EVENT HANDLERS
In order to receive data from the parser you need to write an event
handler. For example,
package ExampleHandler;
sub new { bless {}, shift }
sub start_element
{
my ($self, $elem) = @_;
printf " * %s\n", $elem->{Name};
}
This handler would print all the element names as they are found in the
document, for a typical XHTML document this might result in something
like
* html
* head
* title
* body
* p
* ...
The events closely match those in the generic interface to OpenSP, see
<http://openjade.sf.net/doc/generic.htm> for more information.
The event names have been changed to lowercase and underscores to
separate words and properties are capitalized. Arrays are represented
as Perl array references. "Position" information is not passed to the
handler but made available through the "get_location" method which can
be called from event handlers. Some redundant information has also been
stripped and the generic identifier of an element is stored in the
"Name" hash entry.
For example, for an EndElementEvent the "end_element" handler gets
called with a hash reference
{
Name => 'gi'
}
The following events are defined:
* appinfo
* processing_instruction
* start_element
* end_element
* data
* sdata
* external_data_entity_ref
* subdoc_entity_ref
* start_dtd
* end_dtd
* end_prolog
* general_entity # set $p->output_general_entities(1)
* comment_decl # set $p->output_comment_decls(1)
* marked_section_start # set $p->output_marked_sections(1)
* marked_section_end # set $p->output_marked_sections(1)
* ignored_chars # set $p->output_marked_sections(1)
* error
* open_entity_change
If the documentation of the generic interface to OpenSP states that
certain data is not valid, it will not be available through this
interface (i.e., the respective key does not exist in the hash ref).
POSITIONING INFORMATION
Event handlers can call the "get_location" method on the parser object
to retrieve positioning information, the get_location method will
return a hash reference with the following properties:
LineNumber => ..., # line number
ColumnNumber => ..., # column number
ByteOffset => ..., # number of preceding bytes
EntityOffset => ..., # number of preceding bit combinations
EntityName => ..., # name of the external entity
FileName => ..., # name of the file
These can be "undef" or an empty string.
POST-PROCESSING ERROR MESSAGES
OpenSP returns error messages in form of a string rather than
individual components of the message like line numbers or message text.
The "split_message" method on the parser object can be used to post-
process these error message strings as reliable as possible. It can be
used e.g. from an error event handler if the parser object is
accessible like
sub error
{
my $self = shift;
my $erro = shift;
my $mess = $self->{parser}->split_message($erro);
}
See the documentation of "split_message" in the
SGML::Parser::OpenSP::Tools documentation.
UNICODE SUPPORT
All strings returned from event handlers and helper routines are UTF-8
encoded with the UTF-8 flag turned on, helper functions like
"split_message" expect (but don't check) that string arguments are
UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper
functions is undefined when you pass unexpected input and should be
avoided.
"parse" has limited support for binary input, but the binary input must
be compatible with OpenSP's generic interface requirements and you must
specify the encoding through means available to OpenSP to enable it to
properly decode the binary input. Any encoding meta data about such
binary input specific to Perl (such as encoding disciplines for file
handles when you pass a file descriptor) will be ignored. For more
specific information refer to the OpenSP manual.
· <http://openjade.sourceforge.net/doc/sysid.htm>
· <http://openjade.sourceforge.net/doc/charset.htm>
ENVIRONMENT VARIABLES
OpenSP supports a number of environment variables to control specific
processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED".
Portable applications need to ensure that these are set prior to
loading the OpenSP library into memory which happens when the XS code
is loaded. This means you need to wrap the code into a "BEGIN" block:
BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
use SGML::Parser::OpenSP;
# ...
Otherwise changes to the environment might not propagate to OpenSP.
This applies specifically to Win32 systems.
SGML_SEARCH_PATH
See <http://openjade.sourceforge.net/doc/sysid.htm>.
SP_HTTP_USER_AGENT
The "User-Agent" header for HTTP requests.
SP_HTTP_ACCEPT
The "Accept" header for HTTP requests.
SP_MESSAGE_FORMAT
Enable run time selection of message format, Value is one of "XML",
"NONE", "TRADITIONAL". Whether this will have an effect depends on
a compile time setting which might not be enabled in your OpenSP
build. This module assumes that no such support was compiled in.
SGML_CATALOG_FILES
SP_USE_DOCUMENT_CATALOG
See <http://openjade.sourceforge.net/doc/catalog.htm>.
SP_SYSTEM_CHARSET
SP_CHARSET_FIXED
SP_BCTF
SP_ENCODING
See <http://openjade.sourceforge.net/doc/charset.htm>.
Note that you can use the "search_dirs" method instead of using
"SGML_SEARCH_PATH" and the "catalogs" method instead of using
"SGML_CATALOG_FILES" and attributes on storage object specifications
for "SP_BCTF" and "SP_ENCODING" respectively. For example, if
"SP_CHARSET_FIXED" is set to 1 you can use
$p->parse("<OSFILE encoding='UTF-8'>example.xhtml");
to process "example.xhtml" using the "UTF-8" character encoding.
KNOWN ISSUES
OpenSP must be compiled with "SP_MULTI_BYTE" defined and with
"SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime
or not compile.
BUG REPORTS
Please report bugs in this module via
http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP
<http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>
Please report bugs in OpenSP via
<http://sf.net/tracker/?group_id=2115&atid=102115>
Please send comments and questions to the spo-devel mailing list, see
http://lists.sf.net/lists/listinfo/spo-devel
<http://lists.sf.net/lists/listinfo/spo-devel> for details.
SEE ALSO
· <http://openjade.sf.net/doc/generic.htm>
· <http://openjade.sf.net/>
· <http://sf.net/projects/spo/>
AUTHORS
Terje Bless <link@cpan.org> wrote version 0.01.
Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02+.
COPYRIGHT AND LICENSE
Copyright (c) 2006-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
This module is licensed under the same terms as Perl itself.
perl v5.14.1 2008-06-29 SGML::Parser::OpenSP(3)