fst-compiler man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

fst-compiler(1)			 fst-compiler		       fst-compiler(1)

NAME
       fst-compiler fst-compiler-utf8 - Two compilers for SFST programs

SYNOPSIS
       fst-compiler grammar-file [ output-file ]
       fst-compiler-utf8 grammar-file [ output-file ]

OPTIONS
       -c     Store  the  transducer  in  compact format which is used by fst-
	      infl2.

       -l     Store the transducer in lowmem format.

       -s     Switch surface and analysis layer of the transducer. You have to
	      use  this switch in order to use fst-infl (fst-infl2, fst-infl3)
	      for generation rather than analysis.

DESCRIPTION
       fst-compiler is a compiler for  finite-state  transducer	 programs.  It
       generates  a  minimized	finite state transducer which can be used with
       fst-mor, fst-infl, fst-print, fst-compare, fst-parse, and  fst-lattice.
       The  compact  transducer	 representation which is generated with the -c
       flag, is supported by fst-infl2, fst-train, and fst-match.  The memory-
       efficient  transducer  representation  which  is	 generated with the -l
       flag, is only supported by fst-infl3.

       The first program argument is the name of a  file  which	 contains  the
       transducer  program.  The  programming language is described below. The
       second argument is the name of the file to which the  resulting	trans‐
       ducer  will be written in binary form. If a second argument is missing,
       the output will be written to stdout.

       fst-compiler-utf8 differs  from	fst-compiler  only  in	the  character
       encoding.  fst-compiler-utf8 supports UTF8 encoding of the source files
       whereas fst-compiler is to be  used  for	 8-Bit	character  codes  like
       latin1  which are an extension of the ASCII code. Information about the
       encoding is stored in the transducer files and used by the  other  SFST
       programs.

FILE FORMATS
       A transducer program consists of an (optional) sequence of alphabet and
       variable definitions followed by a single transducer  expression	 which
       defines the result transducer.

       Alphabet

       An  alphabet  definition consists of the keyword ALPHABET followed by =
       and some transducer expression e.g.

	   ALPHABET = [a-z]:[A-Z]

       This command redefines the alphabet as the set of symbol	 pairs	occur‐
       ring  on	 the  transitions  of the transducer. Occurrences of two-level
       operators, negation operators and unquoted periods always  have	to  be
       preceded by an alphabet definition.

       Variables

       There  are  two different types of variables.  Symbol set variables are
       enclosed by hash signs (#) and take symbol  sequences  (see  below)  as
       values:

	   #UC# = A-Z
	   #LC# = a-z

       Transducer  variables  are enclosed by dollar signs and take transducer
       expressions as values:

	   $MAP$ = [a-z]:[A-Z]+
	   $MAP$ = [#LC#]:[#UC#]+

       Variables whose name starts with the symbol `=' are  special  agreement
       variables.  If  an agreement variable occurs more than once in a trans‐
       ducer expression, it will always have the same value. Consider the fol‐
       lowing transducer program:

	   $=1$ = [abc]
	   $=1$ X $=1$

       The  result  transducer	recognizes the strings aXa, bXb, and cXc. Only
       acyclic transducers (i.e. transducers with a finite set of string  map‐
       pings) can be assigned to agreement variables.

       Symbols

       A symbol is either

       - a single character like A s 5,

       - a quoted character like \* or \_,

       - a multi-character symbol like <X> or <ab.c5> (which is always
	 enclosed in angle brackets) or

       - a backslash followed by a number which is the numeric code of the
	 designated character

       - the null symbol <>.

       Symbol sequence

       A  symbol sequence is a sequence of characters, multi-character symbols
       and character ranges, e.g. a-z \. <x>.

       symbol range

       A symbol range is either

       - a single symbol

       - a symbol sequence enclosed in square brackets like [A-Za-z] or

       - a symbol sequence starting with ^ and	enclosed  in  square  brackets
       like [^A-Za-z] (designating the complement of [a-zA-Z]) or

       - the period (which represents any symbol from the alphabet)

       Transducer expressions

       A transducer expression (TE) is recursively defined as follows:

       - A pair of two symbol ranges separated by a colon is a TE.

	 [a-z]:[a-Z]

       - A single symbol range like [a-z] is a TE.
	 It is a short form for [a-z]:[a-z].

       - Two symbol sequences enclosed in braces and separated by a colon are
	a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d c:e <>:f.

       - X Y is a TE if X and Y are TEs.
	 (Blanks are ignored unless they are quoted.)

       - (X) is a TE if X is a TE.

       -  X  op	 is a TE is X is a TE and op is either * (Kleene's star opera‐
       tor), +
	(Kleene's plus operator), or ? (optionality operator)

       - op X is a TE is X is a TE and op is either ! (negation operator), ^
	(target language extraction operator), _ (source  language  extraction
	operator), or ^_ (source and target switch operator).

       - X op Y is a TE is X and Y are TEs and op is either & (conjunction
	operator),  |  (disjunction operator), || (composition operator), or -
	(subtraction operator)

       - L x op y R is a TE if L and R are TEs, x and y are symbol ranges and
	op is either => (two-level restriction), <= (two-level	coercion),  or
	<=> (two-level restriction and coercion).

       - X op L__R is a TE if X, L and R are TEs and op is either ^-> (upward
	replacement),  _->  (downward replacement), /-> (leftward replacement)
	or \-> (rightward replacement). Furthermore, L and R must  define  au‐
	tomata (i.e. which map their strings onto themselves). These operators
	correspond to Karttunen's replace operators. If the arrow is  followed
	by a question mark (?), the replacement becomes optional.

       - X << l is a TE if X is a TE, and l is either of the form
	a  or the form a:b where a and b are single characters or symbols. The
	result is a transducer where l was freely inserted into X. The	trans‐
	ducer ab << c for instance is equivalent to c*ac*bc*.

       - X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN and R1
	through	 RN  are  TEs,	and  op is either => (general restriction), <=
	(general coercion), ^=> (general surface  restriction),	 ^<=  (general
	surface	 coercion),  ^<=>  (general surface restriction and coercion),
	_=> (general deep restriction),	 _<=  (general	deep  coercion),  _<=>
	(general  deep restriction and coercion). (These operators were imple‐
	mented following a suggestion by Anssi Yli-Jyra.)

       - "fname" is a TE. The compiler reads the file named fname and turns
	it into a transducer of the form line1|line2|line3|... where linex  is
	the  x-th  line	 of  the  file.	 All characters other than : and \ are
	interpreted literally (i.e. not as operators). This  TE	 is  typically
	used e.g. to read morpheme list from a file.

       - "<fname>" is a TE. The compiler reads a pre-compiled transducer from
	the file named fname. This

       Further Features

       Comments	 start with the symbol % and extend up to the end of the line.
       Blanks are ignored unless they are quoted. Expressions terminate at the
       end  of	a line unless the end of line is preceded by a backslash.  The
       command

       #include "fname"

       can be used to insert source code from a file named fname.  The command

       RE >> "fname"

       stores the regular expression RE in the file fname.  The command

       #use hopcroft

       tells the compiler to use the Hopcroft minimisation algorithm from  now
       on, and

       #use default

       switsches back to the default minimisation algorithm (Brzozowski).  The
       command

EXAMPLE
       Here is an example of a simple transducer program.  Assuming  that  the
       file "adj-stems" contains the two lines

	  easy
	  late
	  big

       this  transducer	 will correctly analyze the adjective forms easy, eas‐
       ier, easiest and late, later, and latest.

       ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>

       $R$ = y<=>i (<ADJ>:<> e)

       $R2$ = e<=><> (<ADJ>:<> e)

       $R$ = $R$ & $R2$

       $Stems$ = "adj-stems"

       $S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

       $S$ || $R$

EXIT STATUS
       fst-compiler returns 0 unless some error occurs.

BUGS
       The compiler gets the operator precedence wrong in  case	 of  two-level
       rules and interprets the expression "ab c<=>d ef" as "a(b c<=>d (ef))".
       Therefore, you should always surround the  left	context	 of  two-level
       rules with parenthesis: (ab) c<=>d (ef)

SEE ALSO
       fst-mor,	 fst-infl,  fst-infl2, fst-infl3, fst-print, fst-compact, fst-
       parse, fst-compare, fst-compact, fst-lowmem, fst-lattice, fst-train

AUTHOR
       Helmut Schmid, Institute for Computational Linguistics,	University  of
       Stuttgart,  Email: schmid@ims.uni-stuttgart.de, This software is avail‐
       able under the GNU Public License.

				 December 2004		       fst-compiler(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net