fstcompiler, fstcompilerutf8  Two compilers for SFST programs
fstcompiler grammarfile [
outputfile ]
fstcompilerutf8 grammarfile [
outputfile ]
 c
 Store the transducer in compact format which is used by
fstinfl2.
 l
 Store the transducer in lowmem format.
 s
 Switch surface and analysis layer of the transducer. You
have to use this switch in order to use fstinfl (fstinfl2,
fstinfl3) for generation rather than analysis.
fstcompiler is a compiler for finitestate transducer programs. It
generates a minimized finite state transducer which can be used with
fstmor, fstinfl, fstprint, fstcompare,
fstparse, and
fstlattice. The compact transducer
representation which is generated with the c flag, is supported by
fstinfl2, fsttrain, and
fstmatch. The memoryefficient
transducer representation which is generated with the l flag, is only
supported by
fstinfl3.
The first program argument is the name of a file which contains the transducer
program. The programming language is described below. The second argument is
the name of the file to which the resulting transducer will be written in
binary form. If a second argument is missing, the output will be written to
stdout.
fstcompilerutf8 differs from
fstcompiler only in the character
encoding.
fstcompilerutf8 supports UTF8 encoding of the source files
whereas
fstcompiler is to be used for 8Bit character codes like
latin1 which are an extension of the ASCII code. Information about the
encoding is stored in the transducer files and used by the other SFST
programs.
A transducer program consists of an (optional) sequence of
alphabet and
variable definitions followed by a single
transducer expression
which defines the result transducer.
Alphabet
An alphabet definition consists of the keyword ALPHABET followed by = and some
transducer expression e.g.

ALPHABET = [az]:[AZ]
This command redefines the alphabet as the set of symbol pairs occurring on the
transitions of the transducer. Occurrences of twolevel operators, negation
operators and unquoted periods always have to be preceded by an alphabet
definition.
Variables
There are two different types of variables.
Symbol set variables are
enclosed by hash signs (#) and take symbol sequences (see below) as values:

#UC# = AZ

#LC# = az
Transducer variables are enclosed by dollar signs and take transducer
expressions as values:

$MAP$ = [az]:[AZ]+

$MAP$ = [#LC#]:[#UC#]+
Variables whose name starts with the symbol `=' are special
agreement
variables. If an agreement variable occurs more than once in a transducer
expression, it will always have the same value. Consider the following
transducer program:

$=1$ = [abc]

$=1$ X $=1$
The result transducer recognizes the strings aXa, bXb, and cXc. Only acyclic
transducers (i.e. transducers with a finite set of string mappings) can be
assigned to agreement variables.
Symbols
A symbol is either
 a single character like A s 5,
 a quoted character like \* or \_,
  a multicharacter symbol like <X> or <ab.c5>
(which is always
 enclosed in angle brackets) or
  a backslash followed by a number which is the numeric
code of the
 designated character
 the null symbol <>.
Symbol sequence
A symbol sequence is a sequence of characters, multicharacter symbols and
character ranges, e.g. az \. <x>.
symbol range
A symbol range is either
 a single symbol
 a symbol sequence enclosed in square brackets like [AZaz] or
 a symbol sequence starting with ^ and enclosed in square brackets like
[^AZaz] (designating the complement of [azAZ]) or
 the period (which represents any symbol from the alphabet)
Transducer expressions
A transducer expression (TE) is recursively defined as follows:
  A pair of two symbol ranges separated by a colon is a
TE.

[az]:[aZ]
  A single symbol range like [az] is a TE.
 It is a short form for [az]:[az].
  Two symbol sequences enclosed in braces and separated by
a colon are
 a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f 
a:d c:e <>:f.
  X Y is a TE if X and Y are TEs.
 (Blanks are ignored unless they are quoted.)
  (X) is a TE if X is a TE.
  X op is a TE is X is a TE and op is either * (Kleene's
star operator), +
 (Kleene's plus operator), or ? (optionality operator)
  op X is a TE is X is a TE and op is either ! (negation
operator), ^
 (target language extraction operator), _ (source language
extraction operator), or ^_ (source and target switch operator).
  X op Y is a TE is X and Y are TEs and op is either &
(conjunction
 operator),  (disjunction operator),  (composition
operator), or  (subtraction operator)
  L x op y R is a TE if L and R are TEs, x and y are symbol
ranges and
 op is either => (twolevel restriction), <=
(twolevel coercion), or <=> (twolevel restriction and
coercion).
  X op L__R is a TE if X, L and R are TEs and op is either
^> (upward
 replacement), _> (downward replacement), />
(leftward replacement) or \> (rightward replacement). Furthermore, L
and R must define automata (i.e. which map their strings onto themselves).
These operators correspond to Karttunen's replace operators. If the arrow
is followed by a question mark (?), the replacement becomes optional.
  X << l is a TE if X is a TE, and l is either of the
form
 a or the form a:b where a and b are single characters or
symbols. The result is a transducer where l was freely inserted into X.
The transducer ab << c for instance is equivalent to c*ac*bc*.
  X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN
and R1
 through RN are TEs, and op is either => (general
restriction), <= (general coercion), ^=> (general surface
restriction), ^<= (general surface coercion), ^<=> (general
surface restriction and coercion), _=> (general deep restriction),
_<= (general deep coercion), _<=> (general deep restriction and
coercion). (These operators were implemented following a suggestion by
Anssi YliJyra.)
  "fname" is a TE. The compiler reads the file
named fname and turns
 it into a transducer of the form line1line2line3...
where linex is the xth line of the file. All characters other than : and
\ are interpreted literally (i.e. not as operators). This TE is typically
used e.g. to read morpheme list from a file.
  "<fname>" is a TE. The compiler reads a
precompiled transducer from
 the file named fname. This
Further Features
Comments start with the symbol % and extend up to the end of the line. Blanks
are ignored unless they are quoted. Expressions terminate at the end of a line
unless the end of line is preceded by a backslash. The command
 #include "fname"
can be used to insert source code from a file named fname. The command
 RE >> "fname"
stores the regular expression RE in the file fname. The command
 #use hopcroft
tells the compiler to use the Hopcroft minimisation algorithm from now on, and
 #use default
switches back to the default minimisation algorithm (Brzozowski). The command
Here is an example of a simple transducer program. Assuming that the file
"adjstems" contains the two lines
easy late big
this transducer will correctly analyze the adjective forms easy, easier, easiest
and late, later, and latest.
ALPHABET = [azAZ] y:i e:<> <ADJ>:<>
$R$ = y<=>i (<ADJ>:<> e)
$R2$ = e<=><> (<ADJ>:<> e)
$R$ = $R$ & $R2$
$Stems$ = "adjstems"
$S$ = $Stems$ <ADJ>
(<pos>:<><cmp>:{er}<sup>:{est})
$S$  $R$
fstcompiler returns 0 unless some error occurs.
The compiler gets the operator precedence wrong in case of twolevel rules and
interprets the expression "ab c<=>d ef" as "a(b
c<=>d (ef))". Therefore, you should always surround the left
context of twolevel rules with parenthesis: (ab) c<=>d (ef)
fstmor, fstinfl, fstinfl2, fstinfl3, fstprint, fstcompact, fstparse,
fstcompare, fstcompact, fstlowmem, fstlattice, fsttrain
Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,
Email: schmid@ims.unistuttgart.de, This software is available under the GNU
Public License.