Sophie: festival-2.1-3.mga1 i586

festival-2.1-3.mga1.i586.rpm

<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
     from ../festival.texi on 2 August 2001 -->

<TITLE>Festival Speech Synthesis System - 15  Text analysis</TITLE>
</HEAD>
<BODY bgcolor="#ffffff">
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_14.html">previous</A>, <A HREF="festival_16.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
<P><HR><P>


<H1><A NAME="SEC56" HREF="festival_toc.html#TOC56">15  Text analysis</A></H1>



<H2><A NAME="SEC57" HREF="festival_toc.html#TOC57">15.1  Tokenizing</A></H2>

<P>
<A NAME="IDX236"></A>
<A NAME="IDX237"></A>
<A NAME="IDX238"></A>
A crucial stage in text processing is the initial tokenization of text.
A <EM>token</EM> in Festival is an atom separated with whitespace from a
text file (or string).  If punctuation for the current language is
defined, characters matching that punctuation are removed from the
beginning and end of a token and held as features of the token.  The
default list of characters to be treated as white space is defined as

<PRE>
(defvar token.whitespace " \t\n\r")
</PRE>

<P>
While the default set of punctuation characters is

<PRE>
(defvar token.punctuation "\"'`.,:;!?(){}[]")
(defvar token.prepunctuation "\"'`({[")
</PRE>

<P>
These are declared in <TT>`lib/token.scm'</TT> but may be changed
for different languages, text modes etc.

</P>


<H2><A NAME="SEC58" HREF="festival_toc.html#TOC58">15.2  Token to word rules</A></H2>

<P>
<A NAME="IDX239"></A>
Tokens are further analysed into lists of words.  A word
is an atom that can be given a pronunciation by the lexicon (or
letter to sound rules).  A token may give rise to a number
of words or none at all.

</P>
<P>
For example the basic tokens

<PRE>
This pocket-watch was made in 1983.
</PRE>

<P>
would give a word relation of

<PRE>
this pocket watch was made in nineteen eighty three
</PRE>

<P>
Becuase the relationship between tokens and word in some cases is
complex, a user function may be specified for translating tokens into
words.  This is designed to deal with things like numbers, email
addresses, and other non-obvious pronunciations of tokens as zero or
more words.  Currently a builtin function
<CODE>builtin_english_token_to_words</CODE> offers much of the necessary
functionality for English but a user may further customize this.

</P>
<P>
If the user defines a function <CODE>token_to_words</CODE> which takes two
arguments: a token item and a token name, it will be called by the
<CODE>Token_English</CODE> and <CODE>Token_Any</CODE> modules.  A substantial
example is given as <CODE>english_token_to_words</CODE> in
<TT>`festival/lib/token.scm'</TT>.

</P>
<P>
An example of this function is in
<TT>`lib/token.scm'</TT>.  It is quite elaborate and covers most of the
common multi-word tokens in English including, numbers, money symbols,
Roman numerals, dates, times, plurals of symbols, number ranges,
telephone number and various other symbols.

</P>
<P>
Let us look at the treatment of one particular phenomena which shows
the use of these rules.  Consider the expression "$12 million" which
should be rendered as the words "twelve million dollars".  Note the word
"dollars" which is introduced by the "$" sign, ends up after the end of
the expression.  There are two cases we need to deal with as there are
two tokens.  The first condition in the <CODE>cond</CODE> checks if the
current token name is a money symbol, while the second condition check
that following word is a magnitude (million, billion, trillion, zillion
etc.)  If that is the case the "$" is removed and the remaining numbers
are pronounced, by calling the builtin token to word function.  The
second condition deals with the second token.  It confirms the previous
is a money value (the same regular expression as before) and then
returns the word followed by the word "dollars".  If it is neither of
these forms then the builtin function is called.

<PRE>
(define (token_to_words token name)
"(token_to_words TOKEN NAME)
Returns a list of words for NAME from TOKEN."
 (cond
  ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches (item.feat token "n.name") ".*illion.?"))
   (builtin_english_token_to_words token (string-after name "$")))
  ((and (string-matches (item.feat token "p.name")
                          "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches name ".*illion.?"))
   (list 
    name
    "dollars"))
  (t
   (builtin_english_token_to_words token name))))
</PRE>

<P>
It is valid to make some conditions return no words, though some care
should be taken with that, as punctuation information may no longer be
available to later processing if there are no words related to
a token.

</P>


<H2><A NAME="SEC59" HREF="festival_toc.html#TOC59">15.3  Homograph disambiguation</A></H2>

<P>
<A NAME="IDX240"></A>
Not all tokens can be rendered as words easily.  Their context may affect
the way they are to be pronounced.  For example in the 
utterance

<PRE>
On May 5 1985, 1985 people moved to Livingston.
</PRE>

<P>
the tokens "1985" should be pronounced differently, the first as
a year, "nineteen eighty five" while the second as a quantity "one
thousand nine hundred and eighty five".  Numbers may also be pronounced
as ordinals as in the "5" above, it should be "fifth" rather than
"five".

</P>
<P>
Also, the pronunciation of certain words cannot simply be found from
their orthographic form alone.  Linguistic part of speech tags help to
disambiguate a large class of homographs, e.g. "lives".  A part of
speech tagger is included in Festival and discussed in section <A HREF="festival_16.html#SEC62">16  POS tagging</A>.  But even part of speech isn't sufficient in a number of
cases.  Words such as "bass", "wind", "bow" etc cannot by distinguished
by part of speech alone, some semantic information is also required.  As
full semantic analysis of text is outwith the realms of Festival's
capabilities some other method for disambiguation is required.

</P>
<P>
Following the work of <CITE>yarowsky96</CITE> we have included a method
for identified tokens to be further labelled with extra tags to
help identify their type.  Yarowsky uses <EM>decision lists</EM> to
identify different types for homographs.  Decision lists are
a restricted form of decision trees which have some advantages
over full trees, they are easier to build and Yarowsky has shown
them to be adequate for typical homograph resolution.

</P>


<H3><A NAME="SEC60" HREF="festival_toc.html#TOC60">15.3.1  Using disambiguators</A></H3>

<P>
Festival offers a method for assigning a <CODE>token_pos</CODE> feature to
each token.  It does so using Yarowsky-type disambiguation techniques.
A list of disambiguators can be provided in the variable
<CODE>token_pos_cart_trees</CODE>.  Each disambiguator consists of a regular
expression and a CART tree (which may be a decision list as they have the
same format).  If a token matches the regular expression the CART tree
is applied to the token and the resulting class is assigned 
to the token via the feature <CODE>token_pos</CODE>.  This is done
by the <CODE>Token_POS</CODE> module.

</P>
<P>
For example, the follow disambiguator distinguishes "St" (street and saint)
and "Dr" (doctor and drive).

<PRE>
   ("\\([dD][Rr]\\|[Ss][tT]\\)"
    ((n.name is 0)
     ((p.cap is 1)
      ((street))
      ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
       ((street))
       ((title))))
     ((punc matches ".*,.*")
      ((street))
      ((p.punc matches ".*,.*")
       ((title))
       ((n.cap is 0)
        ((street))
        ((p.cap is 0)
         ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
          ((street))
          ((title)))
         ((pp.name matches "[1-9][0-9]+")
          ((street))
          ((title)))))))))
</PRE>

<P>
Note that these only assign values for the feature <CODE>token_pos</CODE> and
do nothing more.  You must have a related token to word rule that
interprets this feature value and does the required translation.  For
example the corresponding token to word rule for the above disambiguator
is

<PRE>
  ((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)")
   (if (string-equal (item.feat token "token_pos") "street")
       (if (string-matches name "[dD][rR]")
           (list "drive")
           (list "street"))
       (if (string-matches name "[dD][rR]")
           (list "doctor")
           (list "saint"))))
</PRE>



<H3><A NAME="SEC61" HREF="festival_toc.html#TOC61">15.3.2  Building disambiguators</A></H3>

<P>
Festival offers some support for building disambiguation trees.  The
basic method is to find all occurrences of a homographic token in a large
text database, label each occurrence into classes, extract appropriate
context features for these tokens and finally build an classification tree
or decision list based on the extracted features.

</P>
<P>
The extraction and building of trees is not yet a fully automated
process in Festival but the file <TT>`festival/examples/toksearch.scm'</TT>
shows some basic Scheme code we use for extracting tokens from very
large collections of text.

</P>
<P>
The function <CODE>extract_tokens</CODE> does the real work.  It reads the
given file, token by token into a token stream.  Each token is tested
against the desired tokens and if there is a match the named features
are extracted.  The token stream will be extended to provide the
necessary context.  Note that only some features will make any sense in
this situation.  There is only a token relation so referring to words,
syllables etc. is not productive.

</P>
<P>
In this example databases are identified by a file that lists all
the files in the text databases.  Its name is expected to be
<TT>`bin/DBNAME.files'</TT> where <CODE>DBNAME</CODE> is the name of
the database.  The file should contain a list
of filenames in the database e.g for the Gutenberg texts the
file <TT>`bin/Gutenberg.files'</TT> contains

<PRE>
gutenberg/etext90/bill11.txt
gutenberg/etext90/const11.txt
gutenberg/etext90/getty11.txt
gutenberg/etext90/jfk11.txt
...
</PRE>

<P>
Extracting the tokens is typically done in two passes.  The first pass
extracts the context (I've used 5 tokens either side).  It extracts 
the file and position, so the token is identified, and the word
in context.

</P>
<P>
Next those examples should be labelled with a small set of classes
which identify the type of the token.  For example for a token
like "Dr" whether it is a person's title or a street identifier.
Note that hand-labelling can be laborious, though it is surprising
how few tokens of particular types actually exist in 62 million
words.

</P>
<P>
The next task is to extract the tokens with the features that will best
distinguish the particular token.  In our "Dr" case this will involve
punctuation around the token, capitalisation of surrounding tokens etc.
After extracting the distinguishing tokens you must line up the labels
with these extracted features.  It would be easier to extract both the
context and the desired features at the same time but experience shows
that in labelling, more appropriate features come to mind that will
distinguish classes better and you don't want to have to label twice.

</P>
<P>
Once a set of features consisting of the label and features is created
it is easy to use <TT>`wagon'</TT> to create the corresponding decision tree
or decision list.  <TT>`wagon'</TT> supports both decision trees and decision
lists, it may be worth experimenting to find out which give the best
results on some held out test data.  It appears that decision trees are
typically better, but are often much larger, and the size does not
always justify the the sometimes only slightly better results.

</P>
<P><HR><P>
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_14.html">previous</A>, <A HREF="festival_16.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
</BODY>
</HTML>