Sophie: festival-2.1-3.mga1 i586

festival-2.1-3.mga1.i586.rpm

This is festival.info, produced by Makeinfo version 3.12h from
festival.texi.

   This file documents the `Festival' Speech Synthesis System a general
text to speech system for making your computer talk and developing new
synthesis techniques.

   Copyright (C) 1996-2001 University of Edinburgh

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the authors.


File: festival.info,  Node: Emacs interface,  Next: Phonesets,  Prev: XML/SGML mark-up,  Up: Top

Emacs interface
***************

   One easy method of using Festival is via an Emacs interface that
allows selection of text regions to be sent to Festival for rendering
as speech.

   `festival.el' offers a new minor mode which offers an extra menu (in
emacs-19 and 20) with options for saying a selected region, or a whole
buffer, as well as various general control functions.  To use this you
must install `festival.el' in a directory where Emacs can find it, then
add to your `.emacs' in your home directory the following lines.
     (autoload 'say-minor-mode "festival" "Menu for using Festival." t)
     (say-minor-mode t)
   Successive calls to `say-minor-mode' will toggle the minor mode,
switching the `say' menu on and off.

   Note that the optional voice selection offered by the language
sub-menu is not sensitive to actual voices supported by the your
Festival installation.  Hand customization is require in the
`festival.el' file.  Thus some voices may appear in your menu that your
Festival doesn't support and some voices may be supported by your
Festival that do not appear in the menu.

   When the Emacs Lisp function `festival-say-buffer' or the menu
equivalent is used the Emacs major mode is passed to Festival as the
text mode.


File: festival.info,  Node: Phonesets,  Next: Lexicons,  Prev: Emacs interface,  Up: Top

Phonesets
*********

   The notion of phonesets is important to a number of different
subsystems within Festival.  Festival supports multiple phonesets
simultaneously and allows mapping between sets when necessary.  The
lexicons, letter to sound rules, waveform synthesizers, etc. all require
the definition of a phoneset before they will operate.

   A phoneset is a set of symbols which may be further defined in terms
of features, such as vowel/consonant, place of articulation for
consonants, type of vowel etc.  The set of features and their values
must be defined with the phoneset.  The definition is used to ensure
compatibility between sub-systems as well as allowing groups of phones
in various prediction systems (e.g.  duration)

   A phoneset definition has the form
       (defPhoneSet
          NAME
          FEATUREDEFS
          PHONEDEFS )
   The NAME is any unique symbol used e.g. `mrpa', `darpa', etc.
FEATUREDEFS is a list of definitions each consisting of a feature name
and its possible values.  For example
        (
          (vc + -)             ;; vowel consonant
          (vlength short long diphthong schwa 0)  ;; vowel length
          ...
        )
   The third section is a list of phone definitions themselves.  Each
phone definition consists of a phone name and the values for each
feature in the order the features were defined in the above section.

   A typical example of a phoneset definition can be found in
`lib/mrpa_phones.scm'.

   Note the phoneset should also include a definition for any silence
phones.  In addition to the definition of the set the silence phone(s)
themselves must also be identified to the system.  This is done through
the command `PhoneSet.silences'.  In the mrpa set this is done by the
command
     (PhoneSet.silences '(#))
   There may be more than one silence phone (e.g. breath, start silence
etc.)  in any phoneset definition.  However the first phone in this set
is treated special and should be canonical silence.  Among other things,
it is this phone that is inserted by the pause prediction module.

   In addition to declaring phonesets, alternate sets may be selected
by the command `PhoneSet.select'.

   Phones in different sets may be automatically mapped between using
their features.  This mapping is not yet as general as it could be, but
is useful when mapping between various phonesets of the same language.
When a phone needs to be mapped from one set to another the phone with
matching features is selected.  This allows, at least to some extent,
lexicons, waveform synthesizers, duration modules etc.  to use
different phonesets (though in general this is not advised).

   A list of currently defined phonesets is returned by the function
     (PhoneSet.list)
   Note phonesets are often not defined until a voice is actually
loaded so this list is not the list of of sets that are distributed but
the list of sets that are used by currently loaded voices.

   The name, phones, features and silences of the current phoneset may
be accessedwith the function
     (PhoneSet.description nil)
   If the argument to this function is a list, only those parts of the
phoneset description named are returned.  For example
     (PhoneSet.description '(silences))
     (PhoneSet.description '(silences phones))


File: festival.info,  Node: Lexicons,  Next: Utterances,  Prev: Phonesets,  Up: Top

Lexicons
********

   A _Lexicon_ in Festival is a subsystem that provides pronunciations
for words.  It can consist of three distinct parts: an addenda,
typically short consisting of hand added words; a compiled lexicon,
typically large (10,000s of words) which sits on disk somewhere; and a
method for dealing with words not in either list.

* Menu:

* Lexical entries::                Format of lexical entries
* Defining lexicons::              Building new lexicons
* Lookup process::                 Order of significance
* Letter to sound rules::          Dealing with unknown words
* Building letter to sound rules:: Building rules from data
* Lexicon requirements::           What should be in the lexicon
* Available lexicons::             Current available lexicons
* Post-lexical rules::             Modification of words in context


File: festival.info,  Node: Lexical entries,  Next: Defining lexicons,  Up: Lexicons

Lexical entries
===============

   Lexical entries consist of three basic parts, a head word, a part of
speech and a pronunciation.  The headword is what you might normally
think of as a word e.g. `walk', `chairs' etc.  but it might be any
token.

   The part-of-speech field currently consist of a simple atom (or nil
if none is specified).  Of course there are many part of speech tag sets
and whatever you mark in your lexicon must be compatible with the
subsystems that use that information.  You can optionally set a part of
speech tag mapping for each lexicon.  The value should be a reverse
assoc-list of the following form
     (lex.set.pos.map
        '((( punc fpunc) punc)
          (( nn nnp nns nnps ) n)))
   All part of speech tags not appearing in the left hand side of a pos
map are left unchanged.

   The third field contains the actual pronunciation of the word.  This
is an arbitrary Lisp S-expression.  In many of the lexicons distributed
with Festival this entry has internal format, identifying syllable
structure, stress markigns and of course the phones themselves.  In
some of our other lexicons we simply list the phones with stress marking
on each vowel.

   Some typical example entries are

     ( "walkers" n ((( w oo ) 1) (( k @ z ) 0)) )
     ( "present" v ((( p r e ) 0) (( z @ n t ) 1)) )
     ( "monument" n ((( m o ) 1) (( n y u ) 0) (( m @ n t ) 0)) )

   Note you may have two entries with the same headword, but different
part of speech fields allow differentiation.  For example

     ( "lives" n ((( l ai v z ) 1)) )
     ( "lives" v ((( l i v z ) 1)) )

   *Note Lookup process:: for a description of how multiple entries
with the same headword are used during lookup.

   By current conventions, single syllable function words should have no
stress marking, while single syllable content words should be stressed.

   _NOTE:_ the POS field may change in future to contain more complex
formats.  The same lexicon mechanism (but different lexicon) is used
for holding part of speech tag distributions for the POS prediction
module.


File: festival.info,  Node: Defining lexicons,  Next: Lookup process,  Prev: Lexical entries,  Up: Lexicons

Defining lexicons
=================

   As stated above, lexicons consist of three basic parts (compiled
form, addenda and unknown word method) plus some other declarations.

   Each lexicon in the system has a name which allows different
lexicons to be selected from efficiently when switching between voices
during synthesis.  The basic steps involved in a lexicon definition are
as follows.

   First a new lexicon must be created with a new name
     (lex.create "cstrlex")
   A phone set must be declared for the lexicon, to allow both checks
on the entries themselves and to allow phone mapping between different
phone sets used in the system
     (lex.set.phoneset "mrpa")
   The phone set must be already declared in the system.

   A compiled lexicon, the construction of which is described below,
may be optionally specified
     (lex.set.compile.file "/projects/festival/lib/dicts/cstrlex.out")
   The method for dealing with unknown words, *Note Letter to sound
rules::, may be set
     (lex.set.lts.method 'lts_rules)
     (lex.set.lts.ruleset 'nrl)
   In this case we are specifying the use of a set of letter to sound
rules originally developed by the U.S. Naval Research Laboratories.  The
default method is to give an error if a word is not found in the addenda
or compiled lexicon.  (This and other options are discussed more fully
below.)

   Finally addenda items may be added for words that are known to be
common, but not in the lexicon and cannot reasonably be analysed by the
letter to sound rules.
     (lex.add.entry
       '( "awb" n ((( ei ) 1) ((d uh) 1) ((b @ l) 0) ((y uu) 0) ((b ii) 1))))
     (lex.add.entry
       '( "cstr" n ((( s ii ) 1) (( e s ) 1) (( t ii ) 1) (( aa ) 1)) ))
     (lex.add.entry
       '( "Edinburgh" n ((( e m ) 1) (( b r @ ) 0))) ))
   Using `lex.add.entry' again for the same word and part of speech
will redefine the current pronunciation.  Note these add entries to the
_current_ lexicon so its a good idea to explicitly select the lexicon
before you add addenda entries, particularly if you are doing this in
your own `.festivalrc' file.

   For large lists, compiled lexicons are best.  The function
`lex.compile' takes two filename arguments, a file name containing a
list of lexical entries and an output file where the compiled lexicon
will be saved.

   Compilation can take some time and may require lots of memory, as all
entries are loaded in, checked and then sorted before being written out
again.  During compilation if some entry is malformed the reading
process halts with a not so useful message.  Note that if any of your
entries include quote or double quotes the entries will probably be
misparsed and cause such a weird error.  In such cases try setting
     (debug_output t)
before compilation.  This will print out each entry as it is read in
   which should help to narrow down where the error is.


File: festival.info,  Node: Lookup process,  Next: Letter to sound rules,  Prev: Defining lexicons,  Up: Lexicons

Lookup process
==============

   When looking up a word, either through the C++ interface, or Lisp
interface, a word is identified by its headword and part of speech.  If
no part of speech is specified, `nil' is assumed which matches any part
of speech tag.

   The lexicon look up process first checks the addenda, if there is a
full match (head word plus part of speech) it is returned.  If there is
an addenda entry whose head word matches and whose part of speech is
`nil' that entry is returned.

   If no match is found in the addenda, the compiled lexicon, if
present, is checked.  Again a match is when both head word and part of
speech tag match, or either the word being searched for has a part of
speech `nil' or an entry has its tag as `nil'.  Unlike the addenda, if
no full head word and part of speech tag match is found, the first word
in the lexicon whose head word matches is returned.  The rationale is
that the letter to sound rules (the next defence) are unlikely to be
better than an given alternate pronunciation for a the word but
different part of speech.  Even more so given that as there is an entry
with the head word but a different part of speech this word may have an
unusual pronunciation that the letter to sound rules will have no chance
in producing.

   Finally if the word is not found in the compiled lexicon it is
passed to whatever method is defined for unknown words.  This is most
likely a letter to sound module.  *Note Letter to sound rules::.

   Optional pre- and post-lookup hooks can be specified for a lexicon.
As a single (or list of) Lisp functions.  The pre-hooks will be called
with two arguments (word and features) and should return a pair (word
and features).  The post-hooks will be given a lexical entry and should
return a lexical entry.  The pre- and post-hooks do nothing by default.

   Compiled lexicons may be created from lists of lexical entries.  A
compiled lexicon is _much_ more efficient for look up than the addenda.
Compiled lexicons use a binary search method while the addenda is
searched linearly.  Also it would take a prohibitively long time to
load in a typical full lexicon as an addenda.  If you have more than a
few hundred entries in your addenda you should seriously consider
adding them to your compiled lexicon.

   Because many publicly available lexicons do not have syllable
markings for entries the compilation method supports automatic
syllabification.  Thus for lexicon entries for compilation, two forms
for the pronunciation field are supported: the standard full
syllabified and stressed form and a simpler linear form found in at
least the BEEP and CMU lexicons.  If the pronunciation field is a flat
atomic list it is assumed syllabification is required.

   Syllabification is done by finding the minimum sonorant position
between vowels.  It is not guaranteed to be accurate but does give a
solution that is sufficient for many purposes.  A little work would
probably improve this significantly.  Of course syllabification
requires the entry's phones to be in the current phone set.  The
sonorant values are calculated from the _vc_, _ctype_, and _cvox_
features for the current phoneset.  See
`src/arch/festival/Phone.cc:ph_sonority()' for actual definition.

   Additionally in this flat structure vowels (atoms starting with a,
e, i, o or u) may have 1 2 or 0 appended marking stress.  This is again
following the form found in the BEEP and CMU lexicons.

   Some example entries in the flat form (taken from BEEP) are
     ("table" nil (t ei1 b l))
     ("suspicious" nil (s @ s p i1 sh @ s))

   Also if syllabification is required there is an opportunity to run a
set of "letter-to-sound"-rules on the input (actually an arbitrary
re-write rule system).  If the variable `lex_lts_set' is set, the lts
ruleset of that name is applied to the flat input before
syllabification.  This allows simple predictable changes such as
conversion of final r into longer vowel for English RP from American
labelled lexicons.

   A list of all matching entries in the addenda and the compiled
lexicon may be found by the function `lex.lookup_all'.  This function
takes a word and returns all matching entries irrespective of part of
speech.

   You can optionall intercept the words as they are lookup up, and
after they have been found through `pre_hooks' and `post_hooks' for
each lexicon.  This allows a function or list of functions to be applied
to an word and feature sbefore lookup or to the resulting entry after
lookup.  The following example shows how to add voice specific entries
to a general lexicon without affecting other voices that use that
lexicon.

   For example suppose we were trying to use a Scottish English voice
with the US English (cmu) lexicon.  A number of entgries will be
inapporpriate but we can redefine some entries thus
     (set! cmu_us_awb::lexicon_addenda
           '(
     	("edinburgh" n (((eh d) 1) ((ax n) 0) ((b r ax) 0)))
     	("poem" n (((p ow) 1) ((y ax m) 0)))
     	("usual" n (((y uw) 1) ((zh ax l) 0)))
     	("air" n (((ey r) 1)))
     	("hair" n (((hh ey r) 1)))
     	("fair" n (((f ey r) 1)))
     	("chair" n (((ch ey r) 1)))))
   We can the define a function that chesk to see if the word looked up
is in the speaker specific exception list and use that entry instead.
     (define (cmu_us_awb::cmu_lookup_post entry)
       "(cmu_us_awb::cmu_lookup_post entry)
     Speaker specific lexicon addeda."
       (let ((ne
     	 (assoc_string (car entry) cmu_us_awb::lexicon_addenda)))
         (if ne
     	ne
     	entry)))
   And then for the particualr voice set up we need to add both a
selection part _and_ a reset part.  Thuis following the FestVox
vonventions for voice set up.
     (define (cmu_us_awb::select_lexicon)
     
         ...
         (lex.select "cmu")
         ;; Get old var for reset and to append our function to is
         (set! cmu_us_awb::old_cmu_post_hooks
            (lex.set.post_hooks nil))
         (lex.set.post_hooks
            (append cmu_us_awb::old_cmu_post_hooks
                    (list cmu_us_awb::cmu_lookup_post)))
         ...
     )
     
     ...
     
     (define (cmu_us_awb::reset_lexicon)
     
       ...
       ;; reset CMU's post_hooks back to original
       (lex.set.post_hooks cmu_us_awb::old_cmu_post_hooks)
       ...
     
     )
   The above isn't the most efficient way as the word is looked up first
then it is checked with the speaker specific list.

   The `pre_hooks' function are called with two arguments, the word and
features, they should return a pair of word and features.


File: festival.info,  Node: Letter to sound rules,  Next: Building letter to sound rules,  Prev: Lookup process,  Up: Lexicons

Letter to sound rules
=====================

   Each lexicon may define what action to take when a word cannot be
found in the addenda or the compiled lexicon.  There are a number of
options which will hopefully be added to as more general letter to
sound rule systems are added.

   The method is set by the command
     (lex.set.lts.method METHOD)
   Where METHOD can be any of the following
`Error'
     Throw an error when an unknown word is found (default).

`lts_rules'
     Use externally specified set of letter to sound rules (described
     below).  The name of the rule set to use is defined with the
     `lex.lts.ruleset' function.  This method runs one set of rules on
     an exploded form of the word and assumes the rules return a list
     of phonemes (in the appropriate set).  If multiple instances of
     rules are required use the `function' method described next.

`none'
     This returns an entry with a `nil' pronunciation field.  This will
     only be valid in very special circumstances.

`FUNCTIONNAME'
     Call this as a LISP function function name.    This function is
     given two arguments: the word and the part of speech.  It should
     return a valid lexical entry.

   The basic letter to sound rule system is very simple but is powerful
enough to build reasonably complex letter to sound rules.  Although
we've found trained LTS rules better than hand written ones (for
complex languages) where no data is available and rules must be hand
written the following rule formalism is much easier to use than that
generated by the LTS training system (described in the next section).

   The basic form of a rule is as follows
     ( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )
   This interpretation is that if ITEMS appear in the specified right
and left context then the output string is to contain NEWITEMS.  Any of
LEFTCONTEXT, RIGHTCONTEXT or NEWITEMS may be empty.  Note that NEWITEMS
is written to a different "tape" and hence cannot feed further rules
(within this ruleset).  An example is
     ( # [ c h ] C = k )
   The special character `#' denotes a word boundary, and the symbol
`C' denotes the set of all consonants, sets are declared before rules.
This rule states that a `ch' at the start of a word followed by a
consonant is to be rendered as the `k' phoneme.  Symbols in contexts
may be followed by the symbol `*' for zero or more occurrences, or `+'
for one or more occurrences.

   The symbols in the rules are treated as set names if they are
declared as such or as symbols in the input/output alphabets.  The
symbols may be more than one character long and the names are case
sensitive.

   The rules are tried in order until one matches the first (or more)
symbol of the tape.  The rule is applied adding the right hand side to
the output tape.  The rules are again applied from the start of the list
of rules.

   The function used to apply a set of rules if given an atom will
explode it into a list of single characters, while if given a list will
use it as is.  This reflects the common usage of wishing to re-write the
individual letters in a word to phonemes but without excluding the
possibility of using the system for more complex manipulations, such as
multi-pass LTS systems and phoneme conversion.

   From lisp there are three basic access functions, there are
corresponding functions in the C/C++ domain.

`(lts.ruleset NAME SETS RULES)'
     Define a new set of lts rules.  Where `NAME' is the name for this
     rule, SETS is a list of set definitions of the form `(SETNAME e0 e1
     ...)'  and `RULES' are a list of rules as described above.

`(lts.apply WORD RULESETNAME)'
     Apply the set of rules named `RULESETNAME' to `WORD'.  If `WORD'
     is a symbol it is exploded into a list of the individual
     characters in its print name.  If `WORD' is a list it is used as
     is.  If the rules cannot be successfully applied an error is
     given.  The result of (successful) application is returned in a
     list.

`(lts.check_alpha WORD RULESETNAME)'
     The symbols in `WORD' are checked against the input alphabet of the
     rules named `RULESETNAME'.  If they are all contained in that
     alphabet `t' is returned, else `nil'.  Note this does not
     necessarily mean the rules will successfully apply (contexts may
     restrict the application of the rules), but it allows general
     checking like numerals, punctuation etc, allowing application of
     appropriate rule sets.

   The letter to sound rule system may be used directly from Lisp and
can easily be used to do relatively complex operations for analyzing
words without requiring modification of the C/C++ system.  For example
the Welsh letter to sound rule system consists or three rule sets,
first to explicitly identify epenthesis, then identify stressed vowels,
and finally rewrite this augmented letter string to phonemes.  This is
achieved by the following function
     (define (welsh_lts word features)
       (let (epen str wel)
         (set! epen (lts.apply (downcase word) 'newepen))
         (set! str (lts.apply epen 'newwelstr))
         (set! wel (lts.apply str 'newwel))
         (list word
               nil
               (lex.syllabify.phstress wel))))
   The LTS method for the Welsh lexicon is set to `welsh_lts', so this
function is called when a word is not found in the lexicon.  The above
function first downcases the word and then applies the rulesets in
turn, finally calling the syllabification process and returns a
constructed lexically entry.


File: festival.info,  Node: Building letter to sound rules,  Next: Lexicon requirements,  Prev: Letter to sound rules,  Up: Lexicons

Building letter to sound rules
==============================

   As writing letter to sound rules by hand is hard and very time
consuming, an alternative method is also available where a latter to
sound system may be built from a lexicon of the language.  This
technique has successfully been used from English (British and
American), French and German.  The difficulty and appropriateness of
using letter to sound rules is very language dependent,

   The following outlines the processes involved in building a letter to
sound model for a language given a large lexicon of pronunciations.
This technique is likely to work for most European languages (including
Russian) but doesn't seem particularly suitable for very language
alphabet languages like Japanese and Chinese.  The process described
here is not (yet) fully automatic but the hand intervention required is
small and may easily be done even by people with only a very little
knowledge of the language being dealt with.

   The process involves the following steps
   * Pre-processing lexicon into suitable training set

   * Defining the set of allowable pairing of letters to phones.  (We
     intend to do this fully automatically in future versions).

   * Constructing the probabilities of each letter/phone pair.

   * Aligning letters to an equal set of phones/_epsilons_.

   * Extracting the data by letter suitable for training.

   * Building CART models for predicting phone from letters (and
     context).

   * Building additional lexical stress assignment model (if necessary).
   All except the first two stages of this are fully automatic.

   Before building a model its wise to think a little about what you
want it to do.  Ideally the model is an auxiluary to the lexicon so only
words not found in the lexicon will require use of the letter to sound
rules.  Thus only unusual forms are likely to require the rules.  More
precisely the most common words, often having the most non-standard
pronunciations, should probably be explicitly listed always.  It is
possible to reduce the size of the lexicon (sometimes drastically) by
removing all entries that the training LTS model correctly predicts.

   Before starting it is wise to consider removing some entries from the
lexicon before training, I typically will remove words under 4 letters
and if part of speech information is available I remove all function
words, ideally only training from nouns verbs and adjectives as these
are the most likely forms to be unknown in text.  It is useful to have
morphologically inflected and derived forms in the training set as it is
often such variant forms that not found in the lexicon even though their
root morpheme is.  Note that in many forms of text, proper names are the
most common form of unknown word and even the technique presented here
may not adequately cater for that form of unknown words (especially if
they unknown words are non-native names).  This is all stating that this
may or may not be appropriate for your task but the rules generated by
this learning process have in the examples we've done been much better
than what we could produce by hand writing rules of the form described
in the previous section.

   First preprocess the lexicon into a file of lexical entries to be
used for training, removing functions words and changing the head words
to all lower case (may be language dependent).  The entries should be of
the form used for input for Festival's lexicon compilation.  Specifical
the pronunciations should be simple lists of phones (no
syllabification).  Depending on the language, you may wish to remve the
stressing--for examples here we have though later tests suggest that we
should keep it in even for English.  Thus the training set should look
something like
     ("table" nil (t ei b l))
     ("suspicious" nil (s @ s p i sh @ s))
   It is best to split the data into a training set and a test set if
you wish to know how well your training has worked.  In our tests we
remove every tenth entry and put it in a test set.  Note this will mean
our test results are probably better than if we removed say the last
ten in every hundred.

   The second stage is to define the set of allowable letter to phone
mappings irrespective of context.  This can sometimes be initially done
by hand then checked against the training set.  Initially constract a
file of the form
     (require 'lts_build)
     (set! allowables
           '((a _epsilon_)
             (b _epsilon_)
             (c _epsilon_)
             ...
             (y _epsilon_)
             (z _epsilon_)
             (# #)))
   All letters that appear in the alphabet should (at least) map to
`_epsilon_', including any accented characters that appear in that
language.  Note the last two hashes.  These are used by to denote
beginning and end of word and are automatically added during training,
they must appear in the list and should only map to themselves.

   To incrementally add to this allowable list run festival as
     festival allowables.scm
   and at the prompt type
     festival> (cummulate-pairs "oald.train")
   with your train file.  This will print out each lexical entry that
couldn't be aligned with the current set of allowables.  At the start
this will be every entry.  Looking at these entries add to the
allowables to make alignment work.  For example if the following word
fails
     ("abate" nil (ah b ey t))
   Add `ah' to the allowables for letter `a', `b' to `b', `ey' to `a'
and `t' to letter `t'.  After doing that restart festival and call
`cummulate-pairs' again.  Incrementally add to the allowable pairs
until the number of failures becomes accceptable.  Often there are
entries for which there is no real relationship between the letters and
the pronunciation such as in abbreviations and foreign words (e.g.
"aaa" as "t r ih p ax l ey").  For the lexicons I've used the technique
on less than 10 per thousand fail in this way.

   It is worth while being consistent on defining your set of
allowables.  (At least) two mappings are possible for the letter
sequence `ch'--having letter `c' go to phone `ch' and letter `h' go to
`_epsilon_' and also letter `c' go to phone `_epsilon_' and letter `h'
goes to `ch'.  However only one should be allowed, we preferred `c' to
`ch'.

   It may also be the case that some letters give rise to more than one
phone.  For example the letter `x' in English is often pronunced as the
phone combination `k' and `s'.  To allow this, use the multiphone
`k-s'.  Thus the multiphone `k-s' will be predicted for `x' in some
context and the model will separate it into two phones while it also
ignoring any predicted `_epsilons_'.  Note that multiphone units are
relatively rare but do occur.  In English, letter `x' give rise to a
few, `k-s' in `taxi', `g-s' in `example', and sometimes `g-zh' and
`k-sh' in `luxury'.  Others are `w-ah' in `one', `t-s' in `pizza',
`y-uw' in `new' (British), `ah-m' in `-ism' etc.  Three phone
multiphone are much rarer but may exist, they are not supported by this
code as is, but such entries should probably be ignored.  Note the `-'
sign in the multiphone examples is significant and is used to indentify
multiphones.

   The allowables for OALD end up being
     (set! allowables
            '
           ((a _epsilon_ ei aa a e@ @ oo au o i ou ai uh e)
            (b _epsilon_ b )
            (c _epsilon_ k s ch sh @-k s t-s)
            (d _epsilon_ d dh t jh)
            (e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa oi y y-u@ o)
            (f _epsilon_ f v )
            (g _epsilon_ g jh zh th f ng k t)
            (h _epsilon_ h @ )
            (i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a)
            (j _epsilon_ h zh jh i y )
            (k _epsilon_ k ch )
            (l _epsilon_ l @-l l-l)
            (m _epsilon_ m @-m n)
            (n _epsilon_ n ng n-y )
            (o _epsilon_ @ ou o oo uu u au oi i @@ e uh w u@ w-uh y-@)
            (p _epsilon_ f p v )
            (q _epsilon_ k )
            (r _epsilon_ r @@ @-r)
            (s _epsilon_ z s sh zh )
            (t _epsilon_ t th sh dh ch d )
            (u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e)
            (v _epsilon_ v f )
            (w _epsilon_ w uu v f u)
            (x _epsilon_ k-s g-z sh z k-sh z g-zh )
            (y _epsilon_ i ii i@ ai uh y @ ai-@)
            (z _epsilon_ z t-s s zh )
            (# #)
            ))
   Note this is an exhaustive list and (deliberately) says nothing
about the contexts or frequency that these letter to phone pairs appear.
That information will be generated automatically from the training set.

   Once the number of failed matches is signficantly low enough let
`cummulate-pairs' run to completion.  This counts the number of times
each letter/phone pair occurs in allowable alignments.

   Next call
     festival> (save-table "oald-")
   with the name of your lexicon.  This changes the cummulation table
into probabilities and saves it.

   Restart festival loading this new table
     festival allowables.scm oald-pl-table.scm
   Now each word can be aligned to an equally-lengthed string of phones,
epsilon and multiphones.
     festival> (aligndata "oald.train" "oald.train.align")
   Do this also for you test set.

   This will produce entries like
     aaronson _epsilon_ aa r ah n s ah n
     abandon ah b ae n d ah n
     abate ah b ey t _epsilon_
     abbe ae b _epsilon_ iy

   The next stage is to build features suitable for `wagon' to build
models.  This is done by
     festival> (build-feat-file "oald.train.align" "oald.train.feats")
   Again the same for the test set.

   Now you need to constructrure a description file for `wagon' for the
given data.  The can be done using the script `make_wgn_desc' provided
with the speech tools

   Here is an example script for building the models, you will need to
modify it for your particualr database but it shows the basic processes
     for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
     do
        # Stop value for wagon
        STOP=2
        echo letter $i STOP $STOP
        # Find training set for letter $i
        cat oald.train.feats |
         awk '{if ($6 == "'$i'") print $0}' >ltsdataTRAIN.$i.feats
        # split training set to get heldout data for stepwise testing
        traintest ltsdataTRAIN.$i.feats
        # Extract test data for letter $i
        cat oald.test.feats |
         awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats
        # run wagon to predict model
        wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \
               -stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree
        # Test the resulting tree against
        wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \
                   -tree lts.$i.tree
     done
   The script `traintest' splits the given file `X' into `X.train' and
`X.test' with every tenth line in `X.test' and the rest in `X.train'.

   This script can take a significnat amount of time to run, about 6
hours on a Sun Ultra 140.

   Once the models are created the must be collected together into a
single list structure.  The trees generated by `wagon' contain fully
probability distributions at each leaf, at this time this information
can be removed as only the most probable will actually be predicted.
This substantially reduces the size of the tress.
     (merge_models 'oald_lts_rules "oald_lts_rules.scm")
   (`merge_models' is defined within `lts_build.scm') The given file
will contain a `set!' for the given variable name to an assoc list of
letter to trained tree.  Note the above function naively assumes that
the letters in the alphabet are the 26 lower case letters of the
English alphabet, you will need to edit this adding accented letters if
required.  Note that adding "'" (single quote) as a letter is a little
tricky in scheme but can be done--the command `(intern "'")' will give
you the symbol for single quote.

   To test a set of lts models load the saved model and call the
following function with the test align file
     festival oald-table.scm oald_lts_rules.scm
     festival> (lts_testset "oald.test.align" oald_lts_rules)
   The result (after showing all the failed ones), will be a table
showing the results for each letter, for all letters and for complete
words.  The failed entries may give some notion of how good or bad the
result is, sometimes it will be simple vowel diferences, long versus
short, schwa versus full vowel, other times it may be who consonants
missing.  Remember the ultimate quality of the letter sound rules is
how adequate they are at providing _acceptable_ pronunciations rather
than how good the numeric score is.

   For some languages (e.g. English) it is necessary to also find a
stree pattern for unknown words.  Ultimately for this to work well you
need to know the morphological decomposition of the word.  At present
we provide a CART trained system to predict stress patterns for
English.  If does get 94.6% correct for an unseen test set but that
isn't really very good.  Later tests suggest that predicting stressed
and unstressed phones directly is actually better for getting whole
words correct even though the models do slightly worse on a per phone
basis `black98'.

   As the lexicon may be a large part of the system we have also
experimented with removing entries from the lexicon if the letter to
sound rules system (and stree assignment system) can correct predict
them.  For OALD this allows us to half the size of the lexicon, it could
possibly allow more if a certain amount of fuzzy acceptance was allowed
(e.g. with schwa).  For other languages the gain here can be very
signifcant, for German and French we can reduce the lexicon by over 90%.
The function `reduce_lexicon' in `festival/lib/lts_build.scm' was used
to do this.  A diccussion of using the above technique as a dictionary
compression method is discussed in `pagel98'.  A morphological
decomposition algorithm, like that described in `black91', may even
help more.

   The technique described in this section and its relative merits with
respect to a number of languages/lexicons and tasks is dicussed more
fully in `black98'.


File: festival.info,  Node: Lexicon requirements,  Next: Available lexicons,  Prev: Building letter to sound rules,  Up: Lexicons

Lexicon requirements
====================

   For English there are a number of assumptions made about the lexicon
which are worthy of explicit mention.  If you are basically going to use
the existing token rules you should try to include at least the
following in any lexicon that is to work with them.

   * The letters of the alphabet, when a token is identified as an
     acronym it is spelled out.  The tokenization assumes that the
     individual letters of the alphabet are in the lexicon with their
     pronunciations.  They should be identified as nouns.  (This is to
     distinguish `a' as a determiner which can be schwa'd from `a' as a
     letter which cannot.)  The part of speech should be `nn' by
     default, but the value of the variable `token.letter_pos' is used
     and may be changed if this is not what is required.

   * One character symbols such as dollar, at-sign, percent etc.  Its
     difficult to get a complete list and to know what the
     pronunciation of some of these are (e.g hash or pound sign).  But
     the letter to sound rules cannot deal with them so they need to be
     explicitly listed.  See the list in the function `mrpa_addend' in
     `festival/lib/dicts/oald/oaldlex.scm'.  This list should also
     contain the control characters and eight bit characters.

   * The possessive `'s' should be in your lexicon as schwa and voiced
     fricative (`z').  It should be in twice, once as part speech type
     `pos' and once as `n' (used in plurals of numbers acronyms etc.
     e.g 1950's).  `'s' is treated as a word and is separated from the
     tokens it appears with.  The post-lexical rule (the function
     `postlex_apos_s_check') will delete the schwa and devoice the `z'
     in appropriate contexts.  Note this post-lexical rule brazenly
     assumes that the unvoiced fricative in the phoneset is `s'.  If it
     is not in your phoneset copy the function (it is in
     `festival/lib/postlex.scm') and change it for your phoneset and
     use your version as a post-lexical rule.

   * Numbers as digits (e.g. "1", "2", "34", etc.) should normally
     _not_ be in the lexicon.  The number conversion routines convert
     numbers to words (i.e. "one", "two", "thirty four", etc.).

   * The word "unknown" or whatever is in the variable
     `token.unknown_word_name'.  This is used in a few obscure cases
     when there just isn't anything that can be said (e.g. single
     characters which aren't in the lexicon).  Some people have
     suggested it should be possible to make this a sound rather than a
     word.  I agree, but Festival doesn't support that yet.


File: festival.info,  Node: Available lexicons,  Next: Post-lexical rules,  Prev: Lexicon requirements,  Up: Lexicons

Available lexicons
==================

   Currently Festival supports a number of different lexicons.  They are
all defined in the file `lib/lexicons.scm' each with a number of common
extra words added to their addendas.  They are
`CUVOALD'
     The Computer Users Version of Oxford Advanced Learner's Dictionary
     is available from the Oxford Text Archive
     `ftp://ota.ox.ac.uk/pub/ota/public/dicts/710'.  It contains about
     70,000 entries and is a part of the BEEP lexicon.  It is more
     consistent in its marking of stress though its syllable marking is
     not what works best for our synthesis methods.  Many syllabic
     `l''s, `n''s, and `m''s, mess up the syllabification algorithm,
     making results sometimes appear over reduced.  It is however our
     current default lexicon.  It is also the only lexicon with part of
     speech tags that can be distributed (for non-commercial use).

`CMU'
     This is automatically constructed from `cmu_dict-0.4' available
     from many places on the net (see `comp.speech' archives).  It is
     not in the mrpa phone set because it is American English
     pronunciation.  Although mappings exist between its phoneset
     (`darpa') and `mrpa' the results for British English speakers are
     not very good.  However this is probably the biggest, most
     carefully specified lexicon available.  It contains just under
     100,000 entries.  Our distribution has been modified to include
     part of speech tags on words we know to be homographs.

`mrpa'
     A version of the CSTR lexicon which has been floating about for
     years.  It contains about 25,000 entries.  A new updated free
     version of this is due to be released soon.

`BEEP'
     A British English rival for the `cmu_lex'.  BEEP has been made
     available by Tony Robinson at Cambridge and is available in many
     archives.  It contains 163,000 entries and has been converted to
     the `mrpa' phoneset (which was a trivial mapping).  Although
     large, it suffers from a certain randomness in its stress
     markings, making use of it for synthesis dubious.

   All of the above lexicons have some distribution restrictions (though
mostly pretty light), but as they are mostly freely available we provide
programs that can convert the originals into Festival's format.

   The MOBY lexicon has recently been released into the public domain
and will be converted into our format soon.


File: festival.info,  Node: Post-lexical rules,  Prev: Available lexicons,  Up: Lexicons

Post-lexical rules
==================

   It is the lexicon's job to produce a pronunciation of a given word.
However in most languages the most natural pronunciation of a word
cannot be found in isolation from the context in which it is to be
spoken.  This includes such phenomena as reduction, phrase final
devoicing and r-insertion.  In Festival this is done by post-lexical
rules.

   `PostLex' is a module which is run after accent assignment but
before duration and F0 generation.  This is because knowledge of accent
position is necessary for vowel reduction and other post lexical
phenomena and changing the segmental items will affect durations.

   The `PostLex' first applies a set of built in rules (which could be
done in Scheme but for historical reasons are still in C++).  It then
applies the functions set in the hook `postlex_rules_hook'.  These
should be a set of functions that take an utterance and apply
appropriate rules.  This should be set up on a per voice basis.

   Although a rule system could be devised for post-lexical sound rules
it is unclear what the scope of them should be, so we have left it
completely open.  Our vowel reduction model uses a CART decision tree to
predict which syllables should be reduced, while the "'s" rule is very
simple (shown in `festival/lib/postlex.scm').

   The `'s' in English may be pronounced in a number of different ways
depending on the preceding context.  If the preceding consonant is a
fricative or affricative and not a palatal labio-dental or dental a
schwa is required (e.g. `bench's') otherwise no schwa is required (e.g.
`John's').  Also if the previous phoneme is unvoiced the "s" is
rendered as an "s" while in all other cases it is rendered as a "z".

   For our English voices we have a lexical entry for "'s" as a schwa
followed by a "z".  We use a post lexical rule function called
`postlex_apos_s_check' to modify the basic given form when required.
After lexical lookup the segment relation contains the concatenation of
segments directly from lookup in the lexicon.  Post lexical rules are
applied after that.

   In the following rule we check each segment to see if it is part of a
word labelled "'s", if so we check to see if are we currently looking
at the schwa or the z part, and test if modification is required
     (define (postlex_apos_s_check utt)
       "(postlex_apos_s_check UTT)
     Deal with possesive s for English (American and British).  Delete
     schwa of 's if previous is not a fricative or affricative, and
     change voiced to unvoiced s if previous is not voiced."
       (mapcar
        (lambda (seg)
          (if (string-equal "'s" (item.feat
                                  seg "R:SylStructure.parent.parent.name"))
              (if (string-equal "a" (item.feat seg 'ph_vlng))
                  (if (and (member_string (item.feat seg 'p.ph_ctype)
                                          '(f a))
                           (not (member_string
                                 (item.feat seg "p.ph_cplace")
                                 '(d b g))))
                      t;; don't delete schwa
                      (item.delete seg))
                  (if (string-equal "-" (item.feat seg "p.ph_cvox"))
                      (item.set_name seg "s")))));; from "z"
        (utt.relation.items utt 'Segment))
       utt)


File: festival.info,  Node: Utterances,  Next: Text analysis,  Prev: Lexicons,  Up: Top

Utterances
**********

   The utterance structure lies at the heart of Festival.  This chapter
describes its basic form and the functions available to manipulate it.

* Menu:

* Utterance structure::         internal structure of utterances
* Utterance types::             Type defined synthesis actions
* Example utterance types::     Some example utterances
* Utterance modules::
* Accessing an utterance::      getting the data from the structure
* Features::                    Features and features names
* Utterance I/O::               Saving and loading utterances