Sophie: festival-2.1-3.mga1 i586

festival-2.1-3.mga1.i586.rpm

<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
     from ../festival.texi on 2 August 2001 -->

<TITLE>Festival Speech Synthesis System - 24  Voices</TITLE>
</HEAD>
<BODY bgcolor="#ffffff">
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_23.html">previous</A>, <A HREF="festival_25.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
<P><HR><P>


<H1><A NAME="SEC97" HREF="festival_toc.html#TOC97">24  Voices</A></H1>

<P>
This chapter gives some general suggestions about adding new voices to
Festival.  Festival attempts to offer an environment where new voices
and languages can easily be slotted in to the system.

</P>



<H2><A NAME="SEC98" HREF="festival_toc.html#TOC98">24.1  Current voices</A></H2>

<P>
<A NAME="IDX310"></A>
Currently there are a number of voices available in Festival and we
expect that number to increase. Each is elected via a function of the
name <SAMP>`voice_*'</SAMP> which sets up the waveform synthesizer, phone set,
lexicon, duration and intonation models (and anything else necessary)
for that speaker.  These voice setup functions are defined in
<TT>`lib/voices.scm'</TT>.

</P>
<P>
The current voice functions are
<DL COMPACT>

<DT><CODE>voice_rab_diphone</CODE>
<DD>
A British English male RP speaker, Roger.  This uses the UniSyn residual
excited LPC diphone synthesizer.  The lexicon is the computer users
version of Oxford Advanced Learners' Dictionary, with letter to sound
rules trained from that lexicon.  Intonation is provided by a ToBI-like
system using a decision tree to predict accent and end tone position.
The F0 itself is predicted as three points on each syllable, using
linear regression trained from the Boston University FM database (f2b)
and mapped to Roger's pitch range.  Duration is predicted by decision
tree, predicting zscore durations for segments trained from the 460
Timit sentence spoken by another British male speaker.
<DT><CODE>voice_ked_diphone</CODE>
<DD>
An American English male speaker, Kurt.  Again this uses the UniSyn
residual excited LPC diphone synthesizer.  This uses the CMU lexicon,
and letter to sound rules trained from it.  Intonation as with Roger is
trained from the Boston University FM Radio corpus.  Duration for this
voice also comes from that database.
<DT><CODE>voice_kal_diphone</CODE>
<DD>
An American English male speaker.  Again this uses the UniSyn residual
excited LPC diphone synthesizer.  And like ked, uses the CMU lexicon,
and letter to sound rules trained from it.  Intonation as with Roger is
trained from the Boston University FM Radio corpus.  Duration for this
voice also comes from that database.  This voice was built in two days
work and is at least as good as ked due to us understanding the process
better.  The diphone labels were autoaligned with hand correction.
<DT><CODE>voice_don_diphone</CODE>
<DD>
Steve Isard's LPC based diphone synthesizer, Donovan diphones.  The
other parts of this voice, lexicon, intonation, and duration are the
same as <CODE>voice_rab_diphone</CODE> described above.  The
quality of the diphones is not as good as the other voices because it
uses spike excited LPC.  Although the quality is not as good it
is much faster and the database is much smaller than the others.
<DT><CODE>voice_el_diphone</CODE>
<DD>
A male Castilian Spanish speaker, using the Eduardo Lopez diphones.
Alistair Conkie and Borja Etxebarria did much to make this.  It has
improved recently but is not as comprehensive as our English voices.
<DT><CODE>voice_gsw_diphone</CODE>
<DD>
This offers a male RP speaker, Gordon, famed for many previous CSTR
synthesizers, using the standard diphone module.  Its higher
levels are very similar to the Roger voice above.  This voice
is not in the standard distribution, and is unlikely to be added
for commercial reasons, even though it sounds better than Roger.
<DT><CODE>voice_en1_mbrola</CODE>
<DD>
The Roger diphone set using the same front end as <CODE>voice_rab_diphone</CODE>
but uses the MBROLA diphone synthesizer for waveform synthesis.  The
MBROLA synthesizer and Roger diphone database (called <CODE>en1</CODE>)
is not distributed by CSTR but is available for non-commercial use
for free from <A HREF="http://tcts.fpms.ac.be/synthesis/mbrola.html">http://tcts.fpms.ac.be/synthesis/mbrola.html</A>.
We do however provide the Festival part of the voice in 
<TT>`festvox_en1.tar.gz'</TT>.
<DT><CODE>voice_us1_mbrola</CODE>
<DD>
A female Amercian English voice using our standard US English front end and the
<CODE>us1</CODE> database for the MBROLA diphone synthesizer for waveform
synthesis.  The MBROLA synthesizer and the <CODE>us1</CODE> diphone database
is not distributed by CSTR but is available for
non-commercial use for free from
<A HREF="http://tcts.fpms.ac.be/synthesis/mbrola.html">http://tcts.fpms.ac.be/synthesis/mbrola.html</A>.  We
provide the Festival part of the voice in <TT>`festvox_us1.tar.gz'</TT>.
<DT><CODE>voice_us2_mbrola</CODE>
<DD>
A male Amercian English voice using our standard US English front end and the
<CODE>us2</CODE> database for the MBROLA diphone synthesizer for waveform
synthesis.  The MBROLA synthesizer and the <CODE>us2</CODE> diphone database
is not distributed by CSTR but is available for
non-commercial use for free from
<A HREF="http://tcts.fpms.ac.be/synthesis/mbrola.html">http://tcts.fpms.ac.be/synthesis/mbrola.html</A>.  We
provide the Festival part of the voice in <TT>`festvox_us2.tar.gz'</TT>.
<DT><CODE>voice_us3_mbrola</CODE>
<DD>
Another male Amercian English voice using our standard US English front
end and the <CODE>us2</CODE> database for the MBROLA diphone synthesizer for
waveform synthesis.  The MBROLA synthesizer and the <CODE>us2</CODE> diphone
database is not distributed by CSTR but is available for non-commercial
use for free from <A HREF="http://tcts.fpms.ac.be/synthesis/mbrola.html">http://tcts.fpms.ac.be/synthesis/mbrola.html</A>.
We provide the Festival part of the voice in <TT>`festvox_us1.tar.gz'</TT>.
</DL>
<P>
<A NAME="IDX311"></A>
Other voices will become available through time.  Groups other than CSTR
are working on new voices.  Particularly OGI's CSLU have release a
number of American English voices, two Mexican Spanish voices and two German
voices.  All use OGI's their own residual excited LPC
synthesizer which is distributed as a plug-in for Festival.
(see <A HREF="http://www.cse.ogi.edu/CSLU/research/TTS">http://www.cse.ogi.edu/CSLU/research/TTS</A> for
details).

</P>
<P>
Other languages are being worked on including German, Basque, Welsh,
Greek and Polish already have been developed and could be release soon.
CSTR has a set of Klingon diphones though the text anlysis for Klingon
still requires some work (If anyone has access to a good Klingon
continous speech corpora please let us know.)

</P>
<P>
Pointers and examples of voices developed at CSTR and elsewhere will
be posted on the Festival home page.

</P>


<H2><A NAME="SEC99" HREF="festival_toc.html#TOC99">24.2  Building a new voice</A></H2>

<P>
<A NAME="IDX312"></A>
This section runs through the definition of a new voice in Festival.
Although this voice is simple (it is a simplified version of the
distributed spanish voice) it shows all the major parts that must be
defined to get Festival to speak in a new voice.  Thanks go to Alistair
Conkie for helping me define this but as I don't speak Spanish there are
probably many mistakes.  Hopefully its pedagogical use is better than
its ability to be understood in Castille.

</P>
<P>
A much more detailed document on building voices in Festival has been
written and is recommend reading for any one attempting to add a new
voice to Festival <CITE>black99</CITE>.  The information here is a little
sparse though gives the basic requirements.

</P>
<P>
The general method for defining a new voice is to define the
parameters for all the various sub-parts e.g. phoneset, duration
parameter intonation parameters etc., then defined a function
of the form <CODE>voice_NAME</CODE> which when called will actually
select the voice.

</P>


<H3><A NAME="SEC100" HREF="festival_toc.html#TOC100">24.2.1  Phoneset</A></H3>

<P>
<A NAME="IDX313"></A>
For most new languages and often for new dialects, a new
phoneset is required.  It is really the basic building
block of a voice and most other parts are defined in terms
of this set, so defining it first is a good start.

<PRE>
(defPhoneSet
  spanish
  ;;;  Phone Features
  (;; vowel or consonant
   (vc + -)  
   ;; vowel length: short long diphthong schwa
   (vlng s l d a 0)
   ;; vowel height: high mid low
   (vheight 1 2 3 -)
   ;; vowel frontness: front mid back
   (vfront 1 2 3 -)
   ;; lip rounding
   (vrnd + -)
   ;; consonant type: stop fricative affricative nasal liquid
   (ctype s f a n l 0)
   ;; place of articulation: labial alveolar palatal labio-dental
   ;;                         dental velar
   (cplace l a p b d v 0)
   ;; consonant voicing
   (cvox + -)
   )
  ;; Phone set members (features are not! set properly)
  (
   (#  - 0 - - - 0 0 -)
   (a  + l 3 1 - 0 0 -)
   (e  + l 2 1 - 0 0 -)
   (i  + l 1 1 - 0 0 -)
   (o  + l 3 3 - 0 0 -)
   (u  + l 1 3 + 0 0 -)
   (b  - 0 - - + s l +)
   (ch - 0 - - + a a -)
   (d  - 0 - - + s a +)
   (f  - 0 - - + f b -)
   (g  - 0 - - + s p +)
   (j  - 0 - - + l a +)
   (k  - 0 - - + s p -)
   (l  - 0 - - + l d +)
   (ll - 0 - - + l d +)
   (m  - 0 - - + n l +)
   (n  - 0 - - + n d +)
   (ny - 0 - - + n v +)
   (p  - 0 - - + s l -)
   (r  - 0 - - + l p +)
   (rr - 0 - - + l p +)
   (s  - 0 - - + f a +)
   (t  - 0 - - + s t +)
   (th - 0 - - + f d +)
   (x  - 0 - - + a a -)
  )
)
(PhoneSet.silences '(#))
</PRE>

<P>
<A NAME="IDX314"></A>
Note some phonetic features may be wrong.

</P>


<H3><A NAME="SEC101" HREF="festival_toc.html#TOC101">24.2.2  Lexicon and LTS</A></H3>

<P>
Spanish is a language whose pronunciation can almost completely be
predicted from its orthography so in this case we do not need a
list of words and their pronunciations and can do most of the work
with letter to sound rules.

</P>
<P>
<A NAME="IDX315"></A>
Let us first make a lexicon structure as follows

<PRE>
(lex.create "spanish")
(lex.set.phoneset "spanish")
</PRE>

<P>
However if we did just want a few entries to test our system without
building any letter to sound rules we could add entries directly to
the addenda.  For example

<PRE>
(lex.add.entry
   '("amigos" nil (((a) 0) ((m i) 1) (g o s))))
</PRE>

<P>
A letter to sound rule system for Spanish is quite simple 
in the format supported by Festival.  The following is a good
start to a full set.
<A NAME="IDX316"></A>

<PRE>
(lts.ruleset
;  Name of rule set
 spanish
;  Sets used in the rules
(
  (LNS l n s )
  (AEOU a e o u )
  (AEO a e o )
  (EI e i )
  (BDGLMN b d g l m n )
)
;  Rules
(
 ( [ a ] = a )
 ( [ e ] = e )
 ( [ i ] = i )
 ( [ o ] = o )
 ( [ u ] = u )
 ( [ "'" a ] = a1 )   ;; stressed vowels
 ( [ "'" e ] = e1 )
 ( [ "'" i ] = i1 )
 ( [ "'" o ] = o1 )
 ( [ "'" u ] = u1 )
 ( [ b ] = b )
 ( [ v ] = b )
 ( [ c ] "'" EI = th )
 ( [ c ] EI = th )
 ( [ c h ] = ch )
 ( [ c ] = k )
 ( [ d ] = d )
 ( [ f ] = f )
 ( [ g ] "'" EI = x )
 ( [ g ] EI = x )
 ( [ g u ] "'" EI = g )
 ( [ g u ] EI = g )
 ( [ g ] = g )
 ( [ h u e ] = u e )
 ( [ h i e ] = i e )
 ( [ h ] =  )
 ( [ j ] = x )
 ( [ k ] = k )
 ( [ l l ] # = l )
 ( [ l l ] = ll )
 ( [ l ] = l )
 ( [ m ] = m )
 ( [ ~ n ] = ny )
 ( [ n ] = n )
 ( [ p ] = p )
 ( [ q u ] = k )
 ( [ r r ] = rr )
 ( # [ r ] = rr )
 ( LNS [ r ] = rr )
 ( [ r ] = r )
 ( [ s ] BDGLMN = th )
 ( [ s ] = s )
 ( # [ s ] C = e s )
 ( [ t ] = t )
 ( [ w ] = u )
 ( [ x ] = k s )
 ( AEO [ y ] = i )
 ( # [ y ] # = i )
 ( [ y ] = ll )
 ( [ z ] = th )
))
</PRE>

<P>
We could simply set our lexicon to use the above letter to sound
system with the following command

<PRE>
(lex.set.lts.ruleset 'spanish)
</PRE>

<P>
But this would not deal with upper case letters.  Instead of
writing new rules for upper case letters we can define that
a Lisp function be called when looking up a word and intercept
the lookup with our own function.  First we state that unknown
words should 
call a function, and then define the function we wish called.
The actual link to ensure our function will be called is done
below at lexicon selection time

<PRE>
(define (spanish_lts word features)
  "(spanish_lts WORD FEATURES)
Using letter to sound rules build a spanish pronunciation of WORD."
  (list word
        nil
        (lex.syllabify.phstress (lts.apply (downcase word) 'spanish))))
(lex.set.lts.method spanish_lts)
</PRE>

<P>
In the function we downcase the word and apply the LTS rule to it.
Next we syllabify it and return the created lexical entry.

</P>


<H3><A NAME="SEC102" HREF="festival_toc.html#TOC102">24.2.3  Phrasing</A></H3>

<P>
Without detailed labelled databases we cannot build statistical models
of phrase breaks, but we can simply build a phrase break model based
on punctuation.  The following is a CART tree to predict simple breaks,
from punctuation.

<PRE>
(set! spanish_phrase_cart_tree
'
((lisp_token_end_punc in ("?" "." ":"))
  ((BB))
  ((lisp_token_end_punc in ("'" "\"" "," ";"))
   ((B))
   ((n.name is 0)  ;; end of utterance
    ((BB))
    ((NB))))))
</PRE>



<H3><A NAME="SEC103" HREF="festival_toc.html#TOC103">24.2.4  Intonation</A></H3>

<P>
For intonation there are number of simple options without requiring
training data.  For this example we will simply use a hat pattern on all
stressed syllables in content words and on single syllable content
words. (i.e.  <CODE>Simple</CODE>) Thus we need an accent prediction CART
tree.

<PRE>
(set! spanish_accent_cart_tree
 '
  ((R:SylStructure.parent.gpos is content)
   ((stress is 1)
    ((Accented))
    ((position_type is single)
     ((Accented))
     ((NONE))))
   ((NONE))))
</PRE>

<P>
We also need to specify the pitch range of our speaker.  We will
be using a male Spanish diphone database of the follow range

<PRE>
(set! spanish_el_int_simple_params
    '((f0_mean 120) (f0_std 30)))
</PRE>



<H3><A NAME="SEC104" HREF="festival_toc.html#TOC104">24.2.5  Duration</A></H3>

<P>
We will use the trick mentioned above for duration prediction.
Using the zscore CART tree method, we will actually use it to 
predict factors rather than zscores.  

</P>
<P>
The tree predicts longer durations in stressed syllables and in
clause initial and clause final syllables.

<PRE>
(set! spanish_dur_tree
 '
   ((R:SylStructure.parent.R:Syllable.p.syl_break &#62; 1 ) ;; clause initial
    ((R:SylStructure.parent.stress is 1)
     ((1.5))
     ((1.2)))
    ((R:SylStructure.parent.syl_break &#62; 1)   ;; clause final
     ((R:SylStructure.parent.stress is 1)
      ((2.0))
      ((1.5)))
     ((R:SylStructure.parent.stress is 1)
      ((1.2))
      ((1.0))))))
</PRE>

<P>
In addition to the tree we need durations for each phone in the
set

<PRE>
(set! spanish_el_phone_data
'(
   (# 0.0 0.250)
   (a 0.0 0.090)
   (e 0.0 0.090)
   (i 0.0 0.080)
   (o 0.0 0.090)
   (u 0.0 0.080)
   (b 0.0 0.065)
   (ch 0.0 0.135)
   (d 0.0 0.060)
   (f 0.0 0.100)
   (g 0.0 0.080)
   (j 0.0 0.100)
   (k 0.0 0.100)
   (l 0.0 0.080)
   (ll 0.0 0.105)
   (m 0.0 0.070)
   (n 0.0 0.080)
   (ny 0.0 0.110)
   (p 0.0 0.100)
   (r 0.0 0.030)
   (rr 0.0 0.080)
   (s 0.0 0.110)
   (t 0.0 0.085)
   (th 0.0 0.100)
   (x 0.0 0.130)
))
</PRE>



<H3><A NAME="SEC105" HREF="festival_toc.html#TOC105">24.2.6  Waveform synthesis</A></H3>

<P>
There are a number of choices for waveform synthesis currently
supported.  MBROLA supports Spanish, so we could use that.  But their
Spanish diphones in fact use a slightly different phoneset so we would
need to change the above definitions to use it effectively.  Here we will
use a diphone database for Spanish recorded by Eduardo Lopez when he was
a Masters student some years ago.

</P>
<P>
Here we simply load our pre-built diphone database

<PRE>
(us_diphone_init
   (list
    '(name "el_lpc_group")
    (list 'index_file 
          (path-append spanish_el_dir "group/ellpc11k.group"))
    '(grouped "true")
    '(default_diphone "#-#")))
</PRE>



<H3><A NAME="SEC106" HREF="festival_toc.html#TOC106">24.2.7  Voice selection function</A></H3>

<P>
The standard way to define a voice in Festival is to define
a function of the form <CODE>voice_NAME</CODE> which selects
all the appropriate parameters.  Because the definition 
below follows the above definitions we know that everything 
appropriate has been loaded into Festival and hence we
just need to select the appropriate a parameters.

</P>

<PRE>
(define (voice_spanish_el)
"(voice_spanish_el)
Set up synthesis for Male Spanish speaker: Eduardo Lopez"
  (voice_reset)
  (Parameter.set 'Language 'spanish)  
  ;; Phone set
  (Parameter.set 'PhoneSet 'spanish)
  (PhoneSet.select 'spanish)
  (set! pos_lex_name nil)
  ;; Phrase break prediction by punctuation
  (set! pos_supported nil)
  ;; Phrasing
  (set! phrase_cart_tree spanish_phrase_cart_tree)
  (Parameter.set 'Phrase_Method 'cart_tree)
  ;; Lexicon selection
  (lex.select "spanish")
  ;; Accent prediction
  (set! int_accent_cart_tree spanish_accent_cart_tree)
  (set! int_simple_params spanish_el_int_simple_params)
  (Parameter.set 'Int_Method 'Simple)
  ;; Duration prediction
  (set! duration_cart_tree spanish_dur_tree)
  (set! duration_ph_info spanish_el_phone_data)
  (Parameter.set 'Duration_Method 'Tree_ZScores)
  ;; Waveform synthesizer: diphones
  (Parameter.set 'Synth_Method 'UniSyn)
  (Parameter.set 'us_sigpr 'lpc)
  (us_db_select 'el_lpc_group)

  (set! current-voice 'spanish_el)
)

(provide 'spanish_el)
</PRE>



<H3><A NAME="SEC107" HREF="festival_toc.html#TOC107">24.2.8  Last remarks</A></H3>

<P>
We save the above definitions in a file <TT>`spanish_el.scm'</TT>.  Now we
can declare the new voice to Festival.  See section <A HREF="festival_24.html#SEC109">24.3  Defining a new voice</A>
for a description of methods for adding new voices.  For testing
purposes we can explciitly load the file <TT>`spanish_el.scm'</TT>

</P>
<P>
The voice is now available for use in festival.

<PRE>
festival&#62; (voice_spanish_el)
spanish_el
festival&#62; (SayText "hola amigos")
&#60;Utterance 0x04666&#62;
</PRE>

<P>
As you can see adding a new voice is not very difficult.  Of course
there is quite a lot more than the above to add a high quality robust
voice to Festival.  But as we can see many of the basic tools that we
wish to use already exist.  The main difference between the above voice
and the English voices already in Festival are that their models are
better trained from databases.  This produces, in general, better
results, but the concepts behind them are basically the same.  All
of those trainable methods may be parameterized with data for
new voices.

</P>
<P>
As Festival develops, more modules will be added with better support for
training new voices so in the end we hope that adding in high quality
new voices is actually as simple as (or indeed simpler than) the above
description.

</P>


<H3><A NAME="SEC108" HREF="festival_toc.html#TOC108">24.2.9  Resetting globals</A></H3>

<P>
<A NAME="IDX317"></A>
<A NAME="IDX318"></A>
Because the version of Scheme used in Festival only has a single flat
name space it is unfortunately too easy for voices to set some global
which accidentally affects all other voices selected after it.  Because
of this problem we have introduced a convention to try to minimise the
possibility of this becoming a problem.  Each voice function
defined should always call <CODE>voice_reset</CODE> at the start.  This
will reset any globals and also call a tidy up function provided by
the previous voice function.  

</P>
<P>
Likewise in your new voice function you should provide a tidy up
function to reset any non-standard global variables you set.  The
function <CODE>current_voice_reset</CODE> will be called by
<CODE>voice_reset</CODE>.  If the value of <CODE>current_voice_reset</CODE> is
<CODE>nil</CODE> then it is not called.  <CODE>voice_reset</CODE> sets
<CODE>current_voice_reset</CODE> to <CODE>nil</CODE>, after calling it.

</P>
<P>
For example suppose some new voice requires the audio device to 
be directed to a different machine.  In this example we make
the giant's voice go through the netaudio machine <CODE>big_speakers</CODE>
while the standard voice go through <CODE>small_speakers</CODE>.

</P>
<P>
Although we can easily select the machine <CODE>big_speakers</CODE> as out
when our <CODE>voice_giant</CODE> is called, we also need to set it back when
the next voice is selected, and don't want to have to modify every other
voice defined in the system.  Let us first define two functions to
selection the audio output.

<PRE>
(define (select_big)
  (set! giant_previous_audio (getenv "AUDIOSERVER"))
  (setenv "AUDIOSERVER" "big_speakers"))

(define (select_normal)
  (setenv "AUDIOSERVER" giant_previous_audio))
</PRE>

<P>
Note we save the previous value of <CODE>AUDIOSERVER</CODE> rather than simply
assuming it was <CODE>small_speakers</CODE>.

</P>
<P>
Our definition of <CODE>voice_giant</CODE> definition of <CODE>voice_giant</CODE>
will look something like

<PRE>
(define (voice_giant)
"comment comment ..."
   (voice_reset)  ;; get into a known state
   (select_big)
   ;;; other giant voice parameters
   ...

   (set! current_voice_rest select_normal)
   (set! current-voice 'giant))
</PRE>

<P>
The obvious question is which variables should a voice reset.
Unfortunately there is not a definitive answer to that.  To a certain
extent I don't want to define that list as there will be many variables
that will by various people in Festival which are not in the original
distribution and we don't want to restrict them.  The longer term answer
is some for of partitioning of the Scheme name space perhaps having
voice local variables (cf. Emacs buffer local variables).  But
ultimately a voice may set global variables which could redefine the
operation of later selected voices and there seems no real way to stop
that, and keep the generality of the system.

</P>
<P>
<A NAME="IDX319"></A>
Note the convention of setting the global <CODE>current-voice</CODE> as
the end of any voice definition file.  We do not enforce this
but probabaly should.  The variable <CODE>current-voice</CODE> at
any time should identify the current voice, the voice
description information (described below) will relate this name
to properties identifying it.

</P>


<H2><A NAME="SEC109" HREF="festival_toc.html#TOC109">24.3  Defining a new voice</A></H2>

<P>
<A NAME="IDX320"></A>
As there are a number of voices available for Festival and they may or
may not exists in different installations we have tried to make it
as simple as possible to add new voices to the system without having
to change any of the basic distribution.  In fact if the voices
use the following standard method for describing themselves it
is merely a matter of unpacking them in order for them to be used
by the system.

</P>
<P>
<A NAME="IDX321"></A>
The variable <CODE>voice-path</CODE> conatins a list of directories where
voices will be automatically searched for.  If this is not set it is set
automatically by appending <TT>`/voices/'</TT> to all paths in festival
<CODE>load-path</CODE>.  You may add new directories explicitly to this
variable in your <TT>`sitevars.scm'</TT> file or your own <TT>`.festivalrc'</TT>
as you wish.

</P>
<P>
Each voice directory is assumed to be of the form

<PRE>
LANGUAGE/VOICENAME/
</PRE>

<P>
Within the <CODE>VOICENAME/</CODE> directory itself it is assumed there is a
file <TT>`festvox/VOICENAME.scm'</TT> which when loaded will define the
voice itself.  The actual voice function should be called
<CODE>voice_VOICENAME</CODE>.

</P>
<P>
For example the voices distributed with the standard Festival
distribution all unpack in <TT>`festival/lib/voices'</TT>.  The Amercan
voice <TT>`ked_diphone'</TT> unpacks into

<PRE>
festival/lib/voices/english/ked_diphone/
</PRE>

<P>
Its actual definition file is in 

<PRE>
festival/lib/voices/english/ked_diphone/festvox/ked_diphone.scm
</PRE>

<P>
Note the name of the directory and the name of the Scheme definition
file must be the same.  

</P>
<P>
Alternative voices using perhaps a different encoding of the database but
the same front end may be defined in the same way by using symbolic
links in the langauge directoriy to the main directory.  For example
a PSOLA version of the ked voice may be defined in 

<PRE>
festival/lib/voices/english/ked_diphone/festvox/ked_psola.scm
</PRE>

<P>
Adding a symbole link in <TT>`festival/lib/voices/english/'</TT> ro
<TT>`ked_diphone'</TT> called <TT>`ked_psola'</TT> will allow that voice
to be automatically registered when Festival starts up.

</P>
<P>
Note that this method doesn't actually load the voices it finds, that
could be prohibitively time consuming to the start up process.  It
blindly assumes that there is a file
<TT>`VOICENAME/festvox/VOICENAME.scm'</TT> to load.  An autoload definition
is given for <CODE>voice_VOICENAME</CODE> which when called will load that
file and call the real definition if it exists in the file.

</P>
<P>
This is only a recommended method to make adding new voices easier, it
may be ignored if you wish.  However we still recommend that even if you
use your own convetions for adding new voices you consider the autoload
function to define them in, for example, the <TT>`siteinit.scm'</TT> file or
<TT>`.festivalrc'</TT>.  The autoload function takes three arguments:
a function name, a file containing the actual definiton and a comment.
For example a definition of voice can be done explicitly by 

<PRE>
(autooad voice_f2b  "/home/awb/data/f2b/ducs/f2b_ducs" 
     "American English female f2b")))
</PRE>

<P>
Of course you can also load the definition file explicitly if you
wish.

</P>
<P>
<A NAME="IDX322"></A>
In order to allow the system to start making intellegent use of voices
we recommend that all voice definitions include a call to the function
<CODE>voice_proclaim</CODE> this allows the system to know some properties
about the voice such as language, gender and dialect.  The
<CODE>proclaim_voice</CODE> function taks two arguments a name (e.g.
<CODE>rab_diphone</CODE> and an assoc list of features and names.  Currently
we require <CODE>language</CODE>, <CODE>gender</CODE>, <CODE>dialect</CODE> and
<CODE>description</CODE>.  The last being a textual description of the voice
itself.  An example proclaimation is

<PRE>
(proclaim_voice
 'rab_diphone
 '((language english)
   (gender male)
   (dialect british)
   (description
    "This voice provides a British RP English male voice using a
     residual excited LPC diphone synthesis method.  It uses a 
     modified Oxford Advanced Learners' Dictionary for pronunciations.
     Prosodic phrasing is provided by a statistically trained model
     using part of speech and local distribution of breaks.  Intonation
     is provided by a CART tree predicting ToBI accents and an F0 
     contour generated from a model trained from natural speech.  The
     duration model is also trained from data using a CART tree.")))
</PRE>

<P>
There are functions to access a description.  <CODE>voice.description</CODE>
will return the description for a given voice and will load that voice
if it is not already loaded.  <CODE>voice.describe</CODE> will describe the
given given voice by synthesizing the textual description using the
current voice.  It would be nice to use the voice itself to give a self
introduction but unfortunately that introduces of problem of decide
which language the description should be in, we are not all as fluent in
welsh as we'd like to be.

</P>
<P>
The function <CODE>voice.list</CODE> will list the <EM>potential</EM> voices in
the system.  These are the names of voices which have been found in the
<CODE>voice-path</CODE>.  As they have not actaully been loaded they can't
actually be confirmed as usable voices.  One solution to this would be
to load all voices at start up time which would allow confirmation they
exist and to get their full description through <CODE>proclaim_voice</CODE>.
But start up is already too slow in festival so we have to accept this
stat for the time being.  Splitting the description of the voice from
the actual definition is a possible solution to this problem but we have
not yet looked in to this.

</P>
<P><HR><P>
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_23.html">previous</A>, <A HREF="festival_25.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
</BODY>
</HTML>