Sophie: festival-2.1-3.mga1 i586

festival-2.1-3.mga1.i586.rpm

<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
     from ../festival.texi on 2 August 2001 -->

<TITLE>Festival Speech Synthesis System - 21  Diphone synthesizer</TITLE>
</HEAD>
<BODY bgcolor="#ffffff">
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_20.html">previous</A>, <A HREF="festival_22.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
<P><HR><P>


<H1><A NAME="SEC85" HREF="festival_toc.html#TOC85">21  Diphone synthesizer</A></H1>

<P>
<EM>NOTE:</EM> use of this diphone synthesis is depricated and it
will probably be removed from future versions, all of its functionality
has been replaced by the UniSyn synthesizer.  It is not
compiled by default, if required add <CODE>ALSO_INCLUDE += diphone</CODE>
to your <TT>`festival/config/config'</TT> file.

</P>
<P>
<A NAME="IDX278"></A>
A basic diphone synthesizer offers a method
for making speech from segments, durations and intonation
targets.  This module was mostly written by Alistair Conkie
but the base diphone format is compatible with previous CSTR 
diphone synthesizers.

</P>
<P>
The synthesizer offers residual excited LPC based synthesis (<CITE>hunt89</CITE>)
and PSOLA (TM) (<CITE>moulines90</CITE>) (PSOLA is not available for
distribution).

</P>



<H2><A NAME="SEC86" HREF="festival_toc.html#TOC86">21.1  Diphone database format</A></H2>

<P>
A diphone database consists of a <EM>dictionary file</EM>, a set
of <EM>waveform files</EM>, and a set of <EM>pitch mark files</EM>.  These
files are the same format as the previous CSTR (Osprey) synthesizer.

</P>
<P>
The dictionary file consist of one entry per line.  Each entry
consists of five fields: a diphone name of the form <VAR>P1-P2</VAR>, a
filename (without extension), a floating point start position in the
file in milliseconds, a mid position in milliseconds (change in phone),
and an end position in milliseconds.  Lines starting with a semi-colon
and blank lines are ignored.  The list may be in any order.

</P>
<P>
For example a partial list of phones may look like.

<PRE>
ch-l  r021   412.035  463.009  518.23  
jh-l  d747   305.841  382.301  446.018 
h-l   d748   356.814  403.54   437.522 
#-@   d404   233.628  297.345  331.327 
@-#   d001   836.814  938.761  1002.48 
</PRE>

<P>
Waveform files may be in any form, as long as every file is the same
type, headered or unheadered as long as the format is supported the
speech tools wave reading functions.  These may be standard linear PCM
waveform files in the case of PSOLA or LPC coefficients and residual
when using the residual LPC synthesizer. section <A HREF="festival_21.html#SEC87">21.2  LPC databases</A>

</P>
<P>
Pitch mark files consist a simple list of positions in milliseconds
(plus places after the point) in order, one per line of each pitch mark
in the file.  For high quality diphone synthesis these should be derived
from laryngograph data.  During unvoiced sections pitch marks should be
artificially created at reasonable intervals (e.g. 10 ms).  In the
current format there is no way to determine the "real" pitch marks from
the "unvoiced" pitch marks.

</P>
<P>
It is normal to hold a diphone database in a directory with
a number of sub-directories namely <TT>`dic/'</TT> contain
the dictionary file, <TT>`wave/'</TT> for the waveform files, typically
of whole nonsense words (sometimes this directory is called 
<TT>`vox/'</TT> for historical reasons) and <TT>`pm/'</TT> for
the pitch mark files.  The filename in the dictionary entry should
be the same for waveform file and the pitch mark file (with different
extensions).

</P>


<H2><A NAME="SEC87" HREF="festival_toc.html#TOC87">21.2  LPC databases</A></H2>

<P>
The standard method for diphone resynthesis in the released system is
residual excited LPC (<CITE>hunt89</CITE>).  The actual method of resynthesis
isn't important to the database format, but if residual LPC synthesis
is to be used then it is necessary to make the LPC coefficient
files and their corresponding residuals.

</P>
<P>
Previous versions of the system used a "host of hacky little scripts"
to this but now that the Edinburgh Speech Tools supports LPC analysis
we can provide a walk through for generating these.

</P>
<P>
We assume that the waveform file of nonsense words are in a directory
called <TT>`wave/'</TT>.  The LPC coefficients and residuals will be, in
this example, stored in <TT>`lpc16k/'</TT> with extensions <TT>`.lpc'</TT> and
<TT>`.res'</TT> respectively.

</P>
<P>
Before starting it is worth considering power normalization.  We have
found this important on all of the databases we have collected so far.
The <CODE>ch_wave</CODE> program, part of the speech tools, with the optional
<CODE>-scaleN 0.4</CODE> may be used if a more complex method is not
available.

</P>
<P>
The following shell command generates the files

<PRE>
for i in wave/*.wav
do
   fname=`basename $i .wav`
   echo $i
   lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \
       -r lpc16k/$fname.res -otype htk -rtype nist $i
done
</PRE>

<P>
It is said that the LPC order should be sample rate divided by one
thousand plus 2.  This may or may not be appropriate and if you are
particularly worried about the database size it is worth experimenting.

</P>
<P>
The program <TT>`lpc_analysis'</TT>, found in <TT>`speech_tools/bin'</TT>,
can be used to generate the lpc coefficients and residual.  Note
these should be reflection coefficients so they may be quantised
(as they are in group files).

</P>
<P>
The coefficients and residual files produced by different LPC analysis
programs may start at different offsets.  For example the Entropic's ESPS
functions generate LPC coefficients that are offset by one frame shift
(e.g. 0.01 seconds).  Our own <TT>`lpc_analysis'</TT> routine has no offset.
The <CODE>Diphone_Init</CODE> parameter list allows these offsets to be
specified.  Using the above function to generate the LPC files the
description parameters should include

<PRE>
  (lpc_frame_offset 0)
  (lpc_res_offset 0.0)
</PRE>

<P>
While when generating using ESPS routines the description should be

<PRE>
  (lpc_frame_offset 1)
  (lpc_res_offset 0.01)
</PRE>

<P>
The defaults actually follow the ESPS form, that is <CODE>lpc_frame_offset</CODE>
is 1 and <CODE>lpc_res_offset</CODE> is equal to the frame shift, if they are
not explicitly mentioned.

</P>
<P>
Note the biggest problem we have in implementing the residual excited
LPC resynthesizer was getting the right part of the residual to line up
with the right LPC coefficients describing the pitch mark.  Making
errors in this degrades the synthesized waveform notably, but not
seriously, making it difficult to determine if it is an offset problem or
some other bug.

</P>
<P>
Although we have started investigating if extracting pitch synchronous
LPC parameters rather than fixed shift parameters gives better
performance, we haven't finished this work.  <TT>`lpc_analysis'</TT>
supports pitch synchronous analysis but the raw "ungrouped"
access method does not yet.  At present the LPC parameters are
extracted at a particular pitch mark by interpolating over the 
closest LPC parameters.  The "group" files hold these interpolated
parameters pitch synchronously.

</P>
<P>
The American English voice <TT>`kd'</TT> was created using the speech
tools <TT>`lpc_analysis'</TT> program and its set up should
be looked at if you are going to copy it.  The British English voice
<TT>`rb'</TT> was constructed using ESPS routines.

</P>


<H2><A NAME="SEC88" HREF="festival_toc.html#TOC88">21.3  Group files</A></H2>

<P>
<A NAME="IDX279"></A>
Databases may be accessed directly but this is usually too inefficient
for any purpose except debugging.  It is expected that <EM>group
files</EM> will be built which contain a binary representation of the
database.  A group file is a compact efficient representation of the
diphone database.  Group files are byte order independent, so may be
shared between machines of different byte orders and word sizes.
Certain information in a group file may be changed at load time so a
database name, access strategy etc. may be changed from what was set
originally in the group file.

</P>
<P>
A group file contains the basic parameters, the diphone index, the
signal (original waveform or LPC residual), LPC coefficients, and the
pitch marks.  It is all you need for a run-time synthesizer.  
Various compression mechanisms are supported to allow smaller databases
if desired.  A full English LPC plus residual database at 8k ulaw
is about 3 megabytes, while a full 16 bit version at 16k is about
8 megabytes.

</P>
<P>
Group files are created with the <CODE>Diphone.group</CODE> command which
takes a database name and an output filename as an argument.  Making
group files can take some time especially if they are large.  The
<CODE>group_type</CODE> parameter specifies <CODE>raw</CODE> or <CODE>ulaw</CODE>
for encoding signal files.  This can significantly reduce the size
of databases.

</P>
<P>
Group files may be partially loaded (see access strategies) at
run time for quicker start up and to minimise run-time
memory requirements.

</P>


<H2><A NAME="SEC89" HREF="festival_toc.html#TOC89">21.4  Diphone_Init</A></H2>

<P>
The basic method for describing a database is through the <CODE>Diphone_Init</CODE>
command.  This function takes a single argument, a list of
pairs of parameter name and value.  The parameters are
<DL COMPACT>

<DT><CODE>name</CODE>
<DD>
An atomic name for this database.
<DT><CODE>group_file</CODE>
<DD>
The filename of a group file, which may itself contain parameters
describing itself
<DT><CODE>type</CODE>
<DD>
The default value is <CODE>pcm</CODE>, but for distributed voices
this is always <CODE>lpc</CODE>.
<DT><CODE>index_file</CODE>
<DD>
A filename containing the diphone dictionary.
<DT><CODE>signal_dir</CODE>
<DD>
A directory (slash terminated) containing the pcm waveform files.
<DT><CODE>signal_ext</CODE>
<DD>
A dot prefixed extension for the pcm waveform files.
<DT><CODE>pitch_dir</CODE>
<DD>
A directory (slash terminated) containing the pitch mark files.
<DT><CODE>pitch_ext</CODE>
<DD>
A dot prefixed extension for the pitch files
<DT><CODE>lpc_dir</CODE>
<DD>
A directory (slash terminated) containing the LPC coefficient files
and residual files.
<DT><CODE>lpc_ext</CODE>
<DD>
A dot prefixed extension for the LPC coefficient files
<DT><CODE>lpc_type</CODE>
<DD>
The type of LPC file (as supported by the speech tools)
<DT><CODE>lpc_frame_offset</CODE>
<DD>
The number of frames "missing" from the beginning of the file.
Often LPC parameters are offset by one frame.
<DT><CODE>lpc_res_ext</CODE>
<DD>
A dot prefixed extension for the residual files
<DT><CODE>lpc_res_type</CODE>
<DD>
The type of the residual files, this is a standard waveform type
as supported by the speech tools.
<DT><CODE>lpc_res_offset</CODE>
<DD>
Number of seconds "missing" from the beginning of the residual file.
Some LPC analysis technique do not generate a residual until after one
frame.
<DT><CODE>samp_freq</CODE>
<DD>
Sample frequency of signal files
<DT><CODE>phoneset</CODE>
<DD>
Phoneset used, must already be declared.
<DT><CODE>num_diphones</CODE>
<DD>
Total number of diphones in database.  If specified this must be
equal or bigger than the number of entries in the index file.
If it is not specified the square of the number of phones in the
phoneset is used.
<DT><CODE>sig_band</CODE>
<DD>
number of sample points around actual diphone to take from file.
This should be larger than any windowing used on the signal,
and/or up to the pitch marks outside the diphone signal.
<DT><CODE>alternates_after</CODE>
<DD>
List of pairs of phones stating replacements for the second
part of diphone when the basic diphone is not found in the 
diphone database.
<DT><CODE>alternates_before</CODE>
<DD>
List of pairs of phones stating replacements for the first
part of diphone when the basic diphone is not found in the 
diphone database.
<DT><CODE>default_diphone</CODE>
<DD>
When unexpected combinations occur and no appropriate diphone can be
found this diphone should be used.  This should be specified for all
diphone databases that are to be robust.  We usually us the silence to
silence diphone.  No mater how carefully you designed your diphone set,
conditions when an unknown diphone occur seem to <EM>always</EM> happen.
If this is not set and a diphone is requested that is not in the
database an error occurs and synthesis will stop.
</DL>

<P>
Examples of both general set up, making group files and general
use are in 

<PRE>
<TT>`lib/voices/english/rab_diphone/festvox/rab_diphone.scm'</TT>
</PRE>



<H2><A NAME="SEC90" HREF="festival_toc.html#TOC90">21.5  Access strategies</A></H2>

<P>
<A NAME="IDX280"></A>
Three basic accessing strategies are available when using
diphone databases.  They are designed to optimise access time, start up
time and space requirements.

</P>
<DL COMPACT>

<DT><CODE>direct</CODE>
<DD>
Load all signals at database init time. This is the slowest startup but
the fastest to access.  This is ideal for servers.  It is also useful
for small databases that can be loaded quickly.  It is reasonable for
many group files.
<DT><CODE>dynamic</CODE>
<DD>
Load signals as they are required.  This has much faster
start up and will only gradually use up memory as the diphones
are actually used.  Useful for larger databases, and for non-group
file access.
<DT><CODE>ondemand</CODE>
<DD>
Load the signals as they are requested but free them if they are not
required again immediately.  This is slower access but requires low
memory usage.  In group files the re-reads are quite cheap as the
database is well cached and a file description is already open for the
file.
</DL>
<P>
Note that in group files pitch marks (and LPC coefficients) are
always fully loaded (cf. <CODE>direct</CODE>), as they are typically
smaller.  Only signals (waveform files or residuals) are potentially
dynamically loaded.

</P>


<H2><A NAME="SEC91" HREF="festival_toc.html#TOC91">21.6  Diphone selection</A></H2>

<P>
<A NAME="IDX281"></A>
<A NAME="IDX282"></A>
<A NAME="IDX283"></A>
<A NAME="IDX284"></A>
The appropriate diphone is selected based on the name of the phone
identified in the segment stream.  However for better diphone synthesis
it is useful to augment the diphone database with other diphones in
addition to the ones directly from the phoneme set.  For example dark
and light l's, distinguishing consonants from their consonant cluster
form and their isolated form.  There are however two methods to identify
this modification from the basic name.

</P>
<P>
<A NAME="IDX285"></A>
<A NAME="IDX286"></A>
When the diphone module is called the hook <CODE>diphone_module_hooks</CODE>
is applied.  That is a function of list of functions which will be
applied to the utterance.  Its main purpose is to allow the conversion
of the basic name into an augmented one.  For example converting a basic
<CODE>l</CODE> into a dark l, denoted by <CODE>ll</CODE>.  The functions given in
<CODE>diphone_module_hooks</CODE> may set the feature
<CODE>diphone_phone_name</CODE> which if set will be used rather than the
<CODE>name</CODE> of the segment.

</P>
<P>
For example suppose we wish to use a dark l (<CODE>ll</CODE>) rather than
a normal l for all l's that appear in the coda of a syllable.
First we would define a function to which identifies this condition
and adds the addition feature <CODE>diphone_phone_name</CODE> identify
the name change.  The following function would
achieve this

<PRE>
(define (fix_dark_ls utt)
"(fix_dark_ls UTT)
Identify ls in coda position and relabel them as ll."
  (mapcar
   (lambda (seg) 
     (if (and (string-equal "l" (item.name seg))
              (string-equal "+" (item.feat seg "p.ph_vc"))
              (item.relation.prev seg "SylStructure"))
      (item.set_feat seg "diphone_phone_name" "ll")))
   (utt.relation.items utt 'Segment))
  utt)
</PRE>

<P>
Then when we wish to use this for a particular voice we need to
add

<PRE>
(set! diphone_module_hooks (list fix_dark_ls))
</PRE>

<P>
in the voice selection function.

</P>
<P>
For a more complex example including consonant cluster identification
see the American English voice <TT>`ked'</TT> in
<TT>`festival/lib/voices/english/ked/festvox/kd_diphone.scm'</TT>.  The
function <CODE>ked_diphone_fix_phone_name</CODE> carries out a number of
mappings.

</P>
<P>
The second method for changing a name is during actual look up of a
diphone in the database.  The list of alternates is given by the
<CODE>Diphone_Init</CODE> function.  These are used when the specified diphone
can't be found.  For example we often allow mappings of dark l,
<CODE>ll</CODE> to <CODE>l</CODE> as sometimes the dark l diphone doesn't actually
exist in the database.

</P>
<P><HR><P>
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_20.html">previous</A>, <A HREF="festival_22.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
</BODY>
</HTML>