Sophie: festival-2.1-3.mga1 i586

festival-2.1-3.mga1.i586.rpm

<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
     from ../festival.texi on 2 August 2001 -->

<TITLE>Festival Speech Synthesis System - 5  Overview</TITLE>
</HEAD>
<BODY bgcolor="#ffffff">
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_4.html">previous</A>, <A HREF="festival_6.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
<P><HR><P>


<H1><A NAME="SEC9" HREF="festival_toc.html#TOC9">5  Overview</A></H1>

<P>
Festival is designed as a speech synthesis system for at least three
levels of user.  First, those who simply want high quality speech from
arbitrary text with the minimum of effort.  Second, those who are
developing language systems and wish to include synthesis output.  In
this case, a certain amount of customization is desired, such as
different voices, specific phrasing, dialog types etc.  The third level
is in developing and testing new synthesis methods.

</P>
<P>
This manual is not designed as a tutorial on converting text to speech
but for documenting the processes and use of our system.  We do not
discuss the detailed algorithms involved in converting text to speech or
the relative merits of multiple methods, though we will often give
references to relevant papers when describing the use of each module.

</P>
<P>
For more general information about text to speech we recommend Dutoit's
<TT>`An introduction to Text-to-Speech Synthesis'</TT> <CITE>dutoit97</CITE>.  For
more detailed research issues in TTS see <CITE>sproat98</CITE> or
<CITE>vansanten96</CITE>.

</P>



<H2><A NAME="SEC10" HREF="festival_toc.html#TOC10">5.1  Philosophy</A></H2>

<P>
One of the biggest problems in the development of speech synthesis, and
other areas of speech and language processing systems, is that there are
a lot of simple well-known techniques lying around which can help you
realise your goal.  But in order to improve some part of the whole
system it is necessary to have a whole system in which you can test and
improve your part.  Festival is intended as that whole system in which
you may simply work on your small part to improve the whole.  Without a
system like Festival, before you could even start to test your new
module you would need to spend significant effort to build a whole
system, or adapt an existing one before you could start working on your
improvements.

</P>
<P>
Festival is specifically designed to allow the addition of new
modules, easily and efficiently, so that development need not 
get bogged down in re-implementing the wheel.

</P>
<P>
But there is another aspect of Festival which makes it more useful than
simply an environment for researching into new synthesis techniques.
It is a fully usable text-to-speech system suitable for embedding in
other projects that require speech output.  The provision of a fully
working easy-to-use speech synthesizer in addition to just a testing
environment is good for two specific reasons.  First, it offers a conduit
for our research, in that our experiments can quickly and directly
benefit users of our synthesis system.  And secondly, in ensuring we have
a fully working usable system we can immediately see what problems exist
and where our research should be directed rather where our whims take
us.

</P>
<P>
These concepts are not unique to Festival.  ATR's CHATR system
(<CITE>black94</CITE>) follows very much the same philosophy and Festival
benefits from the experiences gained in the development of that system.
Festival benefits from various pieces of previous work.  As well as
CHATR, CSTR's previous synthesizers, Osprey and the Polyglot projects
influenced many design decisions.  Also we are influenced by more
general programs in considering software engineering issues, especially
GNU Octave and Emacs on which the basic script model was based.

</P>
<P>
Unlike in some other speech and language systems, software engineering is
considered very important to the development of Festival.  Too often
research systems consist of random collections of hacky little scripts
and code.  No one person can confidently describe the algorithms it
performs, as parameters are scattered throughout the system, with tricks
and hacks making it impossible to really evaluate why the system is good
(or bad).  Such systems do not help the advancement of speech
technology, except perhaps in pointing at ideas that should be further
investigated.  If the algorithms and techniques cannot be described
externally from the program <EM>such that</EM> they can reimplemented by
others, what is the point of doing the work?

</P>
<P>
Festival offers a common framework where multiple techniques may be 
implemented (by the same or different researchers) so that they may
be tested more fairly in the same environment.

</P>
<P>
As a final word, we'd like to make two short statements which both
achieve the same end but unfortunately perhaps not for the same reasons:

<BLOCKQUOTE>
<P>
Good software engineering makes good research easier
</BLOCKQUOTE>

<P>
But the following seems to be true also

<BLOCKQUOTE>
<P>
If you spend enough effort on something it can be shown to be better
than its competitors.
</BLOCKQUOTE>



<H2><A NAME="SEC11" HREF="festival_toc.html#TOC11">5.2  Future</A></H2>

<P>
Festival is still very much in development.  Hopefully this state will
continue for a long time. It is never possible to complete software,
there are always new things that can make it better.  However as time
goes on  Festival's core architecture will stabilise and little or
no changes will be made.  Other aspects of the system will gain
greater attention such as waveform synthesis modules, intonation 
techniques, text type dependent analysers etc.

</P>
<P>
Festival will improve, so don't expected it to be the same six months
from now.

</P>
<P>
A number of new modules and enhancements are already under consideration
at various stages of implementation.  The following is a non-exhaustive
list of what we may (or may not) add to Festival over the
next six months or so.

<UL>
<LI>Selection-based synthesis:

Moving away from diphone technology to more generalized selection
of units for speech database.
<LI>New structure for linguistic content of utterances:

Using techniques for Metrical Phonology we are building more structure
representations of utterances reflecting there linguistic significance
better.  This will allow improvements in prosody and unit selection.
<LI>Non-prosodic prosodic control:

For language generation systems and custom tasks where the speech
to be synthesized is being generated by some program, more information
about text structure will probably exist, such as phrasing, contrast,
key items etc.   We are investigating the relationship of high-level
tags to prosodic information through the Sole project
<A HREF="http://www.cstr.ed.ac.uk/projects/sole.html">http://www.cstr.ed.ac.uk/projects/sole.html</A>
<LI>Dialect independent lexicons:

Currently for each new dialect we need a new lexicon, we are currently
investigating a form of lexical specification that is dialect independent
that allows the core form to be mapped to different dialects.  This
will make the generation of voices in different dialects much easier.
</UL>

<P><HR><P>
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_4.html">previous</A>, <A HREF="festival_6.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
</BODY>
</HTML>