Formant synthesis

Formant synthesis was the first genuine synthesis techniqne to be developed and was the dominant technique until the early 1980s. Formant synthesis is often called synthesis by rule a term invented to make clear at the time that this was synthesis from scratch (at the time the term synthesis was more commonly nsed for the process of reconstmcting a waveform that had been parameterised for speech-coding purposes). As we shall see, most formant-synthesis techniques do in fact use rules of the traditional form, but data-driven techniques have also been used. [Pg.388]

The first point of note regarding the formant synthesiser is that it is not an accurate model of the vocal tract. Even taking into account the crude assumptions regarding [Pg.388]

That said, formant synthesis does share much in common with the all-pole vocal-tract model. As with the tube model, the formant synthesiser is modular with respect to the source and vocal-tract filter. The oral-cavity component is formed from the connection of between three and six individual formant resonators in series, as predicted by the vocal-tract model, and each formant resonator is a second-order filter of the type discussed in Section 10.5.3. [Pg.389]

Formant synthesis originated from research into human voice simulation, but its methods have been adapted to model other types of sounds. Human voice simulation is of great interest to the industrial, scientific and artistic communities alike. In telecommunications, for example, speech synthesis is important because it is comparatively inexpensive and also safer to transmit synthesis parameters rather than the voice signal. In fact, music sound synthesis owes much of its techniques to research in telecommunications systems. [Pg.66]

In music research, voice simulation is important because the human ability to distinguish between different timbres is closely related to our capacity to recognise vocal sounds, especially vowels. Since the nineteenth century, a bulk of scientific studies has linked speech and timbre perception, from Helmholtz (1885) to current research in cognitive science (McAdams and Deliege, 1985). [Pg.66]

The generator works by producing a sequence of dampened sinewave bursts. A single note contains a number of these bursts. Each burst has its own local envelope with either a steep or a smooth attack, and an exponential decay curve. The formant is the result of this local envelope. As the duration of each FOF burst lasts for only a few milliseconds, the envelope produces sidebands around the sinewave, as in amplitude modulation. Note that this mechanism resembles the granular synthesis technique, with the difference that the envelope of the FOF grain was especially designed to facilitate the production of formants. [Pg.67]

The decay time local envelope defines the bandwidth of the formant at -6 dB and the rise time the skirtwidth of the formant at -40 dB. The relationships are as follows the longer the decay, the sharper the resonance peak, and the longer the rise time, the narrower the skirtwidth. Reference values for singing sounds are approximately 7 milliseconds for the rise time and 3 milliseconds for the decay time. [Pg.68]

This technique was originally designed for a synthesis system developed by Ircam in Paris, called Chant (in French chant means sing). Ircam s Diphone system (in the folder diphone on the accompanying CD-ROM) provides a Chant module whereby the composer can either specify the FOF parameters manually or infer them automatically from the analysis of given samples. Also, Virtual Waves for PC-compatible under Windows (in the folder virwaves) provides an FOF unit generator as part of its repertoire of synthesis modules. [Pg.68]

Figure 13.1. Filter view of additive formant synthesis, showing individual impulse responses of parallel formant filters.

Figaie 13.3. Wavetable view of additive formant synthesis. [Pg.151]

In a source/filter vocal model such as LPC or parallel/cascade formant synthesis, periodic impulses are used to excite resonant filters to produce vowels. We could construct a simple alternative model using three, four, or more tables storing the impulse responses of the individual vowel formants. Note that it isn t necessary for the tables to contain pure exponentially decaying sinusoids. We could include aspects of the voice source, etc., as long as those effects are periodic. FOFs (originally introduced as Formant Onde Functions in French, translates to Formant Wave Functions in English) were created for... [Pg.151]

The formant synthesis technique just described is of course only half the problem in addition to generating waveforms from formant parameters, we have to be able to generate formant parameters from the discrete pronunciation representations of the type represented by the synthesis specification. It is useful therefore to split the overall process into separate parameter-to-speech (i.e. the formant synthesiser just described) and specification-to-parameter components. [Pg.406]

Before going into this, we should ask - how good does the speech sound if we give the formant synthesiser perfect input The specification-to-parameter component may produce errors and if we are interested in assessing the quality of the formant synthesis itself, it may be diffieult to do this from the specification directly. Instead we can use the technique of copy synthesis, where we forget about automatic text-to-speech conversion, and instead artificially generate the best possible parameters for the synthesiser. This test is in fact one of the comer stones of speech synthesis research it allows us to work on one part of the system in a modular fashion, but more importantly it acts as a proof of concept as to the synthesiser s eventual suitability for inelusion in the full TTS system. The key point is that if the synthesis sounds bad with the best possible input, then it will only sound worse when potentially error-full input is given instead. In effect copy synthesis sets the upper limit on expeeted quality from any system. [Pg.406]

An alternative to using formants as the primary means of control is to use the parameters of the vocal tract transfer function directly. The key here is that if we assume the all-pole tube model, we can in fact determine these parameters automatically by means of linear prediction, performed by the covariance or autocorrelation technique described in Chapter 12. In the following section we will explain in detail the commonality between linear prediction and formant synthesis, where the two techniques diverge, and how linear prediction can be used to generate speech. [Pg.410]

Beyond this the similarities between the formant s mthesiser and LP model start to diverge. Firstly, with the LP model, we use a single all-pole transfer function for all sounds. In the formant model, there are separate transfer functions in the formant synthesiser for the oral and nasal cavity. In addition a further separate resonator is used in formant synthesis to create a voiced source signal from the impulse train in the LP model the filter that does this is included in the all-pole filter. Hence the formant synthesiser is fundamentally more modular in that it separates these components. This lack of modularity in the LP model adds to the difficulty in providing physical interpretations to the coefficients. [Pg.411]

It should be clear from our exposition that each technique has inherent tradeoffs with respect to the above wish list. For example, we make many assumptions in order to use the lossless all-pole linear prediction model for all speech sounds. In doing so, we achieve a model whose parameters we can measure easily and automatically, but find that these are difficult to interpret in a useful sense. While the general nature of the model is justified, the assumptions we make to achieve automatic analysis mean that we can t modify, manipulate and control the parameters in as direct a way as we can with formant synthesis. Following on from this, it is difficult to produce a simple and elegant phonetics-to-parameter model, as it is difficult to interpret these parameters in higher level phonetic terms. [Pg.418]

On the other hand, with formant synthesis, we can in some sense relate the parameters to the phonetics in that we know for instance that the typical formant values for an /iy/ vowel are 300Hz,... [Pg.418]

Klatt s 1987 article. Review ofText-to-Speech Conversion for English [255] is an excellent source for further reading on first generation systems. Klatt documents the history of the entire TTS field, and then explains the (then) state of the art systems in detail. While his account is heavily biased towards formant synthesis, rather than LP or articulatory syntiiesis, it none the less remains a very solid account of technology before and at the time. [Pg.420]

Formant synthesis works by using individually controllable formant filters which can be set to produce accurate estimations of the vocal tract transfer function... [Pg.421]

In general formant synthesis produces intelligible but not natural sounding speech. [Pg.421]

In terms of production, it is very similar to formant synthesis with regard to the source and vowels. It differs in that all sounds are generated by an all-pole filter, whereas parallel filters are common in formant synthesis. [Pg.421]

Throughout the book, we have made statements to the effect that statistical text analysis outperforms rule methods, or that unit selection is more natural than formant synthesis. But how do we know this In one way or another, we have evaluated our systems and come to these conclusions. How we go about this is the topic of this section. [Pg.534]

Hogberg, J. Data driven formant synthesis. InProceedings of Eurospeech 1997 (1997). [Pg.584]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...