HMM synthesis systems

We now describe some complete HMM synthesis systems. In the HMM system described by Zen and Tokuda [513] five streams are used, one for the MFCCs and one each for [Pg.465]

Rgure 15.17 Spectral examples extracted from 39 order cepstrum calculated from a DFT and STRAIGHT. [Pg.466]

Taylor [438] has performed some preliminary work on HMM topologies other than the three-state left-to-right models used above. He shows that a standard unit-selection system of the type described in Chapter 16 can in fact by modelled by a more-general [Pg.466]

The purpose of all the prosody algorithms described in this chapter is to provide part of the specification which will act as input to the synthesizer proper. In the past, the provision of FO and timing information was uncontested as a vital part of the synthesis specification and most of today s systems still use them. As we shall see in Chapters 15 and 16, however, some third generation systems do not require any acoustic prosody specification at all, making used of higher level prosodic representations instead. Rather than use FO directly, stress, phrasing, and discourse information are used. While such an approach completely bypasses all the problems described in this chapter, it does have the consequence of increasing the dimensionahty of the feature space used in unit selection or HMM synthesis. It is therefore a practical question of tradeoff whether such an approach is better than the traditional approach. [Pg.262]

HMM synthesis started with the now classic paper by Tokuda et al [453] which explained the basic principles of generating observations which obey the dynamic constraints. Papers explaining the basic principle include [452], [454] [305]. From this, a gradual set of improvements have been proposed, resulting in today s high quality synthesis systems. Enhancements include more powerful observation modelling [509] duration modelling in HMMs [504], [515] trended HMMs [134], [133], trajectory HMMs [515], HMM studies on emotion and voice transformation [232], [429], [404], [505]. [Pg.483]

Zen, H., and Toda, T. An overview of Nitech HMM-based speech synthesis system for blizzard challenge 2005. In Proceedings of the Interspeech 2005 (2005). [Pg.603]

The output of the HMM synthesis process is a sequence of cepstral vectors and FO values, and so the final task is to convert these into a speech waveform. This can be accomplished in a number of ways, see for example Section 14.6. In general though the approach is to use the generated cepstral output to create a spectral envelope, and use the generated FO output to create an impulse train. The impulses are then fed into a filter with the coefficients derived from the cepstral parameters. While reasonably effective, this vocoder style approach is essentially the same as that used in first generation systems and so can suffer from the buzz or metallic sound characteristic of those systems (see Section 13.3.5). A major focus of current research is to improve on this. [Pg.464]

HMM that has many hundreds of states for each model. He shows that one can reduce die number of states by an arbitrary amount, allowing one to scale the size of a synthesis system in a principled manner. [Pg.467]

An alternative approach is to train an HMM system solely on the synthesis data, ensuring consistency. [Pg.484]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...