Are HMMs a good model of speech

It is often asked whether HMMs are in fact a good model of speech [63], [102], [317]. This is not so straightforward a question to answer. On the surface, the answer is no because we can identify many ways in which HMM behaviour differs from that of real speech, and we shall list some of these differences below. However, in their defence, it is important to realise that many of the additions to the canonical HMM alleviate [Pg.454]

Independence of observations. This is probably the most-cited weakness of HMMs. While an observation depends on its state, and the state depends on the previous states, one observation is not statistically dependent on the previous one. To see the effect of this, consider the probability of a sequence of observations at a single value. If we have a density with a mean of 0.5, we would then find that a sequence of observations of 0.3, 0.3, 0.3, 0.3. would have a certain probabihty, and, since the Gaussian is symmetrical, this would be the same probability as a sequence of 0.7, 0.7, 0.7, 0.7. But, since the observations are independent of one anotiier, tiiis would have the same probability as a sequence that oscillates such as 0.3,0.7,0.3, 0.7. Such a pattern would be extremely unhkely in real speech, a fact not reflected in the model. [Pg.455]

Discrete states. This problem arises because, whereas real speech evolves more or less smoothly, the HMM switches characteristics suddenly when moving from one state to flie next. Both fliis and the independent-observation problems are largely solved by tiie use of dynamic features. When these are included, then we can distinguish between the stable and oscillating sequences. In fact, as we shall see in Section 15.2, very natural trajectories can be generated 1 an HMM that uses dynamic features. [Pg.455]

Linearity of the model. HMMs enforce a strictly linear model of speech, whereby a sentence is composed of a list of words, which are composed of a list of phones, a list of states and, finally, a list of frames. While there is little problem with describing an utterance as a list of words, or as a list of frames, we know that the strict linear model of phones within words and, worse, states within phones, differs from the more-complex interactions in real speech production (see Section 7.3.5). There we showed that it was wrong to think that we could identify phones or phonemes in a waveform rather we should think of speech production as a process whereby phonemes are input and speech is output and we can only approximately identify where the output of a phone hes in the waveform. When we consider the alignment of states and frames in HMMs we see that we have in fact insisted that every frame belongs to exactly one state and one phone, which is a gross simplification of the speech-production process. [Pg.455]

While many of these points are valid, there are often solutions that help alleviate any problems. As just explained, the use of dynamic features helps greatly with the problems of observation independence and discrete states. As we shall see, the linearity issue is potentially more of a problem in speech synthesis. Models such as neural netwoiks that perform classification directly have been proposed [375] and have produced reasonable [Pg.455]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...