Determining the phone sequence

Given the word sequence, our next job is to determine the phone sequence. In considering this, we should recall the discussions of Sections 7.3.2, 7.3.6 and 8.1.3. There we stated that there was considerable choice in how we mapped from the words to a sound representation, with the main issue being whether to choose a representation that is close to the lexicon (e.g. phonemes) or one that is closer to the signal (e.g. phones with allophonic variation marked). [Pg.468]

Another important consideration is that, if we are going to use an HMM-based labeller, then it makes sense to use a sound-representation system that is amenable to HMM labelling. In general this means adopting a system that represents speech sounds as a linear list, and unfortunately precludes some of the more-sophisticated non-linear phonologies described in Section 7.4.3. [Pg.468]

Instead of, or in addition to, this technique it is possible to use a human labeller to hsten to the speech. This labeller would either correct the words when the speaker deviates from the script, or, if no script is available, simply record what words were spoken. While this use of a human labeller means of course that the technique is not fully automatic, this labelling can be done quickly and accurately compared with determining the phone sequence or either of the boundary-type locations. [Pg.467]

We are often confronted with the problem where we have to determine the correct sequence of words, phonemes or phones from a given a waveform of speech. The general name given to this process is transcription. Transcription is involved in virtually every aspect of analysing real speech data, as we nearly always want to relate the data we have to the linguistic message that it encodes. [Pg.171]

Given the durations, we ean ealculate the source signal for the utterance. We use an impulse source for sonorant phones, a noise source for unvoiced consonants, and a combined impulse and noise souree for voiced obstruents. The source characteristies are switched at phone boundaries. For voiced sounds, the impulse sequence is created by plaeing impulses at a separation distanee determined by 1 /FO at that point. Finally we feed the souree signal into the filter coefficients to generate the final speech waveform for the sentence. [Pg.414]

This gives us the most likely observation sequence for a given state sequence. The state sequence is partly determined by the synthesis specification in that we know the words and phones, but not the state sequence within each phone. Ideally, we would search every possible state sequence to find the observation sequence that maximise c in Equation 15.31, but this is too e q)ensive, firstly because the number of possible state sequences is very large and secondly because the solution of Equation 15.31 for each possible state sequence is expensive. [Pg.472]

Given the word and phone sequence we can construct an HMM model network can to recognise just those words. Recognition is obviously performed with perfect accuracy, but in doing the recognition search we also determine the most likely state sequence, and this gives us the phone and word boundaries. Often this operation is called forced alignment. [Pg.479]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...