Mel-scale

The mel-scale was the product of experiments with sinusoids in which subjects were required to divide frequency ranges into sections. From this, a new scale was defined in which one mel equalled one thousandth of the pitch of a IKHz tone [415]. The mapping from linear... [Pg.360]

A very popular representation in speech recognition is the mel-frequency cepstral coefficient or MFCC. This is one of the few popular represenations lhat does not use linear prediction. This is formed by first performing a DFT on a frame of speech, then performing a filter bank analysis (see Section 12.2) in which the frequency bin locations are defined to lie on the mel-scale. This is set up to give say 20-30 coefficients. These are then transformed to the cepstral domain by the discrete cosine transform (we use this rather than the DFT as we only require the real part to be calculated) ... [Pg.379]

It is common to ignore the higher cepstral coefficients, and often in ASR only the bottom 12 MFCCs are used. This representation is very popular in ASR as it firstly has the basic desirable properties that the coefficients are largely independent, allowing probability densities to be modelled with diagonal covariance matrices (see Section 15.1.3). Secondly, the mel-scaling has been shown to offer better discrimination between phones, which is an obvious help in recognition. [Pg.379]

It is possible to perform a similar operation with LP coefficients. In the normal calculation of these, spectral representations aren t used and so scaling the frequency domain (as in the case of mel-scaled cepstrum) isn t possible. Recall however that in the autocorrelation technique of LP, that we used the set of autocorrelation functions to find the predictor coefficients. In Section... [Pg.379]

Currently, mel-scale cepstral coeflicients, and perceptual linear prediction coefficients transformed into cepstral coefficients, are popular choices for the above reasons. Specifically they are ehosen because they are robust to noise, can be modelled with diagonal covariance, and with the aid of the perceptual scaling are more discriminative than would otherwise be. From a speech synthesis point of view, these points are worth making, not because the same requirements exist for synthesis, but rather to make the reader aware that the reason MFCCs and PLPs are so often used in ASR systems is for the above reasons, and not because they are intrinsically better in any general purpose sort of way. This also helps explain why there are so many speech representations in the first place each has strengths in certain areas, and will be used as the application demands. In fact, as we shall see in Chapter 16, the application requirements which make, say, MFCCs so suitable for speeeh recognition are almost entirely absent for our purposes. We shall leave a discussion as to what representations really are suited for speech synthesis purposes until Chapter 16. [Pg.395]

Remove the pre-emphasis and the influence that the mel-scaling operation has on spectral tilt. This can be performed by creating cepstral vectors for each process, and then simply... [Pg.442]

Perform an inverse liftering operation by padding the mel-scale cepstrum with zeros. This... [Pg.442]

Perform an inverse cosine transform, which gives us a mel-scaled spectrum. This differs from the analysis mel-scale cepstrum because of the liftering, but in fact the differences in the envelopes of the original and reconstructed spectra have been shown to be minor, particularly with respect to the important formant locations. [Pg.442]

The frames are processed so as to removed phase and source information. The most common representation is based on mel-scale cepstral coefficients (MFCCs), introduced in Section 12.5.7. Hence each observation oj is a vector of continuous values. For each phone we build a probabilistic model which tells us the probability of observing a particular acoustic input. [Pg.448]

Figure 12.10 Filter-bank analysis on magnitude spectra (a) with evenly spaced bins and (b) with bins spaced according to the mel-scale.

Transform die representation into a space that has more desirable properties log magnitude spectra follow the ear s dynamic range mel-scaled cepstra scale according to the frequency sensitivity to the ear log area ratios are amenable to simple interpolation and line-spectral frequencies show the formant patterns robustly. [Pg.386]

In front-end processing, the speech signals needed to be preprocessed. Based on [6] several processing steps occur. First the speech is segmented into frames of 20-ms window at 10-ms frame rate. Next mel-scale cepstmm feature vectors are extracted using the speech frames. The discrete... [Pg.560]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...