Acoustic representations

The acoustic waveform itself is rarely studied directly. This is because phase differences, which significantly affect the shape of the waveform, are in fact not relevant for speech perception. We will deal with phase properly in Chapter 10, but for now let us take for granted that unless we normalise the speech with respect to phase we will find it very difficult to discern the necessary patterns. Luckily we have a well established technique for removing phase, known as spectral [Pg.159]

In this figure we can see the harmonics quite clearly they are shown as the vertical spikes which occm at even intervals. In addition to this, we can discern a spectral envelope, which is the pattern of amplitude of the harmonics. From om previous sections, we know that the position of the harmonics is dependent on the fundamental fi equency and the glottis, whereas the spectral envelope is controlled by the vocal tract and hence contains the information required for vowel and consonant identity. By various other techniques, it is possible to further separate the harmonics from the envelope, so that we can determine the fundamental frequency (useful for prosodic analysis) and envelope shape. [Pg.160]

In this figure we can see the harmonics quite clearly they are shown as the vertical spikes which occur at even intervals. In addition to this, we can discern a spectral [Pg.156]

It is important to realise that the spectrogram is an artificial representation of the speech signal that has been produced by software so as to highlight the salient features that a [Pg.157]

This chapter is concerned with the issue of synthesising acoustic representations of prosody. The input to the algorithms described here varies but in general takes the form of the phrasing, stress, prominenee and discourse patterns which we introdueed in Chapter 6. Hence the complete process of synthesis of prosody ean be seen as one where we first extract a prosodic form representation from the text, described in Chapter 6, and then S5mthesize an acoustic representation of this form, deseribed here. [Pg.227]

The majority of this chapter focuses on the synthesis of intonation. The main acoustic representation of intonation is fundamental frequency (FO), such that intonation is often defined as the manipulation of FO for commimicative or linguistic purposes. As we shall see, techniques for S5mthesizing FO contours are inherently linked to the model of intonation used, and so the whole topic of intonation, including theories, models and FO synthesis is dealt with here. In addition, we cover the topic of predicting intonation form from text, which was deferred from Chapter 6 as we first require an understanding of intonational phenomena theories and models before explaining this. [Pg.227]

Timing is considered the second important acoustic representation of prosody. Timing is used to indicate stress (phones are longer than normal), phrasing (phones get noticeably longer immediately prior to a phrase break) and rhythm. [Pg.227]

FO/pitch and timing are the two main acoustic representations of prosody, but this is largely because these are quite easy to identify and measure, other aspects of the signal, most notably voice quality are heavily influenced also. [Pg.263]

As well shall see in Sections 15.2.4 and 16.4.1, decision tree clustering is a key component of a number of s mthesis systems. In these, the decision is not seen so much as a clustering algorithm, but rather as a mapping or function from the discrete feature space to the acoustic space. As this is fully defined for every possible feature combination, it provides a general mechanism for generating acoustic representations from linguistic ones. [Pg.468]

The weakness of the left-out data approach is that we are heavily reliant on a distance fimction which we know only partially corresponds to human perception. A number of approaches have therefore taken been designed to create target functions that more directly mimic human perception. One way of improving the distance metric is to use an acoustic representations based on models of human perception for example Tsuzaki [457] uses an auditory modelling approach. But as we shall see in Section 16.7.2, even a distance function that perfectly models human auditory perception is only part of the story the categorical boundaries in perception between one discrete feature value and another mean that no such measure on its own can sufficiently mimic human perception in the broader sense. [Pg.501]

A more extreme view is to consider the middle region very small, such that we take the view that units either join together well (we can t hear the join) or they don t (we can hear the join). Then, the join cost becomes a function which returns a binary value. Another way of stating this is that the join function is not returning a cost at all, but is in fact a classifier which simply returns true of two units will join and false if they don t. To our knowledge Pantazis et al [345] is the only published study that has examined this approach in detail. In that study, they asked listeners to tell if they eould hear a join or not, and use this to build a classifier which made a decision based on acoustic features. In their study, they used a harmonic model as the acoustic representation (see... [Pg.515]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...