Semiotic classification

Semiotic classification can be performed by any of the techniques described in Section 5.2.1 above. First, we define which classes we believe our system will encounter. We can immediately identify some which crop up regardless of text type and genre, and these include [Pg.93]

In addition, we have a number of classes which are more genre specific for example we may have street addresses in a telephone directory application, computer programs in a screen reader for the blind and any munber of systems in specialist areas such as medicine, engineering or construction [Pg.93]

Chapter 5. Text Decoding Finding the words from the text [Pg.94]

So it is important to have a good knowledge of the application area in order to ensure a good and accurate coverage of the type of semiotic systems that will be encountered. [Pg.94]

It should be clear that a simple classifier based on token observations will only be sufficient in the simplest of cases. If we take the case of cardinal numbers, we see that it would be absurd to list the possible tokens that a cardinal class can give rise to the list would literally be infinite. To handle this, we make use of specialist sub-classifiers which use regular expression generators to identify likely tokens from a given class. One way of doing this is to run regular expression matches on tokens and use the result of this as a feature in the classifier itself. So we might have [Pg.94]

The process of semiotic classification can therefore be summarised as follows. [Pg.93]

Semiotic classification is therefore a question of assigning the correct class to each of these tokens. This can be done based on the patterns within the tokens themselves (e.g. three numbers divided by slashes (e.g. 10/12/6 7) is indicative of a date) and optionally the tokens surrounding the one in question (so that if we find 1967 preceded by in there is a good chance that this is a year). [Pg.45]

In both semiotic classification and homograph disambiguation, our goal is to assign a label to each token, where the label is drawn from a pre-defined set. This process is one of classification, which as we shall see, crops up all the time in TTS to the extent that we use a number of basic approaches again and again for various TTS problems. [Pg.80]

While we like to imagine that beautiful well constructed theories underly our algorithms, we frequently find that when it comes to text classification many systems in praetiee simply use a hodge-podge of rules to perform the task. This is particularly common in approaches to semiotic classification in genres where the amount of non-natural language text is very low, and so only a few special cases (say dealing with munbers) are required. [Pg.84]

For text genres dominated by natural language text, and in situations where very little training data is available, this approach may be adequate for semiotic classification. This is less likely to work for homographs, but again depending on the language, the level of errors may be acceptable. [Pg.84]

The first stage in text analysis is therefore to perform semiotic classification which assigns a class to each token. This is performed by one of the above mentioned classification algorithms... [Pg.110]

The take home message from this discussion is that the theoretical difficulties with prosodic models and the under specification of prosody in text mean that this is an inherently difficult problem and should be approached with a degree of caution. While it is possible to continually drive down the error rate with regard to semiotic classification, homograph disambiguation and so on, it is unrealistic to assume the same can be done with prosody. [Pg.145]

The main problem in natural-language text decoding is dealing with homographs, whether tiiey be accidental, part-of-speech or abbreviation homographs. Many systems choose to handle each of these separately, and in particular abbreviation homographs are often dealt with at the same time or in the same module as semiotic classification. [Pg.99]

Typically this is done in systems that don t have an explicit semiotic-classification stage. An alternative approach is to handle these in a single modnle. Regardless, the general approach is to use one of the token-disambiguation techniques described in Section 5.2.1. Here we discuss some of the specifics of homograph disambiguation. [Pg.100]

In coimnercial TTS, text analysis is seen as vitally important since any error can cause an enourmous negative impression witii a hstener. Despite this, the problem has received only sporadic attention amongst academic researchers. Notable exceptions include various works by Richard Sproat [410], [411], [412], [413], who tackles nearly all the problems in text analysis (and many in prosody, explained in file next chapter). The Bell Labs system tiiat Sproat worked on is particularly well known for its very high accuracy with regard to text processing. In a separate line of work Sproat et al. investigated a number of machine-learning approaches to semiotic classification and verbahsation. [Pg.108]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...