Non-natural-language text

The task of text decoding is to take a tokenised sentence and determine the best sequence of words. In many situations this is a classical disambiguation problem there is one, and only one correct sequence of words which gave rise to the text, and it is our job to determine this. In other situations, especially where we are dealing with non-natural language text such as numbers and dates and so on, there may be a few different acceptable word sequences. [Pg.79]

While we like to imagine that beautiful well constructed theories underly our algorithms, we frequently find that when it comes to text classification many systems in praetiee simply use a hodge-podge of rules to perform the task. This is particularly common in approaches to semiotic classification in genres where the amount of non-natural language text is very low, and so only a few special cases (say dealing with munbers) are required. [Pg.84]

The final step in handling non-natural language text is to convert it into words and this process is often called verbalisation. If we take our decoded date example, we see that we have values for... [Pg.96]

Having dealt with non-natural language text, we now turn our attention to the issue of how to find the words fi om the text when that text does in fact encode natural language. For many words, this process is relatively straightforward if we take a word such as walk, we will have its textual form walk listed in the lexieon, and upon finding the a token walk in the text we can be fairly sure that this is in fact a vmtten representation of walk. At worst we might be presented with a capitalised version Walk or an upper ease version WALK but these are not too diffieult to deal with either. If all words and all text behaved in this manner the problem would be solved, but of eourse not all cases are so simple, and we ean identify the main reasons for this ... [Pg.99]

Having dealt with non-natural-language text, we now turn our attention to the issue of how to find the words from the text when that text does in fact encode natural language. [Pg.97]

Text is often fiill of tokens which encode non-natural language systems such as dates, times, email addresses and so on. [Pg.110]

There are two main ways to deal with this problem. The first is the text-normalisation approach, which sees the text as the input to the S5mthesiser and tries to rewrite any non-standard text as proper linguistic text. The second is to classify each section of text according to one of the known semiotic classes. From there, a parser specific to each classes is used to analyse that section of text and uncover the underlying form. For natural language the text analysis job is now done but for the other systems an additional stage is needed, where the underlying form is translated into words. [Pg.44]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...