Supervised statistical learning

A central problem in computational chemistry is to find empirical relationships between structures of organic compounds and their experimental physicochemical, biological or pharmaceutical properties. This is necessary whenever the functional dependence of a property on molecular structure is not known, or its calculation requires extraordinary effort. In particular, one may be interested in predicting properties from a molecular structure, or rather deriving a molecular structure from known properties. [Pg.221]

The first problem is known as application of quantitative structure-property relationships (QSPRs). An analogous term quantitative structure-activity relationship (QSAR) is used if a biological activity, rather than a physicochemical property, is to be modeled. [Pg.221]

Inverse QSPR/QSAR is the search for compounds exhibiting prescribed prop-erty/activity values. Molecular structure elucidation aims at identifying an unknown compound, i.e. deducing its molecular structure from measured physicochemical properties, most often spectra. [Pg.221]

The aim of QSPR/QSAR work is to develop mathematical models based on known cases to allow predictions for unknown cases. Furthermore, such models may lead to a better understanding of the often complex causal dependence of a property on structure. Important mathematical tools in the search for QSPRs are the statistical methods of supervised learning. Application of such methods requires a sufficiently large database of the appropriate structure-property pedrs. [Pg.221]

In Chapter 6 we describe the basic principles of supervised statistical learning and show how it can be used in computer chemistry when a causal connection between structure and property is not known, or can only be calculated with extremely high effort. Such problems occm quite often in combinatorial chemistry as well as in molecular structure elucidation. [Pg.10]

The methods of discrete mathematics, introduced in this chapter, were sufficient for describing chemical compounds as discrete structures. However, once the relationships between properties and structures have to be modelled, non-discrete methods are required. Methods from supervised statistical learning theory and machine learning are particularly useful and thus some of these will be introduced in the next chapter. [Pg.220]

While we focussed on the methods of supervised statistical learning in this chapter, the next chapter builds on this with a specific focus on applying these to quantitative structure-property relationships (QSPRs). [Pg.239]

Spectrum interpretation is extraction of structural properties flora spectroscopic data. Methods of pattern recognition and of supervised statistical learning aroused (Section 8.5). [Pg.299]

Classification describes the process of assigning an instance or property to one of several given classes. The classes are defined beforehand and this class assignment is used in the learning process, which is therefore supervised. Statistical methods and decision trees (cf. Section 9.3) are also widely used for classification tasks. [Pg.473]

Support vector machine (SVM) is originally a binary supervised classification algorithm, introduced by Vapnik and his co-workers [13, 32], based on statistical learning theory. Instead of traditional empirical risk minimization (ERM), as performed by artificial neural network, SVM algorithm is based on the structural risk minimization (SRM) principle. In its simplest form, linear SVM for a two class problem finds an optimal hyperplane that maximizes the separation between the two classes. The optimal separating hyperplane can be obtained by solving the following quadratic optimization problem ... [Pg.145]

Multiple linear regression is strictly a parametric supervised learning technique. A parametric technique is one which assumes that the variables conform to some distribution (often the Gaussian distribution) the properties of the distribution are assumed in the underlying statistical method. A non-parametric technique does not rely upon the assumption of any particular distribution. A supervised learning method is one which uses information about the dependent variable to derive the model. An unsupervised learning method does not. Thus cluster analysis, principal components analysis and factor analysis are all examples of unsupervised learning techniques. [Pg.719]

Classification, or the division of data into groups, methods can be broadly of two types supervised and unsupervised. The primary difference is that prior information about classes into which the data fall is known and representative samples from these classes are available for supervised methods. The supervised and unsupervised approaches loosely lend themselves into problems that have prior hypotheses and those in which discovery of the classes of data may be needed, respectively. The division is purely for organization purposes in many applications, a combination of both methods can be very powerful. In general, biomedical data analysis will require multiple spectral features and will have stochastic variations. Hence, the field of statistical pattern recognition [88] is of primary importance and we use the term recognition with our learning and classification method descriptions below. [Pg.191]

Analysis of variance in general serves as a statistical test of the influence of random or systematic factors on measured data (test for random or fixed effects). One wants to test if the feature mean values of two or more classes are different. Classes of objects or clusters of data may be given a priori (supervised learning) or found in the course of a learning process (unsupervised learning see Section 5.3, cluster analysis). In the first case variance analysis is used for class pattern confirmation. [Pg.182]

This approach employs statistical methods that use no obvious theory-derived basis, but which derive usable relationships from realistic inputs. It is beyond the scope of this review to describe the methods and their validation in detail. Useful reviews are available (Livingstone, 2000 2003) and more details are provided in Chapter 3. The methods may be divided into two classes, often referred to as those derived from supervised and unsupervised learning. In the latter, the techniques used are more free to explore relationships between variables, and are therefore less likely to produce chance effects. [Pg.58]

There are literally dozens of kinds of neural network architectures in use. A simple taxonomy divides them into two types based on learning algorithms (supervised, unsupervised) and into subtypes based upon whether they are feed-forward or feedback type networks. In this chapter, two other commonly used architectures, radial basis functions and Kohonen self-organizing architectures, will be discussed. Additionally, variants of multilayer perceptrons that have enhanced statistical properties will be presented. [Pg.41]

Neural networks can be used for supervised and unsupervised learning as we know from the statistical methods of pattern recognition (Chapter 5). [Pg.311]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...