Correlation multiple-descriptor data sets

For multiple-descriptor data sets, one could use the methods in Section 17.3 to derive a correlation between y and xl, between y and x2, between y and x3, etc., to find the k data set giving the best correlation with y. It is very likely, however, that one of the other X variabled can describe part of the y variation, which is not described by k, and a third x variable describing some of the remaining variation, etc. Since the x vectors may be internally correlated, however, the second most important x vector found in a one-to-one correlation is not necessarily the most important once the k vector has been included. [Pg.555]

The multiple linear regression models are validated using standard statistical techniques. These techniques include inspection of residual plots, standard deviation, and multiple correlation coefficient. Both regression and computational neural network models are validated using external prediction. The prediction set is not used for descriptor selection, descriptor reduction, or model development, and it therefore represents a true unknown data set. In order to ascertain the predictive power of a model the rms error is computed for the prediction set. [Pg.113]

Stored in a table where columns are descriptors, and rows are compounds (or conformers), QSAR data sets contain separate columns for the measured target property (Y), attributed to the training set, as well as computed descriptors for (external) reference compounds on which the QSAR model is tested—the test set. Statistical procedures, e.g., multiple linear regression (MLR), projection to latent structures (PLS), or neural networks (NN) [38], are then used to establish a mathematical soft model relating the observed measurement(s) in the Y column(s) with some combination of the properties represented in the subsequent columns. PLS, NN, and AI (artificial intelligence) techniques have been explored by Green and Marshall in the context of 3D-QSAR models [39], and were shown to extract similar information. A problem that may lead to spurious (chance) correlations when using MLR techniques, the colinearity between various descriptors, or cross-correlation, is usually dealt with in PLS [40],... [Pg.573]

A neural network contains input units, layers of neurons, and an output. Each neuron carries out arithmetic operations on its input to produce an output signal. The type of arithmetic operation is defined by the user often it is sigmoidal and restricted to values between 0 and 1. The input to a QSAR neural network is the matrix of descriptor values for each compound. One input unit represents the properties of one compound, which is one row of the matrix. In the first layer, each neuron usually represents one molecular descriptor, corresponding to one column of the matrix. However, if the input data have internal correlations, the network is set up with a reduced number of neurons (such as the number of significant principal components). The output signal from a neuron has a value that describes the relationship between all input signals and the property represented by that neuron. In multiple regression terms, this is the coefficient of the property. Some advocate... [Pg.193]

The pool of descriptors that is calculated must be winnowed down to a manageable set before constructing a statistical or neural network model. This operation is called feature selection. The first step of feature selection is to use a battery of objective statistical methods. Descriptors that contain little information, descriptors that have little variation across the data set, or descriptors that are highly correlated with other descriptors are candidates for elimination. Multivariate correlations among descriptor can also be discovered with multiple linear regression analysis, and these relationships can be broken down by elimination of descriptors. [Pg.2325]

Elguero et al. [141] have reduced Palm s analogous tetraparametric model for the multiple correlation of solvent effects [cf. Eq. (7A8) in Section 7.7] to a triparametric one, with two factors explaining 94% of the data variance given in an original set of four descriptors [Y, P, E, and B of Eq. (7-48) in Section 7.7] for 51 solvents. [Pg.87]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...