Multiple-descriptor data sets

Multiple-descriptor data sets and quality analysis... [Pg.553]

For multiple-descriptor data sets, one could use the methods in Section 17.3 to derive a correlation between y and xl, between y and x2, between y and x3, etc., to find the k data set giving the best correlation with y. It is very likely, however, that one of the other X variabled can describe part of the y variation, which is not described by k, and a third x variable describing some of the remaining variation, etc. Since the x vectors may be internally correlated, however, the second most important x vector found in a one-to-one correlation is not necessarily the most important once the k vector has been included. [Pg.555]

When compounds are selected according to SMD, this necessitates the adequate description of their structures by means of quantitative variables, "structure descriptors". This description can then be used after the compound selection, synthesis, and biological testing to formulate quantitative models between structural variation and activity variation, so called Quantitative Structure Activity Relationships (QSARs). For extensive reviews, see references 3 and 4. With multiple structure descriptors and multiple biological activity variables (responses), these models are necessarily multivariate (M-QSAR) in their nature, making the Partial Least Squares Projections to Latent Structures (PLS) approach suitable for the data analysis. PLS is a statistical method, which relates a multivariate descriptor data set (X) to a multivariate response data set Y. PLS is well described elsewhere and will not be described any further here [42, 43]. [Pg.214]

Figures 11 and 12 illustrate the performance of the pR2 compared with several of the currently popular criteria on a specific data set resulting from one of the drug hunting projects at Eli Lilly. This data set has IC50 values for 1289 molecules. There were 2317 descriptors (or covariates) and a multiple linear regression model was used with forward variable selection the linear model was trained on half the data (selected at random) and evaluated on the other (hold-out) half. The root mean squared error of prediction (RMSE) for the test hold-out set is minimized when the model has 21 parameters. Figure 11 shows the model size chosen by several criteria applied to the training set in a forward selection for example, the pR2 chose 22 descriptors, the Bayesian Information Criterion chose 49, Leave One Out cross-validation chose 308, the adjusted R2 chose 435, and the Akaike Information Criterion chose 512 descriptors in the model. Although the pR2 criterion selected considerably fewer descriptors than the other methods, it had the best prediction performance. Also, only pR2 and BIC had better prediction on the test data set than the null model.

On the basis of the origin of molecular descriptors used in calculations, QSAR methods can be divided into three groups. One group is based on a relatively small number (usually many times smaller than the number of compounds in a data set) of physicochemical properties and parameters describing,for example, hydrophobic, steric, and electrostatic effects. Usually, these descriptors are used as independent variables in multiple regression approaches (18) Jn the literature, these methods are typically referred to as Hansch analysis (8).These types of descriptors and corresponding linear optimization methods used in traditional QSAR analyses are discussed extensively in the chapter by Celassie (7) and therefore is not reviewed here. [Pg.52]

Step 1. Multiple descriptors such as molecular connectivity indices or atom pair descriptors (cf Section 2.1) are generated initially for every compound in a data set. [Pg.61]

The multiple linear regression models are validated using standard statistical techniques. These techniques include inspection of residual plots, standard deviation, and multiple correlation coefficient. Both regression and computational neural network models are validated using external prediction. The prediction set is not used for descriptor selection, descriptor reduction, or model development, and it therefore represents a true unknown data set. In order to ascertain the predictive power of a model the rms error is computed for the prediction set. [Pg.113]

Molecule data in ARC are represented in a data set structure, which stores the relationships between molecular data and subsequent dialogs, windows, or routines. Each time single or multiple hies are opened, a data set node appears in a tree view that contains subnodes of molecules as well as data calculated for the entire molecule set. Nevertheless, each window contains its own data and can be manipulated independently from the corresponding data set. Each hie entry consists of a single compound and may include several subwindows (e.g., molecule, spectrum, descriptor). Depending on the conhguration, multiple selected hies either are loaded as individual entries or are collected as a single data set. [Pg.153]

Stored in a table where columns are descriptors, and rows are compounds (or conformers), QSAR data sets contain separate columns for the measured target property (Y), attributed to the training set, as well as computed descriptors for (external) reference compounds on which the QSAR model is tested—the test set. Statistical procedures, e.g., multiple linear regression (MLR), projection to latent structures (PLS), or neural networks (NN) [38], are then used to establish a mathematical soft model relating the observed measurement(s) in the Y column(s) with some combination of the properties represented in the subsequent columns. PLS, NN, and AI (artificial intelligence) techniques have been explored by Green and Marshall in the context of 3D-QSAR models [39], and were shown to extract similar information. A problem that may lead to spurious (chance) correlations when using MLR techniques, the colinearity between various descriptors, or cross-correlation, is usually dealt with in PLS [40],... [Pg.573]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...