Descriptors matrix

Hierarchical cluster analysis (Section 6.4)—with the result represented by a dendrogram—is a complementary, nonlinear, and widely used method for cluster analysis. The distance measure used was dTANi (Equation 6.5), and the cluster mode was average linkage. The dendrogram in Figure 6.6 (without the chemical structures) was obtained from the descriptor matrix X by... [Pg.273]

Principal Component Analysis (PCA) is the most popular technique of multivariate analysis used in environmental chemistry and toxicology [313-316]. Both PCA and factor analysis (FA) aim to reduce the dimensionality of a set of data but the approaches to do so are different for the two techniques. Each provides a different insight into the data structure, with PCA concentrating on explaining the diagonal elements of the covariance matrix, while FA the off-diagonal elements [313, 316-319]. Theoretically, PCA corresponds to a mathematical decomposition of the descriptor matrix,X, into means (xk), scores (fia), loadings (pak), and residuals (eik), which can be expressed as... [Pg.268]

Descriptor matrix coordinates can be preprocessed using the recently developed filtering technique (37) known as OSC, which is also implemented in... [Pg.223]

The descriptor matrix formed by the uniform descriptor vectors for all the training set structures, as well as the experimental activity values, serves as the source data for the statistical analysis and construction of the QSAR model. [Pg.158]

The first characteristic of this geometry representation that must be emphasized is that it is completely alignment independent, since the values assigned to every variable depend only on the mutual distance of the nodes and not on the position of the nodes in space. Therefore, if we carry out this analysis for a series of molecules not aligned in space, the variables obtained for every compound do nevertheless have the same meaning and the values obtained can be combined to build a consistent descriptor matrix, without the need to align or otherwise superimpose their structures. [Pg.125]

The PLS multivariate data analysis of the training set was carried out on the descriptors matrix to correlate the complete set of variables with the activity data. From a total of 710 variables, 559 active variables remained after filtering descriptors with no variability by the ALMOND program. The PLS analysis resulted in four latent variables (LVs) with / = 0.76. The cross validation of the model using the leave-one-out (LOO) method yielded values of 0.72. As shown in Table 9.2, the GRIND descriptors 11-36, 44-49, 12-28, 13-42, 14-46, 24-46 and 34-45 were found to correlate with the inhibition activity in terms of high coefficients. [Pg.205]

PLS multivariate data analysis of the training set was carried out on the descriptors matrix to correlate the complete set of variables (640) with the activity data. [Pg.211]

There are many ways of characterizing different statistical machine-learning methods and protocols, but in this section, they will be organized into linear and nonlinear methods (even though the descriptor matrix they operate on may contain higher order terms and cross-terms) as well as rule-based and Bayesian methods. [Pg.388]

The statistically focused methods for defining ADs are related to information content of the investigated descriptors, for example, the variance of the descriptor matrix and calculate the amount of an unexplained variance for the training set objects (the model) and compare it with the corresponding amount for the new objects to be predicted. If the amount of unexplained variance for the new obj ects is much greater, typically more than the two standard deviations from the training set compounds ( 95% confidence interval), the former objects are designated to be outside the AD of the model. [Pg.397]

Scaling converts the original descriptor matrix to a normalized data matrix, which hereafter will be denoted by The columns in X are the scaled descriptor variables, see Fig. 15.9. [Pg.355]

Fig. 15.9 Scaling converts the original descriptor matrix into the normalized data matrix X.

Compute the mean, x, of each descriptor "z" over the set of compounds and subtract this value from the value of "i". This yields the scaled and mean centred descriptor matrix (X — X). [Pg.362]

For the descriptor matrix X used in principal components and factor analysis, the matrix X X is symmetric. Its eigenvalues are related to the sum of squares of the descriptors. The corresponding eigenvectors are the loading vectors of the principal component model. [Pg.517]

FIGURE 29.10 Dendrogram obtained after hierarchical clustering of principal components 1-3 calculated from the CoMFA descriptor matrix. Compounds are numbered according to Figure 29.8. Graphics taken from Ref. [107]... [Pg.600]

We assumed that the observations along a single row in the descriptor matrix are independent (different compounds) but that some values within a column vector are... [Pg.88]

If we do this for all components of our descriptors, we will find the first principal component corresponding to the linear combination of the data rows in the descriptor matrix that shows the largest variation in the data. The next principal component is another linear combination of data rows that accounts for the next largest variation in the data set and so forth. The procedure for performing a principal component analysis is described in the following sections. [Pg.89]

To make the principal components comparable, we center the data to find the first principal component as a direction from the origin of our coordinate system (Figure 4.4). We achieve this by calculating the mean of the variations in distance in each row to get m mean values (one for each descriptor), and subtract them from the descriptor matrix to get... [Pg.89]

PCA and SVD are strongly related whereas SVD provides a factorization of a descriptor matrix, PCA provides a nearly parallel factoring, due to the analysis of eigenvalues of the covariance matrix. [Pg.93]

Singular value decomposition is a valuable approach to matrix analysis. Compared to the original matrix, the SVD of a descriptor matrix reveals its geometric aspects and is more robust to numerical errors. [Pg.93]

Common chemometric tools may be applied to deal with similarity matrices. Particularly, partial least squares (PLS) [73,74] stands as an ideal technique for obtaining a generalized regression for modeling the association between the matrices X (descriptors) and Y (responses). In computational chemistry, its main use is to model the relationship between computed variables, which together characterize the structural variation of a set of N compounds and any property of interest measured on those N substances [75-77]. This variation of the molecular skeleton is condensed into the matrix X, whereas the analyzed properties are recorded into Y. In PLS, the matrix X is commonly built up from nonindependent data, as it usually has more columns than rows hence it is not called the independent matrix, but predictor or descriptor matrix. A good review, as well as its practical application in QSAR, is found in Ref. 78 and a detailed tutorial in Ref. 79. [Pg.372]

In some cases it is possible to go from more computationally demanding descriptors to more rapidly computed ones while preserving the information content from one descriptor matrix to the other. [Pg.1018]

PPs have been applied for describing amino acids in peptides [9,17], aromatic substituents in general organic series [20-22] and heteroaromatic systems [23], When the design affects many chemical elements simultaneously, i.e. amino acids in a peptide sequence, or substituents in a polysubstituted organic skeleton, each element can be described by a block of PPs. The blocks of PPs are then collected in a descriptor matrix. This is the matrix that will be analyzed by PLS in the last step in order to find out its relationship with the y-vector, or the Y-matrix, measuring the biological activity. [Pg.22]

D QSAR methods are based on the mechanistic underlying assumptions that the modelled compounds should bind in similar mode and with similar bioactive conformation. Moreover, the underlying assumptions on molecular description are the congruency of the descriptor matrix... [Pg.413]

Thirty-five compounds provided exact numerical Iso values which could be used in the computations. There are 15 various substituents appearing in five possible substituent positions (numbered I-V in Structure A) in these compounds. In a program, written in BASIC, a descriptor matrix was created by copying the appropriate substituent parameters into the corresponding substitution position-compartments of the descriptor matrix if the actual substituent is present in the molecule in the actual position. From the Hammett constants, the equation - a was assumed (31,, 1 ), thus a was applied for positions I, III and V, and for positions II and IV. Therefore, 10 parameters were used to enter the descriptor matrix for each substituent position. Two additional parameters, the total lipophilicity [J]x] and its square value [(X > ) 1 were also calculated for each compound and were added to the data-set. [Pg.173]

PCA is not only used as a method on its own but also as part of other mathematical techniques such as SIMCA classification (see section on parametric classification methods), principal component regression analysis (PCRA) and partial least-squares modelling with latent variables (PLS). Instead of original descriptor variables (x-variables), PCs extracted from a matrix of x-variables (descriptor matrix X) are used in PCRA and PLS as independent variables in a regression model. These PCs are called latent variables in this context. [Pg.61]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...