Descriptor variable

If a large number of descriptor variables are utilized, reduce the number by variable elimination before modeling. [Pg.474]

To demonstrate the use of binary substructure descriptors and Tanimoto indices for cluster analysis of chemical structures we consider the 20 standard amino acids (Figure 6.3) and characterize each molecular structure by eight binary variables describing presence/absence of eight substructures (Figure 6.4). Note that in most practical applications—for instance, evaluation of results from searches in structure databases—more diverse molecular structures have to be handled and usually several hundred different substructures are considered. Table 6.1 contains the binary substructure descriptors (variables) with value 0 if the substructure is absent and 1 if the substructure is present in the amino acid these numbers form the A-matrix. Binary substructure descriptors have been calculated by the software SubMat (Scsibrany and Varmuza 2004), which requires as input the molecular structures in one file and the substructures in another file, all structures are in Molfile format (Gasteiger and Engel 2003) output is an ASCII file with the binary descriptors. [Pg.270]

The measure of chemical diversity of a set of compounds clearly depends on the descriptor variables chosen to characterize their chemical structures. Similarly, the utility of a structure-activity analysis will depend on how well the descriptor variables capture the important chemical features. [Pg.302]

One method of choosing a diverse set of compounds from a molecular database is to first cluster them (see, for example ref. 12) and then select one molecule from each cluster. Clustering ideally groups the molecules into well-separated, compact groups with respect to the descriptor variables. If this is the case, then each cluster or group can be represented by one of its members. This method of choosing a diverse set of objects goes back at least to Zemroch (13) in the statistics literature and has been widely used for chemical databases (14,15). [Pg.303]

As it is essentially impossible to cover a high-dimensional space finely with a modest number of compounds, Lam and co-workers proposed a cell-based method that uniformly covers all low-dimensional subspaces formed by subsets of descriptors (19,20). Typically, they would consider all one-dimensional (ID), 2D, and 3D subspaces. In addition to practical feasibility, this is consistent with Pearlman and Smith s notion of a relevant subspace (21) a particular activity mechanism will likely involve only a few relevant descriptor variables. [Pg.304]

The method which satisfies these conditions is partial least squares (PLS) regression analysis, a relatively recent statistical technique (18, 19). The basis of tiie PLS method is that given k objects, characterised by i descriptor variables, which form the X-matrix, and j response variables which form the Y-matrix, it is possible to relate the two blocks (or data matrices) by means of the respective latent variables u and 1 in such a way that the two data sets are linearly dependent ... [Pg.103]

For redox reactions of a series of closely related compounds, redox potentials and rate constants often correlate to descriptor variables that reflect the electron donor or electron acceptor properties of P. Such correlations can be used to derive quantitative structure-activity relationships (QSARs), and these QSARs provide the basis for predicting properties of environmental contaminants that have not previously been measured (173). [Pg.428]

Commonly used descriptor variables for QSARs involving redox reactions include substituent constants (o), ionization potential, electron affinity, energy of the highest occupied molecular orbital (EHOMO)or lowest unoccupied molecular orbital (ELUMO), one-electron reduction or oxidation potential (E1), and half-wave potential (E1/2)- One descriptor variable (D), fit to a log-linear model, is usually sufficient to describe a redox property of P. Such a QSAR will have the form... [Pg.428]

Descriptor variable for QSAR, subscripts i distinguish congeners Potential under non-standard conditions... [Pg.430]

Regression on principal components (PCR) is another from of regression modeling that may be used for continuous response data. Here, the independent variables (the x set) are computed from the descriptor variables using PC A as shown in Equation 7.1. These are the principal component scores and they have several advantages ... [Pg.173]

There is a smaller number of PC scores than the original descriptor variables, sometimes a much smaller number. [Pg.173]

The prioritization of QSAR models for validation is likely to take account of regulatory needs. The selection of QSARs could also take into account the mechanistic basis of the QSAR. In the case of QSARs, this would involve a qualitative assessment of the relevance of descriptor variables to the endpoint being modeled. A strong mechanistic basis could be regarded as a desirable criterion, rather than an essential criterion, since mechanistic relevance cannot always be established this is especially the case since the fundamental processes underlying the expression of biological endpoints are often unknown. [Pg.433]

This quantitative description has a number of requirements to be suitable for CombC. It has to be fast to derive or calculate. It has to be relevant, i.e. capture the essential structural properties that influence the biological activities of interest. The structure descriptor variables have to be chemically interpretable. If possible, the description should be reversible, i.e. be possible to also translate backwards from descriptor values to structure. The description should be consistent with similarity dissimilarity, so those compounds that are closely similar have very similar values for the descriptors, and dissimilar compounds have widely different values for at least most ofthe descriptors. [Pg.202]

Figure 4 shows the distribution of the ketones in the two dimensional score space (h, t2), resulting from the principal component analysis (PCA) of the table of 78 ketones described by the 11 structure descriptor variables derived from IR, NMR spectra and other properties such as density, molecular weight and so on [31]. The figure also shows 9 compounds selected by a D-optimal design to well span this score space. Figure 5 shows the same score space but with another selection of 12 compounds, claimed to be superior. [Pg.206]

The first step is to compute the averages of each descriptor variable. This yields the average vector, x = [xtx2. .. xk] which gives the average matrix, X, after multiplication by the (nx 1) vector 1 = [11. .. 1]. ... [Pg.37]

PLS is a modelling and computational method for establishing quantitative relations between blocks of variables. Such blocks may, for instance, comprise a block of descriptor variables of a set of test systems (X block) and a block of measured responses obtained with these sytems (Y block). A quantitative relation between these blocks will make it possible to enter data, x, for a new systems and make predictions of the expected responses, y, for these systems. [Pg.52]

A stepwise selection procedure is performed to search for QSPR/QSAR models after the preliminary exclusion of - constant and near-constant variables. The - pair correlation cutoff selection of variables is then performed to avoid highly correlated descriptor variables within the model. [Pg.75]

The general problem of excluding variables from data, i.e. of estimating the best I vector, can be divided in two main blocks methods for - variable reduction and methods for - variable selection. The first group of methods evaluates the variable exclusion by inner relationships among the p descriptor variables, i.e. [Pg.295]

Description A QSAR toolkit with descriptor generation (topological, geometrical, electronic, and physicochemical descriptors), variable selection, regression and artificial neural network modelling. [Pg.521]

It is necessary to resist the temptation to analyze tables like the one below by considering one variable at a time. The risk of finding spurious correlations between the variables increases rapidly with an increasing number of variables. If we wish to be sure at a probability level of 95 % that a descriptor variable is correlated to a chemical penomenon, this also means that we accept a risk of 5 % that this variable is by pure chance correlated to the phenomenon. This risk with k variables will be... [Pg.340]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...