Descriptor reduction

The principal reason that a test set is necessary for validation is that empirical model-building methods cannot readily distinguish between noise and information in data sets, so the methods are prone to adjusting the model parameters to reduce error beyond the point warranted by the information contained in the data. This problem is called overtraining and can be countered by a variety of techniques such as descriptor reduction and early stopping, and readers interested in those topics are referred to the more detailed reviews of numerical methods cited in each of the following sections. [Pg.366]

The multiple linear regression models are validated using standard statistical techniques. These techniques include inspection of residual plots, standard deviation, and multiple correlation coefficient. Both regression and computational neural network models are validated using external prediction. The prediction set is not used for descriptor selection, descriptor reduction, or model development, and it therefore represents a true unknown data set. In order to ascertain the predictive power of a model the rms error is computed for the prediction set. [Pg.113]

More typically the process of building up the QSAR models requires more complex chemical information. For a set of compounds, with known property value, the descriptors are calculated. The process of model building proceeds through a reduction of the molecular descriptors, in order to indentify the most important ones. Then, using these selected chemical descriptors and a suitable algorithm, the model is developed. Finally, the model so obtained has to be validated. [Pg.83]

The relationship between the herbicidal activity of 1,2,5-oxadiazole iV-oxides and some physicochemical properties potentially related to this bioactivity, such as polarity, molecular volume, proton acceptor ability, lipophilicity, and reduction potential, were studied. The semi-empirical MO method AMI was used to calculate theoretical descriptors such as dipolar moment, molecular volume, Mulliken s charge, and the octanol/water partition coefficients (log Po/w) <2005MOL1197>. [Pg.319]

The reduction of latent variables is an effective method to reduce the number of possible models, yet in PLS, variable reduction is not needed. The reduction of the number of variables in traditional regression techniques will lead to models with improved predictive ability and, in the case of PLS, a model that is easier to understand. The attempts to reduce the number of variables for PLS have only resulted in simpler models that fit the Training Set better yet do not have the predictive abilities of the complete PLS model (111). The reduction of latent variables with respect to the descriptors is possible with no apparent decrease in the model s ability to predict bioactivities, yet the remaining descriptor-based variables are considered to be more important before reduction and thus introduces bias (111). [Pg.175]

Key Words Biological activity chemical features chemical space cluster analysis compound databases dimension reduction molecular descriptors molecule classification partitioning algorithms partitioning in low-dimensional spaces principal component analysis visualization. [Pg.279]

How is dimension reduction of chemical spaces achieved There are a number of different concepts and mathematical procedures to reduce the dimensionality of descriptor spaces with respect to a molecular dataset under investigation. These techniques include, for example, linear mapping, multidimensional scaling, factor analysis, or principal component analysis (PCA), as reviewed in ref. 8. Essentially, these techniques either try to identify those descriptors among the initially chosen ones that are most important to capture the chemical information encoded in a molecular dataset or, alternatively, attempt to construct new variables from original descriptor contributions. A representative example will be discussed below in more detail. [Pg.282]

Despite the conceptual elegance of partitioning in low-dimensional descriptor spaces, dimensional reduction is not essential for effective partitioning, as has been shown, for example, by application of statistical partitioning methods (4). [Pg.287]

In contrast to partitioning methods that involve dimension reduction of chemical reference spaces, MP is best understood as a direct space method. However, -dimensional descriptor space is simplified here by transforming property descriptors with continuous or discrete value ranges into a binary classification scheme. Essentially, this binary space transformation assigns less complex -dimensional vectors to test molecules, with each dimension having unity length of either 0 or 1. Thus, although MP analysis proceeds in -dimensional descriptor space, its dimensions are scaled and its complexity is reduced. [Pg.295]

This example is also intended as a warning. Even experts forget when preparing tables of results, e.g., of enantioselective reductions, that such descriptor inversions" occur. [Pg.22]

This chapter provides a brief overview of chemoinformatics and its applications to chemical library design. It is meant to be a quick starter and to serve as an invitation to readers for more in-depth exploration of the field. The topics covered in this chapter are chemical representation, chemical data and data mining, molecular descriptors, chemical space and dimension reduction, quantitative structure-activity relationship, similarity, diversity, and multiobjective optimization. [Pg.27]

In this chapter, we will give a brief introduction to the basic concepts of chemoinformatics and their relevance to chemical library design. In Section 2, we will describe chemical representation, molecular data, and molecular data mining in computer we will introduce some of the chemoinformatics concepts such as molecular descriptors, chemical space, dimension reduction, similarity and diversity and we will review the most useful methods and applications of chemoinformatics, the quantitative structure-activity relationship (QSAR), the quantitative structure-property relationship (QSPR), multiobjective optimization, and virtual screening. In Section 3, we will outline some of the elements of library design and connect chemoinformatics tools, such as molecular similarity, molecular diversity, and multiple objective optimizations, with designing optimal libraries. Finally, we will put library design into perspective in Section 4. [Pg.28]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...