Training Set Composition

One of the most important concerns regarding QSAR modeling is the number of available compounds. For any 3D-QSAR paradigm to be applied, at least 20 compounds (preferably more) should be individually characterized (i.e., synthesized, purified, and tested for the target property). These compounds constitute the training set. Whereas a smaller number of compounds may be used to derive a QSAR, results from such sets are regarded with less confidence. This constitutes a serious limitation for the use of QSAR techniques in the very early stages of a project, when limited or no information is available about active compounds. [Pg.140]

The outcome of any 3D-QSAR model ultimately depends on the data set. Computational chemists are often tempted to blame the failure of their modeling efforts on the quality of the biological data supplied. In certain cases, this is justified. Quality and choice of biological data to be modeled by QSAR, briefly discussed in the preceding section, are critical to the successful development of any model. The target property is often represented as an exact number for statistical QSAR analyses. Flowever, intra- and interexperimental variations occur in most pharmacological assays, as well as variations due to procedural modifications. Such problems are difficult to avoid, and therefore contextual judgment needs to be applied on an individual basis. [Pg.140]

Whereas CoMFA and other 3D-QSAR techniques have been used to model a variety of biological and physical properties of chemicals, by far the [Pg.140]

Utilizing this complex methodology, we obtain decision rules and results of activity prediction that are context-independent from the training set composition, the methods of compound structure representation, and the methods of regularity detection. [Pg.379]

In practice, of course, there can be many more than two classes within the pattern cloud. A striking example of this is afforded by the British bumblebee fauna as illustrated in Figure 7.12. With appropriate training set composition, DAISY is able to deal effectively with situations where the pattern clusters are non-convex or even situations where a single class maps to multiple clusters in morphospace (Figure 7.13). [Pg.108]

The data in the training set are used to derive the calibration which we use on the spectra of unknown samples (i.e. samples of unknown composition) to predict the concentrations in those samples. In order for the calibration to be valid, the data in the training set which is used to find the calibration must meet certain requirements. Basically, the training set must contain data which, as a group, are representative, in all ways, of the unknown samples on which the analysis will be used. A statistician would express this requirement by saying, "The training set must be a statistically valid sample of the population... [Pg.13]

The predictive performance of majority of the log BB models developed so far is 0.4 log units, despite the great diversity of molecular descriptors employed and the variations in the composition of the training sets. This seems like a large error in comparison to the range of log BB determined by experiment ( -2 to +1.5, i.e. 3.5 log units). However, it should be remembered that the experimental error in log BB measurements can be around 0.3 log units, and so this value provides a limit to the accuracy of in silico methods. [Pg.544]

The PLS-2 technique is a typical full spectmm method where the data are fitted to many data points, thereby improving the precision and requires a carefully experimental design of the Standard composition of the calibration set order the provide good predictions. In this study training set of 27 representative ternary mixtures was constmcted and the absorption spectra were recorded. In Table 33.1, the compositions of the ternary mixtures employed are summarized. [Pg.309]

This strategy of integrating neural networks with genetic algorithms has been used to search for the optimal composition of a catalyst for the ammoxidation of propane [62]. In that case, no experiments were performed the network was trained with data published earlier by other authors [63]. However, those data were for only 26 catalysts, thus forming a quite small training set. Even more importantly, the predicted performance of the optimal catalyst, expressed by means of acrylonitrile yield, was not experimentally verified. [Pg.167]

HY) as possible features. Two additional constraints were set (i) because of the molecular flexibility and functional complexity of the training set, only pharmacophores containing five features should be considered and (ii) the program was forced to include a positive ionizable feature in the composition of hypotheses, on the basis of the literature reporting a basic atom (usually a nitrogen) as a critical structural determinant for arAR antagonistic activity. [Pg.257]

Although rather similar in their composition, hypotheses 1 and 2 would probably show different performances depending on their use. The preliminary statistics are clearly in favor of hypothesis 2 (see Fig. 15.11, top), making this model a good candidate for activity prediction. A more thorough validation would be required if it were to be used for this purpose. In particular, further assessment with compounds external to the training set would be necessary. [Pg.358]

In the ideal case, a QSAR model should be developed in four stages data preparation, model generation, model validation, and assessment of the applicability domain [85]. Data preparation, i.e. a careful composition of activity data and molecular descriptors in a training set, and the establishment of a statistically significant relationship between the biological activity and the molecular descriptors (usually characterized by a correlation coefficient, are part of almost each QSAR study. [Pg.67]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...