Predictions in QSAR

In this chapter, we briefly review the history of QSAR, the types of descriptors, the key methodologies and various aspects of model building and prediction in QSAR. A few examples are given to exemplify appropriate use of the available tools. [Pg.492]

Pal, D.K., Sengupta, C. and De, A.U. (1989) Introduction of a novel topochemical index and exploitation of group connectivity concept to achieve predictability in QSAR and RDD. Indian J. Chem., 28, 261-267. [Pg.1135]

Neural networks are commonly used for activity prediction in QSAR. So and Karplus have used a genetic algorithm for selection of molecular descriptors. These are then used as input to a neural net which predicts activity. An alternative approach is suggested by KyngSs and Valjakka, who encode the neural net size and descriptors within a GA. The performance of the encoded network is then used as the GA fitness function. [Pg.1133]

The field points must then be fitted to predict the activity. There are generally far more field points than known compound activities to be fitted. The least-squares algorithms used in QSAR studies do not function for such an underdetermined system. A partial least squares (PLS) algorithm is used for this type of fitting. This method starts with matrices of field data and activity data. These matrices are then used to derive two new matrices containing a description of the system and the residual noise in the data. Earlier studies used a similar technique, called principal component analysis (PCA). PLS is generally considered to be superior. [Pg.248]

In QSAR equations, n is the number of data points, r is the correlation coefficient between observed values of the dependent and the values predicted from the equation, is the square of the correlation coefficient and represents the goodness of fit, is the cross-validated (a measure of the quality of the QSAR model), and s is the standard deviation. The cross-validated (q ) is obtained by using leave-one-out (LOO) procedure [33]. Q is the quality factor (quality ratio), where Q = r/s. Chance correlation, due to the excessive number of parameters (which increases the r and s values also), can. [Pg.47]

Descriptors used to characterize molecules in QSAR studies should be as independent of each other (orthogonal) as possible. When using correlated parameters there is an increased danger of obtaining non-predictive, chance correlation [56]. To examine the correlation between PSA (calculated according to the fragment-based protocol [10]) and other descriptors, we studied a collection of 7010 bioactive molecules from the PubChem database [57]. In addition to PSA, the following parameters were used ... [Pg.121]

Partial Least Squares (PLS) regression (Section 35.7) is one of the more recent advances in QSAR which has led to the now widely accepted method of Comparative Molecular Field Analysis (CoMFA). This method makes use of local physicochemical properties such as charge, potential and steric fields that can be determined on a three-dimensional grid that is laid over the chemical stmctures. The determination of steric conformation, by means of X-ray crystallography or NMR spectroscopy, and the quantum mechanical calculation of charge and potential fields are now performed routinely on medium-sized molecules [10]. Modem optimization and prediction techniques such as neural networks (Chapter 44) also have found their way into QSAR. [Pg.385]

While principal components models are used mostly in an unsupervised or exploratory mode, models based on canonical variates are often applied in a supervisory way for the prediction of biological activities from chemical, physicochemical or other biological parameters. In this section we discuss briefly the methods of linear discriminant analysis (LDA) and canonical correlation analysis (CCA). Although there has been an early awareness of these methods in QSAR [7,50], they have not been widely accepted. More recently they have been superseded by the successful introduction of partial least squares analysis (PLS) in QSAR. Nevertheless, the early pattern recognition techniques have prepared the minds for the introduction of modem chemometric approaches. [Pg.408]

In a first attempt to derive characterization factors with QSARs, the entire dataset of plastics additives was included, and aquatic ecotoxicity was predicted for two different trophic levels. This generated characterization factors that did not correspond well with the ones derived from experimental data [30]. Hardly surprising, but a clear indication that two trophic levels are unsufficient. A second attempt to derive characterization factors with QSARs are currently being performed [31]. In this second attempt, substances that are difficult to model in QSAR models have been removed from the dataset and the ecotoxicity has been predicted for three different trophic levels instead of two. However, results have not yet been obtained from this second attempt. If the results show that it is possible to derive reliable characterization factors by the use of QSARs, the current data gap regarding characterization factors for human toxicity and ecotoxicity could be... [Pg.16]

Keywords Alternatives to animal testing, Computational toxicology, In silico, In vitro, Predictive models, QSAR models, Regulation... [Pg.74]

Results in Table 31.3 indicate that the combination of TS and TC descriptors resulted in a highly predictive RR model = 0.895) the addition of three-dimensional and QC indices to the set of independent variables did not result in significant improvement in model quality. It may be noted that we have observed such results for various other physicochemical and biological properties including mutagenicity [25,54], boiling point [55], blood air partition coefficient [37], tissue air partition coefficient [46], etc. [24,30,45,56]. Only in limited cases, e.g., halocarbon toxicity [12], the addition of QC indices after TS and TC parameters resulted in significant improvement in QSAR model quality. [Pg.488]

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

Basak, S. C., Mills, D., Hawkins, D. M., Kraker, J. J. Proper statistical modeling and validation in QSAR A case study in the prediction of rat fat air partitioning. In Computation in Modem Science and Engineering, Proceedings of the International Conference on Computational Methods in Science and Engineering 2007 (ICCMSE 2007), Simos, T. E., Maroulis, G., Eds., American Institute of Physics, Melville, New York, 2007, pp. 548-551. [Pg.501]

QSAR are useful In the design of pesticides and medicinal drugs, and In environmental problems such as the prediction of toxicity and blodegradablllty. An empirical relationship can be properly used only for Interpolation whereas one based solidly on well-established theory can be used at least to some extent for extrapolation as well. It seems of real Importance, then, to determine the nature and slgmiflcance of steric and bulk parameters In QSAR. [Pg.249]

Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK. (2004) Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. /. Chem. Inf Comp. Sci. 44 1912-1928. [Pg.154]

In 2008, Weaver [64] utilized PPB as an example to demonstrate the concept of "domain of applicability" in QSAR researches. The PLS model was constructed using 17 ID and 2D molecular descriptors. The performance of the model was reasonable for such a large data set for PPB modeling (n — 685, q2 — 0.56, RMSE = 0.55 AUE = 0.42, ntest = 210, q2 = 0.58, RMSEtest = 0.54, AUEtest = 0.41). How domain selection protocol affects the prediction performance will be discussed in Section 3. [Pg.117]

Once a QSAR model is constructed, it must be validated using the external test set. The data points in the test set should not appear in the training set. There are two approaches to improve the prediction accuracy for a given QSAR model. The first approach utilized the concept of "the domain of applicability," which is used to estimate the uncertainty in prediction of a particular molecule based on how similar it is to the compound used to build the model. To make a more accurate prediction for a given molecule in the test set, the structurally similar compounds in the training set are used to construct model and that model is used to make the prediction. In some cases, the domain similarity is measured using molecular descriptor similarity, rather than the structural similarity. The... [Pg.120]

A standard assumption in QSAR studies is that the models describing the data are linear. It is from this standpoint that transformations are performed on the bioactivities to achieve linearity before construction of the models. The assumption of linearity is made for each case based on theoretical considerations or the examination of scatter plots of experimental values plotted against each predicted value where the relationship between the data points appears to be nonlinear. The transformation of the bioactivity data may be necessary if theoretical considerations specify that the relationship between the two variables... [Pg.142]

The chapter is divided into three sections the first part is concerned with the derivation of 3D-LogP descriptor and the selection of suitable parameters for the computation of the MLP values. This study was performed on a set of rigid molecules in order, at least initially, to avoid the issue of conformation-dependence. In the second part, both the information content and conformational sensitivity of the 3D-LogP description was established using a set of flexible acetylated amino acids and dipeptides. This initial work was carried out using log P as the property to be estimated/predicted. However, it should be made clear that, while the 3D-LogP descriptor can be used for the prediction of log P, this was not the primary intention behind its the development. Rather, as previously indicated, the rationale for this work was the development of a conformationally sensitive but alignment-free lipophilicity descriptor for use in QSAR model development. The use of log P as the property to be estimated/predicted enables one to establish the extent of information loss, if any, in the process used to transform the results of MLP calculations into a descriptor suitable for use in QSAR analyses. [Pg.218]

The final part of the chapter is devoted to a demonstration of the effectiveness of the 3D-LogP approach as a descriptor in QSAR analysis through the modeling and prediction of pIC50 values for a set of 49 structurally diverse HIV-1 protease inhibitors taken from the literature (14). [Pg.218]

Ekins et al. (201) used the MS-WHIM descriptors to construct 3D and 4D QSAR models for the log(l/Aj) of 14 competitive inhibitors of CYP3A. The 3D QSAR of the CYP3A4-mediated midazolam l -hydroxylation was shown to be predictive yielding a leave-one-out (LOO) q2 value of 0.32. Although the 4D QSAR methodology includes conformational changes, it did not provide for a significant improvement over the 3D QSAR (LOO q2 0.44). Two other datasets (242,243) were used to create 3D and 4D QSAR models. In both datasets, it was not possible to build predictive 3D QSAR models however, 4D QSAR models were constructed (LOO q2 = 0.41-0.56). [Pg.486]

Froehlich, H., Wegner, J.K., Sieker, F. and Zell, A. (2006) Kernel functions for attributed molecular graphs - a new similarity-based approach to ADME prediction in classification and regression. QSAR Combinatorial Science, 25, 317-326. [Pg.40]

This strategy was successfully applied in QSAR [62] and preliminary results have demonstrated an increased accuracy in log Poet prediction when consensus models were derived by neural network using as input eight well known prediction values [63]. [Pg.97]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...