Applicability Domain of Models

In order to be appUed the models should be predictive. Unfortunately, the models frequently fail and demonstrate significantly lower prediction ability compared to the estimated one, when they are applied to new unseen data [100-103, 106]. One of the main reasons for such failures can be the lack of available experimental data and difficulties in calculating log D, as discussed in Section 16.4.2. Another problem of low prediction ability of log D models can be attributed to different chemical diversity of molecules in the in-house databases compared to the training sets used to develop the programs. [Pg.429]

The models developed for log D prediction usually aim at being global ones. This, however, does not work on practice. Sheridan et al. [109] noticed that the accuracy of log D prediction of molecules decreased approximately 2-3 times (RMSE = 0.75 versus 1.5-2) as the similarity of the test molecule to the molecules in the training set (using Dice definition with the atom pair descriptors) changed from 1 to 0 (most to least similar). Thus, if a test set molecule had a very similar molecule in the training set, it was possible to accurately predict its log D value. A detailed overview of state of the art methods to access the same problem was published elsewhere [117]. [Pg.429]

However, there is still a strong need to develop new methods that will be able to quantitatively or at least qualitatively estimate the prediction accuracy of log D models. Such models will allow the computational chemist to distinguish reliable versus nonreliable predictions and to decide whether the available model is sufficiently accurate or whether experimental measurements should be provided. For example, when applying ALOGPS in the LIB RARY model it was possible to predict more than 50% and 30% compounds with an accuracy of MAE 0.35 for Pfizer and AstraZeneca collections, respectively [117]. This precision approximately corresponds to the experimental accuracy, s=0.4, of potentiometric lipophilicity determinations [15], Thus, depending on the required precision, one could skip experimental measurements for some of the accurately predicted compounds. [Pg.429]

A low accuracy of models for prediction of log D at any pH would not encourage the use of these models for practical applications in industry. Thus, it is likely that the methods for log D prediction at fixed pH that are developed in house by pharmaceutical companies will dominate in industry. However, log D measurements [Pg.429]

The availability of data will dramatically transform the field and boost development of new, reliable methods for physicochemical property predictions. The development of methods to estimate the accuracy of a prediction and the applicability domain of models will make it possible to obtain more confident results on their wider use in environmental and pharmaceutical studies. [Pg.267]

In the following, we will discuss very few important techniques for machine learning. There exists a wealth of methods [69] and we try to focus here on the ones primarily applied to virtual screening. The list, however, is not exhaustive and the interested reader is redirected to excellent machine learning literature in order to get the full picture where details of the applicability domain of models and error estimations are also discussed [69-72]. [Pg.77]

The tools for in silico toxicology are broadly applied in the drug development process. The particular use of the tools is clearly context-dependent, which includes the quality of the prediction and the applicability domain of the model. [Pg.475]

All predictions must be taken for what they are, namely, generalizations based on current knowledge and understanding. There is a temptation for a user to assume that a computer-generated answer must be correct. To determine whether this is in fact the case, a number of factors concerning the model must be addressed. The statistical evaluation of a model was addressed above. Another very important criterion is to ensure that a prediction is an interpolation within the model space, and not an extrapolation outside of it. To determine this, the concept of the applicability domain of a model has been introduced [106]. [Pg.487]

Extent of Extrapolation For a regression-like QSAR, a simple measure of a chemical being too far from the applicability domain of the model is its leverage, hi [36], which is defined as... [Pg.441]

Several attempts were performed to determine the accuracy of in silica prediction tools developed for lipophilicity (for a recent review, see [34]). The main factor limiting the accuracy of all predictive methods is the training sets used to generate the models, in terms of population and quality of the experimental data they contain. Since most of the methods proposed in commercial software were built with data available in the public domain, their accuracy can be expected to be comparable. Thus, in order to select the most suitable prediction tool, other criteria than accuracy have to be used such as the speed of the calculation for large databases, the price of commercial software or the application domain of the model. [Pg.96]

Every model has limitations. Even the most robust and best-validated regression model will not predict the outcome for all catalysts. Therefore, you must define the application domain of the model. Usually, interpolation within the model space will yield acceptable results. Extrapolation is more dangerous, and should be done only in cases where the new catalysts or reaction conditions are sufficiently close to the model. There are several statistical parameters for measuring this closeness, such as the distance to the nearest neighbor within the model space (see the discussion on catalyst diversity in Section 6.3.5). Another approach uses the effective prediction domain (EPD), which defines the prediction boundaries of regression models with correlated variables [105]. [Pg.266]

The descriptor was a product of the correlation weights, CW(Ik), calculated by the Monte Carlo method for each kth element of a special SMILES-like notation introduced by the authors. The notation codes the following characteristics the atom composition, the type of substance (bulk or not, ceramic or not), and the temperature of synthesis. The QSAR model constructed in this way was validated with the use of many different splits into training (n 21) and validation (n=8) sets. Individual sub-models are characterized by high goodness-of-fit (0.972 applicability domain of the model, it is not known if all the compounds (metal oxides, nitrides, mullite, and silicon carbide) can be truly modeled together. [Pg.211]

Applicability Domain Criteria. The applicability domain of a (Q)SAR is the physico-chemical, structural, or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds. Ideally, QSARs should only be used to make predictions within the applicability domain of the training set (i.e., interpolation versus extrapolation). Many chemicals and chemical classes do not conform to current QSAR models, as they extend beyond the domain inherent in the training data sets. These include polymers, reaction products, mixtures and inorganics. [Pg.2681]

Applicability Domain for DT-Based Models We describe applicability domain for QSAR models as being determined by two parameters (1) prediction confidence, or the certainty of a prediction for an unknown chemical, and (2) domain extrapolation, or the prediction accuracy of an unknown chemical that lies beyond the chemical space of the training set [60]. Both parameters can be quantitatively estimated in the consensus tree approaches, where individual models are constructed as DTs. Taken together, prediction confidence and domain extrapolation assess the applicability domain of a model for each prediction. [Pg.164]

Another important area of QSAR research is the determination of the reliability of QSAR models. Current research in this area includes the development of methods to define the applicability domain of a QSAR model [91] and to calculate the expected prediction accuracies for individual compounds [70]. [Pg.232]

OntoCAPE is the core of the integrated model which describes the concepts of chemical engineering, the specific application domain of interest in the IMPROVE project. The product data in OntoCAPE are well integrated with the Document Model by direct references from the document content description to product data elements in OntoCAPE. Likewise, the elements of the document content description link the product model in OntoCAPE to the decision and work process documentation in the Process and Decision Ontologies. [Pg.748]

Important steps of this process are (a) selection of the set of molecules the modeling procedure is applied to, and the set of molecular descriptors that will define the model chemical space (b) selection of the training set for the model estimation and the test set for model validation (c) application of the validated model(s) to design new molecules with desirable properties and/or predict the response of interest for future molecules, paying attention to the applicability domain of the model. [Pg.749]

However, even when the predictive ability of the models was high, the estimated property should be taken carefully because a molecule might be far from the model chemical space and, then, the response would be the result of a strong extrapolation, resulting in an unreliable prediction. To cope with this problem, the concept of —> applicability domain of a model came out as a relevant aspect for the evaluation of the prediction reliability. [Pg.1253]

In order to establish whether a query chemical compound fits within the applicability domain of the QSAR model (i.e., that the training set for the model contained molecules chemically similar to the query molecule), the similarity of the uploaded structure (and its predicted metabolites if the metabolite QSAR prediction option is selected) to the structures used in the training set for the QSAR model can be calculated (maximal Tanimoto coefficient (15)). The higher the similarity value, the greater is the applicability of the model. Results of QSAR modeling for... [Pg.233]

It is rather essential that the applicability domain of a derived model can be evaluated so that outliers to the model may be indicated. If an established statistical model is to be regarded as poor from a predictive point of view this should be done on the basis of correct reasons, that is, that the model has truly poor predictive ability and not from the fact that the model cannot estimate outliers to the model with acceptable accuracy. The latter case is probably the most common cause for statistical (ADMET) models to fall from fame especially those that can be accessed through internal or external web services. In many cases it is difficult, if not impossible, to find out about the compounds used as training set and/or the chemical description used in the model. Thus, many compounds outside the applicability domain of the model will be submitted. It is therefore of great importance to have an indication together with the prediction whether the compound is considered to fall inside or outside of this domain, that is, if the compound is an outlier or not. The outlier information, and possibly also how far from the model the compound in question is, may in many cases be utilized in a more proactive way than just realizing that a number of compounds submitted to the model for prediction are, in fact, outliers to the present model. Thus, by analyzing the outliers, perhaps virtual compounds, from various points of views, for example, structural or synthetic, some of these compounds may later be synthesized and tested experimentally. The same compounds may then be incorporated into a revised model that will have a broader applicability domain. There are different methods available to determine whether a particular compound is to be labeled as an outlier. In this section, we will describe two of these methods ... [Pg.1015]

Sahigara F, Mansouri K, Ballabio D et al (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules (Basel, Switzerland) 17 4791-4810... [Pg.192]

Dynamic range of the data Experimental error Applicability domain of the model Confidence intervals on correlation coefficients... [Pg.17]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...