Outliers in prediction

As a result, it is critical to evaluate process samples in real-time for their appropriateness of use with the empirical model. For models built using PC A, PLS, PCR and other factor based methods, the mechanism for such a model health monitor is already built into the model. Equations 12.21 and 12.22 can be used to calculate Hotelling and Q residual statistics for each process sample using the sample s on-line analytical profile (xp) and the loadings (P) and scores (T) of the model. An abnormally high value would indicate [Pg.430]

The use of and Q prediction outlier metrics as described above is an example of a model-specific health monitor , in that the metrics refer to the specific analyzer response space that was used to develop a PLS, PCR or PCA prediction model. However, many PAT applications involve the deployment of multiple prediction models on a single analyzer. In such cases, one can also develop an analyzer-specific health monitor, where the and Q outlier metrics refer to a wider response space that covers all normal analyzer operation. This would typically be done by building a separate PCA model using a set of data that covers all normal analyzer responses. Of course, one could extend this concept further, and deploy multiple PCA health monitor models that are designed to detect different specific abnormal conditions. [Pg.431]

The and Q metrics are the highest-level metrics for abnormality detection. For any given prediction sample that produces a high or interesting or Q value, one can also obtain the contributions of each analyzer variable to the and Q value, aptly named the t- and q-contributions, which are defined below [Pg.431]

These confributions can be particularly useful for assessing the nature or source of a specific process sample s abnormality. [Pg.431]

Once a chemometric model is built, and it is used to produce concentration or property values in real time from on-line analyzer profiles, the detection of outliers is a particularly critical task. This is the case for two reasons [Pg.283]

As a result, it is very important to evaluate process samples in real time for their appropriateness of use with the empirical model. Historically, this task has been often overlooked. This is very unfortunate not only because it is relatively easy to do, but also because it can effectively prevent the misuse of quantitative results obtained from a multivariate model. I would go so far as to say that it is irresponsible to implement a chemometric model without prediction outlier detection. [Pg.283]

Such real-time evaluation of process samples can be done by developing a PCA model of the calibration data, and then using this model in real time to generate prediction residuals (RESp) and leverages for each sample.3 Given a PCA model of the analytical profiles in the calibration data (conveyed by T and P), and the analytical profile of the prediction sample (xp), the scores of the prediction sample can be calculated [Pg.283]

The prediction leverage (LEVp) can then be calculated using the scores of the prediction sample, along with the scores of the PCA model [Pg.283]

The calculation of the RESP requires that the model estimate of the analytical profile of the prediction sample first be computed [Pg.283]

Reasons for errors and outliers in prediction models are summarized with respect to cross-validations methods, such as leave-one-out. Furthermore, some case studies are discussed which make use of support vector regression, an emerging technique in QSAR. [Pg.112]

Ideally, the results should be validated somehow. One of the best methods for doing this is to make predictions for compounds known to be active that were not included in the training set. It is also desirable to eliminate compounds that are statistical outliers in the training set. Unfortunately, some studies, such as drug activity prediction, may not have enough known active compounds to make this step feasible. In this case, the estimated error in prediction should be increased accordingly. [Pg.248]

Figures 1 to 4 illustrate the results of the reconciliation for the four variables involved. As can be seen, this approach does not completely eliminate the influence of the outliers. For some of the variables, the prediction after reconciliation is actually deteriorated because of the presence of outliers in some of the other measurements. This is in agreement with the findings of Albuquerque and Biegler (1996), in the sense that the results of this approach can be very misleading if the gross error distribution is not well characterized.

With the molecular descriptors as the X-block, and the senso scores for sweet as the Y-block, PLS was used to calculate a predictive model using the Unscrambler program version 3.1 (CAMO A/S, Jarleveien 4, N-7041 Trondheim, Norway). When the full set of 17 phenols was us, optimal prediction of sweet odour was shown with 1 factor. Loadings of variables and scores of compounds on the first two factors are shown in Fig es 1 and 2 respectively. Figure 3 shows predicted sweet odour score plotted against that provid by the sensory panel. Vanillin, with a sensory score of 3.3, was an obvious outlier in this set, and so the model was recalculated without it. Again 1 factor was r uired for optimal prediction, shown in Figure 4. [Pg.105]

However, the outliers in Eq. 5.1 were tetracaine and procaine which showed higher activities than predicted. [Pg.225]

Examination of the residuals plot shows isopropylbenzene as an outlier, in that the predicted value is too low. Recalculation of the data after removing isopropylbenzene resulted in the model ... [Pg.383]

There is strong evidence for making the assumption that the increase in luminescence observed is caused by induction of a metabolite for most of the compounds tested. First, outliers in QSAR regressions can be used to determine the limits of applicability of a QSAR (Lipnick, 1991). If the biosensor response to all compounds other than naphthalene was a non-specific response with no relationship to biotransformation, then it would be expected that the value for naphthalene would be a clear outlier. Instead, the value for naphthalene is close to the predicted value, as shown in Figure 17.3. A dose-response behavior is indicative of a specific mechanism. [Pg.386]

Prediction functions, based on the principal components, were used to see how much of the variation in the copigmentation measures could be accounted for by the vineyard and winery practices, using PLS methodologies. When all of the vineyard measures and winery conditions were included, 66% of the variation could be accounted for. The predicted color due to copigmentation is plotted against the actual values in Figure 4. Four components were used and the correlation coefficient is 0.82.Note that wine number 50 seems to be an outlier in this correlation. [Pg.45]

It is rather essential that the applicability domain of a derived model can be evaluated so that outliers to the model may be indicated. If an established statistical model is to be regarded as poor from a predictive point of view this should be done on the basis of correct reasons, that is, that the model has truly poor predictive ability and not from the fact that the model cannot estimate outliers to the model with acceptable accuracy. The latter case is probably the most common cause for statistical (ADMET) models to fall from fame especially those that can be accessed through internal or external web services. In many cases it is difficult, if not impossible, to find out about the compounds used as training set and/or the chemical description used in the model. Thus, many compounds outside the applicability domain of the model will be submitted. It is therefore of great importance to have an indication together with the prediction whether the compound is considered to fall inside or outside of this domain, that is, if the compound is an outlier or not. The outlier information, and possibly also how far from the model the compound in question is, may in many cases be utilized in a more proactive way than just realizing that a number of compounds submitted to the model for prediction are, in fact, outliers to the present model. Thus, by analyzing the outliers, perhaps virtual compounds, from various points of views, for example, structural or synthetic, some of these compounds may later be synthesized and tested experimentally. The same compounds may then be incorporated into a revised model that will have a broader applicability domain. There are different methods available to determine whether a particular compound is to be labeled as an outlier. In this section, we will describe two of these methods ... [Pg.1015]

Provided the limitations of QSARs in environmental sciences are considered and the respective application criteria are complied with, structure-activity relationships can be a powerful tool for elucidating modes of interaction, screening the validity and the plausibility of experimental data, detecting of outliers and predicting activity parameters for product development, identification of priority pollutants and range finding for further testing. [Pg.10]

The applicabihty domain is one of the most important factors which should be taken into consideration while building mathematical models or while applying the prebuilt models for predictive studies [7]. Explaining outliers in the training set, test set and predicted set is one of the requirements in modem structure-property-... [Pg.135]

Compound pairs detected as informative activity cliffs often illustrate key chemical features for activity. These pairs, however, may also often be detected as apparent statistical outliers in quantitative SAR analysis methods [56], since the assumption of SAR continuity is fundamental for QSAR model building and affinity prediction. [Pg.210]

There is one additional method to use in determining outliers in discriminant analysis models to look at a plot of the predicted Mahalanobis distances (either from a cross-validation or self-prediction) to see if any samples stand out (Fig. 13). [Pg.188]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...