Model development outliers

To construct the reference model, the interpretation system required routine process data collected over a period of several months. Cross-validation was applied to detect and remove outliers. Only data corresponding to normal process operations (that is, when top-grade product is made) were used in the model development. As stated earlier, the system ultimately involved two analysis approaches, both reduced-order models that capture dominant directions of variability in the data. A PLS analysis using two loadings explained about 60% of the variance in the measurements. A subsequent PCA analysis on the residuals showed that five principal components explain 90% of the residual variability. [Pg.85]

Outliers demand special attention in chemometrics for several different reasons. During model development, their extremeness often gives them an unduly high influence in the calculation of the calibration model. Therefore, if they represent erroneous readings, then they will add disproportionately more error to the calibration model. Furthermore, even if they represent informative information, it might be determined that this specific information is irrelevant to the problem. Outliers are also very important during model deployment, because they can be informative indicators of specific failures or abnormalities in the process being sampled, or in the measurement system itself. This use of outlier detection is discussed in the Model Deployment section (12.10), later in this chapter. [Pg.413]

The fact that not all outliers are erroneous leads to the following suggested practice of handling outliers during calibration development (1) detect, (2) assess and (3) remove if appropriate. In practice, however, there could be hundreds or thousands of calibration samples and x variables, thus rendering individual detection and assessment of all outliers a rather time-consuming process. However, time-consuming as it may be, outlier detection is one of the most important processes in model development. The tools described below enable one to accomplish this process in the most efficient and effective manner possible. [Pg.413]

Figure 12.29 Time-series plot of the y-residuals obtained from a PLS model developed using the process spectroscopy calibration data set (solid line), after removal of sample and variable outliers as discussed earlier. The measured y-values (dashed line) are also provided for reference.

Once a suitable covariate model is identified and no further model development will be done, the next step is to examine the dataset for outliers and influential observations. It may be that a few subjects are driving the inclusion of a covariate in a model or that a few observations are biasing the parameter estimates. Examination of the weighted residuals under Eq. (9.14) with the model estimates given in Table 9.15 showed that the distribution was skewed with two observations outside the acceptable limits of + 5. Patient 54 had an observable concentration of 4.05 mg/L 6-h postdose but had a predicted concentration of 1.22 mg/L, a difference of 2.83 mg/L and a corresponding weighted residual of +5.4. Patient 84 had an observable concentration of 1.57 mg/L 7.5-h postdose but had a... [Pg.328]

Cropley made general recommendations to develop kinetic models for compUcated rate expressions. His approach includes first formulating a hyperbolic non-linear model in dimensionless form by linear statistical methods. This way, essential terms are identified and others are rejected, to reduce the number of unknown parameters. Only toward the end when model is reduced to the essential parts is non-linear estimation of parameters involved. His ten steps are summarized below. Their basis is a set of rate data measured in a recycle reactor using a sixteen experiment fractional factorial experimental design at two levels in five variables, with additional three repeated centerpoints. To these are added two outlier... [Pg.140]

The development of a calibration model is a time consuming process. Not only have the samples to be prepared and measured, but the modelling itself, including data pre-processing, outlier detection, estimation and validation, is not an automated procedure. Once the model is there, changes may occur in the instrumentation or other conditions (temperature, humidity) that require recalibration. Another situation is where a model has been set up for one instrument in a central location and one would like to distribute this model to other instruments within the organization without having to repeat the entire calibration process for all these individual instruments. One wonders whether it is possible to translate the model from one instrument (old or parent or master. A) to the others (new or children or slaves, B). [Pg.376]

For modeling the BBB penetration, authors used Abraham s data set of 57 compounds as the training set. The test set consisted of 13 compounds, 7 of which were taken from Abraham s data set and 6 from the data set of Lombardo and workers. In addition to the lipoaffinity descriptor, the other descriptors used by them include molecular weight and TPSA. Two models were developed one based on stepwise MLR and the other one based on ANN. To test the performance of different descriptors, they first carried out a simple LR of the 55 training set compounds (two outliers were removed) using TPSA as the only descriptor (Eq. 41). The equation was comparable to Clark s model (Eq. 33). [Pg.526]

A training set of 57 compounds studied by Lombardo et al., Norinder et al., and Clark was used to develop the following model using stepwise multiple regression analysis, after removing one outlier (Eq. 65) ... [Pg.535]

The use of and Q prediction outlier metrics as described above is an example of a model-specific health monitor , in that the metrics refer to the specific analyzer response space that was used to develop a PLS, PCR or PCA prediction model. However, many PAT applications involve the deployment of multiple prediction models on a single analyzer. In such cases, one can also develop an analyzer-specific health monitor, where the and Q outlier metrics refer to a wider response space that covers all normal analyzer operation. This would typically be done by building a separate PCA model using a set of data that covers all normal analyzer responses. Of course, one could extend this concept further, and deploy multiple PCA health monitor models that are designed to detect different specific abnormal conditions. [Pg.431]

The two models chosen by USNA team are clearly outliers from the family of available models. The Canadian Climate Centre model (acronymed by the USNA as CGCM1) is one of the very few that produces a substantially exponential (rather than linear) change in temperature. The other model used by the team is known as the Hadley Centre Model (acronymed by the USNA as HadCM2), developed at the United Kingdom s Meteorological Office.6... [Pg.189]

Although it is desirable to include as many data (compounds) as possible in developing a model, it is wise initially to omit compounds with unique characteristics, e.g. the one acid in a set of basic compounds, an insoluble derivative, or an unstable or readily metabolized analogue. Such compounds will be recognizable as outliers once a robust equation has been generated, but until then their inclusion might obscure a relationship. [Pg.103]

Cross-validation of the DFA model is conducted by casewise deletion, reestimation of functions, and classification. In other words, for each observation in the data set, that observation is omitted and the discriminant functions are re-estimated using the full data set minus that observation. Then that observation is classified based on the re-estimated functions. The accuracy of the cross-validation can be used to evaluate the reliability of the DFA and the potential impact of group outliers. In essence, the cross-validation process is the same process used to determine provenance of an unsourced archaeological sample where the discriminant functions are developed independently of the sample and then used to determine its most likely source. [Pg.466]

A preliminary and essential step in a QSAR study is to evaluate the database to identify any outliers and hidden patterns, trends, and major groupings. Outliers refer to certain members of the database exhibiting mechanistic behaviors so different that the outlier cannot belong to the bulk of the data. Selecting suitable molecular descriptors, whether they are theoretical or empirical or are derived from readily available experimental characteristics of the structures, is an important step in the development of sound QSAR models. Many descriptors reflect simple molecular properties and thus can provide insight into the physicochemical nature of the activity or property under consideration. [Pg.139]

It should be noted that there is no global PCA prediction outlier model for a given application. In fact, several different prediction outlier models can be developed, using different subsets of calibration samples and/or X-variables, to provide alarms for several different outlier conditions. [Pg.285]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...