Outlier sample detection

By calculating the sum of the squares of the spectral residuals across all the wavelengths, an additional representative value can be generated for each spectrum. The spectral residual is effectively a measure of the amount of each spectrum left over in the secondary or noise vectors. This value is the basis of another type of discrimination method known as SIMCA (Refs. 13, 36). This is similar to performing an F test on the spectral residual to determine outliers in a training set (see Outlier Sample Detection in Chapter 4). In fact, one group combined the PCA-Mahalanobis distance method with SIMCA to provide a biparametric method of discriminant analysis (Ref. 41). In this method, both the Mahalanobis distance and the SIMCA test on the spectral residual had to pass in order for a sample to be classified as a match. [Pg.177]

Figure 4.24 Studentised residuals leverage plot to detect outlier samples on the calibration set (a) general rules displayed on a hypothetical case and (b) diagnostics for the data from the worked example (Figure 4.9, mean centred, four factors in the PLS model).

Typically, two types of extreme values can exist in our experimentally measured results, namely stragglers and outliers. The difference between the two is the confidence level required to distinguish between them. Statistically, stragglers are detected between the 95% and 99% confidence levels whereas outliers are detected at >99% confidence limit. It is always important to note that no matter how extreme a data point may be in our results, the data point could in fact be correct, and we need to remember that, when using the 95% confidence limit, one in every 20 samples we examine will be classified incorrectly. [Pg.34]

Another powerfiil tool in seeking out outlier samples is the spectral residual. This was discussed briefly in an earlier section. Similar to looking for concentration outliers, spectral outliers are detected by using a model for which the optimum number of factors has been determined by a cross-validation. [Pg.135]

The detection of outliers, particularly when working with a small number of samples, is discussed in the following papers. Efstathiou, G. Stochastic Galculation of Gritical Q-Test Values for the Detection of Outliers in Measurements, /. Chem. Educ. 1992, 69, 773-736. [Pg.102]

The development of a calibration model is a time consuming process. Not only have the samples to be prepared and measured, but the modelling itself, including data pre-processing, outlier detection, estimation and validation, is not an automated procedure. Once the model is there, changes may occur in the instrumentation or other conditions (temperature, humidity) that require recalibration. Another situation is where a model has been set up for one instrument in a central location and one would like to distribute this model to other instruments within the organization without having to repeat the entire calibration process for all these individual instruments. One wonders whether it is possible to translate the model from one instrument (old or parent or master. A) to the others (new or children or slaves, B). [Pg.376]

For identifying outliers, it is crucial how center and covariance are estimated from the data. Since the classical estimators arithmetic mean vector x and sample covariance matrix C are very sensitive to outliers, they are not useful for the purpose of outlier detection by taking Equation 2.19 for the Mahalanobis distances. Instead, robust estimators have to be taken for the Mahalanobis distance, like the center and... [Pg.61]

Outliers demand special attention in chemometrics for several different reasons. During model development, their extremeness often gives them an unduly high influence in the calculation of the calibration model. Therefore, if they represent erroneous readings, then they will add disproportionately more error to the calibration model. Furthermore, even if they represent informative information, it might be determined that this specific information is irrelevant to the problem. Outliers are also very important during model deployment, because they can be informative indicators of specific failures or abnormalities in the process being sampled, or in the measurement system itself. This use of outlier detection is discussed in the Model Deployment section (12.10), later in this chapter. [Pg.413]

The fact that not all outliers are erroneous leads to the following suggested practice of handling outliers during calibration development (1) detect, (2) assess and (3) remove if appropriate. In practice, however, there could be hundreds or thousands of calibration samples and x variables, thus rendering individual detection and assessment of all outliers a rather time-consuming process. However, time-consuming as it may be, outlier detection is one of the most important processes in model development. The tools described below enable one to accomplish this process in the most efficient and effective manner possible. [Pg.413]

Cook s Distance Plot (Model Diagnostic) A statistic known as Cook s distance can be used to detect calibration data outliers by identilying which samples are most influential on the model. Now that the selected variables have been finalized, it is good practice to examine the calibration data for influential samples. These samples should be investigated and removed if it is determined that they have an unusual effect on the model. [Pg.313]

Raw Measurement Plot In multivariate calibration, it is normally not necessary to plot the prediction data if the outlier detection technique has not flagged the sample as an outlier. However, with MLR, the outlier detection methods are not as robust as with the full-spectrum techniques (e.g., CLS, PLS, PCR) because few variables are considered. Figure 5.75 shows all of the prediction data with the variables used in the modeling noted by vertical lines. One sample appears to be unusual, with an extra peak centered at variable 140. The prediction of this sample might be acceptable because the peak is not located on the variables used for the models. However, it is still suspect because the new peak is not expected and can be an indication of other problems. [Pg.317]

As in PCA, outliers may influence modeling and should be detected. In regression, there are many ways a sample can be defined as an outlier. It may be an outlier according to X variables only or to Y variables only, or to both. It may also not be an outlier for either separate set of variables but become an outlier for (X, Y) regression. [Pg.401]

If a sample has been analyzed by k laboratories n times each, the sample variances of the n results from each laboratory can be tested for homogeneity —that is, any variance outliers among the laboratories can be detected. The ISO-recommended test is the Cochran test. The statistic that is tested is... [Pg.45]

In any case, the cross-validation process is repeated a number of times and the squared prediction errors are summed. This leads to a statistic [predicted residual sum of squares (PRESS), the sum of the squared errors] that varies as a function of model dimensionality. Typically a graph (PRESS plot) is used to draw conclusions. The best number of components is the one that minimises the overall prediction error (see Figure 4.16). Sometimes it is possible (depending on the software you can handle) to visualise in detail how the samples behaved in the LOOCV process and, thus, detect if some sample can be considered an outlier (see Figure 4.16a). Although Figure 4.16b is close to an ideal situation because the first minimum is very well defined, two different situations frequently occur ... [Pg.206]

Equation (4.20) was proposed by Hoskuldsson [65] many years ago and has been adopted by the American Society for Testing and Materials (ASTM) [59]. It generalises the univariate expression to the multivariate context and concisely describes the error propagated from three uncertainty sources to the standard error of the predicted concentration calibration concentration errors, errors in calibration instrumental signals and errors in test sample signals. Equations (4.19) and (4.20) assume that calibrations standards are representative of the test or future samples. However, if the test or future (real) sample presents uncalibrated components or spectral artefacts, the residuals will be abnormally large. In this case, the sample should be classified as an outlier and the analyte concentration cannot be predicted by the current model. This constitutes the basis of the excellent outlier detection capabilities of first-order multivariate methodologies. [Pg.228]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...