Outlier variance

In case of trace elements results were provided only for Cd, Cu and Pb (Table 5.2.5). In RM05 exercise (spring water) Z-score was calculated considering a fit-for- purpose unit of deviation corresponding to 10% of Xref. Satisfactory results were obtained for Cu and Pb and doubtful results for Cd for this last analyte the result was identified as a Cochran-outlier (variance) amongst the results submitted for RM05, but not as Hampel-outlier (mean). [Pg.360]

To construct the reference model, the interpretation system required routine process data collected over a period of several months. Cross-validation was applied to detect and remove outliers. Only data corresponding to normal process operations (that is, when top-grade product is made) were used in the model development. As stated earlier, the system ultimately involved two analysis approaches, both reduced-order models that capture dominant directions of variability in the data. A PLS analysis using two loadings explained about 60% of the variance in the measurements. A subsequent PCA analysis on the residuals showed that five principal components explain 90% of the residual variability. [Pg.85]

Most techniques for process data reconciliation start with the assumption that the measurement errors are random variables obeying a known statistical distribution, and that the covariance matrix of measurement errors is given. In Chapter 10 direct and indirect approaches for estimating the variances of measurement errors are discussed, as well as a robust strategy for dealing with the presence of outliers in the data set. [Pg.26]

A multivariate normal distribution data set was generated by the Monte Carlo method using the values of variances and true flowrates in order to simulate the process sampling data. The data, of sample size 1000, were used to investigate the performance of the robust approach in the two cases, with and without outliers. [Pg.212]

If the errors are normally distributed, the OLS estimates are the maximum likelihood estimates of 9 and the estimates are unbiased and efficient (minimum variance estimates) in the statistical sense. However, if there are outliers in the data, the underlying distribution is not normal and the OLS will be biased. To solve this problem, a more robust estimation methods is needed. [Pg.225]

PCA is sensitive with respect to outliers. Outliers are unduly increasing classical measures of variance (that means nonrobust measures), and since the PCs are following directions of maximum variance, they will be attracted by outliers. Figure 3.8 (left) shows this effect for classical PCA. In Figure 3.8 (right), a robust version of PCA was taken (the method is described in Section 3.5). The PCs are defined as directions maximizing A robust measure of variance (see Section 2.3) which is not inflated by the outlier group. As a result, the PCs explain the variability of the nonoutliers which refer to the reliable data information. [Pg.80]

As already noted in Section 3.4, outliers can be influential on PCA. They are able to artificially increase the variance in an otherwise uninformative direction which will be determined as PCA direction. Especially for the goal of dimension reduction this is an undesired feature, and it will mainly appear with classical estimation of the PCs. Robust estimation will determine the PCA directions in such a way that a robust measure of variance is maximized instead of the classical variance. Essential features of robust PCA can be summarized as follows ... [Pg.81]

There are several important issues for PCA, like the explained variances of each PC which determine the number of components to select. Moreover, it is of interest if outliers have influenced the PCA calculation, and how well the objects are presented in the PCA space. These and several other questions will be treated below. [Pg.89]

Methods of robust PCA are less sensitive to outliers and visualize the main data structure one approach for robust PCA uses a robust estimation of the covariance matrix, another approach searches for a direction which has the maximum of a robust variance measure (projection pursuit). [Pg.114]

Both assumptions are mainly needed for constructing confidence intervals and tests for the regression parameters, as well as for prediction intervals for new observations in x. The assumption of normal distribution additionally helps avoid skewness and outliers, mean 0 guarantees a linear relationship. The constant variance, also called homoscedasticity, is also needed for inference (confidence intervals and tests). This assumption would be violated if the variance of y (which is equal to the residual variance a2, see below) is dependent on the value of x, a situation called heteroscedasticity, see Figure 4.8. [Pg.135]

The objects span a nominal urea concentration range of 85.5 to 91.4%. The PLS model will be established on experimental Y reference values (crystallization temperature), corresponding to 92-107 °C. A model for urea concentrations can also be established following appropriate laboratory data [14,15]. The crystallization temperature model (no outliers) is able to describe 87% of the Y variance with three PLS components. [Pg.287]

Major steps In this type of analysis Include Initial data scaling and transformation, outlier detection, determination of the underlying factors, and evaluation of the effect that experimental procedures may have on the variance of the results. Most of the calculations were performed with the ARTHUR software package (O. [Pg.35]

An increase from 5 to 10 in the number of factors representing the original data results in a substantial reduction in the error. Because of the data pretreatment used, the spectral error cannot be directly compared to the experimental error determined from the data set. When five factors were used, two different lignite samples were flagged as possible outliers based on their spectral variances relative to the rest of the data set. With ten factors, one of the lignites was accommodated within the factor model (although ten factors may not have been required to accommodate it). With thirteen factors, both lignites were accommodated. [Pg.58]

Because principal component analysis attempts to account for all of the variance within a molecular dataset, it can be negatively affected by outliers, i.e., compounds having at least some descriptor values that are very different from others. Therefore, it is advisable to scale principal component axes or, alternatively, pre-process compound collections using statistical filters to identify and remove such outliers prior to the calculation of principal components. [Pg.287]

PLS is related to principal components analysis (PCA) (20), This is a method used to project the matrix of the X-block, with the aim of obtaining a general survey of the distribution of the objects in the molecular space. PCA is recommended as an initial step to other multivariate analyses techniques, to help identify outliers and delineate classes. The data are randomly divided into a training set and a test set. Once the principal components model has been calculated on the training set, the test set may be applied to check the validity of the model. PCA differs most obviously from PLS in that it is optimized with respect to the variance of the descriptors. [Pg.104]

If a sample has been analyzed by k laboratories n times each, the sample variances of the n results from each laboratory can be tested for homogeneity —that is, any variance outliers among the laboratories can be detected. The ISO-recommended test is the Cochran test. The statistic that is tested is... [Pg.45]

The above analysis assumes that the results are normally distributed and without outliers. A Cochran test for homogeneity of variance and Grubbs s tests for single and paired outliers is recommended (see chapter 2). Data from... [Pg.146]

Finally, paying attention to the outliers of these series, the element lead seems to be added almost at random to all the series of coins, though over time the use of additives seems to have become more consistent, for the most part. There is certainly more lead present than is required to effect lower working temperatures for molten alloys. Other than this variance in lead however, these coins have a notably consistent composition. It appears that the mintmasters and workmen of the foundries had a good understanding and control of copper, lead, and tin in the production of alloys. [Pg.244]

The box-whisker plot is a tool for representing the distribution parameters of variables. This tool can be used to provide a visual idea in the sense of F- and /-test. Group means may be compared in relation to the group variances. A box-whisker plot may also be formed from robust statistical parameters the median as the centre line of the box, percentiles as the edges of the box, and minimum and maximum as horizontal lines outside the box. Outliers are points outside the structure. An example of measurements of trichloroethene in river water is given in Fig. 5-10. The five groups refer to... [Pg.150]

Lowess normalization methods are based on lowess (loess) scatterplot smoothing algorithms. The lowess smoother attempts to smooth contours within a dataset. Typically the lowess will be robust to genes which are active in treatment as they will be observed as outliers (45). Some normalization methods include a print-tip normalization (46), since physical location on the array and the print-tip may contribute some effect and variance beyond the biological and treatment variation. [Pg.539]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...