Outlier selection

The goal of EDA is to reveal structures, peculiarities and relationships in data. So, EDA can be seen as a kind of detective work of the data analyst. As a result, methods of data preprocessing, outlier selection and statistical data analysis can be chosen. EDA is especially suitable for interactive proceeding with computers (Buja et al. [1996]). Although graphical methods cannot substitute statistical methods, they can play an essential role in the recognition of relationships. An informative example has been shown by Anscombe [1973] (see also Danzer et al. [2001], p 99) regarding bivariate relationships. [Pg.268]

Another type of classification is outlier selection or contamination identification. As an example, in Fig. 4.23(b), the butter is the desired material and bacteria the contamination. An arbitrary threshold for this image would be 0.02, in which all pixels >0.02 are considered suspect, and hopefully, because this is a food product, decontamination procedures are pursued. In these two examples of classification, only arbitrary thresholds have been defined and, as such, confidence in these classifications is lacking. This confidence can be achieved through statistical methods. Although this chapter is not the appropriate place for an involved discussion of application of statistics toward data analysis, we will give one example often used in chemometric classification. [Pg.108]

When the laboratory value is plotted against the NIR predicted value for the calibration sample set it may well be noted that some points lie well away from the computed regression line. This will, of course, reduce the correlation between laboratory and NIR data and increase the SEC or SEP. These samples may be outliers. The statistic hi describes the leverage or effect of an individual sample upon a regression. If a particular value of hi is exceeded this may be used to determine an outlier sample. Evaluation criteria for selecting outliers, howevei are somewhat subjective so there is a requirement for expertise in multivariate methods to make outlier selection effective. [Pg.2249]

Once the quality of the dataset is defined, the next task is to improve it. Again, one has to remove outliers, find out and remove redundant objects (as they deliver no additional information), and finally, select the optimal subset of descriptors. [Pg.205]

Calculate all deviations (jc(i) - jc ), and sort according to absolute size calculate the median average deviation MAD calculate cut-off limits for outliers according to Huber i by assuming the recommended value for Huber s k (3.5). Different values for this multiplier can be selected. Display the cut-off limits and the clipped data set. [Pg.373]

A set of N VLE experimental data points have been made available. These data are the measurements of the state variables (T, P, x, y) at each of the N performed experiments. Prior to the estimation, one should plot the data and look for potential outliers as discussed in Chapter 8. In addition, a suitable EoS with the corresponding mixing rules should be selected. [Pg.242]

There are several important issues for PCA, like the explained variances of each PC which determine the number of components to select. Moreover, it is of interest if outliers have influenced the PCA calculation, and how well the objects are presented in the PCA space. These and several other questions will be treated below. [Pg.89]

Leardi, R. J. Chemom. 8, 1994, 65-79. Application of a genetic algorithm for feature selection under full validation conditions and to outlier detection. [Pg.206]

Cook s Distance Plot (Model Diagnostic) A statistic known as Cook s distance can be used to detect calibration data outliers by identilying which samples are most influential on the model. Now that the selected variables have been finalized, it is good practice to examine the calibration data for influential samples. These samples should be investigated and removed if it is determined that they have an unusual effect on the model. [Pg.313]

For this reason, it is of interest to learn the diverse types of calibration, together with their mathematical/statistical assumptions, the methods for validating these models and the possibilities of outlier detection. The objective is to select the calibration method that will be most suited for the type of analysis one is carrying out. [Pg.161]

Our objective in re-examining the 1446 pieces of Small Boy data has been to extract the maximum amount of information and the minimum amount of misinformation with the least amount of tampering. Our method has turned out to be a loop which we have traveled innumerable times. The first step of the loop was to choose, for a given laboratory, the best substantiated correlation available and select the outliers. We next traced the outliers through every other meaningful correlation to corroborate their spuriousness. The procedure was repeated for the next best substantiated correlation, and so on, as far as we could carry it. The data from the other laboratories were treated similarly in turn. We have thus examined the data exhaustively for mutual consistency. In many cases we were able to show that a datum violated more than one criterion, and we rejected it on that basis. In other cases, data were so far out of line that there was no question as to their abnormality. In still other situations we found that correlations could be established with the data from one laboratory but not with the data from another. We then rejected the irregular data in toto. [Pg.316]

A preliminary and essential step in a QSAR study is to evaluate the database to identify any outliers and hidden patterns, trends, and major groupings. Outliers refer to certain members of the database exhibiting mechanistic behaviors so different that the outlier cannot belong to the bulk of the data. Selecting suitable molecular descriptors, whether they are theoretical or empirical or are derived from readily available experimental characteristics of the structures, is an important step in the development of sound QSAR models. Many descriptors reflect simple molecular properties and thus can provide insight into the physicochemical nature of the activity or property under consideration. [Pg.139]

The criterions have considered the distribution of the samples in PCA score plot, and 10 samples among outliers and those that represent most part of the analysed samples were selected. [Pg.1086]

Selected Critical Values for DIXON Outlier Tests... [Pg.378]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...