Outliers

The quality may suffer from the presence of so-called outliers, i.e., compounds that have low similarity to the rest of the dataset. Another negative feature may be just the contrary the dataset may contain too many too highly similar objects. [Pg.205]

Once the quality of the dataset is defined, the next task is to improve it. Again, one has to remove outliers, find out and remove redundant objects (as they deliver no additional information), and finally, select the optimal subset of descriptors. [Pg.205]

An observation of the results of cross-validation revealed that all but one of the compounds in the dataset had been modeled pretty well. The last (31st) compound behaved weirdly. When we looked at its chemical structure, we saw that it was the only compound in the dataset which contained a fluorine atom. What would happen if we removed the compound from the dataset The quahty ofleaming became essentially improved. It is sufficient to say that the cross-vahdation coefficient in-CTeased from 0.82 to 0.92, while the error decreased from 0.65 to 0.44. Another learning method, the Kohonen s Self-Organizing Map, also failed to classify this 31st compound correctly. Hence, we had to conclude that the compound containing a fluorine atom was an obvious outlier of the dataset. [Pg.206]

Returning to datasets from chemoinformatics, we may conclude that the first case stands for the complete degeneracy of the dataset, when all but one data object are redimdant. The second case corresponds to the weird situation in which all the objects of the dataset are outliers. We have thus arrived at the extreme extents of the dataset complexity. [Pg.208]

We have already mentioned that real-world data have drawbacks which must be detected and removed. We have also mentioned outliers and redundancy. So far, only intuitive definitions have been given. Now, aimed with information theory, we are going firom the verbal model to an algebraic one. [Pg.212]

Equation (4) provides a value of AICO = 0.00852. As we see, the populations of classes 3 and 8 bear redundancy in information, as their JCOs are quite low here. In contrast, classes 6 and 7 are clearly outliers. Their JCOs are too high in compar-... [Pg.212]

We must now mention, that traditionally it is the custom, especially in chemo-metrics, for outliers to have a different definition, and even a different interpretation. Suppose that we have a fc-dimensional characteristic vector, i.e., k different molecular descriptors are used. If we imagine a fe-dimensional hyperspace, then the dataset objects will find different places. Some of them will tend to group together, while others will be allocated to more remote regions. One can by convention define a margin beyond which there starts the realm of strong outliers. "Moderate outliers stay near this margin. [Pg.213]

Another way to identify correlations is to plot the values of the parameters in graphical form this can help to identify any correlations and the presence of outliers. A Craig plot is a two-dimerrsional scatterplot of one parameter against another ideally, the molecules should sample from all four quadrants of the plot. [Pg.697]

Ideally, the results should be validated somehow. One of the best methods for doing this is to make predictions for compounds known to be active that were not included in the training set. It is also desirable to eliminate compounds that are statistical outliers in the training set. Unfortunately, some studies, such as drug activity prediction, may not have enough known active compounds to make this step feasible. In this case, the estimated error in prediction should be increased accordingly. [Pg.248]

On occasion, a data set appears to be skewed by the presence of one or more data points that are not consistent with the remaining data points. Such values are called outliers. The most commonly used significance test for identifying outliers is Dixon s Q-test. The null hypothesis is that the apparent outlier is taken from the same population as the remaining data. The alternative hypothesis is that the outlier comes from a different population, and, therefore, should be excluded from consideration. [Pg.93]

The Q-test compares the difference between the suspected outlier and its nearest numerical neighbor to the range of the entire data set. Data are ranked from smallest to largest so that the suspected outlier is either the first or the last data... [Pg.93]

Statistical test for deciding if an outlier can be removed from a set of data. [Pg.93]

This experiment uses the change in the mass of a U.S. penny to create data sets with outliers. Students are given a sample of ten pennies, nine of which are from one population. The Q-test is used to verify that the outlier can be rejected. Glass data from each of the two populations of pennies are pooled and compared with results predicted for a normal distribution. [Pg.97]

The absorbance of solutions of food dyes is used to explore the treatment of outliers and the application of the f-test for comparing means. [Pg.98]

Determine if there are any potential outliers in Sample 1, Sample 2, or Sample 3 at a significance level of a = 0.05. [Pg.101]

The detection of outliers, particularly when working with a small number of samples, is discussed in the following papers. Efstathiou, G. Stochastic Galculation of Gritical Q-Test Values for the Detection of Outliers in Measurements, /. Chem. Educ. 1992, 69, 773-736. [Pg.102]

Kelly, P. G. Outlier Detection in Gollaborative Studies, Anal. Chem. 1990, 73, 58-64. [Pg.102]

Dixon s Q-test statistical test for deciding if an outlier can be removed from a set of data. (p. 93) dropping mercury electrode an electrode in which successive drops of Hg form at the end of a capillary tube as a result of gravity, with each drop providing a fresh electrode surface, (p. 509)... [Pg.771]

Usually, 10 to 20 measurements are made of the isotope ratio for one substance. Sometimes, one or more of these measurements appears to be sufficiently different from the mean value that the question arises as to whether or not it should be included in the set at all. Several statistical criteria are available for reaching an objective assessment of the reliability of the apparently rogue result (Figure 48.10). Such odd results are often called outliers, and ignoring them gives a more precise mean value (lower standard deviation). It is not advisable to remove such data more than once in any one set of measurements. [Pg.361]

Chauvenet t tables can be used to decide whether one measurement in a series of measurements is a true outlier or can be rejected as statistically not significant. [Pg.364]

It is usually advisable to plot the observed pairs of y versus r, to support the linearity assumption and to detect potential outhers. Suspected outliers can be omitted from the least-squares Tit and then subsequently tested on the basis of the least-squares fit. [Pg.502]

Cropley made general recommendations to develop kinetic models for compUcated rate expressions. His approach includes first formulating a hyperbolic non-linear model in dimensionless form by linear statistical methods. This way, essential terms are identified and others are rejected, to reduce the number of unknown parameters. Only toward the end when model is reduced to the essential parts is non-linear estimation of parameters involved. His ten steps are summarized below. Their basis is a set of rate data measured in a recycle reactor using a sixteen experiment fractional factorial experimental design at two levels in five variables, with additional three repeated centerpoints. To these are added two outlier... [Pg.140]

The root-mean-square error is the square root of the mean square error. Note that since the root-mean-square error involves the square of the differences, outliers have more influence on this statistic than on the mean absolute error. [Pg.333]

See also in sourсe #XX -- [ Pg.205 , Pg.441 ]

See also in sourсe #XX -- [ Pg.93 ]

See also in sourсe #XX -- [ Pg.36 , Pg.43 , Pg.57 , Pg.103 , Pg.106 , Pg.223 , Pg.238 , Pg.242 ]

See also in sourсe #XX -- [ Pg.239 , Pg.374 ]

See also in sourсe #XX -- [ Pg.196 , Pg.240 ]

See also in sourсe #XX -- [ Pg.191 ]

See also in sourсe #XX -- [ Pg.55 , Pg.210 ]

See also in sourсe #XX -- [ Pg.27 , Pg.30 , Pg.31 , Pg.32 , Pg.38 , Pg.116 , Pg.132 , Pg.137 , Pg.140 , Pg.143 , Pg.146 , Pg.147 , Pg.149 , Pg.151 , Pg.153 , Pg.164 ]

See also in sourсe #XX -- [ Pg.117 , Pg.233 , Pg.302 , Pg.322 , Pg.370 , Pg.379 , Pg.405 , Pg.421 , Pg.437 , Pg.439 ]

See also in sourсe #XX -- [ Pg.41 , Pg.346 ]

See also in sourсe #XX -- [ Pg.59 ]

See also in sourсe #XX -- [ Pg.38 ]

See also in sourсe #XX -- [ Pg.266 ]

See also in sourсe #XX -- [ Pg.642 , Pg.658 , Pg.659 ]

See also in sourсe #XX -- [ Pg.266 ]

See also in sourсe #XX -- [ Pg.3 , Pg.4 , Pg.6 , Pg.10 , Pg.12 , Pg.17 , Pg.25 , Pg.28 , Pg.43 , Pg.62 , Pg.67 , Pg.73 , Pg.166 ]

See also in sourсe #XX -- [ Pg.150 , Pg.287 , Pg.307 , Pg.311 , Pg.317 , Pg.318 , Pg.319 , Pg.320 , Pg.335 , Pg.339 , Pg.494 ]

See also in sourсe #XX -- [ Pg.22 , Pg.25 , Pg.26 , Pg.32 ]

See also in sourсe #XX -- [ Pg.15 , Pg.22 , Pg.24 , Pg.25 , Pg.41 , Pg.42 , Pg.52 , Pg.179 , Pg.186 , Pg.190 , Pg.196 , Pg.227 , Pg.231 , Pg.247 , Pg.249 , Pg.250 , Pg.253 , Pg.254 , Pg.272 ]

See also in sourсe #XX -- [ Pg.118 ]

See also in sourсe #XX -- [ Pg.89 , Pg.94 ]

See also in sourсe #XX -- [ Pg.2 , Pg.953 ]

See also in sourсe #XX -- [ Pg.6 , Pg.8 , Pg.12 , Pg.16 , Pg.17 , Pg.27 , Pg.28 , Pg.108 , Pg.110 , Pg.111 , Pg.114 , Pg.116 , Pg.132 , Pg.152 , Pg.266 , Pg.356 , Pg.390 , Pg.391 ]

See also in sourсe #XX -- [ Pg.2 , Pg.567 ]

See also in sourсe #XX -- [ Pg.129 , Pg.145 ]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...