Mahalanobis distance methods

On large data sets (greater than 14 samples) the SIMCA and Mahalanobis distance methods perform better than the wavelength distance method because they use estimates of the underlying population variance/covariance matrix for sample classification. As the size of the training set decreases, the accuracy of the population variance/covariance matrix estimates decreases, and performance of the Mahalanobis distance and SIMCA methods is reduced. In such cases where small training sets are used, the wavelength distance method may perform better because its univariate means and standard deviations are likely estimated with more certainty than the multivariate means and variance/covariance matrices used by the Mahalanobis distance and SIMCA methods. [Pg.61]

The actual mathematics of the Mahalanobis distance calculation has been known for some time. In fact, this method has been applied successfully for spectral discrimination in a number of cases (Refs. 33-36). One of the main reasons the Mahalanobis distance method was chosen is that it is very sensitive to intervariable changes in the calibration data. In addition, the distance is measured in terms of standard deviations from the mean of the training samples. Not only does the calculation give a very sensitive discrimination but also the reported matching values give a statistical measure of how well the spectrum of the unknown sample matches (or does not match) the original training spectra. [Pg.171]

By calculating the sum of the squares of the spectral residuals across all the wavelengths, an additional representative value can be generated for each spectrum. The spectral residual is effectively a measure of the amount of each spectrum left over in the secondary or noise vectors. This value is the basis of another type of discrimination method known as SIMCA (Refs. 13, 36). This is similar to performing an F test on the spectral residual to determine outliers in a training set (see Outlier Sample Detection in Chapter 4). In fact, one group combined the PCA-Mahalanobis distance method with SIMCA to provide a biparametric method of discriminant analysis (Ref. 41). In this method, both the Mahalanobis distance and the SIMCA test on the spectral residual had to pass in order for a sample to be classified as a match. [Pg.177]

Mark, H. "Use of Mahalanobis Distances to Evaluate Sample Preparation Methods for Near-Infrared Reflectance Analysis", Anal. Chem. 1987 (59) 790-795. [Pg.195]

Points with a constant Euclidean distance from a reference point (like the center) are located on a hypersphere (in two dimensions on a circle) points with a constant Mahalanobis distance to the center are located on a hyperellipsoid (in two dimensions on an ellipse) that envelops the cluster of object points (Figure 2.11). That means the Mahalanobis distance depends on the direction. Mahalanobis distances are used in classification methods, by measuring the distances of an unknown object to prototypes (centers, centroids) of object classes (Chapter 5). Problematic with the Mahalanobis distance is the need of the inverse of the covariance matrix which cannot be calculated with highly correlating variables. A similar approach without this drawback is the classification method SIMCA based on PC A (Section 5.3.1, Brereton 2006 Eriksson et al. 2006). [Pg.60]

The Mahalanobis distance used for multivariate outlier detection relies on the estimation of a covariance matrix (see Section 2.3.2), in this case preferably a robust covariance matrix. However, robust covariance estimators like the MCD estimator need more objects than variables, and thus for many applications with m>n this approach is not possible. For this situation, other multivariate outlier detection techniques can be used like a method based on robustified principal components (Filzmoser et al. 2008). The R code to apply this method on a data set X is as follows ... [Pg.64]

The distance between object points is considered as an inverse similarity of the objects. This similarity depends on the variables used and on the distance measure applied. The distances between the objects can be collected in a distance matrk. Most used is the euclidean distance, which is the commonly used distance, extended to more than two or three dimensions. Other distance measures (city block distance, correlation coefficient) can be applied of special importance is the mahalanobis distance which considers the spatial distribution of the object points (the correlation between the variables). Based on the Mahalanobis distance, multivariate outliers can be identified. The Mahalanobis distance is based on the covariance matrix of X this matrix plays a central role in multivariate data analysis and should be estimated by appropriate methods—mostly robust methods are adequate. [Pg.71]

HCA is a common tool that is used to determine the natural grouping of objects, based on their multivariate responses [75]. In PAT, this method can be used to determine natural groupings of samples or variables in a data set. Like the classification methods discussed above, HCA requires the specification of a space and a distance measure. However, unlike those methods, HCA does not involve the development of a classification rule, but rather a linkage rule, as discussed below. For a given problem, the selection of the space (e.g., original x variable space, PC score space) and distance measure (e.g.. Euclidean, Mahalanobis) depends on the specific information that the user wants to extract. For example, for a spectral data set, one can choose PC score space with Mahalanobis distance measure to better reflect separation that originates from both strong and weak spectral effects. [Pg.405]

Once the general library has been constructed, those products requiring the second identification step are pinpointed and the most suitable method for constructing each sublibrary required is chosen. Sublibraries can be constructed using various algorithms including the Mahalanobis distance or the residual variance. The two are complementary, so which is better for the intended purpose should be determined on an individual basis. [Pg.469]

To generate the dendrogram, HCA methods form clusters of samples based on their nearness in row space. A common approach is to initially treat every sample as a cluster and join closest clusters together. This process is repeated until only one cluster remains. Variations of HCA use different approaches to measure distances between clusters (e.g., single vs. centroid linking, Euclidean vs. Mahalanobis distance), fhe two methods discussed below use single and centroid linking with Euclidean distances. [Pg.216]

Moreover, because the Mahalanobis distance is a chi-square function, as is the SIMCA distance used to define the class space in the SIMCA method (Sect. 4.3), it is possible to use Coomans diagrams (Sect. 4.3) both to visualize the results of modelling and classification (distance from two category centroids) and to compare two different methods (Mahalanobis distance from the centroids versus SIMCA distance). [Pg.119]

KNN)13 14 and potential function methods (PFMs).15,16 Modeling methods establish volumes in the pattern space with different bounds for each class. The bounds can be based on correlation coefficients, distances (e.g. the Euclidian distance in the Pattern Recognition by Independent Multicategory Analysis methods [PRIMA]17 or the Mahalanobis distance in the Unequal [UNEQ] method18), the residual variance19,20 or supervised artificial neural networks (e.g. in the Multi-layer Perception21). [Pg.367]

Gemperline, P.J. and Boyer, N.R., Classification of near-infrared spectra using wavelength distances comparison to the Mahalanobis distance and residual variance methods, Anal. Chem., 67, 160-166, 1995. [Pg.68]

A very useful method of discriminating between samples from different classes is to plot PCA or PLS scores in two or three dimensions. This is very similar to the Mahalanobis distance discussed earlier in Fig. 5-11, except that it is limited to two or three dimensions, and the Mahalanobis distance can be constructed for n dimensions. Score plots do provide a good visual understanding of the underlying differences between data from samples belonging to different classes. [Pg.289]

There are dimensionality issues. Later we propose Mahalanobis distance (Section 4.5) as a good metric for diversity analysis. With p descriptors in the data set, this metric effectively, if not explicitly, computes a covariance matrix with ( ) parameters. In order to obtain accurate estimates of the elements of the covariance matrix, one rule of thumb is that at least five observations per parameter should be made. This suggests that a data set with n observations can only investigate approximately V2 /5 descriptors for the Mahalanobis distance computation. Thus, some method for subset selection of descriptors is needed. [Pg.80]

Mahalanobis distance. This method is popular with many chemometricians and, whilst superficially similar to the Euclidean distance, it takes into account that some variables may be correlated and so measure more or less the same properties. The distance between objects k and l is best defined in matrix terms by... [Pg.227]

The method can be extended to any number of classes. It is easy to calculate the Mahalanobis distance to several classes, and determine which is the most appropriate classification of an unknown sample, simply by finding the smallest class distance. More discriminant functions are required, however, and the computation can be rather complicated. [Pg.242]

Identification Quality Assessment. In a first of its kind paper, Rose in 1982 showed that a number of structurally similar penicillin-type drugs could be identified and determined by NIR. At Sandoz, Ciurczak in 1984 reported using Mahalanobis distance-based algorithms for the identification of raw materials. ° Ciurczak also reported the use of spectral matching (SM) and principle component analysis (PCA) for raw materials and suggested a method for introducing variations into samples for more robust equation development in 1986. NIR has been in use for raw material ID since then in companies worldwide. [Pg.3437]

In a semiquantitative method, six wavelengths were used in a Mahalanobis distance calculation, and it was possible to distinguish the ETH extracts at concentrations below 0.05%. For quantitative analysis, multiple linear regression (MLR) was employed. The correlations obtained were r2 = 0.85 (ETH) and r2 = 0.86 (NOR). With low drug concentrations and a short range of values, the SECs were high. [Pg.93]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...