Similarity measures Mahalanobis distance

So far we have been considering leverage with respect to a point s Euclidean distance from an origin. But this is not the only measure of distance, nor is it necessarily the optimum measure of distance in this context. Consider the data set shown in Figure E4. Points C and D are located at approximately equal Euclidean distances from the centroid of the data set. However, while point C is clearly a typical member of the data set, point D may well be an outlier. It would be useful to have a measure of distance which relates more closely to the similarity/difference of a data point to/from a set of data points than simple Euclidean distance.The various Mahalanobis distances are one such family of such measures of distance. Thus, while the Euclidean distances of points C and D from the centroid of the data set are equal, the various Mahalanobis distances from the centroid of the data set are larger for point D than for point C. [Pg.185]

Points with a constant Euclidean distance from a reference point (like the center) are located on a hypersphere (in two dimensions on a circle) points with a constant Mahalanobis distance to the center are located on a hyperellipsoid (in two dimensions on an ellipse) that envelops the cluster of object points (Figure 2.11). That means the Mahalanobis distance depends on the direction. Mahalanobis distances are used in classification methods, by measuring the distances of an unknown object to prototypes (centers, centroids) of object classes (Chapter 5). Problematic with the Mahalanobis distance is the need of the inverse of the covariance matrix which cannot be calculated with highly correlating variables. A similar approach without this drawback is the classification method SIMCA based on PC A (Section 5.3.1, Brereton 2006 Eriksson et al. 2006). [Pg.60]

The distance between object points is considered as an inverse similarity of the objects. This similarity depends on the variables used and on the distance measure applied. The distances between the objects can be collected in a distance matrk. Most used is the euclidean distance, which is the commonly used distance, extended to more than two or three dimensions. Other distance measures (city block distance, correlation coefficient) can be applied of special importance is the mahalanobis distance which considers the spatial distribution of the object points (the correlation between the variables). Based on the Mahalanobis distance, multivariate outliers can be identified. The Mahalanobis distance is based on the covariance matrix of X this matrix plays a central role in multivariate data analysis and should be estimated by appropriate methods—mostly robust methods are adequate. [Pg.71]

Initially cluster analysis defines a measure of simUarity given by a distance or a correlation or the information content Distance can be measured as euclidean distance or Mahalanobis distance or Minkowski distance. Objects separated by a short distance are recognized as very similar, while objects separated by a great distance are dissimilar. The overall result of cluster analysis is reported as a dendrogram of the similarities obtained by many procedures. [Pg.130]

The leverage, / , of the z th calibration sample is the z th diagonal of the hat matrix, H. The leverage is a measure of how far the z th calibration sample lies from the other n - 1 calibration samples in X-space. The matrix H is called the hat matrix because it is a projection matrix that projects the vector y into the space spanned by the X matrix, thus producing y-hat. Notice the similarity between leverage and the Mahalanobis distance described in Chapter 4. [Pg.128]

Mahalanobis distance. This method is popular with many chemometricians and, whilst superficially similar to the Euclidean distance, it takes into account that some variables may be correlated and so measure more or less the same properties. The distance between objects k and l is best defined in matrix terms by... [Pg.227]

It is important to understand the applicability domain (AD) of any model. For example, care needs to be taken when applying models over time as new compounds can be quite different from the compounds used to build the original predictive model. Such shifts in chemotype typically result in less accurate prediction. A measure on how well a model is expected to perform is to calculate a distance-to-model statistic. This uses the model descriptors to measure how similar a new compound is to the compounds used to create the model. For example, it can be based on the Euclidean or Mahalanobis distance from the compound to the model space. Apart from regularly rebuilding the model to take new chemical space into account, an elegant method to get the best possible predictions is to use associative or correction Ubraries. These use the most recent experimental data to make adjustments in the errors in prediction. [Pg.503]

Simplified equations result because A A equals the identity matrix I, B B equals I, and C C equals I. The same equations are valid for PARAFAC models, but the middle cross-product is not identity. To summarize the properties of the squared Mahalanobis distances, the following can be said. Leverage can be defined for multiple linear regression as an influence measure. It is related to a specific Mahalanobis distance. The term leverage is sometimes also used for similar Mahalanobis distances in low-rank regression methods such as PCR and PLS. Then it becomes dependent on the rank of the model. The squared Mahalanobis distances can also be defined for PCA and multi-way models and can be calculated for both variables and objects. [Pg.173]

This distance in descriptor space measures how similar the investigated compound is to the training set compounds. The Mahalanobis distance is superior to the corresponding, and more familiar, Euclidian distance since the former takes correlations between the variables into account, that is, the Mahalanobis distance does not assume orthogonal descriptors as does the euclidian distance, that normally exists. [Pg.1015]

In this matrix the most similar pair of (different) objects is (1.4), while the most divergent pair is (3.4). Apart from the classical Euclidean distance defined by Equation 8.2, some further relevant measures exist such as Mahalanobis or Manhattan distance. The Mahalanobis distance, for instance, which is important in classification (see Chapter 3.10). is computed according to... [Pg.53]

Despite all the methods available, the remainder of this chapter will focus on a method that uses a technique called the Mahalanobis distance to measure the spectral similarity. [Pg.171]

By calculating the sum of the squares of the spectral residuals across all the wavelengths, an additional representative value can be generated for each spectrum. The spectral residual is effectively a measure of the amount of each spectrum left over in the secondary or noise vectors. This value is the basis of another type of discrimination method known as SIMCA (Refs. 13, 36). This is similar to performing an F test on the spectral residual to determine outliers in a training set (see Outlier Sample Detection in Chapter 4). In fact, one group combined the PCA-Mahalanobis distance method with SIMCA to provide a biparametric method of discriminant analysis (Ref. 41). In this method, both the Mahalanobis distance and the SIMCA test on the spectral residual had to pass in order for a sample to be classified as a match. [Pg.177]

In the first step of HCA, a distance matrix is calculated that contains the complete set of interspectral distances. The distance matrix is symmetric along its diagonal and has the dimension nxn, with n as the number of patterns. Spectral distance can be obtained in different ways depending on how the similarity of two patterns is calculated. Popular distance measures are Euclidean distances, including the city-block distance (Manhattan block distance), Mahalanobis distance, and so-called differentiation indices (D-values, see also Appendix B) . [Pg.211]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...