Euclidean distance with correlated variables

The problem lies in the model. The Euclidean distance calculation is inappropriate for use with correlated variables because it is based only on pairwise comparisons, without regard to the elongation of data point swarms along particular axes. In effect, Euclidean distance imposes a spherical constraint on the data set (18). When correlation has been removed from the data, (by derivation of standardized characteristic vectors) Euclidean distance and average-linkage cluster analysis return the three groups. [Pg.66]

Points with a constant Euclidean distance from a reference point (like the center) are located on a hypersphere (in two dimensions on a circle) points with a constant Mahalanobis distance to the center are located on a hyperellipsoid (in two dimensions on an ellipse) that envelops the cluster of object points (Figure 2.11). That means the Mahalanobis distance depends on the direction. Mahalanobis distances are used in classification methods, by measuring the distances of an unknown object to prototypes (centers, centroids) of object classes (Chapter 5). Problematic with the Mahalanobis distance is the need of the inverse of the covariance matrix which cannot be calculated with highly correlating variables. A similar approach without this drawback is the classification method SIMCA based on PC A (Section 5.3.1, Brereton 2006 Eriksson et al. 2006). [Pg.60]

Mahalanobis distance. This method is popular with many chemometricians and, whilst superficially similar to the Euclidean distance, it takes into account that some variables may be correlated and so measure more or less the same properties. The distance between objects k and l is best defined in matrix terms by... [Pg.227]

Scaling of data is not necessary if the Mahalanobis distance is used. In addition, with this measure distortion occasioned by correlations of features or feature groups is avoided. In contrast, if the Euclidean distance were applied in the case of two highly correlated variables, these variables would be used as two independent features although they provide identical information. [Pg.173]

Fluorescence spectra are collected under excitation conditions that are optimized to correlate the emission spectral features with parameters of interest. Principal components analysis (PCA) is further used to extract the desired spectral descriptors from the spectra. The PCA method is used to provide a pattern recognition model that correlates the features of fluorescence spectra with chemical properties, such as polymer molecular weight and the concentration of the formed branched side product, also known as Fries s product, that are in turn related to process conditions. The correlation of variation in these spectral descriptors with variation in the process conditions is obtained by analyzing the PCA scores. The scores are analyzed for their Euclidean distances between different process conditions as a function of catalyst concentration. Reaction variability is similarly assessed by analyzing the variability between groups of scores under identical process conditions. As a result the most appropriate process conditions are those that provide the largest differentiation between materials as a function of catalyst concentration and the smallest variability in materials between replicate polymerization reactions. [Pg.103]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...