Similarity measures distance-based

Before delving into the specific similarity calculation, we start our discussion with the characteristics of attributes in multidimensional data objects. The attributes can be quantitative or qualitative, continuous or binary, nominal or ordinal, which determines the corresponding similarity calculation (Xu and Wunsch, 2005). Typically, distance-based similarity measures are used to measure continuous features, while matching-based similarity measures are more suitable for categorical variables. [Pg.90]

Distance-Based Similarity Measures Similarity measm-es determine the proximity (or distance) between two data objects. Multidimensional objects can be formalized as numerical vectors O, = oy = 1 data object and p is the number of dimensions for the data object Oy. Figure 5.1 provides an intuitive view of multidimensional data. The similarity between two objects can be measured by a distance function of corresponding vectors Oj and (. ... [Pg.90]

Matching-Based Similarity Measures For categorical attributes, distance-based similarity measures cannot be performed directly. The most straightforward way is to compare the similarity of objects for pairs of categorical attributes (Zhao and Karypis, 2005). For two objects that contain simple binary attributes, ... [Pg.92]

The efiectiveness of the BNB and NBN measures was assessed by simulated property-prediction experiments. These experiments involved the QSAR data sets studied previously by Pepperrell and Willetti" for the evaluation of distance-based similarity measures and a large set of 6-deoxyhexopyranose carbohydrates, which had previously been classified into 14 shape classes using numerical clustering methods based on torsional dissimilarity coefficients. The comparison encompassed the Bemis-Kuntz and Lederle measures, including not just the atom-triplet but also the atom-pair and atom-quadruplet versions of the former measure. The results were equivocal, in that it was impossible to... [Pg.36]

Fig. 30.5. The point i is equidistant to i and i" according to the Euclidean distances (Z) - and Z) ) but much closer to i (cos 0, ) than to i" (cos 0 ), when a correlation-based similarity measure is applied.

Distances in these spaces should be based upon an Zj or city-block metric (see Eq. 2.18) and not the Z2 or Euclidean metric typically used in many applications. The reasons for this are the same as those discussed in Subheading 2.2.1. for binary vectors. Set-based similarity measures can be adapted from those based on bit vectors using an ansatz borrowed from fuzzy set theory (41,42). For example, the Tanimoto similarity coefficient becomes... [Pg.17]

Several types of similarity measures can be used to compare the similarity between such objects [4]. Two types of the most often used measures are distance-based similarity coefficients and correlation coefficients. Representatives of them are as follows ... [Pg.128]

Spectra and chemical structure searches are based on distance and similarity measures as introduced in Section 5.2. Different strategies are known sequential search, search based on inverted lists, and hierarchical search trees. The strategies are explained for search of spectra. [Pg.286]

A similarity measure, sAB, between objects A and B, based on any distance measure, dAB, can be defined as... [Pg.60]

Spectral similarity search is a routine method for identification of compounds, and is similar to fc-NN classification. For molecular spectra (IR, MS, NMR), more complicated, problem-specific similarity measures are used than criteria based on the Euclidean distance (Davies 2003 Robien 2003 Thiele and Salzer 2003). If the unknown is contained in the used data base (spectral library), identification is often possible for compounds not present in the data base, k-NN classification may give hints to which compound classes the unknown belongs. [Pg.231]

Samples joined at small distances are similar based on the measurement system. [Pg.43]

Distance-based methods require a definition of molecular similarity (or distance) in order to be able to select subsets of molecules that are maximally diverse with respect to each other or to select a subset that is representative of a larger chemical database. Ideally, to select a diverse subset of size k, all possible subsets of size k would be examined and a diversity measure of a subset (for example, average near neighbor similarity) could be used to select the most diverse subset. Unfortunately, this approach suffers from a combinatoric explosion in the number of subsets that must be examined and more computationally feasible approximations must be considered, a few of which are presented below. [Pg.81]

When used for relative similarity and diversity, only potential pharmacophores that contain the defined special centre-type are used. The frame of reference for similarity/diversity studies is thus changed to one that is focused on the feature of interest distances are now measured relative to this special centre. For example, the special centre could be the centroid of a substructure [10] such as biphenyl tetrazole or diphenylmethane, enabling the calculation and comparison of all 3D pharmacophoric shapes that contain this substructure the substructure is said to be privileged . For structure-based design, the potential pharmacophores in a site can be restricted to those that contain a specific site point (e.g. in a pocket, or at the entrance to a pocket). In the context of combinatorial library design, the relative measure can be those pharmacophoric shapes that contain a special site-point that represents where the attachment point for a reagent would be. In figure 1, the special point would be centre-type number 3, which can be reserved for this purpose. [Pg.69]

The choice of representation, of similarity measure and of selection method are not independent of each other. For example, some types of similarity measure (specifically the association coefficients as exemplified by the well-known Tanimoto coefficient) seem better suited than others (such as Euclidean distance) to the processing of fingerprint data [12]. Again, the partition-based methods for compound selection that are discussed below can only be used with low-dimensionality representations, thus precluding the use of fingerprint representations (unless some drastic form of dimensionality reduction is performed, as advocated by Agrafiotis [13]). Thus, while this chapter focuses upon selection methods, the reader should keep in mind the representations and the similarity measures that are being used recent, extended reviews of these two important components of diversity analysis are provided by Brown [14] and by Willett et al. [15]. [Pg.116]

However, these descriptor-based similarity definitions present only one class of available similarity and distance measures. Approaches to molecular... [Pg.126]

Analysis of molecular similarity is based on the quantitative determination of the overlap between fingerprints of the query structure and all database members. As descriptors of a given molecule can be considered as a vector of real or binary attributes, most of the similarity measures are derived as vectorial distances. Tanimoto and Cosine coefficients are the most popular measures of similarity.Definitions of similarity metrics are collected in Table 3. [Pg.4017]

The simplest similarity measures s t between two objects s and t are obtained from the distance measures, based on the two following similarity functions ... [Pg.399]

Recently, a number of modifications of the classical methods have appeared that incorporate the spatial distance among pixels as an addihonal criterion in the clustering schemes. Thus, similarity measures based on spectral distances, such as p, can be weighted incorporating pixel neighboring informahon for example, the Euclidean distance can be redefined as ... [Pg.84]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...