Similarity searching distance distribution

The investigations reported in this paper are only preliminary. They are driven by the obvious importance of similarity analyses in large databases and by a perceived need for a very flexible approach to the problem for both 2-D and 3-D molecular structures. The results obtained so far indicate that similarity searches can be effective at various levels of query definition chemical, 2-D bonding pattern and 3-D distance distribution, and that these levels can be combined or integrated depending upon the information coded in the query. What is important are (a) the selection of suitable attribute sets for each level of discrimination, (b) the compact storage of attributes for each database entry and (c) the selection of effective metrics for assessment of similarity at the various levels. [Pg.376]

The USR (Ultrafast Shape Recognition) Method. This method was reported by Ballester and Richards (53) for compound database search on the basis of molecular shape similarity. It was reportedly capable of screening billions of compounds for similar shapes on a single computer. The method is based on the notion that the relative position of the atoms in a molecule is completely determined by inter-atomic distances. Instead of using all inter-atomic distances, USR uses a subset of distances, reducing the computational costs. Specifically, the distances between all atoms of a molecule to each of four strategic points are calculated. Each set of distances forms a distribution, and the three moments (mean, variance, and skewness) of the four distributions are calculated. Thus, for each molecule, 12 USR descriptors are calculated. The inverse of the translated and scaled Manhattan distance between two shape descriptors is used to measure the similarity between the two molecules. A value of 1 corresponds to maximum similarity and a value of 0 corresponds to minimum similarity. [Pg.124]

The similarity between two structures is calculated by means of an associated coefficient, usually the Tanimoto coefficient. Valence angles are 2D, and interatomic distances are ID. Most of the 3D searches of data bases have concentrated on interatomic distance information, e.g., hash codes for all distinct triplets of nonhydrogen atoms in pairs of isomeric structures. If one uses frequency distributions instead of hash codes, one can also compare nonisomeric molecules. [Pg.16]

Cluster analysis is a method for dividing a group of objects into classes so that similar objects are in the same class. As in PCA, the groups are not known prior to the mathematical analysis and no assumptions are made about the distribution of the variables. Cluster analysis searches for objects which are close together in the variable space. The distance, d, between two points in n-dimensional space with coordinates (Xj, X2, x ) and (yi, y2,..., y ) is usually taken as the Euclidean distance defined by... [Pg.220]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...