Similarity Search Results, Ranking

Although models are usually built on the basis of a whole bunch of descriptors in machine learning techniques, in most cases simple similarity searches first rank the compounds according to single descriptors (for the combination of results with different descriptors, see below). To support the selection of the most appropriate descriptors, many studies have been published assessing the performance of different descriptors or descriptor combinations in similarity searches. Selected publications can be found in the reference list [16, 30, 41-47]. Most of them discuss the behavior of descriptors on individual test data sets in retrospective studies. They often focus on the comparison of 2D versus 3D descriptors and/or on the potential to detect new chemotypes. A detailed discussion of all these studies is beyond the scope of this contribution, but some general observations are presented. [Pg.72]

The relationship between IR and chemical similarity searching is discussed in detail by Edgar et al. (11) who summarize the various effectiveness measures in terms of the 2 x 2 contingency table shown in Table 1. In this table, it is assumed that a search has been carried out resulting in the retrieval of the n nearest neighbors at the top of the ranked output. Assume that these n nearest neighbors include a of the A active molecules in the complete database, which contains a total of N molecules. Then the recall, R, is defined to be the fraction of the active molecules that are retrieved, i.e.,... [Pg.54]

It is inconvenient to have to specify two measures, i.e., recall and precision, to quantify the effectiveness of a search. The Merck group have made extensive use of the enrichment factor, i.e., the number of actives retrieved relative to the number that would have been retrieved if compounds had been picked from the database at random (12). Thus, using the notation of Table 1, the enrichment factor at some point, n, in the ranking resulting from a similarity search is given by... [Pg.55]

Fig. 4.6 (a) Feature Tree search result based on query (left) retrieves a high ranked (rank 5) active compound with low 2D similarity. The corresponding feature trees are also... [Pg.93]

After the selection of an active query ligand and a search database, it has to be decided which descriptor(s) and which similarity measure shall be employed. The result of a similarity search is a list of compounds (so-called nearest neighbors) that is ranked according to the similarity of the reference compound and the searched stmctures. The list can be limited to a certain number of nearest neighbors to the reference compound or to compounds with a similarity above a certain threshold. There is no general mle to determine this threshold, especially since there is no similarity value that clearly discriminates between active and inactive compounds. The number... [Pg.70]

The product-moment correlation coefficient is widely used in bmriate data analysis to measure the extent of the correlation between two variables, but it is not clear that it is necessarily appropriate to measure the extent of the similarity between two objects. Other sorts of correlation coefficient are available, such as the Spearman rank correlation coefficient which has been used by Manaut et al. as a measure of electrostatic similarity, but these have not found extensive application in similarity searching systems. Similar comments apply to probabilistic coefficients, which are calculated from the frequency distribution of descriptors in a database, and which Adamson and Bush < found to give poor results when applied to 2D chemical structures. [Pg.21]

Visual inspection of the top-ranking structures from similarity searches using the triangle counts suggests that these features are sufficiently discriminating to permit the detection of both shape and size similarity and that they result in the retrieval of substances that can be very different from those re-... [Pg.38]

The test query structures were selected from this subset to display a range of pattern attributes. The chemical diagrams of the structures were entered to the similarity search system using the Version 4 graphics interface, to imitate the normal situation when a user presents a structure to the system for searching. Using structures known to be in the database will allow the effectiveness of the search procedures to be monitored, since they should appear at the top of the ranked hit fists. Due to limited space, the results from only three of the test structures that were chosen are reported here. [Pg.360]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...