Dissimilarity-based selection

A number of different algorithms have been proposed to select a maximaUy diverse subset from a set of descriptors. Normally, the diverse subset is generated by selecting an initial molecule at random from the database, and then repeatedly selecting that molecule that is as different as possible from those that have already been selected [106-109]. [Pg.589]

The method can be improved by removing all compounds from the dataset that are similar to already selected ones in each iteration [110]. [Pg.589]

In dissimilarity-based compound selection the required subset of molecules is identified directly, using an appropriate measure of dissimilarity (often taken to be the complement of the similarity). This contrasts with the two-stage procedure in cluster analysis, where it is first necessary to group together the molecules and then decide which to select. Most methods for dissimilarity-based selection fall into one of two categories maximum dissimilarity algorithms and sphere exclusion algorithms [Snarey et al. 1997]. [Pg.699]

Since diversity is a collective property, its precise quantification requires a mathematical description of the distribution of the molecular collection in a chemical space. When a set of molecules are considered to be more diverse than another, the molecules in this set cover more chemical space and/or the molecules distribute more evenly in chemical space. Historically, diversity analysis is closely linked to compound selection and combinatorial library design. In reality, library design is also a selection process, selecting compounds from a virtual library before synthesis. There are three main categories of selection procedures for building a diverse set of compounds cluster-based selection, partition-based selection, and dissimilarity-based selection. [Pg.39]

Dissimilarity analysis plays a major role in compound selection. Typical tasks include the selection of a maximally dissimilar subset of compounds from a large set or the identification of compounds that are dissimilar to an existing collection. Such issues have played a major role in compound acquisition in the pharmaceutical industry. A typical task would be to select a subset of maximally dissimilar compounds from a data set containing n molecules. This represents a non-trivial challenge because of the combinatorial problem involved in exploring all possible subsets. Therefore, other dissimilarity-based selection algorithms have been developed (Lajiness 1997). The basic idea of such approaches is to initially select a seed compound (either randomly or, better, based on dissimilarity to others), then calculate dissimilarity between the seed compound and all others and select the most dissimilar one. In the next step, the database compound most dissimilar to these two compounds is selected and added to the subset, and the process is repeated until a subset of desired size is obtained. [Pg.9]

Clark [46] has recently described a subset selection algorithm called OptiSim which includes maximum and minimum dissimilarity based selection as special cases. A parameter is used to adjust the balance between representativeness and diversity in the compounds that are selected. [Pg.354]

Figure 5.15 Dissimilarity-based selection methods maximum dissimilarity approach.

Many different methods have been developed for compound selection. They include selection based on clustering compounds, dissimilarity-based selection, selection based on partitioning a collection of compounds into some multidimensional space, experimental design methods and the use of stochastic methods such as simulated annealing and genetic algorithms. Filtering techniques are often employed prior to compound selection to remove undesirable compounds. [Pg.259]

Maximum Dissimilarity-Based Selection The original algorithm for dissimilarity ranking in the chemical structure context seems to have been proposed by Bawden, although the basic algorithm may be due to Kennard and Stone. The basic operation of a dissimilarity selection algorithm is to start with a compound selected at random and make this the first selected compound. Subsequent compounds are selected so that they are maximally dissimilar to all those in the currendy selected set. Dissimilarity may be measured by... [Pg.23]

Fig. 1.17 Comparison of subset selection procedures based on compounds in the Diverse collection depicted in cyan (see Table 1.4 and Seet. 3.6.1 for details). Yellow dots represent compounds obtained by the subset seleetion proeedures a dissimilarity-based selection, b Cell-based subset selection. (Figure kindly provided by Veer Shanmugasundaram)...

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...