Diverse subsets selection

When a chemistry space has been defined, a database can be mapped onto the space by assigning each molecule to a cell according to its properties and a diverse subset selected by taking one or more molecules from each cell alternatively, a focused subset can be selected by choosing compounds from a limited number of cells, for example, from the cells adjacent to a cell occupied by a known active. The partitioning scheme is defined independently of... [Pg.201]

Despite the many different approaches to diversity analysis, little has yet been done to determine which methods are the best. The studies that have been carried out so far to validate the effectiveness of different structural descriptors in diversity analysis is normally done using simulated property prediction experiments and by examining the coverage of different bioactivity types in the diverse subsets selected. The most extensive studies have been performed by Brown and Martin [13,39] and by Matter [45],... [Pg.51]

The axes of a corporate standard chemistry space are intended to represent all aspects of molecular structure. Thus, all axes of the corporate chemistry space must be considered for purposes such as general diverse subset selection or rational compound acquisition. How-... [Pg.203]

Alternatively, the number of desired compounds can be predefined and a stochastic algorithm used to maximize the diversity of the selected set, although these methods are even slower than addition methods. Sphere-exclusion methods, which Pearlman calls "elimination" algorithms because the diverse subset is created by eliminating compounds from the superset, have been implemented in Diverse-Solutions (31) (see Section 2.2.1.1), providing a rapid distance-based diverse subset selection method. The minimum distance between nearest neighbors within the diverse subset is first defined a compound is chosen at... [Pg.207]

Schmuker, M., Givehchi, A. and Schneider, G. (2004) Impact of different software implementations on the performance of the Maxmin method for diverse subset selection. Mol. Div., 8, 421—425. [Pg.1165]

Step 1 starts with a large set of seed solutions, which may be created by heuristics or by random generation. One possible implementation then generates a diverse subset of these by choosing some initial seed solution, then selecting a second one that maximizes the distance from the initial one. The third one maximizes the distance from the nearest of the first two, and so on. [Pg.408]

In chemoinformatics research, partitioning algorithms are applied in diversity analysis of large compound libraries, subset selection, or the search for molecules with specific activity (1-4). Widely used partitioning methods include cell-based partitioning in low-dimensional chemical spaces (1,3) and decision tree methods, in particular, recursive partitioning (RP) (5-7). Partitioning in low-dimensional chemical spaces is based on various dimension reduction methods (4,8) and often permits simplified three-dimensional representation of... [Pg.291]

Library design methods can be divided into reactant-based or product-based design. In reactant-based design, reactants are chosen without consideration of the products that will result. For example, diverse subsets of reactants are selected in the hope they will give rise to a diverse library of products. In product-based design, the selection of reactants is determined by analyzing the products that will be produced. [Pg.337]

Our approach to selecting a diverse subset is based on utilizing a minimum similarity between each molecule and all other molecules in the virtual library. For the 2-D fingerprints, the similarity is measured by a Tanimoto coefficient20 which measures similarity on a pair-wise basis. A Tanimoto coefficient for any pair of molecular structures lies in the range of zero (dissimilar) to one (similar). It is defined as the ratio of the number of common bits (in this case molecular fragments) set in two molecules divided by the number of bits set in either. [Pg.229]

Chapman [44] describes a method for selecting a diverse set of compounds that is based on 3-D similarity. The diversity of a set of compounds is computed from the similarities between all conformers in the dataset, where multiple conformers are generated for each structure. The similarity between two conformers is determined by aligning them and measuring how well they can be superimposed in terms of steric bulk and polar functionalities. A diverse subset is built by adding one compound at a time and the compound that would contribute the most diversity to the subset is chosen in each step. The high computational cost of this method restricts its use to small datasets. [Pg.353]

Clark [46] has recently described a subset selection algorithm called OptiSim which includes maximum and minimum dissimilarity based selection as special cases. A parameter is used to adjust the balance between representativeness and diversity in the compounds that are selected. [Pg.354]

The compound selection methods described thus far can be used to select compounds for screening from an in-house collection, or to select which compounds to purchase from an external supplier. In combinatorial library design, however, it is necessary to select subsets of reactants for actual synthesis. The two main strategies for combinatorial library design are reactant-based selection and product-based selection. In reactant-based selection, optimized subsets of reactants are selected without consideration of the products that will result and any of the compound selection methods already identified can be used. An early example of reactant-based design is that already described by Martin and colleagues which is based on experimental design and where diverse subsets of reactants were selected for the synthesis of peptoid libraries [1]. [Pg.358]

In product-based selection, the properties of the resulting product molecules are taken into account when selecting the reactants. Typically this is done by enumerating the entire virtual library that could potentially be made. Any of the subset selection methods described previously could be used to select a diverse subset of products, however the resulting subset is very unlikely to represent a combinatorial subset. This process is known as cherry-picking and is synthetically inefficient as far as combinatorial synthesis is concerned. Synthetic efficiency is maximized by taking the combinatorial... [Pg.358]

There are dimensionality issues. Later we propose Mahalanobis distance (Section 4.5) as a good metric for diversity analysis. With p descriptors in the data set, this metric effectively, if not explicitly, computes a covariance matrix with ( ) parameters. In order to obtain accurate estimates of the elements of the covariance matrix, one rule of thumb is that at least five observations per parameter should be made. This suggests that a data set with n observations can only investigate approximately V2 /5 descriptors for the Mahalanobis distance computation. Thus, some method for subset selection of descriptors is needed. [Pg.80]

Distance-based methods require a definition of molecular similarity (or distance) in order to be able to select subsets of molecules that are maximally diverse with respect to each other or to select a subset that is representative of a larger chemical database. Ideally, to select a diverse subset of size k, all possible subsets of size k would be examined and a diversity measure of a subset (for example, average near neighbor similarity) could be used to select the most diverse subset. Unfortunately, this approach suffers from a combinatoric explosion in the number of subsets that must be examined and more computationally feasible approximations must be considered, a few of which are presented below. [Pg.81]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...