Compound selection, dissimilarity-base

Much research has been done into similarity searching, and it is often assumed that diversity, or dissimilarity, is the converse (i.e, 1 — similarity). Any similarity measure involves three main components the structural descriptors that are used to characterize the molecules, the weighting scheme used to differentiate important from less important characteristics, and the similarity coefficient that is used to quantify the degree of similarity between pairs of molecules. Three types of descriptor have been used fragment substructures, topological indexes, and global physical properties. One of the most common similarity coefficients is that due to Tanimoto. There are three main methods of compound selection cluster-based, dissimilarity-based, and partition-based. [Pg.416]

In dissimilarity-based compound selection the required subset of molecules is identified directly, using an appropriate measure of dissimilarity (often taken to be the complement of the similarity). This contrasts with the two-stage procedure in cluster analysis, where it is first necessary to group together the molecules and then decide which to select. Most methods for dissimilarity-based selection fall into one of two categories maximum dissimilarity algorithms and sphere exclusion algorithms [Snarey et al. 1997]. [Pg.699]

A major potential drawback with cluster analysis and dissimilarity-based methods f selecting diverse compounds is that there is no easy way to quantify how completel one has filled the available chemical space or to identify whether there are any hole This is a key advantage of the partition-based approaches (also known, as cell-bas( methods). A number of axes are defined, each corresponding to a descriptor or son combination of descriptors. Each axis is divided into a number of bins. If there are axes and each is divided into b bins then the number of cells in the multidimension space so created is ... [Pg.701]

Dissimilarity-based compound selection (DECS) methods involve selecting a subset of compounds directly based on pairwise dissimilarities [37]. The first compound is selected, either at random or as the one that is most dissimilar to all others in the database, and is placed in the subset. The subset is then built up stepwise by selecting one compound at a time until it is of the required size. In each iteration, the next compound to be selected is the one that is most dissimilar to those already in the subset, with the dissimilarity normally being computed by the MaxMin approach [38]. Here, each database compound is compared with each compound in the subset and its nearest neighbor is identified the database compound that is selected is the one that has the maximum dissimilarity to its nearest neighbor in the subset. [Pg.199]

Snarey M, Terrett NK, Willett P, Wilton DJ. Comparison of algorithms for dissimilarity-based compound selection. J Mol Graph Model 1997 15 372-85. [Pg.206]

Willett, P. Dissimilarity-based algorithms for selecting structurally diverse sets of compounds./. Comput. Biol. 1999, 6, 447-457. [Pg.172]

Lajiness, M.S. Dissimilarity-based compound selection techniques. Perspect. Drug Discov. Des., 1997, 7/8, 65-84. [Pg.332]

The D-score is computed using the maximum dissimilarity algorithm of Lajiness (20). This method utilizes a Tanimoto-like similarity measure defined on a 360-bit fragment descriptor used in conjunction with the Cousin/ChemLink system (21). The important feature of this method is that it starts with the selection of a seed compound with subsequent compounds selected based on the maximum diversity relative to all compounds already selected. Thus, the most obvious seed to use in the current scenario is the compound that has the best profile based on the already computed scores. Thus, one needs to compute a preliminary consensus score based on the Q-score and the B-score using weights as defined previously. To summarize this, one needs to... [Pg.121]

Gillet, V. J. and Willett, P. (2001) Dissimilarity-based compound selection for library design. In Combinatorial library design and evaluation. Principles, software tools and applications in drug discovery, Ghose, A. K. and Viswanadhan, A. N. (eds.), Marcel Dekker, New York, pp. 379-398. [Pg.352]

Since diversity is a collective property, its precise quantification requires a mathematical description of the distribution of the molecular collection in a chemical space. When a set of molecules are considered to be more diverse than another, the molecules in this set cover more chemical space and/or the molecules distribute more evenly in chemical space. Historically, diversity analysis is closely linked to compound selection and combinatorial library design. In reality, library design is also a selection process, selecting compounds from a virtual library before synthesis. There are three main categories of selection procedures for building a diverse set of compounds cluster-based selection, partition-based selection, and dissimilarity-based selection. [Pg.39]

The DIVSEL program was developed by Pickett et al. for combinatorial reagent selection using three-point pharmacophores as the descriptor for similarity calculations [2], The algorithm starts by selecting the compound most dissimilar to the others in the set and then iteratively selects compounds most dissimilar to those already selected. DIVSEL was used to select a set of carboxylic acids from a collection of 1100 monocarboxylic acids for an amide library, based on the pharmacophoric diversity of the products. Eleven diverse amines were selected based on pharmacophoric diversity. A virtual library of 12100 amides was constructed from the 11 amines and 1100 carboxylic acids. The DIVSEL program used the pharmacophore fingerprints for the product virtual library to select a diverse set of the carboxylic acids. The products of 90 acids with the 11 amines selected with DIVSEL covered 85% of the three-point pharmacophores represented by the entire 12100 compound virtual library. [Pg.194]

Dissimilarity analysis plays a major role in compound selection. Typical tasks include the selection of a maximally dissimilar subset of compounds from a large set or the identification of compounds that are dissimilar to an existing collection. Such issues have played a major role in compound acquisition in the pharmaceutical industry. A typical task would be to select a subset of maximally dissimilar compounds from a data set containing n molecules. This represents a non-trivial challenge because of the combinatorial problem involved in exploring all possible subsets. Therefore, other dissimilarity-based selection algorithms have been developed (Lajiness 1997). The basic idea of such approaches is to initially select a seed compound (either randomly or, better, based on dissimilarity to others), then calculate dissimilarity between the seed compound and all others and select the most dissimilar one. In the next step, the database compound most dissimilar to these two compounds is selected and added to the subset, and the process is repeated until a subset of desired size is obtained. [Pg.9]

A variety of different compound selection methods have been developed. These techniques are dependent on the use of molecular descriptors which are numerical values that characterize the properties of molecules. In addition, many compound selection methods are based on quantifying the degree of similarity or dissimilarity of compounds based on molecular descriptors. This requires the use of similarity or distance coefficients. [Pg.347]

Many different methods have been developed for compound selection. They include clustering, dissimilarity-based compound selection, partitioning a collection of compounds into a low-dimensional space and the use of optimization methods such as simulated annealing and genetic algorithms. Filtering techniques are often employed prior to compound selection to remove undesirable compounds. [Pg.351]

Dissimilarity-Based Compound Selection (DBCS) involves identifying directly the subset comprising the n most dissimilar compounds in a database containing N compounds, where typically n< N. Identification of the most dissimilar subset is not computationally feasible since it requires consideration... [Pg.352]

Clark [46] has recently described a subset selection algorithm called OptiSim which includes maximum and minimum dissimilarity based selection as special cases. A parameter is used to adjust the balance between representativeness and diversity in the compounds that are selected. [Pg.354]

Given the variety of different descriptors and subset selection methods that are available, several studies have been carried out in an attempt to validate both the compound selection methods and the various descriptors. To some extent the choice of descriptors and subset selection methods are interlinked. For example, partitioning schemes are restricted to low-dimensional descriptors such as physicochemical descriptors, whereas clustering and dissimilarity-based methods can be used with high dimensional descriptors such as fingerprints. [Pg.357]

Matter [58] has also validated a range of 2-D and 3-D structural descriptors on their ability to predict biological activity and on their ability to sample structurally and biologically diverse datasets effectively. The compound selection techniques used were maximum dissimilarity and clustering. Their results also showed the 2-D fingerprint-based descriptors to be the most effective in selecting representative subsets of bioactive compounds. [Pg.358]

Holliday JD, Willet P, Definitions of dissimilarity for dissimilarity-based compound selection, J. Biomolec. Screening, 1 145-151, 1996. [Pg.365]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...