Dissimilarity, compound selection

Another important feature of the Tanimoto coefficient when used with bitstring data is that small molecules, which tend to have fewer bits set, will have only a small number of bits in common and so can tend to give inherently low similarity values. This can be important when selecting dissimilar compounds, as a bias towards small molecules can result. [Pg.693]

In dissimilarity-based compound selection the required subset of molecules is identified directly, using an appropriate measure of dissimilarity (often taken to be the complement of the similarity). This contrasts with the two-stage procedure in cluster analysis, where it is first necessary to group together the molecules and then decide which to select. Most methods for dissimilarity-based selection fall into one of two categories maximum dissimilarity algorithms and sphere exclusion algorithms [Snarey et al. 1997]. [Pg.699]

The maximum dissimilarity algorithm works in an iterative manner at each step one compormd is selected from the database and added to the subset [Kennard and Stone 1969]. The compound selected is chosen to be the one most dissimilar to the current subset. There are many variants on this basic algorithm which differ in the way in which the first compound is chosen and how the dissimilarity is measured. Three possible choices for fhe initial compormd are (a) select it at random, (b) choose the molecule which is most representative (e.g. has the largest sum of similarities to the other molecules) or (c) choose the molecule which is most dissimilar (e.g. has the smallest sum of similarities to the other molecules). [Pg.699]

Compound selection methods usually involve selecting a relatively small set of a few tens or hundreds of compounds from a large database that could consist of hundreds of thousands or even millions of compounds. Identifying the n most dissimilar compounds in a database containing N compounds, when typically n N, is computationally infeasible because it requires consideration of all possible n-member subsets of the database, and therefore approximate methods have been developed as described below. [Pg.199]

Dissimilarity-based compound selection (DECS) methods involve selecting a subset of compounds directly based on pairwise dissimilarities [37]. The first compound is selected, either at random or as the one that is most dissimilar to all others in the database, and is placed in the subset. The subset is then built up stepwise by selecting one compound at a time until it is of the required size. In each iteration, the next compound to be selected is the one that is most dissimilar to those already in the subset, with the dissimilarity normally being computed by the MaxMin approach [38]. Here, each database compound is compared with each compound in the subset and its nearest neighbor is identified the database compound that is selected is the one that has the maximum dissimilarity to its nearest neighbor in the subset. [Pg.199]

Snarey M, Terrett NK, Willett P, Wilton DJ. Comparison of algorithms for dissimilarity-based compound selection. J Mol Graph Model 1997 15 372-85. [Pg.206]

Lajiness, M.S. Dissimilarity-based compound selection techniques. Perspect. Drug Discov. Des., 1997, 7/8, 65-84. [Pg.332]

The D-score is computed using the maximum dissimilarity algorithm of Lajiness (20). This method utilizes a Tanimoto-like similarity measure defined on a 360-bit fragment descriptor used in conjunction with the Cousin/ChemLink system (21). The important feature of this method is that it starts with the selection of a seed compound with subsequent compounds selected based on the maximum diversity relative to all compounds already selected. Thus, the most obvious seed to use in the current scenario is the compound that has the best profile based on the already computed scores. Thus, one needs to compute a preliminary consensus score based on the Q-score and the B-score using weights as defined previously. To summarize this, one needs to... [Pg.121]

Gillet, V. J. and Willett, P. (2001) Dissimilarity-based compound selection for library design. In Combinatorial library design and evaluation. Principles, software tools and applications in drug discovery, Ghose, A. K. and Viswanadhan, A. N. (eds.), Marcel Dekker, New York, pp. 379-398. [Pg.352]

Since diversity is a collective property, its precise quantification requires a mathematical description of the distribution of the molecular collection in a chemical space. When a set of molecules are considered to be more diverse than another, the molecules in this set cover more chemical space and/or the molecules distribute more evenly in chemical space. Historically, diversity analysis is closely linked to compound selection and combinatorial library design. In reality, library design is also a selection process, selecting compounds from a virtual library before synthesis. There are three main categories of selection procedures for building a diverse set of compounds cluster-based selection, partition-based selection, and dissimilarity-based selection. [Pg.39]

Dissimilarity analysis plays a major role in compound selection. Typical tasks include the selection of a maximally dissimilar subset of compounds from a large set or the identification of compounds that are dissimilar to an existing collection. Such issues have played a major role in compound acquisition in the pharmaceutical industry. A typical task would be to select a subset of maximally dissimilar compounds from a data set containing n molecules. This represents a non-trivial challenge because of the combinatorial problem involved in exploring all possible subsets. Therefore, other dissimilarity-based selection algorithms have been developed (Lajiness 1997). The basic idea of such approaches is to initially select a seed compound (either randomly or, better, based on dissimilarity to others), then calculate dissimilarity between the seed compound and all others and select the most dissimilar one. In the next step, the database compound most dissimilar to these two compounds is selected and added to the subset, and the process is repeated until a subset of desired size is obtained. [Pg.9]

A variety of different compound selection methods have been developed. These techniques are dependent on the use of molecular descriptors which are numerical values that characterize the properties of molecules. In addition, many compound selection methods are based on quantifying the degree of similarity or dissimilarity of compounds based on molecular descriptors. This requires the use of similarity or distance coefficients. [Pg.347]

Many different methods have been developed for compound selection. They include clustering, dissimilarity-based compound selection, partitioning a collection of compounds into a low-dimensional space and the use of optimization methods such as simulated annealing and genetic algorithms. Filtering techniques are often employed prior to compound selection to remove undesirable compounds. [Pg.351]

In the Maximum Dissimilarity (MD) selection method described by Lajiness [40] the first compound is selected at random and subsequent compounds are then chosen iteratively, such that the distance to the nearest of the compounds already chosen is a maximum. This method is known as MaxMin. In this study, the compounds were represented by COUSIN 2-D fragment-based bitstrings. Polinsky et al. [41] use a similar algorithm in the LiBrain system. In this case, the molecules are represented by a feature vector that contains information about the following affinity types—aliphatic hydrophobic, aromatic hydrophobic, basic, acidic, hydrogen bond donor, hydrogen bond acceptor and polarizable heteroatom. [Pg.353]

Clark [46] has recently described a subset selection algorithm called OptiSim which includes maximum and minimum dissimilarity based selection as special cases. A parameter is used to adjust the balance between representativeness and diversity in the compounds that are selected. [Pg.354]

Given the variety of different descriptors and subset selection methods that are available, several studies have been carried out in an attempt to validate both the compound selection methods and the various descriptors. To some extent the choice of descriptors and subset selection methods are interlinked. For example, partitioning schemes are restricted to low-dimensional descriptors such as physicochemical descriptors, whereas clustering and dissimilarity-based methods can be used with high dimensional descriptors such as fingerprints. [Pg.357]

Matter [58] has also validated a range of 2-D and 3-D structural descriptors on their ability to predict biological activity and on their ability to sample structurally and biologically diverse datasets effectively. The compound selection techniques used were maximum dissimilarity and clustering. Their results also showed the 2-D fingerprint-based descriptors to be the most effective in selecting representative subsets of bioactive compounds. [Pg.358]

Holliday JD, Willet P, Definitions of dissimilarity for dissimilarity-based compound selection, J. Biomolec. Screening, 1 145-151, 1996. [Pg.365]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...