Descriptor-based diversity selection

Burden, CAS, and University of Texas (BCUT) descriptors are well suited and widely used to describe diversity of a chemical population in a low dimensional Euclidian space and they allow for fast cell-based diversity selection algorithms (Pearlman and Smith, 1998). The DiverseSolutions... [Pg.255]

Fig. 1. Median partitioning and compound selection. In this schematic illustration, a two-dimensional chemical space is shown as an example. The axes represent the medians of two uncorrelated (and, therefore, orthogonal) descriptors and dots represent database compounds. In A, a compound database is divided in into equal subpopulations in two steps and each resulting partition is characterized by a unique binary code (shared by molecules occupying this partition). In B, diversity-based compound selection is illustrated. From the center of each partition, a compound is selected to obtain a representative subset. By contrast, C illustrates activity-based compound selection. Here, a known active molecule (gray dot) is added to the source database prior to MP and compounds that ultimately occur in the same partition as this bait molecule are selected as candidates for testing. Finally, D illustrates the effects of descriptor correlation. In this case, the two applied descriptors are significantly correlated and the dashed line represents a diagonal of correlation that affects the compound distribution. As can be seen, descriptor correlation leads to over- and underpopulated partitions.

Key Words Biological activity cell-based partitioning chemical descriptors classification clustering distance-based design diversity selection high-throughput screening quantitative structure-activity relationship. [Pg.301]

This approach is particularly efficient when combined with the Cosine coefficient (69) and was used by Pickett et al. in combination with pharmacophore descriptors (70). In lower dimensional spaces the maxsum measure tends to force selection from the comers of diversity space (6b, 71) and hence maxmin is the preferred function in these cases. A similar conclusion was drawn from a comparison of algorithms for dissimilarity-based compound selection (72). [Pg.208]

Fig. 6-20 Forward chemical-genetic screen for inhibitors of protein deacetylation (data from Ref. 80). (a) Overview of cell-based screens of the 1,3-dioxane-based, diversity-oriented synthesis-derived library using antibodies to measure tubulin and histone acetylation, (b) Relative position of selected active compounds in a three-dimensional principal component model computed from five cell-based assay descriptors. AcTubulin-selective (red),...

In principle, there are three main steps required to carry out diversity-based subset selections (1) the calculation of descriptors representing the compound structures, (2) a quantitative method to describe the similarity or dissimilarity of molecules in relationship to each other, and (3) selection methods to identify compounds based on their similarity or dissimilarity values that best represent the entire compound set. In the following, the three steps are described in more detail. [Pg.13]

All methods require compound characterization by multiple molecular descriptors and appropriate dissimilarity scoring functions must be used. The purpose of the diversity selection can be formulated as follows select a subset of n representative compounds from a database S containing n compounds, which is the most diverse in terms of chemical structure. The key to each of the different methods is the mathematical fimction that measures diversity. Since each molecule is represented by a vector of molecular descriptors, it is geometrically mapped to a point in a multidimensional space. The distance between two points, such as Euclidian distance, measures the dissimilarity between two molecules. Thus, the diversity function should be based on aU pairwise distances between molecules in the subset... [Pg.79]

As we mentioned earlier, the time that is available for each diversity task will likely depend on the nature of the task. Reagent selection may need to be done in a hurry, whereas compound acquisition studies may be afforded rather more time. In the former case, it is clear that the computer time required for diversity analysis/library design must not exceed that available (possibly only days if the library chemistry is already developed, longer if the chemistry is new). For many product-based reagent selection approaches, CPU time is at present a very real obstacle to what might be done. It is to be hoped that more efficient algorithms and exploitation of parallel computation techniques will help alleviate the current difficulties. More fundamentally, the development of approaches based on Markush representations may offer a solution in instances where only simple 2-D descriptors are employed. ... [Pg.39]

A major potential drawback with cluster analysis and dissimilarity-based methods f selecting diverse compounds is that there is no easy way to quantify how completel one has filled the available chemical space or to identify whether there are any hole This is a key advantage of the partition-based approaches (also known, as cell-bas( methods). A number of axes are defined, each corresponding to a descriptor or son combination of descriptors. Each axis is divided into a number of bins. If there are axes and each is divided into b bins then the number of cells in the multidimension space so created is ... [Pg.701]

Looking ahead, I am optimistic that we will see continued growth of our knowledge about this and other conceptual DFT-based reactivity and selectivity descriptors as well as broadening applications in understanding a diverse class of biophysicochemical properties and processes. [Pg.189]

A set of n = 209 polycyclic aromatic compounds (PAC) was used in this example. The chemical structures have been drawn manually by a structure editor software approximate 3D-structures including all H-atoms have been made by software Corina (Corina 2004), and software Dragon, version 5.3 (Dragon 2004), has been applied to compute 1630 molecular descriptors. These descriptors cover a great diversity of chemical structures and therefore many descriptors are irrelevant for a selected class of compounds as the PACs in this example. By a simple variable selection, descriptors which are constant or almost constant (all but a maximum of five values constant), and descriptors with a correlation coefficient >0.95 to another descriptor have been eliminated. The resulting m = 467 descriptors have been used as x-variables. The y-variable to be modeled is the Lee retention index (Lee et al. 1979) which is based on the reference values 200, 300, 400, and 500 for the compounds naphthalene, phenanthrene, chrysene, and picene, respectively. [Pg.187]

Narayanan and Gunturi [33] developed QSPR models based on in vivo blood-brain permeation data of 88 diverse compounds, 324 descriptors, and a systematic variable selection method called Variable Selection and Modeling method based on the Prediction (VSMP). VSMP efficiently explored all... [Pg.541]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...