Descriptors space

In the multidimensional descriptor space, each compound can be described by a point, with the coordinates along the descriptor axes equal to the measured values of the corresponding property descriptor, see Fig. 15.2. [Pg.342]

A series of N different compounds, characterized by the same descriptors, will then define a swarm of ponts in the descriptor space, Fig. 15.3 [Pg.342]

If the compounds should happen to be very similar to each other, there will only be minor variations in the values of the different property descriptors. The data points would then be close to each other in the descriptor space, and the average values of the descriptors would give an adequate approximation of the distribution of the data points in the descriptor space. This would correspond to a point with the coordinates defined by the average values. Fig. 15.4. [Pg.342]

In most cases, however, the compounds are different to each other and there is a spread in their properties. The average values will not give a sufficient description of the data. If the compounds are different to each other, the distribution of the data points will have an extension in the descriptor space. This extension will be outside the range of random error variation around the average point. [Pg.344]

The next step is to determine the direction through the swarm of points along which the data show the largest variation in their distribution in the space. This will be the direction of the first principal component vector, denoted pj. If this vector is anchored in the average point, it is possible to make a perpendicular projection of all the points in the space on this vector, A projected point will then have a coordinate, measured fi-om the average point, along the first principal component vector, Pp The distribution of data points in the pi direction of the descriptor space can thus be described by their coordinates along pj. If all descriptors should happen to be correlated to each other, the swarm of points would describe a linear structure [Pg.345]

Cluster sampling methods, which first identify a set of compound clusters, followed by the selection of several compounds from each cluster [73]. Grid-based sampling, which places all the compounds into a low-dimensional descriptor space divided into many cells and then chooses a few compounds from each cell [74]. [Pg.364]

In contrast to SOMs, nonlinear maps (NLMs) represent relative distances between all pairs of compounds in the descriptor space of a 2D map. The distance between two points on the map directly reflects the similarity of the... [Pg.362]

To further analyze the relationships within descriptor space we performed a principle component analysis of the whole data matrix. Descriptors have been normalized before the analysis to have a mean of 0 and standard deviation of 1. The first two principal components explain 78% of variance within the data. The resultant loadings, which characterize contributions of the original descriptors to these principal components, are shown on Fig. 5.8. On the plot we can see that PSA, Hhed and Uhba are indeed closely grouped together. Calculated octanol-water partition coefficient CLOGP is located in the opposite corner of the property space. This analysis also demonstrates that CLOGP and PSA are the two parameters with... [Pg.122]

Some sort of estimation of the statistical distance to the overall model (see Section 16.5.4) should be reported for each compound to provide an estimate of how much an intra- or extrapolation in multivariate descriptor space the prediction actually is. [Pg.398]

To be able to use extrapolations from the present model in a constructive manner to expand it to cover a larger descriptor space. [Pg.400]

It is worth noting that when relatively few compounds are screened and the descriptor space is very large or the neighborhoods very small, each compound covers very little... [Pg.399]

Fig. 18.4 An ideal compound library. Compounds (black dots) are uniformly distributed in 2D descriptor space, a plane defined by two molecular descriptors that correlate with biological activity but not with one another. The compounds are surrounded by nonoverlapping neighborhoods (circles)...

Fig. 18.5 A typical compound library. Compounds are clustered in descriptor space (for example, around synthetically accessible scaffolds). For clusters with overlapping neighborhoods, efficiency may be increased by testing representative compounds from each cluster analogs of hits...

Dissimilarity and clustering methods only describe the compounds that are in the input set voids in diversity space are not obvious, and if compounds are added then the set must be re-analyzed. Cell-based partitioning methods address these problems by dividing descriptor space into cells, and then populating those cells with compounds [67, 68]. The library is chosen to contain representatives from each cell. The use of a partition-based method with BCUT descriptors [69] to design an NMR screening library has recently been described [70]. [Pg.401]

The similarity of samples can be evaluated by using geometrical constructs based on the standard deviation of the objects modeled by SIMCA. By enclosing classes in volume elements in descriptor space, the SIMCA method provides information about the existence of similarities among the members of the defined classes. Relations among samples, when visualized in this way, increase one s ability to formulate questions or hypotheses about the data being examined. The selection of variables on the basis of MPOW also provides clues as to how samples within a class are similar, and the derived class model describes how the objects are similar, with regard to the internal variation of these variables. [Pg.208]

Fig. 7.3 Representation of interesting or promising cells. Each cube represents a cell in the partitioned descriptor space. The shaded cell contains one or more active compounds. The virtual compounds that fall into cells that are immediately adjacent to the interesting cells are also likely to be active, so one may add a degree of fuzziness to the design so as to capture these cells. One layer is shown, but the number of layers controls the focus or fuzziness of the region defined as interesting.

Here y is the average and cr is the standard deviation of the Euclidean distances of the k nearest neighbors of each compound in the training set in the chemical descriptor space, and Z is an empirical parameter to control the significance level, with the default value of 0.5. If the distance from an external compound to its nearest neighbor in the training set is above Dc, we label its prediction unreliable. [Pg.443]

How is dimension reduction of chemical spaces achieved There are a number of different concepts and mathematical procedures to reduce the dimensionality of descriptor spaces with respect to a molecular dataset under investigation. These techniques include, for example, linear mapping, multidimensional scaling, factor analysis, or principal component analysis (PCA), as reviewed in ref. 8. Essentially, these techniques either try to identify those descriptors among the initially chosen ones that are most important to capture the chemical information encoded in a molecular dataset or, alternatively, attempt to construct new variables from original descriptor contributions. A representative example will be discussed below in more detail. [Pg.282]

Despite the conceptual elegance of partitioning in low-dimensional descriptor spaces, dimensional reduction is not essential for effective partitioning, as has been shown, for example, by application of statistical partitioning methods (4). [Pg.287]

Partitioning in Binary-Transformed Chemical Descriptor Spaces... [Pg.291]

In contrast to partitioning methods that involve dimension reduction of chemical reference spaces, MP is best understood as a direct space method. However, -dimensional descriptor space is simplified here by transforming property descriptors with continuous or discrete value ranges into a binary classification scheme. Essentially, this binary space transformation assigns less complex -dimensional vectors to test molecules, with each dimension having unity length of either 0 or 1. Thus, although MP analysis proceeds in -dimensional descriptor space, its dimensions are scaled and its complexity is reduced. [Pg.295]

The method is straightforward, but runs into a major problem. For similar biological activity, two molecules must have fairly similar values of all critical descriptors (18), so the number of bins per dimension, m, should be large. With a high-dimensional descriptor space (large k), there will be so many cells that most are empty in the database. [Pg.304]

Fig. 2. Cells and selected molecules in a 2D descriptor space formed by variables Xj and x2. There are four bins for each ID descriptor (solid or dashed lines) and four cells in a 2 x 2 arrangement in two dimensions (solid lines). Solid and open circles represent four molecules selected from 20 and the remaining unselected molecules, respectively. Panels A and B show poor and good selections, respectively, according to the UCC criterion.

This simulated dataset is generated as follows. First, nine cluster centers are defined at different locations in a 2D space with coordinate values within [-3.0, 3.0]. Second, a random number (between 1 and 100) of points is generated around each cluster center within a distance of 0.5. Finally, additional points are generated, which are randomly distributed in the 2D space so that a total of 1000 points is obtained. This dataset simulates the situation where clusters of molecules exist in their descriptor space, and the number of members for each cluster is different, i.e., some regions are more densely populated than other regions. [Pg.386]

Reducing the dimensionality of the descriptor space not only facilitates model building with molecular descriptors but also makes data visualization and identification of key variables in various models possible. Notice that while a low dimension mathematically simplifies a problem such as model development or data visualization, it is usually more difficult to correlate trends directly with physical descriptors, and hence the data become less interpretable, after the dimension transformation. Trends directly linked with physical descriptors provide simple guidance for molecular modifications during potency/property optimizations. [Pg.38]

See also in sourсe #XX -- [ Pg.241 , Pg.251 , Pg.256 ]

See also in sourсe #XX -- [ Pg.5 ]

See also in sourсe #XX -- [ Pg.138 , Pg.140 ]

See also in sourсe #XX -- [ Pg.765 ]

See also in sourсe #XX -- [ Pg.342 ]

See also in sourсe #XX -- [ Pg.80 ]

See also in sourсe #XX -- [ Pg.15 , Pg.24 , Pg.25 ]

See also in sourсe #XX -- [ Pg.78 ]

See also in sourсe #XX -- [ Pg.301 ]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...