Choice of descriptors

In contrast to the example in the previous section we now have compounds with several molecular formulas in our real library. It therefore makes sense to use arithmetical descriptors. In MOLGEN-QSPR there are 48 arithmetical descriptors available (see Appendix A.l). Since there are no compounds containing iodine, no free radicals and no charged species in the library, Nj, rel. Nj, rad and charge are useless. Further, there is only one compound containing a triple bond. Therefore the descriptors n and rel. n are also excluded. Fluorine, bromine, and sulfur-containing compounds are scarce, so that an even distribution over LS and TS is questionable. Therefore, we also exclude Np, rel. Np, Ns, rel. Ng, Np,., rel.Np. Thus, 35 arithmetical indices are retained. [Pg.273]

For topological indices, we start with the 30 Indices listed in the beginning of Section 7.4 and again exclude and M2 for the reasons given there. Among the remedning [Pg.273]

28 TIs there is no pairwise complete correlation within our library. The function twc should be used with caution Its values increase exponentially with increasing number of bonds and with increasing topological density, i.e. the quotient of bond count and atom count of a molecular graph. In our real library the highest twc values are [Pg.273]

The highest twc value is an order of magnitude higher than the next highest, and so on. This may cause fatal consequences for linear (and nonlinear) models If there eue no compounds with highest twc in the LS, completely unrealistic predictions for such [Pg.273]

Obtaining a good quality QSAR model depends on many factors, such as the quality of biological data and the choice of descriptors and statistical methods. As a consequence, the uncertainty of the QSAR predictions is a combination of experimental uncertainties and model uncertainties. QSAR methods have to be applied to individual chemicals, not on mixtures. If the QSAR demands it, the components of the mixture have to be addressed separately and individually - in case of unknown compounds, QSAR cannot identify the toxicity risk and is therefore not useful. [Pg.468]

Given the variety of different descriptors and subset selection methods that are available, several studies have been carried out in an attempt to validate both the compound selection methods and the various descriptors. To some extent the choice of descriptors and subset selection methods are interlinked. For example, partitioning schemes are restricted to low-dimensional descriptors such as physicochemical descriptors, whereas clustering and dissimilarity-based methods can be used with high dimensional descriptors such as fingerprints. [Pg.357]

An additional consideration on the choice of descriptor for diversity analyses is the speed with which the descriptor can be calculated since diversity studies are often applied to the huge numbers of compounds (potentially millions) that characterise combinatorial libraries. Thus, some computationally expensive descriptors such as field-based descriptors [28] or descriptors derived from quantum mechanics [29] are not appropriate for diversity studies. [Pg.46]

Many different methods have been developed both to measure diversity and to select diverse sets of compounds, however, currently there is no clear picture of which methods are best. To date, some work has been done on comparing the various methods however, there is a great need for more validation studies to be performed both on the structural descriptors used and on the different compound selection strategies that have been devised. In some cases, the characteristics of the library itself might determine the choice of descriptors and the compound selection methods that can be applied. For example, computationally expensive methods such as 3D pharmacophore methods are limited in the size of libraries that can be handled. Thus for product-based selection, they are currently restricted to handling libraries of tens of thousands of compounds rather than the millions that can be handled using 2D based descriptors. [Pg.61]

Arbitrariness One hears comments that graph descriptors being arbitrary, do not have a deep physicochemical significance A misconception here is in confusing an input with an output of a mathematical treatment of a problem Surely die choice of descriptors selected to characterize grains is at the disposal of an investigator, hence arbitary But the same is true of a choice of coordinate systems used to describe a system of classical physics, or the choice of basis functions to compute a molecular structure in quantum chemistry I In each case we speak of input information, and the outcome of the analysis undertaken (when a complete basis is taken) does not depend on the choice of descriptors (coordinates) However, the amount of work, even its mere feasibility, will differ A poor selection of basis functions may... [Pg.247]

The choice of descriptors may be, in itself, relevant to pharmacophore elucidation - if founded on an information-rich training set. [Pg.69]

The choice of descriptors is not always clear-cut. The time required to calculate elaborate descriptors by quantum methods is not always justified compared to the results obtained with simpler and more rapidly calculated descriptors. For example, Bergstrom [12] compared 2D polar surface area (PSA) with 3D PSA and static, instead of dynamic, calculations. No definitive gain was obtained by using the most sophisticated method(s) of calculating PSA descriptors. [Pg.58]

The choice of descriptors to use in a QSAR analysis is dependent on several factors ... [Pg.495]

The examples also show that the benefit of a theoretical investigation depends on the methods used - the choice of descriptors is especially crucial. Different descriptors must be applied for different design tasks, and all scientists who use computer-assisted library design methods should know the characteristics of basic... [Pg.610]

Conclusions about human carcinogenic potential (choice of descriptor(s),. ..)... [Pg.202]

When comparing priority setting methodologies it is important to identify additional external information. For all the methods considered a choice is made by choosing the descriptors. The choice of descriptors is however often based on key parameters from risk assessment schemes or environmental fate models, which makes it less subjective. In the present chapter exposure (given by production volume), aquatic toxicity, bioaccumulation and persistence were chosen as descriptive parameters. For HDT the selection of the descriptors is the main contribution of subjectivity. Additionally, some indirect weighting can be added if the data are separated into classes. [Pg.253]

Given the large variety of molecular descriptors that exist, several approaches have been taken to evaluate which descriptors are most suitable for virtual screening although, to some extent, the choice of descriptors and subset selection method is interlinked. For example, as already mentioned, cell-based approaches are only appropriate for low-dimensional chemistry space whereas both DBCS and clustering can be used with high-dimensional data such as fingerprints. [Pg.624]

The focus of this chapter is ligand-based virtual screening. The underlying principle is the assumption that similar molecules should exhibit similar binding properties with respect to a given target [11[. Molecular similarities are based on descriptors and similarity measures. Section 3.2 introduces the calculation of descriptors. Some selected descriptors are discussed in detail. The choice of descriptors does not mean an assessment however, it is driven by the experience of the authors. [Pg.62]

An important feature of linear models over non-linear ones is that they usually are easier to interpret. However, the value of a model, expressed in terms of predictivity and interpretability, crucially depends on the type and number of descriptors which are incorporated. 3D-QSAR models based on MlFs allow to localize regions with favorable and less favorable inferacfions. Very often, predictive models are obtained by using multivariate approaches with many descriptors. The choice of descriptors defermines fhe chemical variabilify which can be captured by a QSAR model [135]. During the model building process one has to find an appropriate balance between few, easily inferprefable variables, and a broader multivariate characterization which maybe more difficult to interpret. [Pg.74]

So, the only firm conclusions that can be drawn from these various investigations of different descriptor types is that combinations can be good and that the choice of descriptor, including the selection of subsets and possible combinations, is a vital part of the overall model building process. [Pg.236]

Another problem with the choice of descriptors is accessibility. Some types of proprietary parameters are only available through the licensing of commercial software. There are, however, some web-based resources, such as Chembench (http //chembench.mml.unc.edu) and the Virtual Computational Chemistry Laboratory (www.vcclab.org) which both provide not only descriptor calculation facilities, but also access to statistical analysis routines. The molecular descriptors website (www.moleculardescriptors.eu) and QSAR world websites (www.qsarworld.com/qsar-web-based-programs.php) also provide useful links to resources such as databases and programs. [Pg.237]

This chapter has presented an introduction to the general problem of selecting and using a molecular shape descriptor. It should be apparent that there is an enormous range of descriptors among which we can choose. However, our first decision in shape analysis is not the choice of descriptor. As we have seen, the nature of the descriptor can change with the molecular properties under study, and their model representations. Therefore, our first true choice must be the selection of a molecular model relevant to the problem. [Pg.238]

The choice of descriptor will depend on a number of factors, including any personal biases of the modeler Perhaps the most important considerations are the amount of information available about the target and whether lead compounds have been discovered. There are several possible scenarios ... [Pg.16]

Clustering by 2D fingerprints is a very common procedure, and yet there are still many unanswered questions, principally aroimd the choice of descriptor, and the selection of a statistically appropriate number of clusters. Wild and Blankley have looked at these issues and have come to some interesting conclusions MACCS-like keys are best for general diverse sets (e.g. corporate databases), whereas Daylight-like keys are the best for similar sets (e.g. combinatorial libraries) the best method for eluster-level selection is dataset-dependent. However, the Kelley method seems to have the best worst-case performance across the different datasets and deseriptors. Xue et al. have developed a method for using consensus... [Pg.284]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...