Dataset characteristics

We must now mention, that traditionally it is the custom, especially in chemo-metrics, for outliers to have a different definition, and even a different interpretation. Suppose that we have a fc-dimensional characteristic vector, i.e., k different molecular descriptors are used. If we imagine a fe-dimensional hyperspace, then the dataset objects will find different places. Some of them will tend to group together, while others will be allocated to more remote regions. One can by convention define a margin beyond which there starts the realm of strong outliers. "Moderate outliers stay near this margin. [Pg.213]

This asymmetry may have an effect on the development of the map. If there are few examples of a particular class in the dataset or if the characteristics of some sample patterns are markedly different from the characteristics of most other samples, development of the map may be eased if these unusual samples find their way to the edge of the map where they have fewer neighbors. The remaining samples, which share a wider range of characteristics, then have the whole of the rest of the map to themselves and they can spread out widely to reveal the differences between them to the maximum degree permitted by the size of the map. [Pg.86]

The finished network automatically reflects the characteristics of the data domain Not only do the network weights evolve so that they describe the data as fully as possible, but so also does the network geometry. The size of the network is not chosen in advance and as topology is determined by the algorithm and the dataset in combination, it is more likely to be appropriate than the geometry used for a SOM, especially in the hands of an inexperienced user, who might find it difficult to choose an appropriate size of network or suitable values for the adjustable parameters in the SOM. [Pg.109]

Thus, it is still uncommon to test QSAR models (characterized by a reasonably high q ) for their ability to predict accurately biological activities of compounds not included in the training set. In contrast to such expectations, it has been shown that if a test set with known values of biological activities is available for prediction, there exists no correlation between the LOO cross-validated and the correlation coefficient between the predicted and observed activities for the test set (Figure 16.1). In our experience [17, 28], this phenomenon is characteristic of many datasets and is independent of the descriptor types and optimization techniques used to develop training set models. In a recent review, we emphasized the importance of external validation in developing reliable models [18]. [Pg.440]

As expected. Equations 3 and 4 convey the same message as did Equations 1 and 2 and have improved t-ratios on term coefficients they are less confounded by the noise characteristic of the full datasets and are thereby more suitable for predictive work. [Pg.331]

In this section, two types of structure-metal binding ability relationships will be described. The first one concerns empirical linear correlations between equilibrium constants of complexation or extraction and some descriptors. In most cases, these correlations are obtained for relatively small datasets (less than 20 molecules) without any validation. We do not intend to analyze them in detail only their general characteristics will be reported. The second type of relationships were obtained in regular QSPR studies involving the selection of pertinent descriptors from their large initial pools, and the stage of the models, validation on external test set(s). [Pg.329]

The dataset in Table 6.2 is of the same size but represents three partially overlapping peaks. The profile (Figure 6.5) appears to be slightly more complex than that for dataset A, and die PC scores plot presented in Figure 6.6 definitely appears to contain more features. Each turning point represents a pure compound, so it appears that there are three compounds, centred at times 9, 13 and 17. In addition, the spectral characteristics... [Pg.344]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...