Classified Datasets

LSVM, http //www.cs.wisc.edu/dmi/lsvm/. LSVM (Lagrangian Support Vector Machine) is a very fast SVM implementation in MATLAB by Mangasarian and Musicant. It can classify datasets containing several million patterns. [Pg.390]

There are finer details to be extracted from such Kohonen maps that directly reflect chemical information, and have chemical significance. A more extensive discussion of the chemical implications of the mapping of the entire dataset can be found in the original publication [28]. Gearly, such a map can now be used for the assignment of a reaction to a certain reaction type. Calculating the physicochemical descriptors of a reaction allows it to be input into this trained Kohonen network. If this reaction is mapped, say, in the area of Friedel-Crafts reactions, it can safely be classified as a feasible Friedel-Qafts reaction. [Pg.196]

An observation of the results of cross-validation revealed that all but one of the compounds in the dataset had been modeled pretty well. The last (31st) compound behaved weirdly. When we looked at its chemical structure, we saw that it was the only compound in the dataset which contained a fluorine atom. What would happen if we removed the compound from the dataset The quahty ofleaming became essentially improved. It is sufficient to say that the cross-vahdation coefficient in-CTeased from 0.82 to 0.92, while the error decreased from 0.65 to 0.44. Another learning method, the Kohonen s Self-Organizing Map, also failed to classify this 31st compound correctly. Hence, we had to conclude that the compound containing a fluorine atom was an obvious outlier of the dataset. [Pg.206]

One can find more details on the algorithm in Section 4.3.4. This time the learning yielded essentially improved results. It is sufficient to say that if in the case of the primary dataset, only 21 compoimds from 91 were classified correctly, whereas in the optimized dataset (i.e., that with no redundancy) the correctness of classification was increased to 65 out of 91. [Pg.207]

Initially the dataset contained 818 compounds, among which 31 were active (high TA, low USE), 157 inactive (low TA, high USE), and the rest intermediate. When the complete dataset was employed, none of the active compounds and 47 of the inactives were correctly classified by using Kohonen self-organizing maps (KSOM). [Pg.221]

The statistical difference in performance of models was estimated with a bootstrap test using 10000 replicas (see details in Ref. [91]). The significance level of p<0.05 was used. For each dataset all methods were classified in four categories. [Pg.395]

Two-dimensional SOMs are more widely used than those of one dimension because the extra flexibility that is provided by a second dimension allows the map to classify a greater number of classes. Figure 3.14 shows the result of applying a 2D SOM to the same dataset used to create Figure 3.12 and Figure 3.13 the clustering of similar node weights is very clear. [Pg.69]

Yoshida and Topliss compiled a dataset of 232 structurally diverse drugs (Table 16.4) and evaluated the possibility of constructing a predictive model for human oral bioavailability on categorical data [28]. The bioavailability data were classified into four categories ... [Pg.363]

Biomedical spectra are often extremely complex. Hyphenated techniques such as MS-MS can generate databases that contain hundreds of thousands or millions of data points. Reduction of dimensionality is then a common step preceding data analysis because of the computational overheads associated with manipulating such large datasets.9 To classify the very large datasets provided by biomedical spectra, some form of feature selection10 is almost essential. In sparse data, many combinations of attributes may separate the samples, but not every combination is plausible. [Pg.363]

Degner et al. (1993) used the model with a dataset of 774 chemicals with MITI test data and found 80.7% correct estimates for easily degradable/non-degradable classifications. They were unable to classify 18 chemicals due to a lack of indicator variables. [Pg.319]

Another method that can be used to quickly extract useful chemical information from an infrared image dataset is MCR 50-52,54,56-58 In some cases, this method can be used to obtain the concentration and absorbance spectra for each constituent in the original dataset. However, if the goal is not necessarily to resolve the constituents spectra, but rather to empirically classify them, a regression method may be more appropriate.53,59,60 The most prominent multivariate regression methods include PLS and ANNs. [Pg.271]

Fig. 2 NMR-based metabolomics can be used to quickly identify changes in the global NMR pattern. In this case, the red peaks between 2.5-0.5 ppm are indicative of metabolic differences that are specific to the disease state. Actual data is not nearly as clear as this schematic. The analysis of typical NMR metabolomics datasets requires the use of multivariate analysis methods, such as principle components analysis (PCA), in order to use the metabolome to classify samples...

Later Sweredoski et al. (37) incorporated a combination of amino acid propensity scores and half-sphere exposure values at multiple distances to form the BEpro tool (formerly called PEPITO). Using the Epitopia algorithm, Rubinstein et al. (38) for the first time truly exploited an extensive set of physicochemical and structural geometrical features from an antigen s primary or tertiary structures. They trained the Naive Bayes classifier using a benchmark dataset of 66 and 194 validated nonredundant epitopes derived from antibody-antigen structures and antigen sequences,... [Pg.133]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...