Descriptors selection procedure

Another extension is the descriptor selection procedure designed to enhance the stability and predict vity of the PLSR models. Its aim is to minimize the info-noise that can dilute and distort the true structure-activity relationship. The procedure involves two phases. The first phase consists of the elimination of the low-variable descriptors that have the same value for all but a few (2-3) compounds in the training set. Such descriptors cannot provide useful statistical information and instead simply help to fit these particular compounds into a model, thus decreasing its predictivity. This filtering is performed entirely in the X-space, without regard for the aetivity values. In the optional second phase, the descriptor... [Pg.160]

A descriptor selection procedure was then appHed to identify the most promising descriptor set out of the following available ... [Pg.1797]

PLS VCCLAB Partial Least Squares (PLS). The original two-step descriptors selection procedure http //www.vcclab.org/... [Pg.337]

Using a descriptor selection procedure, we found that only three descriptors ( homo lumo nd Q ) are essential for the SVM model. To exemplify the shape of the classification hyperplane for polar and nonpolar narcotic pollutants, we selected 20 compounds (Table 7) as a test set (nonpolar compounds, class +1 polar compounds, class —1). [Pg.353]

In the NN method, the property F of the target compound is calculated as an average (or weighted average) of that for its NN in the space of descriptors selected for the model. Different metrics (Euclidian distances, Tanimoto similarity coefficients, etc.), can be used to identify the neighbors. Their number k is optimized using a cross-validation procedure for the training set. [Pg.325]

A stepwise selection procedure is performed to search for QSPR/QSAR models after the preliminary exclusion of - constant and near-constant variables. The - pair correlation cutoff selection of variables is then performed to avoid highly correlated descriptor variables within the model. [Pg.75]

The second modification was to bias the descriptors selected for the subset in order to speed the optimization procedure. The descriptor selection is achieved by calculating a quality value for each descriptor. The quality value for a particular descriptor is the cost... [Pg.120]

Figure 6.6 Two procedures for descriptor selection in validation processes (a) Descriptor selection occurs before dataset splitting (selection bias) (b) descriptor selection occurs after dataset splitting (correct procedure). The solid line illustrates the external validation process, and together the solid and dashed lines constitute the cross-validation process.

Descriptors were selected using the CART algorithm (mindev = 0.04). Hence, the number of descriptors may vary horn SP to SP. As a second procedure for variable selection, the Z-fold stepwise selection procedure was used in connection with MLR, with subsets of 13 descriptors selected by 50-fold stepwise selection. [Pg.352]

Table 8.15 contains the misclassification rates for variable selection by the Z-fold stepwise procedure in connection with MLR. Here the linear models are markedly better than CT and are slightly inferior to SVM with radial kernel only. Compared with descriptor selection by CT, LDA, ANN and SVM look better for stepwise selection. Again, Figure 8.34 illustrates the result in form of a box plot. Here the boxes are positioned lower than in Figure 8.33. [Pg.355]

Among the 1790 substructures there are 301 for which such sets can be obtained. For each of these 301 substructures we calculate MS classifiers. For descriptor selection we use, as described above, the 50 -fold stepwise procedure within MLR. By so doing we obtain 13 MS descriptors relevant for modeling for each substructure. For classification CT, LDA, ANN with one, two, or three HN, and SVM with linear, radial, polynomial (degree = 2), and sigmoid kernel are used. [Pg.355]

Selecting relevant input parameters is both important and difficult for any machine learning method. For example, in QSAR, one can compute thousands of structural descriptors with software like CODESSA or Dragon, or with various molecular field methods. Many procedures have been developed in QSAR to identify a set of structural descriptors that retain the important characteristics of the chemical compounds. " These methods can be extended to SVM models. Another source of inspiration is represented by the algorithms proposed in the machine learning literature, which can be readily applied to cheminformatics problems. We present here several literature pointers for algorithms on descriptor selection. [Pg.347]

Selecting an optimum group of descriptors is both an important and time-consuming phase in developing a predictive QSAR model. Frohlich, Wegner, and Zell introduced the incremental regularized risk minimization procedure for SVM classification and regression models, and they compared it with recursive feature elimination and with the mutual information procedure. Their first experiment considered 164 compounds that had been tested for their human intestinal absorption, whereas the second experiment modeled the aqueous solubility prediction for 1297 compounds. Structural descriptors were computed by those authors with JOELib and MOE, and full cross-validation was performed to compare the descriptor selection methods. The incremental... [Pg.374]

Five methods of feature selection (information gain, mutual information, X -test, odds ratio, and GSS coefficient) were compared by Liu for their ability to discriminate between thrombin inhibitors and noninhibitors.The chemical compounds were provided by DuPont Pharmaceutical Research Laboratories as a learning set of 1909 compounds contained 42 inhibitors and 1867 noninhibitors, and a test set of 634 compounds contained 150 inhibitors and 484 noninhibitors. Each compound was characterized by 139,351 binary features describing their 3-D structure. In this comparison of naive Bayesian and SVM classifiers, all compounds were considered together, and a L10%O cross-validation procedure was applied. Based on information gain descriptor selection,... [Pg.375]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...