Training set selection

Successful discriminant analysis is based on the assumption that the data in each class are normally distributed and all classes have the same covariance matrix (McFarland and Cans, 1990). Discriminant analysis is extremely sensitive to collinearities among descriptors and the ratio of the number of chemicals in the data set to the number of descriptors in the model should exceed 10 1. For the training-set selection, the data quality, the statistical significance, the type of descriptors and the limitations of the model range, the same restrictions apply for discriminant analysis as explained for classical regression analysis (section 3.2.1) ... [Pg.82]

In Eq. (12), SE is the standard error, c is the number of selected variables, p is the total number of variables (which can differ from c), and d is a smoothing parameter to be set by the user. As was mentioned above, there is a certain threshold beyond which an increase in the number of variables results in some decrease in the quality of modeling. In fact, the smoothing parameter reflects the user s guess of how much detail is to be modeled in the training set. [Pg.218]

A data set can be split into a training set and a test set randomly or according to a specific rule. The 1293 compounds were divided into a training set of 741 compounds and a test set ot 552 compounds, based on their distribution in a K.NN map. From each occupied neuron, one compound was selected and taken into the training set, and the other compounds were put into the test set. This selection ensured that both the training set and the test set contained as much information as possible, and covered the chemical space as widely as possible. [Pg.500]

Figure 10.2-S. Procedure for spectra simulation the query structure is coded, a training set of structure-spectra pairs is selected from the database, and the counterpropagation network is trained.

There are three rules of thumb to guide us in selecting the number of calibration samples we should include in a training set. They are all based on the number of components in the system with which we are working. Remember that components should be understood in the widest sense as "independent sources of significant variation in the data." For example, a... [Pg.19]

Select K objects from the training set most similar to object u, according to the calculated distances (K is usually an odd number). [Pg.314]

Aqueous solubility is selected to demonstrate the E-state application in QSPR studies. Huuskonen et al. modeled the aqueous solubihty of 734 diverse organic compounds with multiple linear regression (MLR) and artificial neural network (ANN) approaches [27]. The set of structural descriptors comprised 31 E-state atomic indices, and three indicator variables for pyridine, ahphatic hydrocarbons and aromatic hydrocarbons, respectively. The dataset of734 chemicals was divided into a training set ( =675), a vahdation set (n=38) and a test set (n=21). A comparison of the MLR results (training, r =0.94, s=0.58 vahdation r =0.84, s=0.67 test, r =0.80, s=0.87) and the ANN results (training, r =0.96, s=0.51 vahdation r =0.85, s=0.62 tesL r =0.84, s=0.75) indicates a smah improvement for the neural network model with five hidden neurons. These QSPR models may be used for a fast and rehable computahon of the aqueous solubihty for diverse orgarhc compounds. [Pg.93]

Two models of practical interest using quantum chemical parameters were developed by Clark et al. [26, 27]. Both studies were based on 1085 molecules and 36 descriptors calculated with the AMI method following structure optimization and electron density calculation. An initial set of descriptors was selected with a multiple linear regression model and further optimized by trial-and-error variation. The second study calculated a standard error of 0.56 for 1085 compounds and it also estimated the reliability of neural network prediction by analysis of the standard deviation error for an ensemble of 11 networks trained on different randomly selected subsets of the initial training set [27]. [Pg.385]

A mathematically very simple classification procedure is the nearest neighbour method. In this method one computes the distance between an unknown object u and each of the objects of the training set. Usually one employs the Euclidean distance D (see Section 30.2.2.1) but for strongly correlated variables, one should prefer correlation based measures (Section 30.2.2.2). If the training set consists of n objects, then n distances are calculated and the lowest of these is selected. If this is where u represents the unknown and I an object from learning class L, then one classifies u in group L. A three-dimensional example is given in Fig. 33.11. Object u is closest to an object of the class L and is therefore considered to be a member of that class. [Pg.223]

Most of the supervised pattern recognition procedures permit the carrying out of stepwise selection, i.e. the selection first of the most important feature, then, of the second most important, etc. One way to do this is by prediction using e.g. cross-validation (see next section), i.e. we first select the variable that best classifies objects of known classification but that are not part of the training set, then the variable that most improves the classification already obtained with the first selected variable, etc. The results for the linear discriminant analysis of the EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of classification success. Selectivity is used here. It is applied in the sense of Chapter... [Pg.236]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...