Dataset training

The three introduced network structures were trained with the training data set and tested with the test dataset. The backpropagation network reaches its best classification result after 70000 training iterations ... [Pg.465]

There are finer details to be extracted from such Kohonen maps that directly reflect chemical information, and have chemical significance. A more extensive discussion of the chemical implications of the mapping of the entire dataset can be found in the original publication [28]. Gearly, such a map can now be used for the assignment of a reaction to a certain reaction type. Calculating the physicochemical descriptors of a reaction allows it to be input into this trained Kohonen network. If this reaction is mapped, say, in the area of Friedel-Crafts reactions, it can safely be classified as a feasible Friedel-Qafts reaction. [Pg.196]

The final stage of compiling a maximally refined dataset is to split it into a training and test dataset. The definition of a test dataset is an absolute must during learning, as, in fact, it is the best way to validate the results of that learning. [Pg.205]

Let us start with a classic example. We had a dataset of 31 steroids. The spatial autocorrelation vector (more about autocorrelation vectors can be found in Chapter 8) stood as the set of molecular descriptors. The task was to model the Corticosteroid Ringing Globulin (CBG) affinity of the steroids. A feed-forward multilayer neural network trained with the back-propagation learning rule was employed as the learning method. The dataset itself was available in electronic form. More details can be found in Ref. [2]. [Pg.206]

Although the problem of compilation of training and test datasets is crucial, unfortunately no de-faao standard technique has been introduced. Nevertheless, we discuss here a method that was designed within our group, and that is used quite successfully in our studies. The method is mainly addressed to the task of finding and removing redundancy. [Pg.220]

The quality of a model should be validated by compilation of training, test and control datasets... [Pg.224]

Aqueous solubility is selected to demonstrate the E-state application in QSPR studies. Huuskonen et al. modeled the aqueous solubihty of 734 diverse organic compounds with multiple linear regression (MLR) and artificial neural network (ANN) approaches [27]. The set of structural descriptors comprised 31 E-state atomic indices, and three indicator variables for pyridine, ahphatic hydrocarbons and aromatic hydrocarbons, respectively. The dataset of734 chemicals was divided into a training set ( =675), a vahdation set (n=38) and a test set (n=21). A comparison of the MLR results (training, r =0.94, s=0.58 vahdation r =0.84, s=0.67 test, r =0.80, s=0.87) and the ANN results (training, r =0.96, s=0.51 vahdation r =0.85, s=0.62 tesL r =0.84, s=0.75) indicates a smah improvement for the neural network model with five hidden neurons. These QSPR models may be used for a fast and rehable computahon of the aqueous solubihty for diverse orgarhc compounds. [Pg.93]

Refinement of a QSPR model requires experimental solubilities to train the model. Several models have used the dataset of Huuskonen [44] who sourced experimental data from the AQUASOL [45] and PHYSPROP [46] databases. The original set had a small number of duplicates, which have been removed in most subsequent studies using this dataset, leaving 1290 compounds. When combined, the log Sw... [Pg.302]

In a attempt to compensate for poor long-term reproducibility in a longterm identification study, Chun et al.128 applied ANNs to PyMS spectra collected from strains of Streptomyces six times over a 20-month period. Direct comparison of the six data sets, by the conventional approach of HCA, was unsuccessful for strain identification, but a neural network trained on spectra from each of the first three data sets was able to identify isolates in those three datasets and in the three subsequent datasets. [Pg.333]

PFOS Non-carcinogen Non-carcinogen Non-carcinogen Non-carcinogen Not enough similar compounds in training dataset Non-carcinogen... [Pg.189]

Table 7 shows the predictions relating to the mutagenicity. The models gave homogeneous and consistent results. Only exceptions are the predictions of Toxsuite on HBCDD and Methylisothiazolinone which are in contrast with the predictions of other models and with experimental data. Moreover, no predictions are provided by Lazar for Isothiazolinones due to the absence of not enough similar compounds in the training dataset. [Pg.197]

The evaluation for aquatic toxicity on daphnids and fish is reported in Tables 12 and 13. Bold values indicate that compounds are out of the model applicability domain (ECOSAR) or that the prediction is not reliable. ECOSAR and ToxSuite are able to predict all the selected compounds while T.E.S.T. fails in prediction for the daphnia toxicity of perfluorinated compounds (PFOS and PFOA). Tables 12 and 13 include also a limited number of experimental results provided by the model training dataset (some data are extracted from USEPA Ecotox database). Predicted results are in agreement for five compounds only (2, 3, 5, 13 and 14) for both endpoints while the predictions for the other compounds are highly variable. [Pg.200]

Results for the prediction of oral toxicity on rat are shown in Table 14. Bold values indicated that prediction is not reliable. Table includes also a limited number of experimental results provided by the model training dataset (some data are extracted from ChemID database). Globally, values show a high degree of concurrence. The only exceptions are the predictions of triclosan and methylisothia-zolinone toxicity for triclosan, prediction made by T.E.S.T. is more conservative than ToxSuite and experimental values while for methylisothiazolinone predicted values from T.E.S.T and ToxSuite are variable. [Pg.201]

Since equal numbers of disordered and ordered residues were used for training and testing, prediction success would be about 50% if disordered and ordered sequences were the same. In contrast to this 50% value, prediction success rates for the short, medium, long, and merged datasets were 69% 3%, 74% 2%, 73% 2%, and 60% 3%, respectively (Romero et al., 1997b), where the standard errors were determined over about 2200, 2600, 2000, and 6800 individual predictions, respectively. [Pg.50]

Overfitting is a potentially serious problem in neural networks. It is tackled in two ways (1) by continually monitoring the quality of training as it occurs using a test set, and (2) by ensuring that the geometry of the network (its size and the way the nodes are connected) is appropriate for the size of the dataset. [Pg.38]

As the network learns, connection weights are adjusted so that the network can model general rules that underlie the data. If there are some general rules that apply to a large proportion of the patterns in the dataset, the network will repeatedly see examples of these rules and they will be the first to be learned. Subsequently, it will turn its attention to more specialized rules of which there are fewer examples in the dataset. Once it has learned these rules as well, if training is allowed to continue, the network may start to learn specific samples within the data. This is undesirable for two reasons. Firstly, since these particular patterns may never be seen when the trained network is put to use, any time spent learning them is wasted. Secondly,... [Pg.38]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...