Why Does the Quality of Data Matter

The question is rhetorical. Once we learn from data, the quality of learning and the reliability of the obtained knowledge are conditioned by the quahty of the data. However, some examples of how data affect the results and the conclusions we draw from them may be of interest. [Pg.206]

Let us start with a classic example. We had a dataset of 31 steroids. The spatial autocorrelation vector (more about autocorrelation vectors can be found in Chapter 8) stood as the set of molecular descriptors. The task was to model the Corticosteroid Ringing Globulin (CBG) affinity of the steroids. A feed-forward multilayer neural network trained with the back-propagation learning rule was employed as the learning method. The dataset itself was available in electronic form. More details can be found in Ref. [2]. [Pg.206]

An observation of the results of cross-validation revealed that all but one of the compounds in the dataset had been modeled pretty well. The last (31st) compound behaved weirdly. When we looked at its chemical structure, we saw that it was the only compound in the dataset which contained a fluorine atom. What would happen if we removed the compound from the dataset The quahty ofleaming became essentially improved. It is sufficient to say that the cross-vahdation coefficient in-CTeased from 0.82 to 0.92, while the error decreased from 0.65 to 0.44. Another learning method, the Kohonen s Self-Organizing Map, also failed to classify this 31st compound correctly. Hence, we had to conclude that the compound containing a fluorine atom was an obvious outlier of the dataset. [Pg.206]

Another misleading feature of a dataset, as mentioned above, is redundancy. This means that the dataset contains too many similar objects contributing no [Pg.206]

One can find more details on the algorithm in Section 4.3.4. This time the learning yielded essentially improved results. It is sufficient to say that if in the case of the primary dataset, only 21 compoimds from 91 were classified correctly, whereas in the optimized dataset (i.e., that with no redundancy) the correctness of classification was increased to 65 out of 91. [Pg.207]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...