Dataset redundancy

Once the quality of the dataset is defined, the next task is to improve it. Again, one has to remove outliers, find out and remove redundant objects (as they deliver no additional information), and finally, select the optimal subset of descriptors. [Pg.205]

Another misleading feature of a dataset, as mentioned above, is redundancy. This means that the dataset contains too many similar objects contributing no... [Pg.206]

One can find more details on the algorithm in Section 4.3.4. This time the learning yielded essentially improved results. It is sufficient to say that if in the case of the primary dataset, only 21 compoimds from 91 were classified correctly, whereas in the optimized dataset (i.e., that with no redundancy) the correctness of classification was increased to 65 out of 91. [Pg.207]

Although the problem of compilation of training and test datasets is crucial, unfortunately no de-faao standard technique has been introduced. Nevertheless, we discuss here a method that was designed within our group, and that is used quite successfully in our studies. The method is mainly addressed to the task of finding and removing redundancy. [Pg.220]

A widely used approach to establish model robustness is the randomization of response [25] (i.e., in our case of activities). It consists of repeating the calculation procedure with randomized activities and subsequent probability assessments of the resultant statistics. Frequently, it is used along with the cross validation. Sometimes, models based on the randomized data have high q values, which can be explained by a chance correlation or structural redundancy [26]. If all QSAR models obtained in the Y-randomization test have relatively high values for both and LOO (f, it implies that an acceptable QSAR model cannot be obtained for the given dataset by the current modeling method. [Pg.439]

Furthermore, real datasets are often redundant and/or too small to rely on this idea and/or come from disparate in vitro assays. Once again, the know-how of the expert is key when it comes to deciding which molecule is to be taken in the training set. [Pg.337]

Data redundancy is minimized. Data redundancy is kept to a minimum by normalizing data into simple datasets which can then point to related datasets. This saves disk storage space and speeds up storage and modification operations. [Pg.31]

A complementary series of methods for determining the number of significant factors are based on cross-validation. It is assumed that significant components model data , whilst later (and redundant) components model noise . Autopredictive models involve fitting PCs to the entire dataset, and always provide a closer fit to the data the more the components are calculated. Hence the residual error will be smaller if 10 rather than nine PCs are calculated. This does not necessarily indicate that it is correct to retain all 10 PCs the later PCs may model noise which we do not want. [Pg.199]

However, we should not forget that there are other small subsets of experiments as good as used above to define X. This implies that the dataset is very redundant. An inevitable conclusion seems to be that the information we can extract from the totality of the microarray experiments can actually be obtained from a much fewer number of experiments. Furthermore, there is no need of analyzing all the genes it has been shown that 3000 genes are more than enough we suspect that actually much fewer genes suffice. [Pg.348]

In QSAR and QSPR studies, the standard ways of removing redundancy from large numbers of topological and topographical indices include principal component analysis, chi-squared analysis, and multiple regression analysis (MRA). Most QSAR and QSPR applications deal with very small datasets, and so the dimensionality does not cause a problem for PCA or chi-squared analysis. MRA does not impose any restrictions on the type and number of descriptors. The selection process is based on two principles, namely, to cover as much of parametric space as possible (principle of variance) while choosing independent descriptors (principle of orthogonality). [Pg.530]

Non-identical (N) Domains that have a sequence of identity of greater or equal to 95% having almost identical structures so this level is often used to provide non-redundant datasets. [Pg.136]

Take an example, imagine a new system administrator notices that, within the drug administration frequency reference dataset, someone has accidentally created two data items 3 times a day and three times a day . He also notices that there is no entry for five times a day . He can kill two birds with one stone he accesses the redundant 3 times a day option and replaces the text with five times a day . Being a diligent individual he navigates to the prescribing screen and, sure enough, all options are appropriately present and both issues have been fixed. [Pg.96]

The second reason for variable removal is if they are redundant. Redundant variables develop in a dataset for two reasons (1) because more variables (p) exist than objects ( ) and (2) multicolinearity. [Pg.296]

In this tutorial, we have shown that variable selection can be divided into three subtasks dimension reduction, variable elimination and variable selection. The first of these tasks is reasonably well understood, and many standard and not-so-standard methods can process most types of datasets. Variable elimination is also relatively straightforward. Examination of the distributions of individual variables allows the easy identification of descriptors, containing little or no information. Distributions also allow the analyst to recognize properties that are associated with a particular compound or subset of compounds, either because of the imderlying chemical rationale behind the descriptor or because of some division of the dataset into, say, training and test sets. The calculation of multicolinearity allows for the identification of redundancy and sets of variables containing essentially the same information in pairs, or as linear combinations of three or more descriptors. Algorithms for variable elimination have been described in this chapter, and software is available commercially and free from the web. [Pg.341]

In a manner reminiscent of the self-organising maps, the methodology has been applied to produce a subset of the database that represents the best representative of each unique crystal stracture [53]. Thus, a compound with 10 CSD entries, comprising one polymorph determined seven times and a second polymorph determined three times will be reduced to two entries, which are considered to represent the two unique structure types. The details of the applied quality tests are extensive [53], but the result is a list of 231918 structures (derived from 353 666 structures in the November 2005 release) that are considered to be the best representative examples of all unique high-quality stmctures in the CSD [54], In this way, the complete contents of the CSD are reduced to a set of representative structures that contain an equivalent amount of structural information, but without any redundancy. This dataset forms an especially convenient basis for structural searches, since it is free of any duplication. [Pg.32]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...