Dataset

As described above 2314 ROIs were cut out of several radiographs of austenitic welds and the accessory feamres were calculated. The true data for each radiograph is known by destructive testing and evaluation. The ROIs represent the following dataset ... [Pg.465]

The three introduced network structures were trained with the training data set and tested with the test dataset. The backpropagation network reaches its best classification result after 70000 training iterations ... [Pg.465]

Furthermore VGInsight [16], a high-end tool for multi-volume and time-dependent volume raytracing is used for Virtual Reality inspection of the measured dataset. The package is running on a Quad-Pentium Pro System under the Linux Operating System. [Pg.495]

D-CT dataset of power saw cabinet in reverse engineering software package (STL format)... [Pg.499]

Calibration of the ultrasonic instrument, including plotting of a recording curve (DAC), or a reference reflector for a DOS evaluation, or loading of the existing test dataset... [Pg.778]

Intensive data reduction is an efficient inetl iod of managing large datasets. Generally, hasl i codes are used within chemical information processes such as molecule identification and recognition of identical atoms [9S]. [Pg.74]

The fir.-fit line of the file (see Figure 2-110) - the HEADER record - hold.s the moleculc. s classification string (columns 11-50), the deposition date (the date when the data were received by the PDB) in columns 51-59, and the PDB (Dcode for the molecule, which is unique within the Protein Data Bank, in columns 63-66. The second line - the TITLE record - contains the title of the experiment or the analysis that is represented in the entry. The subsequent records contain a more detailed description of the macromolecular content of the entiy (COMPND), the biological and/or chemical source ofeach biological molecule in the entiy (SOURCE), a set ofkeywords relevant to the entiy (KEYWDS). information about the experiment (EXPDTA), a list of people responsible for the contents of this entiy (.AUTHOR), a history of modifications made to this entiy since its release (REVDAT), and finally the primaiy literature citation that describes the experiment which resulted in the deposited dataset ()RNL). [Pg.115]

IingCir/cnr Efficient storage of 2D area detector data and other large datasets. [Pg.121]

Previous work in our group had shown the power of self-organizing neural networks for the projection of high-dimensional datasets into two dimensions while preserving clusters present in the high-dimensional space even after projection [27]. In effect, 2D maps of the high-dimensional data are obtained that can show clusters of similar objects. [Pg.193]

We will show here the classification procedure with a specific dataset [28]. A reaction center, the addition of a C-H bond to a C=C double bond, was chosen that comprised a variety of different reaction types such as Michael additions, Friedel-Crafts alkylation of aromatic compounds by alkenes, or photochemical reactions. We wanted to see whether these different reaction types can be discerned by this... [Pg.193]

Previous studies with a variety of datasets had shown the importance of charge distribution, of inductive effect), of r-electronegativity, resonance effect), and of effective polarizability, aeffi polarizability effect) for details on these methods see Section 7.1). All four of these descriptors on all three carbon atoms were calculated. However, in the final study, a reduced set of descriptors, shown in Table 3-4, was chosen that was obtained both by statistical methods and by chemical intuition. [Pg.194]

Figure 3-19. Reaction center of the dataset of 120 reactions (reacting bonds are indicated by broken lines), and some reaction instances of this dataset.

Figure 3-20. Distribution of the dataset of 120 reactions in the Kohonen netv/ork, a) The neurons were patterned on the basis of intellectually assigned reaction types b) in addition, empty neurons were patterned on the basis of their k nearest neighbors.

There are finer details to be extracted from such Kohonen maps that directly reflect chemical information, and have chemical significance. A more extensive discussion of the chemical implications of the mapping of the entire dataset can be found in the original publication [28]. Gearly, such a map can now be used for the assignment of a reaction to a certain reaction type. Calculating the physicochemical descriptors of a reaction allows it to be input into this trained Kohonen network. If this reaction is mapped, say, in the area of Friedel-Crafts reactions, it can safely be classified as a feasible Friedel-Qafts reaction. [Pg.196]

To understand what datasets are and how to estimate their quality... [Pg.203]

To become familiar with dataset optimization techniques... [Pg.203]

Once we have defined this, we have to compile the initial dataset. First of aU, we decide upon the composition of the dataset. Usually, we take initially as many compounds as possible. [Pg.205]

First, we have an initial, and probably utterly crude, dataset. Genuine data pre-processing has only just started. The task is to assess the quality of the data. One of the topics for discussion in this chapter is the methods by which one finds out the potential drawbacks of the dataset. [Pg.205]

The quality may suffer from the presence of so-called outliers, i.e., compounds that have low similarity to the rest of the dataset. Another negative feature may be just the contrary the dataset may contain too many too highly similar objects. [Pg.205]

Another problem is to determine the optimal number of descriptors for the objects (patterns), such as for the structure of the molecule. A widespread observation is that one has to keep the number of descriptors as low as 20 % of the number of the objects in the dataset. However, this is correct only in case of ordinary Multilinear Regression Analysis. Some more advanced methods, such as Projection of Latent Structures (or. Partial Least Squares, PLS), use so-called latent variables to achieve both modeling and predictions. [Pg.205]

Once the quality of the dataset is defined, the next task is to improve it. Again, one has to remove outliers, find out and remove redundant objects (as they deliver no additional information), and finally, select the optimal subset of descriptors. [Pg.205]

The final stage of compiling a maximally refined dataset is to split it into a training and test dataset. The definition of a test dataset is an absolute must during learning, as, in fact, it is the best way to validate the results of that learning. [Pg.205]

And, last but by far not least we must mention a very important part of data preprocessing. It is up to a researcher to decide when to employ these techniques. Figure 4-2 displays a step-by-step preparation of a dataset. [Pg.205]

Figure 4-2. A high-quality dataset has to be prepared step-by-step, and often iteratively.

Let us start with a classic example. We had a dataset of 31 steroids. The spatial autocorrelation vector (more about autocorrelation vectors can be found in Chapter 8) stood as the set of molecular descriptors. The task was to model the Corticosteroid Ringing Globulin (CBG) affinity of the steroids. A feed-forward multilayer neural network trained with the back-propagation learning rule was employed as the learning method. The dataset itself was available in electronic form. More details can be found in Ref. [2]. [Pg.206]

An observation of the results of cross-validation revealed that all but one of the compounds in the dataset had been modeled pretty well. The last (31st) compound behaved weirdly. When we looked at its chemical structure, we saw that it was the only compound in the dataset which contained a fluorine atom. What would happen if we removed the compound from the dataset The quahty ofleaming became essentially improved. It is sufficient to say that the cross-vahdation coefficient in-CTeased from 0.82 to 0.92, while the error decreased from 0.65 to 0.44. Another learning method, the Kohonen s Self-Organizing Map, also failed to classify this 31st compound correctly. Hence, we had to conclude that the compound containing a fluorine atom was an obvious outlier of the dataset. [Pg.206]

Another misleading feature of a dataset, as mentioned above, is redundancy. This means that the dataset contains too many similar objects contributing no... [Pg.206]

One can find more details on the algorithm in Section 4.3.4. This time the learning yielded essentially improved results. It is sufficient to say that if in the case of the primary dataset, only 21 compoimds from 91 were classified correctly, whereas in the optimized dataset (i.e., that with no redundancy) the correctness of classification was increased to 65 out of 91. [Pg.207]

As another example, we shall consider the influence of the number of descriptors on the quality of learning. Lucic et. al. [3] performed a study on QSPR models employing connectivity indices as descriptors. The dataset contained 18 isomers of octane. The physical property for modehng was boiling points. The authors were among those who introduced the technique of orthogonahzation of descriptors. [Pg.207]

Returning to datasets from chemoinformatics, we may conclude that the first case stands for the complete degeneracy of the dataset, when all but one data object are redimdant. The second case corresponds to the weird situation in which all the objects of the dataset are outliers. We have thus arrived at the extreme extents of the dataset complexity. [Pg.208]

As oversimplified cases of the criterion to be used for the clustering of datasets, we may consider some high-quality Kohonen maps, or PCA plots, or hierarchical clustering. [Pg.208]

Let us illustrate the approach with examples. Suppose we have a dataset with some 500 compoimds. First, we apply a set of descriptors. Employing some procedure or other, we obtain, say, ten classes of equivalence in which the elements are distributed as given in Table 4-1. [Pg.212]

We must now mention, that traditionally it is the custom, especially in chemo-metrics, for outliers to have a different definition, and even a different interpretation. Suppose that we have a fc-dimensional characteristic vector, i.e., k different molecular descriptors are used. If we imagine a fe-dimensional hyperspace, then the dataset objects will find different places. Some of them will tend to group together, while others will be allocated to more remote regions. One can by convention define a margin beyond which there starts the realm of strong outliers. "Moderate outliers stay near this margin. [Pg.213]

It may be of interest to readers that all three methods mentioned above resulted in the same optimal subset of descriptors for the well-known Selwood dataset, which has become a de-facto standard in testing new approaches in this field [20]. [Pg.219]

Although the problem of compilation of training and test datasets is crucial, unfortunately no de-faao standard technique has been introduced. Nevertheless, we discuss here a method that was designed within our group, and that is used quite successfully in our studies. The method is mainly addressed to the task of finding and removing redundancy. [Pg.220]

Let us outline one of our approaches with the following simple example. Suppose we have a dataset of compounds and two experimental biological activities, of which one is a target activity (TA) and the other is an undesirable side effect (USE). Naturally, those with high TA and low USE form the first subclass, those with low TA and high USE the second, and the rest go into the third, intermediate subclass. [Pg.221]

The functionality of the algorithm can be exemplified with the help of a real-world dataset. [Pg.221]

Initially the dataset contained 818 compounds, among which 31 were active (high TA, low USE), 157 inactive (low TA, high USE), and the rest intermediate. When the complete dataset was employed, none of the active compounds and 47 of the inactives were correctly classified by using Kohonen self-organizing maps (KSOM). [Pg.221]

In our experience, another important advantage of the method is that one can use all the available descriptors without taking care over their choice. The method does not require any significant CPU resources, even when applied to a large dataset. [Pg.221]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...