Dataset dimensionality

Previous work in our group had shown the power of self-organizing neural networks for the projection of high-dimensional datasets into two dimensions while preserving clusters present in the high-dimensional space even after projection [27]. In effect, 2D maps of the high-dimensional data are obtained that can show clusters of similar objects. [Pg.193]

We must now mention, that traditionally it is the custom, especially in chemo-metrics, for outliers to have a different definition, and even a different interpretation. Suppose that we have a fc-dimensional characteristic vector, i.e., k different molecular descriptors are used. If we imagine a fe-dimensional hyperspace, then the dataset objects will find different places. Some of them will tend to group together, while others will be allocated to more remote regions. One can by convention define a margin beyond which there starts the realm of strong outliers. "Moderate outliers stay near this margin. [Pg.213]

Fig. 7. Results of linear and nonlinear methods for analyzing the underlying stmcture of some data sets (a) data points randomly distributed (b) data points on d curved line and (c) data points on a circle correspond to Datasets I, II, and III, respectively. Dimensionality was found by dataset, principal...

In addition to looking for data trends in physical property space using PCA and PLS, trends in chemical structure space can be delineated by viewing nonlinear maps (NLM) of two-dimensional structure descriptors such as Unity Fingerprints or topological atom pairs using tools such as Benchware DataMiner [42]. Two-dimensional NLM plots provide an overview of chemical structure space and biological activity/molecular properties are mapped in a 3rd and/or 4th dimension to look for trends in the dataset. [Pg.189]

In the heart-cutting mode of operation, one or several discrete zones are collected from the first-dimension column and reinjected into the second-dimension separation system. The resulting data are one or more individual one-dimensional datasets and are useful for resolving fused peaks from specific region(s) of the fust-dimension separation system. An example of zone reinjection is shown in Fig. 5.1 clearly, the second column provides the selectivity for the three peaks that the first column did not have. [Pg.94]

We saw in chapter 1 that Artificial Intelligence algorithms incorporate a memory. In the ANN the memory of the system is stored in the connection weights, but in the SOM the links are inactive and the vector of weights at each node provides the memory. This vector is of the same length as the dimensionality of points in the dataset (Figure 3.6). [Pg.57]

Two-dimensional SOMs are more widely used than those of one dimension because the extra flexibility that is provided by a second dimension allows the map to classify a greater number of classes. Figure 3.14 shows the result of applying a 2D SOM to the same dataset used to create Figure 3.12 and Figure 3.13 the clustering of similar node weights is very clear. [Pg.69]

A one-dimensional SOM is less effective at filling the space defined by input data that cover a two-dimensional space (Figure 3.22) and is rather vulnerable to entanglement, where the ribbon of nodes crosses itself. It does, however, make a reasonable attempt to cover the sample dataset. [Pg.76]

Two-dimensional maps are easy to visualize and are readily programmed. If the dataset is complicated, a larger two-dimensional map might be replaced... [Pg.87]

The GCS shows a great deal of potential as a tool for visualization. Wong and Cartwright have investigated the use of GCS to help in the visualization of large high-dimensionality datasets,2 and Walker and co-workers have used the method to analyze biomedical data.3 Applications in the field are starting to increase in number, but at present the potential of the method far exceeds its use. [Pg.110]

Wong, J.W.H. and Cartwright, H.M., Deterministic projection by growing cell structure networks for visualization of high-dimensionality datasets. /. Biomed. Inform., 38,322, 2005. [Pg.111]

Biomedical spectra are often extremely complex. Hyphenated techniques such as MS-MS can generate databases that contain hundreds of thousands or millions of data points. Reduction of dimensionality is then a common step preceding data analysis because of the computational overheads associated with manipulating such large datasets.9 To classify the very large datasets provided by biomedical spectra, some form of feature selection10 is almost essential. In sparse data, many combinations of attributes may separate the samples, but not every combination is plausible. [Pg.363]

The same sample was subsequently used to measure /nhch dipole-dipole CCR. In this case the pulse sequence proposed by Yang and Kay [43] was applied. The experiment is also based on an NH(CO)CA experiment. The zero and double quantum coherences result in two 2-dimensional (2D) datasets and the 2D spectra obtained (black and red cross peaks in Fig. 4) result after pairwise adding and subtracting the measured 2D datasets. The signals are detected at the frequency and split by the JcaHa coupling. The black and red cross peaks are shifted by Jnhhq . Also in this case, the CCR-rate can directly be obtained from the intensities of the individual peaks ... [Pg.9]

The second dataset consists of 50 V-acetyl peptide amides (Table 2) these peptides have un-ionizable side chains and have previously been studied by Buchwald and Bodor (28). The three-dimensional structures of the di-peptides were built using the force field and partial charges of Kollman (29) as implemented in Sybyl 6.5.3. The initial random starting conformations were energy minimized in vacuo. For all calculations described herein, the dielectric of the medium was set to unity and the electrostatic cut-off distance was set to 16 A. For each molecule, the Sybyl Genetic Algorithm-based conformational search,... [Pg.221]

Once -dimensional chemical reference space has been defined, the descriptors values are calculated for all compounds in a dataset, thereby assigning a coordinate vector to each molecule. In principle, partitioning analysis could proceed in -dimensional space, but it is often attempted to reduce its dimensionality in order to generate a low-dimensional representa-... [Pg.281]

How is dimension reduction of chemical spaces achieved There are a number of different concepts and mathematical procedures to reduce the dimensionality of descriptor spaces with respect to a molecular dataset under investigation. These techniques include, for example, linear mapping, multidimensional scaling, factor analysis, or principal component analysis (PCA), as reviewed in ref. 8. Essentially, these techniques either try to identify those descriptors among the initially chosen ones that are most important to capture the chemical information encoded in a molecular dataset or, alternatively, attempt to construct new variables from original descriptor contributions. A representative example will be discussed below in more detail. [Pg.282]

Chen, X., Rusinko, A., Young, S.S., Recursive partitioning analysis of a large structure-activity dataset using three dimensional descriptors. J. Chem. Inf. Comput. Sci 1998, 38, 1054-1062. [Pg.205]

Figure 6.22 PCA reduces the dimensionality of the problem by projecting the original dataset onto a lower-dimension PC model, in which the new variables are orthogonal to each other. The distance from point A to the PCA model space equals the residual value for catalyst A.

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...