Computer principal components

A common practice is to scale each descriptor to have standard deviation of 1. Another is to compute principal components and confine the analysis to the first h components, where h may range from 1 to 20. This is an ad hoc form of dimension reduction that does not remove irrelevant information from the analysis. At Lilly, we prefer a careful descriptor validation to avoid including many irrelevant descriptors into the analysis, combined with a dimension reduction criterion using... [Pg.81]

One way to do this is to compute principal components models for each block and then determine the correlation between the components from each block. This method is usually called Canonical correlation.[3 From a philosophical point of view, canonical correlation is appealing since it does not assume anything about the direction of the interdependencies of the two blocks. From a practical point of view, it is, however, not very useful as a tool for problem solving. It is not certain that variables which describe a large variation in the X block are related to large variations in the Y block. It may be that only a small part of the variation in the X block is strongly related to a large and systematic variation of the responses in the Y block. [Pg.454]

A much more serious problem inherent in the PCA approach is the need to recompute the transform from the high-dimensional pattern space to a low-dimensionality component space each time material is added or removed from the system. This is clearly a very expensive option if the system is non-closed and in situations where training data are liable to be added and/or removed at regular intervals. Worse still, the speed at which the system computes principal components (e.g. trains) is 0(p ) where n is the number of taxa in the system. Thus, the time taken to train the system increases at a rate proportionate to the square of the number of species it contains. This is clearly not ideal if we want to build a general-purpose system capable of scaling to tens of thousands of taxa. [Pg.103]

The concept of property space, which was coined to quanhtahvely describe the phenomena in social sciences [11, 12], has found many appUcahons in computational chemistry to characterize chemical space, i.e. the range in structure and properhes covered by a large collechon of different compounds [13]. The usual methods to approach a quantitahve descriphon of chemical space is first to calculate a number of molecular descriptors for each compound and then to use multivariate analyses such as principal component analysis (PCA) to build a multidimensional hyperspace where each compound is characterized by a single set of coordinates. [Pg.10]

The computational implementation of principal components regression is very straightforward. [Pg.329]

PLS has been introduced in the chemometrics literature as an algorithm with the claim that it finds simultaneously important and related components of X and of Y. Hence the alternative explanation of the acronym PLS Projection to Latent Structure. The PLS factors can loosely be seen as modified principal components. The deviation from the PCA factors is needed to improve the correlation at the cost of some decrease in the variance of the factors. The PLS algorithm effectively mixes two PCA computations, one for X and one for Y, using the NIPALS algorithm. It is assumed that X and Y have been column-centred as usual. The basic NIPALS algorithm can best be demonstrated as an easy way to calculate the singular vectors of a matrix, viz. via the simple iterative sequence (see Section 31.4.1) ... [Pg.332]

Fig. 37.2. Principal components loading plot of 7 physicochemical substituent parameters, as obtained from the correlations in Table 37.5 [39,40]. The horizontal and vertical axes account for 46 and 31%, respectively, of the correlations. Most of the residual correlation is along the perpendicular to the plane of the diagram. The line segments define clusters of parameters that have been computed by means of cluster analysis.

A topic of actuality is the study of receptor proteins and enzymes for which data bases with crystallographic information are now made available. Computer modelling of the active sites of receptors and enzymes are important tools in rational drug design. Principal components and cluster analysis can be applied to the primary... [Pg.416]

Dong, D., and McAvoy, T. J., Nonlinear principal component analysis—based on principal curves and neural networks, Comput. Chem. Eng. 20(1), 65 (1996). [Pg.99]

A reconstruction of the original data matrix A is computed by using the preselected number of principal components (i.e., columns in our T and V matrices) as... [Pg.110]

Computational methods have been applied to determine the connections in systems that are not well-defined by canonical pathways. This is either done by semi-automated and/or curated literature causal modeling [1] or by statistical methods based on large-scale data from expression or proteomic studies (a mostly theoretical approach is given by reference [2] and a more applied approach is in reference [3]). Many methods, including clustering, Bayesian analysis and principal component analysis have been used to find relationships and "fingerprints" in gene expression data [4]. [Pg.394]

Tong, H., and Crowe, C. M. (1996). Detecting persistent gross errors by sequential analysis of principal components. Comput. Chem. Eng. 20, S733-S738. [Pg.244]

Principal component analysis (PCA) can be considered as the mother of all methods in multivariate data analysis. The aim of PCA is dimension reduction and PCA is the most frequently applied method for computing linear latent variables (components). PCA can be seen as a method to compute a new coordinate system formed by the latent variables, which is orthogonal, and where only the most informative dimensions are used. Latent variables from PCA optimally represent the distances between the objects in the high-dimensional variable space—remember, the distance of objects is considered as an inverse similarity of the objects. PCA considers all variables and accommodates the total data structure it is a method for exploratory data analysis (unsupervised learning) and can be applied to practical any A-matrix no y-data (properties) are considered and therefore not necessary. [Pg.73]

The second principal component (PC2) is defined as an orthogonal direction to PCI and again possessing the maximum possible variance of the scores. For two-dimensional data, only one direction, orthogonal to PCI, is possible for PC2. In general further PCs can be computed up to the number of variables. Subsequent PCs are orthogonal to all previous PCs, and their direction has to cover the maximum... [Pg.74]

A principal components multivariate statistical approach (SIMCA) was evaluated and applied to interpretation of isomer specific analysis of polychlorinated biphenyls (PCBs) using both a microcomputer and a main frame computer. Capillary column gas chromatography was employed for separation and detection of 69 individual PCB isomers. Computer programs were written in AMSII MUMPS to provide a laboratory data base for data manipulation. This data base greatly assisted the analysts in calculating isomer concentrations and data management. Applications of SIMCA for quality control, classification, and estimation of the composition of multi-Aroclor mixtures are described for characterization and study of complex environmental residues. [Pg.195]

To illustrate the environmental application of the SIMCA method we examined a set of isomer specific analyses of sediment samples. The data examined were derived from more than 200 sediment samples taken from a study site on the Upper Mississippi River (41). These analytical data were transferred via magnetic tape from the laboratory data base to the Cyber 175 computer where principal component analysis were conducted on the isomer concentration data (ug/g each isomer). [Pg.223]

The first principal component values (Theta 1) for each sample were determined and these values were correlated with the total PCB concentration (Figure 14) recorded for each sample in a separate computer data base that contained other environmental data such as hydrology and sediment texture. The results indicated that certain samples deviated by factors of about two. Upon examining the sample records, the recorded dilution values... [Pg.223]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...