Principal components statistical properties

Multiple linear regression is strictly a parametric supervised learning technique. A parametric technique is one which assumes that the variables conform to some distribution (often the Gaussian distribution) the properties of the distribution are assumed in the underlying statistical method. A non-parametric technique does not rely upon the assumption of any particular distribution. A supervised learning method is one which uses information about the dependent variable to derive the model. An unsupervised learning method does not. Thus cluster analysis, principal components analysis and factor analysis are all examples of unsupervised learning techniques. [Pg.719]

We now consider a type of analysis in which the data (which may consist of solvent properties or of solvent effects on rates, equilibria, and spectra) again are expressed as a linear combination of products as in Eq. (8-81), but now the statistical treatment yields estimates of both a, and jc,. This method is called principal component analysis or factor analysis. A key difference between multiple linear regression analysis and principal component analysis (in the chemical setting) is that regression analysis adopts chemical models a priori, whereas in factor analysis the chemical significance of the factors emerges (if desired) as a result of the analysis. We will not explore the statistical procedure, but will cite some results. We have already encountered examples in Section 8.2 on the classification of solvents and in the present section in the form of the Swain et al. treatment leading to Eq. (8-74). [Pg.445]

Principal Components (i. e, PCs) are linear combinations of random or statistical variables, which have special properties in terms of variances. [Pg.268]

Evaluation of the statistical properties is a fundamental part of any statistical analysis and here we concentrated on the distribution of each variable. To reduce the dimensionality of this data set we used principal component analysis (PCA) to explore the covariance structure of these data and to reduce the variables to a more manageable number (PAl method with no rotation, 21). [Pg.150]

The orthogonality of a set of molecular descriptors is a very desirable property. Classification methodologies such as CART (11) (or other decision-tree methods) are not invariant to rotations of the chemistry space. Such methods may encounter difficulties with correlated descriptors (e.g., production of larger decision trees). Often, correlated descriptors necessitate the use of principal components transforms that require a set of reference data for their estimation (at worst, the transforms depend only on the data at hand and, at best, they are trained once from some larger collection of compounds). In probabilistic methodologies, such as Binary QSAR (12), approximation of statistical independence is simplified when uncorrelated descriptors are used. In addition,... [Pg.267]

Chemists and statisticians use the term mixture in different ways. To a chemist, any combination of several substances is a mixture. In more formal statistical terms, however, a mixture involves a series of factors whose total is a constant sum this property is often called closure and will be discussed in completely different contexts in the area of scaling data prior to principal components analysis (Chapter 4, Section 4.3.6.5 and Chapter 6, Section 6.2.3.1). Hence in statistics (and chemometrics) a solvent system in HPLC or a blend of components in products such as paints, drugs or food is considered a mixture, as each component can be expressed as a proportion and the total adds up to 1 or 100%. The response could be a chromatographic separation, the taste of a foodstuff or physical properties of a manufactured material. Often the aim of experimentation is to find an optimum blend of components that tastes best, or provide die best chromatographic separation, or die material diat is most durable. [Pg.84]

Near-infrared (NIR) spectroscopy is becoming an important technique for pharmaceutical analysis. This spectroscopy is simple and easy because no sample preparation is required and samples are not destroyed. In the pharmaceutical industry, NIR spectroscopy has been used to determine several pharmaceutical properties, and a growing literature exists in this area. A variety of chemoinfometric and statistical techniques have been used to extract pharmaceutical information from raw spectroscopic data. Calibration models generated by multiple linear regression (MLR) analysis, principal component analysis, and partial least squares regression analysis have been used to evaluate various parameters. [Pg.74]

Because of their fixed length, descriptors are valuable representations of molecules for use in further statistical calculations. The most important methods used to compare chemical descriptors are linear and nonlinear regression, correlation methods, and correlation matrices. Since patterns in data can be hard to find in data of high dimension, where graphical representation is not available, principal component analysis (PCA) is a powerful tool for analyzing data. PCA can be used to identify patterns in data and to express the data in such a way as to highlight their similarities and differences. Similarities or diversities in data sets and their properties data can be identified with the aid of these techniques. [Pg.337]

With a large data set of structures and properties, it is possible to use multivariate statistical methods such as principal components analysis to try to identify patterns. The applications of statistics to chemistry is known as chemometrics and an internet search for this keyword will lead to a variety of useful sources of information on the field. Some QSAR methods are based on the molecular orbitals of the compounds in the... [Pg.313]

In the literature, a large number of substituent descriptors have been reported. In order to use this information for substituent selection, appropriate statistical methods may be used. Pattern recognition or data reduction techniques, such as principal component analysis (PCA) or cluster analysis (CA) are good choices. As explained in Section V in more detail, PCA consists of condensing the information in a data table into a few new descriptors made of linear combinations of the original ones. These new descriptors are called principal components or latent variables. This technique has been applied to define new descriptors for amino acids, as well as for aromatic or aliphatic substituents, which are called principal properties (PPs). The principal properties can be used in factorial design methods or as variables in QSAR analysis. [Pg.357]

A later chapter will discuss these methods in more detail. For example, support vector machines and traditional neural networks are analogs of multiple regression or discriminant analysis that provide more flexibility in the form of the relationship between molecular properties and bioactivity.Kohonen neural nets are a more flexible analog to principal component analysis. Various Bayesian approaches are alternatives to the statistical methods described earlier. A freely available program oflcrs many of these capabilities. ... [Pg.81]

For a QSAR analysis, a training set of compounds with known descriptor properties (e.g., pJQ-values, surface area, dipole moment, etc.), including the property of interest, is required. The required dataset can be derived from experimental results or from high-level ab-initio or DFTdata. The Hansch analysis [81] is a statistical method used to analyze and correlate these data in order to determine the magnitude of the target property [Eq. (2.15)] the principal component analysis (PCA) is a more recent alternative [82-84]. [Pg.18]

In this section we shall consider the rather general case where for a series of chemical compounds measurements are made in a number of parallel biological tests and where a set of descriptor variables is believed to be related to the biological potencies observed. In order to imderstand the data in their entirety and to deal adequately with the mathematical properties of such data, methods of multivariate statistics are required. A variety of such methods is available as, for example, multivariate regression, canonical correlation, principal component analysis, principal component regression, partial least squares analysis, and factor analysis, which have all been applied to biological or chemical problems (for reviews, see [1-11]). Which method to choose depends on the ultimate objective of an analysis and the property of the data. We have found principal component and factor analysis particularly useful. For this reason and also since many multivariate methods make use of components for factors we will start with these methods in some detail, while the discussion of other approaches will be less extensive. [Pg.44]

Principal components are linear combinations of random or statistical variables, which have special properties in terms of variances. The central idea of PCA is to reduce the dimensionality of a data set that may consist of a large number of interrelated variables while retaining as much as possible of the variation present in the data set. This is achieved by transforming the PCs which are uncorrelated into a new set of variables which are ordered so that the first few retain most of the variation present in all of the original variables [292-295]. [Pg.357]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...