Chemical structure multivariate data analysis

Multivariate data analysis usually starts with generating a set of spectra and the corresponding chemical structures as a result of a spectrum similarity search in a spectrum database. The peak data are transformed into a set of spectral features and the chemical structures are encoded into molecular descriptors [80]. A spectral feature is a property that can be automatically computed from a mass spectrum. Typical spectral features are the peak intensity at a particular mass/charge value, or logarithmic intensity ratios. The goal of transformation of peak data into spectral features is to obtain descriptors of spectral properties that are more suitable than the original peak list data. [Pg.534]

Despite the broad definition of chemometrics, the most important part of it is the application of multivariate data analysis to chemistry-relevant data. Chemistry deals with compounds, their properties, and their transformations into other compounds. Major tasks of chemists are the analysis of complex mixtures, the synthesis of compounds with desired properties, and the construction and operation of chemical technological plants. However, chemical/physical systems of practical interest are often very complicated and cannot be described sufficiently by theory. Actually, a typical chemometrics approach is not based on first principles—that means scientific laws and mles of nature—but is data driven. Multivariate statistical data analysis is a powerful tool for analyzing and structuring data sets that have been obtained from such systems, and for making empirical mathematical models that are for instance capable to predict the values of important properties not directly measurable (Figure 1.1). [Pg.15]

Varmuza, K. Analytical Sciences 17(suppl.), 2001, i467-i470. Recognition of relationships between mass spectral data and chemical structures by multivariate data analysis. [Pg.306]

Most of the applications with infrared spectroscopic data involve classification, calibration, and prediction. This is because the infrared profiles describe chemical structure that is directly related to the physical and/or chemical properties of the systems analyzed. Changes in the infrared profile of a system imply variation in the chemical structure and hence variations in the properties of the system. These data profiles contain information, and extraction of this information requires multivariate data analysis. [Pg.146]

Hawkins, D.M., The problem of overfitting. J. Chem. Inf. Comput. Sci., 44, 1-12 (2004). Varmuza, K., Penchev, P., Stand, F. and Werther, W, Systematic structure elucidation of organic compounds by mass spectra calssification. J. Mol. Struct., 408/409, 91-96 (1997). Varmuza, K., Recognition of relationships between mass spectral data and chemical structures by multivariate data analysis. Anal. Sci., 17(SuppL), 1467-1470 (2001). [Pg.167]

Chemical structures can be described by binary molecular descriptors (used as the Y-matrix in multivariate data analysis). In the case of yes/no-classifications a single binary y-variable can be used to indicate whether a particular structural property is present or not. The type of molecular descriptors (small or large fragments, atom-centered fragments, functional groups or classes of compounds) is essential to obtain a close relationship between structures and spectra. [Pg.360]

For example, the objects may be chemical compounds. The individual components of a data vector are called features and may, for example, be molecular descriptors (see Chapter 8) specifying the chemical structure of an object. For statistical data analysis, these objects and features are represented by a matrix X which has a row for each object and a column for each feature. In addition, each object win have one or more properties that are to be investigated, e.g., a biological activity of the structure or a class membership. This property or properties are merged into a matrix Y Thus, the data matrix X contains the independent variables whereas the matrix Ycontains the dependent ones. Figure 9-3 shows a typical multivariate data matrix. [Pg.443]

On the other hand, factor analysis involves other manipulations of the eigen vectors and aims to gain insight into the structure of a multidimensional data set. The use of this technique was first proposed in biological structure-activity relationship (i. e., SAR) and illustrated with an analysis of the activities of 21 di-phenylaminopropanol derivatives in 11 biological tests [116-119, 289]. This method has been more commonly used to determine the intrinsic dimensionality of certain experimentally determined chemical properties which are the number of fundamental factors required to account for the variance. One of the best FA techniques is the Q-mode, which is based on grouping a multivariate data set based on the data structure defined by the similarity between samples [1, 313-316]. It is devoted exclusively to the interpretation of the inter-object relationships in a data set, rather than to the inter-variable (or covariance) relationships explored with R-mode factor analysis. The measure of similarity used is the cosine theta matrix, i. e., the matrix whose elements are the cosine of the angles between all sample pairs [1,313-316]. [Pg.269]

Using multivariable linear regression, a set of equations can be derived from the parameterized data. Statistical analysis yields the "best equations to fit the en irical data. This mathematical model forms a basis to correlate the biologicsd activity to the chemical structures. [Pg.152]

In comparison with NMR, mass spectrometry is more sensitive and, thus, can be used for compounds of lower concentration. While it is easily possible to measure picomoles of compounds, detection limits at the attomole levels can be reached. Mass spectrometry also has the ability to identify compounds through elucidation of their chemical structure by MS/MS and determination of their exact masses. This is true at least for compounds below 500 Da, the limit at which very high-resolution mass spectrometry can unambiguously determine the elemental composition. In 2005, this could only be done by FTICR. Orbitrap appears to be a good alternative, with a more limited mass range but a better signal-to-noise ratio. Furthermore, mass spectrometry allows relative concentration determinations to be made between samples with a dynamic range of about 10000. Absolute quantification is also possible but needs reference compounds to be used. It should be mentioned that if mass spectrometry is an important technique for metabolome analysis, another key tool is specific software to manipulate, summarize and analyse the complex multivariant data obtained. [Pg.388]

Multiple intercorrelations between descriptors of chemical structures are illustrated best using multivariate statistics (section 3.2.2). A principal component analysis of the data set of 18 descriptors (Table 1.6, Figure 1.11) revealed that > 80% of the information content of these descriptors is expressed by four factors that explain 54.7%, 15.8%, 8.1% and 5.6% of the total variance, respectively. [Pg.44]

Vectors A series of scalars can be arranged in a column or in a row. Then, they are called a column or a row vector. If the elements of a column vector can be attributed to special characteristics, e.g., to compounds, then data analysis can be completed. The chemical structures of compounds can be characterized with different numbers called descriptors, variables, predictors, or factors. For example, toxicity data were measured for a series of aromatic phenols. Their toxicity can be arranged in a column arbitrarily Each row corresponds to a phenolic compound. A lot of descriptors can be calculated for each compound (e.g., molecular mass, van der Waals volume, polarity parameters, quantum chemical descriptors, etc.). After building a multivariate model (generally one variable cannot encode the toxicity properly) we will be able to predict toxicity values for phenolic compounds for which no toxicity has been measured yet. The above approach is generally called searching quantitative structure - activity relationships or simply QSAR approach. [Pg.144]

Exploratory analysis of spectral data by PCA, PLS, cluster analysis, or Kohonen mapping tries to get an insight into the spectral data structure and into hidden factors, as well as to find clusters of similar spectra that can be interpreted in terms of similar chemical structures. Classification methods, such as LDA. PLS, SIMCA, KNN classification, and neural networks, have been used to generate spectral classifiers for an automatic recognition of structural properties from spectral data. The multivariate methods mostly used for spectra prediction (mainly NMR. rarely IR) are neural networks. Table 6 contains a summary of recent works in this field (see Infrared Data Correlations with Chemical Structure). [Pg.360]

Artificial Intelligence in Chemistry Chemometrics Multivariate View on Chemical Problems Fuzzy Methods in Chemistry Infrared Spectra Interpretation by the Characteristic Frequency Approach Machine Learning Techniques in Chemistry Neural Networks in Chemistry Quality Control, Data Analysis Structural Similarity Measures for Database Searching Structure and Substructure Searching Structure Determination by Computer-based Spectrum Interpretation Structure Generators Structure Representation. [Pg.1306]

Chemometrics Multivariate View on Chemical Problems Comparative Molecular Field Analysis (CoMFA) Environmental Chemistry QSAR Genetic Algorithms Introduction and Applications Genetic and Evolutionary Algorithms Linear Free Energy Relationships (LFER) Octanol/Water Partition Coefficients Partial Least Squares Projections to Latent Structures (PLS) in Chemistry Quality Control, Data Analysis. [Pg.2319]

Chapter 2 (Statistical Space for Multivariate Correlations) Aims to prepare the conceptual-computational ground for correlating chemical structure with biological activity by the celebrated quantitative stractuie-activity relationships (QSARs). Additionally, the fundamental statistical advanced frameworks are detailed to best understand the classical multilinear regression analysis generalized by an algebraic (in quantum Hilbert space) reformulation in terms of data vectors and orthogonal conditions (explained in see Chapter 3). [Pg.604]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...