Multivariate data chemical structures

For example, the objects may be chemical compounds. The individual components of a data vector are called features and may, for example, be molecular descriptors (see Chapter 8) specifying the chemical structure of an object. For statistical data analysis, these objects and features are represented by a matrix X which has a row for each object and a column for each feature. In addition, each object win have one or more properties that are to be investigated, e.g., a biological activity of the structure or a class membership. This property or properties are merged into a matrix Y Thus, the data matrix X contains the independent variables whereas the matrix Ycontains the dependent ones. Figure 9-3 shows a typical multivariate data matrix. [Pg.443]

Multivariate data analysis usually starts with generating a set of spectra and the corresponding chemical structures as a result of a spectrum similarity search in a spectrum database. The peak data are transformed into a set of spectral features and the chemical structures are encoded into molecular descriptors [80]. A spectral feature is a property that can be automatically computed from a mass spectrum. Typical spectral features are the peak intensity at a particular mass/charge value, or logarithmic intensity ratios. The goal of transformation of peak data into spectral features is to obtain descriptors of spectral properties that are more suitable than the original peak list data. [Pg.534]

Despite the broad definition of chemometrics, the most important part of it is the application of multivariate data analysis to chemistry-relevant data. Chemistry deals with compounds, their properties, and their transformations into other compounds. Major tasks of chemists are the analysis of complex mixtures, the synthesis of compounds with desired properties, and the construction and operation of chemical technological plants. However, chemical/physical systems of practical interest are often very complicated and cannot be described sufficiently by theory. Actually, a typical chemometrics approach is not based on first principles—that means scientific laws and mles of nature—but is data driven. Multivariate statistical data analysis is a powerful tool for analyzing and structuring data sets that have been obtained from such systems, and for making empirical mathematical models that are for instance capable to predict the values of important properties not directly measurable (Figure 1.1). [Pg.15]

This example belongs to the area quantitative structure-property relationships (QSPR) in which chemical-physical properties of chemical compounds are modeled by chemical structure data—mostly built by multivariate calibration methods as described in this chapter und using molecular descriptors (Todeschini and Consonni... [Pg.186]

Varmuza, K. Analytical Sciences 17(suppl.), 2001, i467-i470. Recognition of relationships between mass spectral data and chemical structures by multivariate data analysis. [Pg.306]

On the other hand, factor analysis involves other manipulations of the eigen vectors and aims to gain insight into the structure of a multidimensional data set. The use of this technique was first proposed in biological structure-activity relationship (i. e., SAR) and illustrated with an analysis of the activities of 21 di-phenylaminopropanol derivatives in 11 biological tests [116-119, 289]. This method has been more commonly used to determine the intrinsic dimensionality of certain experimentally determined chemical properties which are the number of fundamental factors required to account for the variance. One of the best FA techniques is the Q-mode, which is based on grouping a multivariate data set based on the data structure defined by the similarity between samples [1, 313-316]. It is devoted exclusively to the interpretation of the inter-object relationships in a data set, rather than to the inter-variable (or covariance) relationships explored with R-mode factor analysis. The measure of similarity used is the cosine theta matrix, i. e., the matrix whose elements are the cosine of the angles between all sample pairs [1,313-316]. [Pg.269]

Using multivariable linear regression, a set of equations can be derived from the parameterized data. Statistical analysis yields the "best equations to fit the en irical data. This mathematical model forms a basis to correlate the biologicsd activity to the chemical structures. [Pg.152]

Molecular connectivity indices are desirable as potential explanatory variables because they can be calculated for a nominal cost (fractions of a second by computer) and they describe fundamental relationships about chemical structure. That Is, they describe how non-hydrogen atoms of a molecule are "connected". Here we are most concerned with the statistical properties of molecular connectivity Indices for a large set of chemicals In TSCA and the presentation of the results of multivariate analyses using these Indices as explanatory variables to understand several properties important to environmental chemists. We will focus on two properties for which we have a relatively large data base (1) biodegradation as measured by the percentage of theoretical 5-day biochemical oxygen demand (B0D)( 11), and (2) n-octanol/water partition coefficient or hereafter termed log P (12). [Pg.149]

In comparison with NMR, mass spectrometry is more sensitive and, thus, can be used for compounds of lower concentration. While it is easily possible to measure picomoles of compounds, detection limits at the attomole levels can be reached. Mass spectrometry also has the ability to identify compounds through elucidation of their chemical structure by MS/MS and determination of their exact masses. This is true at least for compounds below 500 Da, the limit at which very high-resolution mass spectrometry can unambiguously determine the elemental composition. In 2005, this could only be done by FTICR. Orbitrap appears to be a good alternative, with a more limited mass range but a better signal-to-noise ratio. Furthermore, mass spectrometry allows relative concentration determinations to be made between samples with a dynamic range of about 10000. Absolute quantification is also possible but needs reference compounds to be used. It should be mentioned that if mass spectrometry is an important technique for metabolome analysis, another key tool is specific software to manipulate, summarize and analyse the complex multivariant data obtained. [Pg.388]

Multiple intercorrelations between descriptors of chemical structures are illustrated best using multivariate statistics (section 3.2.2). A principal component analysis of the data set of 18 descriptors (Table 1.6, Figure 1.11) revealed that > 80% of the information content of these descriptors is expressed by four factors that explain 54.7%, 15.8%, 8.1% and 5.6% of the total variance, respectively. [Pg.44]

Most of the applications with infrared spectroscopic data involve classification, calibration, and prediction. This is because the infrared profiles describe chemical structure that is directly related to the physical and/or chemical properties of the systems analyzed. Changes in the infrared profile of a system imply variation in the chemical structure and hence variations in the properties of the system. These data profiles contain information, and extraction of this information requires multivariate data analysis. [Pg.146]

Vectors A series of scalars can be arranged in a column or in a row. Then, they are called a column or a row vector. If the elements of a column vector can be attributed to special characteristics, e.g., to compounds, then data analysis can be completed. The chemical structures of compounds can be characterized with different numbers called descriptors, variables, predictors, or factors. For example, toxicity data were measured for a series of aromatic phenols. Their toxicity can be arranged in a column arbitrarily Each row corresponds to a phenolic compound. A lot of descriptors can be calculated for each compound (e.g., molecular mass, van der Waals volume, polarity parameters, quantum chemical descriptors, etc.). After building a multivariate model (generally one variable cannot encode the toxicity properly) we will be able to predict toxicity values for phenolic compounds for which no toxicity has been measured yet. The above approach is generally called searching quantitative structure - activity relationships or simply QSAR approach. [Pg.144]

Hawkins, D.M., The problem of overfitting. J. Chem. Inf. Comput. Sci., 44, 1-12 (2004). Varmuza, K., Penchev, P., Stand, F. and Werther, W, Systematic structure elucidation of organic compounds by mass spectra calssification. J. Mol. Struct., 408/409, 91-96 (1997). Varmuza, K., Recognition of relationships between mass spectral data and chemical structures by multivariate data analysis. Anal. Sci., 17(SuppL), 1467-1470 (2001). [Pg.167]

Multivariate data contain objects and features, and sometimes also properties. An object is any real or abstract item such as a sample, a spectrum, a chemical structure, or a process. An object is characterized by a set of features. A feature is a numerical variable (measured or computed data) such as a concentration, a spectral peak height, or a molecular descriptor. The application of statistical methods requires a reasonable number of objects and features typical for chemical problems are 20-1000 objects and 3-500 features. Such data are best described by an n - p matrix X, containing a row for each of the n objects, and a column for each of the p features or properties (Figure 2). [Pg.348]

Multivariate data in chemistry often contain a rather large number of features. The reasons for this situation are as follows (a) a priori it is not known which features are relevant and which are irrelevant remember that systems are usually investigated by statistical methods that are not understood sufficiently (b) automated analytical instruments easily allow the production of large data sets (c) typical chemical data like for instance molecular spectra or the description of chemical structures are complex and therefore actually require many features. [Pg.350]

The relationships between spectral data (NMR, IR, UV-VIS, MS) and molecular chemical structures cannot be described sufficiently by existing theoretical concepts. Since the start of chemometrics it has always been a challenge to model these hidden relationships at least partially by a multivariate statistical approach. C NMR data exhibit rather strict relationships between chemical shifts and atom-centered molecular fragments. In other fields of spectroscopy, however, the widely used correlation tables are less successful for spectra interpretation and spectra prediction. The correspondence between spectroscopic data (key fragments in MS or band frequencies in IR) and... [Pg.359]

Chemical structures can be described by binary molecular descriptors (used as the Y-matrix in multivariate data analysis). In the case of yes/no-classifications a single binary y-variable can be used to indicate whether a particular structural property is present or not. The type of molecular descriptors (small or large fragments, atom-centered fragments, functional groups or classes of compounds) is essential to obtain a close relationship between structures and spectra. [Pg.360]

Exploratory analysis of spectral data by PCA, PLS, cluster analysis, or Kohonen mapping tries to get an insight into the spectral data structure and into hidden factors, as well as to find clusters of similar spectra that can be interpreted in terms of similar chemical structures. Classification methods, such as LDA. PLS, SIMCA, KNN classification, and neural networks, have been used to generate spectral classifiers for an automatic recognition of structural properties from spectral data. The multivariate methods mostly used for spectra prediction (mainly NMR. rarely IR) are neural networks. Table 6 contains a summary of recent works in this field (see Infrared Data Correlations with Chemical Structure). [Pg.360]

Chemoff faces are simple glyphs that associate variables with facial features, such as the size and shape of the mouth, eyes, nose, and facial outline. The motivation was the recognition that humans are extremely capable of discriminating faces, and that traditional visualization methods seemed to be less valuable in producing an emotional response. While this may be true for most multivariate data, it is certainly not the case with chemical structure diagrams. [Pg.755]

Artificial Intelligence in Chemistry Chemometrics Multivariate View on Chemical Problems Fuzzy Methods in Chemistry Infrared Spectra Interpretation by the Characteristic Frequency Approach Machine Learning Techniques in Chemistry Neural Networks in Chemistry Quality Control, Data Analysis Structural Similarity Measures for Database Searching Structure and Substructure Searching Structure Determination by Computer-based Spectrum Interpretation Structure Generators Structure Representation. [Pg.1306]

Chemometrics Multivariate View on Chemical Problems Combined Quantum Mechanics and Molecular Mechanics Approaches to Chemical and Biochemical Reactivity Environmental Chemistry QSAR Infrared Data Correlations with Chemical Structure Quantitative Structure-Activity Relationships in Drug Design Quantitative Structure-Property Relationships (QSPR). [Pg.1495]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...