Variables selection

Variable selection is particularly important in LC-MS and GC-MS. Raw data form what is sometimes called a sparse data matrix, in which the majority of data points are zero or represent noise. In fact, only a small percentage (perhaps 5% or less) of die measurements are of any interest. The trouble with this is that if multivariate methods are applied to the raw data, often the results are nonsense, dominated by noise. Consider the case of performing LC-MS on two closely eluting isomers, whose fragment ions are of principal interest. The most intense peak might be the molecular [Pg.360]

Scores and loadings of PC2 versus PCI after the data in Table 6.3 have been standardised [Pg.361]

Some simple methods, often used as an initial filter of irrelevant variables, are as follows note that it is often important first to have performed baseline correction (Section 6.2.1). [Pg.362]

Remove variables outside a given region, e.g. in mass spectrometry these may be at low or high m/z values, in UV/vis spectroscopy there may be a significant wavelength range where diere is no absorbance. [Pg.362]

Sometimes is possible to measure the noise content of the chromatograms for each variable simply by looking at the standard deviation of die noise region. The higher die noise, the less significant is the mass. This technique is useful in combination widi other methods often as a first step. [Pg.362]

Variable selection is a technique whereby those variables which are considered to be most important in the creation of a model are used, whereas others are discarded. There are many arguments for selecting variables and a number of ways of selecting them, some of which are described in this chapter. To date, comparatively few researchers have used variable selection methods, most concentrating on improving prediction by using all the variables. [Pg.358]

The principle of parsimony (de Noord, 1994 Flury and Riedwyl, 1988 Seasholtz and Kowalski, 1993) states that if a simple model (that is, one with relatively few parameters or variables) fits the data then it should be preferred to a model that involves redundant parameters. A parsimonious model is likely to be better at prediction of new data and to be more robust against the effects of noise (de Noord, 1994). Despite this, the use of variable selection is still rare in chromatography and spectroscopy (Brereton and Elbergali, 1994). Note that the terms variable selection and variable reduction are used by different researchers to mean essentially the same thing. [Pg.359]

Some variable selection techniques for classification. It can easily be shown that certain variables in a data set not only contribute little to the model but actually detract from the optimum model. Let us take a simple, imaginary, case to demonstrate this. If we have a set of data describing two different varieties of olive oil, Leccino and Frantoio, and only three variables, we can look at the data and see which of the variables is most valuable for discriminating between the two varieties (Table 10.3). [Pg.359]

If we take the standard deviation (StDev) of the Leccino and Frantoio oils for variables 1, 2, and 3 and then calculate the average of these, we have a value which represents the inner variance or reproducibility . The higher [Pg.359]

if w has a value greater than 1 then the inner variance is greater than the outer variance therefore this variable is a hindrance to correct discrimination and so it should definitely be discarded. [Pg.360]

The subject of X-variable selection was discussed earlier in this chapter, in the context of data compression methods (Section 8.2.6.1). It was mentioned that selection can be done [Pg.313]

As for sample selection, I will submit two different methods for variable selection one that is relatively simple and one that is more computationally intensive. The simpler method68 involves a series of linear regressions of each X-variable to the property of interest. The relevance of an X-variable is then expressed by the ratio of the linear regression slope (b) to the variance of the elements in the linear regression model residual [Pg.314]

The algorithm can be terminated using several different termination conditions. For example, it can be terminated when a specified number of iterations has occurred, or when agreement between the different variable subsets reaches a certain level. Details on the GA method can be found in several references.69 70 [Pg.315]

The advantage of the GA variable selection approach over the univariate approach discussed earlier is that it is a true search for an optimal multivariate regression solution. One disadvantage of the GA method is that one must enter several parameters before it [Pg.315]

Like the univariate method, there are several ways in which GA results can be used to select variables. One could either select only those variables that appear in all of the best models obtained at the end of the algorithm, or remove all variables that were not selected in any of the best models. It is also important to note that the GA can provide different results from the same input data and algorithm parameters. As a result, it is often useful to run the GA several times on the same data in order to obtain a consensus on the selection of useful variables. [Pg.316]

This snbject has already been discussed, in the context of the MLR regression method, which for most PAT applications (where the number of x variables greatly exceeds the number of calibration standards) requires [Pg.421]

I snggest that that answer to the above qnestion is yes , for the following reasons. [Pg.423]

With these argnments as a backdrop, I will review some empirical variable selection methods in addition to the prior knowledge-based, stepwise and all possible-combinations methods discnssed earlier in the MLR section (Section 12.3.2). [Pg.423]

The above paragraph describes the forward option of the interval methods, where one starts with no variables selected, and sequentially adds intervals of variables until the stop criterion is reached. Alternatively, one could operate the interval methods in reverse mode, where one starts using all available x variables, and sequentially removes intervals of variables until the stop criterion is reached. Being stepwise selection methods, the interval methods have the potential to select local rather than global optima, and they require careful selection of the interval size (number of variables per interval) based on prior knowledge of the spectroscopy, to balance computation time and performance improvement. However, these methods are rather straightforward, relatively simple to implement, and efficient. [Pg.423]

A general alternative to stepwise-type searching methods for variable selection would be methods that attempt to explore as much of the possible solution space as possible. An exhaustive search of all possible combinations of variables is possible only for problems that involve relatively few x variables. However, it [Pg.423]

There was a time when one could use only a few molecular descriptors, which were simple topological indices. The 1990s brought myriads of new descriptors [11]. Now it is difficult even to have an idea of how many molecular desaiptors are at one s disposal. Therefore, the crucial problem is the choice of the optimal subset among those available. [Pg.217]

Much labor has been dedicated to establishing a common technique enabling one to solve the problem of choice. Some solutions suggested are useful, others are less efficient. Below we shall examine the most prominent ones. [Pg.217]

The idea behind this approach is simple. First, we compose the characteristic vector from all the descriptors we can compute. Then, we define the maximum length of the optimal subset, i.e., the input vector we shall actually use during modeling. As is mentioned in Section 9.7, there is always some threshold beyond which an inaease in the dimensionality of the input vector decreases the predictive power of the model. Note that the correlation coefficient will always be improved with an increase in the input vector dimensionality. [Pg.218]

Let us see how the approach works in practice. One of the first studies dedicated to the applications of GA with regard to this task was that by Rogers and Hopfinger [12]. However, the pioneering efforts are due to the Nijmegen chemo-metrics research group led by Buydens [13, 14). [Pg.218]

SE is the standard error, c is the number of selected variables, p is the total number of variables (which can differ from c), and d is a smoothing parameter to be set by the user. As was mentioned above, there is a certain threshold beyond which an increase in the number of variables results in some decrease in the quality of modeling. In fact, the smoothing parameter reflects the user s guess of how much detail is to be modeled in the training set. [Pg.218]

Cruciani G, S dementi and M Baroni 1993. Variable Selection in PLS Analysis. In Kubinyi H (Editor) 31 QSAR in Drug Design. Leiden, ESCOM, pp. 551-564. [Pg.737]

H Kubmyi. Variable selection m QSAR studies. I. An evolutionary algorithm. Quant Struct-Act Relat 13 285-294, 1994. [Pg.367]

T Kimura, K Hasegawa, K Funatsu. GA strategy for variable selection m QSAR studies GA-based region selection for CoMFA modeling. J Chem Inf Comput Sci 38 276-282, 1998. [Pg.367]

For nonlinear systems, however, the evaluation of the flow rates is not straightforward. Morbidelli and co-workers developed a complete design of the binary separation by SMB chromatography in the frame of Equilibrium Theory for various adsorption equilibrium isotherms the constant selectivity stoichiometric model [21, 22], the constant selectivity Langmuir adsorption isotherm [23], the variable selectivity modified Langmuir isotherm [24], and the bi-Langmuir isotherm [25]. The region for complete separation was defined in terms of the flow rate ratios in the four sections of the equivalent TMB unit ... [Pg.233]

Deep catalytic cracking (DCC) is a catalytic cracking process which selectively cracks a wide variety of feedstocks into light olefins. The reactor and the regenerator systems are similar to FCC. However, innovation in the catalyst development, severity, and process variable selection enables DCC to produce more olefins than FCC. In this mode of operation, propylene plus ethylene yields could reach over 25%. In addition, a high yield of amylenes (C5 olefins) is possible. Figure 3-7 shows the DCC process and Table 3-10 compares olefins produced from DCC and FCC processes. ... [Pg.77]

The type and extent of automatic processing to be carried out immediately after a chromatogram has been measured is controlled by a "processing" variable selected from a table displayed by the SETUP program. In our implementation, the choices are... [Pg.24]

The variable selection methods have been also adopted for region selection in the area of 3D QSAR. For example, GOLPE [31] was developed with chemometric principles and q2-GRS [32] was developed based on independent CoMFA analyses of small areas of near-molecular space to address the issue of optimal region selection in CoMFA analysis. Both of these methods have been shown to improve the QSAR models compared to original CoMFA technique. [Pg.313]

When applied to QSAR studies, the activity of molecule u is calculated simply as the average activity of the K nearest neighbors of molecule u. An optimal K value is selected by the optimization through the classification of a test set of samples or by the leave-one-out cross-validation. Many variations of the kNN method have been proposed in the past, and new and fast algorithms have continued to appear in recent years. The automated variable selection kNN QSAR technique optimizes the selection of descriptors to obtain the best models [20]. [Pg.315]

Zheng W, Tropsha A. Novel variable selection quantitative structure-property relationship approach based on the k-nearest-neighbor principle. J Chem Inf Comput Sci 2000 40(l) 185-94. [Pg.317]

Narayanan R and Gunturi SB. In silico ADME modelling prediction models for blood-brain barrier permeation using a systematic variable selection method. Bioorg Med Chem 2005 13 3017-28. [Pg.510]

To benchmark our learning methodology with alternative conventional approaches, we used the same 500 (x, y) data records and followed the usual regression analysis steps (including stepwise variable selection, examination of residuals, and variable transformations) to find an approximate empirical model, / (x), with a coefficient of determination = 0.79. This model is given by... [Pg.127]

D.M. Allen, The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16(1974) 125-127. [Pg.380]

Ordinal tab Establishes counting variable for the variable selected. [Pg.73]

Partition-based modeling methods are also called subset selection methods because they select a smaller subset of the most relevant inputs. The resulting model is often physically interpretable because the model is developed by explicitly selecting the input variable that is most relevant to approximating the output. This approach works best when the variables are independent (De Veaux et al., 1993). The variables selected by these methods can be used as the analyzed inputs for the interpretation step. [Pg.41]

Wold, S., Kettaneh, N., and Tjessem, K., Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection, J. Chemometrics 10, 463 (19%). [Pg.104]

The variables selected as design variables (fixed by the designer) cannot therefore be assigned as output variables from an f node. They are inputs to the system and their edges must be oriented into the system of equations. [Pg.22]

If, for instance, variables r3 and u4 are selected as design variables, then Figure 1.11 shows one possible order of solution of the set of equations. Different types of arrows are used to distinguish between input and output variables, and the variables selected as design variables are enclosed in a double circle. [Pg.22]

See also in sourсe #XX -- [ Pg.217 ]

See also in sourсe #XX -- [ Pg.420 , Pg.421 , Pg.426 , Pg.468 ]

See also in sourсe #XX -- [ Pg.243 , Pg.255 , Pg.311 , Pg.355 , Pg.367 ]

See also in sourсe #XX -- [ Pg.79 ]

See also in sourсe #XX -- [ Pg.155 ]

See also in sourсe #XX -- [ Pg.487 ]

See also in sourсe #XX -- [ Pg.181 ]

See also in sourсe #XX -- [ Pg.287 , Pg.291 , Pg.309 , Pg.318 , Pg.339 ]

See also in sourсe #XX -- [ Pg.106 ]

See also in sourсe #XX -- [ Pg.72 ]

See also in sourсe #XX -- [ Pg.347 ]

See also in sourсe #XX -- [ Pg.165 , Pg.208 ]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...