Independent variable subsets

RR is similar to PCR in that the independent variables are transformed to their principal components (PCs). However, while PCR utilizes only a subset of the PCs, RR retains them all but downweighs them based on their eigenvalues. With PLS, a subset of the PCs is also used, but the PCs are selected by considering both the independent and dependent variables. Statistical theory suggests that RR is the best of the three methods, and this has been generally borne out in multiple comparative studies [30,36-38]. Thus, some of our published studies report RR results only. [Pg.486]

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

The large number of TIs, and the fact that many of them are highly correlated, confounds the development of predictive models. Therefore, we attempted to reduce the number of TIs to a smaller set of relatively independent variables. Variable clustering " was used to divide the TIs into disjoint subsets (clusters) that are essentially unidimensional. These clusters form new variables which are the first principal component derived from the members of the cluster. From each cluster of indexes, a single index was selected. The index chosen was the one most correlated with the cluster variable. In some cases, a member of a cluster showed poor group membership relative to the other members of the cluster, i.e., the correlation of an index with the cluster variable was much lower than the other members. Any variable showing poor cluster membership was selected for further studies as well. A correlation of a TI with the cluster variable less than 0.7 was used as the definition of poor cluster membership. [Pg.107]

Using all subsets regression with the selected TIs and HBi as independent variables resulted in a nine-parameter model ... [Pg.109]

The main goal of feature selection can be formulated as selection of a subset of the candidate variables to obtain a final model that provides accurate and reliable prediction of future values of the dependent variable Y (e.g. a concentration) for a given set of independent variables X (e.g. optical absorbance at a set of wavelengths). [Pg.324]

Many of the compounds listed in Table III were the same as those evaluated by Koga in his earlier QSAR study. (21) This provided a means of comparing the two LPER models using somewhat differently defined parameters. The 36 compounds in common that were tested against E. coli were A3, A18, A32, A34, A36 - A42, A44 - A45, A48, ASl - A54, A56 -A62, A67 - A76, A79. No statistically valid model could be obtained on this subset using the independent variables in this study. When using the variables from the earlier QSAR study (Es(6), Es(6), 2x(6,7,8), 2x(6,7,8), and I(7NCO) to develop the model, an r = 0.812 was obtained which explains approximately 66 percent of the variance. Thus the inability to... [Pg.330]

In Sects. 3.8 and 4.7, certain properties of concave functions [13-15] are needed. At the same time, properties of convex functions are obtained because the convex functions differ from the concave by the sign only and therefore their properties are mostly obtainable by reversion of all following inequalities. For simplicity, we do not restrict the domain of independent variables (although possible restriction on some concave subset, i.e., having property (A.71) below, is nearly obvious) and we confine ourselves to strict concave functions which suffice for our purposes. [Pg.293]

Depending on the distribution of values it may make sense to perform nonlinear transformations on the data, such as n-th root or logarithm. Nonlineeirly transformed independent variables may be used as additional predictors. Further, new variables may be obtained by applying arithmetic operators to pedrs or leirger subsets of predictors Xj, j n. This is called a base extension. Quadratic base extensions are often used, where the squares X, j e n, eind products X X , k + l,k,l e neire used as predictors along with X. ... [Pg.228]

The independent variables x include a subset of the process and key reaction characterisation variables, such as conversions and temperature approaches. Relative catalyst activities and heat transfer coefficients can also be used as independent variables in data reconciliation. The choice is hence between ... [Pg.156]

As an example, 4 components (independent variables) will be examined at 3 levels each in an experiment in media optimization for lactic acid production. The components are an anunonia nitrogen source, a phosphate source, sugar, and yeast. The goal is to obtain an optimum level combination of the 4 factor constituents in a mixture for lactic acid production. An examination of 4 factors at 3 levels each would dictate 3 or 81, possible combinations. A traditional, full factorial ANOVA would require at least 3 replications from different batches of the mixture constituents at each level. Furthermore, there are equipment, lab personnel, and time limitations in performing the experiment. They are only able to run a maximum of 20 sample combinations per day in dieir laboratory. The researchers would like to use this experiment for determining a subset of the components that are influential in lactic acid production, and to ascertain the most effective concentrations of these components. They are not very interested in the 3 or 4 way interactions of the components, and would only like to examine some of the possible 2 way interactions. The researchers decide to perform a... [Pg.155]

The LC50 data were subjected to regression analysis using SA and ASA as independent variables. For all the three subsets 40, intuitively meaningful combinations of ASA contributions from various atoms were tried as independent variables. [Pg.128]

Given the dependencies of the analyzed subsets on the various independent variables, as shown in equations 5 to 9, it is apparent that the following groupings can be made. [Pg.166]

Function separability is a more general notion than sparsity, since all sparse systems are separable but the reverse is not true. It is also another area where algorithmic growth can be expected. Separable functions are composites of subfunctions, each of which depends only on a small subset of the independent variables. Therefore, efficient schemes can be devised in this case to compute the search vector, function curvature, etc., much cheaper by exploiting the invariant subspaces of the objective function. [Pg.1155]

Many reliability methods, including Subset Simulation, assume that the input variables x are independent. This assumption, however, is not a limitation, since in simulation one always starts fi-om independent variables to generate the dependent input variables. Furthermore, for convenience, it is often assumed that x are i.i.d. Gaussian. If this is not the case, a preprocessing step that transforms x to i.i.d. Gaussian variables z must be undertaken. The transformation from x to z can be performed in several ways depending on the available information about the input variables. In the simplest case, when x are independent Gaussians, Xk Af al), where and erf are. [Pg.3676]

Partition-based modeling methods are also called subset selection methods because they select a smaller subset of the most relevant inputs. The resulting model is often physically interpretable because the model is developed by explicitly selecting the input variable that is most relevant to approximating the output. This approach works best when the variables are independent (De Veaux et al., 1993). The variables selected by these methods can be used as the analyzed inputs for the interpretation step. [Pg.41]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...