Subset selection

You can use Lhe senii-empineal an tl ab initio Orbuals dialog box in IlyperChem Lo ret iies[ a con Lour ploL of any molecular orbital. When req nested, lhe orbital is con toured for a plane that is parallel lo lhe screen and which is specified by a subset selection and a plane offset, as described above. The index of the orbilal and its orbilal energy (in electron volts, eV) appears in the stains line. [Pg.244]

IlyperChem assumes that it is easiest for yon to just use subset selection to select that portion oT the molecular system that is to be treated quantum mechanically. You can then extend the initial selection to form a convenient and universally acceptable boundary. Thus, you make a simple selection of atoms for the first pass at selecting the quantum mechanical portion. The selected atoms are quantum atomsand the iinselecied atoms are classical atoms. [Pg.246]

As might be expected, established optimisation techniques such as simulated annealing and genetic algorithms have been used to tackle the subset selection problem. These methods... [Pg.733]

When a chemistry space has been defined, a database can be mapped onto the space by assigning each molecule to a cell according to its properties and a diverse subset selected by taking one or more molecules from each cell alternatively, a focused subset can be selected by choosing compounds from a limited number of cells, for example, from the cells adjacent to a cell occupied by a known active. The partitioning scheme is defined independently of... [Pg.201]

Partition-based modeling methods are also called subset selection methods because they select a smaller subset of the most relevant inputs. The resulting model is often physically interpretable because the model is developed by explicitly selecting the input variable that is most relevant to approximating the output. This approach works best when the variables are independent (De Veaux et al., 1993). The variables selected by these methods can be used as the analyzed inputs for the interpretation step. [Pg.41]

The literature of the past three decades has witnessed a tremendous explosion in the use of computed descriptors in QSAR. But it is noteworthy that this has exacerbated another problem rank deficiency. This occurs when the number of independent variables is larger than the number of observations. Stepwise regression and other similar approaches, which are popularly used when there is a rank deficiency, often result in overly optimistic and statistically incorrect predictive models. Such models would fail in predicting the properties of future, untested cases similar to those used to develop the model. It is essential that subset selection, if performed, be done within the model validation step as opposed to outside of the model validation step, thus providing an honest measure of the predictive ability of the model, i.e., the true q2 [39,40,68,69]. Unfortunately, many published QSAR studies involve subset selection followed by model validation, thus yielding a naive q2, which inflates the predictive ability of the model. The following steps outline the proper sequence of events for descriptor thinning and LOO cross-validation, e.g.,... [Pg.492]

Miller, A. J. Subset Selection in Regression. CRC Press, Boca Raton, FL, 2002. [Pg.206]

These descriptors were calculated for all compounds in the set of interest. Finally, cell-based subset selection was performed using uniform sampling of one compound per cell, with the choice of compound within the cell weighted by activity in the primary assay. The number of bins per axis was varied to achieve the closest possible match to the desired selection size. [Pg.161]

The external subset selection trap This occurs when the selected subset for a particular subvalidation is not representative of the samples in the rest of the data. This usually leads to overly pessimistic cross-validation results, because the selected subset is outside of the space of the remaining samples. [Pg.411]

Small data sets (< 20 objects) + Relatively quick + Relatively quick - Can be slow, if m or number of iterations large - Selection of subsets unknown + OK, if many iterations done - Avoid using if N > 20 + Good choice. -. unless designed/DOE data - Requires time to determine/ construct cross validation array + often needed to avoid the external subset selection trap... [Pg.412]

Experiment (DOE) data order is randomized object order is randomized (external subset selection trap) avoid the external subset selection trap... [Pg.412]

In chemoinformatics research, partitioning algorithms are applied in diversity analysis of large compound libraries, subset selection, or the search for molecules with specific activity (1-4). Widely used partitioning methods include cell-based partitioning in low-dimensional chemical spaces (1,3) and decision tree methods, in particular, recursive partitioning (RP) (5-7). Partitioning in low-dimensional chemical spaces is based on various dimension reduction methods (4,8) and often permits simplified three-dimensional representation of... [Pg.291]

There are many advantages in using this approach to feature selection. First, chance classification is not a serious problem because the bulk of the variance or information content of the feature subset selected is about the classification problem of interest. Second, features that contain discriminatory information about a particular classification problem are usually correlated, which is why feature selection methods using principal component analysis or other variance-based methods are generally preferred. Third, the principal component plot... [Pg.413]

Clark [46] has recently described a subset selection algorithm called OptiSim which includes maximum and minimum dissimilarity based selection as special cases. A parameter is used to adjust the balance between representativeness and diversity in the compounds that are selected. [Pg.354]

Given the variety of different descriptors and subset selection methods that are available, several studies have been carried out in an attempt to validate both the compound selection methods and the various descriptors. To some extent the choice of descriptors and subset selection methods are interlinked. For example, partitioning schemes are restricted to low-dimensional descriptors such as physicochemical descriptors, whereas clustering and dissimilarity-based methods can be used with high dimensional descriptors such as fingerprints. [Pg.357]

In product-based selection, the properties of the resulting product molecules are taken into account when selecting the reactants. Typically this is done by enumerating the entire virtual library that could potentially be made. Any of the subset selection methods described previously could be used to select a diverse subset of products, however the resulting subset is very unlikely to represent a combinatorial subset. This process is known as cherry-picking and is synthetically inefficient as far as combinatorial synthesis is concerned. Synthetic efficiency is maximized by taking the combinatorial... [Pg.358]

Hoerl, R.W., Schuenemeyer, J.H., and Hoerl, A.E., A simulation of biased estimation and subset selection regression techniques, Technometrics, 28, 369-380, 1986. [Pg.163]

A set of molecules is commonly described with anywhere from 4 to 10,000 descriptors. It is also possible to represent molecules with sparse descriptors numbering up to 2 million. Variable selection, or descriptor subset selection, or descriptor validation, is important, whether the context is supervised or unsupervised learning (Section 6). [Pg.79]

Table 2. Subset selection strategies for primary screening at Lilly.

There are dimensionality issues. Later we propose Mahalanobis distance (Section 4.5) as a good metric for diversity analysis. With p descriptors in the data set, this metric effectively, if not explicitly, computes a covariance matrix with ( ) parameters. In order to obtain accurate estimates of the elements of the covariance matrix, one rule of thumb is that at least five observations per parameter should be made. This suggests that a data set with n observations can only investigate approximately V2 /5 descriptors for the Mahalanobis distance computation. Thus, some method for subset selection of descriptors is needed. [Pg.80]

A requirement for any subset selection method is the ability to accommodate a set of previously selected molecules, where augmentation of the pre-existing set is desired. For example, when purchasing compounds, the goal is to augment what is already owned so that the current corporate collection would be used in the analysis as the pre-existing set of molecules. The goal then is to select a subset of the candidate molecules that optimizes a specified criterion with reference to the molecules in both the candidate set and the previously selected set. [Pg.82]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...