Bootstrap optimism

Cross validation and bootstrap techniques can be applied for a statistically based estimation of the optimum number of PCA components. The idea is to randomly split the data into training and test data. PCA is then applied to the training data and the observations from the test data are reconstmcted using 1 to m PCs. The prediction error to the real test data can be computed. Repeating this procedure many times indicates the distribution of the prediction errors when using 1 to m components, which then allows deciding on the optimal number of components. For more details see Section 3.7.1. [Pg.78]

CV or bootstrap is used to split the data set into different calibration sets and test sets. A calibration set is used as described above to create an optimized model and this is applied to the corresponding test set. All objects are principally used in training set, validation set, and test set however, an object is never simultaneously used for model creation and for test. This strategy (double CV or double bootstrap or a combination of CV and bootstrap) is applicable to a relatively small number of objects furthermore, the process can be repeated many times with different random splits resulting in a high number of test-set-predicted values (Section 4.2.5). [Pg.123]

Using the calibration set, a model is created in the case of PLS or PCR, additionally an optimization of the number of components has to be done—for instance by CV, or by an inner bootstrap within the calibration set. The resulting optimized model is then applied to the objects not contained in the calibration set giving predicted values... [Pg.132]

An approach that is sometimes helpful, particularly for recent pesticide risk assessments, is to use the parameter values that result in best fit (in the sense of LS), comparing the fitted cdf to the cdf of the empirical distribution. In some cases, such as when fitting a log-normal distribution, formulae from linear regression can be used after transformations are applied to linearize the cdf. In other cases, the residual SS is minimized using numerical optimization, i.e., one uses nonlinear regression. This approach seems reasonable for point estimation. However, the statistical assumptions that would often be invoked to justify LS regression will not be met in this application. Therefore the use of any additional regression results (beyond the point estimates) is questionable. If there is a need to provide standard errors or confidence intervals for the estimates, bootstrap procedures are recommended. [Pg.43]

MR spectra from 33 patients with breast cancer with vascular invasion and 52 without were subjected to the SCS-based analysis. Maximally discriminatory subregions were 0.47-0.55, 0.57-0.62, 0.86-0.92, 1.00-1.03, 1.69-1.71, 1.99-2.05, 2.55-2.56 and 2.63-2.72 ppm for the first derivatives of the spectra, and 0.75-0.81, 0.90-0.94, 1.03-1.12, 1.21-1.24, 1.59-1.63, 2.00-2.04, 2.24-2.27 and 2.70-2.74 ppm for rank-ordered spectra. Using LDA and bootstrap-based cross-validation, two separate classifiers, A (using the optimal regions from the first derivatives of the spectra) and B (using the optimal regions from the rank-ordered spectra), were developed. The final classifier was the Wolpert-combined A + B classifiers.61... [Pg.102]

This optimism represented the underestimation of the squared prediction error that was expected to occur when the model was applied to the data from which it was derived. In a final step, the average optimism across all bootstrap iterations was estimated and added to the SPE estimated when the Mo was applied to Do. This resulted in an improved estimate of the absolute prediction error (SPEimp). [Pg.416]

A PPK model was developed to be used to construct a dosing strategy for a Phase 3 study and therefore needed some form of validation. To obtain a test data set would have been expensive and time consuming. The bootstrap was used to validate the model by estimating the bias/SE and the optimism for the dependent variable. The optimism was small when compared to the SPE of the original sample, thus validating the model. [Pg.417]

The above are examples of nonparametric resampling (RS) methods, without replacement. The deservedly popular bootstrap method is also a resampling method, but with replacement. Because of replacement, some objects may appear several times in any of the B bootstrap samples. This could cause difficulties for those classifiers that we optimize by inverting covariance matrices. A simple remedy is to perform the sample replacement only after we have selected for the current training set a particular subset of N- d distinct samples. Again, we secure the end result by averaging the B individual classifier outcomes. [Pg.275]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...