Cross-validated prediction error

Cross-validation is an alternative to the split-sample method of estimating prediction accuracy (5). Molinaro et al. describe and evaluate many variants of cross-validation and bootstrap re-sampling for classification problems where the number of candidate predictors vastly exceeds the number of cases (13). The cross-validated prediction error is an estimate of the prediction error associated with application of the algorithm for model building to the entire dataset. [Pg.334]

It can be employed as a fairly realistic error estimate for predictive ability. The minimum cross-validated prediction error for acenaphthylene of 0.040 mg L 1 equals 33.69%. This compares with an autopredictive error of 0.014 mg L-1 or 11.64% using ten components and PL SI which is a very over-optimistic estimate. [Pg.21]

TABLE 7.3.17. Full Cross-Validation Prediction Error (RMSEP in Weight-%) Results ... [Pg.262]

NIR measurements on 100 live salmon were compared to postrigor measurements (42). The live salmon were anesthetized by placing the salmon in 12°C seawater with 0.02% w-amino-benzoic-acid-ethyl-ester-methanesulfonate. The salmons were measured both with a fiber-optic probe as described above and with a noncontact, diffuse reflectance fixed grating diode array spectrophotometer (DA 7000 Flexi-mode, Perten Instruments, Huddinge, Sweden). The diode array instrument scanned from 400 to 1700 nm in 5-nm steps. The salmon were measured at one spot as described above (53). The range of the salmon was 0.73-10.4 kg carcass weight and 8.2-23.2% fat. The cross-validated prediction error results were very similar for both instruments, namely, 1.4% (R = 0.90) fat for live salmon and 1.3-1.5% (R = 0.88-0.91) fat for postrigor salmon (Fig. 7.3.11). [Pg.272]

FIGURE 34.5 Cross-validated predicted versus measured plots for rapeseed methyl ester (RME) in jet fuel (ppm). The root mean square error (RMSE) is 2.2. ppm. Adapted from Eide et al. [13]. [Pg.759]

FIGURE 21.8 Error bars for the cross-validation prediction of 24 standard simulated natural gas samples as a function of resolution. [Pg.446]

The maximum number of latent variables is the smaller of the number of x values or the number of molecules. However, there is an optimum number of latent variables in the model beyond which the predictive ability of the model does not increase. A number of methods have been proposed to decide how many latent variables to use. One approach is to use a cross-validation method, which involves adding successive latent variables. Both leave-one-out and the group-based methods can be applied. As the number of latent variables increases, the cross-validated will first increase and then either reach a plateau or even decrease. Another parameter that can be used to choose the appropriate number of latent variables is the standard deviation of the error of the predictions, SpREss ... [Pg.725]

Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained from leave-one-out cross-validation using PCR (o) and PLS ( ) regression.

The residuals can be calculated from a given set of calibration samples in a different way. Cross validation is an important procedure to estimate a realistic prediction error like PRESS. The data for k samples are removed from the data matrix and then predicted by the model. The residual errors of prediction of cross-validation in this case are given by... [Pg.189]

Cross validation and bootstrap techniques can be applied for a statistically based estimation of the optimum number of PCA components. The idea is to randomly split the data into training and test data. PCA is then applied to the training data and the observations from the test data are reconstmcted using 1 to m PCs. The prediction error to the real test data can be computed. Repeating this procedure many times indicates the distribution of the prediction errors when using 1 to m components, which then allows deciding on the optimal number of components. For more details see Section 3.7.1. [Pg.78]

Efron, B. J. Am. Stat. Assoc. 78, 1983, 316-331. Estimating the error rate of a prediction rule Improvement on cross-validation. [Pg.205]

SEP Standard error of prediction, SEPCV for cross validation, SEPtest for an... [Pg.307]

For time-series data, the contiguous block method can provide a good assessment of the temporal stability of the model, whereas the Venetian blinds method can better assess nontemporal errors. For batch data, one can either specify custom subsets where each subset is assigned to a single batch (i.e., leave one batch out cross-validation), or use Venetian blinds or contiguous blocks to assess within-batch and between-batch prediction errors, respectively. For blocked data that contains replicates, one must be very careful with the Venetian blinds and contiguous block methods to select parameters such that the rephcate sample trap and the external subset traps, respectively, are avoided. [Pg.411]

Infrared data in the 1575-400 cm region (1218 points/spec-trum) from LTAs from 50 coals (large data set) were used as input data to both PLS and PCR routines. This is the same spe- tral region used in the classical least-squares analysis of the small data set. Calibrations were developed for the eight ASTM ash fusion temperatures and the four major ash elements as oxides (determined by ICP-AES). The program uses PLSl models, in which only one variable at a time is modeled. Cross-validation was used to select the optimum number of factors in the model. In this technique, a subset of the data (in this case five spectra) is omitted from the calibration, but predictions are made for it. The sum-of-squares residuals are computed from those samples left out. A new subset is then omitted, the first set is included in the new calibration, and additional residual errors are tallied. This process is repeated until predictions have been made and the errors summed for all 50 samples (in this case, 10 calibrations are made). This entire set of... [Pg.55]

Root mean square error of calibration (RMSEC). 255 of cross validation for PC.A (RMSEC PCA). 93-94 of prediction IRMSEP) in DCLS. 200- 201 idealized behavior. 2SS-289 in MLR, 255 in PLS, 287-290 Row space, 58-59 Rsquare. 253 adjusted. 253... [Pg.178]

Root Mean Square Error of Cross Validation for PCA Plot (Model Diagnostic) As described above, the residuals from a standard PCA calculation indicate how the PCA model fits the samples that were used to construction the PCA model. Specifically, they are the portion of the sample vectors that is not described by the model. Cross-validation residuals are computed in a different manner, A subset of samples is removed from the data set and a PCA model is constructed. Then the residuals for the left out samples are calculated (cross-validation residuals). The subset of samples is returned to the data set and the process is repeated for different subsets of samples until each sample has been excluded from the data set one time. These cross-validation residuals are the portion of the left out sample vectors that is not described by the PCA model constructed from an independent sample set. In this sense they are like prediction residuals (vs. fit). [Pg.230]

Root Mean Square EtTor of Prediction (RMSEP) Plot (Model Diagnostic) The number of variables to include is finalized using a validation procedure that accounts for predictive ability. There are two approaches for calculating the prediction error internal cross-validation (e.g., Icave-one-out cross-validation with the calibration data) or external validation (i.e.. perform prediction... [Pg.311]

Simon et al. (14) also showed that cross-validating the prediction rule after selection of differentially expressed genes from the full data set does little to correct the bias of the re-substitution estimator 90.2% of simulated data sets with no true relationship between expression data and class still result in zero misclassifications. When feature selection was also re-done in each cross-validated training set, however, appropriate estimates of mis-classification error were obtained the median estimated misclassification rate was approximately 50%. [Pg.334]

Independent (external) test-set validation and cross-validation are the most current methods of estimating prediction error. External test-set validation is based on testing the model on a subset of available samples, which will not be involved in... [Pg.402]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...