Multiple linear regression selection

One can determine the b in equations (2)-(4) that best fit the experimental data by multiple linear regression. Multiple linear regression selects the set of coefficients that minimizes the sum of squares of the residuals between the response y and the line described by the model. [Pg.972]

Aqueous solubility is selected to demonstrate the E-state application in QSPR studies. Huuskonen et al. modeled the aqueous solubihty of 734 diverse organic compounds with multiple linear regression (MLR) and artificial neural network (ANN) approaches [27]. The set of structural descriptors comprised 31 E-state atomic indices, and three indicator variables for pyridine, ahphatic hydrocarbons and aromatic hydrocarbons, respectively. The dataset of734 chemicals was divided into a training set ( =675), a vahdation set (n=38) and a test set (n=21). A comparison of the MLR results (training, r =0.94, s=0.58 vahdation r =0.84, s=0.67 test, r =0.80, s=0.87) and the ANN results (training, r =0.96, s=0.51 vahdation r =0.85, s=0.62 tesL r =0.84, s=0.75) indicates a smah improvement for the neural network model with five hidden neurons. These QSPR models may be used for a fast and rehable computahon of the aqueous solubihty for diverse orgarhc compounds. [Pg.93]

There are many different methods for selecting those descriptors of a molecule that capture the information that somehow encodes the compounds solubility. Currently, the most often used are multiple linear regression (MLR), partial least squares (PLS) or neural networks (NN). The former two methods provide a simple linear relationship between several independent descriptors and the solubility, as given in Eq. (14). This equation yields the independent contribution, hi, of each descriptor, Di, to the solubility ... [Pg.302]

Two models of practical interest using quantum chemical parameters were developed by Clark et al. [26, 27]. Both studies were based on 1085 molecules and 36 descriptors calculated with the AMI method following structure optimization and electron density calculation. An initial set of descriptors was selected with a multiple linear regression model and further optimized by trial-and-error variation. The second study calculated a standard error of 0.56 for 1085 compounds and it also estimated the reliability of neural network prediction by analysis of the standard deviation error for an ensemble of 11 networks trained on different randomly selected subsets of the initial training set [27]. [Pg.385]

The set of selected wavelengths (i.e. the experimental design) affects the variance-covariance matrix, and thus the precision of the results. For example, the set 22, 24 and 26 (Table 41.5) gives a less precise result than the set 22, 32 and 24 (Table 41.7). The best set of wavelengths can be derived in the same way as for multiple linear regression, i.e. the determinant of the dispersion matrix (h h) which contains the absorptivities, should be maximized. [Pg.587]

Beside mid-IR, near-IR spectroscopy has been used to quantitate polymorphs at the bulk and dosage product level. For SC-25469 [34], two polymorphic forms were discovered (a and /3), and the /3-form was selected for use in the solid dosage form. Since the /3-form can be transformed to the a-form under pressure by enantiotropy, quantitation of the /3-form in the solid dosage formulation was necessary. Standard mixtures of both forms in the formulation matrix were prepared, and spectra were measured in the near-IR via diffuse reflectance. Utilizing a standard, near-IR multiple linear regression, statistical approach, the a- and /3-forms could be predicted to within 1% of theoretical. This extension of the diffuse reflectance IR technique shows that quantitation of polymorphic forms at the bulk and/or dosage product level can be performed. [Pg.74]

Walters [24] examined the effect of chloride on the use of bromide and iodide solid state membrane electrodes, and he calculated selectivity constants. Multiple linear regression analysis was used to determine the concentrations of bromide, fluorine, and iodide in geothermal brines, and indicated high interferences at high salt concentrations. The standard curve method was preferred to the multiple standard addition method because of ... [Pg.65]

Simple and valence indices up to sixth order were computed for all the PAHs used in the present study database. The program MOLCONN2 [133, 152,154, 156] performed these calculations using the chemical structural formula as input. SAS [425] was used on a mainframe computer to perform statistical analyses. First, indices were selected which explained the greatest amount of variance in the data (i.e., R2 procedure). These indices were then used in a multiple linear regression analysis (REG procedure). [Pg.289]

Like MLR, PCR [63] is an inverse calibration method. However, in PCR, the compressed variables (or PCs) from PCA are used as variables in the multiple linear regression model, rather than selected original X variables. In PCR, PCA is first done on the calibration x data, thus generating PCA scores (T) and loadings (P) (see Section 12.2.5), then a multiple linear regression is carried out according to the following model ... [Pg.383]

In the chapter, we report a successful application of the Free-Wilson (26-30) methodology to model structure-activity/selectivity relationships. The Fujita-Ban (31-34) modification of Free-Wilson coupled with multiple linear regression... [Pg.93]

ILS is a least-squares method that assumes the inverse calibration model given in eqn (3.4). For this reason it is often also termed multiple linear regression (MLR). In this model, the concentration of the analyte of interest, k, in sample i is regressed as a linear combination of the instrumental measurements at J selected sensors [5,16-19] ... [Pg.172]

Some of the earliest applications of chemometrics in PAC involved the use of an empirical variable selection technique commonly known as stepwise multiple linear regression (SMLR).8,26,27 As the name suggests, this is a technique in which the relevant variables are selected sequentially. This method works as follows ... [Pg.243]

For the biotransformation assay results, concentration-normalized maximum induction values were modeled. Stepwise multiple linear regression analysis was used to select the most suitable parameters from log Kow, EHOMO, ELUMO, and the difference in EHOMO and ELUM0 (ELUM0 - EH0M0). [Pg.381]

Figures 11 and 12 illustrate the performance of the pR2 compared with several of the currently popular criteria on a specific data set resulting from one of the drug hunting projects at Eli Lilly. This data set has IC50 values for 1289 molecules. There were 2317 descriptors (or covariates) and a multiple linear regression model was used with forward variable selection the linear model was trained on half the data (selected at random) and evaluated on the other (hold-out) half. The root mean squared error of prediction (RMSE) for the test hold-out set is minimized when the model has 21 parameters. Figure 11 shows the model size chosen by several criteria applied to the training set in a forward selection for example, the pR2 chose 22 descriptors, the Bayesian Information Criterion chose 49, Leave One Out cross-validation chose 308, the adjusted R2 chose 435, and the Akaike Information Criterion chose 512 descriptors in the model. Although the pR2 criterion selected considerably fewer descriptors than the other methods, it had the best prediction performance. Also, only pR2 and BIC had better prediction on the test data set than the null model.

Chemometrics is the discipline concerned with the application of statistical and mathematical methods to chemical data [2.18], Multiple linear regression, partial least squares regression and the analysis of the main components are the methods that can be used to design or select optimal measurement procedures and experiments, or to provide maximum relevant chemical information from chemical data analysis. Common areas addressed by chemometrics include multivariate calibration, visualisation of data and pattern recognition. Biometrics is concerned with the application of statistical and mathematical methods to biological or biochemical data. [Pg.31]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...