Chance correlation

Their advantages are that they are simple to use and are transparent that is, the descriptors that best model the biological activity can be seen and— hopefully—understood. Their disadvantages are that they work best when restricted to congeneric series of compounds, they assume that the biological activity is a rectilinear function of each descriptor, and they can suffer from a high risk of chance correlations, especially when a large pool of descriptors is used. [Pg.477]

Concerning the last point, Topliss and Costello [42] proposed that, to minimize the risk of chance correlations, a QSAR developed with MLR should utilize at least five data points (compounds) for each descriptor included in the equation. Later work [17] showed that it was necessary to take into account not only the number of descriptors in the QSAR (usually several) but also the whole of the descriptor pool (often several hundred) from which the best descriptors were selected. [Pg.477]

Topliss JG, Costello RJ. Chance correlations in structure-activity studies using multiple regression analysis. J Med Chem 1972 15 1066-9. [Pg.490]

In QSAR equations, n is the number of data points, r is the correlation coefficient between observed values of the dependent and the values predicted from the equation, is the square of the correlation coefficient and represents the goodness of fit, is the cross-validated (a measure of the quality of the QSAR model), and s is the standard deviation. The cross-validated (q ) is obtained by using leave-one-out (LOO) procedure [33]. Q is the quality factor (quality ratio), where Q = r/s. Chance correlation, due to the excessive number of parameters (which increases the r and s values also), can. [Pg.47]

Descriptors used to characterize molecules in QSAR studies should be as independent of each other (orthogonal) as possible. When using correlated parameters there is an increased danger of obtaining non-predictive, chance correlation [56]. To examine the correlation between PSA (calculated according to the fragment-based protocol [10]) and other descriptors, we studied a collection of 7010 bioactive molecules from the PubChem database [57]. In addition to PSA, the following parameters were used ... [Pg.121]

Strong co-linear variables must be eliminated by removing all but one of the strongly correlated variables. Otherwise, spurious chance correlation may result. [Pg.398]

In order to obtain reliable models (minimize the probability of chance correlations) it is necessary to consider the ratio /Jdf/v ... [Pg.715]

A widely used approach to establish model robustness is the randomization of response [25] (i.e., in our case of activities). It consists of repeating the calculation procedure with randomized activities and subsequent probability assessments of the resultant statistics. Frequently, it is used along with the cross validation. Sometimes, models based on the randomized data have high q values, which can be explained by a chance correlation or structural redundancy [26]. If all QSAR models obtained in the Y-randomization test have relatively high values for both and LOO (f, it implies that an acceptable QSAR model cannot be obtained for the given dataset by the current modeling method. [Pg.439]

Calculated descriptors have generally fallen into two broad categories those that seek to model an experimentally determined or physical descriptor (such as ClogP or CpKJ and those that are purely mathematical [such as the Kier and Hall connectivity indices (4)]. Not surprisingly, the latter category has been heavily populated over the years, so much so that QSAR/QSPR practitioners have had to rely on model validation procedures (such as leave-k-out cross-validation) to avoid models built upon chance correlation. Of course, such procedures are far less critical when very few descriptors are used (such as with the Hansch, Leo, and Abraham descriptors) it can even be argued that they are unnecessary. [Pg.262]

For avoiding chance correlations, three series of hydrocarbons with wide structural variations were considered 37 alkanes, 36 polyalkylcyclohexanes and 48 monocyclic structures (the full results are reported elsewhere 93 a)). [Pg.51]

Although PCR and PLS are powerful methods, they are not without their limitations. In particular, implicit calibration methods can be susceptible to chance correlations. Thus, when the calculated b is applied to a future spectrum in which those correlations are not present, increased error is likely. It may be possible to improve implicit calibration and limit spurious correlations by incorporating additional information about the system or analytes. This combination of features from implicit and explicit calibration methods is termed hybrid calibration. [Pg.338]

As for all statistical approaches, there is a need for robust validation to ensure that the resulting models are not misleading due to chance correlation and that these models show predictive power to novel molecules. Hence, the successful application depends on the chosen validation strategy using internal validation and evaluation using an external test set. Furthermore, the choice of a set of appropriate descriptors leading to directly interpretable 3D-QSAR models is crucial for interpretation and discussions with medicinal chemists. [Pg.422]

From an inspection of the RSQUARE output, the five-variable equation with the highest correlation was selected for a more complete regression analysis. The five-variable equation (Equation 3) represents the best balance between high correlation and economy in the number of variable parameters. A serious disadvantage of having numerous independent variables in an empirical equation is the increased risk of a chance correlation (12). Consequently, the number of experimental observations required to establish statistical significance increases rapidly with the number of independent variables. In this study, l i experimental determinations were required to obtain statistical significance at the 95% confidence level. [Pg.111]

To obtain a statistically sound QSAR, it is important that certain caveats be kept in mind. One needs to be cognizant about col-linearity between variables and chance correlations. Use of a correlation matrix ensures that variables of significance and/or interest are orthogonal to each other. With the rapid proliferation of parameters, caution must be exercised in amassing too many variables for a QSAR analysis. Topliss has elegantly demonstrated that there is a high risk of ending up with a chance correlation when too many variables are tested (62). [Pg.10]

It has been shown that the more independent variables are involved in MLR QSAR analysis, the higher the probability of a chance correlation between predicted and observed activities, even if only a small portion of variables is included in the final QSAR equation (16). This conclusion is true not only for MLR QSAR, but also for any QSAR approach when the number of variables (descriptors) is comparable to or higher than the number of compounds in a data set. Thus, model validation is one of the most important aspects of QSAR analysis. [Pg.64]

See also in sourсe #XX -- [ Pg.2 , Pg.153 ]

See also in sourсe #XX -- [ Pg.179 ]

See also in sourсe #XX -- [ Pg.131 ]

See also in sourсe #XX -- [ Pg.2 , Pg.153 ]

See also in sourсe #XX -- [ Pg.2 , Pg.153 ]

See also in sourсe #XX -- [ Pg.53 , Pg.85 ]

See also in sourсe #XX -- [ Pg.16 , Pg.29 , Pg.33 , Pg.63 , Pg.184 ]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...