Experimental Solubility Datasets

Refinement of a QSPR model requires experimental solubilities to train the model. Several models have used the dataset of Huuskonen [44] who sourced experimental data from the AQUASOL [45] and PHYSPROP [46] databases. The original set had a small number of duplicates, which have been removed in most subsequent studies using this dataset, leaving 1290 compounds. When combined, the log Sw [Pg.302]

It is usual in developing a QSPR to spUt the database into two. One part is used for training the model, while the other part is used to validate the model. This goes to the predictability of the model the model is assumed to be predictive if it can predict the solubiUty of the validation set. Since the validation set is used intimately with the training set to refine the model, it is questionable if this partitioning is warranted [50]. This partitioning is particularly questionable where the available experimental data is sparse. [Pg.303]

With all QSPR studies it is not possible to separate the influence of the data used to train the model and the computational approach used to derive the model from the final model. Ideally, the QSPR should be sufficiently general to be applied to any compound that is reasonably represented by the data used to derive the model. [Pg.303]

NN models for the three datasets contained the same number of descriptors as the MLR models, yet no more than two descriptors in each model were the same in both NN and MLR models. No descriptor was found in common with all models, although, each model contained a descriptor that relied on H-bonding in some manner. Nonlinear modeling from the NN approach gave better representation of the data than the linear models from MLR the value for the three datasets was 0.88, 0.98 and 0.90, respectively. [Pg.304]

From the results described above it is clear that a different QSPR model can be obtained depending on what data is used to train the model and on the method used to derive the model. This state of affairs is not so much a problem if, when using the model to predict the solubility of a compound, it is clear which model is appropriate to use. The large disparity between models also highlights the difficulty in extrapolating any physical significance from the models. Common to all models described above is the influence of H-bonding, a feature that does at least have a physical interpretation in the process of aqueous solvation. [Pg.304]

In addition to confounding experimental factors, a number of published solubility models are somewhat misleading due to a lack of proper computational controls. While we sometimes have limited control over the experimental data used to build models, we have complete control over the way models are evaluated and should always employ appropriate means of evaluating our models. In subsequent sections, we will use solubility datasets to examine some of these control strategies. [Pg.3]

The Huuskonen Dataset This set of 1274 experimental solubility values (Log S) was one of the first large solubility datasets published [15,16] and has subsequently been used in a number of other publications [14,17]. The data in this set was extracted from the AQUASOL [18, 19] database, compiled by the Yalkowsky group at the... [Pg.3]

The JCIM Dataset This is a set of 94 experimental solubility values that were published as the training set for a blind challenge published in 2008 [21 ]. AU of the solubility values reported in this paper were measured by a single group under a consistent set of conditions. The objective of this challenge was for groups to use a consistently measured set of solubility values to build a model that could subsequently be used to predict the solubility of a set of test compounds. Results of the challenge were reported in a subsequent paper in 2009 [22]. [Pg.4]

Delaney [4,14] and Klamt [16] argued that for drug-like compound datasets only about 20% of the variance of log S arises from AG s. This is further confirmed by the study of Wassvik et al. [15] in which 77% of the variance is due to the solubility of the supercooled liquid. Hence, applying crude estimates by mean values or by QSAR approaches we can reasonably expect that the inaccuracies introduced in dmg solubility prediction by our theoretical ignorance of AG s is less than, or at least not much bigger than, the inaccuracies introduced by the estimates of the larger park i.e. the liquid solubility, and by the experimental difficulties in solubility measurement. [Pg.291]

As a key first step towards oral absorption, considerable effort has been directed towards the development of computational solubility prediction [26-30]. However, partly due to a lack of large experimental datasets measured under identical conditions, today s methods are not sufficiently robust for reliable predictions [31]. Nonetheless, further fine-tuning of these models can be expected since high-throughput data have become available for their construction. [Pg.7]

Are the equilibrium constants for the important reactions in the thermodynamic dataset sufficiently accurate The collection of thermodynamic data is subject to error in the experiment, chemical analysis, and interpretation of the experimental results. Error margins, however, are seldom reported and never seem to appear in data compilations. Compiled data, furthermore, have generally been extrapolated from the temperature of measurement to that of interest (e.g., Helgeson, 1969). The stabilities of many aqueous species have been determined only at room temperature, for example, and mineral solubilities many times are measured at high temperatures where reactions approach equilibrium most rapidly. Evaluating the stabilities and sometimes even the stoichiometries of complex species is especially difficult and prone to inaccuracy. [Pg.24]

Jouyban et al. (2004) applied ANN to calculate the solubility of drugs in water-cosolvent mixtures, using 35 experimental datasets. The networks employed were feedforward back-propagation errors with one hidden layer. The topology of neural network was optimized in a 6-5-1 architecture. All data points in each set were used to train the ANN and the solubilities were back-calculated employing the trained networks. The difference between calculated solubilities and experimental... [Pg.55]

Experimental values or van Krevelen s group contributions were used when available, and correlations developed in terms of connectivity indices in Chapter 3 were used otherwise, for the molar volume at room temperature. It is important to note, however, that it would not have been possible to calculate the solubility parameters of a very large number of the polymers in our dataset in the absence of the new correlations developed in chapters 3 and 5. [Pg.236]

Tucson, AZ, 1990) and SCR s PHYSPROP Database (Syracuse Research Corporation. Physical/Chemical Property Database (PHYSOPROP) SRC Environmental Science Center Syracuse, NY, 1994). The experimental aqueous solubility values for the investigated compounds are measured between 20 and 25 °C. The logS values of the dataset ranges from —11.62 to +1.58. [Pg.1038]

TABLE 9 J Comparison of COSMOquick Solubility Predictions for the Paracetamol Dataset [36] Using Runs without Any Reference (Relative), with One to Six Experimental Soluhilities... [Pg.220]

The PubChem Dataset A randomly selected subset of 1000 measured solubility values selected from a set of 58,000 values that were experimentally determined using chemilumenescent nitrogen detection (CLND) by the Sanford-Bumham Medical Research histimte and deposited in the PubChem database (AID 1996) [23] This dataset is composed primarily of screening compounds from the NIH Molecular Libraries initiative and can be considered representative of the types of compounds typically found in early stage drug discovery programs. Values in this dataset were reported with a qualifier < to indicate whether the values were below,... [Pg.4]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...