Binary descriptor database

Wavelets transforms are useful for compression of descriptors for searches in binary descriptor databases and as alternative representations of molecules for neural networks in classification tasks. [Pg.97]

If a training has been performed in reverse mode, a descriptor command will be available — instead of a property command — which opens a chart containing a comparison of two descriptors. In contrast to a property vector, a descriptor can be directly searched for in a binary descriptor database (e.g., to search for corresponding structures). The result window contains then a hit list and two three-dimensional molecule models one displaying the original molecule of the test set entry (if available), and the other showing the molecule of the actually selected entry in the hit list of similar molecules. [Pg.157]

A molecule with an RDF descriptor most similar to the one retrieved from the neural network is searched in the binary descriptor database using the minimum RMS error or the highest correlation coefficient between the descriptors. [Pg.184]

The query compound is considered as unknown that is, only infrared spectrum is used for prediction. The prediction of a molecule is performed by a search for the most similar descriptors in a binary descriptor database. The database contains compressed low-pass filtered D20 transformed RDF descriptors of 64 components each. The descriptors originally used for training (Cartesian RDF, 128 components) were compressed in the same way before the search process. [Pg.184]

To demonstrate the use of binary substructure descriptors and Tanimoto indices for cluster analysis of chemical structures we consider the 20 standard amino acids (Figure 6.3) and characterize each molecular structure by eight binary variables describing presence/absence of eight substructures (Figure 6.4). Note that in most practical applications—for instance, evaluation of results from searches in structure databases—more diverse molecular structures have to be handled and usually several hundred different substructures are considered. Table 6.1 contains the binary substructure descriptors (variables) with value 0 if the substructure is absent and 1 if the substructure is present in the amino acid these numbers form the A-matrix. Binary substructure descriptors have been calculated by the software SubMat (Scsibrany and Varmuza 2004), which requires as input the molecular structures in one file and the substructures in another file, all structures are in Molfile format (Gasteiger and Engel 2003) output is an ASCII file with the binary descriptors. [Pg.270]

The Aspen NRTL-SAC solvent database was identified from the list of solvents presented in the pharmaceutical based International Committee on Harmonization s guidelines for residual solvents in API [28], Hexane, Acetonitrile and Water were selected as the basis for the X, Y and Z segments respectively, the binary interaction parameters for the segments together with molecular descriptors in terms of X,Y and Z segments were then regressed from experimental vapour-liquid and liquid-liquid equilibrium data from the Dechema database. The list of solvent parameters that were used in the case study are given in Table 13. [Pg.54]

Fig. 1. Median partitioning and compound selection. In this schematic illustration, a two-dimensional chemical space is shown as an example. The axes represent the medians of two uncorrelated (and, therefore, orthogonal) descriptors and dots represent database compounds. In A, a compound database is divided in into equal subpopulations in two steps and each resulting partition is characterized by a unique binary code (shared by molecules occupying this partition). In B, diversity-based compound selection is illustrated. From the center of each partition, a compound is selected to obtain a representative subset. By contrast, C illustrates activity-based compound selection. Here, a known active molecule (gray dot) is added to the source database prior to MP and compounds that ultimately occur in the same partition as this bait molecule are selected as candidates for testing. Finally, D illustrates the effects of descriptor correlation. In this case, the two applied descriptors are significantly correlated and the dashed line represents a diagonal of correlation that affects the compound distribution. As can be seen, descriptor correlation leads to over- and underpopulated partitions.

We use assay data from a National Cancer Institute HIV/AIDS database in our study (http //dtp,nci,nih.gov/docs/aids/aids data.html). As descriptors, we apply a set of six BCUT descriptors and a set of 46 constitutional descriptors computed by the Dragon software. These descriptors could be computed for 29,374 of the compounds in the database. The assay classifies each compound as confirmed inactive (Cl), moderately active (CM), or confirmed active (CA). We treat the data as a binary classification problem with two classes inactive (Cl) and active (CM or CA). According to this classification, 542 (about 1.8%) of the compounds are active. [Pg.308]

Analysis of molecular similarity is based on the quantitative determination of the overlap between fingerprints of the query structure and all database members. As descriptors of a given molecule can be considered as a vector of real or binary attributes, most of the similarity measures are derived as vectorial distances. Tanimoto and Cosine coefficients are the most popular measures of similarity.Definitions of similarity metrics are collected in Table 3. [Pg.4017]

Low-pass, or coarse-filtered, wavelet transforms are valuable as compressed representations of RDF descriptors for fast similarity searches in binary databases. Decreasing the resolution generally reduces the size of an RDF descriptor. However, coarse-filtered wavelet transforms, which are already half the size of a nontransformed descriptor, conserve more information than the corresponding RDF descriptors. Figure 5.24 shows a comparison of a filtered transform and an RDF descriptor. Even though both functions have the same size (i.e., same number of components), the RDF used for transform originally had a higher resolution and size. This is the reason why additional peaks appear in the transform that do not occur in the RDF descriptor. [Pg.151]

An entry in a data set table can be directly compared via root mean square or correlation coefficient with one of the default binary databases. A new window opens that contains two areas with three-dimensional molecule models The left one displays the original molecule of the selected data set entry. The window shows the molecule and a match list containing the most similar entries of the database according to the selected similarity criterion (descriptor) with their sequential number, the names, and the calculated similarity measures. [Pg.155]

Another type of binary file is the binary database. This type is similar to the binary molecule set but is especially designed for fast searches of similar descriptors and the retrieval of the corresponding molecule. An example of the use of binary databases is the prediction of a descriptor from the neiu al network after a reverse training. [Pg.155]

Descriptors based on pattern functions are helpful tools for a quick recognition of substructures. A pattern-search algorithm based on binary pattern descriptors can then be used for substructure search. However, patterns and other characteristics of descriptors that seem to indicate unique features should be investigated carefully. With these descriptors 3D similarity searches for complete structures or substructures in large databases are possible and computationally very efficient. In addition, descriptors can serve as the basis for a measure for the diversity of compounds in large data sets, a topic that is of high interest in combinatorial chemistry. [Pg.162]

The computation times are much higher than in the database approach the recalculations in the modeling process must be performed on each relevant initial model found in the database. Depending on the number of operations, this leads to between approximately 500 and 5000 recalculations of new 3D models and RDF descriptors for each initial model. With respect to about 100,000 compounds in the binary database for the initial models, this can result in several million calculations per prediction if several initial models should be regarded. The method can be improved by implementation of a fast 3D structure generator into the prediction software. In this case, a reliable 3D structure is calculated after each modeling operation directly. [Pg.190]

Compressed descriptor for similarity searches in binary databases. Qassification and prediction tasks. [Pg.397]

The development of 3D pharmacophore methods has spawned a new type of descriptor, the pharmacophore key. This is an extension of the binary keys used to facilitate 3D database... [Pg.674]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...