Data type SMILES

The following SQL defines a domain data type smiles. [Pg.86]

Chapter 7 introduces ways in which RDBMS can be used to handle chemical structural information using SMILES and SMARTS representations. It shows how extensions to relational databases allow chemical structural information to be stored and searched efficiently. In this way, chemical structures themselves can be stored in data columns. Once chemical structures become proper data types, many search and computational options become available. Conversion between different chemical structure formats is also discussed, along with input and output of chemical structures. [Pg.2]

Operators, such as +, 11 and functions such as sqrt, round, and upper can be used with these data types. SQL has the ability to search data, using functions such as =, <, and the like. The goal of the SQL extensions is to enable SMILES to be handled as readily as any standard data type. This requires that SQL be extended to validate and standardize, or canon-icalize SMARTS. In addition, these SQL extensions provide functions and operators to allow comparisons and searches of molecular structures stored as SMILES. [Pg.73]

The recommendation here is to use SMILES to store molecular structure itself. If other features of the molecule or atoms need to be stored, other data types and columns can be added to the row describing the molecule. It is the "SQL way" to not encode a lot of information into one data type. When using a molfile as the structural data type, too much data is encoded in a single data type. The individual data items must be parsed and validated. Errors creep into the data, due to missing, extra, or invalid portions of the molfile. Ways of storing atomic coordinates, atom types, and molecular properties are discussed Chapter 11. [Pg.84]

The standard SQL data type Text has been used to store SMILES. This is appropriate because every SMILES is a valid text string. But not every text string is a valid SMILES. Without additional information about SMILES, the RDBMS cannot enforce any rules about which text strings ought to be in a column intended to contain SMILES. [Pg.86]

The SQL domain allows one to define which values are to be allowed in a particular column of a table. A domain is created by stating the underlying built-in SQL data type used to store the domain data type. In addition, a check constraint function may be used to allow or forbid certain values. This can be used to great advantage for SMILES and canonical SMILES. Using a domain improves the ability of the RDBMS to maintain the integrity of the data contained in its tables. [Pg.86]

The use of the keyword Value is required. Value refers to the value of the data element, here the SMILES. Once this domain is created, it can be used as a data type in the creation of a table. For example ... [Pg.86]

Using a domain like this, the smiles data type behaves much like a standard data type. When one attempts to insert an invalid number into a numeric column, an SQL error is reported and the value is not inserted. This fundamental behavior of an RDBMS is readily extended to SMILES using a domain. [Pg.86]

Why use the domain to define a smiles data type, but use a trigger for canonical SMILES First, SMILES is either valid or not. It is not feasible to... [Pg.87]

The use of Simplified Molecular Input Line Entry System (SMILES) as a string representation of chemical structure makes possible much of what has been discussed in earlier chapters of this book. A chemical reaction could be represented as a collection of SMILES, some identified as reactants and some as products. It is possible to define a table to do this, or perhaps use some arrays of character data types, but a syntax extension of standard SMILES allows reaction to be expressed easily. SMIRKS is an extension of SMILES and SMiles ARbitrary Target Specification (SMARTS). It is used to represent chemical transformations. SMIRKS can also be used in a transformation function to combine SMILES reactants to produce SMILES products. [Pg.99]

In order to keep the appropriate SMILES associated with the corresponding coordinates, consider using a new data type. For example ... [Pg.116]

Although we have made use of SD files up to this point, at this stage we switch to SMILES files (19). This becomes necessary because even for small libraries the file size for a fully enumerated set can be quite large. For example, a sample library of just 2500 compounds resulted in 4.85 MB SD file while the SMILES file was only 384 KB. The one caveat with the SMILES format is that there is no standard for handling data fields. Our solution was to reformat the SD file type data field tags into the SMILES file,... [Pg.81]

SMILES strings are very concise and hence are suitable for storing and transporting a large number of molecular structures, while MOLfiles and its extension SDFiles have the option to store more complicated molecular data such as 3D molecular conformational information and biological data associated with the molecules. There are many other file formats not discussed here. Interested readers can find a list of file types at the following web site http //www.ch.ic.ac.uk/chemime/. [Pg.32]

Fig. 2. (a) A part of the ISIS base data-entry tool describing structure information through various fields. The fields in the structure panel are structural information, molecular formula, molecular weight, smiles, reference, authors and their affiliation, target name, and therapeutic use. (b) The activity panel of the tool describes the data-entry fields related to molecular activity in terms of activity type, measured value, enzyme or cell assay, and references from where the information was captured. [Pg.164]

Fig. 3. A mechanism-based toxicity (MBT) curation tool was divided into four panels with (a) general information, (b) metabolism of the molecule, (c) biological data, and (d) toxicity data. The general information panel describes the structure, IUPAC name, smiles of the molecule, and the information on metabolites, their structures and any references. The panel on metabolism describes the toxicity information on individual metabolites and the references from which the information was curated. Biological data give the information on which protein or cells or organs get affected and the measurement of the effect. Toxicity panel gives the information on the molecule as well as the metabolites effect on the species, exposure times, measurement type and their values, effects, and the references.

The software now uses structurally intrinsic parameters for only one QSAR model (LSER) and the results are used to predict one property (acute toxicity) to four aquatic species by one mechanism (nonreactive, non-polar narcosis) however, we intend to continue to refine our equations as databases grow, incorporate other models, predict other properties, and include other organisms. We will attempt to differentiate between modes of toxic action and improve our estimates accordingly. For the widely divergent classes of chemicals and types of environmental behavior, no one model will best describe every situation and no one species is the optimal organism to monitor. As the software evolves, the expert system should choose the best model based on the contaminant, the species, and the property to be predicted (e.g., toxicity or bioaccumulation). In addition, we envision an interactive screen system for data entry that will bypass the SMILES notation and allow the user to describe the molecule by posing a series of questions about the compound s backbone and functional groups. The responses will translate directly into values of LSER variables. [Pg.110]

Another choice for the internal representation of molecular structure is a molfile. It would be possible to construct SQL functions like those described in this chapter that would operate on this type of data. One disadvantage of molfiles is their greater size compared with SMILES. One advantage is that it is possible to store atomic coordinates, which is not possible with SMILES. There are other molecular file formats, but these are substantially the same as a molfile, except perhaps for specific atom types that may be of use in some database applications. [Pg.84]

The MassBank records have one-to-one relation to a specific mass spectrum. Each record has specific information like accession number, record file, license, and author apart from iirformation on the chemical compound regarding its formula, mass, smiles, InChl identifier etc. The analytical information available is the instrument type and make, Msn type data. A typical MassBank record is shown here (Fig. 7.28). [Pg.401]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...