Canonical SMILES table

The cansmiles function can also be used to enforce an SQL constraint that the cansmi column must contain valid canonical SMILES. SQL constrains like this are commonly used to maintain data integrity. For example, the SQL clause check (cansmi = cansmiles (cansmi)) can be used in the initial creation of the table. One might also consider using an SQL trigger to handle an insert or update to a column that is required to contain canonical SMILES. [Pg.74]

If canonical SMILES are used in a table to facilitate direct lookup of molecular structure, it is necessary that only one unique name be used for any one structure. Similarly, if one is searching for structure-containing nitro groups, it is necessary that all nitro groups be represented using the same valence conventions. For these reason, it is essential to make a decision about the use of SMILES in certain cases, such as nitro groups. Sulfur and phosphorous atoms also must be considered carefully since they are commonly found with "unusual" valence. [Pg.80]

It is possible to represent chirality in SMILES. This is essential to correctly define the appropriate enantiomer or stereoisomer. Many databases will contain isomers. It is possible to relate the various isomers of a structure by using their common canonical SMILES. This might be done by relaxing the uniqueness constraint on the cansmi column in a structure table, or by adding another table of stereoisomers that is related to the master table. Chirality may be used in SMARTS as well. [Pg.80]

The SQL domain allows one to define which values are to be allowed in a particular column of a table. A domain is created by stating the underlying built-in SQL data type used to store the domain data type. In addition, a check constraint function may be used to allow or forbid certain values. This can be used to great advantage for SMILES and canonical SMILES. Using a domain improves the ability of the RDBMS to maintain the integrity of the data contained in its tables. [Pg.86]

Simplified Molecular Input Line Entry System (SMILES) is a simple, yet complete description of molecular structure that considers the atoms and bonds in a molecule. Using unique canonical SMILES, an indexed table lookup of a structure can be quickly done. For example, the SQL to lookup phenol is ... [Pg.91]

When the table contains unique canonical smiles in an indexed column cansmi, and the cansmiles function provides the proper canonical SMILES for phenol, this lookup is extremely fast. [Pg.91]

This type could be used in a table that might also contain a canonical smiles column, or even other variants of SMILES if desired. [Pg.116]

In a molecular structure file, an atom record typically contains all of the information about that atom the atomic number or symbol, the charge, coordinates, etc. When such a file is parsed into a SMILES string and an array of coordinates, it is important to be able to associate the proper coordinate with the proper atom. The use of canonical SMILES ensures this. Because canonical SMILES defines a unique order of the atoms in a molecule, that order is used to store the coordinates. Later sections of this chapter will discuss ways in which atomic coordinates might be stored in columns of a table. [Pg.125]

The column structure.id is a unique integer relating the structure, sdf and property tables. The sdf.molfile column contains the molfile for each structure as defined by the vendor. The structure.name and structure.cansmiles columns contain the name and canonical smiles parsed and computed from the molfile. The structure.coord column will contain an array of atomic coordinates. The structure, atom column will contain an array of atom numbers from the file in canonical order to correspond to the atom order in the canonical SMILES. The OpenBabel/plpythonu extension functions molfile mol and molfile properties will be used to parse the vendor SDF molfiles and populate these tables. The molfile column of the sdf table is first populated from the SDF file, using the following perl script. [Pg.126]

Every chemical company or research organization has a collection of compounds of interest. These may be compounds synthesized by chemists employed at the company, compounds purchased from chemical vendors, compounds on which research has been carried out, or any other collection of compounds. When a new compound becomes of interest, it is important to know whether that compound has already been entered into the system, or a new entry needs to be made. The use of canonical SMILES as a unique name for each structure makes this an easy task. One essential table in a compound registration system is a table of unique structures. Such a table could be defined as follows. [Pg.155]

The next statement defines a t r igger function that will be used whenever data is inserted or updated in this table. This function performs three important functions. First, it modifies the SMILES to be inserted into the smi column so that it contains the result of the isosmiles function. The isosmiles function is similar to the cansmiles function, except that it retains any stereochemistry that might be contained within the SMILES. If two stereoisomers are entered into this table, each will have a unique isosmiles value, but the same cansmiles value. In this way, they can be kept distinct, but their identical canonical SMILES shows them to be stereoisomers. The trigger function also computes the fingerprint and inserts it into the table when the SMILES is inserted or updated. [Pg.156]

The id column is defined as a primary key. This causes an index to be created, which will facilitate joining the structure table with other tables yet to be created. The smiles column is defined to be unique, which also automatically creates an index. This column will not be used as a key, but the unique index will allow fast lookups on this table if a particular structure is desired. The final definition of this schema creates an index on the cansmiles column. This will not be a unique index, but it will allow fast lookup of structures by canonical SMILES. [Pg.157]

One purpose of canonical numbering is the construction of a unique name of a compound. The canonically numbered connection table representation is uniquely defined. However, it usually contains a lot of redundancy and perceived information. The final unique name is normally a compressed form of the connection table which contains just enough information to reconstruct the connection table in its canonical form. Examples of such codes are the SEMA (stereochemically extended Morgan algorithm) name and canonical SMILES. [Pg.2735]

There is some overhead in the use of indexes, constraints, triggers, etc. as discussed here. The overhead is incurred when rows are inserted or updated in the table. However, the value of this approach is that the data in the table are well validated and can be searched more reliably and efficiently. Direct lookups of canonical or stereo SMILES is simple and quick because of the index on these columns. Using the fingerprint column speeds up substructure search. Tautomers can be readily selected using the column of simple graphs. [Pg.162]

The first solution uses some algorithm that transforms any connection table of a molecule into a unique, canonical, form. The best known of these, the Morgan algorithm, chooses the numbering based on the numbers and properties of the neighbors of each atom of the structure. It is the basis of the Chemical Abstracts System Chemical Registry Service. There is also a canonicalization scheme for the SMILES notation of a chemical structure. ... [Pg.220]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...