InChl SMILES

Table 1 shows an example of markup, generated using the OSCAR 3 system. The abstract of a polymer research paper has been parsed by OSCAR and the resulting markup for the first sentence of the abstract is shown in-line with the text (Table IB). The first chemical entity encountered in the sentence is oleic acid , which has been marked up as type = CM (Chemical Moiety) and a number of other annotations, such as in-line representations of chemical structure (InChl, SMILES) have been attached. [Pg.128]

Exports of structural descriptors, SMILES and InChl, provide chemical structure information in a simple tab-delimited text file containing CID or SID and either the isomeric SMILES or InChl strings. Given the very nature of the formats of SMILES and InChl, not all chemical structure information can be identically represented. For example, SMILES encodes only covalent bonds, while PubChem supports the additional concepts of ionic, complex, and dative bonds. Most small molecules in PubChem can be reproducibly interconverted between InChl, SMILES, and PubChem ASN.l formats without loss of chemical structure information. [Pg.232]

Integration of select Internet resources, such as the public chemical databases mentioned above, provides a very practical approach to structure searching the Internet and internal resources (Dong et al. 2007). Chapter 8 elaborates on this concept. As summarized in Chapter 2, another facet of chemical structure mining involves finding information within full text documents that do not traditionally contain identifiers like InChl or SMILE strings. Chapter 5 contains an in-depth discussion of these identifiers. [Pg.6]

The most commonly used identifiers today include line notation identifiers (e.g., Simplified Molecular Input Line Entry System [SMILES] and International Chemical Identifier [InChls]), tabular identifiers (e.g., Molfile and Structure Definition [SD] file types), and portable mark-up language identifiers (e.g., Chemical Markup Language [CML] and FlexMol). Each identifier has its strengths and weaknesses as detailed in Chapter 5. Chapters 5 and 6 provide enough information to guide researchers in choosing the most appropriate formats for their individual use. [Pg.14]

Even though InChl (discussed in the next section) is quickly gaining support as the linear format of choice, the fact that SMILES can be read and written by humans... [Pg.85]

Another important feature of InChl is its layered structure. Unlike in SMILES, where all data related to one atom are stored in one place, in InChl different properties of the structure are encoded in different parts of the identifier. This organization of the data has one very important advantage molecules with the same basic structure that differ only in some minor property, such as in stereochemistry or isotopic composition, have the same InChl, with only the exception of the corresponding layer. This makes it possible not only to compare two InChls to find if they represent exactly the same structure, but to use a more intelligent comparison of two InChl strings to reveal molecules with the same basic structure that differ only in some detail. It is then up to the user to decide which deviations in the InChl are significant for his or her purpose and which are not. [Pg.87]

The hash origin of InChIKey also means that it is not convertible back to the original InChl or molecular structure, because for each InChIKey there is an unlimited number of possible matching input values. Although this might seem to be a drawback of the format, it is simply the price of the fixed length of the identifier. When a readable identifier with no possible collisions is needed, InChl (or canonical SMILES) should be used. [Pg.91]

When publishing chemical structures for search engines, all file-based formats are usually out of the question. The reason is simple Search engines work on the scale of characters and words and are not prepared to compare whole files. Thus, we remain with the selection of available widespread linear formats SMILES, InChl, and InChIKey. [Pg.93]

Both InChl and InChIKey are canonical by default, and there is no other option for them. The situation is different for SMILES, where both a canonical and a noncanoni-cal form exists. For this reason InChl (or InChIKey) is probably better, because the... [Pg.93]

Lastly, we should take into account the probability that the user will use our particular format for searching. This fact depends not only on how well established the format is, in which SMILES would probably win and InChIKey would be an outsider, but also on the user s perception of which format is suitable for this task. Because InChl, and InChIKey even more, were marketed from the beginning as suitable for online searches, it is very likely that the user will use one of these formats for searching, rather than SMILES, which does not have such an association. Of course, we could solve this problem by simply publishing our data in all three formats. The decision should be made based on our resources, but it is in general very easy to convert between the formats automatically. [Pg.94]

SMILES has the additional feature of being human readable, but this is not very important in our model case. InChl, and InChIKey by inheritance, features a much better and robust normalization of structures for example, two different tautomeric forms have the same InChl, but different canonical SMILES. Also, the layered structure of InChl gives us the possibility of excluding some particularity of a structure, such as its stereochemistry, from the search when needed. This is not possible using InChIKey or SMILES. [Pg.97]

Chemical structure representation formats (e.g., InChl, Simplified Molecular Line Entry System [SMILES])... [Pg.129]

IUPAC-like expressions, true IUPAC nomenclature names, and InChl and SMILES representations of chemical compounds are well suited for detection by machine learning approaches. Conditional random fields (CRFs)41 and support vector machines have been used for the detection of IUPAC expressions in scientific literature 42 Other approaches are based on rules sets43 44 or combinations of machine learning with rule-based approaches 45 All these approaches have in common that they face one significant problem the name-to-structure problem. [Pg.129]

The two molecules share the same InChl because it cannot represent non-sp2 or sp3 coordination environments, unlike SMILES. [Pg.157]

Currently, the included metadata are used to create additional functionality for the reader within an enhanced HTML view of an article. The ontology terms link to pop-up pages with the ontology definitions, further links, and related articles, whereas the compounds bring up a pop-up containing a two-dimensional structure, the InChl, and SMILES strings for the compound, names, synonyms, and related articles. This is best shown in Figure 8.1. [Pg.161]

To date it includes data on 37 million fully characterised compounds as well as mixtures, complexes, and uncharacterised substances. It provides information on chemical properties, structures (including InChl and SMILES (see Section 9.3) strings), synonyms, and bioactivity. [Pg.17]

ChemSpider was launched in 2007. It is an open-access service in which constituent databases, the largest of which is Web of Science, are linked on a free-access basis, and which uses algorithms to identify and extract chemical names from documents and web pages and convert them to structures and InChl and SMILES identifiers. Access to the core service is free, but the user may be routed to charging component databases. At launch, ChemSpider contained 21 million compounds. At the time of writing, it was too early to assess the success of the service. It was bought by the Royal Society of Chemistry in 2009. [Pg.23]

SMILES was designed, however, such that it could be written or read without the use of a computer. Its advantage is that it is easier to interpret in this way than InChl. [Pg.166]

BOX 10.1 SMILES, WLN, and InChl Notation of (2E)-3-cyclohexyl-2- (R)-hydroxy(phenyl)methylJacrylonitrile... [Pg.213]

Canonicalize chemical structures, i.e., make all chemical structures quickly comparable for a computer. For example, canonical smiles or InChls can be used. [Pg.215]

Recently, a universal string representation method was proposed and published. The International Chemical Identifier,17 or InChl , is a definition and set of methods maintained by the International Union of Pure and Applied Chemistry. It promises to provide a truly universal character string representation of molecular structure. Whether it will replace the widely used SMILES is yet to be seen. [Pg.82]

The supported query input formats for fhe structure search tool are SMILES, SMARTS [17], InChl, CID (PubChem Compound identifier), molecular formula, and SDF [18]. There is also an online JavaScript-based chemical structure sketcher through which a query may be manually drawn, edited, or imported. The sketcher is compatible with modem web browsers and does not require special software to be downloaded or installed. [Pg.230]

This service takes as input a chemical structure and (if sfandardization is possible) oufputs a chemical sfrucfure. Allowed sfructural inpuf and output formats include SMILES, InChl, or SDF file however, fhe input and output formats need not be the same. As with structure search, the standardization service is queued on PubChem servers, meaning a request may not start right away or may not complete immediately. One may also import and export standardization requests to a local XML file to serve as an example for consfrucfing queries for the PUG interface (described in detail later). [Pg.231]

A number of formafs are available for data export. These formats include SDF, image, small image, SMILES, InChl, XML, and either text or binary ASN.l. The PubChem native archive data format is ASN.l all other formats are converted from the original ASN.l. The XML formatted data is exactly equivalent to the ASN.l in content. SDF format is the industry standard for conveyance of chemical sfructure information and is readily imported into a large number of chemistry programs. Unfortunately, the SDF format is unable to handle all aspects of the ASN.l data and may not contain all archived information. The PubChem ASN.l specification, XML schema, and a description of PubChem SDF sfrucfure dafa (SD) fags are all found on fhe PubChem FTP sife in the "specifications" directory. [Pg.231]

Deposition of substance information is performed using the industry standard SDF format, which may include using the SMILES or InChl formats as the chemical structure. Depositing properly formatted substance data into PubChem is as simple as uploading a file, via HTTP or FTP. [Pg.238]

Standardisation of representations has not had a happy history, partly as a result of the differing requirements of different groups of users. Though the InChl does seem to be establishing itself in some quarters as a widely used standard, under international control, different flavours of and extensions to established formats (such as SMILES) continue to appear. [Pg.187]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...