Similarity searching information retrieval

Current chemical information systems offer three principal types of search facility. Structure search involves the search of a file of compounds for the presence or absence of a specified query compound, for example, to retrieve physicochemical data associated with a particular substance. Substructure search involves the search of a file of compounds for all molecules containing some specified query substructure of interest. Finally, similarity search involves the search of a file of compounds for those molecules that are most similar to an input query molecule, using some quantitative definition of structural similarity. [Pg.189]

The amino acid sequences can be searched and retrieved from the integrated retrieval sites such as Entrez (Schuler et al., 1996), SRS of EBI (http //srs.ebi.ac.uk/), and DDBJ (http //srs.ddbj.nig.ac.jp/index-e.html). From the Entrez home page (http //www.ncbi.nlm.nih.gov/Entrez), select Protein to open the protein search page. Follow the same procedure described for the Nucleotide sequence (Chapter 9) to retrieve amino acid sequences of proteins in two formats GenPept and fasta. The GenPept format is similar to the GenBank format with annotated information, reference(s), and features. The amino acid sequences of the EBI are derived from the SWISS-PROT database. The retrieval system of the DDBJ consists of PIR, SWISS-PROT, and DAD, which returns sequences in the GenPept format. [Pg.223]

PubChem is organized as three linked databases within the NC8I s Entrez information retrieval system. These are PubChem Substance. PubChem Compound, and PubChem BioAssay. Pubchem also provides a fast chemical structure similarity search tool. More information about using each component database may be found using the links above. [Pg.206]

Two methods are commonly applied for library searches. Identity or retrieval searches assume that the spectrum of the unknown compound is present in the reference library, and only experimental variability prevents a perfect match of the unknown and reference spectra. When no similar spectra are retrieved the only information provided is that the unknown spectrum is not in the library. Similarity or interpretive searches assume that the reference library does not contain a spectrum of the unknown compound, and are designed to produce structural information from which identity might be inferred. Interpretive methods typically employ a predetermined set of spectral features, designed to correlate with the presence of chemical substructures. Searching identifies the library spectra that have features most similar to those of the unknown spectrum. The frequency of occurrence of a substructure in the hit list is used to estimate the probability that it is present in the unknown compound. Two well-developed interpretative search algorithms are SISCOM (Search of Identical and Similar Compounds) and STIRS (the Self-Training Interpretive and Retrieval System) [174-177]. Normally a retrieval search is performed first, and when the results are inconclusive, an interpretive search is implemented. In both cases, success depends on the availability of comprehensive libraries of high-quality reference spectra [178]. [Pg.764]

In NCBFs view, the main purpose of listing patented sequences in GenBank is to be able to retrieve sequences by similarity searches that may serve to locate patents related to a given sequence. To make a legal determination in the case, however, one would still have to examine the full text of the patent. To evaluate the biology of the sequence, one generally must locate information other than that contained in the patent. Thus, the critical linkage is between the sequence and its patent number. [Pg.26]

It will be clear from the material presented earlier that 2D similarity searching (either in its basic form or in its subsimilarity or supersimilarity forms) has become a standard retrieval option in chemical information systems... [Pg.31]

Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A(2006)New methods for ligand-based virtual screening use of data-fusion and machine-learning techniques to enhance the effectiveness of similarity searching. J Chem Inf Model 46 462-470 Miyamoto S (1990) Fuzzy sets in information retrieval and cluster analysis. Kluwer Academic, Dordrecht... [Pg.76]

An important question in connection with accident investigations is whether the same type of accident has occurred before. This question is easily answered if a computerised accident database is available. The most efficient means of searching for similar occurrences is to utilise the free-text search facilities of the information retrieval programme. There are computerised tools for pattern recognition that may help identify accident repeaters. It is important to note that different persons may use different words to describe the same work operations, equipment, etc. There are also spelling errors and misconceptions that hinder efficient free-text search. The analyst must be well acquainted with the workplace in question and the different dialects in use. [Pg.209]

Spectroscopic databases are a very valuable tool for the identification of known and unknown substances. In most spectroscopic laboratories they are available and frequently used. Retrieval of data and spectra similarity searches are established tools for the fast identification of unknown compounds. The spectroscopic information stored in the databases offers the generation of structure-spectra correlations, which can be used for predicting spectral features of new compounds. Effective spectrum prediction tools are available for C NMR and H NMR, and will become available for IR spectroscopy in the near future. The prediction of mass spectra is still a challenge. [Pg.2645]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...