Text Mining for Scientific Information

Automatic text datamining is an important source of knowledge, with many applications in generating databases from scientific literature, such as protein-disease associations, gene expression patterns, subcellular localization, and protein-protein interactions. [Pg.384]

The NLProt system developed by Mika and Rost combines four support vector machines, trained individually for distinct tasks.The first SVM is trained to recognize protein names, whereas the second learns the environment in which a protein name appears. The third SVM is trained on both protein names and their environments. The output of these three SVMs and a score from a protein dictionary are fed into the fourth SVM, which provides as output the protein whose name was identified in the text. A dictionary of protein names was generated from SwissProt and TrEMBL, whereas the Merriam-Webster Dictionary was used as a source of common words. Other terms were added to the dictionary, such as medical terms, species names, and tissue [Pg.384]

Big Chemical Encyclopedia