Protein family classification

There are two different approaches to the protein sequence classification problem. One can use an unsupervised neural network to group proteins if there is no knowledge of the number and composition of final clusters (e.g., Ferran Ferrara, 1992). Or one can use supervised networks to classify sequences into known (existing) protein families (e.g., Wu et al., 1992). [Pg.136]

Wu et al. (1992) devised a neural network system for the automatic classification of protein sequences according to superfamilies. It was extended into a full-scale system for classification of more than 3,300 PIR protein superfamilies (Wu et al., 1995). The basic input information was encoded with the n-gram and SVD methods. The implementation [Pg.136]

As a full-scale family classification system, more than 1200 MOTIFIND neural networks were implemented, one for each ProSite protein group. The training set for the neural networks consisted of both positive (ProSite family members) and negative (randomly selected non-members) sequences at a ratio of 1 to 2. ProClass groups non-redundant SwissProt and PIR protein sequence entries into families as defined collectively by PIR superfamilies and ProSite patterns. By joining global and motif similarities in a single classification scheme, ProClass helps to reveal domain and family relationships, and classify multi-domained proteins. [Pg.138]

GeneFIND uses a multi-level filter system, with MOTIFIND and BLAST (Altschul et al., 1997) as the first-level filters to quickly eliminate query sequences that have very low probabilities of being a family member. After searching through all neural networks, the sequence query is considered as a potential PROSITE family member if it is ranked in the top 3% hits of the corresponding network. [Pg.138]

To assist the interpretation of family memberships, overall probability scores for both global and motif matches are provided for top hit families. The global score is computed from the BLAST e-value, the SSEARCH score, and the percentage of sequence identity at overlapped length ratio in SSEARCH alignment. The motif score is computed from the ratio of mismatched amino acids to ProSite patterns, and the hidden Markov motif match score. Family information from ProClass, with hypertext links to all other major family [Pg.139]

Under this framework. Fig. 6 shows the basic pieces for constructing annotated chemical libraries. On the one hand, proteins should be stored using the appropriate annotation under their respective protein-family classification schemes (in this case, nuclear receptors). On the other hand, molecules should be stored using a unique hierarchical identifier. The link between the two entities (molecules and proteins) would be defined by pharmacological data (activity). The use of a certain criteria would then allow to construct a binary annotation matrix, from which the mapping of the chemogenomic space is established. [Pg.51]

Figure 11.1 MOTIFIND neural network design for protein family classification.

Henikoff, S. Henikoff, J. G. (1994). Protein family classification based on searching a database of blocks. Genomics 19,97-107. [Pg.219]

Selection of databases on protein structure and protein family classification ... [Pg.280]

The protein sequence database is also a text-numeric database with bibliographic links. It is the largest public domain protein sequence database. The current PIR-PSD release 75.04 (March, 2003) contains more than 280 000 entries of partial or complete protein sequences with information on functionalities of the protein, taxonomy (description of the biological source of the protein), sequence properties, experimental analyses, and bibliographic references. Queries can be started as a text-based search or a sequence similarity search. PIR-PSD contains annotated protein sequences with a superfamily/family classification. [Pg.261]

Selected entries from Methods in Enzymology [vol, page(s)] General Protein kinase classification, 200, 3 protein kinase catalytic domain sequence database identification of conserved features of primary structure and classification of family members,... [Pg.579]

One of the main problems for the annotation and classification of the biological space is the lack of a standard scheme for all protein families. Even within families, different classification schemes coexist and are being used by different research communities. This aspect hampers enormously any chemogenomic initiative aimed at integrating chemical and biological spaces with novel computational techniques. The following provides an overview of the classification schemes currently in use for the main therapeutically relevant protein families. [Pg.41]

Structural motifs become especially important in defining protein families and superfamilies. Improved classification and comparison systems for proteins lead inevitably to the elucidation of new functional relationships. Given the central role of proteins in living systems, these structural comparisons can help illuminate every aspect of biochemistry, from the evolution of individual proteins to the evolutionary history of complete metabolic pathways. [Pg.144]

PALI (Phylogeny and Alignment of Homologous Protein Domains) Database. The PALI (v 2.6) database provides three-dimensional structure-based sequence alignments for homologous proteins of known three-dimensional structure (24-26). The protein families have been derived from the SCOP (Structural Classification of Proteins) database (27). There are 2,518 protein families, and using more than one sequence as reference, 37,986 profiles have been generated. [Pg.157]

Current trends in neural networks favor smaller networks with minimal architecture. Two major advantages of smaller networks previously discussed are better generalization capability (8.4) and easier rule extraction (13.2). Another advantage is better predictive accuracy, seen when a large network is replaced by many smaller networks, each for a subtask or a subset of data. A typical example is the protein classification problem, where n individual networks can be used to classify n different protein families and increase the prediction accuracy obtained by one large network with n output units. The improvement is especially significant when there is sufficient data for fine-tuning individual neural networks to the particularity of the data subsets. The use of ensembles of small, customized neural networks to improve predictive accuracy has been shown in numerous cases. [Pg.156]

Close inspection of currently available sequences of proteins carrying BCB domains clearly indicated that they can be classified into four major classes, which are described below. This classification is based on their ability to bind copper and the specific features of their domain organization. Members of the first three classes harbor single or multiple type 1 blue copper-binding sites, while members of the fourth class do not appear to bind copper. Domain organizations of the precursors of aU currently known protein families that contain a BCB domain are shown in Fig. 1. [Pg.272]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...