Clustering analysis

When building a protein model, it is common to construct many (25 or greater) initial structures and select either the best model or create an averaged structure from that collection (discussed earlier). NMRCLUST, [Pg.146]

NMRCORE, and OLDERADO are programs that were initially developed for use by scientists solving protein structures by NMR spectroscopy. These programs aid in the superposition and clustering of protein structures. OLDERADO is a combination of the NMRCLUST and NMRCORE methods. The methodologies employed in OLDERADO are discussed below as separate entities devoted to a common task. [Pg.146]

OLDERADO clusters the structures of proteins derived from MD simulations into subfamilies based on the conformations of loops and the movement of different protein regions. This clustering is used to select the most representative model from the ensemble of target models constructed. NMRCLUST and NMRCORE can be used individually, but by combining their power and comparing their results with a database of experimentally determined protein family folds, OLDERADO provides additional information about the quality of the final protein model(s). We reiterate that OLDERADO does not construct an average structure, but instead it selects the most representative structure from an ensemble of structures. Clusters of conforma-tionally similar models are created, and the core atoms of protein domains are selected automatically without intervention from the modeler. [Pg.147]

The methodologies discussed here differ in how they evaluate both the overall quality and the validity of a protein structure. PROCHECK does this by assessing stereochemistry, VerifySD evaluates the probability of a side chain occupying a specific region, ERRAT assesses the distribution of nonbonded atom-atom interactions for key atoms, ProSa uses PMFs, PROVE [Pg.147]

Abstract. A smooth empirical potential is constructed for use in off-lattice protein folding studies. Our potential is a function of the amino acid labels and of the distances between the Ca atoms of a protein. The potential is a sum of smooth surface potential terms that model solvent interactions and of pair potentials that are functions of a distance, with a smooth cutoff at 12 Angstrom. Techniques include the use of a fully automatic and reliable estimator for smooth densities, of cluster analysis to group together amino acid pairs with similar distance distributions, and of quadratic progrmnming to find appropriate weights with which the various terms enter the total potential. For nine small test proteins, the new potential has local minima within 1.3-4.7A of the PDB geometry, with one exception that has an error of S.SA. [Pg.212]

Keywords, protein folding, tertiary structure, potential energy surface, global optimization, empirical potential, residue potential, surface potential, parameter estimation, density estimation, cluster analysis, quadratic programming... [Pg.212]

Other methods consist of algorithms based on multivariate classification techniques or neural networks they are constructed for automatic recognition of structural properties from spectral data, or for simulation of spectra from structural properties [83]. Multivariate data analysis for spectrum interpretation is based on the characterization of spectra by a set of spectral features. A spectrum can be considered as a point in a multidimensional space with the coordinates defined by spectral features. Exploratory data analysis and cluster analysis are used to investigate the multidimensional space and to evaluate rules to distinguish structure classes. [Pg.534]

There is no correct method of performing cluster analysis and a large number of algorithms have been devised from which one must choose the most appropriate approach. There can also be a wide variation in the efficiency of the various cluster algorithms, which may be an important consideration if the data set is large. [Pg.507]

A cluster analysis requires a measure of the similarity (or dissimilarity) between pairs of objects. When comparing conformations, the RMSD would be an obviou.s measure to use. [Pg.507]

The aim of cluster analysis is to group together similar objects. [Pg.508]

The dimensionality of a data set is the number of variables that are used to describe eac object. For example, a conformation of a cyclohexane ring might be described in terms c the six torsion angles in the ring. However, it is often found that there are significai correlations between these variables. Under such circumstances, a cluster analysis is ofte facilitated by reducing the dimensionality of a data set to eliminate these correlation Principal components analysis (PCA) is a commonly used method for reducing the dimensior ality of a data set. [Pg.513]

Aldenderfer M S and R K Blahfield 1984. Cluster Analysis. Newbury Park, CA. Sage New York, Garlanc Publishing. [Pg.521]

Chatfield C and A J CoHns 1980. Introduction to Multivariate Analysis. London, Chapman Hall. Desiraju G R 1997. Crystal Gazing Structure Prediction and Polymorphism. Sdence 278 404-405. Everitt B.S. 1993 Cluster Analysis. Chichester, John Wiley Sons. [Pg.521]

Ithough knowledge-based potentials are most popular, it is also possible to use other types potential function. Some of these are more firmly rooted in the fundamental physics of iteratomic interactions whereas others do not necessarily have any physical interpretation all but are able to discriminate the correct fold from decoy structures. These decoy ructures are generated so as to satisfy the basic principles of protein structure such as a ose-packed, hydrophobic core [Park and Levitt 1996]. The fold library is also clearly nportant in threading. For practical purposes the library should obviously not be too irge, but it should be as representative of the different protein folds as possible. To erive a fold database one would typically first use a relatively fast sequence comparison lethod in conjunction with cluster analysis to identify families of homologues, which are ssumed to have the same fold. A sequence identity threshold of about 30% is commonly... [Pg.562]

Selection of Diverse Sets Using Cluster Analysis... [Pg.698]

In dissimilarity-based compound selection the required subset of molecules is identified directly, using an appropriate measure of dissimilarity (often taken to be the complement of the similarity). This contrasts with the two-stage procedure in cluster analysis, where it is first necessary to group together the molecules and then decide which to select. Most methods for dissimilarity-based selection fall into one of two categories maximum dissimilarity algorithms and sphere exclusion algorithms [Snarey et al. 1997]. [Pg.699]

A major potential drawback with cluster analysis and dissimilarity-based methods f selecting diverse compounds is that there is no easy way to quantify how completel one has filled the available chemical space or to identify whether there are any hole This is a key advantage of the partition-based approaches (also known, as cell-bas( methods). A number of axes are defined, each corresponding to a descriptor or son combination of descriptors. Each axis is divided into a number of bins. If there are axes and each is divided into b bins then the number of cells in the multidimension space so created is ... [Pg.701]

Multiple linear regression is strictly a parametric supervised learning technique. A parametric technique is one which assumes that the variables conform to some distribution (often the Gaussian distribution) the properties of the distribution are assumed in the underlying statistical method. A non-parametric technique does not rely upon the assumption of any particular distribution. A supervised learning method is one which uses information about the dependent variable to derive the model. An unsupervised learning method does not. Thus cluster analysis, principal components analysis and factor analysis are all examples of unsupervised learning techniques. [Pg.719]

In a typical appHcation of hierarchical cluster analysis, measurements are made on the samples and used to calculate interpoint distances using an appropriate distance metric. The general distance, is given by... [Pg.422]

Fig. 10. R-mode cluster analysis of the Pacific Northwest rainwater study (24). Reprinted with permission.

D. L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis,]ohxs Wiley Sons, Inc., New York, 1983. [Pg.431]

The PLS calibration set was built mixing in an agate mortar different amounts of Mancozeb standard with kaolin, a coadjuvant usually formulated in agrochemicals. Cluster analysis was employed for sample classification and to select the adequate PLS model acording with the characteristics of the sample matrix and the presence of other components. [Pg.93]

In order to evaluate possible classes among samples considered, a clustering analysis was carried out before PFS treatment for selecting properly a reduced but well representative calibration set. [Pg.142]

It has been shown that similar conformations that belong to adjacent energy basins separated by high energy barriers are incorrectly grouped together by the straightforward cluster analysis [29]. [Pg.86]

FI Spath. Cluster-Analysis Algorithms for Data Reduction and Classification of Objects. Chichester Ellis Florwood, 1980. [Pg.90]

D Shalon, SJ Smith, PO Brown. A DNA microairay system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res 6 639-645, 1996. MB Eisen, PT Spellman, PO Brown, D Botstem. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA 95 14863-14868, 1998. [Pg.348]

P Willett, V Wmterman, D Bawden. Implementation of nonhierarchic cluster analysis methods m chemical information systems Selection of compounds for biological testing and clustering of substiaictures search output. I Chem Inf Comput Sci 26 109-118, 1986. [Pg.368]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...