Identifying structural domains of proteins using clustering

BMC Bioinformatics. 2012 Nov 1:13:286. doi: 10.1186/1471-2105-13-286.

Abstract

Background: Protein structures are comprised of modular elements known as domains. These units are used and re-used over and over in nature, and usually serve some particular function in the structure. Thus it is useful to be able to break up a protein of interest into its component domains, prior to similarity searching for example. Numerous computational methods exist for doing so, but most operate only on a single protein chain and many are limited to making a series of cuts to the sequence, while domains can and do span multiple chains.

Results: This study presents a novel clustering-based approach to domain identification, which works equally well on individual chains or entire complexes. The method is simple and fast, taking only a few milliseconds to run, and works by clustering either vectors representing secondary structure elements, or buried alpha-carbon positions, using average-linkage clustering. Each resulting cluster corresponds to a domain of the structure. The method is competitive with others, achieving 70% agreement with SCOP on a large non-redundant data set, and 80% on a set more heavily weighted in multi-domain proteins on which both SCOP and CATH agree.

Conclusions: It is encouraging that a basic method such as this performs nearly as well or better than some far more complex approaches. This suggests that protein domains are indeed for the most part simply compact regions of structure with a higher density of buried contacts within themselves than between each other. By representing the structure as a set of points or vectors in space, it allows us to break free of any artificial limitations that other approaches may depend upon.

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • Computational Biology / methods*
  • Protein Structure, Secondary
  • Protein Structure, Tertiary*
  • Proteins / chemistry*
  • Sequence Analysis, Protein / methods*
  • Sequence Analysis, Protein / statistics & numerical data*

Substances

  • Proteins