Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Juliana S Bernardes; Fabio Rj Vieira; Lygia Mm Costa; Gerson Zaverucha

doi:10.1186/s12859-014-0445-4

Article Dans Une Revue BMC Bioinformatics Année : 2015

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

(1, 2) , (2) , (3) , (2)

1
2
3

Juliana S Bernardes

Fonction : Auteur correspondant
PersonId : 965731
ORCID : 0000-0003-1341-4256
IdRef : 166015423

Connectez-vous pour contacter l'auteur

Biologie Computationnelle et Quantitative = Laboratory of Computational and Quantitative Biology

Programa de Engenharia de Sistemas e Computação

Fabio Rj Vieira

Fonction : Auteur

Programa de Engenharia de Sistemas e Computação

Lygia Mm Costa

Fonction : Auteur

Engenharia de computacao e informacao

Gerson Zaverucha

Fonction : Auteur

Programa de Engenharia de Sistemas e Computação

Résumé

Background: An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been developed for this task, and they perform reasonably well (over 0.88 of F-measure) when grouping proteins with high sequence identity. However, for highly diverged proteins the performance of these methods can be much lower, mainly because a common evolutionary origin is not deduced directly from sequence similarity. To the best of our knowledge, a systematic evaluation of clustering methods over distant homologous proteins is still lacking. Results: We performed a comparative assessment of four clustering algorithms: Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix), considering several datasets with different levels of sequence similarity. Two types of similarity measures, required by the clustering sequence methods, were used to evaluate the performance of the algorithms: the standard measure obtained from sequence–sequence comparisons, and a novel measure based on profile-profile comparisons, used here for the first time. Conclusions: The results reveal low clustering performance for the highly divergent datasets when the standard measure was used. However, the novel measure based on profile-profile comparisons substantially improved the performance of the four methods, especially when very low sequence identity datasets were evaluated. We also performed a parameter optimization step to determine the best configuration for each clustering method. We found that TransClust clearly outperformed the other methods for most datasets. This work also provides guidelines for the practical application of clustering sequence methods aimed at detecting accurately groups of related protein sequences.

Mots clés

Remote homology detection Clustering sequence algorithms Sequence analysis

Domaines

Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

Bernardes_2015_Evaluation_and.pdf (935.64 Ko)

Origine	Publication financée par une institution

HAL SU Administrateur : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-01143450

Soumis le : vendredi 17 avril 2015-15:37:50

Dernière modification le : lundi 15 avril 2024-15:16:03

Archivage à long terme le : mardi 18 avril 2017-22:54:02

Dates et versions

hal-01143450 , version 1 (17-04-2015)

Identifiants

HAL Id : hal-01143450 , version 1
DOI : 10.1186/s12859-014-0445-4
PUBMED : 25651949

Citer

Juliana S Bernardes, Fabio Rj Vieira, Lygia Mm Costa, Gerson Zaverucha. Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinformatics, 2015, 16 (1), pp.34. ⟨10.1186/s12859-014-0445-4⟩. ⟨hal-01143450⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS GRID5000 LCQB LCQB-AG LCQB-SGBP IBPS SORBONNE-UNIVERSITE SU-SCIENCES SILECS

266 Consultations

153 Téléchargements

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager