“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

Marc Bailly-Bechet; Annabelle Haudry; Emmanuelle Lerat

doi:10.1186/1759-8753-5-13

Article Dans Une Revue Mobile DNA Année : 2014

“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

(1, 2) , (3) , (3)

1
2
3

Marc Bailly-Bechet

Fonction : Auteur

Atelier de BioInformatique

Bioinformatique, phylogénie et génomique évolutive [LBBE]

Annabelle Haudry

Fonction : Auteur
PersonId : 169546
IdHAL : annabelle-haudry
ORCID : 0000-0001-6088-0909
IdRef : 083544887

Eléments transposables, évolution, populations [LBBE]

Emmanuelle Lerat

Fonction : Auteur
PersonId : 21696
IdHAL : emmanuelle-lerat
ORCID : 0000-0001-6757-8796
IdRef : 069851107

Eléments transposables, évolution, populations [LBBE]

Résumé

Background: Of the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program. RepeatMasker generates several output files, including the .out file, which provides annotations for all detected repeats in a query sequence. However, a remaining challenge consists of identifying the different copies of TEs that correspond to the identified hits. This step is essential for any evolutionary/comparative analysis of the different copies within a family. Different possibilities can lead to multiple hits corresponding to a unique copy of an element, such as the presence of large deletions/insertions or undetermined bases, and distinct consensus corresponding to a single full-length sequence (like for long terminal repeat (LTR)-retrotransposons). These possibilities must be taken into account to determine the exact number of TE copies. Results: We have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families. Conclusions: Our tool provides access to detailed information concerning the TE content in a genome at the family level from the .out file of RepeatMasker. This information includes the exact position and orientation of each copy, its proportion in the query sequence, and its quality compared to the reference element. In addition, our tool allows a user to directly retrieve the sequence of each copy and obtain the same detailed information at the family level when a local library with incomplete TE class/subclass information was used with RepeatMasker. We hope that this tool will be helpful for people working on the distribution and evolution of TEs within genomes.

Mots clés

RepeatMasker Transposable elements Annotation

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

onecode.pdf (2.25 Mo)

Origine	Publication financée par une institution

Gestionnaire HAL 2 Sorbonne Université : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-01332104

Soumis le : mercredi 15 juin 2016-11:32:05

Dernière modification le : samedi 18 mai 2024-03:15:13

Dates et versions

hal-01332104 , version 1 (15-06-2016)

Licence

Paternité

Identifiants

HAL Id : hal-01332104 , version 1
DOI : 10.1186/1759-8753-5-13

Citer

Marc Bailly-Bechet, Annabelle Haudry, Emmanuelle Lerat. “One code to find them all”: a perl tool to conveniently parse RepeatMasker output files. Mobile DNA, 2014, 5, pp.13. ⟨10.1186/1759-8753-5-13⟩. ⟨hal-01332104⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS UNIV-LYON1 BIOENVIS LBBE SORBONNE-UNIVERSITE SU-SCIENCES UDL

425 Consultations

223 Téléchargements

“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager