Distributed under Creative Commons CC-BY 4.0 Genome annotation across species using deep convolutional neural networks

Ghazaleh Khodabandelou; Etienne Routhier; Julien Mozziconacci

doi:10.7717/peerj-cs.278

Article Dans Une Revue PeerJ Computer Science Année : 2020

Distributed under Creative Commons CC-BY 4.0 Genome annotation across species using deep convolutional neural networks

(1, 2) , (1) , (1, 3)

1
2
3

Ghazaleh Khodabandelou

Fonction : Auteur
PersonId : 938382
ORCID : 0000-0002-8078-8461
IdRef : 26330373X

Laboratoire de Physique Théorique de la Matière Condensée

Laboratoire Images, Signaux et Systèmes Intelligents

Etienne Routhier

Fonction : Auteur

Laboratoire de Physique Théorique de la Matière Condensée

Julien Mozziconacci

Fonction : Auteur
PersonId : 1093826

Laboratoire de Physique Théorique de la Matière Condensée

Structure et Instabilité des Génomes

Résumé

Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome-wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not.

Mots clés

Bioinformatics Computational Biology Data Mining and Machine Learning Promoters Genome annotation Deep learning DNA motifs Sequence evolution Unbalanced datasets Transcription start sites

Domaines

Physique [physics]

Fichier principal

peerj-cs-278.pdf (9.85 Mo)

Origine	Publication financée par une institution

Gestionnaire HAL 5 Sorbonne Université : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-02883334

Soumis le : lundi 29 juin 2020-09:10:12

Dernière modification le : mercredi 30 octobre 2024-22:17:30

Dates et versions

hal-02883334 , version 1 (29-06-2020)

Identifiants

HAL Id : hal-02883334 , version 1
DOI : 10.7717/peerj-cs.278

Citer

Ghazaleh Khodabandelou, Etienne Routhier, Julien Mozziconacci. Distributed under Creative Commons CC-BY 4.0 Genome annotation across species using deep convolutional neural networks. PeerJ Computer Science, 2020, 6, pp.e278. ⟨10.7717/peerj-cs.278⟩. ⟨hal-02883334⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSERM MNHN CNRS LISSI UPEC LPTMC INC-CNRS SORBONNE-UNIVERSITE SU-SCIENCES ANR

117 Consultations

84 Téléchargements

Distributed under Creative Commons CC-BY 4.0 Genome annotation across species using deep convolutional neural networks

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager