Skip to Main content Skip to Navigation
Conference papers

Un modèle multimodal d’apprentissage de représentations de phrases qui préserve la sémantique visuelle

Abstract : In this paper, we tackle visual grounding, an active field aiming to enrich textual representations with visual information, at the sentence level. Our model transfers the structure of a visual representation space to the textual space without using inter-modal projections, which are inherently problematic since modalities do not have a one-to-one correspondence. Our new multimodal approach can build upon any sentence representation model and can be implemented in a simple fashion by using objectives ensuring that (1) sentences associated with the same visual content should be close in the textual space and (2) similarities between related elements should be preserved across modalities. We demonstrate the quality of the learned representations on semantic relatedness, classification and cross-modal retrieval tasks.
Complete list of metadatas

https://hal.sorbonne-universite.fr/hal-02351045
Contributor : Benjamin Piwowarski <>
Submitted on : Wednesday, November 6, 2019 - 11:39:18 AM
Last modification on : Saturday, June 6, 2020 - 6:16:04 PM

Identifiers

Citation

Patrick Bordes, Éloi Zablocki, Laure Soulier, Benjamin Piwowarski. Un modèle multimodal d’apprentissage de représentations de phrases qui préserve la sémantique visuelle. COnférence en Recherche d'Informations et Applications, May 2019, Lyon, France. ⟨10.24348/coria.2019.CORIA_2019_paper_9⟩. ⟨hal-02351045⟩

Share

Metrics

Record views

31