J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Learning to compose neural networks for question answering, NAACL HLT 2016, pp.1545-1554, 2016.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, ICCV, 2015.

M. Charikar, K. Chen, and M. Farach-colton, Finding frequent items in data streams, International Colloquium on Automata, Languages and Programming, pp.693-703, 2002.

K. Cho, B. Van-merrienboer, D. Bahdanau, and Y. Bengio, On the properties of neural machine translation: Encoderdecoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp.103-111, 2014.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal compact bilinear pooling for visual question answering and visual grounding, 2016.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, 2015.

J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim et al.,

B. Ha and . Zhang, Multimodal Residual Learning for Visual QA, NIPS, pp.361-369, 2016.

J. Kim, K. On, J. Kim, J. Ha, and B. Zhang, Hadamard Product for Low-rank Bilinear Pooling, 5th International Conference on Learning Representations, 2017.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal neural language models, ICML, pp.595-603, 2014.

R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba et al., Skip-thought vectors, NIPS, pp.3294-3302, 2015.

T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Rev, vol.51, issue.3, pp.455-500, 2009.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.

R. Li and J. Jia, Visual question answering with question representation update (qru), NIPS, pp.4655-4663, 2016.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, ECCV, 2014.

T. Lin, A. Roychowdhury, and S. Maji, Bilinear cnn models for fine-grained visual recognition, ICCV, 2015.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, NIPS, pp.289-297, 2016.

L. Ma, Z. Lu, L. Shang, and H. Li, Multimodal convolutional neural networks for matching image and sentence, ICCV, pp.2623-2631, 2015.

M. Malinowski and M. Fritz, Towards a visual turing challenge, Learning Semantics (NIPS workshop), 2014.

M. Malinowski, M. Rohrbach, and M. Fritz, Ask your neurons: A deep learning approach to visual question answering, 2016.

M. Ren, R. Kiros, and R. S. Zemel, Exploring models and data for image question answering, NIPS, pp.2953-2961, 2015.

K. J. Shih, S. Singh, and D. Hoiem, Where to look: Focus regions for visual question answering, CVPR, 2016.

R. Socher, A. Karpathy, Q. Le, C. Manning, and A. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, vol.2, pp.207-218, 2014.

L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika, vol.31, issue.3, pp.279-311, 1966.
DOI : 10.1007/bf02289464

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, CVPR, pp.3156-3164, 2015.
DOI : 10.1109/cvpr.2015.7298935

URL : http://arxiv.org/pdf/1411.4555

Q. Wu, P. Wang, C. Shen, A. Dick, A. Van-den et al., Ask me anything: free-form visual question answering based on knowledge from external sources, CVPR, 2016.
DOI : 10.1109/cvpr.2016.500

URL : http://arxiv.org/pdf/1511.06973

H. Xu and K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, ECCV, pp.451-466, 2016.
DOI : 10.1007/978-3-319-46478-7_28

URL : http://arxiv.org/pdf/1511.05234

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, ICML, pp.2048-2057, 2015.

F. Yan and K. Mikolajczyk, Deep correlation for matching images and text, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7298966

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked attention networks for image question answering, CVPR, pp.21-29, 2016.
DOI : 10.1109/cvpr.2016.10

URL : http://arxiv.org/pdf/1511.02274

Y. Zhu, O. Groth, M. Bernstein, and L. Fei-fei, Visual7W: Grounded Question Answering in Images, CVPR, 2016.
DOI : 10.1109/cvpr.2016.540

URL : http://arxiv.org/pdf/1511.03416