Learning to compose neural networks for question answering, NAACL HLT 2016, pp.1545-1554, 2016. ,
VQA: Visual Question Answering, ICCV, 2015. ,
Finding frequent items in data streams, International Colloquium on Automata, Languages and Programming, pp.693-703, 2002. ,
On the properties of neural machine translation: Encoderdecoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp.103-111, 2014. ,
, Multimodal compact bilinear pooling for visual question answering and visual grounding, 2016.
Deep residual learning for image recognition, 2015. ,
,
Multimodal Residual Learning for Visual QA, NIPS, pp.361-369, 2016. ,
Hadamard Product for Low-rank Bilinear Pooling, 5th International Conference on Learning Representations, 2017. ,
Adam: A method for stochastic optimization, 2014. ,
Multimodal neural language models, ICML, pp.595-603, 2014. ,
Skip-thought vectors, NIPS, pp.3294-3302, 2015. ,
Tensor decompositions and applications, SIAM Rev, vol.51, issue.3, pp.455-500, 2009. ,
, Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
Visual question answering with question representation update (qru), NIPS, pp.4655-4663, 2016. ,
Microsoft coco: Common objects in context, ECCV, 2014. ,
Bilinear cnn models for fine-grained visual recognition, ICCV, 2015. ,
Hierarchical question-image co-attention for visual question answering, NIPS, pp.289-297, 2016. ,
Multimodal convolutional neural networks for matching image and sentence, ICCV, pp.2623-2631, 2015. ,
Towards a visual turing challenge, Learning Semantics (NIPS workshop), 2014. ,
Ask your neurons: A deep learning approach to visual question answering, 2016. ,
Exploring models and data for image question answering, NIPS, pp.2953-2961, 2015. ,
Where to look: Focus regions for visual question answering, CVPR, 2016. ,
Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, vol.2, pp.207-218, 2014. ,
Some mathematical notes on three-mode factor analysis, Psychometrika, vol.31, issue.3, pp.279-311, 1966. ,
DOI : 10.1007/bf02289464
Show and tell: A neural image caption generator, CVPR, pp.3156-3164, 2015. ,
DOI : 10.1109/cvpr.2015.7298935
URL : http://arxiv.org/pdf/1411.4555
Ask me anything: free-form visual question answering based on knowledge from external sources, CVPR, 2016. ,
DOI : 10.1109/cvpr.2016.500
URL : http://arxiv.org/pdf/1511.06973
Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, ECCV, pp.451-466, 2016. ,
DOI : 10.1007/978-3-319-46478-7_28
URL : http://arxiv.org/pdf/1511.05234
Show, attend and tell: Neural image caption generation with visual attention, ICML, pp.2048-2057, 2015. ,
Deep correlation for matching images and text, CVPR, 2015. ,
DOI : 10.1109/cvpr.2015.7298966
Stacked attention networks for image question answering, CVPR, pp.21-29, 2016. ,
DOI : 10.1109/cvpr.2016.10
URL : http://arxiv.org/pdf/1511.02274
Visual7W: Grounded Question Answering in Images, CVPR, 2016. ,
DOI : 10.1109/cvpr.2016.540
URL : http://arxiv.org/pdf/1511.03416