Towards Goal-Driven Visually Grounded Dialog Agents

2 novembre 2017
Versione stampabile

Time: Thursday, 2nd of November 2017 - h.12:00 pm
Location: Garda (Povo 2), Polo scientifico e tecnologico "Fabio Ferrari", Building Povo 1, via Sommarive 5, Povo (Trento)


Stefan Lee, School of Interactive Computing at Georgia Tech


One goal of AI is to develop artificial agents which can perceive their surroundings and communicate this understanding to humans in natural language to accomplish cooperative tasks. For example, a user might talk with an expert agent in order to learn about some visually grounded topic (i.e. User: "What kind of bird is that?" AI: "It is a blue jay." User: "How can you tell?" AI: "Its blue crown and wings give it away!") or to sift through large quantities of visual data (i.e. User: "Has anyone entered this hallway in the last month?" AI: "Yes, 127 instances are logged on camera." User: "Were any of them carrying a black bag?"). In this talk, I will focus on a recent line of work on such question-answer based dialogs grounded in natural images - a task we call Visual Dialog. First, I will provide an overview of the Visual Dialog task and the data collection effort culminating in the VisDial dataset of over 1.2 million rounds of visually grounded dialog. I will then go on to describe a number of deep agent architectures trained for this task and some of the challenges faced by these supervised-learning based models. Then I will discuss a follow-up work in which we address some of these challenges by modeling Visual Dialog as a cooperative game between agents in a reinforcement learning setting -- learning dialog agent policies end-to-end, from pixels to multi-agent, multi-round dialog to game reward.

About the Speaker

Stefan Lee is a Research Scientist in the School of Interactive Computing at Georgia Tech collaborating closely with Dhruv Batra and Devi Parikh at the intersection of computer vision and natural language processing. He received his PhD from Indiana University in 2016 under David Crandall and was awarded the Bradley Postdoctoral Fellowship at Virginia Tech 2016-2017 with Dhruv Batra. He has published at NIPS, ICCV, CVPR, WACV, EMNLP, and ICCP and has held visiting research positions at Virginia Tech, INRIA Willow, and UC Berkeley.

Contact person regarding this talk: raffaella.bernardi [at] (Raffaella Bernard)i

For more info see:
Visual Dialog [CVPR 2017]
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model [NIPS 2017]
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning [ICCV 2017]
Also please visit for more information or to interact with a live demo of one of our Visual Dialog agents.