Exploring Multi-Modal and Structured Representation Learning for Visual Image and Video Understanding

PhD candidate Dan Xu

8 May 2018

May 8, 2018

Time: h 10:30 am
Location: Room Garda, Polo Ferrari 1 - Via Sommarive 5, Povo (TN)

PhD Candidate

Dan Xu

Abstract of Dissertation

As the explosive growth of the visual data, it is particularly important to develop intelligent visual understanding techniques for dealing with the large amount of data. Many efforts have been made in recent years to build highly effective and large-scale visual processing algorithms and systems. One of the core aspects in the research line is how to learn robust representations to better describe the data. In this thesis we study the problem of visual image and video understanding and specifically, we address the problem via designing and implementing novel multi-modal and structured representation learning approaches, both of which are fundamental research hot-spots in machine learning. Multi-modal representation learning involves relating information from multiple input sources, and the structured representation learning works on exploring rich structural information hidden in the data for robust feature learning. We investigate both the shallow representation learning frameworks such as dictionary learning and the deep representation learning frameworks such as deep neural networks, and present different modules devised in our works, consisting of cross-paced representation learning, cross-modal feature learning and transferring, multi-scale structured prediction and fusion, multi-modal prediction and distillation. These techniques are further applied in various visual understanding topics, i.e. sketch-based-image retrieval (SBIR), video pedestrian detection, monocular depth estimation and scene parsing, showing superior performance.