Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

By Ramprasaath R. Selvaraju Many popular and well-performing models for multi-modal, vision and language tasks exhibit poor visual grounding -- failing to appropriately associate words or phrases with the image regions they denote and relying instead on superficial linguistic correlations. For example, answering the question “What color are the bananas?” with yellow regardless of their … Continue reading Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Embodied Amodal Recognition: Learning to Move to Perceive Objects

By Jianwei Yang and Zhile Ren With the rapid development of computer vision, several technologies such as object detection and image classification are becoming mature and effective. Those vision algorithms play important roles in many real-world systems, enabling applications ranging from augmented reality to self-driving cars.  The pipeline for designing a typical computer vision system … Continue reading Embodied Amodal Recognition: Learning to Move to Perceive Objects

Overcoming Large-scale Annotation Requirements for Understanding Videos in the Wild

By Min-Hung Chen, Zsolt Kira and Ghassan AlRegib Videos have become an increasingly important type of media from which we obtain valuable information and knowledge. This motivates the need for the development of video analysis techniques. The development of these techniques could, for example, provide recommendations or support discovery for different objectives. Given the recent … Continue reading Overcoming Large-scale Annotation Requirements for Understanding Videos in the Wild