By Ramprasaath R. Selvaraju
Many popular and well-performing models for multi-modal, vision and language tasks exhibit poor visual grounding — failing to appropriately associate words or phrases with the image regions they denote and relying instead on superficial linguistic correlations. For example, answering the question “What color are the bananas?” with yellow regardless of their ripeness evident in the image. When challenged with datasets that penalize reliance on these sorts of biases, state-of-the-art models demonstrate significant drops in performance despite there being no change to the set of visual and linguistic concepts about which models must reason.
In addition to these diagnostic datasets, another powerful class of tools for observing this shortcoming has been gradient-based explanation techniques which allow researchers to examine which portions of the input models rely on when making decisions. Application of these techniques has shown that vision-and-language models often focus on seemingly irrelevant or contextual image regions that differ significantly from where human subjects fixate when asked to perform the same tasks — eg. focusing on a produce stand rather than the bananas in our example.
While somewhat dissatisfying, these findings are not entirely surprising — after all, standard training protocols do not provide any guidance for visual grounding. Instead, models are trained on input-output pairs and must resolve grounding from co-occurrences — a challenging task, especially in the presence of more direct and easier to learn correlations in language. Consider our previous example question, the words `color’, `banana’, and `yellow’ are given as discrete tokens that will trivially match in every occurrence when these underlying concepts are referenced. In contrast, actually grounding this question requires dealing with all visual variations of bananas and learning the common feature of things described as `yellow’.
To address this, we explore if giving a small hint in the form of human attention demonstrations can help improve grounding and reliability.
For the dominant paradigm of vision-and-language models that compute explicit question-guided attention over image regions, a seemingly straight-forward solution is to provide explicit grounding supervision — training models to attend to the appropriate image regions. While prior work has shown this approach results in more human-like attention maps, our experiments show it to be ineffective at reducing language bias.
Crucially, attention mechanisms are bottom-up processes that feed final classification models such that even when attending to appropriate regions, models can ignore visual content in favor of language bias. In response, we introduce a generic, second-order approach that instead aligns gradient-based explanations with human attention.
A number of works use Top-down attention mechanisms to help fine-grained and multi-stage reasoning, which is shown to be very important for vision and language tasks. Anderson et al. propose a variant of the traditional attention mechanism, where instead of attending over convolutional features they show that attending over objects and other salient image regions gives significant improvements in VQA and captioning performance. This is shown in the left half of the figure below.
We will now get to our approach for training deep networks to rely on the same regions as humans which we call Human Importance-aware Network Tuning (HINT). In summary, HINT estimates the importance of input regions through gradient-based explanations and tunes the network parameters so as to align this with the regions deemed important by humans.
In order to do this, we first align the expert knowledge obtained from human attention maps into a form corresponding to the network inputs. The Bottom-up Top-down model takes in region proposals as input. For a given instance, we compute an importance score for each of the proposals based on normalized human attention map energy inside the proposal box relative to the normalized energy outside the box.
We define Network Importance as the importance that the given trained network places on spatial regions of the input when making a particular prediction. In earlier work, we proposed an approach to compute the importance of last convolutional layer’s neurons. Since proposals usually look at objects and salient/semantic regions of interest while providing a good spatial resolution, we extend Grad-CAM to compute importance over proposals.
At this stage, we now have two sets of importance scores — one computed from the human attention and another from network importance — that we would like to align. Each set of scores is calibrated within itself; however, absolute values are not comparable between the two as human importance lies in [0,1] while network importance is unbounded. Consequently, we focus on the relative rankings of the proposals, applying a ranking loss — specifically, a variant of Weighted Approximate Rank Pairwise (WARP) loss.
We evaluate on the standard VQA split and the VQA-CP split. VQA-CP is a restructuring of VQAv2 that is designed such that the answer distribution in the training set differs significantly from that of the test set. For example, while the most popular answer in training for “What sport …” questions might be “tennis”, and in testing, it might be “volleyball”. Without proper visual grounding, models trained on this dataset will generalize poorly to the test distribution. Understandably, the Bottom-up Top-down model which achieves 62% val accuracy when trained and evaluated on VQA v2, only achieves 39.5% test accuracy when trained and evaluated on VQA-CP v2 split.
For VQA-CP, our HINTed UpDown model significantly improves over its base architecture alone by ~7 percentage point gain in overall accuracy with human attention for just 6% of the training data. Unlike previous approaches for language-bias reduction which cite trade-offs in performance between the VQA and VQA-CP splits, we find our HINTed UpDn model actually improves on standard VQA — making HINT the first-ever approach to show simultaneous improvement on both the standard and compositional splits.
Qualitative comparison of models on the validation set before and after applying HINT. For each example, the left column shows the input image along with the question and the ground-truth (GT) answer from the VQA-CP val split. In the middle column, for the base model, we show the explanation visualization for the GT answer along with the model’s answer. Similarly, we show the explanations and predicted answer for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions and answers more accurately. For the example above, the base model only looks at the boy, and after we apply HINT, it looks at both the boy and the skateboard in order to answer `Yes’. After applying HINT, the model also changes its answer from `No’ to `Yes’.
We also show the results of our approach on image captioning.
For each example, the left column shows the input image along with the ground-truth caption from the COCO robust split. We see that the HINTed model looks at more appropriate regions. Note how the HINTed model correctly localizes the fork, apple, and the orange when generating the corresponding visual words, but the base model fails to do so.
Interestingly the model is able to ground even the shadow of a cat!
About the Authors
If you’re interested in learning more about HINT, please refer to our full-paper.
This research is joint work by Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh, and will be presented at the International Conference on Computer Vision (ICCV) Oct. 27 – Nov. 2.