By: Prithvijit Chattopadhyay and Ramprasaath R. Selvaraju
(Paper authors include Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, and Stefan Lee)
Deep Neural Networks have pushed the boundaries of standard image-classification tasks in the past few years, with performance on many challenging benchmarks reaching near human-level accuracies. One of the major limitations associated with these models is that they require massive amounts of labeled data to learn new concepts. To eventually deploy such models in the wild, it is desirable to have these models to generalize from a few examples or descriptions of novel concepts like humans can.
As humans, much of the way we acquire and transfer knowledge about novel concepts is in reference to or via composition of concepts which are already known to us. For instance, upon hearing that “A Red Bellied Woodpecker is a small, round bird with a white breast, red crown, and spotted wings.” we can compose our understanding of colors and birds to imagine how we might distinguish such an animal from other birds. Along these lines, the task of learning classifiers (deep or otherwise) for novel concepts from external summarized domain knowledge alone — termed Zero Shot Learning (ZSL) — has been a topic of increased interest within the computer vision community.
In our work, we explore how to incorporate similar human-like compositional learning strategies for deep neural networks enabling them to recognize or identify novel/unseen concepts based on the models’ understanding of already seen concepts.
Prior evidence suggests that neurons within a deep network do indeed capture localized, semantic concepts. Unfortunately, for the most part these captured concepts lack referable groundings – i.e. even if a network contains units sensitive to “white breast” and “red crown”, there is no explicit mapping of these neurons to the relevant neuron names or descriptions in natural language. Previous attempts have focused on utilizing crowd-sourced annotations to gather such model-dependent neuron names. The expensive process of collecting annotations renders this approach impractical.
Our proposed approach — Neuron Importance-based Weight Transfer (NIWT) — not only provides a solution to the above problem but also addresses how to leverage this neuron-level descriptive supervision to train novel classifiers. At the heart of our approach is grounding class descriptions (including attributes and/or free-form text) to the importance of lower-layer neurons to final network decisions.
Neuron Importance-based Weight Transfer (NIWT)
As the name suggests, NIWT relies on identifying the neurons in a deep network important for a specific decision. We quantify neuron-importance as the sensitivity of individual neurons in a particular layer to the predictions made by a learned network. Following this, we learn a simple linear mapping to associate the importance scores characterizing relevant neurons for the seen (training) classes to external domain-specific descriptions (attributes or fine-grained sentences) associated with the same. This learned transformation allows us to ground the class-level semantic descriptions in the neurons.
Upon encountering a novel/unseen class, the learned mapping now allows us to map to the importance scores of neurons associated with that class (identify the important neurons), such that we predict the correct class associated with that instance. Essentially, given a description of the novel class, we can optimize the classifier weights such that the neurons which are deemed important actually end up being important while making a prediction associated with that class. From the network’s perspective, this results in learning the proper assignment of weights to different sub-level compositional concepts captured while training for seen classes to compose the novel/unseen class. In this manner, we connect the description of a previously unseen class to weights of a classifier that can predict the same at test time — all without having seen a single image from this novel class.
We evaluate on the task of Generalized Zero-Shot Learning, which serves as a natural test-bed for our approach. Compared to just measuring classification accuracy on unseen classes, GZSL requires measuring the accuracy of classifying a test instance (of a seen or unseen class) in the presence of both seen and unseen classes — with the former acting as distractors for the latter or vice-versa. Given a test set, the harmonic mean of accuracies measured in such settings of seen and unseen class instances, provides a robust metric to benchmark different approaches. Empirically, we observe that NIWT outperforms the pervious best approach by a margin of 10% and 2% on the Animals with Attributes and Caltech-UCSD Birds datasets respectively. In addition, by relying on grounding neuron importances in semantic human-interpretable domain, the improvement comes with the benefit of automatically being able to explain the decisions made by the network at the level of neurons (see below).
Qualitative examples containing explanations for NIWT: (a) the ground truth class and image, (b) Grad-CAM visual explanations for the GT category, (c) textual explanations obtained using the inverse mapping from neuron-importance to domain knowledge, (d) most important neurons for this decision, their names and activation maps.
Notice that in the second row, for the image — correctly classified as a yellow-headed blackbird — the visualizations for the class focus specifically at the union of attributes that comprise the class – black eye, yellow throat, and black wing. In addition, the textual explanations also filter out these attributes based on the neuron-importance scores – has throat color yellow, has wing color black, etc. When we focus on the individual neurons with relatively higher importance we see that individual neurons focus on the visual regions characterized by their assigned `names’. This shows that our neuron names are indeed representative of the concepts learned by the network and are well grounded in the image.
If you’re interested in learning more about NIWT, please refer to our full-paper.