By Jianwei Yang and Jiasen Lu
The following post breaks down Graph R-CNN for Scene Graph Generation, which will be presented at the European Conference on Computer Vision 2018 (ECCV). The conference takes place September 8th through 14th in Munich, Germany.
Visual scene understanding has traditionally focused on identifying objects in images — learning to predict their presence (image classification) and spatial configuration (object detection) or silhouette (image segmentation). These object-centric techniques have matured significantly in recent years, however, representing scenes as collections of objects fails to capture relationships which may be essential for scene understanding.
A recent work  has instead proposed representing visual scenes as graphs containing objects, their attributes, and the relationships between them. These scene graphs form an interpretable structured representation of the image that can support higher-level visual intelligence tasks such as image captioning, visual question answering, and visually grounded dialog, etc.
While scene graph representations hold tremendous promise, extracting scene graphs from images — efficiently and accurately — is challenging. In this post, I will talk about my team’s recent work which proposed an efficient and effective approach for scene graph generation, i.e., detecting objects and the relationships between objects jointly.
What is our motivation?
Scene graph generation requires to jointly detect the objects and relationships in the image. A natural choice is to connect every pair of the object proposals as a potential edge and reasoning over a fully connected graph, as shown in Fig. 1(b). However, it scales poorly (quadratically) when the number of object proposals grows, and quickly becoming impractical. In the real world, the existence of a relationship is naturally sparse, meaning few objects have relationships with each other.
Take Fig. 1(c) as an example, it is much more likely for a `car’ (red node) and `wheel’ (yellow node) to have a relationship than a `wheel’ (yellow node) and `building’ (green node). Furthermore, the types of relationship are highly correlated with the type of objects. As shown in Fig. 1(d), `wheel’ (yellow node) is more likely `on’ the `car’ (red node) while `car` is not necessarily `behind’ the building. Our model is motivated by the above observations.
To this end, we propose a new framework called Graph R-CNN, which effectively leverages object-relationship regularities through two mechanisms to intelligently sparsify and reason over candidate scene graphs. In our framework, the scene graph generation pipeline is factorized into three logical stages: 1) object node extraction, 2) graph edge pruning, and 3) graph context integration.
To extract the object nodes, we use the off-the-shelf object detector such as faster R-CNN to detect the objects in images. To pruning the unlikely connections between objects, we propose relation proposal network which learns to prune the edges from the fully-connected scene graph. Finally, attentional graph convolutional networks (aGCNs) is introduced to refine the scene graph and obtain the final labels. The detailed framework is depicted in Fig.2.
Pruning the Graph
It is well-known that learning the structure is much harder than learning the parameters. This is mainly due to the missed supervision and the indifferentiability of the structure learning. Fortunately, when training the scene graph generation model, we have the supervision from the human annotations indicating which edges should be connected and which should be cut. This supervision motivated us to propose a module called relational proposal network (RePN).
In RePN, the core part is computing the relationship-ness between objects, which indicates whether the objects have a relationship or not. The higher it is, the more likely the object pairs have a relationship. A straightforward way to compute them is composing quadratic pairs and pass them to a score function. However, this is of much inefficiency since we have a quadatical number of pairs.
Say we have 300 object proposals, we then have about 90,000 pairs. To address this computational issues, we propose a smart but simple way. We decompose the score function to two asymmetric kernel functions. Based on these two kernel functions, computing the scores for a quadratic number of object pairs becomes projections of objects and computing the similarity with simple matrix multiplications. The RePN becomes low computational cost and learnable end-to-end based on the structure supervisions.
Integrate the graph context
Once the graph structure is confirmed by RePN, the next step is to infer the scene graph labels. At this stage, context plays an important role. To achieve this, we propose attentional graph convolutional networks (aGCNs).
Graph convolutional network (GCN) is not new, which was first proposed for semi-supervised learning . We extended it to scene graph generation and then added the attention mechanism to it. The aGCNs work as two levels: 1) feature-level and 2) semantic-level.
At the feature-level, we extract the representations for each object nodes and edges in the pruned graph and then compute the attention weights for the neighbor nodes of each node. These attention weights are used to update the input object and relationship representations through GCNs. At the semantic level, based on the updated features, we predict the class distributions over the graph. These class distributions are further sent to another aGCN, whose outputs are used to infer the final scene graph labels. This way, aGCNs integrate the context at both the feature- and semantic-level, which is more thorough and effective.
How to evaluate the scene graph generation?
Prior work has proposed to evaluate scene graph generation under a simple triplet-recall based metric. Under this metric, which we will refer to as SGGen, the ground truth scene graph is represented as a set of <object, relationship, subject> triplets and recall is computed via exact match. A triplet is considered `matched’ in a generated scene graph if all three elements have been correctly labeled, and both subject and object nodes have been properly localized (i.e, bounding box IoU>0.5).
To mitigate this issue, we propose a new metric called SGGen+ as the augmentation of SGGen. SGGen+ not only considers the triplets in the graph but also the singletons (object and predicate). An illustrative comparison is shown in Figure 3.
What are the merits of Graph R-CNN?
As a support, we evaluate our scene graph generation model on a large-scale dataset, called visual genome  with previous metrics and our proposed metric. It turns out that our model beats previous models by a significant margin.
Two other merits except for the performance improvement are: 1) The object detection performance is improved due to the benefit from RePN. With the supervision of scene graph structure, the parameters in the object detection parts could be learned with more supervisions, from both node annotations and edge annotations. 2) The common-sense emerges in our aGCNs. By visualizing the parameters in semantic-level GCNs, we observe strong co-occurrence patterns between different objects, objects, and relationships.
In summary, we introduced a new model called Graph R-CNN for scene graph generation. Our model includes a relation proposal network (RePN) that efficiently and intelligently prunes out pairs of objects that are unlikely to be related and an attentional graph convolutional network (aGCN) that effectively propagates contextual information across the graph.
We also introduce a novel scene graph generation evaluation metric that is more fine-grained and realistic than existing metrics. Our approach outperforms existing methods for scene graph generation, as evaluated using existing metrics and our proposed metric. We believe this simple but effective framework would provide a novel scene graph generation technique, and also shed light into some other areas that involve graph structure learning and graph labeling.
 Justin Johnson et al. “Image retrieval using scene graphs.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
 Thomas N. Kipf and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907. 2016.
 Ranjay Krishna et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.