By Jianwei Yang and Zhile Ren

With the rapid development of computer vision, several technologies such as object detection and image classification are becoming mature and effective. Those vision algorithms play important roles in many real-world systems, enabling applications ranging from augmented reality to self-driving cars. 

The pipeline for designing a typical computer vision system is to download or curate a large dataset of annotated images from the internet, and train deep neural networks to generate desired outputs such as bounding boxes, segmentations, etc. Those images usually contain well-posed objects, and usually have prototypical viewing angles in the scene. However when we deploy the trained system in robots to perform scene understanding tasks, a robots’ vision is very different from “internet vision”, because objects are usually occluded in cluttered environments. This is inevitable because cameras or depth sensors only capture the visible part of the scene. 

In this paper, we aim to develop AI agents that can understand an entire scene when only parts of it are visible. This is called Amodal Recognition. For instance, an amodal object detection+segmentation system would output not only the correct object label, but also the full extent of object shape and bounding box in the image (See Fig. 1). 

To perceive an occluded object, humans can move in the scene to gather information from new viewpoints. A recent study [1] shows that toddlers are capable of actively diverting viewpoints to learn about objects, even when they are only 4 to 7 months old.

Inspired by human vision, the key thesis of this work is that agents should also learn to move to perceive occluded objects. Specifically, agents should learn to move in the scene to gather information of the occluded object, and then perform amodal perception tasks. As shown in Figure 1, to recognize the class and shape of a target object indicated by the red bounding box, agents will learn to actively move toward the target object to unveil the occluded region behind the stump.

Screen Shot 2019-09-30 at 1.47.50 PM

Figure 1. An illustration of the Embodied Amodal Recognition, where the robot learns to move to perceive occluded object (the sofa in red bounding box).

What is the new task?

In this paper, we introduce a new task called Embodied Amodal Recognition (EAR) where agents actively move in a 3D environment for amodal recognition of a target object, i.e., predicting its category and amodal shape as well. We aim to systematically study whether embodiment (movement) helps amodal recognition. Below, we highlight three design choices for the EAR task:

  1. Three sub-tasks. In EAR, we aim to recover both semantics and shape for the target object. EAR consists of three sub-tasks: object recognition, 2D amodal localization (a 2D bounding box enclosing the full extent of the object), and 2D amodal segmentation (a 2D mask enclosing the full shape of the object). With these three sub-tasks, we provide a new testbed for vision systems. A sample prediction output can be found in Figure 1. 
  2. Single target object. When spawned in a 3D environment, an agent may see multiple objects in the field-of-view. We specify one instance as the target, and denote it using a bounding box encompassing its visible region. The agent’s goal is to then move to perceive this single target object.
  3. Predict for the first frame. The agent performs amodal recognition for the target object observed at the spawning point. If the agent does not move, EAR degrades to passive amodal recognition. Both passive and embodied algorithms are trained using the same amount of supervision and evaluated on the same set of images.

Screen Shot 2019-09-30 at 1.48.08 PM

Figure 2. Comparison of passive amodal recognition pipeline and embodied amodal recognition pipeline (Ours).

Based on the above choices, we propose the general pipeline for EAR shown in Figure 2. When the agent doesn’t move in the scene (Fig. 2a), object recognition algorithms cannot fully recover the shape of the object due to heavy occlusion. However, when agents learn to move in the 3D environment (Fig 2b), the predicted output is more reasonable.

What is our model?

We propose a new model called Embodied Mask R-CNN. The perception module extends work presented in Mask R-CNN [2] by adding a recurrent network to aggregate temporal features. The policy module takes the current observation and features from the past frames to predict the action. The full formulations and experimental results can be found in our paper. We highlight some of the key designs in our model.

Amodal Recognition Module

The amodal recognition module is responsible for predicting the object category, amodal bounding box, and amodal mask at each navigational time step.  Our amodal recognition module has a similar goal to Mask R-CNN [2]. In our task, since the agent is already provided with the visible location of the target object in the first frame, we remove the region proposal network from Mask R-CNN and directly use the location box to feed into the second stage. In our implementation, we use ResNet-50 [3] pre-trained on ImageNet as the backbone.

Given the sequential data {I_0, I_1,…,I_t} along the agent’s trajectory, aggregating the information is challenging, especially when the 3D structure of the scene and the locations of the target object in the later frames are not known.  To address this, we propose a model called Temporal Mask R-CNN to aggregate visual features across multiple frames, as shown in Figure 3.

Screen Shot 2019-09-30 at 1.48.22 PM

Figure 3. The pipeline of object amodal recognition module.

Learn to Move

The goal of the policy network is to propose the next moves in order to acquire useful information for amodal recognition. We disentangle it with the perception network, so that the learned policy will not overfit to a specific perception model.

Similar to the perception network, the policy network receives a visible bounding box of target object and the raw images as inputs, and outputs probabilities over the action space. As shown in Figure 4, the policy network has three components. At step t, its inputs consist of the first frame, current frame, as well as a mask representing the visible bounding box of the target object in the initial view.

Besides image features, we also encode the last action in each step. We use a multi-layer perceptron (MLP) to get the action feature. We then link together the image feature and action feature and pass them to a single-layer GRU network for integrating history information. The output is then sent to a linear layer with softmax to derive the probability distribution over the action space, from which the action is sampled. We learn the policy network via reinforcement learning.

Screen Shot 2019-09-30 at 1.48.34 PM

Figure 4. Pipeline of action module 

What do we learn?

Embodiment helps amodal recognition. In our experiment, we find that agents that move in the environment consistently outperform agents that stay still. Interestingly, even when moving randomly, the agent still performs better than the passive one.

Our model learns a good moving strategy. An intuitive moving strategy is following the shortest-path to the target object. However, using the same recognition model, our agent finds a better moving strategy, and the performance is on par or slightly better than those using the shortest-path move.

Improvements over action step. In general, the performance improves as more steps are taken and more information aggregated, but eventually saturates. This is because the agent’s location and viewpoint might change significantly after a number of steps, thus it becomes more difficult to aggregate information.

To summarize, in this work, we introduced a new task called Embodied Amodal Recognition, — an agent is spawned in a 3D environment that is free to move in order to perform object classification, amodal localization and segmentation of a target occluded object. As a first step toward this task, we proposed a new model that learned to move strategically to improve the visual recognition performance. Through comparisons with various baselines, we demonstrated the importance of embodiment for visual recognition. We also show that our agents developed strategic movements that were different from shortest path, to recover the semantics and shape of occluded objects. The details can be found in our paper here.

This work will be presented at the 2019 International Conference on Computer Vision (ICCV). 

[1] Bambach et al. Toddler-inspired  visual object learning. NeurIPS 2018.

[2] He et al. Mask R-CNN. ICCV 2017.

[3] He et al. Deep Residual Learning for Image Recognition. CVPR 2016

About the Authors

Jianwei Yang is a fifth-year Ph.D. student studying computer vision, machine learning, vision and language, and robot learning. Zhile Ren is a postdoc under Dhruv Batra, Devi Parikh, and Irfan Essa. His research interest is in computer vision where he mainly works on 3D scene understanding applications, embodied artificial intelligence, optical flow, and image manipulation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s