By Samarth Brahmbhatt and Charlie Kemp

Paper (CVPR 2019 oral) | bib | Explore ContactdB

Paper by Samarth Brahmbhatt, Cusuh Ham, Charlie Kemp, and James Hays

Georgia Institute of Technology

Many times a day, people effortlessly grasp objects, yet human grasping is a complex phenomenon that has proven challenging to emulate and analyze. If robots could better grasp objects, they could be more useful in homes and factories. If AI systems could better perceive human grasping, they could more naturally interact and collaborate with people. If researchers better understood human grasping, they might find new ways to help people whose hands are impaired. A key aspect of human grasping that has been hidden from view is the contact that occurs between the human hand and the object. We present a novel method that reveals this hidden interaction, opening up a new path for research on human grasping.

This blog post introduces ContactDB, the first large-scale dataset of contact maps from human grasps of household objects.

A contact map is a textured mesh of the object, where the texture indicates contact. A typical contact map looks like this (note the detailed, continuous and real-world nature of our data, which distinguishes ContactDB from previous attempts at capturing contact):

contactmap_example

(above) A contact map for ‘binoculars’.

Why Capture Contact Maps?

Human grasping has traditionally been captured in the handpose space. Contact maps capture grasping from the novel perspective of contact. In addition, we envision that this data can be used to design ergonomic tools, soft robotic grippers capable of executing human contact patterns and to develop a deeper understanding of hands in action from images and videos.

How We Capture Contact Maps

Heat transfers from warm human hands to the object surface during grasping. After this, the contact pattern can be seen clearly through a thermal camera even if the object is let go.

To capture this pattern, we 3D print a set of household objects and abstract shapes to ensure uniform thermal properties. 50 participants were invited to our laboratory to hold the objects. All objects are grasped with the functional intent of handing them off, and more than half are also grasped with the intent of using them.

Once grasped, the objects are put on a turntable and scanned with a Kinect V2 RGB-D camera and a FLIR Boson 640 thermal camera. The thermal images are texture-mapped to the object 3D mesh to generate a contact map. A typical scan from the RGB, depth and thermal cameras looks like this:

rgb_capture.gif depth_capture thermal_capture

(above) Data stream from the RGB (left), depth (middle), and thermal (right) cameras while scanning an object.

What Insights Does This Data Reveal?

Grasps are significantly influenced by the functional intent:

Screen Shot 2019-05-16 at 3.06.11 PM

(above) Influence of functional intent on contact maps.

Soft tissue of the human hand in the palm and the distal parts of the fingers play a large role in grasping. This is shown by the following figure, which plots the average contact areas for each object, calculated from the observed grasps. The red line indicates an upper bound on the contact area if the grasp were fingertip-only. The average contact area for many objects is significantly higher than the upper bound on the fingertip-only contact area.

handoff_contact_areas

(above) Blue bars: average contact areas. Red line: Upper bound on fingertip-only contact area.

This motivates the inclusion of non-fingertip areas in grasp prediction and modeling algorithms, and presents an opportunity to inform the design of soft robotic manipulators.

See the paper for more analysis and insights.

Predicting Contact Maps from Object Shape

Robots are required to manipulate known objects in many situations. For example, teams had access to object 3D models in the DARPA Robotics Challenge. Models that can predict optimal contact regions from known object shape can help in the positioning of the robot before manipulation.

We have also developed the ContactGrasp algorithm, which synthesizes human-like functional grasps for diverse robotic end-effectors from ContactDB contact maps.

In this paper, we experimented with two 3D object shape representations: voxel occupancy grid and point-cloud. Since each participant’s contact map is a correct way to grasp the object, we need to learn a one-to-many mapping from object shape to contact map.

We adopted two approaches from the literature for this: DiverseNet and Stochastic Multiple Choice Learning (sMCL). Our experiment used two 3D object shape representations: voxel occupancy grid (processed by a CNN architecturewith 3D convolutions), and pointcloud (processed by the PointNet architecture).

We found in our experiments that the voxel occupancy grid representation is better suited for this task, probably because the CNN with 3D convolutions learns a hierarchical representation, while PointNet does not.

We evaluated our models on three unseen test classes of objects: mug, pan, and wine-glass to test their generalization ability across objects. Some example predictions are shown below; out of 10 predictions made by our model, we are showing the 3 that are most realistic.

This particular model was trained with the voxel occupancy grid shape representation using DiverseNet for predicting contact maps for the ‘use’ intent. For more detailed quantitative evaluations of all our models, read our paper.

mug2_use_voxnet_diversenet  mug6_use_voxnet_diversenetmug7_use_voxnet_diversenet

pan3_use_voxnet_diversenet pan5_use_voxnet_diversenet pan6_use_voxnet_diversenet

wine_glass5_use_voxnet_diversenet wine_glass7_use_voxnet_diversenet wine_glass8_use_voxnet_diversenet

(above) Contact map predictions for unseen objects: mug (top), pan (middle), and wine-glass (bottom). The contact maps show plausible grasps for the objects.

To test the generalization ability across object shapes, we also evaluated our model on new shapes of objects seen during training:

camera_shapes camerav2_0_use_voxnet_diversenet camerav2_4_use_voxnet_diversenet camerav2_9_use_voxnet_diversenet

hammer_shapeshammerv2_3_use_voxnet_diversenet hammerv2_5_use_voxnet_diversenet hammerv2_1_use_voxnet_diversenet

(above) Contact map predictions for unseen shapes of seen training objects: camera (top), hammer (middle).

The camera is grasped from the side with the contact at the shutter button and the hammer is grasped at the handle, which are good ways to grasp these objects.

Dataset, Code and Trained Models

You can download the entire ContactDB dataset along with code to perform deep-learning experiments on contact maps at this GitHub repository. If you want to record your own contact maps or access our raw data, we have also open-sourced our data collection and processing code at this GitHub repository (data collection requires ROS).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s