Learning Rigidity and Scene Flow Estimation

By Zhaoyang Lv

We live in a three-dimensional (3D), dynamic world every day. Being able to perceive 3D high-resolution motion is a fundamental ability of our perception system, which enables us to perform versatile jobs. At the age when we are building intelligent robots, autonomous vehicles, and augmented reality toolkits, how can we also enable them with the similar ability to perceive like humans?

Throughout this post, we will describe our work Learning Rigidity in Dynamic Scenes for 3D Motion Field Estimation. We explore scene flow which is the technical name for 3D motion representation. Although the representation can differ, we are primarily interested in scene flow as the 3D motion field in the physical world scene is observed from a moving camera. Such motions are commonly seen and may be the most challenging to estimate for a self-moving agent.

Optical Flow, Camera Motion, and Scene Flow

There are two types of motions that are commonly introduced in a computer vision class. The first type is called 2D optical flow. It calculates a motion vector for every pixel in the image space (where does a point move to in the next image). The second type is Structure-from-Motion (SFM), which can be used to infer the relative movement of the camera or the object if one of them remains static.  

However, when both the camera and the scene are moving, the traditional SFM will not directly apply. Optical flow indicates the relative motion in an egocentric 2D image space, which always entangles the motion of the camera and 3D scene motion. We want to understand scene flow as an absolute motion representation in the world coordinate. It is arguable which one is more analogous to the raw human vision, but understanding the absolute motion is one of our abilities and is quite valuable in robotics, mixed reality, etc.. It makes things much easier when robots can plan the actions in the absolute world coordinate with Newton’s physics.

Many researchers have shown learning optical flow can generalize well. If we can disentangle scene flow from optical flow, we should also be able to learn scene flow as well. If we can understand depth, the relationship between the different flows turns out to be quite obvious if we project the three types of motions all into the second view. We plot them as a triangle relationship as shown below:

flow_relationship

Solving two of the three correspondences is enough to uncover the third given the above constraints. If we know which regions are static, the ego-motion vectors are part of the optical flow which gives us clues on how to subtract it. The following animations with the ground truth optical flow and camera motions give an example of what the motions look like:

Unknown-1.gif optical_flow

egomotion_flow scene_flow

From top to bottom: raw video, optical flow, ego-motion flow, projected scene flow (into new view)

Rigidity for the Dynamic Scene

To disambiguate optical flow induced by ego-motion from scene flow requires the correct identification of the static structure of a scene, which has been called rigidity (J. Wulff et al. 2017). It might be easy to tell rigidity in the top scene with some semantic knowledge. Humans can move, and all other background buildings (in green) should be the static region. It is one hypothesis that there is often a semantic meaning in the rigidity and how rigidity was tackled. However, this prior idea often does not stand true on its own. We can easily find a scene like the bottom video that the bamboos can suddenly move (in red) by external forces although most of the time they remain static.

semantic_bg.gif

semantic_fg

Traditionally we use the relative motion and 3D information to solve it. Random Sample Consensus (RANSAC) and its variants are one of the most widely used techniques to extract the motion inliers based on the statistics but it can often fail if used alone when the motions are complex and ambiguous.

One takeaway is that rigidity is highly correlated to the relative motions, but motion can be ambiguous. Because of this, we unconsciously leverage our priors such as semantics or relative depth distributions to distinguish it. If it is too hard to code this knowledge, why don’t we simply learn it through data? We do, in this work. But before that, we have to face another non-trivial problem.

Where do we acquire the 3D dynamic training data?

We need data to learn such concepts. One reality is that, unlike any category labeling tasks, we can not collect the dense motion ground truth easily. One solution is to use an expensive motion capture system in a single constrained room. We can also collect trajectories of completely rigid objects. Option three is to create completely synthetic scenes like movies do using computer graphics. This can lead to excessive use of time and money because it is easy to underestimate those when creating a large variety of scenes. None of these sound scalable and unbiased. Even with them, if available, the rigidity and scene flow is still not likely to generalize motions in the wild.

We want to have a data generation process that is simple to run, scalable in creation, potential to be diverse, and as real as possible. Nowadays, dense 3D reconstruction of rooms (A. Dai et al. 2017) can be created at a large scale almost automatically, which brings us opportunities to rethink what synthetic data we can create to mix with our reality. The reconstruction process is purely static, but it provides realistic moving camera trajectories and a rigid dense 3D world. If we can add dynamic objects, say, humans (G. Varol et al. 2017), in the scene, then we will be able to easily create some amazing dynamic scenarios.

The above concept is what we propose to create training data for 3D dynamic world: REal 3D from REconstruction with Synthetic Humans (REFRESH). The final data provides a real photometric background, realistic 3D geometry, and per-pixel motion ground truth, which can never be annotated by humans. The proposed toolkit makes data creation scalable. The scene is not limited to indoor or outdoor, and objects can be anything. It’s possible that you can even DIY your training data for your specific applications or even rooms. The final data still looks unnatural in terms of human footages, but we find it does not hurt that much or may even bring some advantages in generalization which we will see later. But with the semantic 3D reconstruction progress, placing humans or objects more reasonably on the grounds should not be a big problem to overcome.

refresh.png

3D Scene Flow from Rigidity and Optical Flow

The rigidity provides the ROI for static regions. Given RGBD data, we can calculate 3D scene flow with our favorite optical flow and two-view geometry algorithm. Recall in the static regions where there was no 3D scene flow, the ego-motion flow is equivalent to the optical flow. We can directly use the optical flow as correspondences to estimate the relative transform.

inference_pipeline.png
Although our network is quite simple and our training data only uses seven indoor scenes, we were surprised to find that the predicted rigidity turned out to generalize well to various types of scenes with different challenges. A question was posed to us on whether the rigidity only learns the semantics of human rather than the general motions, due to only human models being used in data creation. The answer is NO. Even for rigid cars in outdoor scenes, we still find reasonable motion clues. For a highly nonrigid, non-human like object, e.g. a dragon, we find the prediction can still be consistently good across a video, simply with frame-to-frame prediction.

results.png

There is no solid grounding for our observation on why the network should guarantee to generalize. One possible explanation is that our “lazily” set dynamic human footages may create the necessary randomness that forces the network not to learn to memorize the simple human patterns. This ‘random’ data generation is also termed as ‘domain randomization’, which is often found to be useful for researchers to transfer simulation to real-world data.  

Conclusion

Inferring motions from images requires resolving various challenges in ambiguities, occlusions, etc., where data can play an important role. The method we present to learn scene flow and rigidity is a promising approach in addressing motion ambiguity. Rigidity can be a core building block of many applications including scene flow, tracking, SLAM, and 3D reconstruction pipeline.

We don’t claim that the proposed network and our trained model can generalize any scene and solves the ultimate problem. There will be efforts to find better models or data to make it more robust against the various objects or motion types, or domain discrepancy. However, we do believe our findings of scene flow and our thoughts in creating the data can inspire more people to drive it forward.

Learning Rigidity in Dynamic Scenes for 3D Motion Field Estimation will be presented at the European Conference on Computer Vision (ECCV) 2018 in Munich (The full paper can be accessed here).

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.