Overcoming Large-scale Annotation Requirements for Understanding Videos in the Wild

By Min-Hung Chen, Zsolt Kira and Ghassan AlRegib

Videos have become an increasingly important type of media from which we obtain valuable information and knowledge. This motivates the need for the development of video analysis techniques. The development of these techniques could, for example, provide recommendations or support discovery for different objectives. Given the recent progress in deep neural networks, many approaches for video understanding and action recognition have been developed. However, the majority of the performance gains have come from the availability of massive amounts of labeled data to feed supervised learning, which requires a great amount of time for manual human annotation. Therefore, effectively and efficiently generalizing trained models to different datasets has become an important problem necessary to enable real-world applications. 

Current Challenges

Unfortunately, countless variations in our world make this generalization task very challenging.

For example, if we want to classify two actions: basketball and walk in a video dataset, we can easily train a model with good performance using a current state-of-the-art method. However, this model will not always work on different datasets. As shown in the left half of the following figure, the target dataset may have very different data distribution, making the source model difficult to classify basketball and walk correctly. 

Screen Shot 2019-09-10 at 8.43.53 AM

(above) The overview of the domain shift problem and the goal of domain alignment.

This problem is called the Domain Shift problem (different datasets in the above example can be considered as different domains), and it exists in many kinds of tasks (e.g. classification, segmentation), modalities (e.g. video, audio) and real-world applications (e.g. autonomous vehicles). In other words, domain shift problems happen everywhere every day. 

To alleviate this problem, we aim to project the features of different domains to the same feature space so that we can train a model that works on both domains. This method is noted as Domain Adaptation or Domain Alignment (DA). In the above example, after domain adaptation, we can train a model that effectively classifies basketball and walk in both domains, as shown in the right half of the above figure.

In this post, we would like to introduce how to achieve DA for videos with our recent work “Temporal Attentive Alignment for Large-Scale Video Domain Adaptation (ICCV 2019 Oral)”.

Why is this problem important? 

This work mainly bridges the gap between different environments in order to remove the additional annotation need, especially for video data, when adapting models to new environments.

Let’s say we want to develop pedestrian activity analysis techniques for autonomous vehicles. Ideally, we need to collect and annotate data in various environments, such as different cities and weather conditions. However, this is time-consuming and impractical. To solve this problem, we can develop a model using a game development platform (e.g. Unity) where we can quickly collect large-scale data with auto-generated annotations. We can then apply this work to adapt the model for real-world usage. 

This work can also be applied to retail store industries to quickly generalize the customer analysis system to different branches with various environmental settings, like different lighting conditions, encouraging the progress of autonomous retail stores. In other words, this work can facilitate the process of converting academic research to industrial products that directly affect human life.

Domain Shift in Videos

Since many image-based DA approaches exist, you may ask, “Why not directly apply those methods to deal with video domain shift? What makes video-based DA different from image-based DA? If different, how can one achieve video-based DA?” 

To answer these questions, we first investigate: What makes video different from image?

A video is composed of a set of frames where information source can be divided into two parts:

  • Spatial: visual structural information within each frame
  • Temporal: changes in visual information across time

As shown below, video domain shift problems exist in both directions as:

Screen Shot 2019-09-10 at 8.44.09 AM

(above) Video domain shift needs to be addressed by domain alignment in both spatial and temporal directions.

Image-based DA methods only focus on aligning the spatial feature space, leaving the temporal domain shift unaddressed. However, representing temporal dynamics is critical for us to understand videos. This means that the temporal domain shift contributes to most of the overall domain shift, which cannot be solved only by image-based DA. Therefore, we aim to develop DA approaches to diminish the domain shift for videos in this work.

First, We Need Datasets

Unfortunately, existing datasets are all small-scale with limited domain shifts so a standard deep Convolutional Neural Network (CNN) architecture can achieve nearly perfect performance even without any DA method. Therefore, we propose two large-scale datasets to investigate video DA tasks, including videos from virtual and real domains. Please check our paper for dataset statistics and the GitHub repository for more usage details.

walk_ucf_compressed    walk_hmdb_compressed              walk_gameplay_compressed   walk_kinetics_short_compressed

(above) The example walk videos in our datasets, which have obvious variations between different domains.

Domain Alignment for Videos

Besides the datasets, we propose Temporal Attentive Adversarial Adaptation Network (TA3N), including the following concepts:

  1. Modifying existing image-based DA approaches does not fully explore the temporal nature in videos. Therefore, we first learn relation-based temporal dynamics given the fact that humans can recognize actions by reasoning the relation of observations across time.
  2. The overall temporal dynamics in videos consist of multiple local temporal dynamics corresponding to different motion characteristics. Therefore, we perform DA on all the local temporal dynamics to jointly align and learn the overall temporal dynamics. 
  3. Given the above idea, we can consider a video-level feature representation as a combination of multiple local temporal features. However, not all of the local temporal features equally contribute to the overall domain shift, which is ignored by prior DA work. Therefore, we propose to pay more attention to aligning the features that have high contributions to the overall domain shift, leading to more effective domain alignment.

Screen Shot 2019-09-10 at 8.44.32 AM

(above) The illustration of our main approach. Thicker arrows correspond to larger attention values.

To demonstrate how TA3achieves domain alignment, we evaluate our approaches with current existing small-scale datasets (left two) and our self-collected datasets (right two). “Source only” means directly applying the model trained with source data to the target dataset without any DA methods. High “Source only” performance indicates the limited domain shift in the existing datasets. “Video-based DAAN” is the simple extension of a popular image-based DA method for videos. In all the investigated datasets, TA3improves significantly over the other methods, as shown in the following table: 

(Source → Target) H → Usmall O → U U → Hfull K → G
Source only 97.35 87.08 70.28 17.22
Video-based DAAN 98.41 90.00 71.11 20.56
TA3N 99.47 92.92 78.33 27.50

(above) The comparison of different approaches in various cross-domain settings (H: HMDB, U: UCF, O: Olympic, K: Kinetics, G: Gameplay).

In addition, we use a common visualization technique t-SNE to show that TA3N can group source data (blue dots) into denser clusters and generalize the distribution into the target domains (orange dots) as well, as shown below:  

Screen Shot 2019-09-10 at 8.44.57 AM

(above) The comparison of t-SNE visualization. Here we show the “UCF → HMDB” setting.

For more detailed qualitative and quantitative experimental results, please check our paper and GitHub repository


To summarize, we present two large-scale cross-domain video datasets, including both real and virtual domains. We also propose an effective spatio-temporal domain alignment approach that achieves state-of-the-art performance on all of the investigated cross-domain video datasets. This work is tackling the major challenge of deploying video understanding models in various settings without requiring large-scale data annotations for every setting. 


This work was mainly done in OLIVES@GT (GitHub) and in collaboration with RIPL@GT (GitHub). Feel free to check the websites and Github repositories for other interesting work!

This post is based on the following paper:

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng

ICCV 2019 (Paper (Oral) | Code & Data)

3 thoughts on “Overcoming Large-scale Annotation Requirements for Understanding Videos in the Wild”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.