By Minshuo Chen

Background and Motivation

Deep learning has made significant breakthroughs in various real-world applications, such as computer vision, natural language processing, healthcare, robotics, etc. In image classification, the winner of the 2017 ImageNet challenge retained a top-5 error rate of 2.25% [1], while the data set consists of about 1.2 million labeled high-resolution images in 1000 categories.

The empirical success of deep learning, however, is not well explained by existing theories [2, 3]. Existing results suggest that, to achieve an Screen Shot 2019-11-11 at 11.42.43 AMpredictive error, the number of samples needs to grow as Screen Shot 2019-11-11 at 11.43.32 AMwhere D is the dimension of data and c is a constant. For ImageNet which consists of RGB images with a resolution of 224 x 224, the sample complexity scales as Screen Shot 2019-11-11 at 11.44.48 AM.png. Setting =0.1 and c = 1 gives rise to Screen Shot 2019-11-11 at 11.46.15 AM, which well exceeds 1.2 million labeled images in ImageNet.

To bridge the aforementioned gap between theory and practice, we take the low-dimensional geometric structures of datasets into consideration. This is motivated by the fact that in real-world applications, high-dimensional data often depend on a small number of intrinsic parameters. 

In Figure 1, the left panel shows images of the same teapot obtained under rotations, where the intrinsic parameter is the angle of the rotation; The right panel shows the Yale Face dataset where the intrinsic parameters are the poses of the faces. These images are stored as large-size matrices, but each one only depends on one or two intrinsic parameters. It is, therefore, reasonable to assume that data are in Screen Shot 2019-11-11 at 11.47.01 AM.png but concentrated on or near a d-dimensional manifold with Screen Shot 2019-11-11 at 11.47.33 AM.png

Screen Shot 2019-10-31 at 8.41.15 AM

Figure 1. Left panel: images of the same teapot obtained under rotations [4]. Right panel: clusters in Yale Face dataset [5].

Efficient Statistical Recovery

Our research focuses on learning a regression model. Assume that f* is our target regression function and the response is Screen Shot 2019-11-11 at 11.48.55 AM.png, where x is sampled on a low-dimensional manifold and Screen Shot 2019-11-11 at 11.49.31 AM.png represents noise. We answer the following question of central interest in deep learning and statistical learning:

Can we gain benefits by exploiting low-dimensional geometric structures of data, when learning regression models using deep neural networks?

Our theory provides a positive answer by establishing an efficient statistical recovery theory with a sample complexity depending on the intrinsic data dimension d.

Main Result [6]: Suppose f* is a Screen Shot 2019-11-11 at 11.50.23 AM.png function supported on a low-dimensional manifold with mild assumptions. There exists a properly designed ReLU network architecture which gives rise to an estimation of f* up to error Screen Shot 2019-11-11 at 11.42.43 AM, with sample complexity Screen Shot 2019-11-11 at 11.51.14 AM.png. Here Õ hides logarithmic factors in D, Screen Shot 2019-11-11 at 11.42.43 AM, and polynomial factors in d. Note that

  1. the sample complexity depends on the intrinsic dimension Screen Shot 2019-11-11 at 11.47.33 AM;
  2. we consider rectified linear unit (ReLU) activation, since it is commonly used in deep networks, and eases the notorious vanishing gradient problem [7];
  3. our result reveals the adaptability of neural networks to low-dimensional geometric structures of data.

Key Ingredients in Our Analysis

A natural question: How is the network architecture properly designed? At a coarse level, it crucially depends on 

  1. the smoothness of f*: on a fixed manifold, it is easier — smaller network suffices — to approximate a smoother function;
  2. the curvature of the manifold: highly curved manifold may skyrocket the difficulty of the approximation (see the bottom row of Figure 2).

Screen Shot 2019-10-31 at 8.42.41 AM

Figure 2. Upper row: easier approximations for smoother functions on the same manifold. Bottom row: easier approximations for functions on a flat manifold.

An Efficient Approximation Theory

Existing theories of neural networks [8, 9] show that, to approximate Screen Shot 2019-11-11 at 11.50.23 AM functions up to error Screen Shot 2019-11-11 at 11.42.43 AM, the ReLU network size scales as Screen Shot 2019-11-11 at 11.53.56 AM. Recall that D is the dimension of data. This implies that the network size needs to grow exponentially in the dimension of data, which is in sharp contrast to the neural networks used in practice (see Figure 3).

Screen Shot 2019-10-31 at 8.43.17 AM.pngFigure 3. Network size for the ImageNet dataset [10]. Existing theories suggest a huge number of parameters to achieve a desired accuracy.

We exploit the intrinsic data structures to bridge such a gap, by establishing an efficient approximation theory for functions on low-dimensional manifolds using deep neural networks. To approximate Screen Shot 2019-11-11 at 11.50.23 AM functions with error Screen Shot 2019-11-11 at 11.42.43 AM, we design a network architecture with size Screen Shot 2019-11-11 at 11.56.06 AM.png.  Given our approximation theory, we choose the network architecture consisting of Screen Shot 2019-11-11 at 11.56.43 AM.pnglayers, each with width bounded by Screen Shot 2019-11-11 at 11.57.34 AM, where n is the sample size. This gives rise to the error of estimating f*in the order Screen Shot 2019-11-11 at 11.58.10 AM.png.

If you are interested in reading more about nonparametric regression using deep neural networks on low dimensional manifolds, please refer to our full-paper. This research is joint work by Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao.

References

[1] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Györfi, László, Kohler, Michael, Krzyzak, Adam, and Walk, Harro. “A distribution-free theory of nonparametric regression.” Springer Science & Business Media, 2006.

[3] Schmidt-Hieber, Johannes. “Nonparametric regression using deep neural networks with ReLU activation function.” arXiv preprint arXiv:1708.06633 (2017).

[4] Ham, Ji Hun, Daniel D. Lee, and Lawrence K. Saul. “Learning high dimensional correspondences from low dimensional manifolds.” (2003).

[5] Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. “A global geometric framework for nonlinear dimensionality reduction.” Science 290.5500 (2000): 2319-2323.

[6] Chen, Minshuo, Jiang, Haoming, Liao, Wenjing, and Zhao, Tuo. “Efficient approximation of deep relu networks for functions on low dimensional manifolds.” arXiv preprint arXiv:1908.01842 (2019).

[7] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neural networks.” International conference on artificial intelligence and statistics. 2011.

[8] Cybenko, George. “Approximation by superpositions of a sigmoidal function.” Mathematics of control, signals and systems 2.4 (1989): 303-314.

[9] Yarotsky, Dmitry. “Error bounds for approximations with deep ReLU networks.” Neural Networks 94 (2017): 103-114.

[10] Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” International conference on machine learning. 2019.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s