Explaining Nonparametric Regression on Low Dimensional Manifolds using Deep Neural Networks

Background and Motivation

Deep learning has made significant breakthroughs in various real-world applications, such as computer vision, natural language processing, healthcare, robotics, etc. In image classification, the winner of the $2017$ ImageNet challenge retained a top- $5$ error rate of $2.25\%$ [1], while the data set consists of about $1.2$ million labeled high-resolution images in $1000$ categories.

The empirical success of deep learning is, however, not well explained by existing theories [2, 3]. Existing results suggest that, to achieve an $\epsilon$ predictive error, the number of samples needs to grow as $\epsilon^{-cD}$ where $D$ is the dimension of data and $c$ is a constant. For ImageNet which consists of RGB images with a resolution of $224 \times 224$ , the sample complexity scales as $\epsilon^{-c\times 224 \times 224 \times 3}$ . Setting $\epsilon=0.1$ and $c=1$ gives rise to $10^{224 \times 224 \times 3},$ which well exceeds $1.2$ million labeled images in ImageNet.

To bridge the aforementioned gap between theory and practice, we take the low-dimensional geometric structures of data into consideration. This is motivated by the fact that in real-world applications, high-dimensional data often depend on a small number of intrinsic parameters.

In Figure 1, the left panel shows images of the same teapot obtained under rotations, where the intrinsic parameter is the angle of the rotation; The right panel shows the Yale Face dataset where the intrinsic parameters are the poses of the faces. These images are stored as large-size matrices, but each one only depends on one or two intrinsic parameters. It is, therefore, reasonable to assume that data are in $\mathbb R^D$ but concentrated on or near a $d$ -dimensional manifold with $d \ll D$ .

Screen Shot 2019-10-31 at 8.41.15 AM

Figure 1. Left panel: images of the same teapot obtained under rotations [4]. Right panel: face images with different poses in Yale Face dataset [5].

Efficient Statistical Recovery

Our research focuses on learning a regression model. Assume that $f^*$ is our target regression function and the response is $y=f^*(x)+\xi$ where $x$ is sampled on a low-dimensional manifold and $\xi$ represents noise. We answer the following question of central interest in deep learning and statistical learning:

Can we gain benefits by exploiting low-dimensional geometric structures of data, when learning regression models using deep neural networks？

Our theory provides a positive answer by establishing an efficient statistical recovery theory with a sample complexity depending on the intrinsic data dimension $d$ .

Main Result [6]: Suppose $f^*$ is a $C^s$ function supported on a low-dimensional manifold with mild assumptions. There exists a properly designed ReLU network architecture which gives rise to an estimation of $f^*$ up to error $\epsilon$ , with sample complexity $\tilde{\mathcal{O}}(\epsilon^{-(2s+d)/s})$ . Here $\tilde{\mathcal{O}}$ hides logarithmic factors in $D$ , $\epsilon$ and polynomial factors in $d$ . Note that

the sample complexity depends on the intrinsic dimension $d\ll D$ ;
we consider the rectified linear unit (ReLU) activation, since it is commonly used in deep networks, and eases the notorious vanishing gradient problem [7];
our result demonstrates that neural networks are adaptive to low-dimensional geometric structures of data, which partially explains the power of neural networks in learning real-world data with structures.

Key Ingredients in Our Analysis

A natural question: How is the network architecture properly designed? At a coarse level, it crucially depends on

the smoothness of $f^*$ : on a fixed manifold, it is easier — smaller network suffices — to approximate a smoother function;
the curvature of the manifold: highly curved manifold may skyrocket the difficulty of the approximation (see the bottom row of Figure 2).

Screen Shot 2019-10-31 at 8.42.41 AM

Figure 2. Upper row: easier approximations for smoother functions on the same manifold. Bottom row: easier approximations for functions on a flatter manifold.

An Efficient Approximation Theory

Existing theories of neural networks [8, 9] show that, to approximate $C^s$ functions up to error $\epsilon$ , the ReLU network size scales as $\tilde{\mathcal{O}}(\epsilon^{-D/s} \log 1/\epsilon)$ , where $D$ is the data dimension. This implies that the network size needs to grow exponentially with an exponent proportional to the data dimension, which is in sharp contrast to the neural networks used in practice (see Figure 3).

Screen Shot 2019-10-31 at 8.43.17 AM Figure 3. Network size for the ImageNet dataset [10]. Existing theories suggest a huge number of parameters to achieve a desired accuracy.

We exploit the low-dimensional geometric structures to bridge such a gap, by establishing an efficient approximation theory for functions on low-dimensional manifolds using deep neural networks. To approximate $C^s$ functions with error $\epsilon$ , we design a network architecture with size $\tilde{\mathcal{O}}(\epsilon^{-d/s} \log 1/\epsilon + D\log 1/\epsilon)$ . Given our approximation theory, we choose the network architecture consisting of $\tilde{\mathcal{O}}(s/(2s+d)\log n)$ layers, each with width bounded by $\tilde{\mathcal{O}}(n^{d/(2s+d)}\log n)$ where $n$ is the sample size. This gives rise to the error of estimating $f^*$ in the order $\tilde{\mathcal{O}}(n^{-s/(2s+d)})$ .

If you are interested in reading more about nonparametric regression using deep neural networks on low dimensional manifolds, please refer to our full-paper.

About the Authors

This research is joint work by Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Minshuo Chen and Haoming Jiang are third year Ph.D. students in Machine Learning at Georgia Tech advised by Prof. Tuo Zhao. Minshuo Chen’s research focuses on developing principled methodologies and theoretical foundations of deep learning.

References

[1] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Györfi, László, Kohler, Michael, Krzyzak, Adam, and Walk, Harro. “A distribution-free theory of nonparametric regression.” Springer Science & Business Media, 2006.

[3] Schmidt-Hieber, Johannes. “Nonparametric regression using deep neural networks with ReLU activation function.” arXiv preprint arXiv:1708.06633 (2017).

[4] Ham, Ji Hun, Daniel D. Lee, and Lawrence K. Saul. “Learning high dimensional correspondences from low dimensional manifolds.” (2003).

[5] Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. “A global geometric framework for nonlinear dimensionality reduction.” Science 290.5500 (2000): 2319-2323.

[6] Chen, Minshuo, Jiang, Haoming, Liao, Wenjing, and Zhao, Tuo. “Efficient approximation of deep relu networks for functions on low dimensional manifolds.” arXiv preprint arXiv:1908.01842 (2019).

[7] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neural networks.” International conference on artificial intelligence and statistics. 2011.

[8] Cybenko, George. “Approximation by superpositions of a sigmoidal function.” Mathematics of control, signals and systems 2.4 (1989): 303-314.

[9] Yarotsky, Dmitry. “Error bounds for approximations with deep ReLU networks.” Neural Networks 94 (2017): 103-114.

[10] Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” International conference on machine learning. 2019.

Background and Motivation

Efficient Statistical Recovery

Key Ingredients in Our Analysis

An Efficient Approximation Theory

About the Authors

References

Share this:

Like this:

Related

Share this:

Related

Leave a comment Cancel reply