By Minshuo Chen

#### Background and Motivation

Deep learning has made significant breakthroughs in various real-world applications, such as computer vision, natural language processing, healthcare, robotics, etc. In image classification, the winner of the 2017 ImageNet challenge retained a top-5 error rate of 2.25% [1], while the data set consists of about 1.2 million labeled high-resolution images in 1000 categories.

The empirical success of deep learning, however, is not well explained by existing theories [2, 3]. Existing results suggest that, to achieve an predictive error, the number of samples needs to grow as where *D* is the dimension of data and *c* is a constant. For ImageNet which consists of RGB images with a resolution of 224 x 224, the sample complexity scales as . Setting =0.1 and c = 1 gives rise to , which well exceeds 1.2 million labeled images in ImageNet.

To bridge the aforementioned gap between theory and practice, we take the *low-dimensional geometric structures* of datasets into consideration. This is motivated by the fact that in real-world applications, high-dimensional data often depend on a small number of intrinsic parameters.

In Figure 1, the left panel shows images of the same teapot obtained under rotations, where the intrinsic parameter is the angle of the rotation; The right panel shows the Yale Face dataset where the intrinsic parameters are the poses of the faces. These images are stored as large-size matrices, but each one only depends on one or two intrinsic parameters. It is, therefore, reasonable to assume that data are in but concentrated on or near a *d*-dimensional manifold with

Figure 1. Left panel: images of the same teapot obtained under rotations [4]. Right panel: clusters in Yale Face dataset [5].

#### Efficient Statistical Recovery

Our research focuses on learning a regression model. Assume that f* is our target regression function and the response is , where x is sampled on a low-dimensional manifold and represents noise. We answer the following question of central interest in deep learning and statistical learning:

*Can we gain benefits by exploiting low-dimensional geometric structures of data, when learning regression models using deep neural networks？*

Our theory provides a positive answer by establishing an efficient statistical recovery theory with a sample complexity depending on the intrinsic data dimension *d*.

**Main Result [6]:** Suppose f* is a function supported on a low-dimensional manifold with mild assumptions. There exists a *properly designed ReLU network architecture* which gives rise to an estimation of f* up to error , with sample complexity . Here Õ hides logarithmic factors in D, , and polynomial factors in *d*. Note that

- the sample complexity depends on the intrinsic dimension ;
- we consider rectified linear unit (ReLU) activation, since it is commonly used in deep networks, and eases the notorious vanishing gradient problem [7];
- our result reveals the adaptability of neural networks to low-dimensional geometric structures of data.

#### Key Ingredients in Our Analysis

A natural question: How is the network architecture properly designed? At a coarse level, it crucially depends on

- the smoothness of f*: on a fixed manifold, it is easier — smaller network suffices — to approximate a smoother function;
- the curvature of the manifold: highly curved manifold may skyrocket the difficulty of the approximation (see the bottom row of Figure 2).

Figure 2. Upper row: easier approximations for smoother functions on the same manifold. Bottom row: easier approximations for functions on a flat manifold.

#### An Efficient Approximation Theory

Existing theories of neural networks [8, 9] show that, to approximate functions up to error , the ReLU network size scales as . Recall that *D* is the dimension of data. This implies that the network size needs to grow exponentially in the dimension of data, which is in sharp contrast to the neural networks used in practice (see Figure 3).

Figure 3. Network size for the ImageNet dataset [10]. Existing theories suggest a huge number of parameters to achieve a desired accuracy.

We exploit the *intrinsic data structures* to bridge such a gap, by establishing an efficient approximation theory for functions on low-dimensional manifolds using deep neural networks. To approximate functions with error , we design a network architecture with size . Given our approximation theory, we choose the network architecture consisting of layers, each with width bounded by , where *n* is the sample size. This gives rise to the error of estimating f*in the order .

If you are interested in reading more about nonparametric regression using deep neural networks on low dimensional manifolds, please refer to our full-paper. This research is joint work by Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao.

#### References

[1] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” *Proceedings of the IEEE conference on compu*ter vision and pattern recognition. 2018.

[2] Györfi, László, Kohler, Michael, Krzyzak, Adam, and Walk, Harro. “A distribution-free theory of nonparametric regression.”* Springer Science & Business Media*, 2006.

[3] Schmidt-Hieber, Johannes. “Nonparametric regression using deep neural networks with ReLU activation function.” *arXiv preprint arXiv:1708.06633* (2017).

[4] Ham, Ji Hun, Daniel D. Lee, and Lawrence K. Saul. “Learning high dimensional correspondences from low dimensional manifolds.” (2003).

[5] Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. “A global geometric framework for nonlinear dimensionality reduction.” *Science 290.5500 *(2000): 2319-2323.

[6] Chen, Minshuo, Jiang, Haoming, Liao, Wenjing, and Zhao, Tuo. “Efficient approximation of deep relu networks for functions on low dimensional manifolds.” *arXiv preprint arXiv:1908.01842* (2019).

[7] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neural networks.” *International conference on artificial intelligence and statistics*. 2011.

[8] Cybenko, George. “Approximation by superpositions of a sigmoidal function.” *Mathematics of control, signals and systems 2.4* (1989): 303-314.

[9] Yarotsky, Dmitry. “Error bounds for approximations with deep ReLU networks.” *Neural Networks 94* (2017): 103-114.

[10] Tan, Mingxing, and Quoc V. Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” *International conference on machine learning*. 2019.