This is an informal summary of our recent paper Blended Matching Pursuit with my advisor Sebastian Pokutta. It will be presented at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver, British Columbia, Dec. 8-14, 2019. In this post, we motivate and explain the main ideas behind the design of our algorithm.
Introduction
Most applications in machine learning involve the minimization of a loss function at the centerpiece of their problem, whether it be in, e.g., market prediction (helping banks make better decisions), recommender systems (a streaming app finding new music for you), or object detection (detecting breast cancer from mammograms). To this end, we address the problem of minimizing a convex function f using linear combinations of points from a fixed set D, which is specific to the given problem. We refer to these points as and they represent the parametrization of our problem. Hence, we are interested in finding an ε-approximate minimizer of f that can be expressed with as little atoms as possible. When such a small representation is available, it tells us which atoms are most important to our problem. This property, known as sparsity, is essential in many machine learning applications for its numerous benefits, including higher interpretability and better performance. We summarize our general setup in Problem 1.
1 In typical machine learning applications, the atoms are often called the features of the problem.
Matching Pursuit Algorithms
The most popular method for minimizing a convex function is the gradient descent algorithm, which origins can be traced back to Cauchy [1847]. At each iteration, it descends in the optimal direction and performs the update
, where
is a step size. However, this procedure potentially yields very poor sparsity, as
may be a combination of many atoms if it is the case for
, even if
is sparse. Hence, in order to ensure good sparsity in the solution, the formulation of the problem is often made more complex by using sparsity-inducing constraints (e.g., the lasso), which can require the tedious tuning of hyper-parameters.
In the signal processing community, a popular method that avoids such a reformulation by inherently producing a sparse solution is the Matching Pursuit algorithm [Mallat and Zhang, 1993]. It starts from an arbitrary extremely sparse point (e.g., an atom) and sequentially adds one atom
to
to form
until it becomes a sufficiently good quality approximate solution. Thus, after T iterations at most T atoms have been added, which is very significant in high-dimensional spaces. In recent work, Locatello et al. [2017] proposed the Generalized Matching Pursuit algorithm (GMP) and the Orthogonal Matching Pursuit algorithm (OMP) to solve Problem 1. These algorithms are presented in Algorithm 1, where Line 3 shows that
is selected so as to maximize the alignment with the negative gradient, hereby mimicking gradient descent, however preserving sparsity since only 1 atom is added in a given iteration. The set
is the set of atoms collected in
and its cardinality is the sparsity of
.
Below we provide an illustration of the GMP procedure.
GMP performs the update where
is obtained by line search (Line 5 in Algorithm 1). However, this does not prevent GMP to select atoms that have already been added in previous iterations or that are redundant. This is resolved in OMP, which reoptimizes f at each iteration over the linear subspace spanned by all the selected atoms (Line 6 in Algorithm 1), but this is very expensive. Thus, GMP and OMP have very opposite behaviors as GMP converges much faster while OMP produces iterates with much higher sparsity. That is, GMP finds a point
satisfying
faster (in time) than OMP, but the point that OMP will end up finding has a much smaller m. Therefore, it is not possible to enjoy both properties of speed and sparsity simultaneously, and, in practice, we must trade one for the other.
Blended Matching Pursuit
In our paper, we unify the best of both algorithms and design a Blended Matching Pursuit algorithm (BMP). Noting that an OMP iteration is actually a sequence of projected gradient steps (PG), i.e., a sequence of gradient descent steps with projections back onto , our key observation is that this sequence is overkill and a sweet spot exists where it can be truncated and replaced with one GMP step. After some number of PG steps, additional PG steps will not decrease the function value significatively and we can afford to add 1 atom and take a GMP step. Essentially, by proceeding this way one can achieve the same sparsity as in OMP while converging much faster. This idea is at the core of the design of our BMP algorithm.
In order to combine the PG and GMP steps in a smooth and efficient manner, we define the dual gap estimates . Since we want to favor PG steps as they do not add a new atom and hence preserve the sparsity of the iterates, the dual gap estimates monitor the blending of steps by establishing at each iteration a threshold on the progress required to take a PG step: BMP takes a PG step if its progress guarantee satisfies this requirement, else BMP takes a GMP step. We refer the interested reader to our paper for technical details.
We establish convergence results for BMP and we validate in experiments that BMP converges faster than GMP while producing iterates with sparsity equivalent to that of OMP. This is of fundamental interest for practitioners. In particular, we establish linear convergence rate for a wide range of non-strongly convex functions via a sharpness property around the set of minimizers of f. Interestingly, most well-behaved convex functions satisfy this property [Bolte et al., 2007], which subsumes that of strong convexity. We further compare BMP to state-of-the-art algorithms, including Conditional Gradient with Enhancement and Truncation (CoGEnT) [Rao et al., 2015], Blended Conditional Gradients (BCG) [Braun et al., 2019], and Accelerated Matching Pursuit (accMP) [Locatello et al., 2018]. The results are identical, namely, that BMP has the fastest speed of convergence while its sparsity is close-to-optimal, with the optimal sparsity being that of OMP.
Computational Experiments
To conclude this post, we present two computational experiments. The first one compares BMP vs. GMP, OMP, BCG, and CoGEnT on the minimization of an arbitrarily chosen function over the set of canonical vectors
, where
. The plot on the left shows the convergence speed in number of iterations, the plot in the middle shows the convergence speed in time, and the plot on the right shows the convergence speed in number of atoms, i.e., the sparsity of the algorithm. We can see in the middle plot that GMP (blue) converges faster than OMP (green) while the plot on the right shows that OMP achieves higher sparsity, as described earlier. BMP (orange) converges the fastest in time and achieves sparsity equivalent to that of OMP.
2 Strong convexity is a standard requirement to obtain linear convergence rates, however this considerably restricts the set of candidate functions. For BMP, it is possible to obtain these rates without the strong convexity requirement.
The second experiment compares BMP vs. accMP on an experiment from Locatello et al. [2018], consisting in minimizing over a random dictionary D of 200 atoms. Our code framework is consistent with their implementation.
References
J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205‒1223, 2007.
G. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning of conditional gradients. In Proceedings of the 36th International Conference on Machine Learning, pages 735‒743, 2019.
A. Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées. In Comptes rendus hebdomadaires des séances de l’Académie des sciences, pages 536‒538, 1847.
C. W. Combettes and S. Pokutta. Blended matching pursuit. In Advances in Neural Information Processing Systems 32, 2019. To appear.
F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A unified optimization view on generalized matching pursuit and Frank-Wolfe. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 860–868, 2017.
F. Locatello, A. Raj, S. P. Karimireddy, G. Rätsch, B. Schölkopf, S. U. Stich, and M. Jaggi. On matching pursuit and coordinate descent. In Proceedings of the 35th International Conference on Machine Learning, pages 3198‒3207, 2018.
S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397‒3415, 1993.
N. Rao, S. Shah, and S. Wright. Forward-backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21):5798‒5811, 2015.