Minimizing Regret

Logo

Hazan Lab @ Princeton University

26 May 2016

The complexity zoo and reductions in optimization

by zeyuan, Elad Hazan

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.

This is especially acute for courses that do not deal directly with optimization, which is described as a tool for learning or as a basic building block for other algorithms. Some approaches:

All of these variants have different proofs whose connections are perhaps not immediate. If one wishes to go into more depth, usually in convex optimization courses one covers the full spectrum of different smoothness/strong-convexity/acceleration/stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)

This year I’ve tried something different in COS511 @ Princeton, which turns out also to have research significance. We’ve covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of this book.

Classical Strong-Convexity and Smoothness Reductions

Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.

Given a non-strongly convex but smooth objective \(f\), define a objective by \(\text{strong-convexity reduction:} \qquad f_1(x) = f(x) +  \epsilon \|x\|^2\) It is straightforward to see that \(f_1\) differs from \(f\) by at most \(\epsilon\) times a distance factor, and in addition it is \(\epsilon\)-strongly convex. Thus, one can apply \(A\) to minimize \(f_1\) and get a solution which is not too far from the optimal solution for \(f\) itself.  This simplistic reduction yields an almost optimal rate, up to logarithmic factors.

Similar simplistic assumptions can be derived for (finite-sum forms of) non-smooth by strongly-convex functions (via randomized smoothing or Fenchel duality), and for functions that are neither smooth nor strongly-convex by just applying both reductions simultaneously. Notice that such classes of functions include famous machine learning problems such as SVM, Logistic Regression, SVM, L1-SVM, Lasso, and many others.

Necessity of Reductions

This is not only a pedagogical question. In fact, very few algorithms apply to the entire spectrum of strong-convexity / smoothness regimes, and thus reductions are very often intrinsically necessary. To name a few examples,

Optimality and Practicality of Reductions

The folklore strong-convexity and smoothing reductions are suboptimal. Focusing on the strong-convexity reduction for instance:

These theoretical concerns also translate into running time losses and parameter tuning difficulties in practice. For such reasons, researchers usually make efforts on designing unbiased methods instead.

One can find academic papers derive various optimization improvements many times for only one of the settings, leaving the other settings desirable. An optimal and unbiased black-box reduction is thus a tool to extend optimization algorithms from one domain to the rest.

Optimal, Unbiased, and Practical Reductions

In this paper, we give optimal and unbiased reductions. For instance, the new reduction, when applied to SVRG, implies the same running time as SVRG++ up to constants, and is unbiased so converges to the global minimum. Perhaps more surprisingly, these new results imply new theoretical results that were not previously known by direct methods. To name two of such results:

These reductions are surprisingly simple. In the language of strong-convexity reduction, the new algorithm starts with a regularizer \(\lambda \|x\|^2\) of some large weight \(\lambda\), and then keeps halving it throughout the convergence. Here, the time to decrease \(\lambda\) can be either decided by theory or by practice (such as by computing duality gap).

A figure to demonstrate the practical performance of our new reduction (red dotted curve) as compared to the classical biased reduction (blue curves, with different regularizer weights) are presented in the figure below.

As a final word – if you were every debating whether to post your paper on ArXiV, yet another example of how quickly it helps research propagate:  only a few weeks after our paper was made available online, Woodworth and Srebro have already made use of our reductions in their new paper.

tags:
Share on: