Blog | Minimizing Regret

The Statistical Orthogonality Thesis

June 27, 2026

The orthogonality thesis holds that intelligence and goals are independent: an agent could be arbitrarily intelligent while pursuing almost any objective. As a possibility this is hard to dispute. But perhaps a more useful question is not about what is...

AI Alignment via Incentives and Correction

May 7, 2025

It seems clear that the most important technology of today is artificial intelligence. Many methods to align agents often rely on simply instructing them to be aligned, e.g. Anthropic’s Constitutional AI. One well-known issue, though, is that even with these...

Spectral Transformers

June 4, 2024

One of the biggest challenges for the Transformer architecture, powering modern large language models, is their computational inefficiencies on long sequences. The computational complexity of inference of the attention module scales quadratically with the context length, which can be extremely...

Meta-optimization, adaptive gradient methods and parameter-free optimization

May 15, 2023

The study of mathematical optimization is a hallmark of the application of the scientific method to almost all engineering fields. With the rise of machine learning and large scale problems, attention shifted from interior point methods based on Newton’s method...

Blackwell approachability meets Online Conve Optimization

October 6, 2020

David Blackwell was still roaming the corridors of UC Berkeley’s stat department during my postdoc years, circa 2009. Jake Abernethy, Sasha Rakhlin, Peter Bartlett and myself were discussing his results, and his seminal contributions to prediction theory were already well...

New Methods in Control: The Gradient Perturbation Controller

October 31, 2019

The following post is based on this paper and this paper.

Boosting for Dynamical Systems

July 17, 2019

In a famous 1906 competition at a local fair in Plymouth, England, participants were asked to guess the weight of an ox. Out of a crowd of hundreds, no one came close to the ox’s actual weight, but the average...

Lecture notes: optimization for ML

June 23, 2019

Spring semester is over, yay! To celebrate summer, I’ve compiled lecture notes from the graduate course COS 598D, a.k.a. “optimization for machine learning”.

Reinforcement learning without rewards

April 15, 2019

The following post is based on this paper.

Taking control by convex optimization

May 29, 2018

The picture to the right depicts Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack Obama.

Unsupervised learning III: worst-case compression-based metrics that generalize

March 1, 2018

This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning. The long time it took me to publish this one is not by chance – I am still ambivalent on the merit, or...

Hyperparameter optimization and the analysis of boolean functions

June 8, 2017

I’ve always admired the beautiful theory on the analysis of boolean functions (see the excellent book of Ryan O’Donnell, as well Gil Kalai’s lectures). Heck, it used to be my main area of investigation back in the early grad-school days we were studying...

Unsupervised Learning II: the power of improper convex relaxations

October 30, 2016

In this continuation post I’d like to bring out a critical component of learning theory that is essentially missing in today’s approaches for unsupervised learning: improper learning by convex relaxations.

More than a decade of online convex optimization

July 7, 2016

This nostalgic post is written after a tutorial in ICML 2016 as a recollection of a few memories with my friend Satyen Kale.

The complexity zoo and reductions in optimization

May 26, 2016

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and...

The two cultures of optimization

March 3, 2016

The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school...

Making second order methods practical for machine learning

March 2, 2016

First-order methods such as Gradient Descent, AdaGrad, SVRG, etc. dominate the landscape of optimization for machine learning due to their extremely low per-iteration computational cost. Second order methods have largely been ignored in this context due to their prohibitively large...

Worst-case vs. average-case learning

February 21, 2011

Classical statistical learning theory deals with finding a consistent hypothesis given sufficiently many examples. Such a hypothesis may be a rule for classifying emails into spam/not-spam, and examples are classified emails.

Duality for undergrads

February 3, 2011

The duality theorem of linear programming and its extensions is a fundamental pillar of optimization. It lies at the heart of an elegant theory which allows us not only to reason about the structure of continuous optimization problems, but also design combinatorial...