Hazan Lab @ Princeton University
It seems clear that the most important technology of today is artificial intelligence. Many methods to align agents often rely on simply instructing them to be aligned, e.g. Anthropic’s Constitutional AI. One well-known issue, though, is that even with these...
One of the biggest challenges for the Transformer architecture, powering modern large language models, is their computational inefficiencies on long sequences. The computational complexity of inference of the attention module scales quadratically with the context length, which can be extremely...
The study of mathematical optimization is a hallmark of the application of the scientific method to almost all engineering fields. With the rise of machine learning and large scale problems, attention shifted from interior point methods based on Newton’s method...
David Blackwell was still roaming the corridors of UC Berkeley’s stat department during my postdoc years, circa 2009. Jake Abernethy, Sasha Rakhlin, Peter Bartlett and myself were discussing his results, and his seminal contributions to prediction theory were already well...
The following post is based on this paper and this paper.
In a famous 1906 competition at a local fair in Plymouth, England, participants were asked to guess the weight of an ox. Out of a crowd of hundreds, no one came close to the ox’s actual weight, but the average...
Spring semester is over, yay! To celebrate summer, I’ve compiled lecture notes from the graduate course COS 598D, a.k.a. “optimization for machine learning”.
The picture to the right depicts Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack Obama.
This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning. The long time it took me to publish this one is not by chance – I am still ambivalent on the merit, or...
I’ve always admired the beautiful theory on the analysis of boolean functions (see the excellent book of Ryan O’Donnell, as well Gil Kalai’s lectures). Heck, it used to be my main area of investigation back in the early grad-school days we were studying...
In this continuation post I’d like to bring out a critical component of learning theory that is essentially missing in today’s approaches for unsupervised learning: improper learning by convex relaxations.
This nostalgic post is written after a tutorial in ICML 2016 as a recollection of a few memories with my friend Satyen Kale.
The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and...
The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school...
First-order methods such as Gradient Descent, AdaGrad, SVRG, etc. dominate the landscape of optimization for machine learning due to their extremely low per-iteration computational cost. Second order methods have largely been ignored in this context due to their prohibitively large...
Classical statistical learning theory deals with finding a consistent hypothesis given sufficiently many examples. Such a hypothesis may be a rule for classifying emails into spam/not-spam, and examples are classified emails.
The duality theorem of linear programming and its extensions is a fundamental pillar of optimization. It lies at the heart of an elegant theory which allows us not only to reason about the structure of continuous optimization problems, but also design combinatorial...