The optimizer that outlived its proof
A case study on Adam
Quick note: Since this is my first post, I recommend reading the about page for this blog to get a sense for why I am writing this!
Adam is a popular deep learning optimizer that often just works, especially when training language models. In this post, I want to use Adam as a case study for a broader question: what is the value of deep learning theory useful when its guarantees do not cleanly explain the methods practitioners actually use?
The original Adam paper gave a convergence proof [1] that turned out to have a problem, and later work showed that Adam could fail to converge even in convex settings [2]. That same work proposed AMSGrad, a variant that permits the convergence proof to work [2]. The surprising part is that Adam did not disappear, and, as far as I know, AMSGrad is barely ever used, even as a baseline. Adam, and especially AdamW [3], became one of the default ways people train modern neural networks. Actually, many practitioners are not even aware that the original Adam paper has a flaw in it. So what are we supposed to make of this? If the proof was wrong, why did the optimizer survive? And if Adam survived the failure of one of its theoretical justifications, was there any point to doing the theory in the first place?
The so-called instrumentalist answer is pretty compelling: Adam is a tool, and tools are justified by whether they work or not. The incorrect proof is embarrassing, but it does not erase the fact that Adam made training easier, worked well in many settings, and had reasonable defaults. I find this view hard to dismiss because deep learning is full of methods that were useful before they were understood. We found tricks and recipes, then tried to explain/refine them afterward. In that sense, Adam is a very clean example of something that happens all the time.
The instrumentalist answer also starts to feel incomplete once you leave the regime where you have already tested the method. “It works” tells us something about the experiments we have run. It says much less about what will happen when we change the model scale, data distribution, architecture, batch size, precision, learning-rate schedule, or training objective. It actually does not even say much about changing the random seed, though we can try to run more experiments to estimate the variance. In small experiments, maybe we can just try a few optimizers and pick the best one. At LLM scale, that gets expensive very quickly. We cannot test every plausible optimizer variant on every plausible frontier-scale run, so at some point we want more than a leaderboard result.
This is where the realist impulse comes in. Realists, in a philosophical sense, would want to know what Adam is actually doing. Is it adapting to coordinate-wise gradient scales? Is it smoothing noise? Is it changing the implicit bias of training? Is it acting like some crude geometry on parameter space? Which of these stories, if any, tracks the real training dynamics? And how do any of these dynamical properties relate to model performance? The practical success of Adam motivates these questions without answering them.
The realism vs instrumentalism view is one I often take to understand what a given paper’s theory section is trying to accomplish. But the scope of Adam’s impact is not constrained to these few papers (and their theorems)…
Adam as a research programme
Another way to view the Adam saga, which I suspect many optimization researchers would find natural, is to place it inside the broader line of work on adaptive optimization. This is where Lakatos’s language of research programmes [5] feels useful, at least as a loose analogy. A research programme has some central commitments that people keep coming back to, plus a bunch of surrounding details that can be adjusted as people learn more. For adaptive optimization, the central commitment might be something like:
Modifying the gradient using prior gradient information before applying it to update the network is useful.
Adam is one way to instantiate this hypothesis: it chooses different effective step sizes for different parameters and scales the gradient accordingly. This idea is intuitive, easy to implement, and often works, but the details ultimately matter. Reddi, Kale, and Kumar showed that the exponential moving average in Adam could cause convergence failures [2]. Loshchilov and Hutter pointed out that L2 regularization and weight decay behave differently for adaptive optimizers, which led to the popular AdamW optimizer [3]. In practice, people also care about beta values, epsilon placement, warmup, clipping, schedules, batch size, and many other recipe details.
This framing makes it easier to distinguish different kinds of progress. AMSGrad feels like one kind: theory found a failure mode and suggested a concrete repair. AdamW feels like another: it clarified a mismatch between weight decay and adaptive preconditioning that mattered in practice. Other optimization ideas feel more like recipe-building. They may be important, but it is harder to say what general lesson they are teaching us.
What went wrong with the proof?
Adam’s update, ignoring bias correction and constants, looks like
where m_t is an exponential moving average of gradients and v_t is an exponential moving average of squared gradients:
So each coordinate has its own effective learning rate. This is the whole point of Adam: if a coordinate has consistently large gradients, the denominator gets large and Adam takes smaller steps in that coordinate; if a coordinate has small or sparse gradients, Adam can take larger steps.
The proof problem is that the convergence argument needs these effective learning rates to behave monotonically in the right direction. In AdaGrad, this is true [4]. AdaGrad accumulates all past squared gradients, so the denominator only grows. The effective learning rate only shrinks. That monotonicity is very convenient for the regret proof because certain weighted distance terms telescope cleanly.
Adam does something different. Since v_t is an exponential moving average, old gradients decay out of memory. If a coordinate has a large gradient and then stays quiet for a while, v_t can shrink. When v shrinks, the effective learning rate α / \sqrt{v} can grow. In other words, Adam can increase the effective learning rate in a coordinate because it has forgotten that the coordinate used to have large gradients.
Reddi, Kale, and Kumar formalize this using a quantity that measures the change in the inverse effective learning rate [2]:
Here V_t is the diagonal matrix of second-moment estimates. The original proof effectively needs \Gamma_{t+1} to be positive semidefinite. For AdaGrad-like methods this is natural, because the denominator grows over time, but for Adam, it can fail. The denominator can go down, so this matrix difference can have the wrong sign. Reddi et al. point out that the Adam proof erroneously assumes this positivity property.
So the problem is that the proof treats Adam as if it has the long-term memory of AdaGrad, while Adam’s design uses short-term memory. This is also why the counterexample is useful: it shows that the proof gap corresponds to a real failure mode. You can construct simple convex sequences where Adam keeps increasing the effective learning rate at the wrong times and drifts toward the wrong point.
There are later developments that make this story less binary. The proof in [2] essentially works like, the user picks β1 and β2 (satisfying some constraint), and the proof will generate a counterexample on which Adam will not converge. This is different from how we normally use an optimizer, where we first have a problem and later try to figure out the optimizer to use for it. In [6], the authors show that if you fix the problem first, Adam will converge to a neighborhood of critical points as long as β2 is large enough (and a few other assumptions are satisfied). So even here, the theory is not a cut-and-dry result like Adam works or Adam fails. It is clarifying what the interesting question is here: do we want an optimizer that survives adversarially chosen problems for fixed hyperparameters, or do we want to understand how to tune an optimizer for the problem we actually have?
What did theory do here?
The question I keep coming back to is: what kind of work did theory do in the case of the Adam optimizer? Many practitioners can and do use the algorithm without knowing any of this backstory about its theoretical justification. So was there any point to even including a convergence proof to begin with?
The theory helped identify failure modes, motivate variants, and give us language for asking better questions. It also made the claim precise enough to be wrong. A vague story like “Adam adapts learning rates and works well” is hard to refute or improve. A convergence argument, even a failed one, creates a target: here is what we thought Adam should guarantee, here are the assumptions under which we thought it worked, and here is where the argument breaks. That is a more modest role for theory than we sometimes advertise when trying to write papers and make talks, but it still seems worth taking seriously.
Here is another way to think about it. Every time you run a deep learning experiment, you are making many small bets on the right optimizer, data, architecture, etc. to use. You are usually also making one large bet on “your idea,” the novel part of this experiment. You usually have a reason to think the large bet is correct, but these small bets feel like just pesky details that could muddy the waters. Mathematical language around how to think about the optimizer, for example, makes it clear to you what bet you are actually making when you choose to use Adam instead of, say, AdaGrad. In the Adam case, it tells you that you are trading long-term memory in the effective learning rate for a more responsive, short-memory estimate of gradient scale. When or why is that useful? Well, that’s what tons of theory (or theory-adjacent) papers in simpler settings aim to understand…
So the lesson I take from Adam is that theory should do something more specific than add mathematical decoration to a useful trick. It should formalize a claim, expose a failure mode we did not see before, clarify which parts of the recipe matter, or tell us when success in one regime should transfer to another.
Along these lines, we may even begin to wonder what the point is of a convergence proof anymore. We don’t usually train models to convergence, and even if we do, when/how do asymptotics actually provide a meaningful window into finite-horizon behaviors? One of the newest optimization darlings, Muon, had no convergence proof in its original proposal, though people have since proven convergence results and used them to design new variants. This will be a topic that I return to later on, both in the context of optimization and in the context of scaling neural networks, where ideas like infinite width and depth have become commonplace despite their obvious impracticality.
Acknowledgements
Thank you (in alphabetical order) to Taylor Berg-Kirkpatrick, Angelica Chen, Tianyu Gao, Sayash Kapoor, Will Merrill, Smitha Milli, Naomi Saphra, Weijia Shi, Kaiyue Wen, Mengzhou Xia, Guangxuan Xiao, and Chunting Zhou.
References
[1] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR 2015.
[2] Sashank J. Reddi, Satyen Kale, Sanjiv Kumar. On the Convergence of Adam and Beyond. ICLR 2018.
[3] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. ICLR 2019.
[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 2011.
[5] Imre Lakatos. “Falsification and the Methodology of Scientific Research Programmes.” In Criticism and the Growth of Knowledge, edited by Imre Lakatos and Alan Musgrave, Cambridge University Press, 1965.
[6] Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules. NeurIPS 2022.

