Yet Another ICML Award Fiasco

19 min read Original article ↗

Disclaimer: I deliberated extensively on whether writing this blog post was a good idea. Some kind of action was necessary because this award was just too unfair. I consulted with many senior people in my field. Ultimately, I concluded that this path was the most appropriate one to take because critiques are an integral part of any serious scientific discourse.

1. Who Am I?

Hi, in case you don’t know me, let me first introduce my academic self: I am an Associate Professor and I have been working on parameter-free optimization for 10 years. Indeed, I single-handedly founded this topic of research. That’s right, this topic did not exist before my papers! Nothing like proposing ConvNets, but still the thing I am most proud of in my academic life.

You don’t know what is parameter-free optimization? Well, if you ever used Stochastic Gradient Descent, you should know how painful it is to set the learning rate. Too small? It will take forever to converge. Too large? It will diverge. Why is it so difficult? For non-smooth functions, the optimal learning rate should be proportional to the distance between the initial point and the optimal one, but of course, this distance is typically unknown. Without knowing this distance, any optimization algorithm is provably suboptimal. Parameter-free optimization algorithms are defined as the ones that achieve the optimal (up to logarithmic factors) convergence rate with respect to this unknown distance for Lipschitz convex functions, in the deterministic, stochastic, and adversarial settings. They are called “parameter-free” because often—but not always—they achieve it by completely removing the learning rate. Note that this is not only a theoretical topic, parameter-free algorithms do work! From winning Kaggle competitions to large-scale linear predictors, these algorithms have been proven multiple times to work very well in practice. Are they the state-of-the-art on LLM? No, but we are actively working to make it happen!

In 2013, exactly 10 years ago, I started this topic. That is, I have shown that you can design an algorithm completely without learning rates, and still obtain optimal performance in theory and in practice. Of course, I didn’t do it completely alone, I was standing on the shoulders of two giants: this seminal paper by Streeter and McMahan.

Magic? No, just a bit of math, not even particularly difficult. The key idea is that the algorithm can estimate on the fly the distance between the initial point and the optimal solution and use it to decide the optimal steps to take. This is done implicitly in the dual observing the (stochastic) gradients: if they are aligned we are far from the solution. Moreover, the algorithm can change its mind in the case the noise in the stochastic gradients fooled it and decreases the length of the jumps if necessary. People also do much more with this basic idea, like differentially private SGD without learning rates, connecting this idea to concentration inequalities, parameter-free sampling, etc. It took us some time, but now we essentially know all about designing SGD without learning rates. The theory is so clear that we also gave a tutorial at ICML 2020 on parameter-free online convex optimization and it takes only a few slides to explain the key ideas.

By now, I wrote roughly 15 papers on this topic and I also teach it in my Online Learning class each year.

Soooo, I happen to know quite a bit about this topic.

Let me also add that I am a bit obsessed with the history of Science: Credits are essentially the only things researchers get for what they do, so I am very passionate about always crediting the right people. You can find me on Twitter constantly bitching about missing citations or you can read my previous blog posts where I always try to trace the exact history of the ideas.

2. What is D-Adaption?

Now that you know a bit about me and parameter-free optimization, let’s move to the content of this post: The ICML Outstanding Paper Award to the D-Adaptation paper. What is D-Adaptation? It is a new method to obtain parameter-free rates in the deterministic setting only. It is largely based on the theory developed by Carmon and Hinder COLT’23 (IMO, a breakthrough paper).

The first thing one notices about the D-Adaptation paper is that this is presented as a new problem. In fact, there is not a single word on the problem of designing parameter-free algorithms (that is exactly what they describe) until half of the paper. Then, they do mention previous parameter-free results, but they still do not say that the problem itself is a known and very studied problem. This is particularly weird given that there was a tutorial on this topic at ICML, the same conference where this paper was submitted. Even more weird is the fact that their Figure 1 is basically the same one we used to explain parameter-free algorithms in our tutorial. Why do I care about this? Such writing style can easily induce the average reviewer/Area Chair/Senior Area Chair to think that this is a new topic and happily ignore the possible presence of previous results.

Now, let’s take a look at the result: They design AdaGrad-like algorithms that estimate on the fly the distance between the initial point and the optimal solution so that it does not require manual tuning of learning rates. Caveat: It only works in the deterministic setting.

Well, Nemirovski obtained a slightly better guarantee 10 years ago (!) with a very simple reduction. In fact, in 2013 I sent Nemirovski an email about my first parameter-free results. Nemirovski quickly (less than 2 hours between my email and his reply) designed a reduction to solve the deterministic parameter-free problem, even without knowledge of the Lipschitz constant (through normalized gradients). I explained his reduction (with his permission) in the ICML tutorial we gave in 2020 (slides 5-7 in part 2): essentially you run a bunch of GD, each one with a different learning rate and a number of iterations so that the total number of iterations is roughly the same as using a single learning rate. Very easy. This will also work decently in practice because it is essentially a smart version of grid search. Nemirovski also pointed out that, given the existence of this simple reduction, this was definitely not a problem worth pursuing, adding that only the stochastic setting was worth consideration. Nemirovski, in a polite way, essentially was telling me “This is trivial, I can solve it in this straightforward way. Kid, I hope you have something better!”. This is not even an argument from authority: The deterministic setting makes this problem amenable to many many many easy solutions. So, it is easy to come up with tens of variants of Nemirovski’s idea, some even marginally better, for example, removing the need for a function value oracle. Should we write a paper on each of them? Probably not.

But it gets worse: the parameter-free community discovered how to solve the same problem in the harder stochastic and adversarial settings 9 years ago. The authors do mention prior work in the related work section, but they never say explicitly that their results are way weaker because they hold only in the deterministic setting: if you know stochastic optimization you know that this is a huge difference! Also, let me mention that people smarter than me these days have way stronger results than the ones from 9 years ago, like this and that one. So, the D-Adaptation results are literally old, besides being in a very easy setting that allows many easy solutions.

However, is the paper substantially different from previous work? Unfortunately, no. Technically speaking the idea in the D-Adaptation paper is the same as in all other parameter-free algorithms, it is only the specific way to estimate the distance that is different. This is due to the regret-duality reward: any parameter-free algorithm will necessarily guarantee a certain growth rate of the dual function, exactly as in the previous work. However, their approach is fundamentally less robust than previous ones because they never decrease their estimate. In turn, their estimate is correct only if the convex inequality {\langle {\boldsymbol g}_t, {\boldsymbol x}_t-{\boldsymbol x}^\star\rangle \geq0} ({{\boldsymbol g}_t}=stocastic gradient, {{\boldsymbol x}_t}=iterate, {{\boldsymbol x}^\star}=optimal solution) holds in all time steps, which is clearly false in the stochastic setting. This motivates their need for the deterministic assumption.

They also have another result, that they seem to claim to be their main result: They prove that asymptotically they can shave a log factor. This is cute, but hardly useful because i) they don’t give us any clue of when the asymptotic regime will kick in; ii) again it holds only in the easy deterministic setting.

Let me summarize what we have till now: This is the “Outstanding Paper Award” at a top conference in machine learning and the theoretical results are weak and/or known. To add context, there were at least 3 other papers on parameter-free algorithms (1, 2, 3) at the same conference and I believe they all have deeper results than this one.

So, let’s now move to the empirical evaluation. It must be the case that they compared with the 10+ years of research on this topic and crushed all the other parameter-free algorithms: That would be an exciting discovery!

This is not the case.

The submitted version does not have a single parameter-free baseline. Allow me to stress this point: 10+ years of research on parameter-free algorithms, 1 tutorial, 1 guy that won a Kaggle competition using these algorithms, code in the major machine learning libraries, and not even a single parameter-free baseline in a parameter-free paper.

The situation was so painful to me that after the acceptance, I literally begged Aaron: “Aaron, you are canceling 10+ years of research on this topic for absolutely no reason. Could you please not do that? I don’t care which algorithm you pick for your comparison and I don’t care if your algorithm is better, just add at least one baseline!”. So, they added one parameter-free baseline, my algorithm COCOB. In their experiments, they beat COCOB and that’s actually great! But they also added the claim that COCOB is the only parameter-free algorithm for deep learning. This is blatantly false: the final aim of this field was always to solve deep learning problems. So, apart from COCOB, there are deep learning experiments here, in this other paper, and even in the tutorial in 2020 with code here with an explicit comparison to Adam. This information is also easily available: It is literally on the webpage of our tutorial.

Moreover, there is nothing in these algorithms (and even in this paper) that is specific to deep learning. In fact, the only thing “specific to deep learning” is the use of AdaGrad-style updates, that have been implemented in essentially all parameter-free papers, from the very first proof-of-concept algorithm by Streeter and McMahan.

That said, let’s take a closer look at the experiments. For each experiment, they use the learning decay schedule according to best practices. Think for a moment what this implies: someone already tried a bunch of learning rate decays and selected the best one. This step is not free! You are not really “learning-rate-free” if you depend on this step to obtain good results. Obtaining good results on known neural network architectures with known optimal settings is fine, but the real test is in obtaining good results on something new, from scratch. It is there that you have to check if any parameter-free algorithm makes you save time or not. This is why I like to stress that someone won a Kaggle competition with one of our algorithms: there you really see if an algorithm is useful or just yet another paper.

Moreover, as it usually happens, the entire architecture was the end result of a tuning process that involved hyperparameters, learning rate decay, and network architecture, keeping the choice of the optimizer fixed to Adam. That is, best-practice settings are usually selected to make Adam shine. This is a well-known idea that many in the fields know and I first proposed on this very blog.
So, let’s consider a setting where this effect is not present: the linear experiments that they show. Instead of plotting the training accuracy as they do (?), let’s plot something more meaningful like the training loss, testing loss, and testing accuracy on their datasets that have a test set and that are bigger than 1k training points (because we are not in the 90s anymore). In their paper, they always use a learning rate decay. Here, we will run all the algorithms with and without a learning rate decay. In fact, despite what they claim, we can also safely and easily use a learning rate decay in parameter-free algorithms: it is enough to multiply the gradients by the decay rate before passing them to the algorithm, a 1-line change in Pytorch. In some cases, this will give you the convergence of the last iterate even on some non-convex problems. Once again, let’s really try to be fair and use a learning rate decay without hyperparameters, like the cosine decay, my favorite one and the most used one these days. Let’s pick 100 epochs, again without any tuning because if you tune you cheat. Let’s also run my ancient parameter-free algorithm COCOB. Again, there are better variants of COCOB, but I am just proving a point, not chasing the best performance. This is what we get. (The algorithms ending with ‘lr’ use the learning rate decay).

There is a lot going on in these plots and I’ll let anyone parse them on their own. But whatever you think of these results, I feel it is really difficult to argue for an award for the D-Adaption results.

3. Stochastic Awards?

Overall, this award is one of the most unfair things I witnessed in my academic life. I am super happy to see new parameter-free results, I am especially thrilled to see when they beat my algorithms, and everybody knows that I actively advertise new results by other people in this field. I do all this because I deeply care about this field! But not only this is known stuff, but their results are even weaker than previous ones! And this is not only me: the reaction of people in this field (i.e., the ones that should have judged this paper!) ranged from being “surprised” by this award to plainly considering it “particularly ridiculous”. Like, deterministic setting? Asymptotic bound? What is going on? Are we 8 years back in time? Also, are we sure this paper is the best example we want to give to new generations on how to credit past work?

For me, there is no doubt that this award is a glaring mistake and a slap in the face of all the people who worked in this subfield.

Whose fault is it, then?

From my side, from December 2022 I communicated with the authors many times, in person, on Zoom, through emails, and on Twitter. I was even at the workshop at NeurIPS 2022 in New Orleans where they presented it for the first time and I literally asked after the talk: “So, what is new with respect to prior work?” I am sorry to say that I didn’t get a very precise answer. Essentially, I told them many times that this is a big topic, that should study and compare with the previous papers, that what they did is essentially known, etc. Also, I did not push too hard because I was 99% sure the paper would have been rejected and the reviewers would have explained how well-known this was.

That said, people are completely free to write their papers the way they like, not study prior work, and not use baselines. So, to me, the actual responsible people for this fiasco are the ones in the reviewing pipeline of ICML. Everyone who looked at this paper and thought it was worthy of an award is clearly responsible for it, from reviewers to AC/SAC, from the Program Chairs to the members of the Award Committee. These people have the majority of the blame because it was literally their job to prevent giving awards to known and weak results. And some of the problems I outline above are so evident that even a 1st year PhD student should have seen them. It would be fantastic if some of these people would have the courage to come forward and just say “I didn’t know anything about this topic, probably I made a mistake”.

However, this is hardly surprising. We all know the process is almost random, we repeat it each year like a mantra, and we even measured its randomness, twice. It is our best excuse when our papers are rejected, but never the reason our papers are accepted. We insist on relying on this very noisy process to decide what Science is worth pursuing in machine learning. Even worse, we amplify the noise with “awards”. Like, volume 11 on random noise! F*ck Yeah!

This is also hardly the first time that awards are given to questionable papers. Do you remember last year when one of the awards went to a paper with wrong mathematical statements? There is an entire Arxiv paper explaining what is wrong in that paper. Guess what happened after it? Nothing at all: Not only did the paper still have the award, but the authors did not even update the Arxiv. Luckily, some people had to guts to say something. However, how is it possible that the ICML committee decided to do nothing? And even more importantly, how is it possible that we as a community allowed this inaction?

To me, the lack of action from the ICML committee is just disrespectful: I was a co-chair of ALT 2023 and I was ultimately responsible together with my co-chair for the assignments of the awards. It was very stressful because we were fully aware of the responsibility and we took a lot of time and consulted many people before making our decision. But, mistakes can happen and if the awarded papers are ever found wrong or containing known results, that would be completely our fault, there is no way to sugarcoat it. So, I would own my errors and I would insist on immediately retracting the awards—and possibly the papers too. Moreover, I would assign the awards to other people. In fact, as I see it, this is first of all a matter of basic respect towards the other authors.

They will tell you that it cannot be done. Apparently, faulty journal papers are retracted all the time, even awards to people can be retracted, but ICML paper awards cannot be retracted. You can also write a response paper for a faulty journal paper, while there is no way to officially answer to an ICML award.

The only way for me to logically explain such a lack of action is that ICML, NeurIPS, and ICLR are now so big that plainly admitting errors in the reviewing system and, even worse, in the assignments of awards would make the system collapse. The reason is that admitting this kind of mistake would open the floodgates to too many complaints, that they know very well that they cannot manage.

Moreover, probably nobody wants something like the reproducibility crisis to happen in the machine learning community and an article in the New York Times titled “Critical AI Research Decided by a Lottery”. So, it seems that we keep doing it, one mistake at a time, always without correction, and always praising how our community is better than the other ones.

4. Now What?

Nothing beats the publicity that an award at a big conference can give. Especially in our community, where people often do not go past the latest 2 years of prior work. Example: already a citation for a non-existent stochastic D-Adaptation (see eq. (11)), instead of citing prior work that actually works in the stochastic setting. Two mistakes in one citation.

Also, I can hardly wait for a wave of parameter-free papers focusing only on the deterministic setting and using this paper as the only baseline. In fact, I already saw them on Arxiv and I expect them at the next conference.

Am I exaggerating? Consider that the wrong paper that got the Outstanding Paper Award last year has already 39 citations. Not bad for a paper that should have been rejected.

Let me state it very clearly: These mistakes can and should be fixed.

Moreover, we should prevent such mistakes from happening again. Are you sure you want another award lottery? I don’t.

I think we have a much better way to decide awards: Given enough time the community is usually capable of finding good work. So, here is my proposal: let’s stick only to test-of-time awards, possibly multiplying them by topic otherwise only LLM papers would win in the next few years. Sure, it is not perfect, but it cannot be worse than the current flawed system.

EDIT 10/22/2023
George Lan made me aware of the fact that he has a line of work on parameter-free optimization algorithms for the Hölder smooth functions in the deterministic setting, see this and that paper. Note that his definition is different: They achieve optimal accelerated rates without knowing the smoothness constant. This is great work, but not to be confused with the paper in question. In fact, the oracle is different (they also assume access to function values), the rates are accelerated, and the functions are smooth. (Let me add that the different oracle seems the key to allowing stronger guarantees without the logarithmic term.)
So, absolutely nothing changes in the above post: the mentioned paper uses the same oracle and the same assumptions on the function of previous work. Also, there are no accelerated rates. The paper still contains a known result under stronger assumptions than previous ones. Moreover, under these premises, the deterministic regime is still too easy to be seriously considered.