Papers with Code
paperswithcode.comIt is weird to see Papers with Code on the front page of HN.
This site is the bread and butter of each Research Engineers and Scientists working in Deep Learning. You use the site almost everyday.
Advanced learners also use the site regularly.
You would just think that "everyone knows" and never think of sharing the site on HN.
I'm concerned that this "every knows" is increasingly becoming a true social problem, unsolved by current technology - in fact, worsened by it.
Knowledge about a field transfers best by hands-on association with people who practice it. Before widespread IT, communities of practice were local and relatively homogeneous; so it was easy to share the essentials of a field quickly, and get newcomers up and running with best practices.
Nowadays however, communities of practice are widespread, coming around all the world with very different backgrounds, communicating through low-bandwidth channels, and we're flooded with information so it's difficult to ascertain what is essential and what's accessory.
It is much more difficult for an outsider to grasp the essential qualities of a field they want to enter, as there are usually no guides comprehensive enough to detail everything you need to know.
Why is it a problem? People should put at least a minimum of effort to research what might interests them. Not everything has to be spoon fed to people.
I never found any subject that needed let's say more than 10 minutes of internet searches to know if it's worth pursuing.
It was much harder before the web. I remember as a kid seeing books about C++ in the local shop but even with looking inside not understanding what C++ was. Nowadays I would get my answer almost instantly.
> I'm concerned that this "every knows" is increasingly becoming a true social problem, unsolved by current technology - in fact, worsened by it.
You couldn't possibly believe this if you were old enough to remember what preceded the internet.
Good lord, no, today is not worse than microfiche and card catalogs.
I was a young adult when the web become widespread, and the problem I'm talking about was milder: precisely because there was a shortage of documentation, what was available limited the number of topics that you could learn about, and being flooded by different sources was less of a problem.
It was still possible to define a Library Science were books were classified by hand, and not some secret algorithm counting links as votes or learning and regurgitating a corpus of loosely related documents without understanding any of it. I.e., it was possible to make sense of information sources, and whatever you learned of a field came with a single consistent narrative. Nowadays, information gathering has become an exercise in picking and choosing unconnected fragments from which you must infer your own understanding.
In some ways you can still try to emulate the old way, by limiting yourself to a small set of publishers who try to compile and organize a small part of a field of knowledge - yet it is much easier than the teachings of that source will be deeply contradicted by some other seemingly authoritative source, without a clear way to know which one should you rely upon, and with the whole exercise feeling like it provides an incomplete perspective.
I think you overestimate how many HN readers are "Research Engineers and Scientists working in Deep Learning".
He also overestimates the importance of that site for “Research Engineers and Scientists working in Deep Learning".
Let me guess - everyone - means /r/machinelearning and a curated list of people on Twitter?
I’m a DL researcher, I’ve known of this site for a few years, and while the original motivation behind it was good I personally never extracted much value from it. Usually googling the paper title or a model name plus ‘github’ and/or ‘pytorch’ will produce all relevant links to code.
“Bread and butter” for me is http://arxiv-sanity.com
A couple previous discussions for those interested:
https://news.ycombinator.com/item?id=19054501 (Feb 1, 2019) 411 points, 23 comments
https://news.ycombinator.com/item?id=23391934 (June 2, 2020) 304 points, 21 comments
Not everyone works in machine learning which seems to be the only subject the site handles, and those that do not work in it may still be interested in hearing about the existence of such a website.
I’m curious, how does it fit your daily workflow as an engineer? Is it somewhere where you get the “news” for the day? Or do you use it for getting information relevant to your current work projects?
I never tracked is like news. I used it for two main things:
- Checking the state of the art (SotA) for a given problem. For some problems 2 year old solutions are still close to SotA; in others - there is a huge difference. And if there is a huge difference - is it because of architecture and parameter tuning, or using totally different architectures and training modes.
- Running code - to be used somewhere, or as a reference. Papers never have all details, and do not compile.
Context: I used to work in the field, as a consultant. Though, I cite Papers with Code in one overview paper.
You use it to find the code and data of a paper - since it also lists other implementations - to run additional baselines on Imagenet in order to appease reviewer #3( who has no idea why your paper on convex optimization has nothing to do with this but it's easier to run them than argue with them).
Pre-parenthesis part is dead serious, parenthesis part is slightly hyperbolic due to accumulated trauma with bad reviewers
It also gives an overview of the current state of the art for thousands of tasks and indexes current research by methods used, so you can quickly bootstrap research on a topic.
Another great resource is the HuggingFace model zoo. So many trained models easy to deploy.
While there's probably some value in posting the resources "everyone knows", Papers with Code was submitted multiple times in the past few years, which is a pretty common HN thing (whether this behavior is desired is up to dang and the community I guess).
I've never heard of it. Glad to have made its acquaintance.
After reading your comment, I now feel embarrassed as to why I haven't heard of this site.
Don’t worry, it’s not true.
Question -
I get that your run-of-the-mill paper saying "Here we present a novel algorithm for xyz" will usually have the algorithm defined in simple psuedo-code, maybe with an implementation in a "real" language as a proof of concept.
But for the many papers describing novel ML models, how does that work? They seem to use images that diagram out the different layers of the model. But is that truly "universal" the way that a psuedo-code algorithm is universal? As in, if the authors use PyTorch (or whatever), can I take the exact model they describe in their paper and apply it in MyFavoriteMLToolkit and achieve similar results?
I guess my question is, what are the "primitives" of papers describing ML models? Is saying "convolutional layer" enough, or do they also describe the dozens of hyper-parameters, etc?
So porting between ML frameworks was my job for a while, and the short answer is Yes, common layers can be quite simple to describe and reproduce in different frameworks. eg "Conv2D(2,3)" is enough info, in code or text, to describe a 2d convolution layer with 2 outputs and a shape of 3x3.
The longer answer is that the rest of the Conv2D configuration can then be easily overlooked, unless changed from the defaults. And those can be different across frameworks and potentially break things, even they even exist in your preferred framework. You can always create custom layers though, if needed.
But many papers also seem to do a bad job describing the actual structure of their own ML network. They can be vague, confusing, or simply inaccurate. And that can be because they are a general concept with flexible details, or because they struggle to put their model into clear words and diagrams. Or simply because they know the code is going to do the lifting.
It's a good question which might yield a very complex answer depending on how far down the rabbit hole of reproducible science/computation/machine learning you're willing to go.
To keep things simple, I'd say the true "primitives" of ML models can be reduced to mathematical formulas. For example, a plain old feed forward network is implemented as matrix multiplication. Sprinkle in a bit of calculus to analytically derive the formula for back-propagating errors (aka training), and you have the basic building blocks of modern deep learning. Convolutions, Transformers, etc are just a bit fancier spins on the same mathematical foundations.
Hyper-parameters are essentially tunable variables in a formula. I'd say your instinct is spot on - they are absolutely necessary to capture for reproducible results.
If you have the code and the data the answer should be yes. You should be able to take that PyTorch code and translate it to MyFavoriteMLToolkit to obtain numerically identical results.
In practice, we face the same universal difficulties as other computer science based research: fighting inconsistencies in software, hardware, all the way down to the physics of the universe with cosmic ray induced bit flips, etc.
> But for the many papers describing novel ML models, how does that work? They seem to use images that diagram out the different layers of the model. But is that truly "universal" the way that a psuedo-code algorithm is universal? As in, if the authors use PyTorch (or whatever), can I take the exact model they describe in their paper and apply it in MyFavoriteMLToolkit and achieve similar results?
Generally, yes.
If they are standard, well-known layers that exist in both PyTorch and TF you can take a paper that was implemented in one and implement in the other and expect similar results (assuming you know a reasonable number of details[1]).
If they are non-standard layers it can be hard. There are lots of details that you need to port and even with access to the source code it can be easy to miss things.
[1] Here's an example of how things are implemented differently - you can still get the same result, but you need to know what you are doing: https://stackoverflow.com/questions/60079783/difference-betw...
In my experience there are many lesser significant hyperparameters that can impact performance when going from the released code to your personal favorite framework.
Nothing you can't figure out by reading source code of the two frameworks or by reading the documentation closely.
Generally, people don't seem to care about reproducing exact metrics - as long as it is close enough they're happy. You need to dig a bit deeper if you want the full quality.
>But is that truly "universal" the way that a psuedo-code algorithm is universal?
My experience has been that pseudo-code is anything but universal.
In fact, having had many times to implement actual working code from research papers pseudo-code, I would posit that pseudo-code is nothing but a license for academics to provide stuff that simply doesn't work to the reader and get away with it. Thanks to pseudo-code, they get to gently skip over the hard bits to get the paper out the door as quickly as possible.
Papers with actual, git-clonable, working code, should be the standard for CS academic publishing.
It depends. Usually a paper doesn't have enough room to mention all of the possible choices in preprocessing, architecture, optimiser, etc. You can usually get pretty close with details just in the paper, but it's not always possible.
That's why a large number of journals now have requirements for publishing code and/or pretrained models (if applicable).
An annoying trend I've noticed in a number of SotA ML papers in video classification present multiple models and only publish the exact architecture & weights for the smaller models which are only as-good-as SotA (see tiny video networks, X3D for examples).
I implemented Wang–Müller algorithm, described it, and embedded the code to the pdf, along with tooling how to generate the example diagrams of the paper (and the whole paper). Everything is in the pdf[1].
Arxiv.org won't accept a pdf with attachments though, so only a stripped-down version will come there (once/if I get an endorsement, fingers crossed).
I copied this concept from Joe Armstrong, where he suggested to distribute Erlang modules as PDFs with code files (*.erl) as attachments. "Documentation comes first, and the distribution should prioritize humans".
[1]: See Section A.1 of https://github.com/motiejus/wm/blob/main/mj-msc-full.pdf
I don’t get it, why not just include a link to github in your pdf?
Links to external sites have a tendency to rot.
I've stumbled upon a number of scientific papers from 2000s that include links to sourceforge for code listings. Most of those are dead now.
Github will not be there for ever.
This is a great site. It's pretty ML focused which lands a bit outside my interest range, does anyone know of a similar site that has papers from CS as a whole?
Not exactly what you need, but this page lists best papers awards (with links) from all major conferences
https://jeffhuang.com/best_paper_awards/
And here's PapersWeLove Repo with similar sauce
Yes. I just resubmitted Papers We Love. Submission activity has dropped off and still a Scrooge McDuck’s embarrassment of riches.
Glad you like Papers with Code. Please check [1] for the list of scientific domains we currently support and [2] for CS in particular.
What are your interests? This one has some extra content like company blogs and conferences, though it’s still AI centric https://www.zeta-alpha.com/
Amazing website. As some noted in comments, widely used in AI research community, but I expect this website will be useful to the broader developer community as well!
Transformer based architectures and unsupervised pre-training are achieving state of the art results across multiple modalities including NLP, CV, speech recognition, genomics, physics etc - so here's my must read list of recent papers on the topics (along with some of my notes). Happy holidays!
[1] Attention Is All You Need (2017) https://paperswithcode.com/paper/attention-is-all-you-need
Introduced the Transformer architecture and applied it to NLP tasks.
[2] The Annotated Transformer (2018) https://nlp.seas.harvard.edu/2018/04/03/attention.html
An “annotated” version of [1] in the form of a line-by-line Pytorch implementation. Super helpful for learning how to implement Transformers in practice!
[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://paperswithcode.com/paper/bert-pre-training-of-deep-b...
One of the most highly cited papers in machine learning! Proposed an unsupervised pre-training objective called masked language modeling; learned bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Bonus: https://nlp.stanford.edu/seminar/details/jdevlin.pdf
See the above slideshow from the primary author, noting the remarkably prescient conclusion: "With [unsupervised] pre-training, bigger == better, without clear limits (so far)"
[4] Conformer: Convolution-augmented Transformer for Speech Recognition (2020) https://paperswithcode.com/paper/conformer-convolution-augme...
Proposed an architecture combining aspects of CNNs and Transformers; performed data augmentation in frequency domain (spectral augmentation).
[5] Scaling Laws for Neural Language Models (2020) https://paperswithcode.com/paper/scaling-laws-for-neural-lan...
Arguably one of the most important papers published in the last 5 years! Studies empirical scaling laws for (Transformer) language models; performance scales as a power-law with model size, dataset size, and amount of compute used for training; trends span more than seven orders of magnitude.
[6] Language Models are Few-Shot Learners (May 2020, NeurIPS 2020 Best Paper) https://paperswithcode.com/paper/language-models-are-few-sho...
Introduced GPT-3, a Tranformer model with 175 billion parameters, 10x more than any previous non-sparse language model. Trained on Azure's AI supercomputer, training costs rumored to be over 12 million USD. Presented evidence that the average person cannot distinguish between real or GPT-3 generated news articles that are ~500 words long.
[7] CvT: Introducing Convolutions to Vision Transformers (May 2020) https://paperswithcode.com/paper/cvt-introducing-convolution...
Introduced the Convolutional vision Transformer (CvT) which has alternating layers of convolution and attention; used supervised pre-training on ImageNet-22k.
[8] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Oct 2020) https://paperswithcode.com/paper/pushing-the-limits-of-semi-...
Scaled up the Conformer architecture to 1B parameters; used both unsupervised pre-training and iterative self-training. Observed through ablative analysis that unsupervised pre-training is the key to enabling growth in model size to transfer to model performance.
[9] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Jan 2021) https://paperswithcode.com/paper/switch-transformers-scaling...
Introduced the Switch Transformer architecture, a sparse Mixture of Experts model advancing the scale of language models by pre-training up to 1 trillion parameter models. The sparsely-activated model has an outrageous number of parameters, but a constant computational cost. 1T parameter model was distilled (shrunk) by 99% while retaining 30% of the performance benefit of the larger model. Findings were consistent with [5].
[10] ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (August 2021) https://paperswithcode.com/paper/prottrans-towards-cracking-...
Applied Transformer based NLP models to classify & predict properties of protein structure for a given amino acid sequence, using supercomputers at Oak Ridge National Laboratory. Proved that unsupervised pre-training captured useful features; used learned representation as input to small CNN/FNN models, yielding results challenging state of the art methods, notably without using multiple sequence alignment (MSA) and evolutionary information (EI) as input. Highlighted a remarkable trend across an immense diversity of protein LMs and corpus: performance on downstream supervised tasks increased with the number of samples presented during unsupervised pre-training.
[11] CoAtNet: Marrying Convolution and Attention for All Data Sizes (December 2021) https://paperswithcode.com/paper/coatnet-marrying-convolutio...
Current state of the art Top-1 Accuracy on ImageNet.
Thanks for this thoughtful list. I try not to flood my ML dev team with too much academic reading but obviously some are too important. Seeing another persons take on what’s important helps me refine what I give to the newcomers to get them up to speed.
The ViT paper doesn't make your list?
Good suggestion, it was tough to narrow down the list! Here is a link to the ViT paper in case others are interested [1].
According to the latest ImageNet standings [2], ViT appears to have slipped to second place in Top-1 Accuracy. CoAtNet-7 is the new leader, but only by a slight margin and at the cost of what appears to be a significantly larger model.
[1] Scaling Vision Transformers https://paperswithcode.com/paper/scaling-vision-transformers
[2] https://paperswithcode.com/sota/image-classification-on-imag...
That isn't the ViT paper, this one is https://paperswithcode.com/paper/an-image-is-worth-16x16-wor...