TabPFN: Transformer Solves Small Tabular Classification in a Second
automl.orgIt's exciting to see a novel approach to applying NNets to Tabular ML. Definitely need to call out the 1000 row limitation. It will be interesting to see if this approach stands the test of time. Other algorithms (SAINT cough, cough) made big claims, but AFAIK no one actually uses them. It's still a "XGBoost is all you need" world in tabular ML (unless you've discovered AutoGluon).
On the other hand xgboost tends to overfit on small dataset so tabpfn is a great complement
From https://twitter.com/FrankRHutter/status/1583410845307977733 :
> This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 1/6
[Faster and more accurate than gradient boosting for tabular data: Catboost, LightGBM, XGBoost]
Hey there! I am one of the authors on this paper. If there are any questions, I am happy to answer :)
You are very clear about the current limitations on data size, which I find refreshingly honest! How sensible do you find the idea to fine tune the model to a specific problem that has more than 1000 observations, by resampling the data (similar to bootstrapping) and retraining on the subsamples? As I understand it, one could fine tune the algorithm that TabPFN learned to the specific problem.
Many thanks also for open-sourcing your work and making the colab notebook, I've been playing around with that a bit.
Edit: spelling
We did try it a bit a while back, but did not have conclusive results. I expect you can bend it to perform better for larger datasets, too, but how exactly I cannot say for sure. The bootstrapping is definitely a good candidate for this.
Thanks. Looks very interesting!
My main observation just looking at your example pictures is that its closest competitor is Gaussian Processes which I've long been a fan of.
Just looking at those pictures it looks like GP and TabPFN are very similar where there is data but TabPFN is more happy to extrapolate while GP is localised around the data (look at the top row for example).
I can't decide whether that's a feature or a bug. I guess it's good to have a choice whether you want to show that you're uncertain in regions where you've never seen data before or be able to extrapolate on what you have seen.
Yeah, we also find this interesting. Arguably, though, GP is far worse in all metrics. Thus, it is not really our closest competition.
I'd be interested to hear what you see as your closest competition?
As a lazy non-ML simpleton, is there a simple explanation for it's usage?
Would tabular classification usually refer to say, extraction of tabular data in a picture to text?
I tried googling and looking through the site but it wasn't obvious to me what this actually does.
No, it's meant for taking something like a CSV file and deciding if each row matches a specific category. A common example, have a CSV, columns corresponding to different features of flowers (e.g. number of petals, size of petals) and the output is the type of flower.
No it’s really just tabular csv data, like a typical spreadsheet would hold. These datasets are rarely ever outperformed by deep learning compared to standard ML.
Doesn’t it being a NN automate some of the feature engineering?