TabPFN: Transformer Solves Small Tabular Classification in a Second

60 points by jupiterelastica 3 years ago · 13 comments

Reader

It's exciting to see a novel approach to applying NNets to Tabular ML. Definitely need to call out the 1000 row limitation. It will be interesting to see if this approach stands the test of time. Other algorithms (SAINT cough, cough) made big claims, but AFAIK no one actually uses them. It's still a "XGBoost is all you need" world in tabular ML (unless you've discovered AutoGluon).

jilijeanlouis 3 years ago

On the other hand xgboost tends to overfit on small dataset so tabpfn is a great complement

westurner 3 years ago

From https://twitter.com/FrankRHutter/status/1583410845307977733 :

> This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 1/6

[Faster and more accurate than gradient boosting for tabular data: Catboost, LightGBM, XGBoost]

ersiees 3 years ago

Hey there! I am one of the authors on this paper. If there are any questions, I am happy to answer :)

jupiterelasticaOP 3 years ago

You are very clear about the current limitations on data size, which I find refreshingly honest! How sensible do you find the idea to fine tune the model to a specific problem that has more than 1000 observations, by resampling the data (similar to bootstrapping) and retraining on the subsamples? As I understand it, one could fine tune the algorithm that TabPFN learned to the specific problem.
Many thanks also for open-sourcing your work and making the colab notebook, I've been playing around with that a bit.
Edit: spelling
- ersiees 3 years ago
  
  We did try it a bit a while back, but did not have conclusive results. I expect you can bend it to perform better for larger datasets, too, but how exactly I cannot say for sure. The bootstrapping is definitely a good candidate for this.

snthpy 3 years ago

Thanks. Looks very interesting!

My main observation just looking at your example pictures is that its closest competitor is Gaussian Processes which I've long been a fan of.

Just looking at those pictures it looks like GP and TabPFN are very similar where there is data but TabPFN is more happy to extrapolate while GP is localised around the data (look at the top row for example).

I can't decide whether that's a feature or a bug. I guess it's good to have a choice whether you want to show that you're uncertain in regions where you've never seen data before or be able to extrapolate on what you have seen.

ersiees 3 years ago

Yeah, we also find this interesting. Arguably, though, GP is far worse in all metrics. Thus, it is not really our closest competition.
- snthpy 3 years ago
  
  I'd be interested to hear what you see as your closest competition?

janee 3 years ago

As a lazy non-ML simpleton, is there a simple explanation for it's usage?

Would tabular classification usually refer to say, extraction of tabular data in a picture to text?

I tried googling and looking through the site but it wasn't obvious to me what this actually does.

learndeeply 3 years ago

No, it's meant for taking something like a CSV file and deciding if each row matches a specific category. A common example, have a CSV, columns corresponding to different features of flowers (e.g. number of petals, size of petals) and the output is the type of flower.
jerpint 3 years ago

No it’s really just tabular csv data, like a typical spreadsheet would hold. These datasets are rarely ever outperformed by deep learning compared to standard ML.
- edmundsauto 3 years ago
  
  Doesn’t it being a NN automate some of the feature engineering?

Settings

TabPFN: Transformer Solves Small Tabular Classification in a Second

Keyboard Shortcuts