So, You Recommended a Python Time-Series Package … Now What?

In this day and age, influencers have a pretty big impact on what analytical approaches get used by data scientists. Social platforms are filled with posts recommending package X, Y, or Z. It seems almost certain, however, that few of those recommendations are by people who have actually used the packages, much less a collection of their alternatives.

Press enter or click to view image in full size

When I published this little joke above about the situation (and probably got myself banned from Towards Data Science for life as a result) the reaction assured me that many people recognize how seemingly intractable, yet important, the issue might be. A microcosm for larger societal information curation challenges, no doubt.

The Echo Chamber Orchestra

Some influencers don’t even code at all, never mind examine the package they are suggesting others utilize. They simply see something posted and feel a need to inform their legion followers.

That’s not always a bad thing. After all, if nobody ever knows about package X then it might never even enter consideration. Recommendations take the form of posts or medium articles. Those might, in fairness, help someone get started on a particular package — and what’s not to love about soon to be stale documentation?

And yet there’s also a tremendous echo-chamber effect which I poked fun at in this petition. Publications like Towards Data Science have a mildly perverse incentive to boost articles that are completely uncritical, and conversely. If you are under the impression that Towards Data Science performs a scientific curation service of any kind (or if you believed that data science as a cult has maintained roots in statistical methodology) let me quickly disabuse you of that notion with these examples:

And one might go on.

Those articles are entirely uncritical and are usually completely unconcerned with the issue of whether there are better alternatives. Often, they seem to be equally unconsidering of the suitability of application, as with the use of seasonal forecasting methods applied to the stock market.

Popularity and Proficiency

And then there is the random reinforcement with ChatGPT-equivalent intelligence (a bit unkind to ChatGPT actually). Recently I saw a “booster” post a list of “hidden gem” Python packages, all of which had hundreds and mostly thousands of stars. Not so hidden, it would seem?

There is also a huge dispersion in developer brashness versus shyness. I’ve come across packages with only three stars on Github that are completely invisible — their authors have better things to do than spruik them, so neither does anyone else. Yet they can be really quite good. Here’s one with 14 stars in the optimization area.

The net result: the popularity of Python packages seems to be very weakly related to their efficacy — at least in the niches that I have taken the time to evaluate in a quantitative fashion. Of course, performance is sometimes hard to define, and yet the overall problem is not.

One need only note that the most downloaded Python package for time-series (by far) has a dubious generative model (as I discussed here at length) and empirical justification for it is completely elusive, and the package author agrees with the critique. Sean Taylor recently penned something of a Prophet mea culpa where he wrote:

I do actually feel a bit of shame about the project at this point — of course I wish it worked better … a large number of the citations on the Forecasting at scale paper are showing it underperforming other approaches. I see no flaws with these studies, Prophet is not a reasonable model in many settings.

Naturally it is absurd for Sean to take any blame. Anyone is free to evaluate open-source software themselves, and improve it with pull requests. Authors offer no warrantees.

Academic mores

There is an old method of dealing with poor recommendations. It has been employed for centuries. It is the peer review system and, right now, through inaction, Towards Data Science is doing a great job of demonstrating its value.

Peer review has its issues — a long digression — but in any case, it does not scale to the blogosphere. It is entirely irrelevant, as far as the issue I have raised is concerned. In the Wild West it is left to individuals to self-enforce whatever long-forgotten codes of behavior might put the “science” back into “data science”.

But unfortunately nothing will stop influencers from recommending packages multiple times a day, without first using them, never mind benchmarking or offering analysis that goes beyond an echoing of claimed features.

Then there’s the how-to article where — I’ve accumulated plenty of evidence — authors fail to perform so much as a simple google search before tapping out the advice. You can forget about more elaborate qualitative exercises, or consideration of proximate work, or academic practices of most kinds for that matter that have evolved to deal with entropy. That’s really the end of academic-inspired remedies, if there were any plans to lift them out of the academic setting.

It is natural that there should be some blurring of lines in data science writing, given that many people (myself included) find it useful to propose concepts in the early stages, seek feedback and exchange ideas in a more fluid form that strict adherence to the paper mills demands.

To be frank, I’m mildly irritated by the academic ceremony and I’m mostly waiting until large language models advance to the point where the entire layer of superficiality that exists (chasing down a journal’s LaTeX templates, conforming to scholarly style, etc) is simply blown to pieces. Then the conveying of ideas in peer reviewed settings, including technical ones, will feel efficient. But that’s a minor quibble.

The more profound problem is that when the currency is eyeballs, and when there isn’t even the faintest regard for “science” and what that ought to mean, it isn’t even possible to have a conversation. My objection is not so much the practice of bad science, but of not even trying.

Consider for example an exchange I had with someone who entirely agreed with my view on the scientific merit of a particular methodology (or lack thereof, as it happens) but still wanted to push back on account of the fact that the software had other compensating features — such as clean documentation, ease of use, etc. Performing worse than a moving average was merely a point deduction for this judge, not a disqualification.

Press enter or click to view image in full size

The short-lived BratGPT’s take on the issue

I’m not against docs or ease of use, but I find that position to be outrageous. You won’t get me onto a boat that isn’t fundamentally seaworthy just because it is well upholstered and has a fresh coat of paint. But that’s the way it is when algorithms are just handbags or other consumer items.

So, whether we like it or not, scientific merit and the popularity of methods have no strong reason to go together these days. They are, in fact, becoming increasingly unrelated. While statistical efficacy is still a weak signal when it comes to predicting usage, it is probably dwarfed by other exogenous features like FAANG-fetish.

Nature abhors a vacuum

How did we get here? Data scientists will not take their information solely from journals, and nor should they. Journals are often paywalled. They have other goals, of course, and few are intended as software manuals — the medium itself is not particularly well-suited. Why would I shove documentation into an open source journal, or any pdf, if I can render LaTeX directly in GitHub? The latter will be much more likely to stay in sync with the code.

It’s no surprise that data scientists reject journals and look elsewhere for educational resources and quick explanations of open-source code, including algorithms therein and their empirical or theoretical properties. Or they choose to follow people they hope will provide cool things to look at. There’s nothing wrong with engineers choosing their own path.

And yet the matter of open-source code might be a point of regret for some editors who look upon the dumpster fires like Towards Data Science today and wonder what has become of their statistical or applied mathematical field. Could they have done more? Might they have better served the new tinkering, eager community? Might they have recognized the urgency of industry? The desire to build something in an afternoon?

Could journals have enforced the provision of open-source algorithms much earlier? It is easy to criticize with perfect hindsight (and hypocritical in my case). Researchers have been exhorted to make their code public for 40 years or more, however, and it was a slow take. Academics mostly keep code to themselves, away from other researchers who they fear might steal their idea, or take credit for a minor derivative work.

Although it was popularized in the 1990s (The Cathedral and the Bazaar comes to mind) the argument for open source goes back further. Search for An Invitation to Reproducible Computational Research and you’ll be led back to Claerbout and Karrenbach’s 1982 paper on the virtues.

Interestingly, the code-hoarding is probably against self-interest, a kind of behavioral market inefficiency in scholarship, arguably — call it privacy bias. But advice from Claerbout to David Donoho was taken to heart and Donoho went on to become one of the most-cited authors in science. (He’s also very smart, in other ways, but might there be a lesson there?). Donoho’s view:

An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Despite examples like wavelab, lack of a strong academic norm left a void — a relatively small and awkward supply of haphazardly provided, or half-provided, or reluctantly provided combinations of code and mathematics where there was rarely a legitimate desire to truly help the reader get on with the job (as Mesnard and Barba discovered, for example).

Much later, initiatives such as papers with code came along and they sure help, providing indirect incentive since authors presumably wish to be indexed therein. But coverage of time series is particularly spotty at present, just to pick one example.

So for the most part if we choose to take advantage of the benefits of peer curation, we are still left glancing at papers, searching them for keywords like “github” or “open-source”, and trying to discern if the author of the paper actually has a genuine interest in providing us something useful. (Honestly if a paper doesn’t contain the words “github” or “code” I’m unlikely to even scan it these days, however profound the title might sound).

Influencer Elo ratings?

And that takes us back to the echo chamber — the vast expanse of unreviewed statistical writing. There’s nothing evil about it. But it is a mess and it seems to me it could be marginally improved by influencing the influencers just a little. Rules and norms don’t work so well, according to Jimmy Wiles, but rewards and penalties sometimes do.

Right now there’s no penalty for facile recommendation, but could there be? Down those lines I recently threw out the idea of influencer Elo ratings. It sounds simple in principle though I admit there are some challenges. The idea is that we tie recommendations to actual benchmarking exercises, the same way we tie analysts to the subsequent performance of stocks they recommend.

It could be done now, even in real-time, in a couple of small domains.

So for instance, if influencer A suggests that all their followers should be trying out NeuralProphet because it has a list of seemingly nice qualities, whereas influencer B suggests pydlm then, the next time the former faces up against the latter in a benchmarking match, influencers A and B will also have their own personal Elo ratings updated.

It would have to be done carefully though, because recommendations might be application-specific. Perhaps a minimax approach would work, however. The maximum rating achieved by a package on any test could be computed. Then it would be at least interesting to compute the minimum maximum Elo rating of any package recommended by a given influencer. It would pick up on people who tout complete junk, at least.

This would provide a small incentive for people to not uncritically recommend things. Though facing realities, it seems to me that a more promising approach — for now — is feeding influencers a long list of benchmarking efforts and keeping the threat of influencer Elo ratings in the back pocket. At least that way, they can simply work through a list of approaches for which there is some evidence, somewhere, that there is actually some reason to believe in present or future utility.

I’m keeping the influencer Elo threat hanging out there however, and if you’re a random or popularity-weighted booster of packages (i.e. equivalent to a poorly guided ChatGPT session) I suggest you look at the Elo ratings or leaderboards or careful academic work with rigorous ongoing diverse benchmarking. Because if you don’t, you might end up at the bottom of a leaderboard yourself.

AI from the future will solve the problem

I’m the least of your worries, actually. What the LMGTFY influencers should really fear is the super-intelligent AI sent back from the future — just as in that movie. Don’t believe me? Let’s switch from packages to the market for a moment, and hold my beer.

Press enter or click to view image in full size

Notice the deliberate grammatical error to slip past some gaslighting … stick with me and you’ll learn a few things, grasshoppers :-)

Press enter or click to view image in full size

I will not bore you with a long chain of partially computer generated prompts, but I think you get the drift. I was able to get past the coyness pretty easily and ultimately write a loop to extract Cramer’s general positioning over the course of his amazing tenure as influencer.

Large language models are noisy, just like Jim, and sometimes flat out false. But ask a few different ways, with different lead ups, and it is still a decent statistical metric — certainly better than nothing at all. I had little problem reproducing, more or less, the results of academic studies of Cramer — which you can ask ChatGPT about yourself.

I further point out that ChatGPT, or its API kin, are also perfectly capable of doing the calculations too — computing various measures of alpha or correlation between future outcomes and opinions expressed, for instance, and so on. And yes, Cramer is famous so this is easy. But in the future, LLMs will be able to extract the views that anyone has had, on just about anything, large or small — and form quantitative longitudinal opinions.

It is easy to train them on sub-domains, after all. It is relatively easy to warehouse your own corpus of opinions, even if sites like Linked-In aren’t going to make it completely trivial (hmmm, who should I ask about writing scraping code?). Am I currently warehousing opinions spouted by people about the tiny fraction of science I know a tiny bit about? Or am I bluffing? Ask yourself, punk, do you feel lucky today?

The future is coming for all you people who think there is no accountability for your bullish or bearish opinions on this and that, be it software or technologies, LLMs, blockchain, quantum computing or what have you.