Anthropic's latest model was released today. Here's a first glance on how it performs and what it feels like in Plotly Studio. We'll have a more detailed post on this once we put it through the full set of benchmarks but for now here are our field notes.
We ran through 5 real world, personal, and relatable use cases that allow for a wide range of analytical depth.
1. Housing Rent or Buy Analysis with FRED Data
2. SF Mayor Effectiveness via 311 Data
3. Workout Impact Analysis from Apple Health - See our previous deep dive on this analysis.
4. SF Water Temp Buoy Dashboard App - Examining the water temp in San Francisco for open water swimming
5. Iran Conflict Economic Impact via FRED
Remarkable World Knowledge
Others have commented how this model "feels big" in its world knowledge.
We noticed this right away where every step of the analysis presents much more detailed contextual information about the data.
For example, it presents station information about each buoy with more detailed location information.
I've written previously about LLM's curious and remarkable knowledge of public datasets, and this release is no different. Zillow Research data is a new dataset that I haven't seen any model surface before for this type of analysis.
Pauses Appropriately
Fable is positioned as a highly autonomous model that can work for hours or days at a time. From it's knowledge card (emphasis mine):
Claude Fable 5 [...] It is suited for long-running, complex, and asynchronous tasks that previously required frequent human check-ins.
It is particularly strong at end-to-end work that would otherwise take a person hours, days, or weeks - taking on problems that are long-running, ambiguous, or highly multi-step. It executes well-scoped tasks with few mistakes, automatically self-correcting through verification loops, and ships with robust safeguards.
In data work, we want our agentic loop to recognize ambiguity and pause and ask the operator for clarification if the ambiguity is consequential rather than just plough ahead. See more in our essay about designing agentic analytic benchmarks. I have been concerned that the over-emphasis on autonomy and long-running tasks would end up being in conflict to this behavior.
However, Fable handles ambiguity with grace and curiosity within Plotly Studio's agentic loop. It raises good questions and seeks clarification with appropriate context under the scenarios we presented.
Autonomy and Long Horizon Tasks
By default, the model still appears to work up to about 10 steps (about 15 minutes) for any given data analytics task. This is a reasonable behavior as a default as it prevents cost overruns or unnecessary depth of analysis.
It can be steered to work for longer horizon analytics tasks if you tell it to work longer in open exploration or if you give it a more detailed specification of the analytics task.
However, I find that it is still difficult for it to really go into open exploration on its own where it might come up with new questions and ideas as it works or follow different rabbit holes - as is common with exploratory data work. I suspect that this is RL trained behavior to prevent the loops from going "off the rails".
Solid visualizations with room for improvement
I was delighted to see it build out these physiological subplots similar to what we designed by hand when working through the Apple Health data.
It does a better job at handling labels than I've seen previously as well.
But some of the charts it creates are pretty dense by default, and difficult to interpret at a glance:
But it takes direction well. I asked it to update the chart to display rolling averages and subplots instead of fixed aggregations and stacked bars and it had no problem:
It defaults to clean line and bar charts, but if you ask it for a wider range of charts it does well across the Plotly visualization stack:
The Dash apps are very nice and clean as well:
The first shot of graphs, reports, and graphs is remarkably good across the board. And it's also not uncommon to still see a few visual quirks here and there that need to be followed up upon, like large number formatting issues:
Stronger Analysis
The analysis was notably stronger across the board. It handled time series and lagging correlations better. In the Apple Health analysis, it correctly identified the different regimes of training and the hiking-to-VO2 max correlation that we found when steering the sessions in Plotly Studio more manually.
It does a better job at highlighting some of the core assumptions in the financial models as well:
And the approach is strong as well, handling considerations like data size and server-side aggregation and having solid plans.
It demonstrates capability to identify and investigate data quality issues during it's analysis autonomously:
It also demonstrates ability to self correct it's analysis as it works:
Nice backtesting:
Strong, neutral analysis:
Thorough deep diving into data quality issues that can really distort a story. Rather than trying to make a story out of bad numbers, Studio discovered the data quality issues and then investigated and surfaced them its report.
Tone
In Plotly Studio, we steer the direction of the tone of the writing a fair amount in order to the analysis focussed on the data and remove some of the enthusiastic or sycophantic behavior. This model does a pretty good job at adhering to this but we still see a fair amount of AI's "hard hitting journalistic" tone and tropes ("Here's the honest answer") in the reports that it creates, especially if you ask for it to editorialize.
As an aside - Plotly Studio lets you control how much editorializing a write-up should provide, so you can create reports that just show the numbers.
Methodology
Surfacing the "Methodology" is a first class feature in Plotly Studio, and this model does a fine and thorough job of reporting this.
Cost
A normal, reasonable session costs about $0.5-$2. Tonight's exploration cost about $80 in tokens. It was 5 sessions in parallel over about 2 hours, and I was definitely steering the analysis to go deep to see how far I could get it to work in a long-running autonomous mode. These sessions also generated many reports and full fledged Dash apps - about 10-15 in all.
Stay tuned
We'll be exploring Fable more over the coming weeks as we tune our agentic loop to work better with the model and compare it more rigorously across our benchmarks.