Why do Startups Power Perplexity Finance?

4 min read Original article ↗

A common “party trick” I observed while compiling and maintaining my list of hedge fund oriented AI startups is a relatively basic demo. Users are presented with a chat bar, they type “What are NVDA’s revenues the past 5 years?” and the interface returns a bar chart or table with the data. This demo is so common, it forms the basis of some financial evaluation benchmarks for LLMs.

The demo is a red herring with an intriguing wrinkle. Accurately parsing SEC filings to retrieve this data is legitimately difficult to do. However, consuming structured financial data this way is impractical and unnecessary. In practice, quants use structured financial data feeds while non-technical investors will use interfaces like Bloomberg Terminal. Pulling NVDA revenues in Bloomberg is far faster than any inference or parsing. So, while this demo demonstrates LLM capabilities, it is a poor pitch to customers. Simultaneously, it demonstrates that offline, batch-based extraction and cleaning of public domain data should become a commodity.

Surprisingly, existing incumbents have poorly served this new AI market. Consider Perplexity – a little digging suggests that they have a partnership with Factset, but the data you see on perplexity.com/finance comes from startups1. So the question is: why? What are the oppurtunities on which to outcompete the incumbents and become the new data standard for the AI era? Here are some ideas:

  1. Avoiding (alleged) Anticompetitive Identifiers

    1. I’ve written about regulatory capture stemming from government agencies using proprietary identifiers2. Beyond this, proprietary identifier owners commonly extract fees from users by inserting clauses into data contracts that users may be unaware are not required. Lawsuits continue in this area3.

    2. OpenFIGI is a step in the right direction4.

  2. Modern Delivery Mechanisms

    1. Modern data delivery means delivering query results via REST APIs and bulk data via S3, Snowflake, Databricks, or similar bulk-sharing service.

    2. Many incumbent data providers have tried (and largely failed) to create their own delivery mechanisms or white-labelled data environments. Others charge a premium to deliver data via modern interfaces (the equivalent, in my opinion, of a SaaS vendor charging extra to access their website with the latest version of Chrome).

  3. Transparent Contracts

    1. Almost all AI frontier companies have usage-based billing for API usage and seat based usage. Usage-based pricing schemes are naturally more complex and more difficult to forecast than fixed licenses. The complexity of AI companies’ pricing pales in comparison though to what is common amongst incumbent data vendors.

    2. Anyone familiar with an incumbent data vendor’s usage based contract knows it takes a combination of lawyers, accountants, and engineers to scope its exact cost.

    3. A good litmus test for reasonable contracts is public availability of contracts and rate cards. Data purchase agreements contain no proprietary secrets a competitor could steal. Even if they did, protecting this information in a tight-knit industry would be impractical.

  4. Good and Public Documentation

    1. Lack of transparent documentation is a huge problem. Browsing proprietary portals for answers is simply annoying.

Speaking with hedge fund data executives, I'm often met with head nods followed by shrugs. Few funds are large enough to truly change the practices of one of the major data incumbents. Too few competitors exist to exert enough competitive pressure. This presents an opportunity for startups to meet these needs – especially since the barrier to entries in this market are coming down, not up.

For those who've followed my blog, you may sense a bit of regret. Cybersyn, my now-defunct startup5, provided both public domain and proprietary data, focusing mostly on the latter. On the public domain data side, my biggest question remains the long-term defensibility of this business. Nonetheless, Cybersyn had blue-chip users (and customers) that could have bought the same data from Bloomberg, Factset, or S&P. Cybersyn’s datasets remain among the most popular on Snowflake Marketplace today. Being back in a data buying role, I now need my own previous products. So, there is some thread to pull here.

My List of AI Data Startups

Some excellent new data providers follow these principles, including Financial Modeling Prep, Quartr, and Databento. Open source projects, like Datamule, also show promise. It's no coincidence that innovative AI startups like Perplexity use these vendors as data sources. A paradigm change in technology will create opportunities for startups in adjacencies. Perhaps some of these startups will also find answers to the defensibility question.

It's worth noting some exceptions beyond DaaS providers’ control, especially when data comes from a very limited set of vendors. For instance, real-time market data is relatively centrally controlled by exchanges. Data vendors are subject to data owners' controls. So, if exchanges mandate a data governance regime incompatible with modern data-sharing practices, data vendors can do little. That said, some startups like Databento have navigated this quagmire to offer more modern data products.

Discussion about this post

Ready for more?