An open source DuckDB text to SQL LLM
motherduck.comI see so many business leaders touting the promise of LLMs allowing business to "talk" to their data. The promise does sound enticing, but it's actually kind of hard to get working in practice.
A lot of our databases at work have columns with custom types and enums, and getting the LLM (Llama2) to write SQL queries to robustly answer natural language questions about the data is tough. It requires a lot of instruction prompting, context, and question-SQL examples (few-shot learning), and it still fails in unexpected ways. It's a tough ask for people to use a tool like this if they can't trust the results all the time. It's also a bit infeasible to scale this to tens or hundreds of tables across our data warehouse.
It's great that a lot of people are trying to crack this problem, I'm curious to try this model out. I'd also love to see if other people have tried solving this problem and made any headway.
I've been working on the DuckDB-NSQL model on MotherDuck side. I fully agree that general text-2-sql, in the sense of "give me a question, and I'll produce you an arbitrary complex query", is a very tough problem and I actually believe that that's not the right problem to solve. Not necessarily because models are not (going to be) cabable enough, but rather because it's way too hard for humans to express in a single prompt what they actually want. Furthermore, it's simply not the right UX for most SQL users. A much better approach IMO is to keep the SQL analysts in the driver seat, and provide nuanced support wherever/whenever they need it most. The FixIt feature we recently launched goes into the same direction: https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
In that sense I emphasized in our Blogpost that users should think of it as a documentation oracle that always gives you the exact DuckDB SQL query snippet you are looking for, which is a tremendoues time-saver if you have an abstrat idea of the query you want to write, but you're just not sure about the syntax, expecially with DuckDB having so many functions and SQL extensions.
Here are a few exammples:
- create tmp table from test.csv
- load aws credentials from 'test' profile
- get max of all columns in rideshare table
- show query plan with runtimes for 'SELECT * FROM rideshare'
- cast hvfhs_license_num column to int
- get all columns ending with _amount from taxi table
- show summary statistics of rideshare table
- get a 10% reservoir sample of rideshare table
- get length of drivers array in taxi table
- get violation_type field from other_violations json column in taxi table
- get passenger count, trip distance and fare amount from taxi table and oder by all of them
- list all tables in current database
- get all databases starting with test_
[edit: fixed list formatting]
Thanks for replying, that's a perspective I didn't consider. The capability to "talk to your data" just seems so enticing as a solution that I was tunnel-visioned into that UX. If I'm understanding correctly, what you're suggesting is more of a SQL assistant to help people write the correct SQL queries instead of writing the entire SQL query from scratch to answer a generic natural-language question?
I believe that is what they are saying.
I have found LLMs to be extremely helpful in mapping between schemas as well as helping me formulate queries where, because of decay, data in tables and column names, etc don't map to what you think they would.
You need to provide as much context as you can to the LLM. So full schema definitions, and histographic summarization and samples from the tables themselves.
I hear what you're saying, and actually agree that - unfortunately - is the current state of LLMs today. But I think OP still has a point, and I'd argue as well that that is the aim of GAI, to be able to take a simple question, understand and produce the desired results.
As it stands, and as you mentioned, the current generation of LLMS seem a long way from there. Thus the need for prompt engineers. It'll be nice to see how long they take to crack this problem.
I agree. I think full text-to-results (even bypassing text-to-SQL) is akin to L5 self-driving cars. It's easy to get to some reasonable level, say 90%. But to get to the point, where folks can fully trust the system and don't need a steering wheel (or know SQL) may take decades.
We at MotherDuck took an incremental approach. We launched something more akin to lane-assist. We're calling it FixIt - an in-flow visual aid to help you identify and fix errors in your SQL [0].
I think there's gobs of opportunities to improve the analytics experiences without resorting to "L5 self-driving" (e.g. full text-to-results)
[0] https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
I've worked on a similar problem and we have pretty much the same issues as you. An idea that makes things better is having an intermediate representation that exposes the key tables of your dwh and gets compiled to SQL. This allows you to have somewhat better security (even if the model outputs garbage it can't do too much damage because nothing except exactly what you want is even representable), and somewhat better explainability (you can't guarantee the model will get it right, but you can see what SQL gets executed rather than "magic 8-ball says no").
But as you say custom types and encoded domain knowledge is extremely tough and as a result it's very tough to "transfer" the system to different databases.
Lots of folks taking this approach and feels like the wrong entry point, e.g., similar to asking LLMs to generate bytecode when compilers exist or 3d printing a machine vs. building a machine from 3d printed components.
1. Business users aren’t prepared to talk to their data in meaningful ways and this is an opportunity for LLMs to enhance the way users ask questions.
2. SQL modeling languages exist (although I’m not sure there are well maintained open source good ones and this is the biggest obstacle to what I’m working on now) and LLMs can extend these effectively by adding components such as dimensions, metrics, relationships, filters, etc. with less chance of hallucination
3. Deterministic SQL generation from a modeling repository is easier to troubleshoot and provide guarantees than end-to-end generation.
4. Existing SQL can be parsed to generate modeling components that can be committed to the model repository with LLM assistance
5. Much of the richness of going to data to answer questions is context, e.g., how does this dept compare to others, this month to same month last year, etc. Modeling languages are good at expressing these transformations, but business users and often analysts aren’t good at capturing all the permutations of these types of comparisons. Another area where LLMs can help apply tooling.
IMO, LLMs are more effective at using tools than generating complex outputs.
> but it's actually kind of hard to get working in practice
One of the biggest challenges I've personally seen in this space is business "leaders" pushing teams to ship products asap lest they loose face among their fellow CEOs for not pushing out "AI" products before everyone else.
I'm fairly optimistic about LLMs being able to truly be transformative, but it's not going to be through forcing the bread-dead UX of hoisting yet another slightly re-imagined chat interface on users.
The idea of "talking to your data" is a promising one, and anyone who has worked for a large data driven org will quickly agree that organizing and searching in-house data is not a solved problem from the UX end of things. But to truly solving these problems, even/especially with LLMs, is going to require thought and experimentation. Something few "business leaders" have patience for.
My biggest concern is that this will allow people to type a question and get a number back from the database, without being able to tell if the query is right or if the LLM just made up something.
It can work to support business analysts to crank out more reporters, but I wouldn’t roll it out to all my staff.
I've actually done exactly what @qsort suggested and outputted the intermediate SQL query and raw data generated by that query when generating the response back to the user. That definitely helps in establishing more trust with the customer since they can verify the response. My approach right now is to just be honest with our customers in the capabilities of the tool, acknowledge its shortcomings, and keep iterating over time to make it better and better. That's what the team in charge of our company-wide custom LLM has done and it's gained a surprising amount of traction and trust over the last few months.
Llama2 wasn't trained on code, so it's no surprise that you're having trouble generating SQL.
I assume that "Talk to your data" presupposes the existence of some simple view that has been created and that embeds 99% of the required business logic.
"What was the average order size per customer for product XYZ in the West region?"
Imagine turning that one loose against the typical legacy system.
I wouldn't trust an LLM to figure out the joins or aggregate calculation, LET ALONE the definition of a customer, a product, or a region.
But it would almost certainly generate AN answer.
I’m trying to solve for this with my project using RAG and (at least based on what people say in Discord), it’s working really well for them: https://github.com/vanna-ai/vanna
Co-founder and Head of Produck at MotherDuck here, happy to answer any questions or go nag the amazing engineers [0] who worked on this :)
1. First of all, thanks for outlining how you trained the model here in the repo: https://github.com/NumbersStationAI/DuckDB-NSQL?tab=readme-o...! I did not know about `sqlglot`, that's a pretty cool lib. Which part of the project was the most challenging or time-consuming: generating the training data, the actual training, or testing? How did you iterate, improve, and test the model?
2. How would you suggest using this model effectively if we have custom data in our DBs? For example, we might have a column called `purpose` that's a custom defined enum (i.e. not a very well-known concept outside of our business). Currently, we've fed it in as context by defining all the possible values it can have. Do you have any other recs on how to tune our prompts so that this model is just as effective with our own custom data?
3. Similar to above, do you know you can use the same model to work effectively on tens or even hundreds of tables? I've used multiple question-SQL example pairs as context, but I've found that I need 15-20 for it to be effective for even one table, let alone tens of tables.
Hi, Till here, worked on the DuckDB-NSQL model on MotherDuck side.
1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.
2. I left a comment about my view on where such models are effective in a previous commment: https://news.ycombinator.com/item?id=39133155
3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.
Thanks for taking the time to answer the questions and link those resources, really appreciate it and the work your team did!
I didn't see this in the blog post, but did you train this from scratch or finetune an existing base model?
If from scratch, quite impressive that the model is capable of understanding natural language prompts (English presumably) from such a small, targeted training set.
Founder of Vanna AI here -- appreciate the link and I agree, we're solving slightly different problems.
Congratulations! The FixIt seems a pragmatic feature.
Any other ideas about where you plan to utilize LLM? E.g.
- migrate from different SQL dialects (e.g. date functions)
- make my queries nicer (e.g., rewrite them in shorter form or use better variable names)
- warn about potential bugs (e.g. SQL fanout)
Love these! We do want to deliver more features like FixIt! [0]
What's really exciting is what you can do with DuckDB, MotherDuck, and WASM. A powerful in-browser storage and execution engine tethered to a central serverless data warehouse using hybrid mode [1] opens the doors for unprecedented experiences. Imagine the possibilities if you have metadata, data, query logic, or even LLMs in the client 0ms away from the user and on user's own hardware.
So we're doing this in our UI of course, but we also released a WASM SDK so that developers can take advantage of this new architecture in their own apps! [2]
[0]https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
[1]https://motherduck.com/docs/architecture-and-capabilities
Produck. Amazing.
The core issue of text to SQL is that your data has to be good for the generated queries to be correct. The queries may run and return good looking results, but if the data requires domain knowledge ("Don't count people in the customer table without filtering out records with the test flag in the customer attributes table and at least one order in the orders table") you'll get results that don't actually answer your question.
This is awesome, congratulations. I'm glad to see some text-to-sql models being created. Shameless plug: I also just realized you used NSText2SQL[1] which itself contains my text-to-sql dataset, sql-create-context[2], so I'm honored. I used sqlglot pretty heavily on it as well.
Do you think a 3B model might also be in the future, or something small enough that can be loaded up in Transformers.js?
[1] https://huggingface.co/datasets/NumbersStation/NSText2SQL
[2] https://huggingface.co/datasets/b-mc2/sql-create-context
I'd love to see how it performs in some benchmarks, specifically against Spider (https://yale-lily.github.io/spider) and BIRD (https://bird-bench.github.io/)
looks great, most text-to-sql attempts i’ve tried fall short, hoping this is different
I do this a lot with GPT4 and I can't recall it ever generating an invalid query, and I did have some complicated ones.
The problem isn't the SQL generation per se, it's the mapping between natural language and the data that columns actually hold. If you have a clean schema with well named fields, LLMs are very good at generating valid SQL but if you have an old schema that has been patched and modified for years by a dozen DBAs each with their own naming quirks, short hands, and hacks, it can barely string a valid query together.
Yeah, blame the user for for poorly named columns and whatever "clean" means, that's it. That's what's blocking the feature.
The issue is that SQL is not composable. It's necessary to articulate the messy operation all at once and it's not possible to iterate on it.
Good luck!