Settings

Theme

Show HN: I built a local data lake for AI powered data engineering and analytics

stream-sock-3f5.notion.site

14 points by vpfaiz 3 months ago · 10 comments · 1 min read

Reader

I got tired of the overhead required to run even a simple data analysis - cloud setup, ETL pipelines, orchestration, cost monitoring - so I built a fully local data-stack/IDE where I can write SQL/Py, run it, see results, and iterate quickly and interactively.

You get data lake like catalog, zero-ETL, lineage, versioning, and analytics running entirely on your machine. You can import from a database, webpage, CSV, etc. and query in natural language or do your own work in SQL/Pyspark. Connect to local models like Gemma or cloud LLMs like Claude for querying and analysis. You don’t have to setup local LLMs, it comes built in.

This is completely free. No cloud account required.

Downloading the software - https://getnile.ai/downloads

Watch a demo - https://www.youtube.com/watch?v=C6qSFLylryk

Check the code repo - https://github.com/NileData/local

This is still early and I'd genuinely love your feedback on what's broken, what's missing, and if you find this useful for your data and analytics work.

ramkiz 3 months ago

Very cool idea. The part I would love to hear more about is how you are thinking about the boundary between notebook/IDE convenience and actual data lake guarantees. For example, what exactly is versioned, how reproducible are transformations, and how much lineage visibility do I get once I start mixing SQL, PySpark, natural language queries, and imported web/DB data?

  • vpfaizOP 3 months ago

    Everything including actual data, schema and transform is versioned and tracked at job run level.

    You will get job run level lineage for any datasets created in the system.

jazarine 3 months ago

What's the difference between this and asking claude to do data analysis?

  • vpfaizOP 3 months ago

    Two things:

    1. You may not want to expose bits and pieces of your data and metadata to an LLM, you dont want your data to be used for training. If you are using LLM running on your machine, as in this case, you are covered there.

    2. Claude can do a lot of stuff, but doing multi step analysis consistently and reliably is not guaranteed due to the non-deterministic nature of LLMs. Every time it may take a different route. Nile local offers a bunch of data primitives like query, build-pipe, discover, etc. that reduces the non-determinism and bring reliability and transparency (how the answer was derrived) to the data analysis.

revv00 3 months ago

Great work! What is the difference between writing a shell script to solve "cloud setup, ETL pipelines, orchestration, cost monitoring", and using a local app?

  • vpfaizOP 3 months ago

    I used to write a shell script to do all this. Then the number of scripts started adding up, got overly complex. This local app here is what evolved from all that, but with more reliable query running, compute management, spark environment, and above all UI and AI that can make this process seamless than a cluttered CLI UX.

am__ 3 months ago

When you say local do you mean I could run it without wifi? i have some work files I could use some help on but can’t connect to other LLMs

  • vpfaizOP 3 months ago

    Yes, absolutely. You can turn off wifi and still work on your data, with AI to help you out. LLM runs locally on your box.

sdhruv93 3 months ago

Can I run it on my MacBook.. do I need to setup LLM myself?

  • vpfaizOP 3 months ago

    Yes. I would recommend a model with 16gb ram at least but I was able to run it on a MacBook air 8gb but it lagged for LLM assist.

    You don't need to setup LLM locally, the tool does that. You can choose which model to go with. It has Gemma and Qwen supported now.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection