Settings

Theme

Kor: a half-baked prototype that "helps" you extract structured data using LLMs

github.com

123 points by BorisWilhelms 3 years ago · 16 comments

Reader

anotherpaulg 3 years ago

Does this take advantage of the new OpenAI functions api? From a quick look, I can't find any indication that it does. Although I find it tricky to disentangle the langchain abstractions, so I might be missing it. Kor's last release predates the announcement of OpenAI functions, so probably not.

Seems like this is now best done via functions, if you're using OpenAI's models? They call out "extracting structured data from text" as a key use case in their announcement.

https://openai.com/blog/function-calling-and-other-api-updat...

kiernanmcgowan 3 years ago

Another tool like this is Marvin. My experience this that these work pretty well, but the world of prompt “engineering” is a very squishy one and getting the exact output format you want is not guaranteed.

https://www.askmarvin.ai/

captainmuon 3 years ago

Neat, I was just looking for something like this today, I think I'll give it a spin.

Does anybody here have experience with metadata extraction using LLMs? I've been thinking about it recently. and wonder if just making a big prompt and putting that into OpenGPT or even ChatGPT is really the way to go, or if there is a "cleverer" way. Maybe you could train specifically for certain fields, or use the LLM in a different way (like you can use the embeddings directly to do simularity search)?

Another idea was, if you have a lot of similar HTML documents, to not ask the LLM for the metadata, but to ask it for CSS selectors that contain the metadata fields - assuming it can deal with HTML and the data is verbatim in there. Then you should be able to get much more consistent results.

  • hubraumhugo 3 years ago

    We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, as most comparable tools do, is expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

    Try it out https://kadoa.com

    • jshmrsn 3 years ago

      That is a very thought provoking use case and optimization for LLMs, thanks for sharing.

  • nerpderp82 3 years ago

    I gave it some css paths extracted from devtools, and some sample elements with data that needed extraction and had it write a beautiful soup + regex routine to do the extractions. worked fine. Also thousands of times faster.

mark_l_watson 3 years ago

I have experimented with Kor several times, cool idea.

dennisy 3 years ago

Have you tried this on HTML?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection