Kor: a half-baked prototype that "helps" you extract structured data using LLMs

123 points by BorisWilhelms 3 years ago · 16 comments

Reader

Does this take advantage of the new OpenAI functions api? From a quick look, I can't find any indication that it does. Although I find it tricky to disentangle the langchain abstractions, so I might be missing it. Kor's last release predates the announcement of OpenAI functions, so probably not.

Seems like this is now best done via functions, if you're using OpenAI's models? They call out "extracting structured data from text" as a key use case in their announcement.

https://openai.com/blog/function-calling-and-other-api-updat...

BorisWilhelmsOP 3 years ago

No, it is not using openai functions. Since it is on top of langchain it uses the LLM abstraction of it and it can be used with other models as well.
- anotherpaulg 3 years ago
  
  Yup, the flexibility of running against any model via langchain is super helpful.
  - reissbaker 3 years ago
    
    FYI, the upcoming version of gpt4 does considerably worse emulating function calls / generating code-like strings, but gets better again if you switch to the function call API: https://twitter.com/reissbaker/status/1671361372092010497
    (My guess is the same is true of gpt-3.5, although I haven't tested it.)
    That being said, Langchain has a nearly drop-in replacement if you want to start using the function call API: https://python.langchain.com/docs/modules/agents/agent_types...
  - contravariant 3 years ago
    
    Are there any useful alternative models though? Most I've found weren't particularly good at following instructions or using tools in the way langchain provides them.

kiernanmcgowan 3 years ago

Another tool like this is Marvin. My experience this that these work pretty well, but the world of prompt “engineering” is a very squishy one and getting the exact output format you want is not guaranteed.

https://www.askmarvin.ai/

captainmuon 3 years ago

Neat, I was just looking for something like this today, I think I'll give it a spin.

Does anybody here have experience with metadata extraction using LLMs? I've been thinking about it recently. and wonder if just making a big prompt and putting that into OpenGPT or even ChatGPT is really the way to go, or if there is a "cleverer" way. Maybe you could train specifically for certain fields, or use the LLM in a different way (like you can use the embeddings directly to do simularity search)?

Another idea was, if you have a lot of similar HTML documents, to not ask the LLM for the metadata, but to ask it for CSS selectors that contain the metadata fields - assuming it can deal with HTML and the data is verbatim in there. Then you should be able to get much more consistent results.

hubraumhugo 3 years ago

We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, as most comparable tools do, is expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
Try it out https://kadoa.com
- jshmrsn 3 years ago
  
  That is a very thought provoking use case and optimization for LLMs, thanks for sharing.
nerpderp82 3 years ago

I gave it some css paths extracted from devtools, and some sample elements with data that needed extraction and had it write a beautiful soup + regex routine to do the extractions. worked fine. Also thousands of times faster.

mark_l_watson 3 years ago

I have experimented with Kor several times, cool idea.

dennisy 3 years ago

Have you tried this on HTML?

BorisWilhelmsOP 3 years ago

Yes, tried it on HTML to get "metadata" that was not present in the HTML meta tags, such as author, publish date, etc. Works good.
- BorisWilhelmsOP 3 years ago
  
  Actually not on raw HTML, but with the WebBaseLoader from Langchain which strips away HTML tags.
  - dennisy 3 years ago
    
    Ahh cool thank you!

Settings

Kor: a half-baked prototype that "helps" you extract structured data using LLMs

Keyboard Shortcuts