Extrai
📖 Description
extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.
The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.
extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.
✨ Key Features
- Consensus Mechanism: Consolidates multiple LLM outputs to improve extraction accuracy.
- Dynamic SQLModel Generation: Generates
SQLModelschemas from natural language descriptions. - Hierarchical Extraction: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.
- Extensible LLM Support: Integrates with various LLM providers through a client interface.
- Built-in Analytics: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.
- Workflow Orchestration: A central orchestrator to manage the extraction pipeline.
- Example JSON Generation: Automatically generate few-shot examples to improve extraction quality.
- Customizable Prompts: Customize prompts at runtime to tailor the extraction process to specific needs.
- Rotating LLMs providers: Create the JSON revisions from multiple LLM providers.
📚 Documentation
For a complete guide, please see the full documentation. Here are the key sections:
- Getting Started
- How-to Guides
- Core Concepts
- Reference
- API Reference
- Community
⚙️ Workflow Overview
The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview):
graph TD
%% Define styles for different stages for better colors
classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87
subgraph "Inputs (Static Mode)"
A["📄<br/>Documents"]
B["🏛️<br/>SQLAlchemy Models"]
L1["🤖<br/>LLM"]
end
subgraph "Inputs (Dynamic Mode)"
C["📋<br/>Task Description<br/>(User Prompt)"]
D["📚<br/>Example Documents"]
L2["🤖<br/>LLM"]
end
subgraph "Model Generation<br/>(Optional)"
MG("🔧<br/>Generate SQLModels<br/>via LLM")
end
subgraph "Data Extraction"
EG("📝<br/>Example Generation<br/>(Optional)")
P("✍️<br/>Prompt Generation")
subgraph "LLM Extraction Revisions"
direction LR
E1("🤖<br/>Revision 1")
H1("💧<br/>SQLAlchemy Hydration 1")
E2("🤖<br/>Revision 2")
H2("💧<br/>SQLAlchemy Hydration 2")
E3("🤖<br/>...")
H3("💧<br/>...")
end
F("🤝<br/>JSON Consensus")
H("💧<br/>SQLAlchemy Hydration")
end
subgraph Outputs
SM["🏛️<br/>Generated SQLModels<br/>(Optional)"]
O["✅<br/>Hydrated Objects"]
DB("💾<br/>Database Persistence<br/>(Optional)")
end
%% Connections for Static Mode
L1 --> P
A --> P
B --> EG
EG --> P
P --> E1
P --> E2
P --> E3
E1 --> H1
E2 --> H2
E3 --> H3
H1 --> F
H2 --> F
H3 --> F
F --> H
H --> O
H --> DB
%% Connections for Dynamic Mode
L2 --> MG
C --> MG
D --> MG
MG --> EG
EG --> P
MG --> SM
%% Apply styles
class A,B,C,D,L1,L2 inputStyle;
class P,E1,E2,E3,H,EG processStyle;
class F consensusStyle;
class O,DB,SM outputStyle;
class MG modelGenStyle;
▶️ Getting Started
📦 Installation
Install the library from PyPI:
pip install extrai-workflow
✨ Usage Example
For a more detailed guide, please see the Getting Started Tutorial.
Here is a minimal example:
import asyncio from typing import Optional from sqlmodel import Field, SQLModel, create_engine, Session from extrai.core import WorkflowOrchestrator from extrai.llm_providers.huggingface_client import HuggingFaceClient # 1. Define your data model class Product(SQLModel, table=True): id: Optional[int] = Field(default=None, primary_key=True) name: str price: float # 2. Set up the orchestrator llm_client = HuggingFaceClient(api_key="YOUR_HF_API_KEY") engine = create_engine("sqlite:///:memory:") orchestrator = WorkflowOrchestrator( llm_client=llm_client, db_engine=engine, root_model=Product, ) # 3. Run the extraction and verify text = "The new SuperWidget costs $99.99." with Session(engine) as session: asyncio.run(orchestrator.synthesize_and_save([text], db_session=session)) product = session.query(Product).first() print(product) # Expected: name='SuperWidget' price=99.99 id=1
🚀 More Examples
For more in-depth examples, see the /examples directory in the repository.
🙌 Contributing
We welcome contributions! Please see the Contributing Guide for details on how to set up your development environment, run tests, and submit a pull request.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
