davenporten/nrc-regulatory-embeddings · Datasets at Hugging Face

2 min read Original article ↗

NRC Regulatory Embeddings

37,734 chunked and embedded NRC nuclear regulatory documents, ready for use in RAG pipelines.

Built for the nrc-licensing-rag project, an AI system for analyzing nuclear Combined License Applications (COLAs).

Contents

Source Documents
NUREG-0800 (Standard Review Plan) chapters 1-19 2,436 sections
10 CFR Parts 20, 50, 51, 52, 72, 73, 100 ~504 sections
Regulatory Guide Division 1 (1.1-1.262) 242 guides
Regulatory Guide Division 4 (4.1-4.28) 27 guides
Total chunks 37,734

Schema

Column Type Description
id string Unique chunk ID
text string Document chunk text
embedding list[float] 1536-dim OpenAI text-embedding-3-small vector
source string Source identifier (e.g. nureg_0800, 10cfr50, reg_guide)
document_type string srp, cfr, or reg_guide
document_id string Document identifier
title string Section or guide title
section_id string NRC section number
chapter string Chapter (for SRP documents)
chunk_index int Position of chunk within source document
source_url string NRC.gov URL where available
guide_id string Regulatory Guide number (e.g. 1.1)
cfr_part string CFR part number
division string Regulatory Guide division

Usage

from datasets import load_dataset

ds = load_dataset("davenporten/nrc-regulatory-embeddings")
df = ds["train"].to_pandas()

Or load directly with pandas:

import pandas as pd
df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")

Load into ChromaDB

import chromadb
import pandas as pd

df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")

client = chromadb.HttpClient(host="localhost", port=8000)
col = client.get_or_create_collection("regulations")

batch_size = 500
for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i+batch_size]
    col.add(
        ids=batch["id"].tolist(),
        documents=batch["text"].tolist(),
        embeddings=batch["embedding"].tolist(),
        metadatas=batch.drop(columns=["id", "text", "embedding"]).to_dict("records"),
    )

Embeddings

Generated with OpenAI text-embedding-3-small (1536 dimensions). To query without re-embedding your documents, use the same model.

License

MIT, documents are sourced from publicly available NRC publications.

Downloads last month
24