NRC Regulatory Embeddings
37,734 chunked and embedded NRC nuclear regulatory documents, ready for use in RAG pipelines.
Built for the nrc-licensing-rag project, an AI system for analyzing nuclear Combined License Applications (COLAs).
Contents
| Source | Documents |
|---|---|
| NUREG-0800 (Standard Review Plan) chapters 1-19 | 2,436 sections |
| 10 CFR Parts 20, 50, 51, 52, 72, 73, 100 | ~504 sections |
| Regulatory Guide Division 1 (1.1-1.262) | 242 guides |
| Regulatory Guide Division 4 (4.1-4.28) | 27 guides |
| Total chunks | 37,734 |
Schema
| Column | Type | Description |
|---|---|---|
id |
string | Unique chunk ID |
text |
string | Document chunk text |
embedding |
list[float] | 1536-dim OpenAI text-embedding-3-small vector |
source |
string | Source identifier (e.g. nureg_0800, 10cfr50, reg_guide) |
document_type |
string | srp, cfr, or reg_guide |
document_id |
string | Document identifier |
title |
string | Section or guide title |
section_id |
string | NRC section number |
chapter |
string | Chapter (for SRP documents) |
chunk_index |
int | Position of chunk within source document |
source_url |
string | NRC.gov URL where available |
guide_id |
string | Regulatory Guide number (e.g. 1.1) |
cfr_part |
string | CFR part number |
division |
string | Regulatory Guide division |
Usage
from datasets import load_dataset
ds = load_dataset("davenporten/nrc-regulatory-embeddings")
df = ds["train"].to_pandas()
Or load directly with pandas:
import pandas as pd
df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")
Load into ChromaDB
import chromadb
import pandas as pd
df = pd.read_parquet("hf://datasets/davenporten/nrc-regulatory-embeddings/data/nrc-regulatory-embeddings.parquet")
client = chromadb.HttpClient(host="localhost", port=8000)
col = client.get_or_create_collection("regulations")
batch_size = 500
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i+batch_size]
col.add(
ids=batch["id"].tolist(),
documents=batch["text"].tolist(),
embeddings=batch["embedding"].tolist(),
metadatas=batch.drop(columns=["id", "text", "embedding"]).to_dict("records"),
)
Embeddings
Generated with OpenAI text-embedding-3-small (1536 dimensions). To query without re-embedding your documents, use the same model.
License
MIT, documents are sourced from publicly available NRC publications.
- Downloads last month
- 24