GitHub - eliask/farchive: Local content-addressed archive with observation history. Stores bytes by SHA-256, preserves locator state as contiguous spans, compresses with zstd and corpus-trained dictionaries. SQLite-backed.

farchive

farchive (far archive) — a local, history-preserving archive for opaque bytes observed at named locators.

Farchive stores raw bytes once by SHA-256 digest, preserves each locator's observation history as contiguous spans, and optimizes physical storage with zstd compression, corpus-trained dictionaries, locator-local delta encoding, and content-defined chunking. One SQLite file, queryable with SQL — efficient corpus packing while keeping the archive directly queryable.

$ farchive stats finlex.farchive

DB path:          finlex.farchive
Schema version:   3
Generated by:     farchive 3.1.0
Locators:         346,070
Blobs:            391,943
Spans:            385,854
Dictionaries:     3
Raw bytes:        33,072,048,565
Stored bytes:     5,198,041,734
Compression:      6.36x

Codec distribution:
  raw                 6 blobs           137 raw           137 stored  (1.0x)
  zstd           49,707 blobs  1,545,588,831 raw   254,529,608 stored  (6.1x)
  zstd_delta     39,817 blobs  8,277,115,278 raw    55,313,750 stored  (149.6x)
  zstd_dict     302,413 blobs  23,249,344,319 raw  4,887,903,327 stored  (4.8x)

Storage class distribution:
  pdf             4,931 blobs  4,066,712,995 raw  3,534,337,585 stored  (1.2x)
  xml           248,586 blobs  8,301,639,278 raw   842,189,428 stored  (9.9x)
  html           90,229 blobs  19,429,007,458 raw   593,778,823 stored  (32.7x)
  text           45,229 blobs  1,180,927,761 raw   134,384,957 stored  (8.8x)
  gif             2,964 blobs    93,742,557 raw    93,046,078 stored  (1.0x)
  unknown             3 blobs         6,002 raw         6,012 stored  (1.0x)
  json                1 blobs        12,514 raw         3,939 stored  (3.2x)

Why

Most tools make you choose between a cache, a blob store, a version-control system, and a web archive. Farchive is the boring local thing in the middle: you record what bytes you observed at a locator and when, read them back exactly, resolve the current state or the state at a past time, and keep repetitive corpora compact.

Preserve what was observed. If a locator goes A -> B -> A, that is three spans, not one collapsed record.
Store bytes once. Identical payloads deduplicate by digest.
Query it simply. Latest, as-of, history, freshness.
Keep it small. Repetitive corpora benefit from trained zstd dictionaries, delta encoding for similar versions, and chunk-level dedup for large blobs.
Keep it local and boring. One SQLite file, no server, no daemon.

I wanted a local tool that combined content-addressed dedup, locator history, and corpus-adaptive compression in one queryable file.

Use cases

Web scraping with change detection. Archive pages as you crawl. Query what changed between observations. Detect when a page reverted. Freshness checks avoid redundant fetches.

with Farchive("scrape.farchive") as fa:
    for url in urls:
        if not fa.has(url, max_age_hours=24):
            resp = httpx.get(url)
            fa.store(url, resp.content, storage_class="html")
    # Later: what changed?
    for span in fa.history("https://example.com/pricing"):
        print(f"{span.observed_from}  {span.digest[:12]}")

API response archival. Store every response from a REST or SOAP API. Dedup means identical responses cost nothing. Point-in-time queries let you reconstruct what you knew at any moment.

Legal/regulatory corpus management. Archive legislation, regulations, court decisions. Track amendments over time. Corpus-trained zstd dictionaries compress thousands of structurally similar XML documents at 5-10x ratios. Delta encoding captures small amendments efficiently. (This is the use case farchive was extracted from.)

ML dataset versioning. Store training data snapshots at locators like dataset://v3/train.jsonl. Content-addressed storage means identical data across versions is stored once. History shows the full lineage. Large datasets benefit from chunk-level dedup.

Configuration/infrastructure snapshots. Periodically archive config files, terraform state, DNS records. Spans show exactly when each change was first observed.

Install

Farchive is published on PyPI: pypi.org/project/farchive.

Requires Python 3.11+ and zstandard>=0.21.

For content-defined chunking (large-blob dedup):

pip install farchive[chunking]

This adds pyfastcdc for FastCDC-based content-defined chunking. The archive works without it — chunking is an optional optimization.

Quick start

from farchive import Farchive

with Farchive("my_archive.farchive") as fa:
    # Store content at a locator
    fa.store("https://example.com/page", page_bytes, storage_class="html")

    # Retrieve latest content
    data = fa.get("https://example.com/page")

    # Track changes over time
    fa.store("https://example.com/page", new_page_bytes, storage_class="html")
    for span in fa.history("https://example.com/page"):
        print(f"{span.digest[:12]}  {span.observed_from}..{span.observed_until}")

Status

Near-term priorities were:

importer-facing drift detection via compare_current()
machine-readable history and provenance ergonomics
better event filtering and locator metadata workflows
richer batch import ergonomics without changing core archive semantics

Current status: these near-term items are now implemented.

Core concepts

Blob: Immutable raw bytes identified by SHA-256 digest. Stored once, deduped by content.
Locator: Opaque string naming where content was observed (URL, path, any string).
State span: A contiguous run where one locator resolved to one blob. If the same content returns after an interruption, that's a new span — history is preserved.
Event (optional): Append-only audit log of archive operations, including observations and maintenance.
Storage class: A freeform string label (e.g. "html", "xml", "pdf", "bin", "json", whatever you want) that guides compression strategy. There is no fixed set — any string is valid. Blobs in the same class share dictionaries and local candidate strategy. Common convention is to use MIME-like names, but the archive does not enforce or validate them.
Series key: An optional opaque lineage hint used only to widen delta candidate selection across locators in the same version family. It is optional, advisory, and has no read-time semantic meaning. For an open same-digest span, the current behavior is latest-non-null wins if a later confirmation provides a new non-null series_key.

API

Write

fa.put_blob(data, storage_class="xml")                      # store blob, return digest
fa.observe(locator, digest)                                 # record observation
fa.observe(locator, digest, observed_at=ts, metadata={"k": "v"}, series_key="series/doc-123")  # with time, metadata, and lineage hint
fa.store(locator, data)                                     # put_blob + observe (atomic)
fa.store(locator, data, observed_at=ts, storage_class="html", series_key="series/doc-123", metadata={"k": "v"})       # with time, class, lineage hint, and metadata
fa.store_batch([(loc, data), ...], progress=callback, series_key="series/doc-123")       # shared defaults for batch
fa.store_batch(
    [BatchItem(locator=loc, data=data, observed_at=ts, storage_class="html", series_key="series/doc-123", metadata={"k": "v"})],
    progress=callback,
)                                                           # per-item metadata/timestamps

observe(), store(), and store_batch() may use prior blobs from the same locator as delta candidates when beneficial. If series_key is provided, delta candidates can also come from other locators in the same lineage key. put_blob() has no locator context and skips delta encoding. store_batch() accepts both legacy tuples and BatchItem. Shared observed_at / storage_class / series_key defaults are used only when an item does not set its own value.

Series key contract:

optional, typed hint; advisory and additive only
one value per span; never changes archive semantics
additive and explicit: no profile object, no profile switching
does not affect reads, span identity, resolver results, or history semantics
implemented to widen delta candidate lookup only

Machine-readable span outputs:

in the API, StateSpan includes series_key when present
machine-readable CLI outputs (resolve --json, history --json, and ls spans --json) now include series_key when present

Read

fa.read(digest)                    # exact bytes by digest
fa.compare_current(locator, data=bytes)   # locator state vs candidate bytes (status: absent/same/changed)
fa.compare_current(locator, digest=digest)   # locator state vs candidate digest
fa.resolve(locator)                # current StateSpan
fa.resolve(locator, at=timestamp)  # point-in-time span
fa.get(locator)                    # convenience: resolve + read
fa.get(locator, at=timestamp)      # bytes at a point in time
fa.history(locator)                # all spans, newest first
fa.has(locator, max_age_hours=24)  # freshness check
fa.locators(pattern="https://%")   # list locators (LIKE pattern)
fa.events(locator)                 # audit log (if event history exists)
fa.events(locator, since=ts)       # events since timestamp

fa.compare_current() requires exactly one of data or digest.

Maintenance

fa.train_dict(storage_class="xml")          # train zstd dictionary, returns dict_id
fa.repack(storage_class="xml")                                  # recompress with trained dict, returns RepackStats
fa.repack(storage_class="xml", series_key="series/doc-123")       # recompress one lineage cohort
fa.rechunk(storage_class="bin")                                 # convert large blobs to chunked form, returns RechunkStats
fa.rechunk(storage_class="bin", series_key="series/doc-123")      # target one lineage cohort for chunking maintenance
fa.purge(["loc/a", "loc/b"])                # remove locators and unreachable blobs, returns PurgeStats
fa.stats()                                  # archive statistics, returns ArchiveStats
fa.close()                                  # close connection (automatic with context manager)

Data types

All types are importable from farchive:

StateSpan — one contiguous run of a locator resolving to one blob, including optional series_key
Event — one audit record (event_id, occurred_at, locator, digest, kind, metadata)
CompressionPolicy — configurable storage optimization knobs
ImportStats — results from store_batch()
BatchItem — input envelope for richer batch ingestion (series_key optional lineage hint)
LocatorHeadComparison — result of compare_current()
RepackStats — results from repack() (blobs_repacked, bytes_saved)
RechunkStats — results from rechunk() (blobs_rewritten, chunks_added, bytes_saved)
PurgeStats — results from purge() (locators_requested, locators_purged, spans_deleted, blobs_deleted, chunks_deleted, dry_run)
ArchiveStats — snapshot of archive state (locator_count, blob_count, span_count, dict_count, total_raw_bytes, total_stored_bytes, compression_ratio, codec_distribution, db_path, schema_version, chunk_count, db_file_bytes)

Constructor

Farchive(path, compression=CompressionPolicy(), enable_events=False)

Parameter	Type	Default	Notes
`path`	`str \| Path`	required	SQLite file path
`compression`	`CompressionPolicy`	defaults below	Policy knobs
`enable_events`	`bool`	`False`	Creates event table on first use

Compression

Farchive uses layered storage optimization. Phase 1 and Phase 2 are automatic write-path strategies. Phase 3 is an explicit maintenance transform.

Phase 1 — Inline compression (write path)

Raw — blobs under the raw threshold (default 64 bytes) are stored uncompressed
Vanilla zstd — standard compression
Dictionary zstd — corpus-trained dictionaries for configured storage classes

Storage classes are freeform strings — any value is valid ("html", "xml", "bin", "my-app/v2", whatever). The archive does not validate or enforce any convention. They are optimization buckets: dictionaries are trained per-class.

Phase 2 — Delta encoding (write path)

When storing a blob at a locator that has prior versions, farchive may encode it as a zstd_delta against a similar prior blob. This captures small changes (edits, patches, amendments) very efficiently.

Delta is depth-1 (delta bases are never themselves deltas), and only used when it beats the best inline frame by a configurable margin. Candidate selection includes locator-local history and the optional same-series_key lane for related streams. Delta candidates remain inline-only (raw, zstd, zstd_dict) — chunked blobs are excluded to maintain a clean separation between the delta path (small changes between similar inline blobs) and the chunking path (large-blob dedup via maintenance). Disabled by setting delta_enabled=False.

Phase 3 — Content-defined chunking (maintenance only)

Large blobs (default ≥ 1 MiB) can be split into content-defined chunks via FastCDC. Chunks are deduplicated archive-wide by their own SHA-256 digest. This is most effective when many large blobs share substantial regions — different versions of a dataset, VM images, etc.

Chunking is not applied automatically on write. Use rechunk() to rewrite eligible inline blobs into chunked form when beneficial. Requires the chunking extra (pip install farchive[chunking]).

All compression is transparent: read() and get() always return exact raw bytes regardless of physical representation.

Dictionary training is policy-driven. Defaults auto-train for xml (at 1000 blobs), html (at 500), and pdf (at 16). Other classes can use dictionaries trained manually via train_dict(). After training, new blobs use the dictionary immediately. Run repack() to recompress older blobs.

CompressionPolicy

All knobs are configurable at construction time:

from farchive import CompressionPolicy

policy = CompressionPolicy(
    # Phase 1: inline
    raw_threshold=64,
    compression_level=3,
    auto_train_thresholds={"xml": 1000, "html": 500, "pdf": 16},
    dict_target_sizes={"xml": 112*1024, "html": 112*1024, "pdf": 64*1024},

    # Phase 2: delta
    delta_enabled=True,
    delta_min_size=4*1024,
    delta_max_size=8*1024*1024,
    delta_candidate_count=4,
    delta_size_ratio_min=0.5,
    delta_size_ratio_max=2.0,
    delta_min_gain_ratio=0.95,
    delta_min_gain_bytes=128,

    # Phase 3: chunking
    chunk_enabled=True,
    chunk_min_blob_size=1*1024*1024,
    chunk_avg_size=256*1024,
    chunk_min_size=64*1024,
    chunk_max_size=1*1024*1024,
    chunk_min_gain_ratio=0.95,
    chunk_min_gain_bytes=4096,
)

rechunk()

Explicit maintenance operation that converts eligible inline blobs into chunked representation for cross-blob dedup. Not applied automatically on write.

stats = fa.rechunk()                                    # all eligible blobs
stats = fa.rechunk(storage_class="bin")                 # only one class
stats = fa.rechunk(series_key="series/doc-123")         # one lineage cohort only
stats = fa.rechunk(batch_size=50)                       # cap rewrites
stats = fa.rechunk(min_blob_size=2*1024*1024)           # override threshold

Parameter	Type	Default	Notes
`storage_class`	`str \| None`	`None`	Restrict candidates
`series_key`	`str \| None`	`None`	Restrict to one lineage cohort
`batch_size`	`int`	`100`	Max blobs rewritten per call
`min_blob_size`	`int \| None`	from policy	Minimum raw size

Returns RechunkStats(blobs_rewritten, chunks_added, bytes_saved). Preserves digests, raw bytes, spans, and query results.

CLI

farchive stats [db_path]
farchive cat [db_path] [locator ...] [--digest DIGEST] [--pattern PAT] [--at TS] [--prefix-with-locator] [--unique]
farchive history [db_path] <locator> [--json]
farchive locators [db_path] [--pattern PAT]
farchive find [db_path] <query> [--prefix]
farchive events [db_path] [--locator LOC] [--locator-prefix LOC] [--kind KIND] [--digest DIGEST] [--since TS] [--limit N]
farchive resolve [db_path] <locator> [--at TS] [--json]
farchive meta [db_path] <locator> [--at TS] [--json]
farchive inspect [db_path] <digest>
farchive train-dict [db_path] [--storage-class xml]
farchive repack [db_path] [--storage-class xml] [--series-key key] [--batch-size 1000]
farchive rechunk [db_path] [--storage-class bin] [--series-key key] [--batch-size 100] [--min-blob-size N]
farchive purge [db_path] <locator> [<locator> ...] [--dry-run] [--confirm] [--json]

cat writes raw bytes to stdout. Accepts multiple locators to concatenate, --pattern to select locators by LIKE pattern, --digest for a single digest lookup, and --at for point-in-time retrieval. --prefix-with-locator prepends ==> locator <== headers (like head -v) when streaming multiple locators. --unique deduplicates by digest, skipping locators that resolve to already-emitted content.

inspect shows blob metadata including chunk references and unique stored size for chunked blobs. history --json, resolve --json, and ls spans --json return machine-readable span records including series_key when present. ls spans --series-key filters a relationship cohort to the same lineage for operator inspection. events --kind, events --digest, and events --locator-prefix provide locator-oriented event filters for API-like event queries. find searches locators by substring, or by prefix with --prefix. meta is a thin CLI alias for resolve.

Design

Single SQLite file, WAL mode
SHA-256 content identity
Positive-observation model (records what was seen, not what was absent)
Span-based history (A->B->A creates 3 spans, not 1 collapsed record)
Monotone observation time enforced per locator
Optional event audit log with public read API
Layered compression: inline zstd, trained dictionaries, locator-local deltas, chunk dedup
rechunk() for explicit large-blob chunking maintenance
Configurable CompressionPolicy (training is automatic, repack is explicit)
File-based write lock for multi-process safety (POSIX fcntl; no-lock fallback on Windows)
Not thread-safe (one instance per thread, enforced by SQLite)
No HTTP, no domain-specific logic — the caller brings bytes