GitHub - hephaistos-io/alexandria: Free open source OSINT platform for gathering, ingesting, and analyzing open-source intelligence data

7 min read Original article ↗

An OSINT platform for gathering, ingesting, and analyzing open-source intelligence data. Named after the Library of Alexandria — the ambition is to collect and organize knowledge from diverse sources into a unified, queryable system.

Actively developed as a learning project for python, data pipelines, NLP. Built with AI coding assistants.

Core Idea

  • Data Ingestion: Pull data from multiple open sources (APIs, feeds, scraped content)
  • Processing Pipeline: Clean, normalize, and enrich raw data into structured intelligence
  • On-Demand Model Training: Fine-tune small ML models on collected data for domain-specific analysis (Only manual labelling is implemented, the training pipeline doesn't exist yet...)
  • Search & Analysis: Query the knowledge base and surface patterns across sources

Current Status

Active development — pipeline is functional end-to-end.

Newest updates (08/04/2026):

  • added NASA EONET for climate based events (wildfires and the like)
  • Showcase path of cyclones and icebergs on the map based on available data
{CE3B43F0-868C-4EF7-9CAD-DB102351DB19} {E17E6E43-07F3-4E90-9DAB-E8C5D2B9A02F} - improved active conflict data location mapping through filtering and nearest-country verification
  • Data Ingestion: Generic RSS scraper can be used for various sources that provide one
  • Conflict Data Ingestion: Two independent fetcher services pull geolocated armed conflict events from OSINT sources (Bellingcat, Texty, etc.) and the UCDP Candidate Events API
  • Natural Disaster Ingestion: A dedicated fetcher pulls geolocated natural events (wildfires, severe storms, volcanoes, sea ice, floods) from NASA EONET every 30 minutes, preserving the full geometry timeline so hurricane and iceberg tracks can be replayed on the map
  • World Map Influence: See which news affect which countries, overlaid with a conflict event heatmap, a natural disasters layer with magnitude-driven marker sizing and movement trails, and toggleable map layers
image
  • Processing Pipeline: Fetching of articles, finding entities and categorizing them based on different goals
image
  • On-Demand Model Training: Not implemented yet
  • Search & Analysis: Basic graph database & viewer show current events, persons and their relations (The default relations are a bit wonky)
image

Frontend Usability

The frontend is a React SPA at http://localhost:5173. The sidebar has seven main sections:

Menu Item What it shows
INTERCEPT_FEED World map with article, conflict event, and natural disaster markers, clustered by location. A heatmap layer visualizes conflict density (amber → red gradient). Natural disasters render as green markers whose size scales with magnitude (wildfire area, hurricane wind speed, sea-ice extent); hovering or selecting a moving disaster draws a fading directional trail from its earliest observation to its current position. Layer toggles (Articles / Conflicts / Heatmap / Events / Disasters) let you show or hide each data source. Clicking a marker opens the corresponding detail card in the right-hand feed panel — disasters get a dedicated card with magnitude tier, active/closed status, and source links. A floating status widget shows live pipeline health.
INFRASTRUCTURE Interactive pipeline topology (React Flow diagram auto-generated from Docker Compose labels), container health, queue metrics, uptime stats, and a live terminal log.
LABELLING Two tabs: LABEL_ASSIGNMENT — table of articles with filters and manual label editing. LABEL_SCHEMA — create, edit, and delete the classification labels that the topic-tagger uses.
ATTRIBUTION Two tabs: ROLE_ASSIGNMENT — article list with entity role assignments and inline editing. ROLE_SCHEMA — manage the entity role types (name, description, color) used by the role-classifier.
AFFILIATION_GRAPH Two tabs: RELATION_GRAPH — force-directed graph of entities and their relations from Neo4j, with temporal decay controls (lambda slider, min-strength filter). RELATION_TYPES — manage relation type definitions (name, description, color, directed/undirected).
SIGNAL_ARCHIVE Searchable, paginated card grid of all ingested articles. Click through to the detail page showing full text, extracted entities (with Wikidata IDs and coordinates), and metadata.
TERMINAL_LOG Real-time log stream from all services via WebSocket, with per-service filtering, search, and an error panel with acknowledge buttons.

Architecture

flowchart LR
    FETCH["article-fetcher"] -- articles.rss --> SCRAPE["article-scraper"]

    SCRAPE --> FO1{{"articles.scraped (fanout)"}}
    FO1 -- articles.raw --> NER["ner-tagger"]
    FO1 -- articles.training --> STORE["article-store"]

    NER -- articles.tagged --> RESOLVE["entity-resolver"]
    RESOLVE -- articles.resolved --> ROLE["role-classifier"]
    ROLE -- articles.role-classified --> TOPIC["topic-tagger"]

    TOPIC --> FO2{{"articles.classified (fanout)"}}
    FO2 -- articles.classified.store --> LABEL["label-updater"]
    FO2 -- articles.classified.relation --> RELEXT["relation-extractor"]

    OSINT["osint-geo-fetcher"] -- conflict_events.raw --> CSTORE["conflict-store"]
    UCDP["ucdp-fetcher"] -- conflict_events.raw --> CSTORE
    GDELT["gdelt-fetcher"] -- conflict_events.raw --> CSTORE

    EONET["nasa-eonet-fetcher"] -- natural_disasters.raw --> DSTORE["disaster-store"]

    FETCH -.- RED[("Redis")]
    OSINT -.- RED
    UCDP -.- RED
    GDELT -.- RED
    EONET -.- RED
    RESOLVE -.- RED
    STORE -.- PG[("PostgreSQL")]
    LABEL -.- PG
    ROLE -.- PG
    TOPIC -.- PG
    RELEXT -.- PG
    CSTORE -.- PG
    DSTORE -.- PG
    RELEXT -.- NEO[("Neo4j")]

    PG -.- API["monitoring-api"]
    NEO -.- API
    API -.- FE["Frontend"]
Loading

All services communicate via RabbitMQ queues. Queue names are shown on each edge. Fanout exchanges split the stream to multiple consumers. Dashed lines (-.-) show store connections (PostgreSQL for articles + conflict events, Redis for dedup + scheduling, Neo4j for the knowledge graph).

The conflict data pipeline runs in parallel to the article pipeline. Three independent fetcher services publish geolocated conflict events to a shared conflict_events.raw queue:

The conflict-store consumer writes events to PostgreSQL with dedup on (source, source_id). The frontend renders these as red markers and an aggregated heatmap layer on the world map.

The natural disasters pipeline is the third parallel ingest track. A single fetcher service polls NASA's Earth Observatory Natural Event Tracker (EONET) and publishes to its own queue:

  • nasa-eonet-fetcherNASA EONET v3 events endpoint covering wildfires, severe storms, volcanoes, sea and lake ice, and floods (every 30 min)

The disaster-store consumer writes events to the natural_disasters table with the full EONET geometry timeline preserved as a JSONB column, so moving events (hurricanes, drifting icebergs) can be rendered with directional track overlays on the map. See doc/natural-disasters.md for the full design rationale.

Running Locally

Important: Running everything locally will require some resources. Even then, it will be a bit slow; the local NLP categorization isn't optimized and uses CPU only

# Start the full stack
docker compose -f docker/local/docker-compose.yml up --build -d

# Include all RSS feeds (default runs BBC, Swissinfo + UN News)
docker compose -f docker/local/docker-compose.yml --profile all-feeds up --build -d

# Frontend
open http://localhost:5173

# RabbitMQ management
open http://localhost:15672    # guest / guest

# Neo4j browser
open http://localhost:7474     # neo4j / alexandria

# PostgreSQL
psql postgresql://alexandria:alexandria@localhost:5432/alexandria

Tooling

Languages & Runtimes

Backend Python 3.13+
Frontend TypeScript 5.9 / React 19
Containers Docker & Docker Compose

Backend

Tool Role
uv Package management & dependency locking
FastAPI REST API (monitoring-api)
uvicorn ASGI server (monitoring-api)
pika RabbitMQ client (most services)
psycopg 3 PostgreSQL driver
redis-py Redis client (dedup, scheduling, caching)
neo4j Neo4j driver (relation-extractor, monitoring-api)
httpx Async HTTP client
websockets Real-time log streaming (monitoring-api)
docker Docker SDK for container health queries (monitoring-api)
Ruff Linting & formatting
pytest Testing
osint-geo-extractor OSINT conflict event data (Bellingcat, Texty, GeoConfirmed, DefMon, CenInfoRes)

NLP / ML

Tool Role
spaCy Named-entity recognition (ner-tagger)
Hugging Face Transformers Zero-shot classification (role-classifier, topic-tagger, relation-extractor)
PyTorch Inference runtime (CPU-only)
trafilatura Article text extraction (article-scraper)
feedparser RSS/Atom parsing (article-fetcher)

Frontend

Tool Role
Vite Build tool & dev server
React UI framework
React Router Client-side routing
Tailwind CSS Styling
Leaflet / react-leaflet World map
react-leaflet-cluster Map marker clustering
leaflet.heat Conflict event heatmap layer
@xyflow/react Pipeline topology diagrams
@dagrejs/dagre Graph layout algorithms (pipeline topology)
react-force-graph-2d Entity relation graphs
ESLint Linting

Infrastructure

Tool Role
RabbitMQ 4 Message broker (inter-service queues & fanout exchanges)
PostgreSQL 17 Primary datastore (articles, conflict events, labels, roles, relations)
Neo4j 5 Graph database (entity relations)
Redis 7 Cache & scheduling (entity-resolver lookups, feed dedup, fetcher scheduling)

Design/UX

Design as well as UX is managed using googles stitch AI UX tool.