An OSINT platform for gathering, ingesting, and analyzing open-source intelligence data. Named after the Library of Alexandria — the ambition is to collect and organize knowledge from diverse sources into a unified, queryable system.
Actively developed as a learning project for python, data pipelines, NLP. Built with AI coding assistants.
Core Idea
- Data Ingestion: Pull data from multiple open sources (APIs, feeds, scraped content)
- Processing Pipeline: Clean, normalize, and enrich raw data into structured intelligence
- On-Demand Model Training: Fine-tune small ML models on collected data for domain-specific analysis (Only manual labelling is implemented, the training pipeline doesn't exist yet...)
- Search & Analysis: Query the knowledge base and surface patterns across sources
Current Status
Active development — pipeline is functional end-to-end.
Newest updates (08/04/2026):
- added NASA EONET for climate based events (wildfires and the like)
- Showcase path of cyclones and icebergs on the map based on available data
- improved active conflict data location mapping through filtering and nearest-country verification
- Data Ingestion: Generic RSS scraper can be used for various sources that provide one
- Conflict Data Ingestion: Two independent fetcher services pull geolocated armed conflict events from OSINT sources (Bellingcat, Texty, etc.) and the UCDP Candidate Events API
- Natural Disaster Ingestion: A dedicated fetcher pulls geolocated natural events (wildfires, severe storms, volcanoes, sea ice, floods) from NASA EONET every 30 minutes, preserving the full geometry timeline so hurricane and iceberg tracks can be replayed on the map
- World Map Influence: See which news affect which countries, overlaid with a conflict event heatmap, a natural disasters layer with magnitude-driven marker sizing and movement trails, and toggleable map layers
- Processing Pipeline: Fetching of articles, finding entities and categorizing them based on different goals
- On-Demand Model Training: Not implemented yet
- Search & Analysis: Basic graph database & viewer show current events, persons and their relations (The default relations are a bit wonky)
Frontend Usability
The frontend is a React SPA at http://localhost:5173. The sidebar has seven main sections:
| Menu Item | What it shows |
|---|---|
| INTERCEPT_FEED | World map with article, conflict event, and natural disaster markers, clustered by location. A heatmap layer visualizes conflict density (amber → red gradient). Natural disasters render as green markers whose size scales with magnitude (wildfire area, hurricane wind speed, sea-ice extent); hovering or selecting a moving disaster draws a fading directional trail from its earliest observation to its current position. Layer toggles (Articles / Conflicts / Heatmap / Events / Disasters) let you show or hide each data source. Clicking a marker opens the corresponding detail card in the right-hand feed panel — disasters get a dedicated card with magnitude tier, active/closed status, and source links. A floating status widget shows live pipeline health. |
| INFRASTRUCTURE | Interactive pipeline topology (React Flow diagram auto-generated from Docker Compose labels), container health, queue metrics, uptime stats, and a live terminal log. |
| LABELLING | Two tabs: LABEL_ASSIGNMENT — table of articles with filters and manual label editing. LABEL_SCHEMA — create, edit, and delete the classification labels that the topic-tagger uses. |
| ATTRIBUTION | Two tabs: ROLE_ASSIGNMENT — article list with entity role assignments and inline editing. ROLE_SCHEMA — manage the entity role types (name, description, color) used by the role-classifier. |
| AFFILIATION_GRAPH | Two tabs: RELATION_GRAPH — force-directed graph of entities and their relations from Neo4j, with temporal decay controls (lambda slider, min-strength filter). RELATION_TYPES — manage relation type definitions (name, description, color, directed/undirected). |
| SIGNAL_ARCHIVE | Searchable, paginated card grid of all ingested articles. Click through to the detail page showing full text, extracted entities (with Wikidata IDs and coordinates), and metadata. |
| TERMINAL_LOG | Real-time log stream from all services via WebSocket, with per-service filtering, search, and an error panel with acknowledge buttons. |
Architecture
flowchart LR
FETCH["article-fetcher"] -- articles.rss --> SCRAPE["article-scraper"]
SCRAPE --> FO1{{"articles.scraped (fanout)"}}
FO1 -- articles.raw --> NER["ner-tagger"]
FO1 -- articles.training --> STORE["article-store"]
NER -- articles.tagged --> RESOLVE["entity-resolver"]
RESOLVE -- articles.resolved --> ROLE["role-classifier"]
ROLE -- articles.role-classified --> TOPIC["topic-tagger"]
TOPIC --> FO2{{"articles.classified (fanout)"}}
FO2 -- articles.classified.store --> LABEL["label-updater"]
FO2 -- articles.classified.relation --> RELEXT["relation-extractor"]
OSINT["osint-geo-fetcher"] -- conflict_events.raw --> CSTORE["conflict-store"]
UCDP["ucdp-fetcher"] -- conflict_events.raw --> CSTORE
GDELT["gdelt-fetcher"] -- conflict_events.raw --> CSTORE
EONET["nasa-eonet-fetcher"] -- natural_disasters.raw --> DSTORE["disaster-store"]
FETCH -.- RED[("Redis")]
OSINT -.- RED
UCDP -.- RED
GDELT -.- RED
EONET -.- RED
RESOLVE -.- RED
STORE -.- PG[("PostgreSQL")]
LABEL -.- PG
ROLE -.- PG
TOPIC -.- PG
RELEXT -.- PG
CSTORE -.- PG
DSTORE -.- PG
RELEXT -.- NEO[("Neo4j")]
PG -.- API["monitoring-api"]
NEO -.- API
API -.- FE["Frontend"]
All services communicate via RabbitMQ queues. Queue names are shown on each edge. Fanout exchanges split the stream to multiple consumers. Dashed lines (-.-) show store connections (PostgreSQL for articles + conflict events, Redis for dedup + scheduling, Neo4j for the knowledge graph).
The conflict data pipeline runs in parallel to the article pipeline. Three independent fetcher services publish geolocated conflict events to a shared conflict_events.raw queue:
osint-geo-fetcher— Bellingcat, Texty, GeoConfirmed, DefMon, CenInfoRes via osint-geo-extractor (every 3h)ucdp-fetcher— UCDP Candidate Events API (weekly)gdelt-fetcher— GDELT 2.0 material conflict events filtered by CAMEO codes 18/19/20 (every 15 min)
The conflict-store consumer writes events to PostgreSQL with dedup on (source, source_id). The frontend renders these as red markers and an aggregated heatmap layer on the world map.
The natural disasters pipeline is the third parallel ingest track. A single fetcher service polls NASA's Earth Observatory Natural Event Tracker (EONET) and publishes to its own queue:
nasa-eonet-fetcher— NASA EONET v3 events endpoint covering wildfires, severe storms, volcanoes, sea and lake ice, and floods (every 30 min)
The disaster-store consumer writes events to the natural_disasters table with the full EONET geometry timeline preserved as a JSONB column, so moving events (hurricanes, drifting icebergs) can be rendered with directional track overlays on the map. See doc/natural-disasters.md for the full design rationale.
Running Locally
Important: Running everything locally will require some resources. Even then, it will be a bit slow; the local NLP categorization isn't optimized and uses CPU only
# Start the full stack docker compose -f docker/local/docker-compose.yml up --build -d # Include all RSS feeds (default runs BBC, Swissinfo + UN News) docker compose -f docker/local/docker-compose.yml --profile all-feeds up --build -d # Frontend open http://localhost:5173 # RabbitMQ management open http://localhost:15672 # guest / guest # Neo4j browser open http://localhost:7474 # neo4j / alexandria # PostgreSQL psql postgresql://alexandria:alexandria@localhost:5432/alexandria
Tooling
Languages & Runtimes
| Backend | Python 3.13+ |
| Frontend | TypeScript 5.9 / React 19 |
| Containers | Docker & Docker Compose |
Backend
| Tool | Role |
|---|---|
| uv | Package management & dependency locking |
| FastAPI | REST API (monitoring-api) |
| uvicorn | ASGI server (monitoring-api) |
| pika | RabbitMQ client (most services) |
| psycopg 3 | PostgreSQL driver |
| redis-py | Redis client (dedup, scheduling, caching) |
| neo4j | Neo4j driver (relation-extractor, monitoring-api) |
| httpx | Async HTTP client |
| websockets | Real-time log streaming (monitoring-api) |
| docker | Docker SDK for container health queries (monitoring-api) |
| Ruff | Linting & formatting |
| pytest | Testing |
| osint-geo-extractor | OSINT conflict event data (Bellingcat, Texty, GeoConfirmed, DefMon, CenInfoRes) |
NLP / ML
| Tool | Role |
|---|---|
| spaCy | Named-entity recognition (ner-tagger) |
| Hugging Face Transformers | Zero-shot classification (role-classifier, topic-tagger, relation-extractor) |
| PyTorch | Inference runtime (CPU-only) |
| trafilatura | Article text extraction (article-scraper) |
| feedparser | RSS/Atom parsing (article-fetcher) |
Frontend
| Tool | Role |
|---|---|
| Vite | Build tool & dev server |
| React | UI framework |
| React Router | Client-side routing |
| Tailwind CSS | Styling |
| Leaflet / react-leaflet | World map |
| react-leaflet-cluster | Map marker clustering |
| leaflet.heat | Conflict event heatmap layer |
| @xyflow/react | Pipeline topology diagrams |
| @dagrejs/dagre | Graph layout algorithms (pipeline topology) |
| react-force-graph-2d | Entity relation graphs |
| ESLint | Linting |
Infrastructure
| Tool | Role |
|---|---|
| RabbitMQ 4 | Message broker (inter-service queues & fanout exchanges) |
| PostgreSQL 17 | Primary datastore (articles, conflict events, labels, roles, relations) |
| Neo4j 5 | Graph database (entity relations) |
| Redis 7 | Cache & scheduling (entity-resolver lookups, feed dedup, fetcher scheduling) |
Design/UX
Design as well as UX is managed using googles stitch AI UX tool.