GitHub - dbyter/sphere-embed: Visualize embeddings and LLM relationships on a sphere's surface

Disclaimer: This project and documentation were mostly vibe coded with Claude. Proceed accordingly.

Interactive 3D visualization of OpenAI text embeddings. 1002 words across 15 domains are embedded with text-embedding-3-small, reduced to 3D via PCA + UMAP, and projected onto a sphere using UMAP's native spherical output metric.

Live at nofone.io/experiment/3dembed

Quick start

Option A — just run the frontend (data included)

data.json is committed to the repo, so you can visualize immediately without an API key or running any Python.

cd frontend
npm install
npm run dev
# → http://localhost:5173

Option B — generate your own embeddings

Use this if you want to change the word list, tweak UMAP parameters, or regenerate from scratch. Requires Python 3.11 and an OpenAI API key.

cd backend

# Add your OpenAI API key
echo "OPENAI_API_KEY=sk-..." > .env

# Embed 1002 words (~30s, costs <$0.001)
uv run python embed.py

# Reduce to sphere coordinates (~90s) and overwrite data.json
uv run python reduce.py
# → writes ../frontend/public/data.json

embed.py is idempotent — it skips words already in the database, so you can safely re-run it after adding new words. Re-run reduce.py anytime to regenerate coordinates without re-embedding.

Verify embeddings were stored:

sqlite3 embeddings.db "SELECT count(*) FROM embeddings"   # → 1002

How it works

1002 words × 15 categories
        ↓
OpenAI text-embedding-3-small  →  1536-dim vectors  (stored in SQLite)
        ↓
PCA  →  50 dims
        ↓
UMAP (output_metric="haversine")  →  (lat, lon) on S²
        ↓
Spherical → Cartesian  →  (x, y, z) on unit sphere
        ↓
React Three Fiber  →  interactive 3D visualization

Why haversine output? UMAP with output_metric="haversine" treats the output space as a 2-sphere (S²), embedding directly onto the sphere surface rather than flat 3D space. This avoids the clustering artifacts that come from L2-normalizing flat UMAP output (which maps all points to one hemisphere when UMAP output is all-positive).

Project structure

sphere-embed/
├── backend/                  # Python pipeline (uv)
│   ├── words.py              # 1002 words × 15 categories
│   ├── embed.py              # OpenAI → SQLite (multithreaded, idempotent)
│   ├── reduce.py             # PCA + UMAP → data.json
│   └── embeddings.db         # generated, gitignored
└── frontend/                 # Vite + React + TypeScript
    ├── src/
    │   ├── App.tsx
    │   ├── components/
    │   │   ├── Scene.tsx         # R3F Canvas + OrbitControls
    │   │   ├── SpherePoints.tsx  # InstancedMesh per category
    │   │   ├── WireframeSphere.tsx
    │   │   ├── Controls.tsx      # category toggles + search
    │   │   └── Tooltip.tsx
    │   └── hooks/
    │       └── useEmbeddingData.ts
    └── public/
        └── data.json             # pre-computed, committed to repo

Category	Count
Animals	67
Biology	67
Chemistry	67
Physics	67
Mathematics	67
Philosophy	67
History	67
Politics	67
Business	67
Technology	67
Geography	67
Fashion	67
Food	66
Sports	66
Psychology	66

Tech stack