GitHub - dbyter/sphere-embed: Visualize embeddings and LLM relationships on a sphere's surface

3 min read Original article ↗

Disclaimer: This project and documentation were mostly vibe coded with Claude. Proceed accordingly.

Interactive 3D visualization of OpenAI text embeddings. 1002 words across 15 domains are embedded with text-embedding-3-small, reduced to 3D via PCA + UMAP, and projected onto a sphere using UMAP's native spherical output metric.

Live at nofone.io/experiment/3dembed

Quick start

Option A — just run the frontend (data included)

data.json is committed to the repo, so you can visualize immediately without an API key or running any Python.

cd frontend
npm install
npm run dev
# → http://localhost:5173

Option B — generate your own embeddings

Use this if you want to change the word list, tweak UMAP parameters, or regenerate from scratch. Requires Python 3.11 and an OpenAI API key.

cd backend

# Add your OpenAI API key
echo "OPENAI_API_KEY=sk-..." > .env

# Embed 1002 words (~30s, costs <$0.001)
uv run python embed.py

# Reduce to sphere coordinates (~90s) and overwrite data.json
uv run python reduce.py
# → writes ../frontend/public/data.json

embed.py is idempotent — it skips words already in the database, so you can safely re-run it after adding new words. Re-run reduce.py anytime to regenerate coordinates without re-embedding.

Verify embeddings were stored:

sqlite3 embeddings.db "SELECT count(*) FROM embeddings"   # → 1002

How it works

1002 words × 15 categories
        ↓
OpenAI text-embedding-3-small  →  1536-dim vectors  (stored in SQLite)
        ↓
PCA  →  50 dims
        ↓
UMAP (output_metric="haversine")  →  (lat, lon) on S²
        ↓
Spherical → Cartesian  →  (x, y, z) on unit sphere
        ↓
React Three Fiber  →  interactive 3D visualization

Why haversine output? UMAP with output_metric="haversine" treats the output space as a 2-sphere (S²), embedding directly onto the sphere surface rather than flat 3D space. This avoids the clustering artifacts that come from L2-normalizing flat UMAP output (which maps all points to one hemisphere when UMAP output is all-positive).

Project structure

sphere-embed/
├── backend/                  # Python pipeline (uv)
│   ├── words.py              # 1002 words × 15 categories
│   ├── embed.py              # OpenAI → SQLite (multithreaded, idempotent)
│   ├── reduce.py             # PCA + UMAP → data.json
│   └── embeddings.db         # generated, gitignored
└── frontend/                 # Vite + React + TypeScript
    ├── src/
    │   ├── App.tsx
    │   ├── components/
    │   │   ├── Scene.tsx         # R3F Canvas + OrbitControls
    │   │   ├── SpherePoints.tsx  # InstancedMesh per category
    │   │   ├── WireframeSphere.tsx
    │   │   ├── Controls.tsx      # category toggles + search
    │   │   └── Tooltip.tsx
    │   └── hooks/
    │       └── useEmbeddingData.ts
    └── public/
        └── data.json             # pre-computed, committed to repo

Categories

Category Count
Animals 67
Biology 67
Chemistry 67
Physics 67
Mathematics 67
Philosophy 67
History 67
Politics 67
Business 67
Technology 67
Geography 67
Fashion 67
Food 66
Sports 66
Psychology 66

Tech stack

Layer Stack
Embedding OpenAI text-embedding-3-small
Dim reduction scikit-learn PCA + umap-learn
Storage SQLite
Visualization React + Vite + TypeScript
3D rendering React Three Fiber + Three.js
Python tooling uv