Vector Rosetta: CLIP → SigLIP Image Embedding Translation
Translate image embeddings from OpenAI CLIP ViT-B/32 to Google SigLIP ViT-B/16-224 without re-running inference on original images.
Note: This model translates image embeddings only. Both source (CLIP) and target (SigLIP) are vision models trained on image-text pairs.
Why?
- 41x faster than re-embedding images with SigLIP
- Works when original images are unavailable
- Enables embedding space migration without reprocessing your image corpus
Performance
| Metric | Value |
|---|---|
| Cosine Similarity | 90.9% |
| Rank@1 (10K pool) | 94.3% |
| Rank@1 (100K pool) | 84.4% |
| Cross-domain (COCO photos) | 90.1% |
| Cross-domain (WikiArt paintings) | 85.7% |
Installation
pip install torch huggingface_hub
Quick Start
from huggingface_hub import hf_hub_download
import torch
import sys
import os
# Download the module
module_path = hf_hub_download(
"vulturelabs/vector-rosetta-clip-vit-base-patch32-to-siglip-vit-base-patch16-224",
"vector_rosetta.py",
token=os.environ.get("HF_TOKEN")
)
sys.path.insert(0, str(module_path.rsplit('/', 1)[0]))
from vector_rosetta import VectorRosetta
# Load model
translator = VectorRosetta.from_pretrained(
"vulturelabs/vector-rosetta-clip-vit-base-patch32-to-siglip-vit-base-patch16-224",
token=os.environ.get("HF_TOKEN")
)
# Translate CLIP image embeddings to SigLIP
import numpy as np
clip_image_embeddings = np.random.randn(100, 512).astype(np.float32) # Your CLIP image embeddings
siglip_image_embeddings = translator.translate(clip_image_embeddings)
print(siglip_image_embeddings.shape) # (100, 768)
# With confidence scores (lower = better translation)
siglip_embeddings, confidence = translator.translate(clip_image_embeddings, return_confidence=True)
Full Example: Translate Real Image Embeddings
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
# Get a CLIP image embedding
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = clip_processor(images=image, return_tensors="pt")
clip_emb = clip_model.get_image_features(**inputs).detach().numpy()
# Translate to SigLIP image embedding space
siglip_emb = translator.translate(clip_emb)
# Now use siglip_emb with any SigLIP-based image retrieval system!
Model Details
- Source: openai/clip-vit-base-patch32 (512-dim image embeddings)
- Target: google/siglip-base-patch16-224 (768-dim image embeddings)
- Architecture: VectorTranslationAdapter (50M params)
- Generative Projector (cross-attention based dimension alignment)
- Residual MLP (manifold refinement)
- Confidence Estimator (translation quality prediction)
- Training: 1M ImageNet images, 50 epochs
Confidence Score
The model outputs a "drift score" where lower = better translation:
- < 0.25: Excellent translation
- 0.25-0.35: Good translation
- > 0.35: May need verification
translated, confidence = translator.translate(image_embeddings, return_confidence=True)
good_translations = translated[confidence < 0.3]
Use Cases
- Vector database migration: Move from CLIP-indexed to SigLIP-indexed image search
- Model upgrades: Upgrade your image retrieval system without re-embedding
- Cross-system compatibility: Bridge systems using different image embedding models
- Historical data: Translate old CLIP embeddings when original images are deleted
Limitations
- Image embeddings only - does not translate text embeddings
- Trained on natural images (ImageNet). Performance may degrade on:
- Medical/scientific imagery
- Synthetic/AI-generated images
- Heavy text overlays
- Only supports this specific model pair (CLIP ViT-B/32 → SigLIP ViT-B/16)
Citation
@misc{vector-rosetta-2025,
title={Vector Rosetta: Cross-Model Image Embedding Translation},
year={2025}
}