vulturelabs/vector-rosetta-clip-vit-base-patch32-to-siglip-vit-base-patch16-224 · Hugging Face

2 min read Original article ↗

Vector Rosetta: CLIP → SigLIP Image Embedding Translation

Translate image embeddings from OpenAI CLIP ViT-B/32 to Google SigLIP ViT-B/16-224 without re-running inference on original images.

Note: This model translates image embeddings only. Both source (CLIP) and target (SigLIP) are vision models trained on image-text pairs.

Why?

  • 41x faster than re-embedding images with SigLIP
  • Works when original images are unavailable
  • Enables embedding space migration without reprocessing your image corpus

Performance

Metric Value
Cosine Similarity 90.9%
Rank@1 (10K pool) 94.3%
Rank@1 (100K pool) 84.4%
Cross-domain (COCO photos) 90.1%
Cross-domain (WikiArt paintings) 85.7%

Installation

pip install torch huggingface_hub

Quick Start

from huggingface_hub import hf_hub_download
import torch
import sys
import os

# Download the module
module_path = hf_hub_download(
    "vulturelabs/vector-rosetta-clip-vit-base-patch32-to-siglip-vit-base-patch16-224",
    "vector_rosetta.py",
    token=os.environ.get("HF_TOKEN")
)
sys.path.insert(0, str(module_path.rsplit('/', 1)[0]))

from vector_rosetta import VectorRosetta

# Load model
translator = VectorRosetta.from_pretrained(
    "vulturelabs/vector-rosetta-clip-vit-base-patch32-to-siglip-vit-base-patch16-224",
    token=os.environ.get("HF_TOKEN")
)

# Translate CLIP image embeddings to SigLIP
import numpy as np
clip_image_embeddings = np.random.randn(100, 512).astype(np.float32)  # Your CLIP image embeddings
siglip_image_embeddings = translator.translate(clip_image_embeddings)
print(siglip_image_embeddings.shape)  # (100, 768)

# With confidence scores (lower = better translation)
siglip_embeddings, confidence = translator.translate(clip_image_embeddings, return_confidence=True)

Full Example: Translate Real Image Embeddings

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests

# Get a CLIP image embedding
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = clip_processor(images=image, return_tensors="pt")
clip_emb = clip_model.get_image_features(**inputs).detach().numpy()

# Translate to SigLIP image embedding space
siglip_emb = translator.translate(clip_emb)

# Now use siglip_emb with any SigLIP-based image retrieval system!

Model Details

  • Source: openai/clip-vit-base-patch32 (512-dim image embeddings)
  • Target: google/siglip-base-patch16-224 (768-dim image embeddings)
  • Architecture: VectorTranslationAdapter (50M params)
    • Generative Projector (cross-attention based dimension alignment)
    • Residual MLP (manifold refinement)
    • Confidence Estimator (translation quality prediction)
  • Training: 1M ImageNet images, 50 epochs

Confidence Score

The model outputs a "drift score" where lower = better translation:

  • < 0.25: Excellent translation
  • 0.25-0.35: Good translation
  • > 0.35: May need verification
translated, confidence = translator.translate(image_embeddings, return_confidence=True)
good_translations = translated[confidence < 0.3]

Use Cases

  • Vector database migration: Move from CLIP-indexed to SigLIP-indexed image search
  • Model upgrades: Upgrade your image retrieval system without re-embedding
  • Cross-system compatibility: Bridge systems using different image embedding models
  • Historical data: Translate old CLIP embeddings when original images are deleted

Limitations

  • Image embeddings only - does not translate text embeddings
  • Trained on natural images (ImageNet). Performance may degrade on:
    • Medical/scientific imagery
    • Synthetic/AI-generated images
    • Heavy text overlays
  • Only supports this specific model pair (CLIP ViT-B/32 → SigLIP ViT-B/16)

Citation

@misc{vector-rosetta-2025,
  title={Vector Rosetta: Cross-Model Image Embedding Translation},
  year={2025}
}