Show HN: Multimodal Search over the National Gallery of Art

2 points by Beefin a month ago · 0 comments · 1 min read

Reader

We indexed 120K images from the National Gallery of Art for visual search. Text queries, image uploads, and "find similar" all in one retriever, fused with RRF.

Demo: https://mxp.co/r/nga

Stack: SigLIP (768-dim embeddings), Ray on 2× L4 GPUs, Qdrant. ~2 hours to process, <100ms queries.

Why SigLIP over CLIP: sigmoid loss instead of softmax means embeddings live in a global semantic space—similarity scores stay consistent at scale instead of being batch-relative.

The interesting part is the retriever. One stage, three optional inputs:

- text → encode → kNN - image → encode → kNN - document_id → lookup stored embedding → kNN

Pass any combination. If multiple, fuse with reciprocal rank fusion (RRF). No score normalization needed—RRF only cares about rank position.

Killer query: pass a document_id + text like "but wearing blue." RRF combines structural similarity with the text constraint.

Blog with full config: https://mixpeek.com/blog/visual-search-rrf/

No comments yet.

Settings

Show HN: Multimodal Search over the National Gallery of Art

Keyboard Shortcuts