Show HN: Multimodal Search over the National Gallery of Art
mxp.coWe indexed 120K images from the National Gallery of Art for visual search. Text queries, image uploads, and "find similar" all in one retriever, fused with RRF.
Demo: https://mxp.co/r/nga
Stack: SigLIP (768-dim embeddings), Ray on 2× L4 GPUs, Qdrant. ~2 hours to process, <100ms queries.
Why SigLIP over CLIP: sigmoid loss instead of softmax means embeddings live in a global semantic space—similarity scores stay consistent at scale instead of being batch-relative.
The interesting part is the retriever. One stage, three optional inputs:
- text → encode → kNN - image → encode → kNN - document_id → lookup stored embedding → kNN
Pass any combination. If multiple, fuse with reciprocal rank fusion (RRF). No score normalization needed—RRF only cares about rank position.
Killer query: pass a document_id + text like "but wearing blue." RRF combines structural similarity with the text constraint.
Blog with full config: https://mixpeek.com/blog/visual-search-rrf/
No comments yet.