Settings

Theme

Show HN: Using LLMs and >1k 4090s to visualize 100k scientific research articles

twitter.com

5 points by funfunfunction 4 months ago · 2 comments

Reader

ashvardanian 4 months ago

Congrats on the release, Sam - the preview looks great!

I'm curious about the technical side: how are you handling the dimensionality reduction and visualization? Also noticed you mentioned "custom-trained LLMs" in the tweet - how large are those models, and what motivated using custom ones instead of existing open models?

  • funfunfunctionOP 4 months ago

    We'll release the full data explorer soon, with more info.

    At the core of this project is a structured-extraction task using a custom Qwen 14B model, which we distilled from larger closed-source models. We needed a model we could run at scale on https://devnet.inference.net, which is comprised mostly of idle consumer-grade NVIDIA devices.

    Embeddings were generated using SPECTER2, a transformer model from AllenAI specifically designed for scientific documents. The model processes each paper's title, executive summary, and research context to generate 768-dimensional embeddings optimized for semantic search over scientific literature.

    The visualization uses UMAP to reduce the 768D embeddings to 3D coordinates, preserving local and global structure. K-Means clustering groups papers into ~100 clusters based on semantic similarity in the embedding space. Cluster labels are automatically generated using TF-IDF analysis of paper fields and key takeaways, identifying the most distinctive terms for each cluster.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection