Settings

Theme

Latent Dictionary: 3D map of Oxford3000+search words via DistilBert embeddings

latentdictionary.com

89 points by pps 2 years ago · 42 comments

Reader

minimaxir 2 years ago

Some notes on how embeddings/DistilBERT embeddings work since the other comments are confused:

1) There are two primary ways to have models generate embeddings: implicitly from an LLM by mean-pooling its last hidden state since it has to learn how to map text in a distinct latent space anyways to work correctly (i.e. DistilBERT), or you can use a model which can generate embeddings directly which are trained using something like triplet loss to explicitly incentivise learning similarity/dissimilarity. Popular text-embedding models like BAAI/bge-large-en-v1.5 tend to use the latter approach.

2) The famous word2Vec examples of e.g. woman + king = queen only work because word2vec is a shallow network and the model learns the word embeddings directly, instead of it being emergent. The latent space still maps them closely as shown with this demo, but there isn't any algebraic intuition. You can get close with algebra but no cigar.

3) DistilBERT is pretty old (2019) and based on a 2018 model trained on Wikipedia and books, so there will be significant text drift in addition to being less robust with newer modeling techniques and a more robust dataset. I do not recommend using it for production applications nowadays.

4) There is an under-discussed opportunity for dimensionality reduction techniques like PCA (which this demo uses to get the data into 3D) to both improve signal-to-noise and improve distinctiveness. I am working on a blog post of a new technique to handle dimensionality reduction for text embeddings better which may have interesting and profound usability implications.

  • pyinstallwoes 2 years ago

    I’ve been ruminating on the postulation of a universal signature for every entity across sensory complexes (per sense organ reality, vision, touch, mind) which translates to the problem of entities represented in binary needing to be related across modalities as in “butterfly” vs a picture of a butterfly vs the audio of butterfly vs the thought pointing to one of those or other.

    I was wondering if there was a universal signal that can be used as the identity and then based on that signal one could measure the distance to any other signal based on the principle relation of not(other). That is to say the identity would be precisely not all else for any X. Said another way, every thing is because it is exactly not everything else.

    So thinking as first principles as possible I wondered if it were possible to represent everything as some frequency? A Fourier transform analog for every “time slice” of a thing? This is where it gets slightly slippery.

    So the idea was trying to build relationship and identity and labeling from a simple rule set of things arising out of relation of not being other things.

    In my mind I saw nodes on a graph forming in higher dimensions as half way points for any comparison. Comparisons create new nodes and implicitly have a distance metric to all other things. It made sense in my mind that there was an algorithmic annealing to new nodes in a “low density higher energetic state” allowing them to move faster in this universal emergent ontology/spatial space; eventually getting more dense and slower as it gets cold.

    So the system implicitly also has a snapshot of events or interactions based on that where every comparison has a “tick” that encodes a particular density relation for some set of nodes it’s in association with.

    The idea that cemented it all together was to treat each node like an address:chord. Similar to chording keys like a-b-c in some ux programs, but also exactly like chords in music too.

    The idea being that when multiple “things” are dialed in at same time it becomes its own emergent label by proximity and association of those things being triggered to new information coming in classified as a distance to not(signal).

    I didn’t really realize how close this idea was to what encoders/decoders seem to be doing although I do know I’m trying to think myself towards a universal solution that doesn’t require special encoders for every media type. Hence the Fourier transform path.

    Know anything like this or am I spitting idiocy?

    • pyinstallwoes 2 years ago

      So the alphabet a to z… on their own the symbols mean nothing but when compared to every other letter meaning arises. Then iterate / recursively out for every growth in structure and letter to letter, words to words, paragraphs to paragraphs. Each one has a “dependent arising” of meaning based precisely on the relation to other.

      Which is more or less word2vec as far as I understand but then trying to extrapolate that as a universal principle to all things that can be represented by using a “common signature : hash based off a signal like a complex waveform” and then doing a difference on signal composition and its shape/bandwidth to compare its properties to other things and when they reference similar objects even in different modalities they’d be associated by being triggered together.

      So “dog” vs image of dog would both translate to a primordial signal : identity representation and in the domain of frequency do the comparison and project a coordinate in the spatial sense and eventually those two nodes would more likely be triggered at the same time due to the likelihood of “dog” being next to image of dog when parsing information across future events.

      Whew. Maybe I’m just talking to myself. At least it’s out there if it makes sense to anyone else.

      • minimaxir 2 years ago

        > So “dog” vs image of dog would both translate to a primordial signal : identity representation and in the domain of frequency do the comparison and project a coordinate in the spatial sense and eventually those two nodes would more likely be triggered at the same time due to the likelihood of “dog” being next to image of dog when parsing information across future events.

        That is how CLIP embeddings work and were trained to work.

        Hugging Face transformers now has a get_image_features() and get_text_features() function for CLIP models to make getting the embeddings for different modalities easy: https://huggingface.co/docs/transformers/model_doc/clip#tran...

        • pyinstallwoes 2 years ago

          Yeah but it doesn’t use a universal method does it? And it requires labeling.

          The method I’m describing requires no labeling. Labeling would be a local only translation (alias). Labels emerge based on meaning. But the labels are more of an interface - not the actual nodes themselves which arise off the not identity principle * event proximity * comparisons.

          • warkdarrior 2 years ago

            Labeling (which is typically manual and thus not scalable) is a proxy for comparisons. Two things are the same if they have the same label. The question is how else to encode the comparison information.

            • pyinstallwoes 2 years ago

              Right one is manual the other is automatic and my hypothesis is you can have automatic universal labeling the way I describe

      • pyinstallwoes 2 years ago

        The key requirement in my mind here is that the “universal identifier” is a form of attempting something like a deterministic signature for all things. The hunch is based on the hypothesis that the primordial representation of any and all things is frequency.

        But of course each “ontological capable system” would still need to process the identity function to start making sense of things based on signals being unlike other signals, so deterministic is shallow but concrete.

tikimcfee 2 years ago

Edit: I think this is fascinating. If you use words, like dog, electric, life, and human, all of them appear in one mass however, the words like greet, chicken, and “a“ appear in a different mass density section. I think it’s interesting that the words have diverged in location, with some seeming relationship in the way, the words are used. If this were truly random, I would expect those words to be mixed into the other ones.

I have this except you can see every single word in any dictionary at once in space, it renders individual glyphs. It can show an entire dictionary of words - definitions and roots - and let you fly around in them. It’s fun. I built a sample that “plays” a sentence and its definitions. GitHub.com/tikimcfee/LookAtThat The more I see stuff like this, the more i want to complete it. It’s heartening to see so many people fancied with seeing words… I just wish I knew where to find these people to like.. befriend and get better. Im getting the feeling I just kinda exist between worlds of lofty ideas and people that are incredibly smart sticking around other people that are incredibly smart.

wrsh07 2 years ago

I wish there were more context and maybe the ability to do math on the vectors

Eg what is the real distance between the two vectors? That should be easy to compute

Similarly: what do I get from summing two vectors and what are some nearby vectors?

Maybe just generally: what are some nearby vectors?

Without any additional context it's just a point cloud with a couple of randomly labeled elements

  • refulgentis 2 years ago

    You're actually kinda hitting the nail on the head. _Generally_, the word2vec woman + king = queen thing was cute but not very real.

    People rarely have to get down to the real true metal on the embeddings models, and they're not what people think they are from their memory of word2vec. Ex. there's actually one vector emitted _per token_, the final vector is the mean. And cosine distance for similarity is the only metric anyone is training for.

    In summary, there's ~no reason to think a visualization trying to show multiple vectors will ever be meaningful. Even just starting from "they have way way way more dimensions than we can represent visually" is enough to rule it out

    Mini LM v2, foundation of most vector dbs, is 384 dims.

    n.b. dear reader, if you've heard of that: you should be using v3! V3 is for asymmetric search, aka query => result docs. V2 is for symmetric search, aka chunk of text => similarly worded chunks of texts. It's very very funny how few people read the docs, in this case, the sentence transformers site.

  • tikimcfee 2 years ago

    If I gave you a live GPU shader that let you arbitrarily position any of say a few million words with simple Cartesian coordinates, what would you do with it? Whole words expressed as Individual letters - not symbols, representations, or abstractions. Just letters arranged in a specific order to form words.

    • theaussiestew 2 years ago

      I would want to have multi lingual embeddings so I can learn languages more efficiently. Being able to see clouds of different words in different languages would allow me to contextualize them more easily. Same for phrases and sentences.

      • tikimcfee 2 years ago

        I’m very much with you. Since most languages I know of are written with combinations of ordered glyphs, they all get rendered the same (although I don’t handle about 20k current Unicode characters of the full 400k+, and none of the grouping really works, so RTL languages would be a mess for individual words).

        However, this is exactly where I want to go. A dictionary is a cyclic graph of words mapping to words. That means there’s at least one finite way to visit every single node and give it a position, with a direct relationship to the words that define it, and those words that define them, and so on.

        This creates an arbitrary and unique geometric structure per language, and if you get fancy and create modifiers for an individual’s vocabulary, you can even create transforms for a “base” dictionary, and the way someone chooses to use certain words differently. You would be able to see, but likely not understand, the “structure” of types of text - poetry, storytelling, instructional writing, etc.

granawkins 2 years ago

Hey guys, I'm the bored SOB who built this. Thanks for the awesome discussion, a lot of you know more about this than I do!

I hadn't planned to keep building this but if I do, what should I add/change?

  • tikimcfee 2 years ago

    Hey there! Sincerely cool stuff, I’m glad it’s fun for you.

    It’s actually quite approachable to play with, and some of the comments about, “wut?” may be best answered by a little more experimentation on the user’s side, haha. I think the content itself is tricky, which may trip people up.

    Something I’ve seen before that may be interesting is doing something with the definitions of words. ATM, you’re using a source list of words and using the embedded vectors to visualize. But what if you visualized not just the words themselves, but the ordered list of words that make up the definition(s) of that word visible in some spatial relationship. This would look interesting because around (connected to?) each word is its meaning in this case; changing the definition (the context use of the word) would also change the definition… and also change the connected word nodes in the graph. I envision ordered lines and colored words in this style.

    If you end up doing something like that, start with like.. a “sentence player”. At the moment you show the words at once. What would it look like to “animate” the appearance of the words and their relationships by definition?

    Anyway. Thanks for getting this far, haha. This is a really fascinating project and I’m glad you shared it. Please do tell if any of this is close or far off from something you might be interested in!

  • bravura 2 years ago

    T SNE Not PCA. With t SNE parameters that can be configured

chaxor 2 years ago

Typically these types of single word embedding visualizations work much better with non contextualized models such as the more traditional gensim or w2v approaches, as contextual encoder-based embedding models like BERT don't 'bake in' as much to the token (word) itself, and rather rely on its context to define it. Also, often PCA for contextual models like BERT end up with $PC_0$ aligned with the length of the document.

kvakkefly 2 years ago

By running the same multiple times, I get different visualization. I don't really understand what's going on, but I like the idea of visualizing embeddings.

  • wrsh07 2 years ago

    I think the PCA dimensionality reduction is non deterministic, but I say this with really low confidence

    • soVeryTired 2 years ago

      PCA is purely deterministic (and might not give great results). My guess is this is done by t-sne or UMAP, both of which depend on a seed.

      • minimaxir 2 years ago

        The About page explicitly says it's PCA.

        PCA is fine enough to get it into 3D.

thom 2 years ago

Seems mostly nonsensical, not sure if that's a bug or some deeper point I'm missing.

pamelafox 2 years ago

I’m looking for more resources like this that attempt to visually explain vectors, as I’ll be giving some talks around vector search. Does anyone have related suggestions?

tetris11 2 years ago

Interesting that "cromulent" and "hentai" seem to map right next to each other, as well as the words "decorate" and "spare".

  • tikimcfee 2 years ago

    Similarly, words like “I” and “am” appear in a slightly different dense section of the map by default.

eurekin 2 years ago

I added those in succession:

> man woman king queen ruler force powerful care

and couldn't reliably determine position of any of them

smrtinsert 2 years ago

I would love a quickest path between two words. For example between color and colour

  • tudorw 2 years ago

    I think that's going to be a geodesic in a hyper-dimensional manifold. There was an article here about 'wordlets' on a hyper-sphere and a piece on time and LLM and the relating manifold. Visualising LLM topology (multi-dimensional topological manifolds) is a very rich area for exploration. I'm waiting for someone to use PHATE to do the dimension reduction, it's used in neuroscience to reduce dimensionality providing information not visible using PCA, t-SNE, LDA or UMAP.

larodi 2 years ago

Is this with some sort of dimensionality reduction of the embedding space?

  • kaoD 2 years ago

    In the bottom left "?" button it says it performs PCA down to 3 dimensions. That's going to lose a ton of information, rendering the space mostly useless.

    • smrtinsert 2 years ago

      It is lossy, but doesn't PCA function as a grouping even when forced like this?

    • behnamoh 2 years ago

      Yeah, it's a fun useless project.

      • tikimcfee 2 years ago

        I don’t know about useless. I think there is some real magic waiting to be discovered in mapping language language to some specific and enlightening visualization. I think it involves something like this. Using high statistics and simple, spatial relationships to create a “mapping” of a single individuals wordspace.

        Imagine walking around the world and seeing everyone’s slightly unique relationship space of words. This is something I have envisioned for a very long time.

        • kaoD 2 years ago

          What you're describing exists. It's called "embeddings" and it's one of the first steps ChatGPT does to do its magic and it's indeed very useful.

          What renders this useless is reducing the dimensionality from thousands to just 3.

          • tikimcfee 2 years ago

            Thank you.. sincerely, thank you.

            You just unlocked something in my brain about how this works in a certain way and I.. am incredibly excited to explore this more.

            If I’m intuitively understanding that, I really really really feel like I can… get this. Do something with this.

            Thank you. Now I need to go understand embedding a more. And use them.

            If you have any resources or ideas or keywords to look for (I think i expect a long YouTube trek for myself…) I’m happy to read and share my thoughts.

            • kaoD 2 years ago

              People are talking about embeddings in these comments, you'll get plenty of info and keywords here.

cuttysnark 2 years ago

edge of the galaxy: 'if when that then wherever where while for'

  • tikimcfee 2 years ago

    I wonder what patterns we could create from the word glyphs that produce meaningful patterns. You said “edge of the galaxy” based on the position of the words. I wonder what else you’d come up with different embeddings and organizations.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection