Ask HN: Embeddings as "Semantic Hashes"
As I understand it, embeddings are semantic representations of input data, such as text or images, in a vector space that maps conceptual meaning to distances. However, this vector space is only meaningful to the model.
To draw an analogy, can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256?
How can we make embeddings future-proof and exchangeable between independent parties? You can already use the embeddings as features (as input) to another model that is then trained only on the embedding vectors. In this sense, they are exchangeable. This goes even further, as a model sophisticated enough to capture a probability distribution will produce embeddings that encode this distribution (to some extent) so that any two models of that kind produce "equivalent" embeddings that can be transformed into each other. This is an area of active research (in fact, I've just been to a seminar talk about that). So the answer to the "How can we .." would be: by capturing the distribution, by making the embedding big enough and the training task difficult enough. Examples of embeddings that are re-used are variants of word2vec, CLIP and CLAP. As others have already mentioned: the hash analogy would be correct if you think about non-cryptographic hashes, but I doubt that this clarifies anything. > can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256? No. And no equivalent. Different target. Crypto- hashes are created for unique representation, even when just one bit changed. Target, to detect if data changed, to protect data from change (intentional or non-intentional). Vector representation designed, to easy find similarities, so many pieces of data with different bits, will have equal vector repr, or very close. Good vector repr even consider computationally effective measurement of distance between different pieces of data. Most problem with vector repr's compatibility, that exists few different algorithms and they use large parameters sets, and at the moment, I have not seen any tries to standardize these parameters sets, because they are very large and expensive to create, and copyright issues. Also, I don't know exactly, but may exists some patented algorithms. As example, consider some legal text in English, and it's good translation to French (or other language) - they will be binary totally different, but will be equal in some vector repr. Unfortunately, conversion from one vector space to other impossible in abstract case. Because vector spaces are not intersect 100%, so some cases possible in one space are impossible in other. Second problem, conversions between many-dimensional vector spaces are computationally expensive and not strict. As example of difficulties of vector spaces conversion, exists anecdote that somebody translated with early automatic translator from Russian to English, and back phrase "the spirit is strong but the flesh is weak", and got result "vodka is good, but the meat is rotten". Parameter sets, are basically contexts, in which exists some repr. Fortunately, we reach point, when near everybody could have his own copy of large subset of terrestrial knowledge (wikipedia). So I think, in nearest future we will see massively used, at least wikipedia based vector repr, and some org like Mozilla, could make standard context. But for AGI, we need more, we need 3d representation of world (at least geography and houses/buildings, not all exact, but some adequate); we need non-restricted knowledge base of pictures (video), sounds, and for best results, tactile representations of large enough subset of world objects; and we need some anthropocentric representation of moves of live objects, like humans, trees, some animals. Your suggestion that embeddings are only meaningful to the (presumably generating) model isn't quite right. You can pass them through any suitable model you like (e.g. a logistic regression, kmeans) and get decent results. You couldn't do that with a hash, as far as I understand it, as hashing doesn't attempt to put similar things together -- quite the opposite. Serialize the embeddings via ASCII characteriation (0-127 only) using gradient descent[0] (or simplify with LaTex source). Do a sterioscopic embeding. One eye for meaning, other for distance. Put gis coordinates as g-code/grbl[0] code in a docstore database as a 3d printable bias relief[0] [0] g-code/GRBL : https://www.libhunt.com/compare-Universal-G-Code-Sender-vs-G... What's the goal behind making them exchangeable? The usual plan is to recalculate embeddings whenever the model changes, or when moving between systems. You'll almost certain want to update the model over time, as the input distribution changes, to maintain good accuracy. So you need to keep the original source data and recalculate embeddings as needed. Thanks for the insight! A use case could be that anyone can provide a piece of content with the accompanying embedding and this can be used for semantic search. I.e. the search engine does not have to compute the embedding of everything, just the query. The problem with this line of thinking is that the specific embedding chosen has a big impact on task performance[1], and the person producing the piece of content doesn't necessarily know what you're going to be wanting to use that content for a lot of the time. The second thing is that certain embedding formats are very specific to particular model architectures. From a practical perspective there are some standard embedding formats that people use a lot so if you're performing a normal sort of task there's probably a standard format for that particular task (eg worth checking out spacy's embeddings library which works with a lot of different libraries[2]) [1] For example see this paper for a comparison of performance for different code embeddings https://arxiv.org/pdf/2109.07173.pdf If you want to keep the embeddings relevant, you could train new models with the cosine distance to the current embeddings as part of the cost function, it would then get the new embeddings as close to the current ones as possible, like curve fitting. Hash functions are an analogy that falls apart very quickly. You can recover the original word from the embedding, but not from the hash. A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones Pardon the pedantry, but this reflects the casual/conversational uses of “hash function” not the more general definition. To be a hash function, it just has to map a set to another set of fix sized values (usually some finite set of the natural numbers). Returning unrelated (distant) hashes for similar inputs is a possible property of a hash function, and oftentimes a desirable one (especially for cryptography), but there are in fact use cases where one wants similar inputs to map to similar (or the same) hash. https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing Not for perceptual hashes! Not all hashes are cryptographic hashes.
[1] : https://www.yeggi.com/q/bias/ Strictly virtual relm, can be shortened to s-expressions/m-expression as part of an L-System equation. Or just stick with the traditional math equation(s).