Show HN: Semantic Search on AWS Docs

38 points by antti909 3 years ago · 15 comments

Reader

mdaniel 3 years ago

I can appreciate they want folks to deploy more AWS resources, but without a demo it's hard to know if it's worth the energy

And by demo, I mean they actually ingested the AWS documentation so it is in their best interest to wire this up to docs-staging.aws.amazon.com or some such, with any necessary "this is not supported, it may go away at any time". They're playing with house money, after all

Nathanba 3 years ago

Yes, this requires setting up a gpu server. Very expensive and the whole process involves a lot of steps. It would have to be far better than something like barebibes Lucene but I see no proof of concept.

nextworddev 3 years ago

Couple observations: 1) Uses AWS OpenSearch and not any of the more popular VectorDBs du jour (Pinecone, Weviate, Milvus, etc). Never used OpenSearch for ANN. 2) Obviously doesn't support OpenAI or Cohere embedding algos - understandably they want to promote the OSS / HuggingFace 3) The best AWS doc search has been actually ChatGPT (GPT4) specifically, even with the knowledge cutoff.

jillesvangurp 3 years ago

The nice thing about using opensearch is that you can combine vector search with normal search.
Opensearch actually does support some of the openai embedding models. Basically, as long as your embeddings can fit in the dense vector field type, you can use them. Opensearch has an advantage over Elasticsearch here as it supports higher dimensional vectors. Which means you can use the more fancy newer models provided by e.g. openai.
I've been diving a bit into vector search lately from the perspective of someone who isn't necessarily interested or skilled in creating bespoke AI models but someone who is interested in sticking bits and pieces of off the shelf technology together to implement search functionality. Basically, there are all these vector search engines out there and they kind of loosely do the same things: 1) given some blob of content, and some chunk of extremely expensive to run software that creates vectors for that content 2) store those vectors and 3) allow people to do distance search on those vectors with a second vector calculated from the query; typically using ANN.
That's it. There's a lot of hand-waviness around creating these embeddings vectors. Which is not what most of these products solve. Not even a little bit. You need to provide your own embeddings typically. There are many ways to do that. The simplest is using some docker container (e.g. easybert), writing a simple python script, or using something like the openai embeddings API with a suitable model.
The hard part is picking the right model and evaluating the model performance. Mostly the performance tends to be underwhelming. Especially for short queries. And there's a trade off, all the fancy models produce huge vectors. Which are expensive to query and store.
- kacperlukawski 3 years ago
  
  I'm still wondering why OpenSearch and ES have those limits for the dimensionality of the embeddings while the vector databases such as Qdrant do not.
  - jillesvangurp 3 years ago
    
    The limitations are very practical. High dimensionality vectors have a huge size cost and this limits scalability. This blows up pretty quickly when you index millions of documents. Also query cost explodes quickly to become impractical in terms of resource usage. The Lucene vector field support has some limitations here. Opensearch uses native libraries to work around that. Elasticsearch only supports whatever Lucene supports. For now. I'd say that will probably evolve over time. It's a relatively new feature for both.
    Dedicated products have similar limitations of course. But they have the advantage that they can be optimized for that use case and use some native trickery (GPUs, etc.) to mitigate some of the effects. So, there's a bit of a tradeoff there in what you need and can afford. But ultimately, handling millions/billions of huge vectors has a cost.
PaulHoule 3 years ago

I think their strategy is to use a cheap search algorithm to reduce candidate results and use a more expensive network to filter those results. The head end search could be replaced by an embedding-based search with a vector based search engine but it may work well with the conventional search engine.
- gymbeaux 3 years ago
  
  I think their strategy is to promote and sell their proprietary *aaS solutions

thefourthchime 3 years ago

Can’t normal ChatGPT do most of this already?

tinyhouse 3 years ago

Very cool. Thanks for sharing! I like how you put everything together with open source libraries and cloud tools, showing how one can build a robust search app fairly quickly. Well done.

d4rkp4ttern 3 years ago

It’s using the Haystack library, whose functionally seems to overlap with that of langchain. Anyone know what the tradeoffs are between these two ?

antti909OP 3 years ago

Heh, you seem to keep asking :) You could also ask in our community Discord, tbh, there are people who have been trying both.. There's definitely a ton of great things about langchain, so I'd be curious myself!

simlevesque 3 years ago

Why does it use Terraform instead of CDK ?

imwillofficial 3 years ago

Public facing teams have more leeway with what tools they use.

Settings

Show HN: Semantic Search on AWS Docs

Keyboard Shortcuts