Vector Similarity Search in Postgres with bit.io and pgvector

Also: Answer questions about product documentation with ChatGPT

With typical lexical search methods, when you search your computer, the Internet, or most documentation systems, you’ll see results that contain the exact words you searched for. And this is often useful. But sometimes, your search terms aren’t contained in the results you actually need — you might miss them completely.

For example, finding exactly what you want in a project’s documentation isn’t always easy. You might know what you need, but the organization and naming of the docs don’t always reflect that structure. And even if you find what you’re looking for, applying it to your specific problem isn’t always straightforward.

Enter semantic search. Semantic search is search based on meaning. For example, maybe you search for “Learning how to use a database.” There are docs called Getting Started and SQL Crash Course that would both be relevant. With traditional search methods, these won’t show up.

With semantic search, on the other hand, those two articles are the first results.

Semantic search is often conducted using vector embeddings, which are frequently stored in vector databases. Vector databases are the raison d’être for several companies. Of those, some even focus specifically on the problem of semantically-searchable docs. You can do it for free in bit.io. It doesn’t take a lot of code or a lot of work. And it’s postgres: there’s likely no need to integrate your project with a bespoke vector database solution.

Follow along and we’ll walk you through all of the steps. In this article, we’ll briefly cover how semantic search works, and then we’ll show how to quickly and easily implement it in bit.io with the pgvector PostgreSQL extension.

How it Works

Semantic search goes beyond the traditional keyword-based approach, focusing on understanding the meaning and context of a query to provide more relevant results. It takes advantage of advanced language processing techniques, which include large language models (LLMs), embeddings, and similarity metrics.

Large Language Models (LLMs): These models, such as GPT-4, are trained on vast amounts of text data, enabling them to understand context and relationships between words. LLMs form the foundation for capturing the semantics of language.
Embeddings: LLMs generate continuous vector representations, called embeddings, of words, phrases, or documents, encapsulating their semantic meaning. These embeddings map textual data to a high-dimensional space, where items with similar meanings are closer together, represented as vectors.
Similarity metrics: These are methods to measure the similarity between vectors, such as the embeddings generated by LLMs. Cosine similarity, for example, is a commonly-used metric that calculates the cosine of the angle between two embeddings. This yields a similarity score ranging from -1 (completely dissimilar) to 1 (identical). In semantic search, similarity metrics help determine the relevance of search results by comparing the query’s vector representation (embedding) to the embeddings of documents in the search index.

Semantic search understands the meaning behind your query, going beyond exact keyword matches to offer more relevant results. By leveraging LLMs, embeddings, and similarity metrics such as cosine similarity, semantic search provides a more effective way to find the information you need.

Make Docs (Semantically) Searchable

The pgvector extension brings the vector data type and vector similarity metrics (specifically L2 distance, inner product, and cosine distance) to Postgres. This makes it easy to make product documentation — or any textual data — accessible via semantic search. The basic steps are:

Export your docs
Load the pgvector extension in your database
Create embeddings from docs
Load docs and embeddings to bit.io
Generate embeddings from search string or question
Find the most similar docs using a similarity metric
(Optionally) summarize the results with an LLM.

We’ll go through each of these steps separately.

Export your Docs

In this example, we’re generating embeddings from our documentation. We extracted the text, title, and link for each document using the readme.io developer API. The specific method for extracting documentation will depend on your documentation system and is beyond the scope of this article. For now, we’ll assume you have, at the very least, a list of strings containing the documents from which you would like to generate embeddings.

Load the `pgvector` extension

The pgvector extension is already installed on bit.io. Enabling it is simple. Just run the following query in the database you’d like to use for semantic search.

CREATE EXTENSION vector;

Create embeddings from docs

There are many ways to create embeddings. You can train your own model or use a pre-trained model from Hugging Face or OpenAI. Here, we’ll show how to use the OpenAI text-embedding-ada-002 model API to generate embeddings from text.

The following python function sends each of your texts to the OpenAI API and returns the vector embeddings.

import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")def get_embeddings(text):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input = text.replace("\n"," ")
    )
    embedding = response['data'][0]['embedding']
    return embedding
embeddings = [get_embeddings(d) for d in docs]

The resulting embeddings look like this (but with 1,536 dimensions):

[-0.013901988975703716, -0.014833222143352032, 0.002860217820852995,
 -0.030517857521772385, -0.03490795940160751, 0.010210311971604824,
 -0.016442926600575447, -0.010629367083311081, -0.004227135796099901,
 -0.03179498016834259, ...]

How to divide your documents into sections from which to create embeddings may require some thought. In our case, it was fairly obvious — each document within our documentation made sense as its own vector embeddings. You might need to get creative, though, for less structured documents or for long documents with many headers and subheaders.

Load the docs and embeddings to bit.io

Now that you’ve loaded the extension and created the embeddings, we need to get the embeddings (and text and any necessary metadata) into the database. We’ll show how to do this with the psycopg python library, but any Postgres library or client can be used for this purpose.

Get Daniel Liden’s stories in your inbox

Join Medium for free to get updates from this writer.

First, we create the table in our database. Note the vector data type for the embedding column. This type is made available through the pgvector extension. You need to specify the number of dimensions in the vector embeddings. You can typically find this in the documentation from the embeddings provider you’re using; in this case, the text-embedding-ada-002 model generates vectors with 1536 dimensions, so we specify the column type as vector(1536).

import psycopg_poolpool = psycopg_pool.ConnectionPool(pg_string)
table_sql = """
create table vector_search.docs (
  title text not null,
  text text not null,
  url text not null,
  embedding vector(1536)
);
"""
with pool.connection() as conn:
    conn.execute(table_sql)

Next, we’ll load the data to this table. We stored the docs as a JSON array; each object in the array has keys for title, text, url, and embedding. Why these fields?

embedding for vector similarity/semantic search.
text so we can access and work with the full text of the article (e.g. to generate summaries with LLMs).
url so we can refer users to the original docs online.
title to uniquely identify the document.

import jsonwith open("bitio_docs.json", "r") as f:
    data = json.load(f)
with pool.connection() as conn:
    for doc in data:
        # Extract the values from each doc
        title = doc["title"]
        text = doc["text"]
        url = doc["url"]
        embedding = doc["embedding"]
        # Define the SQL statement to insert the row
        sql = "INSERT INTO vector_search.docs (title, text, url, embedding) VALUES (%s, %s, %s, %s)"
        # Execute the SQL statement with the values
        #cur = conn.cursor()
        conn.execute(sql, (title, text, url, embedding))
        conn.commit()
# Close the database connection
conn.close()

You can also insert embeddings into an existing table with, e.g.,

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Individual elements of the vector type are enclosed in single quotes and brackets; e.g., '[1,2,3,4,5]' is a five-dimensional vector.

Vector Search

Now that we have our vector embeddings on bit.io, it’s time to use them! Let’s start with the example from the beginning: “Learning how to use a database.” Unfortunately, we can’t just input the text directly. pgvector enables vector comparison, meaning we’re finding the similarities between two or more vectors. So we first need to create embeddings from the search string, and then use vector search to find the most relevant docs. Then we can find the cosine similarity between the vectorized text and each of the docs.

We get the embeddings from our search query the same way as we got the embeddings from the docs:

# See above for the definition of the get_embeddings function
query_embedding = get_embeddings("Learning how to use a database")

Note: embeddings are model-specific! You need to use the same model to generate embeddings for your search query as you used for the original text, otherwise the search will not return useful results.

Semantic Search: Find the most relevant docs

Here’s where the magic happens: let’s find the two most similar docs to our search string. Remember, we’re using the vector embeddings generated from the query “Learning how to use a database.”

embedding_array = np.array(query_embedding)with pool.connection() as conn:
    results = conn.execute("SELECT title, text, url FROM vector_search.docs ORDER BY embedding <=> %s::vector LIMIT 2;", [query_embedding])
    out = results.fetchall()
conn.close()
titles = [row[0] for row in out]
print(titles)with pool.connection() as conn:

Which returns

['SQL Crash Course', 'Getting Started with bit.io'].

Note that we needed to cast the input to the vector type with the :: operator. The <=> operator is introduced by the pgvector package and is used for cosine similarity. The query, then, returns the articles with the most similar vector embeddings according to the cosine similarity metric (which OpenAI recommends for use with text-embedding-ada-002 embeddings).

Let’s try some others. “How to query with Python” returns ['iPython Notebook', 'Python — SQLAlchemy', 'Python — psycopg2 and the bit.io Python SDK']. “Database connection termination” returns The Connection Lifecycle as the first result. “How do I translate SQLite to Postgres?” returns our Query Translation doc (have you tried our SQL dialect translation functionality yet?).

This demonstrates the usefulness of semantic search — the results capture something about the meaning of the search query, not just the exact words used.

Summarize the results with ChatGPT

We can take this a step further. Even though we’ve located the most relevant docs, we’d rather not hunt through those docs for the most relevant information. That’s why we included the text field, which includes the raw text of our docs. We can send the text of the most relevant doc or docs and request a summary crafted to directly answer the user’s question.

ChatGPT works great for this application. We’ve written a lot about ChatGPT for SQL translation already, but it’s a very flexible tool and is also very good for text summarization.

In this example, we’ll just try to obtain the answer from the first result. But there are other approaches: in some cases, it might make sense to take an excerpt from each of the top few results and generate an answer from that excerpt.

def summarize_doc(search_string, docs_text):    messages = [{"role": "system",
                 "content": "You concisely answer questions based on text that is provided to you."}]
    prompt = """Answer the user's prompt or question:
    {search_string}
    by summarizing the following text:
        {docs_text}
    Keep your answer direct and concise. Provide code snippets where applicable.
    The question is about a PostgreSQL database. You can enrich the answer with other
    PostgreSQL-relevant details if applicable.""".format(search_string=search_string, docs_text=docs_text)
    messages.append({"role": "user", "content": prompt})
    response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages)
    return response.choices[0]["message"]["content"]

We can combine this with our vector search functionality as follows, starting from a user’s prompt and ending with a summary:

# 1. Get embeddings from user prompt
query_embedding = get_embeddings("How do I translate SQLite to Postgres?")# 2. Get most similar doc text
embedding_array = np.array(query_embedding)
with pool.connection() as conn:
    results = conn.execute("SELECT text FROM vector_search.docs ORDER BY embedding <=> %s::vector LIMIT 1;", [query_embedding])
    out = results.fetchall()
conn.close()
article_text = out[0]
# 3. summarize the text
summarize_doc(article_text, text_str)

Which returns:

To translate SQLite syntax to Postgres, you can use bit.io’s SQL syntax translator. Start your query string with `#!translate:sqlite` on the first line followed by SQL in SQLite syntax on subsequent lines. Bit.io will attempt to translate it into PostgreSQL. Note that the first line will be removed. Function names may not always be translated and functions may not have the same semantics or exist in Postgres. Translation adds around 3–50ms latency, depending on query complexity and translator. Use the documentation of https://github.com/tobymao/sqlglot and https://github.com/PRQL/prql to understand translator support and limits.

Your Turn

In this article, we’ve shown how you can turn a collection of texts into a column of vector embeddings, enabling you to perform semantic search on those texts. Try it yourself! Again, your bit.io databases have the pgvector extension installed; you just need to enable it with CREATE EXTENSION vector;. From there, work through the code in this article, or following along with the pgvector docs, to set up your own vector database.

We’ve also made the Postgres database with our docs and their vector embeddings publicly available—you can take a look here and try out some of the vector search functions. For example, suppose you want the docs most similar to the Python — Pandas docs. You can find them with the following query:

SELECT *
FROM vector_search.docs
WHERE title != 'Python—Pandas'
ORDER BY embedding <=>
    (SELECT embedding
     FROM vector_search.docs
    WHERE title = 'Python—Pandas') LIMIT 2;

Why do this on Postgres when there are plenty of specialized (and expensive) vector databases out there? Depending on your use case, you might not need the special features those services offer. And the pgvector extension already offers several different vector search methods and indexing options. Especially if you are not dealing with exceptionally large vector embeddings or a massive text corpus, pgvector+bit.io offers a free, performant approach to vector databases. Give it a try!