Adding Intelligent Search to a Rails application with Weaviate

7 min read Original article ↗

Press enter or click to view image in full size

Generated by Midjourney

Andrei Bondarev

In this blog post we’ll add intelligent search capabilities to an application that stores a large collection of cooking recipes.

Semantic search is different from lexical search where the former takes meaning in the consideration as opposed to literal keyword matching.

The modern way to implementing semantic search is by leveraging Large Language Models (LLMs) and vectorization. Vectorization is a technique of mapping and grouping words, sentences, paragraphs (forms of text) by meaning on a multi-dimensional coordinate system.¹

For the purposes of vector conversion, storage and retrieval — vector search databases have been exploding in popularity this year. We’re going to look at one such vector search database called Weaviate. For the LLM capabilities — we’ll use OpenAI’s model. Let’s dive right in.

Presume we have an application that hosts a large PostgreSQL database of cooking recipes in a table called recipes with the following schema:

# app/models/recipe.rb

# Table name: recipes
#
# id :bigint
# title :string
# description :text
# total_time :integer
# ingredients :text
# instructions :text
# nutrients :jsonb
# yields :string

We’d like to build semantic search capabilities over this dataset. Here are the high-level steps we’re going to follow:

  1. Register for and create a database instance on Weaviate Cloud Services (WCS).
  2. Obtain the URL and API key for the WCS instance.
  3. Register for and obtain the OpenAI API key.
  4. Add the Weaviate API client Ruby gem and create the schema.
  5. Index data from PostgreSQL to Weaviate.
  6. Test out the Weaviate searching capabilities with the nearText: and ask: parameters.

So let’s create a database instance on WCS and set the WEAVIATE_URL and WEAVIATE_API_KEY environment variables to values copied from here:

Press enter or click to view image in full size

Weaviate env vars needed to instantiate the API client

Set the OPENAI_API_KEY environment variable after obtaining it here.

Let’s now add the Weaviate Ruby API client to our Gemfile and configure the client:

# Gemfile

gem "weaviate-ruby", "~> 0.8.0"

require "weaviate"

client = Weaviate::Client.new(
url: ENV['WEAVIATE_URL'],
api_key: ENV['WEAVIATE_API_KEY'],

# Configure Weaviate to use OpenAI to create vectors and use it for querying
# You can also use Cohere, Hugging Face or Google PaLM and pass their API key here instead
model_service: :openai,
model_service_api_key: ENV['OPENAI_API_KEY']
)

Next we need to create the schema in Weaviate that will hold our recipes’ data:

client.schema.create(
class_name: "Recipes", # Name of the collection
description: "A collection of recipes", # Description of the collection
vectorizer: "text2vec-openai", # OpenAI will be used to create vectors
module_config: {
"qna-openai": { # Weaviate's OpenAI's Q&A module
model: "text-davinci-003", # OpenAI's LLM to be used
maxTokens: 3500, # Maximum number of tokens to generate in the completion
temperature: 0.0, # How deterministic the output will be
topP: 1, # Nucleus sampling
frequencyPenalty: 0.0,
presencePenalty: 0.0
}
},
properties: [
{
dataType: ["int"],
description: "Recipe ID",
name: "recipe_id" # Our PostgreSQL's recipes.id
},
{
dataType: ["text"],
description: "Recipe content",
name: "content" # Recipes' concatenated content
}
]
)

Now that we created the Weaviate schema, we’re ready to import our data. We’ll create a text_blob() method that formats and concatenates the data to produce a single long string:


recipe.text_blob()

# =>
# "Title: Roasted-Vegetable Stock\n" +
# "Description: This delicious stock has a depth of flavor that comes from roasting the vegetables. Use whatever vegetables you have on hand, but avoid anything too strongly flavored, such as broccoli or cabbage, as they will overwhelm the stock.\n" +
# "Total time: 185 minutes\n" +
# "Ingredients: 1 whole head garlic, 4 carrots, cut into chunks, 4 stalks celery, cut into chunks, 3 onions, cut into chunks, 1 green pepper, quartered, 1 tomato, quartered, 0.33333334326744 cup olive oil, salt and pepper to taste, 8 cups water, 1.5 teaspoons dried thyme, 1.5 teaspoons dried parsley, 2 bay leaves\n" +
# "Instructions: Preheat oven to 400 degrees F (200 degrees C). Cut the top off the head of garlic. Arrange the garlic, carrots, celery, onion, pepper, and tomato on a large baking sheet in a single layer. Drizzle the olive oil over the vegetables; season with salt and pepper. Roast the vegetables in the preheated oven, turning every 20 minutes, until tender and browned, about 1 hour. Combine the water, thyme, parsley, and bay leaves in a large stock pot over medium-high heat. Squeeze the head of garlic into the stock pot, and discard the outer husk. Place the carrots, celery, onion, pepper, and tomato in the stock pot. Bring the water to a boil; reduce heat to low and simmer for 1 1/2 hours; strain and cool.\n" +
# "calories: 132 kcal, fatContent: 9 g, fiberContent: 3 g, sugarContent: 5 g, sodiumContent: 53 mg, proteinContent: 2 g, carbohydrateContent: 12 g, saturatedFatContent: 1 g, unsaturatedFatContent: 0 g\n" +
# "Serving size: 8 servings"

We’ll now create a rake task to batch add the data to Weaviate:

require "weaviate"

task :import_recipes_to_weaviate do
Recipe.find_in_batches do |group|
objects = group.map do |recipe|
{
class: "Recipes", # Each object must contain the schema name
properties: {
content: recipe.text_blob, # See below for the formatted data
recipe_id: recipe.id # Recipe ID from the recipes PSQL table
}
}
end

weaviate_client.objects.batch_create(
objects: objects
)
end
end

We’ll run the rake task (rake import_recipes_to_weaviate ) and then confirm that all records were successfully imported:

client.query.aggs(
class_name: "Recipes",
fields: "meta { count }"
)

# => [{ "meta" => { "count" => 100 }}]

We can now try out various ways of querying and searching our data stored in Weaviate. Let’s start by asking it a few questions:

client.query.get(
class_name: "Recipes",
limit: "1",
fields: """
recipe_id
_additional {
answer {
result
}
}
""",
ask: "{ question: \"What is a vegan dish that could be made for Thanksgiving?\" }"
)

# => "result" => " A vegan dish that could be made for Thanksgiving is a Roasted Butternut Squash and Quinoa Salad. This dish is made with roasted butternut squash, quinoa, kale, cranberries, and a maple-balsamic vinaigrette. It is a nutritious and flavorful dish that is sure to be a hit at any Thanksgiving gathering."
# "recipe_id" => 80526

client.query.get(
class_name: "Recipes",
limit: "1",
fields: """
recipe_id
_additional {
answer {
result
}
}
""",
ask: "{ question: \"What is a traditional French breakfast?\" }"
)

# => "result" => " A traditional French breakfast typically consists of a croissant or a piece of bread with butter and jam, accompanied by a hot beverage such as coffee or tea."
# "recipe_id" => 76795

We get our answer back, along with a recipe_id in case we wanted to fetch the full recipe.

How does this actually work? You can read the under the hood details here but the vectorization process selects the documents that are most similar to your question and then sends them as a context in a prompt that’s sent to OpenAI to complete the answer.

Moving past the Q&A style of searching — let’s execute actual semantic search queries:

client.query.get(
class_name: "Recipes",
limit: "3",
fields: "recipe_id content",
near_text: "{ concepts: [\"latin american pescado\"] }"
)

# =>
# [{"content"=>
# "Title: Chef John's Brazilian Fish Stew\n
# Description: Sweet and hot peppers with coconut milk make the sauce for poaching chunks of fish in Chef John's \"weeknight version\" of a classic Brazilian seafood stew....
# ...
# "recipe_id"=>174052},
# {"content"=>
# "Title: Soft Fish Tacos\n
# Description: Authentic taste of Mexico\n
# ...
# "recipe_id"=>80754},
# {"content"=>
# "Title: Javi's Really Real Mexican Ceviche\n
# ...
# "recipe_id"=>2203

We searched for “latin american pescado” and, even though, none of those recipes explicitly include those words, thanks to vector search, we were able to match “Brazilian Fish Stew”, “Soft Fish Tacos… Authentic taste of Mexico” and “Javi’s Really Real Mexican Ceviche” recipe.

YUM!