Implementing Semantic Search with Sequel and pgvector
In my previous post, An LLM-based AI Assistant for the FastRuby.io Newsletter , I introduced an AI-powered assistant we built with Sinatra to help our marketing team write summaries of blog posts for our newsletter.
In this post, I’ll go over how we implemented semantic search using pgvector
and Sequel
to fetch examples of previous summaries based on article content.
Semantic search allows our AI assistant to find the most relevant past examples, given meaning and context, when generating new summaries. This helps ensure consistency in tone and style while providing context-aware results that will serve as better examples for the large language modal (LLM) to generate new summaries, improving the quality of the generated output.
Brief Introduction to Semantic Search and Cosine Distance
Semantic search is a technique used to find items in a database that are similar, contextually or conceptually, to a given query. This means we don’t need to rely solely on exact keyword matches, and instead can find items that are related in meaning.
It “understands” meaning and context by converting text into high-dimensional vectors called embeddings. These embeddings capture semantic relationships, and allow us to find conceptually related items by calculating distances between vectors.
Cosine distance is one of the most popular metrics for measuring the similarity between two vectors. It measures the cosine of the angle between two vectors, capturing how similar their semantic directions are, regardless of their magnitudes.
Other metrics supported by pgvector
include Euclidean distance, inner product, taxicab (or Manhattan distance), Hamming distance, and Jaccard distance.
So why not use one of those instead?
Euclidean distance is sensitive to magnitude, and can suffer from the curse of dimensionality, making it less effective for high-dimensional data like text embeddings. Inner product is better suited for recommendation systems, it could give too much weight to frequently-used topics in our case, not yielding the best results for our summaries. Taxicab (Manhattan) distance is similar to Euclidean but uses absolute differences, which can be less effective for high-dimensional data. Hamming distance is used for binary vectors, which is not our case. Our embeddings are continuous, floating-point values. Jaccard distance is also designed for binary or categorical data, not continuous embeddings.
Therefore, cosine distance is the most appropriate choice for our use case, as it effectively captures the semantic similarity between text embeddings.
Getting Started with pgvector and Sequel
To implement semantic search, we used pgvector
to store and query vector embeddings in our PostgreSQL database, and Sequel
as our ORM to interact with the database.
For pgvector
to work, you need to have the pgvector
extension installed in your system and enabled in your PostgreSQL database.
You can install the pgvector
extension by following the instructions in the pgvector documentation ,
then make sure it is enabled in your database.
CREATE EXTENSION IF NOT EXISTS vector;
Now you can add it to your Sequel configuration as an extension:
require "sequel"
Sequel.extension :pgvector # Extends the main Sequel module with pgvector functionality.
DB = Sequel.connect(ENV["DATABASE_URL"])
DB.extension :pgvector # Extends the specific database connection with pgvector support.
With the setup complete, we can now create a table to store our articles with their embeddings.
Sequel.migration do
change do
create_table(:articles) do
primary_key :id
String :title, null: false
String :content, text: true, null: false
String :summary, text: true
column :embedding, "vector(1536)"
String :embedding_model
DateTime :embedding_created_at
foreign_key :link_id, :links, null: false
end
add_index :articles, :embedding, type: :ivfflat, opclass: :vector_cosine_ops
end
end
This migration creates our articles
table with a column to store the vector embeddings with a dimension of 1536.
The dimension of the vector is determined by the embedding model we use, in this case, OpenAI’s ada-002
model, which produces 1536-dimensional embeddings.
The embedding_model
and embedding_created_at
columns are used to store metadata about the embedding.
For the index, pgvector
supports both ivfflat
and hnsw
indexing methods. We chose ivfflat
with the vector_cosine_ops
operator class for cosine distance.
The ivfflat
index requires less resources and has lower overhead. Our dataset is small (hundreds to lower thousands) and accuracy is good but not critical, so this is a good fit.
For interacting with the database and the embeddings, we created a model class using Sequel:
require "pgvector"
class Article < Sequel::Model
plugin :pgvector, :embedding
def embedding
raw_value = self[:embedding]
return nil unless raw_value
raw_value
rescue StandardError => e
puts "Error retrieving embedding for article #{id}: #{e.message}"
nil
end
def embedding?
!embedding.nil?
rescue StandardError
false
end
end
Due to pgvector
’s type casting, if the value of the embedding
column is nil
, it throws an error when trying to access it.
The embedding
method we defined ensures that, in those cases, we return nil
instead of raising an error, allowing us to handle missing embeddings nicely.
Embedding and Storing Content
As previously mentioned, we chose to use the OpenAI ada-002
model to generate embeddings. We are using the langchain.rb
library
to handle interactions with the OpenAI API, including generating embeddings.
With an OpenAI API key, the client can be initialized with the embedding model value set:
@client = Langchain::LLM::OpenAI.new(
api_key: ENV.fetch("OPENAI_API_KEY"),
default_options: {
temperature: 0.7,
chat_model: "gpt-40",
embedding_model: "ada-002"
}
)
Embedding the document becomes as simple call to the .embed
method of the client:
def embed(doc)
response = @client.embed(text: doc)
response.embeddings.first
end
And the embedding can be stored in the database with the other properties normally:
embedding = embed(content, link_id)
Article.create(
link_id: link_id,
title: title,
content: content,
embedding: embedding,
embedding_model: ada-002,
embedding_created_at: Time.now
)
Performing Semantic Search
We can now perform semantic search to find articles that are closely related to the one in hand. In my previous article, I showed a simple nearest neighbors search using cosine distance:
def fetch_examples(article)
examples = article.nearest_neighbors(:embedding, distance: "cosine").limit(3)
examples.map(:summary)
end
Here, article
is an instance of the Article
model, and we’re retrieving the three most similar articles based on their embeddings using the nearest_neighbors
method.
This method works, but it is limited. Here, we’ll get the three most similar articles, regardless of how similar they are. Similarity is calculated using a distance metric, as explained above, but we are not taking that distance score into account.
To improve this, we can add a threshold to filter out articles that are not similar enough. The pgvector
extension adds
specific operator for each distance metric, allowing us to filter results based on a minimum similarity score. For cosine distance, we can use the <=>
operator:
def fetch_examples(article)
examples = Article.select("id", "title", "content",
Sequel.lit("1 - (embedding <=> '#{article.embedding}'::vector) AS similarity_score"))
.where(Sequel.lit("1 - (embedding <=> '#{article.embedding}'::vector) >= ?", 0.75))
.order(Sequel.lit("embedding <=> '#{article.embedding}'::vector"))
.limit(3)
examples.map(:summary)
end
This query retrieves the three most similar articles to the given one, but only if their similarity score is above 0.75.
Now, we can reliably retrieve good examples of previous summaries that are contextually relevant to the article being summarized, guaranteeing that poor matches are filtered out. This allows our AI assistant to provide better output when generating new summaries, as the LLM has better examples to work with.
Conclusion
Semantic search is a powerful technique that allows us to find contextually relevant items in our database, improving the quality of AI-generated content.
By using pgvector
and Sequel
, we can easily store and query vector embeddings, enabling us to perform similarity searches based on semantic meaning rather than just keywords.
These tools are open source and easy to use, making them a great choice for implementing semantic search in Ruby applications.
Want to know how we can help you leverage AI for your business? Talk to us today! .