Implementing Semantic Search with Sequel and pgvector

In my previous post, An LLM-based AI Assistant for the FastRuby.io Newsletter opens a new window , I introduced an AI-powered assistant we built with Sinatra to help our marketing team write summaries of blog posts for our newsletter.

In this post, I’ll go over how we implemented semantic search using pgvector and Sequel to fetch examples of previous summaries based on article content.

Semantic search allows our AI assistant to find the most relevant past examples, given meaning and context, when generating new summaries. This helps ensure consistency in tone and style while providing context-aware results that will serve as better examples for the large language modal (LLM) to generate new summaries, improving the quality of the generated output.

Brief Introduction to Semantic Search and Cosine Distance

Semantic search is a technique used to find items in a database that are similar, contextually or conceptually, to a given query. This means we don’t need to rely solely on exact keyword matches, and instead can find items that are related in meaning.

It “understands” meaning and context by converting text into high-dimensional vectors called embeddings. These embeddings capture semantic relationships, and allow us to find conceptually related items by calculating distances between vectors.

Cosine distance is one of the most popular metrics for measuring the similarity between two vectors. It measures the cosine of the angle between two vectors, capturing how similar their semantic directions are, regardless of their magnitudes.

Other metrics supported by pgvector include Euclidean distance, inner product, taxicab (or Manhattan distance), Hamming distance, and Jaccard distance. So why not use one of those instead?

Euclidean distance is sensitive to magnitude, and can suffer from the curse of dimensionality, making it less effective for high-dimensional data like text embeddings. Inner product is better suited for recommendation systems, it could give too much weight to frequently-used topics in our case, not yielding the best results for our summaries. Taxicab (Manhattan) distance is similar to Euclidean but uses absolute differences, which can be less effective for high-dimensional data. Hamming distance is used for binary vectors, which is not our case. Our embeddings are continuous, floating-point values. Jaccard distance is also designed for binary or categorical data, not continuous embeddings.

Therefore, cosine distance is the most appropriate choice for our use case, as it effectively captures the semantic similarity between text embeddings.

Getting Started with pgvector and Sequel

To implement semantic search, we used pgvector to store and query vector embeddings in our PostgreSQL database, and Sequel as our ORM to interact with the database.

For pgvector to work, you need to have the pgvector extension installed in your system and enabled in your PostgreSQL database.

You can install the pgvector extension by following the instructions in the pgvector documentation opens a new window , then make sure it is enabled in your database.

CREATE EXTENSION IF NOT EXISTS vector;

Now you can add it to your Sequel configuration as an extension:

require "sequel"

Sequel.extension :pgvector  # Extends the main Sequel module with pgvector functionality.
DB = Sequel.connect(ENV["DATABASE_URL"])
DB.extension :pgvector  # Extends the specific database connection with pgvector support.

With the setup complete, we can now create a table to store our articles with their embeddings.

Sequel.migration do
  change do
    create_table(:articles) do
      primary_key :id
      String :title, null: false
      String :content, text: true, null: false
      String :summary, text: true
      column :embedding, "vector(1536)"
      String :embedding_model
      DateTime :embedding_created_at

      foreign_key :link_id, :links, null: false
    end
    add_index :articles, :embedding, type: :ivfflat, opclass: :vector_cosine_ops
  end
end

This migration creates our articles table with a column to store the vector embeddings with a dimension of 1536. The dimension of the vector is determined by the embedding model we use, in this case, OpenAI’s ada-002 model, which produces 1536-dimensional embeddings.

The embedding_model and embedding_created_at columns are used to store metadata about the embedding.

For the index, pgvector supports both ivfflat and hnsw indexing methods. We chose ivfflat with the vector_cosine_ops operator class for cosine distance. The ivfflat index requires less resources and has lower overhead. Our dataset is small (hundreds to lower thousands) and accuracy is good but not critical, so this is a good fit.

For interacting with the database and the embeddings, we created a model class using Sequel:

require "pgvector"

class Article < Sequel::Model
  plugin :pgvector, :embedding

  def embedding
    raw_value = self[:embedding]
    return nil unless raw_value

    raw_value
  rescue StandardError => e
    puts "Error retrieving embedding for article #{id}: #{e.message}"
    nil
  end

  def embedding?
    !embedding.nil?
  rescue StandardError
    false
  end
end

Due to pgvector’s type casting, if the value of the embedding column is nil, it throws an error when trying to access it. The embedding method we defined ensures that, in those cases, we return nil instead of raising an error, allowing us to handle missing embeddings nicely.

Embedding and Storing Content

As previously mentioned, we chose to use the OpenAI ada-002 model to generate embeddings. We are using the langchain.rb library to handle interactions with the OpenAI API, including generating embeddings.

With an OpenAI API key, the client can be initialized with the embedding model value set:

@client = Langchain::LLM::OpenAI.new(
  api_key: ENV.fetch("OPENAI_API_KEY"),
  default_options: {
    temperature: 0.7,
    chat_model: "gpt-40",
    embedding_model: "ada-002"
  }
)

Embedding the document becomes as simple call to the .embed method of the client:

def embed(doc)
  response = @client.embed(text: doc)
  response.embeddings.first
end

And the embedding can be stored in the database with the other properties normally:

embedding = embed(content, link_id)

Article.create(
  link_id: link_id,
  title: title,
  content: content,
  embedding: embedding,
  embedding_model: ada-002,
  embedding_created_at: Time.now
)

We can now perform semantic search to find articles that are closely related to the one in hand. In my previous article, I showed a simple nearest neighbors search using cosine distance:

def fetch_examples(article)
  examples = article.nearest_neighbors(:embedding, distance: "cosine").limit(3)
  examples.map(:summary)
end

Here, article is an instance of the Article model, and we’re retrieving the three most similar articles based on their embeddings using the nearest_neighbors method.

This method works, but it is limited. Here, we’ll get the three most similar articles, regardless of how similar they are. Similarity is calculated using a distance metric, as explained above, but we are not taking that distance score into account.

To improve this, we can add a threshold to filter out articles that are not similar enough. The pgvector extension adds specific operator for each distance metric, allowing us to filter results based on a minimum similarity score. For cosine distance, we can use the <=> operator:

def fetch_examples(article)
  examples = Article.select("id", "title", "content", 
    Sequel.lit("1 - (embedding <=> '#{article.embedding}'::vector) AS similarity_score"))
    .where(Sequel.lit("1 - (embedding <=> '#{article.embedding}'::vector) >= ?", 0.75))
    .order(Sequel.lit("embedding <=> '#{article.embedding}'::vector"))
    .limit(3)
  examples.map(:summary)
end

This query retrieves the three most similar articles to the given one, but only if their similarity score is above 0.75.

Now, we can reliably retrieve good examples of previous summaries that are contextually relevant to the article being summarized, guaranteeing that poor matches are filtered out. This allows our AI assistant to provide better output when generating new summaries, as the LLM has better examples to work with.

Conclusion

Semantic search is a powerful technique that allows us to find contextually relevant items in our database, improving the quality of AI-generated content.

By using pgvector and Sequel, we can easily store and query vector embeddings, enabling us to perform similarity searches based on semantic meaning rather than just keywords. These tools are open source and easy to use, making them a great choice for implementing semantic search in Ruby applications.

Want to know how we can help you leverage AI for your business? Talk to us today! opens a new window .