How to use PostgreSQL to store and query vector embeddings
What you will learn
- What is pgvector?
- pgvector is a PostgreSQL extension used to store and query vector embeddings efficiently.
- What prerequisites are necessary for the tutorial on using pgvector for storing and querying vector embeddings?
- Prerequisites include basic knowledge of Python and SQL (specifically PostgreSQL's SQL), psycopg2, PyTorch, and Transformers installed in your Python environment, and Docker and Docker Compose installed on your system.
- How can you perform a cosine similarity search using pgvector in PostgreSQL?
- You can perform a cosine similarity search using pgvector by creating an embedding for a query, and using the '<=>' operator provided by pgvector to execute a cosine similarity search, subtracting the result from 1 to obtain the cosine similarity.
- What are the steps to set up a PostgreSQL database with the pgvector extension using Docker?
- The steps include creating a Dockerfile with the PostgreSQL image and pgvector installation commands, creating a docker-compose.yml file for configuration, and an initialization SQL script to create the necessary database schema. Then, build and run the Docker container.
- What is the purpose of the `generate_embeddings` function in the embedding creation process?
- The `generate_embeddings` function takes in text and optional metadata, tokenizes the input, passes it through a transformer model to get the last hidden states, applies an average pooling function, normalizes the result, and returns it as a JSON string, ready to be stored in the database.
Vector embeddings are powerful representations of data points in a high-dimensional space, and they are widely used in Natural Language Processing (NLP) and Machine Learning. Storing and efficiently querying these embeddings is crucial for building scalable and performant applications.
In this blog post, I will explain how to use pgvector, a PostgreSQL extension, to store and query vector embeddings efficiently. All of the code for this article can be found in the companion GitHub repository.
Prerequisites
Here are the items that you need to be somewhat familiar with before continuing with this tutorial:
- Basic knowledge of Python and SQL (specifically PostgreSQL’s SQL).
- psycopg2, PyTorch and Transformers installed in your Python environment.
- Docker installed on your system.
- Docker Compose also installed on your system.
First, I’ll explain how to setup a Postgres database with the pgvector
extension using docker and docker-compose.
Setting Up the Database
Step 1: Create a Dockerfile
In your project directory, create a file named Dockerfile
and add the following content to it:
# Use the official Postgres image as a base image
FROM postgres:latest
# Set environment variables for Postgres
ENV POSTGRES_USER=myuser
ENV POSTGRES_PASSWORD=mypassword
ENV POSTGRES_DB=mydb
# Install the build dependencies
USER root
RUN apt-get update && apt-get install -y \
build-essential \
git \
postgresql-server-dev-all \
&& rm -rf /var/lib/apt/lists/*
# Clone, build, and install the pgvector extension
RUN cd /tmp \
&& git clone --branch v0.5.0 https://github.com/pgvector/pgvector.git \
&& cd pgvector \
&& make \
&& make install
Step 2: Create a docker-compose.yml File
In your project directory, create a file named docker-compose.yml
and add the following content to it:
version: "3"
services:
postgres:
build: .
ports:
- "5432:5432"
volumes:
- ./data:/var/lib/postgresql/data
- ./init_pgvector.sql:/docker-entrypoint-initdb.d/init_pgvector.sql
environment:
POSTGRES_USER: myuser
POSTGRES_PASSWORD: mypassword
POSTGRES_DB: mydb
Step 3: Create Initialization SQL Script
Next, create a file called init_pgvector.sql
and add the following to it:
-- Install the extension we just compiled
CREATE EXTENSION IF NOT EXISTS vector;
/*
For simplicity, we are directly adding the content into this table as
a column containing text data. It could easily be a foreign key pointing to
another table instead that has the content you want to vectorize for
semantic search, just storing here the vectorized content in our "items" table.
"768" dimensions for our vector embedding is critical - that is the
number of dimensions our open source embeddings model output, for later in the
blog post.
*/
CREATE TABLE items (id bigserial PRIMARY KEY, content TEXT, embedding vector(768));
This SQL script will be used in the next section about building and running the docker container.
Step 3: Build and Run the Docker Container
Navigate to your project directory in the terminal and run the following command to build and run the Docker container using the Dockerfile and docker-compose.yml
you just created:
docker compose up
With our pgvector extension compiled and loaded into our database, and our database up and accepting connections, let’s move onto the Python app.
Creating Embeddings with Transformers
For the sake of keeping this blog post shorter, I won’t cover in detail what vector embeddings are. I do offer a free newsletter, and I’ve written an introduction to vector embeddings that you are free to checkout - and then come back!
Let’s move on to how we are creating embeddings.
Introduction to embedding_util
The embedding_util
module is how we are generating embeddings in this tutorial. This module leverages a transformer model, specifically from thenlper/gte-base
(more info here), to transform textual data into meaningful, semantically rich vector representations (our vector embeddings).
Code Overview
I’ll share the code that we are using to create the vector embeddings in our embedding_util.py
file, and then explain it.
import os
import json
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# We won't have competing threads in this example app
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Initialize tokenizer and model for GTE-base
tokenizer = AutoTokenizer.from_pretrained('thenlper/gte-base')
model = AutoModel.from_pretrained('thenlper/gte-base')
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(
~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def generate_embeddings(text, metadata={}):
combined_text = " ".join(
[text] + [v for k, v in metadata.items() if isinstance(v, str)])
inputs = tokenizer(combined_text, return_tensors='pt',
max_length=512, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
attention_mask = inputs['attention_mask']
embeddings = average_pool(outputs.last_hidden_state, attention_mask)
embeddings = F.normalize(embeddings, p=2, dim=1)
return json.dumps(embeddings.numpy().tolist()[0])
1. Importing Necessary Libraries
We start by importing necessary libraries and modules, including json
for serialization, AutoTokenizer
and AutoModel
from the transformers library for tokenization and model loading, torch
from PyTorch for tensor operations, and torch.nn.functional
for additional functionalities like normalization.
2. Setting Environment Variable
os.environ["TOKENIZERS_PARALLELISM"] = "false"
This line disables tokenizers parallelism to prevent any threading issues in this example app.
NOTE: If you are using this inside an asynchronous web framework like fastAPI, you’ll want to set this to "true"
instead.
3. Initializing Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained('thenlper/gte-base')
model = AutoModel.from_pretrained('thenlper/gte-base')
Here, we are initializing the tokenizer and model for GTE-base
, enabling us to convert text into embeddings subsequently.
4. Average Pooling Function
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(
~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
This function performs average pooling on the last hidden states of the transformer model, using the attention mask to handle padded tokens. It returns the average pooled tensor, which represents the semantic essence of the input text.
5. Embedding Generation Function
def generate_embeddings(text, metadata={}):
combined_text = " ".join(
[text] + [v for k, v in metadata.items() if isinstance(v, str)])
inputs = tokenizer(combined_text, return_tensors='pt',
max_length=512, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
attention_mask = inputs['attention_mask']
embeddings = average_pool(outputs.last_hidden_state, attention_mask)
embeddings = F.normalize(embeddings, p=2, dim=1)
return json.dumps(embeddings.numpy().tolist()[0])
This is where the main action happens. This function takes in text and optional metadata, tokenizes the input, passes it through the transformer model, and obtains the last hidden states. It then applies the average_pool
function to get the final embedding, normalizes it, and returns it as a JSON string, ready to be stored in the database.
Understanding the embedding generation process is pivotal for implementing effective and efficient semantic searches. This module, powered by the transformers library, facilitates the creation of rich, meaningful vector representations of text, enabling nuanced similarity searches when integrated with pgvector
and PostgreSQL.
Integration with PostgreSQL for Storage and Querying of Embeddings
With the explanation of the code behind how we are creating the embeddings, let’s move onto the core of our Python app. Here, in app.py
, we consolidate all of our components to insert example sentences into our PostgreSQL database, and then execute a cosine similarity search using pgvector
to rank and retrieve the most similar items to a given query.
Bringing it All Together: app.py
Now, I’ll share the whole app.py
file, then explain it part by part:
import psycopg2
from embedding_util import generate_embeddings
def run():
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(
user="myuser",
password="mypassword",
host="localhost",
port=5432, # The port you exposed in docker-compose.yml
database="mydb"
)
# Create a cursor to execute SQL commands
cur = conn.cursor()
try:
sentences = [
"A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.",
"The mathematician found solace in numbers, deciphering the hidden patterns of the universe.",
"The robot, with its intricate circuitry and precise movements, assembles the devices swiftly.",
"The chef, with a sprinkle of spices and a dash of love, creates culinary masterpieces.",
"The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.",
"The detective, with keen observation and logical reasoning, unravels the intricate web of clues.",
"The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.",
"In the dense forest, the howl of a lone wolf echoes, blending with the symphony of the night.",
"The dancer, with graceful moves and expressive gestures, tells a story without uttering a word.",
"In the quantum realm, particles flicker in and out of existence, dancing to the tunes of probability.",
]
# Insert sentences into the items table
for sentence in sentences:
embedding = generate_embeddings(sentence)
cur.execute(
"INSERT INTO items (content, embedding) VALUES (%s, %s)",
(sentence, embedding)
)
# Example query
query = "Give me some content about the ocean"
query_embedding = generate_embeddings(query)
# Perform a cosine similarity search
cur.execute(
"""SELECT id, content, 1 - (embedding <=> %s) AS cosine_similarity
FROM items
ORDER BY cosine_similarity DESC LIMIT 5""",
(query_embedding,)
)
# Fetch and print the result
print("Query:", query)
print("Most similar sentences:")
for row in cur.fetchall():
print(
f"ID: {row[0]}, CONTENT: {row[1]}, Cosine Similarity: {row[2]}")
except Exception as e:
print("Error executing query", str(e))
finally:
# Close communication with the PostgreSQL database server
cur.close()
conn.close()
# This check ensures that the function is only run when the script is executed directly, not when it's imported as a module.
if __name__ == "__main__":
run()
1. Connection Setup
After importing dependencies, we need to establish a connection to our Postgres and create a cursor that we can use to execute queries with:
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(
user="myuser",
password="mypassword",
host="localhost",
port=5432, # The port you exposed in docker-compose.yml
database="mydb"
)
# Create a cursor to execute SQL commands
cur = conn.cursor()
2. Embedding Insertion
Iterating over a list of sentences, generating their corresponding embeddings, and inserting them into the items
table in the database, inside the try..except
block:
sentences = [
"A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.",
"The mathematician found solace in numbers, deciphering the hidden patterns of the universe.",
"The robot, with its intricate circuitry and precise movements, assembles the devices swiftly.",
"The chef, with a sprinkle of spices and a dash of love, creates culinary masterpieces.",
"The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.",
"The detective, with keen observation and logical reasoning, unravels the intricate web of clues.",
"The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.",
"In the dense forest, the howl of a lone wolf echoes, blending with the symphony of the night.",
"The dancer, with graceful moves and expressive gestures, tells a story without uttering a word.",
"In the quantum realm, particles flicker in and out of existence, dancing to the tunes of probability.",
]
# Insert sentences into the items table
for sentence in sentences:
embedding = generate_embeddings(sentence)
cur.execute(
"INSERT INTO items (content, embedding) VALUES (%s, %s)",
(sentence, embedding)
)
3. Embedding Retrieval and Similarity Search
Creating an embedding for a sample query and leveraging the <=>
operator provided by pgvector
allows us to perform a cosine similarity search.
It is crucial to note that pgvector
actually calculates the cosine distance, which is different from cosine similarity.
Cosine distance measures the cosine of the angle between two non-zero vectors, whereas cosine similarity measures the cosine of the angle between two vectors, projecting the amount they overlap. Here’s another explanation of cosine distance and cosine similarity.
To obtain the cosine similarity from the cosine distance calculated by pgvector
, we subtract the cosine distance from 1, returning results ordered by similarity, as illustrated below:
# Example query
query = "Give me some content about the ocean"
query_embedding = generate_embeddings(query)
# Perform a cosine similarity search
cur.execute(
"""SELECT id, content, 1 - (embedding <=> %s) AS cosine_similarity
FROM items
ORDER BY cosine_similarity DESC LIMIT 5""",
(query_embedding,)
)
# Fetch and print the result
print("Query:", query)
print("Most similar sentences:")
for row in cur.fetchall():
print(
f"ID: {row[0]}, CONTENT: {row[1]}, Cosine Similarity: {row[2]}")
Our final output logged to the console should look something like this:
Query: Give me some content about the ocean
Most similar sentences:
ID: 7, CONTENT: The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea., Cosine Similarity: 0.8009044169301547
ID: 5, CONTENT: The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past., Cosine Similarity: 0.760971262397107
ID: 1, CONTENT: A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures., Cosine Similarity: 0.7582434704013556
ID: 8, CONTENT: In the dense forest, the howl of a lone wolf echoes, blending with the symphony of the night., Cosine Similarity: 0.7446566376294829
ID: 2, CONTENT: The mathematician found solace in numbers, deciphering the hidden patterns of the universe., Cosine Similarity: 0.7399669003526144
Conclusion
This blog post illustrated how to efficiently store and query vector embeddings in PostgreSQL using pgvector, opening up advanced possibilities in semantic search and analysis for Natural Language Processing and Machine Learning.
The companion code repository, with all of the code explained in this tutorial in one place, can be found here.
Questions or comments? Feel free to contact me!