SQL interface for pgvector and pgvectorscale

Installing the pgvector and pgvectorscale extensions

If not already installed, install the vector and vectorscale extensions on your Timescale database.


CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS vectorscale;

Creating the table for storing embeddings using pgvector

Vectors inside of the database are stored in regular PostgreSQL tables using vector columns. The vector column type is provided by the pgvector extension. A common way to store vectors is alongside the data they are embedding. For example, to store embeddings for documents, a common table structure is:


CREATE TABLE IF NOT EXISTS document_embedding  (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    document_id BIGINT FOREIGN KEY(document.id)
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(1538)
)

This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the contents column) and the embedded vector.

You may ask why not just add an embedding column to the document table? The answer is that there is a limit on the length of text an embedding can encode and so there needs to be a one-to-many relationship between the full document and its embeddings.

The above table is just an illustration, it's totally fine to have a table without a foreign key and/or without a metadata column. The important thing is to have a column with the data being embedded and the vector in the same row, enabling you to return the raw data for a given similarity search query

The vector type can specify an optional number of dimensions (1,538) in the example above). If specified, it enforces the constraint that all vectors in the column have that number of dimensions. A plain VECTOR (without specifying the number of dimensions) column is also possible and allows a variable number of dimensions.

Query the vector embeddings

The canonical query is:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Which returns the 10 rows whose distance is the smallest. The distance function used here is cosine distance (specified by using the <=> operator). Other distance functions are available, see the discussion.

The available distance types and their operators are:

Distance type	Operator
Cosine/Angular	`<=>`
Euclidean	`<->`
Negative inner product	`<#>`

Note

If you are using an index, you need to make sure that the distance function used in index creation is the same one used during query (see below). This is important because if you create your index with one distance function but query with another, your index cannot be used to speed up the query.

Indexing the vector data using indexes provided by pgvector and pgvectorscale

Indexing helps speed up similarity queries of the basic form:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

The key part is that the ORDER BY contains a distance measure against a constant or a pseudo-constant.

Note that if performing a query without an index, you always get an exact result, but the query is slow (it has to read all of the data you store for every query). With an index, your queries are an order-of-magnitude faster, but the results are approximate (because there are no known indexing techniques that are exact see here for more).

Nevertheless, there are excellent approximate algorithms. There are 3 different indexing algorithms available on the Timescale platform: StreamingDiskANN, HNSW, and ivfflat. Below is the trade-offs between these algorithms:

Algorithm	Build Speed	Query Speed	Need to rebuild after updates
StreamingDiskANN	Fast	Fastest	No
HNSW	Fast	Fast	No
ivfflat	Fastest	Slowest	Yes

You can see benchmarks in the blog.

For most use cases, the StreamingDiskANN index is recommended.

Each of these indexes has a set of build-time options for controlling the speed/accuracy trade-off when creating the index and an additional query-time option for controlling accuracy during a particular query.

You can see the details of each index below.

StreamingDiskANN index

The StreamingDiskANN index is a graph-based algorithm that was inspired by the DiskANN algorithm. You can read more about it in How We Made PostgreSQL as Fast as Pinecone for Vector Data.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:


CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

StreamingDiskANN indexes only support cosine distance at this time, so you should use the <=> operator in your queries.

This creates the index with smart defaults for all the index parameters. These should be the right value for most cases. But if you want to delve deeper, the available parameters are below.

StreamingDiskANN index build-time parameters

These parameters can be set when an index is created.

Parameter name	Description	Default value
`storage_layout`	`memory_optimized` which uses SBQ to compress vector data or `plain` which stores data uncompressed	memory_optimized
`num_neighbors`	Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.	50
`search_list_size`	This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds.	100
`max_alpha`	Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.	1.2
`num_dimensions`	The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of Matryoshka embeddings	0 (all dimensions)
`num_bits_per_dimension`	Number of bits used to encode each dimension when using SBQ	2 for less than 900 dimensions, 1 otherwise

An example of how to set the num_neighbors parameter is:


CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding) WITH(num_neighbors=50);

StreamingDiskANN query-time parameters

You can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting diskann.query_rescore to fine-tune accuracy.

Parameter name	Description	Default value
`diskann.query_search_list_size`	The number of additional candidates considered during the graph search.	100
`diskann.query_rescore`	The number of elements rescored (0 to disable rescoring)	50

You can set the value by using SET before executing a query. For example:


SET diskann.query_rescore = 400;

Note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL which will be reset after the end of the transaction:


BEGIN;
SET LOCAL diskann.query_search_list_size= 10;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

StreamingDiskANN index-supported queries

You need to use the cosine-distance embedding measure (<=>) in your ORDER BY clause. A canonical query would be:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

pgvector HNSW

Pgvector provides a graph-based indexing algorithm based on the popular HNSW algorithm.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:


CREATE INDEX document_embedding_idx ON document_embedding
USING hnsw(embedding vector_cosine_ops);

This command creates an index for cosine-distance queries because of vector_cosine_ops. There are also "ops" classes for Euclidean distance and negative inner product:

Distance type	Query operator	Index ops class
Cosine / Angular	`<=>`	`vector_cosine_ops`
Euclidean / L2	`<->`	`vector_ip_ops`
Negative inner product	`<#>`	`vector_l2_ops`

Pgvector HNSW also includes several index build-time and query-time parameters.

pgvector HNSW index build-time parameters

These parameters can be set at index build time:

Parameter name	Description	Default value
`m`	Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size.	16
`ef_construction`	Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing `ef_construction` enables more accurate search results at the expense of lengthier index build times.	64

An example of how to set the m parameter is:


CREATE INDEX document_embedding_idx ON document_embedding
USING hnsw(embedding vector_cosine_ops) WITH (m = 20);

pgvector HNSW query-time parameters

You can also set a parameter to control the accuracy vs. query speed trade-off at query time. The parameter is called hnsw.ef_search. This parameter specifies the size of the dynamic candidate list used during search. Defaults to 40. Higher values improve query accuracy while making the query slower.

You can set the value by running:


SET hnsw.ef_search = 100;

Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL:


BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

pgvector HNSW index-supported queries

You need to use the distance operator (<=>, <->, or <#>) matching the ops class you used during index creation in your ORDER BY clause. A canonical query would be:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

pgvector ivfflat

Pgvector provides a clustering-based indexing algorithm. The blog post describes how it works in detail. It provides the fastest index-build speed but the slowest query speeds of any indexing algorithm.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:


CREATE INDEX document_embedding_idx ON document_embedding
USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);

This command creates an index for cosine-distance queries because of vector_cosine_ops. There are also "ops" classes for Euclidean distance and negative inner product:

Distance type	Query operator	Index ops class
Cosine / Angular	`<=>`	`vector_cosine_ops`
Euclidean / L2	`<->`	`vector_ip_ops`
Negative inner product	`<#>`	`vector_l2_ops`

Note: ivfflat should never be created on empty tables because it needs to cluster data, and that only happens when an index is first created, not when new rows are inserted or modified. Also, if your table undergoes a lot of modifications, you need to rebuild this index occasionally to maintain good accuracy. See the blog post for details.

Pgvector ivfflat has a lists index parameter that should be set. See the next section.

pgvector ivfflat index build-time parameters

Pgvector has a lists parameter that should be set as follows: For datasets with less than one million rows, use lists = rows / 1000. For datasets with more than one million rows, use lists = sqrt(rows). It is generally advisable to have at least 10 clusters.

You can use the following code to simplify creating ivfflat indexes:


def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"):
    index_method = "invalid"
    if query_operator == "<->":
        index_method = "vector_l2_ops"
    elif query_operator == "<#>":
        index_method = "vector_ip_ops"
    elif query_operator == "<=>":
        index_method = "vector_cosine_ops"
    else:
        raise ValueError(f"unrecognized operator {query_operator}")

    with conn.cursor() as cur:
        cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
        num_records = cur.fetchone()[0]

        num_lists = num_records / 1000
        if num_lists < 10:
            num_lists = 10
        if num_records > 1000000:
            num_lists = math.sqrt(num_records)

        cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
        conn.commit()

pgvector ivfflat query-time parameters

You can also set a parameter to control the accuracy vs. query speed tradeoff at query time. The parameter is called ivfflat.probes. This parameter specifies the number of clusters searched during a query. It is recommended to set this parameter to sqrt(lists) where lists is the parameter used above during index creation. Higher values improve query accuracy while making the query slower.

You can set the value by running:


SET ivfflat.probes = 100;

Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL:


BEGIN;
SET LOCAL ivfflat.probes = 100;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

pgvector ivfflat index-supported queries

You need to use the distance operator (<=>, <->, or <#>) matching the ops class you used during index creation in your ORDER BY clause. A canonical query would be:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Note

Related Content