If not already installed, install the vector and timescale_vector extensions on your Timescale database.

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS timescale_vector;

Vectors inside of the database are stored in regular PostgreSQL tables using vector columns. The vector column type is provided by the pgvector extension. A common way to store vectors is alongside the data they are embedding. For example, to store embeddings for documents, a common table structure is:

CREATE TABLE IF NOT EXISTS document_embedding (
id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
document_id BIGINT FOREIGN KEY(document.id)
metadata JSONB,
contents TEXT,
embedding VECTOR(1538)
)

This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the contents column) and the embedded vector.

You may ask why not just add an embedding column to the document table? The answer is that there is a limit on the length of text an embedding can encode and so there needs to be a one-to-many relationship between the full document and its embeddings.

The above table is just an illustration, it's totally fine to have a table without a foreign key and/or without a metadata column. The important thing is to have a column with the data being embedded and the vector in the same row, enabling you to return the raw data for a given similarity search query

The vector type can specify an optional number of dimensions (1,538) in the example above). If specified, it enforces the constraint that all vectors in the column have that number of dimensions. A plain VECTOR (without specifying the number of dimensions) column is also possible and allows a variable number of dimensions.

The canonical query is:

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Which returns the 10 rows whose distance is the smallest. The distance function used here is cosine distance (specified by using the <=> operator). Other distance functions are available, see the discussion.

The available distance types and their operators are:

Distance typeOperator
Cosine/Angular<=>
Euclidean<->
Negative inner product<#>
Note

If you are using an index, you need to make sure that the distance function used in index creation is the same one used during query (see below). This is important because if you create your index with one distance function but query with another, your index cannot be used to speed up the query.

Indexing helps speed up similarity queries of the basic form:

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

The key part is that the ORDER BY contains a distance measure against a constant or a pseudo-constant.

Note that if performing a query without an index, you always get an exact result, but the query is slow (it has to read all of the data you store for every query). With an index, your queries are an order-of-magnitude faster, but the results are approximate (because there are no known indexing techniques that are exact see here for more).

Nevertheless, there are excellent approximate algorithms. There are 3 different indexing algorithms available on the Timescale platform: Timescale Vector Index, pgvector HNSW, and pgvector ivfflat. Below is the trade-offs between these algorithms:

AlgorithmBuild SpeedQuery SpeedNeed to rebuild after updates
Timescale VectorSlowFastestNo
pgvector HNSWSlowestFastNo
pgvector ivfflatFastestSlowestYes

You can see benchmarks in the blog.

For most use cases, the Timescale Vector index is recommended.

Each of these indexes has a set of build-time options for controlling the speed/accuracy trade-off when creating the index and an additional query-time option for controlling accuracy during a particular query.

You can see the details of each index below.

The Timescale Vector index is a graph-based algorithm that uses the DiskANN algorithm. You can read more about it on the blog announcing its release.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:

CREATE INDEX document_embedding_idx ON document_embedding
USING tsv (embedding);

Timescale Vector indexes only support cosine distance at this time, so you should use the <=> operator in your queries.

This creates the index with smart defaults for all the index parameters. These should be the right value for most cases. But if you want to delve deeper, the available parameters are below.

These parameters can be set when an index is created.

Parameter nameDescriptionDefault value
num_neighborsSets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.50
search_list_sizeThis is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds.100
max_alphaIs the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.1.0

An example of how to set the num_neighbors parameter is:

CREATE INDEX document_embedding_idx ON document_embedding
USING tsv (embedding) WITH(num_neighbors=50);

You can also set a parameter to control the accuracy vs. query speed trade-off at query time. The parameter is called tsv.query_search_list_size. This is the number of additional candidates considered during the graph search at query time. Defaults to 100. Higher values improve query accuracy while making the query slower.

You can set the value by running:

SET tsv.query_search_list_size = 120;

Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL:

BEGIN;
SET LOCAL tsv.query_search_list_size= 10;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

You need to use the cosine-distance embedding measure (<=>) in your ORDER BY clause. A canonical query would be:

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Pgvector provides a graph-based indexing algorithm based on the popular HNSW algorithm.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:

CREATE INDEX document_embedding_idx ON document_embedding
USING hnsw(embedding vector_cosine_ops);

This command creates an index for cosine-distance queries because of vector_cosine_ops. There are also "ops" classes for Euclidean distance and negative inner product:

Distance typeQuery operatorIndex ops class
Cosine / Angular<=>vector_cosine_ops
Euclidean / L2<->vector_ip_ops
Negative inner product<#>vector_l2_ops

Pgvector HNSW also includes several index build-time and query-time parameters.

These parameters can be set at index build time:

Parameter nameDescriptionDefault value
mRepresents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size.16
ef_constructionRepresents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times.64

An example of how to set the m parameter is:

CREATE INDEX document_embedding_idx ON document_embedding
USING hnsw(embedding vector_cosine_ops) WITH (m = 20);

You can also set a parameter to control the accuracy vs. query speed trade-off at query time. The parameter is called hnsw.ef_search. This parameter specifies the size of the dynamic candidate list used during search. Defaults to 40. Higher values improve query accuracy while making the query slower.

You can set the value by running:

SET hnsw.ef_search = 100;

Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL:

BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

You need to use the distance operator (<=>, <->, or <#>) matching the ops class you used during index creation in your ORDER BY clause. A canonical query would be:

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Pgvector provides a clustering-based indexing algorithm. The blog post describes how it works in detail. It provides the fastest index-build speed but the slowest query speeds of any indexing algorithm.

To create an index named document_embedding_idx on table document_embedding having a vector column named embedding, run:

CREATE INDEX document_embedding_idx ON document_embedding
USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);

This command creates an index for cosine-distance queries because of vector_cosine_ops. There are also "ops" classes for Euclidean distance and negative inner product:

Distance typeQuery operatorIndex ops class
Cosine / Angular<=>vector_cosine_ops
Euclidean / L2<->vector_ip_ops
Negative inner product<#>vector_l2_ops

Note: ivfflat should never be created on empty tables because it needs to cluster data, and that only happens when an index is first created, not when new rows are inserted or modified. Also, if your table undergoes a lot of modifications, you need to rebuild this index occasionally to maintain good accuracy. See the blog post for details.

Pgvector ivfflat has a lists index parameter that should be set. See the next section.

Pgvector has a lists parameter that should be set as follows: For datasets with less than one million rows, use lists = rows / 1000. For datasets with more than one million rows, use lists = sqrt(rows). It is generally advisable to have at least 10 clusters.

You can use the following code to simplify creating ivfflat indexes:

def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"):
index_method = "invalid"
if query_operator == "<->":
index_method = "vector_l2_ops"
elif query_operator == "<#>":
index_method = "vector_ip_ops"
elif query_operator == "<=>":
index_method = "vector_cosine_ops"
else:
raise ValueError(f"unrecognized operator {query_operator}")
with conn.cursor() as cur:
cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
num_records = cur.fetchone()[0]
num_lists = num_records / 1000
if num_lists < 10:
num_lists = 10
if num_records > 1000000:
num_lists = math.sqrt(num_records)
cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
conn.commit()

You can also set a parameter to control the accuracy vs. query speed tradeoff at query time. The parameter is called ivfflat.probes. This parameter specifies the number of clusters searched during a query. It is recommended to set this parameter to sqrt(lists) where lists is the parameter used above during index creation. Higher values improve query accuracy while making the query slower.

You can set the value by running:

SET ivfflat.probes = 100;

Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL:

BEGIN;
SET LOCAL ivfflat.probes = 100;
SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10
COMMIT;

You need to use the distance operator (<=>, <->, or <#>) matching the ops class you used during index creation in your ORDER BY clause. A canonical query would be:

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10

Keywords

Found an issue on this page?

Report an issue!