If not already installed, install the vector
and vectorscale
extensions on your Timescale database.
CREATE EXTENSION IF NOT EXISTS vector;CREATE EXTENSION IF NOT EXISTS vectorscale;
Vectors inside of the database are stored in regular PostgreSQL tables using vector
columns. The vector
column type is provided by the pgvector extension. A common way to store vectors is alongside the data they are embedding. For example, to store embeddings for documents, a common table structure is:
CREATE TABLE IF NOT EXISTS document_embedding (id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,document_id BIGINT FOREIGN KEY(document.id)metadata JSONB,contents TEXT,embedding VECTOR(1536))
This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the contents
column) and the embedded vector.
You may ask why not just add an embedding column to the document table? The answer is that there is a limit on the length of text an embedding can encode and so there needs to be a one-to-many relationship between the full document and its embeddings.
The above table is just an illustration, it's totally fine to have a table without a foreign key and/or without a metadata column. The important thing is to have a column with the data being embedded and the vector in the same row, enabling you to return the raw data for a given similarity search query
The vector type can specify an optional number of dimensions (1,538) in the example above). If specified, it enforces the constraint that all vectors in the column have that number of dimensions. A plain VECTOR
(without specifying the number of dimensions) column is also possible and allows a variable number of dimensions.
The canonical query is:
SELECT *FROM document_embeddingORDER BY embedding <=> $1LIMIT 10
Which returns the 10 rows whose distance is the smallest. The distance function used here is cosine distance (specified by using the <=>
operator). Other distance functions are available, see the discussion.
The available distance types and their operators are:
Distance type | Operator |
---|---|
Cosine/Angular | <=> |
Euclidean | <-> |
Negative inner product | <#> |
Note
If you are using an index, you need to make sure that the distance function used in index creation is the same one used during query (see below). This is important because if you create your index with one distance function but query with another, your index cannot be used to speed up the query.
Indexing helps speed up similarity queries of the basic form:
SELECT *FROM document_embeddingORDER BY embedding <=> $1LIMIT 10
The key part is that the ORDER BY
contains a distance measure against a constant or a pseudo-constant.
Note that if performing a query without an index, you always get an exact result, but the query is slow (it has to read all of the data you store for every query). With an index, your queries are an order-of-magnitude faster, but the results are approximate (because there are no known indexing techniques that are exact see here for more).
Nevertheless, there are excellent approximate algorithms. There are 3 different indexing algorithms available on the Timescale platform: StreamingDiskANN, HNSW, and ivfflat. Below is the trade-offs between these algorithms:
Algorithm | Build Speed | Query Speed | Need to rebuild after updates |
---|---|---|---|
StreamingDiskANN | Fast | Fastest | No |
HNSW | Fast | Fast | No |
ivfflat | Fastest | Slowest | Yes |
You can see benchmarks in the blog.
For most use cases, the StreamingDiskANN index is recommended.
Each of these indexes has a set of build-time options for controlling the speed/accuracy trade-off when creating the index and an additional query-time option for controlling accuracy during a particular query.
You can see the details of each index below.
The StreamingDiskANN index is a graph-based algorithm that was inspired by the DiskANN algorithm. You can read more about it in How We Made PostgreSQL as Fast as Pinecone for Vector Data.
To create an index named document_embedding_idx
on table document_embedding
having a vector column named embedding
, run:
CREATE INDEX document_embedding_idx ON document_embeddingUSING diskann (embedding);
StreamingDiskANN indexes only support cosine distance at this time, so you should use the <=>
operator in your queries.
This creates the index with smart defaults for all the index parameters. These should be the right value for most cases. But if you want to delve deeper, the available parameters are below.
These parameters can be set when an index is created.
Parameter name | Description | Default value |
---|---|---|
storage_layout | memory_optimized which uses SBQ to compress vector data or plain which stores data uncompressed | memory_optimized |
num_neighbors | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower. | 50 |
search_list_size | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds. | 100 |
max_alpha | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds. | 1.2 |
num_dimensions | The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of Matryoshka embeddings | 0 (all dimensions) |
num_bits_per_dimension | Number of bits used to encode each dimension when using SBQ | 2 for less than 900 dimensions, 1 otherwise |
An example of how to set the num_neighbors
parameter is:
CREATE INDEX document_embedding_idx ON document_embeddingUSING diskann (embedding) WITH(num_neighbors=50);
You can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting diskann.query_rescore
to fine-tune accuracy.
Parameter name | Description | Default value |
---|---|---|
diskann.query_search_list_size | The number of additional candidates considered during the graph search. | 100 |
diskann.query_rescore | The number of elements rescored (0 to disable rescoring) | 50 |
You can set the value by using SET
before executing a query. For example:
SET diskann.query_rescore = 400;
Note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL
which will
be reset after the end of the transaction:
BEGIN;SET LOCAL diskann.query_search_list_size= 10;SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10COMMIT;
You need to use the cosine-distance embedding measure (<=>
) in your ORDER BY
clause. A canonical query would be:
SELECT *FROM document_embeddingORDER BY embedding <=> $1LIMIT 10
Pgvector provides a graph-based indexing algorithm based on the popular HNSW algorithm.
To create an index named document_embedding_idx
on table document_embedding
having a vector column named embedding
, run:
CREATE INDEX document_embedding_idx ON document_embeddingUSING hnsw(embedding vector_cosine_ops);
This command creates an index for cosine-distance queries because of vector_cosine_ops
. There are also "ops" classes for Euclidean distance and negative inner product:
Distance type | Query operator | Index ops class |
---|---|---|
Cosine / Angular | <=> | vector_cosine_ops |
Euclidean / L2 | <-> | vector_ip_ops |
Negative inner product | <#> | vector_l2_ops |
Pgvector HNSW also includes several index build-time and query-time parameters.
These parameters can be set at index build time:
Parameter name | Description | Default value |
---|---|---|
m | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size. | 16 |
ef_construction | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times. | 64 |
An example of how to set the m parameter is:
CREATE INDEX document_embedding_idx ON document_embeddingUSING hnsw(embedding vector_cosine_ops) WITH (m = 20);
You can also set a parameter to control the accuracy vs. query speed trade-off at query time. The parameter is called hnsw.ef_search
. This parameter specifies the size of the dynamic candidate list used during search. Defaults to 40. Higher values improve query accuracy while making the query slower.
You can set the value by running:
SET hnsw.ef_search = 100;
Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL
:
BEGIN;SET LOCAL hnsw.ef_search = 100;SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10COMMIT;
You need to use the distance operator (<=>
, <->
, or <#>
) matching the ops class you used during index creation in your ORDER BY
clause. A canonical query would be:
SELECT *FROM document_embeddingORDER BY embedding <=> $1LIMIT 10
Pgvector provides a clustering-based indexing algorithm. The blog post describes how it works in detail. It provides the fastest index-build speed but the slowest query speeds of any indexing algorithm.
To create an index named document_embedding_idx
on table document_embedding
having a vector column named embedding
, run:
CREATE INDEX document_embedding_idx ON document_embeddingUSING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
This command creates an index for cosine-distance queries because of vector_cosine_ops
. There are also "ops" classes for Euclidean distance and negative inner product:
Distance type | Query operator | Index ops class |
---|---|---|
Cosine / Angular | <=> | vector_cosine_ops |
Euclidean / L2 | <-> | vector_ip_ops |
Negative inner product | <#> | vector_l2_ops |
Note: ivfflat should never be created on empty tables because it needs to cluster data, and that only happens when an index is first created, not when new rows are inserted or modified. Also, if your table undergoes a lot of modifications, you need to rebuild this index occasionally to maintain good accuracy. See the blog post for details.
Pgvector ivfflat has a lists
index parameter that should be set. See the next section.
Pgvector has a lists
parameter that should be set as follows:
For datasets with less than one million rows, use lists = rows / 1000.
For datasets with more than one million rows, use lists = sqrt(rows).
It is generally advisable to have at least 10 clusters.
You can use the following code to simplify creating ivfflat indexes:
def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"):index_method = "invalid"if query_operator == "<->":index_method = "vector_l2_ops"elif query_operator == "<#>":index_method = "vector_ip_ops"elif query_operator == "<=>":index_method = "vector_cosine_ops"else:raise ValueError(f"unrecognized operator {query_operator}")with conn.cursor() as cur:cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")num_records = cur.fetchone()[0]num_lists = num_records / 1000if num_lists < 10:num_lists = 10if num_records > 1000000:num_lists = math.sqrt(num_records)cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')conn.commit()
You can also set a parameter to control the accuracy vs. query speed tradeoff at query time. The parameter is called ivfflat.probes
. This parameter specifies the number of clusters searched during a query. It is recommended to set this parameter to sqrt(lists)
where lists is the parameter used above during index creation. Higher values improve query accuracy while making the query slower.
You can set the value by running:
SET ivfflat.probes = 100;
Before executing the query, note the SET command applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using LOCAL
:
BEGIN;SET LOCAL ivfflat.probes = 100;SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10COMMIT;
You need to use the distance operator (<=>
, <->
, or <#>
) matching the ops class you used during index creation in your ORDER BY
clause. A canonical query would be:
SELECT *FROM document_embeddingORDER BY embedding <=> $1LIMIT 10
Keywords
Found an issue on this page?Report an issue or Edit this page in GitHub.