hyperloglog() and approx_count_distinct() functions
ToolkitTimescaleDB Toolkit functions are available under Timescale Community Edition. They are automatically included with Timescale, but must be installed separately for self-hosted TimescaleDB. Click to learn more.Introduction
Estimate the number of distinct values in a dataset. This is also known as
cardinality estimation. For large datasets and datasets with high cardinality
(many distinct values), this can be much more efficient in both CPU and memory
than an exact count using count(DISTINCT)
.
The estimation uses the hyperloglog++
algorithm. If you aren't
sure what parameters to set for the hyperloglog
, try using the
approx_count_distinct
aggregate, which sets some
reasonable default values.
Aggregate
- hyperloglog
- Aggregate data into a hyperloglog for approximate counting
Alternate aggregate
- approx_count_distinct
- Aggregate data into a hyperloglog for approximate counting without specifying the number of buckets
Accessor
- distinct_count
- Estimate the number of distinct values from a hyperloglog
- stderror
- Estimate the relative standard error of a hyperloglog
Rollup
- rollup
- Roll up multiple hyperloglogs
hyperloglog(buckets INTEGER,value AnyElement) RETURNS Hyperloglog
This is the first step for estimating the approximate number of distinct
values using the hyperloglog
algorithm. Use hyperloglog
to create an
intermediate aggregate from your raw data. This intermediate form can then
be used by one or more accessors in this group to compute final results.
Optionally, multiple such intermediate aggregate objects can be combined
using rollup()
before an accessor is applied.
If you're not sure what value to set for buckets
, try using the alternate
aggregate function, approx_count_distinct()
.
approx_count_distinct
also creates a hyperloglog
, but it sets a
default bucket value that should work for many use cases.
Required arguments
Name | Type | Description |
---|---|---|
buckets | INTEGER | Number of buckets in the hyperloglog. Increasing the number of buckets improves accuracy but increases memory use. Value is rounded up to the next power of 2, and must be between 2^4 (16) and 2^18. Setting a value less than 2^10 (1,024) may result in poor accuracy if the true cardinality is high and is not recommended. If unsure, start experimenting with 8,192 (2^13) which has an approximate error rate of 1.15%. |
value | AnyElement | The column containing the elements to count. The type must have an extended, 64-bit, hash function. |
Returns
Column | Type | Description |
---|---|---|
hyperloglog | Hyperloglog | A hyperloglog object which can be passed to other hyperloglog APIs for rollups and final calculation |
Examples
Given a table called samples
, with a column called weights
, return a hyperloglog
over the weights
column:
SELECT hyperloglog(32768, weights) FROM samples;
Using the same data, build a view from the aggregate that you can pass to other hyperloglog
functions:
CREATE VIEW hll AS SELECT hyperloglog(32768, data) FROM samples;
approx_count_distinct(value AnyElement) RETURNS Hyperloglog
This is an alternate first step for approximating the number of distinct
values. It provides some added convenience by using some sensible default
parameters to create a hyperloglog
.
Use approx_count_distinct
to create an intermediate aggregate from your raw data.
This intermediate form can then be used by one or more accessors in this
group to compute final results.
Optionally, multiple such intermediate aggregate objects can be combined
using rollup()
before an accessor is applied.
Required arguments
Name | Type | Description |
---|---|---|
value | AnyElement | The column containing the elements to count. The type must have an extended, 64-bit, hash function. |
Returns
Column | Type | Description |
---|---|---|
hyperloglog | Hyperloglog | A hyperloglog object which can be passed to other hyperloglog APIs for rollups and final calculation |
Examples
Given a table called samples
, with a column called weights
, return a hyperloglog
over the weights
column::
SELECT toolkit_experimental.approx_count_distinct(weights) FROM samples;
Using the same data, build a view from the aggregate that you can pass to other hyperloglog
functions:
CREATE VIEW hll AS SELECT toolkit_experimental.approx_count_distinct(data) FROM samples;
distinct_count(hyperloglog Hyperloglog) RETURNS BIGINT
Estimate the number of distinct values from a hyperloglog
Required arguments
Name | Type | Description |
---|---|---|
hyperloglog | Hyperloglog | The hyperloglog to extract the count from. |
Returns
Column | Type | Description |
---|---|---|
distinct_count | BIGINT | The number of distinct elements counted by the hyperloglog. |
Examples
Estimate the number of distinct values from a hyperloglog named hyperloglog
. The expected output is 98,814:
SELECT distinct_count(hyperloglog(8192, data))FROM generate_series(1, 100000) data
distinct_count----------------98814
stderror(hyperloglog Hyperloglog) RETURNS DOUBLE PRECISION
Estimate the relative standard error of a Hyperloglog
. For approximate relative errors by number of buckets, see the relative errors section.
Required arguments
Name | Type | Description |
---|---|---|
hyperloglog | Hyperloglog | The hyperloglog to estimate the error of. |
Returns
Column | Type | Description |
---|---|---|
stderror | DOUBLE PRECISION | The approximate relative standard error of the hyperloglog. |
Examples
Estimate the relative standard error of a hyperloglog named hyperloglog
. The expected output is 0.011490485194281396:
SELECT stderror(hyperloglog(8192, data))FROM generate_series(1, 100000) data
stderror----------------------0.011490485194281396
rollup(hyperloglog Hyperloglog) RETURNS Hyperloglog
Combine multiple intermediate hyperloglog aggregates, produced by hyperloglog, into a single intermediate hyperloglog aggregate. For example, you can use rollup
to combine hyperloglog from 15-minute buckets into daily buckets.
Required arguments
Name | Type | Description |
---|---|---|
hyperloglog | Hyperloglog | The hyperloglog aggregates to roll up. |
Returns
Column | Type | Description |
---|---|---|
rollup | Hyperloglog | A new hyperloglog aggregate created by combining the input hyperloglog aggregates. |
Roll up two hyperloglogs. The first hyperloglog buckets the integers from 1 to 100,000, and the second hyperloglog buckets the integers from 50,000 to 150,000. Accounting for overlap, the exact number of distinct values in the combined set is 150,000.
Calling distinct_count
on the rolled-up hyperloglog yields a final value of
150,552, so the approximation is off by only 0.368%:
SELECT distinct_count(rollup(logs))FROM ((SELECT hyperloglog(4096, v::text) logs FROM generate_series(1, 100000) v)UNION ALL(SELECT hyperloglog(4096, v::text) FROM generate_series(50000, 150000) v)) hll;
Output:
distinct_count----------------150552
These are the approximate errors for each bucket size:
precision | registers (bucket size) | error | column size (in bytes) |
---|---|---|---|
4 | 16 | 0.2600 | 12 |
5 | 32 | 0.1838 | 24 |
6 | 64 | 0.1300 | 48 |
7 | 128 | 0.0919 | 96 |
8 | 256 | 0.0650 | 192 |
9 | 512 | 0.0460 | 384 |
10 | 1024 | 0.0325 | 768 |
11 | 2048 | 0.0230 | 1536 |
12 | 4096 | 0.0163 | 3072 |
13 | 8192 | 0.0115 | 6144 |
14 | 16384 | 0.0081 | 12288 |
15 | 32768 | 0.0057 | 24576 |
16 | 65536 | 0.0041 | 49152 |
17 | 131072 | 0.0029 | 98304 |
18 | 262144 | 0.0020 | 196608 |
Keywords
Found an issue on this page?Report an issue or Edit this page in GitHub.