# hyperloglog() and approx_count_distinct() functions

ToolkitTimescaleDB Toolkit functions are available under Timescale Community Edition. They are automatically included with Timescale, but must be installed separately for self-hosted TimescaleDB. Click to learn more.## Introduction

Estimate the number of distinct values in a dataset. This is also known as
cardinality estimation. For large datasets and datasets with high cardinality
(many distinct values), this can be much more efficient in both CPU and memory
than an exact count using `count(DISTINCT)`

.

The estimation uses the `hyperloglog++`

algorithm. If you aren't
sure what parameters to set for the `hyperloglog`

, try using the
`approx_count_distinct`

aggregate, which sets some
reasonable default values.

### Aggregate

- hyperloglog
- Aggregate data into a hyperloglog for approximate counting

### Alternate aggregate

- approx_count_distinct
- Aggregate data into a hyperloglog for approximate counting without specifying the number of buckets

### Accessor

- distinct_count
- Estimate the number of distinct values from a hyperloglog
- stderror
- Estimate the relative standard error of a hyperloglog

### Rollup

- rollup
- Roll up multiple hyperloglogs

hyperloglog(buckets INTEGER,value AnyElement) RETURNS Hyperloglog

This is the first step for estimating the approximate number of distinct
values using the `hyperloglog`

algorithm. Use `hyperloglog`

to create an
intermediate aggregate from your raw data. This intermediate form can then
be used by one or more accessors in this group to compute final results.

Optionally, multiple such intermediate aggregate objects can be combined
using `rollup()`

before an accessor is applied.

If you're not sure what value to set for `buckets`

, try using the alternate
aggregate function, `approx_count_distinct()`

.
`approx_count_distinct`

also creates a `hyperloglog`

, but it sets a
default bucket value that should work for many use cases.

## Required arguments

Name | Type | Description |
---|---|---|

`buckets` | `INTEGER` | Number of buckets in the hyperloglog. Increasing the number of buckets improves accuracy but increases memory use. Value is rounded up to the next power of 2, and must be between 2^4 (16) and 2^18. Setting a value less than 2^10 (1,024) may result in poor accuracy if the true cardinality is high and is not recommended. If unsure, start experimenting with 8,192 (2^13) which has an approximate error rate of 1.15%. |

`value` | `AnyElement` | The column containing the elements to count. The type must have an extended, 64-bit, hash function. |

## Returns

Column | Type | Description |
---|---|---|

`hyperloglog` | `Hyperloglog` | A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation |

## Examples

Given a table called `samples`

, with a column called `weights`

, return a `hyperloglog`

over the `weights`

column:

SELECT hyperloglog(32768, weights) FROM samples;

Using the same data, build a view from the aggregate that you can pass to other `hyperloglog`

functions:

CREATE VIEW hll AS SELECT hyperloglog(32768, data) FROM samples;

approx_count_distinct(value AnyElement) RETURNS Hyperloglog

This is an alternate first step for approximating the number of distinct
values. It provides some added convenience by using some sensible default
parameters to create a `hyperloglog`

.

Use `approx_count_distinct`

to create an intermediate aggregate from your raw data.
This intermediate form can then be used by one or more accessors in this
group to compute final results.

Optionally, multiple such intermediate aggregate objects can be combined
using `rollup()`

before an accessor is applied.

## Required arguments

Name | Type | Description |
---|---|---|

`value` | `AnyElement` | The column containing the elements to count. The type must have an extended, 64-bit, hash function. |

## Returns

Column | Type | Description |
---|---|---|

`hyperloglog` | `Hyperloglog` | A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation |

## Examples

Given a table called `samples`

, with a column called `weights`

, return a `hyperloglog`

over the `weights`

column::

SELECT toolkit_experimental.approx_count_distinct(weights) FROM samples;

Using the same data, build a view from the aggregate that you can pass to other `hyperloglog`

functions:

CREATE VIEW hll AS SELECT toolkit_experimental.approx_count_distinct(data) FROM samples;

distinct_count(hyperloglog Hyperloglog) RETURNS BIGINT

Estimate the number of distinct values from a hyperloglog

## Required arguments

Name | Type | Description |
---|---|---|

`hyperloglog` | `Hyperloglog` | The hyperloglog to extract the count from. |

## Returns

Column | Type | Description |
---|---|---|

`distinct_count` | `BIGINT` | The number of distinct elements counted by the hyperloglog. |

## Examples

Estimate the number of distinct values from a hyperloglog named `hyperloglog`

. The expected output is 98,814:

SELECT distinct_count(hyperloglog(8192, data))FROM generate_series(1, 100000) data

distinct_count----------------98814

stderror(hyperloglog Hyperloglog) RETURNS DOUBLE PRECISION

Estimate the relative standard error of a `Hyperloglog`

. For approximate relative errors by number of buckets, see the relative errors section.

## Required arguments

Name | Type | Description |
---|---|---|

`hyperloglog` | `Hyperloglog` | The hyperloglog to estimate the error of. |

## Returns

Column | Type | Description |
---|---|---|

`stderror` | `DOUBLE PRECISION` | The approximate relative standard error of the hyperloglog. |

## Examples

Estimate the relative standard error of a hyperloglog named `hyperloglog`

. The expected output is 0.011490485194281396:

SELECT stderror(hyperloglog(8192, data))FROM generate_series(1, 100000) data

stderror----------------------0.011490485194281396

rollup(hyperloglog Hyperloglog) RETURNS Hyperloglog

Combine multiple intermediate hyperloglog aggregates, produced by hyperloglog, into a single intermediate hyperloglog aggregate. For example, you can use `rollup`

to combine hyperloglog from 15-minute buckets into daily buckets.

## Required arguments

Name | Type | Description |
---|---|---|

`hyperloglog` | `Hyperloglog` | The hyperloglog aggregates to roll up. |

## Returns

Column | Type | Description |
---|---|---|

`rollup` | `Hyperloglog` | A new hyperloglog aggregate created by combining the input hyperloglog aggregates. |

Roll up two hyperloglogs. The first hyperloglog buckets the integers from 1 to 100,000, and the second hyperloglog buckets the integers from 50,000 to 150,000. Accounting for overlap, the exact number of distinct values in the combined set is 150,000.

Calling `distinct_count`

on the rolled-up hyperloglog yields a final value of
150,552, so the approximation is off by only 0.368%:

SELECT distinct_count(rollup(logs))FROM ((SELECT hyperloglog(4096, v::text) logs FROM generate_series(1, 100000) v)UNION ALL(SELECT hyperloglog(4096, v::text) FROM generate_series(50000, 150000) v)) hll;

Output:

distinct_count----------------150552

These are the approximate errors for each bucket size:

precision | registers (bucket size) | error | column size (in bytes) |
---|---|---|---|

4 | 16 | 0.2600 | 12 |

5 | 32 | 0.1838 | 24 |

6 | 64 | 0.1300 | 48 |

7 | 128 | 0.0919 | 96 |

8 | 256 | 0.0650 | 192 |

9 | 512 | 0.0460 | 384 |

10 | 1024 | 0.0325 | 768 |

11 | 2048 | 0.0230 | 1536 |

12 | 4096 | 0.0163 | 3072 |

13 | 8192 | 0.0115 | 6144 |

14 | 16384 | 0.0081 | 12288 |

15 | 32768 | 0.0057 | 24576 |

16 | 65536 | 0.0041 | 49152 |

17 | 131072 | 0.0029 | 98304 |

18 | 262144 | 0.0020 | 196608 |

Keywords

Found an issue on this page?

Report an issue!