freq_agg() and mcv_agg() functions
ToolkitTimescaleDB Toolkit functions are available under Timescale Community Edition. They are automatically included with Timescale, but must be installed separately for self-hosted TimescaleDB. Click to learn more.Introduction
Get the most common elements of a set and their relative frequency. The estimation uses the SpaceSaving algorithm.
This group of functions contains two aggregate functions, which let you set the
cutoff for keeping track of a value in different ways. freq_agg
allows you to specify a minimum frequency, and mcv_agg
allows
you to specify the target number of values to keep.
To estimate the absolute number of times a value appears, use count_min_sketch
.
Related hyperfunction groups
Warning
This function group includes some experimental functions. Experimental functions might change or be removed in future releases. We do not recommend using them in production. Experimental functions are marked with an Experimental tag.
Aggregate
- freq_agg
- ExperimentalAggregate data into a space-saving aggregate for further frequency analysis
Alternate aggregate
- mcv_agg
- Aggregate data into a space-saving aggregate for further calculation of most-frequent values
Accessor
- into_values
- Get a table of all frequency estimates from a space-saving aggregate
- max_frequency
- Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate
- min_frequency
- Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate
- topn
- Get the top N most common values from a space-saving aggregate
Rollup
- rollup
- Combine multiple frequency aggregates
freq_agg(min_freq DOUBLE PRECISION,value AnyElement) RETURNS SpaceSavingAggregate
Aggregate data into a space-saving aggregate object, which stores frequency information in an intermediate form. You can then use any of the accessors in this group to return estimated frequencies or the most common elements.
Required arguments
Name | Type | Description |
---|---|---|
min_freq | DOUBLE PRECISION | Frequency cutoff for keeping track of a value. Values that occur less frequently than the cutoff are not stored. |
value | AnyElement | The column to store frequencies for |
Returns
Column | Type | Description |
---|---|---|
agg | SpaceSavingAggregate | An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result. |
Examples
Create a space-saving aggregate over a field ZIP
in a HomeSales
table. This aggregate tracks any ZIP
value that occurs in at least 5% of rows:
SELECT toolkit_experimental.freq_agg(0.05, ZIP) FROM HomeSales;
mcv_agg (n INTEGER,value AnyElement[, skew DOUBLE PRECISION]) RETURNS SpaceSavingAggregate
Aggregate data into a space-saving aggregate, which stores frequency information in an intermediate form.
You can then use any of the accessors in this group to return estimated frequencies or the most common elements.
This differs from freq_agg
in that you can specify a target number of values to keep, rather than a frequency cutoff.
Required arguments
Name | Type | Description |
---|---|---|
n | INTEGER | The target number of most-frequent values |
value | AnyElement | The column to store frequencies for |
Optional arguments
Name | Type | Description |
---|---|---|
skew | DOUBLE PRECISION | The estimated skew of the data, defined as the s parameter of a zeta distribution. Must be greater than 1.0 . Defaults to 1.1 . For more information, see the section on skew. |
Returns
Column | Type | Description |
---|---|---|
agg | SpaceSavingAggregate | An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result. |
Examples
Create a topN aggregate over the country
column of the users
table. Targets the top 10 most-frequent values:
SELECT mcv_agg(10, country) FROM users;
Create a topN aggregate over the type
column of the devices
table. Estimates the skew of the data to be 1.05, and targets the 5 most-frequent values:
SELECT mcv_agg(5, 1.05, type) FROM devices;
into_values(agg SpaceSavingAggregate) RETURNS (AnyElement, DOUBLE PRECISION, DOUBLE PRECISION)
Returns the data from a space-saving aggregate as a table. The table lists the stored values with the minimum and maximum bounds for their estimated frequencies.
Required arguments
Returns
Column | Type | Description |
---|---|---|
value | AnyElement | A commonly seen value in the original dataset |
min_freq | DOUBLE PRECISION | The minimum bound for the estimated frequency |
max_freq | DOUBLE PRECISION | The maximum bound for the estimated frequency |
max_frequency (agg SpaceSavingAggregate,value AnyElement) RETURNS DOUBLE PRECISION
Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate
Required arguments
Returns
Column | Type | Description |
---|---|---|
max_frequency | DOUBLE PRECISION | The maximum bound for the value's estimated frequency. The maximum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see freq_agg . |
Examples
Find the maximum frequency of the value 3
in a column named value
within the table value_test
:
SELECT max_frequency((SELECT mcv_agg(20, value) FROM value_test),3);
min_frequency (agg SpaceSavingAggregate,value AnyElement) RETURNS DOUBLE PRECISION
Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate
Required arguments
Returns
Column | Type | Description |
---|---|---|
min_frequency | DOUBLE PRECISION | The minimum bound for the value's estimated frequency. The minimum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see freq_agg . |
Examples
Find the minimum frequency of the value 3
in a column named value
within the table value_test
:
SELECT min_frequency((SELECT mcv_agg(20, value) FROM value_test),3);
topn (agg SpaceSavingAggregate,n INTEGER) RETURNS AnyElement
Get the top N most common values from a space-saving aggregate. The space-saving aggregate can be created from either freq_agg
or mcv_agg
.
Required arguments
Name | Type | Description |
---|---|---|
agg | SpacingsavingAggregate | A space-saving aggregate created using either freq_agg or mcv_agg |
n | INTEGER | The number of values to return. Required only for frequency aggregates. For top N aggregates, defaults to target N of the aggregate itself, and requests for a higher N return an error. In some cases, the function might return fewer than N values. This might happen if a frequency aggregate doesn't contain N values above the minimum frequency, or if the data isn't skewed enough to support N values from a top N aggregate. |
Returns
Column | Type | Description |
---|---|---|
topn | AnyElement | The N most-frequent values in the aggregate |
Examples
Get the 20 most frequent zip_codes
from an employees
table:
SELECT topn(mcv_agg(20, zip_code)) FROM employees;
rollup(agg SpaceSavingAggregate) RETURNS SpaceSavingAggregate
This will combine multiple aggregates created with freq_agg
or
mcv_agg
functions. This function does require that the source
aggregates have been created with the same parameters (same min_freq
for freq_agg
, same n-factor and skew
, if used, for a mcv_agg
).
This will produce a very similar aggregate to running the same
aggregate function over all the source data. In most cases, any
difference will be no more than what you might get from simply reordering
the input. However, if the source data for the different aggregates is
very differently distributed, the rollup result may have looser frequency
bounds.
Required arguments
Name | Type | Description |
---|---|---|
agg | SpaceSavingAggregate | The aggregates to roll up. These must have been created with the same parameters. |
Returns
Column | Type | Description |
---|---|---|
rollup | SpaceSavingAggregate | An aggregate containing the most common elements from all of the underlying data for all of the aggregates. |
This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range 0 to 400.
CREATE TABLE value_test(value INTEGER);INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);
This returns the 5 most common values seen in the table:
SELECT topn(toolkit_experimental.freq_agg(0.05, value),5)FROM value_test;
The output for this query:
topn------1918171615
This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range (0,400).
CREATE TABLE value_test(value INTEGER);INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);
Return values that represent more than 5% of the input:
SELECT value, min_freq, max_freqFROM into_values((SELECT toolkit_experimental.freq_agg(0.05, value) FROM value_test));
The output for this query looks like this, with some variation due to randomness:
value | min_freq | max_freq-------+----------+----------19 | 0.09815 | 0.0981518 | 0.09169 | 0.0916917 | 0.08804 | 0.0880416 | 0.08248 | 0.0824815 | 0.07703 | 0.0770314 | 0.07157 | 0.0715713 | 0.06746 | 0.0674612 | 0.06378 | 0.0637811 | 0.05565 | 0.0559510 | 0.05286 | 0.05289
You can adjust the following advanced options to suit your data distribution:
mcv_agg
assumes that the data is skewed. In other words, some values are more
frequent than others. The degree of skew is defined by the s
parameter of a
zeta distribution.
The default value of 1.1
works on data with this distribution or a more
extreme one:
N | Minimum percentage of all values represented by the top N (approximate) |
---|---|
5 | 20% |
10 | 25% |
20 | 30% |
50 | 36% |
100 | 40% |
Keywords
Found an issue on this page?Report an issue or Edit this page in GitHub.