freq_agg() and mcv_agg() functions

ToolkitTimescaleDB Toolkit functions are available under Timescale Community Edition. They are automatically included with Timescale, but must be installed separately for self-hosted TimescaleDB. Click to learn more.

Get the most common elements of a set and their relative frequency. The estimation uses the SpaceSaving algorithm.

This group of functions contains two aggregate functions, which let you set the cutoff for keeping track of a value in different ways. freq_agg allows you to specify a minimum frequency, and mcv_agg allows you to specify the target number of values to keep.

To estimate the absolute number of times a value appears, use count_min_sketch.

Related hyperfunction groups

This group of functions uses the two-step aggregation pattern.

Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.

Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.

The two-step aggregation pattern has several advantages:

More efficient because multiple accessors can reuse the same aggregate
Easier to reason about performance, because aggregation is separate from final computation
Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result

To learn more, see the blog post on two-step aggregates.

Functions in this group

Warning

This function group includes some experimental functions. Experimental functions might change or be removed in future releases. We do not recommend using them in production. Experimental functions are marked with an Experimental tag.

freq_agg: ExperimentalAggregate data into a space-saving aggregate for further frequency analysis

mcv_agg: Aggregate data into a space-saving aggregate for further calculation of most-frequent values

into_values: Get a table of all frequency estimates from a space-saving aggregate
max_frequency: Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate
min_frequency: Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate
topn: Get the top N most common values from a space-saving aggregate

rollup: Combine multiple frequency aggregates

Function details

freq_agg(
  min_freq DOUBLE PRECISION,
  value AnyElement
) RETURNS SpaceSavingAggregate

Aggregate data into a space-saving aggregate object, which stores frequency information in an intermediate form. You can then use any of the accessors in this group to return estimated frequencies or the most common elements.

Required arguments

Name	Type	Description
`min_freq`	`DOUBLE PRECISION`	Frequency cutoff for keeping track of a value. Values that occur less frequently than the cutoff are not stored.
`value`	`AnyElement`	The column to store frequencies for

Returns

Column	Type	Description
`agg`	`SpaceSavingAggregate`	An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result.

Examples

Create a space-saving aggregate over a field ZIP in a HomeSales table. This aggregate tracks any ZIP value that occurs in at least 5% of rows:


SELECT toolkit_experimental.freq_agg(0.05, ZIP) FROM HomeSales;

mcv_agg (
    n INTEGER,
    value AnyElement
    [, skew DOUBLE PRECISION]
) RETURNS SpaceSavingAggregate

Aggregate data into a space-saving aggregate, which stores frequency information in an intermediate form. You can then use any of the accessors in this group to return estimated frequencies or the most common elements.

This differs from freq_agg in that you can specify a target number of values to keep, rather than a frequency cutoff.

Required arguments

Name	Type	Description
`n`	`INTEGER`	The target number of most-frequent values
`value`	`AnyElement`	The column to store frequencies for

Optional arguments

Name	Type	Description
`skew`	`DOUBLE PRECISION`	The estimated skew of the data, defined as the `s` parameter of a zeta distribution. Must be greater than `1.0`. Defaults to `1.1`. For more information, see the section on skew.

Returns

Column	Type	Description
`agg`	`SpaceSavingAggregate`	An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result.

Examples

Create a topN aggregate over the country column of the users table. Targets the top 10 most-frequent values:


SELECT mcv_agg(10, country) FROM users;

Create a topN aggregate over the type column of the devices table. Estimates the skew of the data to be 1.05, and targets the 5 most-frequent values:


SELECT mcv_agg(5, 1.05, type) FROM devices;

into_values(
    agg SpaceSavingAggregate
) RETURNS (AnyElement, DOUBLE PRECISION, DOUBLE PRECISION)

Returns the data from a space-saving aggregate as a table. The table lists the stored values with the minimum and maximum bounds for their estimated frequencies.

Required arguments

Name	Type	Description
`agg`	`SpaceSavingAggregate`	A space-saving aggregate created using either `freq_agg` or `mcv_agg`

Returns

Column	Type	Description
`value`	`AnyElement`	A commonly seen value in the original dataset
`min_freq`	`DOUBLE PRECISION`	The minimum bound for the estimated frequency
`max_freq`	`DOUBLE PRECISION`	The maximum bound for the estimated frequency

max_frequency (
    agg SpaceSavingAggregate,
    value AnyElement
) RETURNS DOUBLE PRECISION

Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate

Required arguments

Name	Type	Description
`agg`	`SpaceSavingAggregate`	A space-saving aggregate created using either `freq_agg` or `mcv_agg`
`value`	`AnyElement`	The value to get the frequency of

Returns

Column	Type	Description
`max_frequency`	`DOUBLE PRECISION`	The maximum bound for the value's estimated frequency. The maximum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see `freq_agg`.

Examples

Find the maximum frequency of the value 3 in a column named value within the table value_test:


SELECT max_frequency(
    (SELECT mcv_agg(20, value) FROM value_test),
    3
);

min_frequency (
    agg SpaceSavingAggregate,
    value AnyElement
) RETURNS DOUBLE PRECISION

Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate

Required arguments

Name	Type	Description
`agg`	`SpaceSavingAggregate`	A space-saving aggregate created using either `freq_agg` or `mcv_agg`
`value`	`AnyElement`	The value to get the frequency of

Returns

Column	Type	Description
`min_frequency`	`DOUBLE PRECISION`	The minimum bound for the value's estimated frequency. The minimum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see `freq_agg`.

Examples

Find the minimum frequency of the value 3 in a column named value within the table value_test:


SELECT min_frequency(
    (SELECT mcv_agg(20, value) FROM value_test),
    3
);

topn (
    agg SpaceSavingAggregate,
    n INTEGER
) RETURNS AnyElement

Get the top N most common values from a space-saving aggregate. The space-saving aggregate can be created from either freq_agg or mcv_agg.

Required arguments

Name	Type	Description
`agg`	`SpacingsavingAggregate`	A space-saving aggregate created using either `freq_agg` or `mcv_agg`
`n`	`INTEGER`	The number of values to return. Required only for frequency aggregates. For top N aggregates, defaults to target N of the aggregate itself, and requests for a higher N return an error. In some cases, the function might return fewer than N values. This might happen if a frequency aggregate doesn't contain N values above the minimum frequency, or if the data isn't skewed enough to support N values from a top N aggregate.

Returns

Column	Type	Description
`topn`	`AnyElement`	The N most-frequent values in the aggregate

Examples

Get the 20 most frequent zip_codes from an employees table:


SELECT topn(mcv_agg(20, zip_code)) FROM employees;

rollup(
    agg SpaceSavingAggregate
) RETURNS SpaceSavingAggregate

This will combine multiple aggregates created with freq_agg or mcv_agg functions. This function does require that the source aggregates have been created with the same parameters (same min_freq for freq_agg, same n-factor and skew, if used, for a mcv_agg).

This will produce a very similar aggregate to running the same aggregate function over all the source data. In most cases, any difference will be no more than what you might get from simply reordering the input. However, if the source data for the different aggregates is very differently distributed, the rollup result may have looser frequency bounds.

Required arguments

Name	Type	Description
`agg`	`SpaceSavingAggregate`	The aggregates to roll up. These must have been created with the same parameters.

Returns

Column	Type	Description
`rollup`	`SpaceSavingAggregate`	An aggregate containing the most common elements from all of the underlying data for all of the aggregates.

Extended examples

Get the 5 most common values from a table

This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range 0 to 400.


CREATE TABLE value_test(value INTEGER);
INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);

This returns the 5 most common values seen in the table:


SELECT topn(
    toolkit_experimental.freq_agg(0.05, value), 
    5) 
FROM value_test;

The output for this query:


topn 
------
   19
   18
   17
   16
   15

Generate a table with frequencies of the most commonly seen values in a dataset

This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range (0,400).


CREATE TABLE value_test(value INTEGER);
INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);

Return values that represent more than 5% of the input:


SELECT value, min_freq, max_freq
FROM into_values(
    (SELECT toolkit_experimental.freq_agg(0.05, value) FROM value_test));

The output for this query looks like this, with some variation due to randomness:


value | min_freq | max_freq 
-------+----------+----------
    19 |  0.09815 |  0.09815
    18 |  0.09169 |  0.09169
    17 |  0.08804 |  0.08804
    16 |  0.08248 |  0.08248
    15 |  0.07703 |  0.07703
    14 |  0.07157 |  0.07157
    13 |  0.06746 |  0.06746
    12 |  0.06378 |  0.06378
    11 |  0.05565 |  0.05595
    10 |  0.05286 |  0.05289

Advanced use

You can adjust the following advanced options to suit your data distribution:

Estimated skew

mcv_agg assumes that the data is skewed. In other words, some values are more frequent than others. The degree of skew is defined by the s parameter of a zeta distribution.

The default value of 1.1 works on data with this distribution or a more extreme one:

N	Minimum percentage of all values represented by the top N (approximate)
5	20%
10	25%
20	30%
50	36%
100	40%

freq_agg() and mcv_agg() functions

Related hyperfunction groups

Two-step aggregation

Functions in this group

Warning

Aggregate

Alternate aggregate

Accessor

Rollup

Function details

freq_agg()

mcv_agg()

into_values()

max_frequency()

min_frequency()

topn()

rollup()

Extended examples

Get the 5 most common values from a table

Generate a table with frequencies of the most commonly seen values in a dataset

Advanced use

Estimated skew

Introduction

Related hyperfunction groups

Two-step aggregation

Warning

Aggregate

Alternate aggregate

Accessor

Rollup

freq_agg()

mcv_agg()

into_values()

max_frequency()

min_frequency()

topn()

rollup()

Related Content