Introduction

Get the most common elements of a set and their relative frequency. The estimation uses the SpaceSaving algorithm.

This group of functions contains two aggregate functions, which let you set the cutoff for keeping track of a value in different ways. freq_agg allows you to specify a minimum frequency, and mcv_agg allows you to specify the target number of values to keep.

To estimate the absolute number of times a value appears, use count_min_sketch.

Related hyperfunction groups
Warning

This function group includes some experimental functions. Experimental functions might change or be removed in future releases. We do not recommend using them in production. Experimental functions are marked with an Experimental tag.

Aggregate

freq_agg
ExperimentalAggregate data into a space-saving aggregate for further frequency analysis

Alternate aggregate

mcv_agg
Aggregate data into a space-saving aggregate for further calculation of most-frequent values

Accessor

into_values
Get a table of all frequency estimates from a space-saving aggregate
max_frequency
Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate
min_frequency
Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate
topn
Get the top N most common values from a space-saving aggregate

Rollup

rollup
Combine multiple frequency aggregates
freq_agg(
min_freq DOUBLE PRECISION,
value AnyElement
) RETURNS SpaceSavingAggregate

Aggregate data into a space-saving aggregate object, which stores frequency information in an intermediate form. You can then use any of the accessors in this group to return estimated frequencies or the most common elements.

Required arguments
NameTypeDescription
min_freqDOUBLE PRECISIONFrequency cutoff for keeping track of a value. Values that occur less frequently than the cutoff are not stored.
valueAnyElementThe column to store frequencies for
Returns
ColumnTypeDescription
aggSpaceSavingAggregateAn object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result.
Examples

Create a space-saving aggregate over a field ZIP in a HomeSales table. This aggregate tracks any ZIP value that occurs in at least 5% of rows:

SELECT toolkit_experimental.freq_agg(0.05, ZIP) FROM HomeSales;
mcv_agg (
n INTEGER,
value AnyElement
[, skew DOUBLE PRECISION]
) RETURNS SpaceSavingAggregate

Aggregate data into a space-saving aggregate, which stores frequency information in an intermediate form. You can then use any of the accessors in this group to return estimated frequencies or the most common elements.

This differs from freq_agg in that you can specify a target number of values to keep, rather than a frequency cutoff.

Required arguments
NameTypeDescription
nINTEGERThe target number of most-frequent values
valueAnyElementThe column to store frequencies for
Optional arguments
NameTypeDescription
skewDOUBLE PRECISIONThe estimated skew of the data, defined as the s parameter of a zeta distribution. Must be greater than 1.0. Defaults to 1.1. For more information, see the section on skew.
Returns
ColumnTypeDescription
aggSpaceSavingAggregateAn object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result.
Examples

Create a topN aggregate over the country column of the users table. Targets the top 10 most-frequent values:

SELECT mcv_agg(10, country) FROM users;

Create a topN aggregate over the type column of the devices table. Estimates the skew of the data to be 1.05, and targets the 5 most-frequent values:

SELECT mcv_agg(5, 1.05, type) FROM devices;
into_values(
agg SpaceSavingAggregate
) RETURNS (AnyElement, DOUBLE PRECISION, DOUBLE PRECISION)

Returns the data from a space-saving aggregate as a table. The table lists the stored values with the minimum and maximum bounds for their estimated frequencies.

Required arguments
NameTypeDescription
aggSpaceSavingAggregateA space-saving aggregate created using either freq_agg or mcv_agg
Returns
ColumnTypeDescription
valueAnyElementA commonly seen value in the original dataset
min_freqDOUBLE PRECISIONThe minimum bound for the estimated frequency
max_freqDOUBLE PRECISIONThe maximum bound for the estimated frequency
max_frequency (
agg SpaceSavingAggregate,
value AnyElement
) RETURNS DOUBLE PRECISION

Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate

Required arguments
NameTypeDescription
aggSpaceSavingAggregateA space-saving aggregate created using either freq_agg or mcv_agg
valueAnyElementThe value to get the frequency of
Returns
ColumnTypeDescription
max_frequencyDOUBLE PRECISIONThe maximum bound for the value's estimated frequency. The maximum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see freq_agg.
Examples

Find the maximum frequency of the value 3 in a column named value within the table value_test:

SELECT max_frequency(
(SELECT mcv_agg(20, value) FROM value_test),
3
);
min_frequency (
agg SpaceSavingAggregate,
value AnyElement
) RETURNS DOUBLE PRECISION

Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate

Required arguments
NameTypeDescription
aggSpaceSavingAggregateA space-saving aggregate created using either freq_agg or mcv_agg
valueAnyElementThe value to get the frequency of
Returns
ColumnTypeDescription
min_frequencyDOUBLE PRECISIONThe minimum bound for the value's estimated frequency. The minimum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see freq_agg.
Examples

Find the minimum frequency of the value 3 in a column named value within the table value_test:

SELECT min_frequency(
(SELECT mcv_agg(20, value) FROM value_test),
3
);
topn (
agg SpaceSavingAggregate,
n INTEGER
) RETURNS AnyElement

Get the top N most common values from a space-saving aggregate. The space-saving aggregate can be created from either freq_agg or mcv_agg.

Required arguments
NameTypeDescription
aggSpacingsavingAggregateA space-saving aggregate created using either freq_agg or mcv_agg
nINTEGERThe number of values to return. Required only for frequency aggregates. For top N aggregates, defaults to target N of the aggregate itself, and requests for a higher N return an error. In some cases, the function might return fewer than N values. This might happen if a frequency aggregate doesn't contain N values above the minimum frequency, or if the data isn't skewed enough to support N values from a top N aggregate.
Returns
ColumnTypeDescription
topnAnyElementThe N most-frequent values in the aggregate
Examples

Get the 20 most frequent zip_codes from an employees table:

SELECT topn(mcv_agg(20, zip_code)) FROM employees;
rollup(
agg SpaceSavingAggregate
) RETURNS SpaceSavingAggregate

This will combine multiple aggregates created with freq_agg or mcv_agg functions. This function does require that the source aggregates have been created with the same parameters (same min_freq for freq_agg, same n-factor and skew, if used, for a mcv_agg).

This will produce a very similar aggregate to running the same aggregate function over all the source data. In most cases, any difference will be no more than what you might get from simply reordering the input. However, if the source data for the different aggregates is very differently distributed, the rollup result may have looser frequency bounds.

Required arguments
NameTypeDescription
aggSpaceSavingAggregateThe aggregates to roll up. These must have been created with the same parameters.
Returns
ColumnTypeDescription
rollupSpaceSavingAggregateAn aggregate containing the most common elements from all of the underlying data for all of the aggregates.

This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range 0 to 400.

CREATE TABLE value_test(value INTEGER);
INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);

This returns the 5 most common values seen in the table:

SELECT topn(
toolkit_experimental.freq_agg(0.05, value),
5)
FROM value_test;

The output for this query:

topn
------
19
18
17
16
15

This test uses a table of randomly generated data. The values used are the integer square roots of a random number in the range (0,400).

CREATE TABLE value_test(value INTEGER);
INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000);

Return values that represent more than 5% of the input:

SELECT value, min_freq, max_freq
FROM into_values(
(SELECT toolkit_experimental.freq_agg(0.05, value) FROM value_test));

The output for this query looks like this, with some variation due to randomness:

value | min_freq | max_freq
-------+----------+----------
19 | 0.09815 | 0.09815
18 | 0.09169 | 0.09169
17 | 0.08804 | 0.08804
16 | 0.08248 | 0.08248
15 | 0.07703 | 0.07703
14 | 0.07157 | 0.07157
13 | 0.06746 | 0.06746
12 | 0.06378 | 0.06378
11 | 0.05565 | 0.05595
10 | 0.05286 | 0.05289

You can adjust the following advanced options to suit your data distribution:

mcv_agg assumes that the data is skewed. In other words, some values are more frequent than others. The degree of skew is defined by the s parameter of a zeta distribution.

The default value of 1.1 works on data with this distribution or a more extreme one:

NMinimum percentage of all values represented by the top N (approximate)
520%
1025%
2030%
5036%
10040%

Keywords

Found an issue on this page?

Report an issue!