Advanced percentile aggregation
While the simple
interface will be sufficient for many users, we do provide more specific APIs for
advanced users who want more control of how their percentile approximation is
computed and how much space the intermediate representation uses. We currently
provide implementations of the following percentile approximation algorithms:
- T-Digest –
This algorithm buckets data more aggressively toward the center of the quantile range,
giving it greater accuracy near the tails (i.e. 0.001 or 0.995).
- UddSketch – T
his algorithm uses exponentially sized buckets to guarantee the approximation
falls within a known error range, relative to the true discrete percentile.
UddSketch algorithm underlies the
percentile_agg() interface, it offers
tunability for the size and maximum error target of the sketch, while
uses preset defaults.
Choosing the right algorithm for your use case
There are different tradeoffs that each algorithm makes, and different use cases
where each will shine. The doc pages above each link to the research papers fully
detailing the algorithms if you want all the details. However, at a higher level,
here are some of the differences to consider when choosing an algorithm:
- First off, it's interesting to note that the formal definition for a percentile
is actually imprecise, and there are different methods for determining what the
true percentile actually is. In Postgres, given a target percentile 'p',
will return the smallest element of a set such that 'p' percent of the set is
less than that element, while
will return an interpolated value between the two nearest matches for 'p'. The
difference here isn't usually that interesting in practice, but if it matters to
your use case, then keep in mind that TDigest will approximate the continuous
percentile while UddSketch provides an estimate of the discrete value.
- It's also important to consider the types of percentiles you're most interested
in. In particular, TDigest is optimized to trade off more accurate estimates at
the extremes with weaker estimates near the median. If your work flow involves
estimating 99th percentiles, this is probably a good trade off. However if you're
more concerned about getting highly accurate median estimates, UddSketch is
probably a better fit.
- UddSketch has a stable bucketing function, so it will always return the same
quantile estimate for the same underlying data, regardless of how it is ordered
or re-aggregated. TDigest, on the other hand, builds up incremental buckets based
on the average of nearby points, which will result in (usually subtle) differences
in estimates based on the same data, unless the order and batching of the
aggregation is strictly controlled (which can be difficult to do in Postgres).
Therefore, if having stable estimates is important to you, UddSketch will likely
- Trying to calculate precise error bars for TDigest can be difficult, especially
when merging multiple sub-digests into a larger one (this can come about either
through summary aggregation or just parallelization of the normal point aggregate).
If being able to tightly characterize your error is important, UddSketch will
likely be the desired algorithm.
- That being said, the fact that UddSketch uses exponential bucketing to provide
a guaranteed relative error can cause some wildly varying absolute errors if the
data set covers a large range. For instance if the data is evenly distributed
over the range [1,100], estimates at the high end of the percentile range would
have about 100 times the absolute error of those at the low end of the range.
This gets much more extreme if the data range is [0,100]. If having a stable
absolute error is important to your use case, consider TDigest.
- While both implementation will likely get smaller and/or faster with future
optimizations, in general UddSketch will end up with a smaller memory footprint
than TDigest, and a correspondingly smaller disk footprint for any continuous
aggregates. This is one of the main reasons that the default
uses UddSketch, and is a pretty good reason to prefer that algorithm if your use
case doesn't clearly benefit from TDigest. Regardless of the algorithm, the best
way to improve the accuracy of your percentile estimates is to increase the
number of buckets, and UddSketch gives you more leeway to do so.
Found an issue on this page?Report an issue!