Timescale uses approximation algorithms to calculate a percentile without requiring all of the data. This also makes them more compatible with continuous aggregates. By default, Timescale uses uddsketch, but you can also choose to use tdigest. This section describes the different methods, and helps you to decide which one you should use.

uddsketch is the default algorithm. It uses exponentially sized buckets to guarantee the approximation falls within a known error range, relative to the true discrete percentile. This algorithm offers the ability to tune the size and maximum error target of the sketch.

tdigest buckets data more aggressively toward the center of the quantile range, giving it greater accuracy at the tails of the range, around 0.001 or 0.995.

Each algorithm has different features, which can make one better than another depending on your use case. Here are some of the differences to consider when choosing an algorithm:

Before you begin, it is important to understand that the formal definition for a percentile is imprecise, and there are different methods for determining what the true percentile actually is. In PostgreSQL, given a target percentile p, percentile_disc returns the smallest element of a set, so that p percent of the set is less than that element. However, percentile_cont returns an interpolated value between the two nearest matches for p. In practice, the difference between these methods is very small but, if it matters to your use case, keep in mind that tdigest approximates the continuous percentile, while uddsketch provides an estimate of the discrete value.

Think about the types of percentiles you're most interested in. tdigest is optimized for more accurate estimates at the extremes, and less accurate estimates near the median. If your workflow involves estimating ninety-ninth percentiles, then choose tdigest. If you're more concerned about getting highly accurate median estimates, choose uddsketch.

The algorithms differ in the way they estimate data. uddsketch has a stable bucketing function, so it always returns the same percentile estimate for the same underlying data, regardless of how it is ordered or re-aggregated. On the other hand, tdigest builds up incremental buckets based on the average of nearby points, which can result in some subtle differences in estimates based on the same data unless the order and batching of the aggregation is strictly controlled, which is sometimes difficult to do in PostgreSQL. If stable estimates are important to you, choose uddsketch.

Calculating precise error bars for tdigest can be difficult, especially when merging multiple sub-digests into a larger one. This can occur through summary aggregation, or parallelization of the normal point aggregate. If you need to tightly characterize your errors, choose uddsketch. However, because uddsketch uses exponential bucketing to provide a guaranteed relative error, it can cause some wildly varying absolute errors if the dataset covers a large range. For example, if the data is evenly distributed over the range [1,100], estimates at the high end of the percentile range have about 100 times the absolute error of those at the low end of the range. This gets much more extreme if the data range is [0,100]. If having a stable absolute error is important to your use case, choose tdigest.

While both algorithms are likely to get smaller and faster with future optimizations, uddsketch generally requires a smaller memory footprint than tdigest, and a correspondingly smaller disk footprint for any continuous aggregates. Regardless of the algorithm you choose, the best way to improve the accuracy of your percentile estimates is to increase the number of buckets, which is simpler to do with uddsketch. If your use case does not get a clear benefit from using tdigest, the default uddsketch is your best choice.

For some more technical details and usage examples of the different algorithms, see the developer documentation for uddsketch and tdigest.

Keywords

Found an issue on this page?

Report an issue!