Aggregates are summaries of raw data for a period of time. Some examples of aggregates are the average temperature per day, the maximum CPU utilization per 5 minutes, and the number of visitors on a website per day.
Calculating aggregates on time-series data can be computationally intensive given the large amounts of data which need to be processed in order to calculate and recalculate aggregates over a time period. Moreover, calculating aggregates while simultaneously ingesting data can cause a slowdown in ingest rate due to the computational resources being split between the two resource intensive processes. Continuous aggregates solve both these problems.
Continuous aggregates are automatically refreshed materialized views – so they massively speed up workloads that need to process large amounts of data. Unlike in other databases, your views are automatically refreshed in the background as new data is added, or as old data is modified according to a schedule.
For Postgres users who may be familiar with views and materialized views, here's how continuous aggregates differ:
Continuous aggregates speed up dashboards and visualizations, summarizing data sampled at high frequency, and querying downsampled data over long time periods.
Now that you're familiar with what Continuous Aggregates are, let's create our first continuous aggregate. Creating a continuous aggregate is a two step process: first we define our view and second, we create a policies which will refresh the continuous aggregate according to a schedule.
We'll use the example of creating a daily aggregation of all weather metrics.
Here's the SQL query which defines the query of which we want to maintain the results. In this case, we calculate the daily average for all weather metrics, as well as the maximum and minimum for temperature.
-- Continuous aggs -- define view CREATE MATERIALIZED VIEW weather_metrics_daily WITH (timescaledb.continuous) AS SELECT time_bucket('1 day', time) as bucket, city_name, avg(temp_c) as avg_temp, avg(feels_like_c) as feels_like_temp, max(temp_c) as max_temp, min(temp_c) as min_temp, avg(pressure_hpa) as pressure, avg(humidity_percent) as humidity_percent, avg(rain_3h_mm) as rain_3h, avg(snow_3h_mm) as snow_3h, avg(wind_speed_ms) as wind_speed, avg(clouds_percent) as clouds FROM weather_metrics GROUP BY bucket, city_name WITH NO DATA;
Notice how we use our friend
time_bucket from the previous section to specify that we want to bucket the data by 1 day time periods.
You can create multiple continuous aggregates on the same hypertable, but for simplicity sake we'll just create one for now.
Here's some ideas for other aggregates to create:
-- See info about continuous aggregates SELECT * FROM timescaledb_information.continuous_aggregates;
Right now we've created the continuous aggregate but it has not materialized any data. There are two ways to populate a continuous aggregate: via manual refresh or an automation policy. These methods enable you to refresh materialized data in your continuous aggregates when its most convenient for you (e.g during low query load times on your database).
Let's take a look at how to do each one.
You can manually refresh a continuous aggregate by means of a "one-shot refresh". This is useful for refreshing data for only a specific time-period in the past or if you want to materialize a lot of data at once on a once-off basis.
The example below refreshes the continuous aggregate
the time period starting 1 January 2010 and ending 1 January 2021:
-- manual refresh -- refresh data between 1 Jan 2010 and 2021 CALL refresh_continuous_aggregate('weather_metrics_daily','2010-01-01', '2021-01-01');
Querying our continuous aggregate for data older than 1 January 2009 shows that it is indeed populated with data from the above time frame, as the first row returned is for 1 January 2010:
-- Show that manual refresh worked SELECT * from weather_metrics_daily WHERE bucket > '2009-01-01' ORDER BY bucket ASC;
An automation policy can also be used to refresh continuous aggregates according to a schedule.
While many databases have ad-hoc capabilities for managing large amounts of time-series data, adapting these features to the rhythm of your data lifecycle often requires custom code and regular development time.
Enter TimescaleDB's built-in background jobs capability, often referred to as Automation Policies.
Using automation policies for updating continuous aggregates is the first of many automation policies you'll encounter in TimescaleDB. TimescaleDB also has policies to automate compression and data retention. You can also create custom policies and run them according to a schedule with the User Defined Actions feature. Using this automation feature allows you to "set it and forget it" on administration tasks so that you can spend time on feature development.
You'll see policies for compression and data retention later in this Getting started section.
Let's create a policy which will auto-update the continuous aggregate every two weeks:
-- create policy -- refresh the last 6 months of data every 2 weeks SELECT add_continuous_aggregate_policy('weather_metrics_daily', start_offset => INTERVAL '6 months', end_offset => INTERVAL '1 hour', schedule_interval => INTERVAL '14 days');
The policy above will run every 14 days (
schedule_interval). When it runs, it
will materialize data from between 6 months (
start_offset) and 1 hour (
of the time it executes, according to the query which defined the continuous
Automation policies are useful for continuous aggregates which need to return data from recent time frames, as they help them stay up to date.
When you query a continuous aggregate, the system reads and processes the much smaller materialized table, which has been refreshed at scheduled intervals. This makes querying continuous aggregates useful for speeding up dashboards, summarizing data sampled at high frequency, and querying downsampled data over long time periods.
Moreover, querying continuous aggregates does not slow down INSERT operations, since data is inserted into a different hypertable than the one underlying the continuous aggregate.
Let's examine an example of a query which would perform better on our continuous
weather_metrics_daily) than on our hypertable (
Here's a query which looks at how temperatures in New York have changed over the
past 6 years. It returns the returns the time as well as the daily maximum, minimum
and average temperatures for New York City between 2015 and 2021:
-- Continuous Aggregate query example -- Temperature in New York 2015-2021 SELECT bucket, max_temp, avg_temp, min_temp FROM weather_metrics_daily WHERE bucket >= '2015-01-01' AND bucket < '2021-01-01' AND city_name LIKE 'New York' ORDER BY bucket ASC;
Such a query executes quickly as the data for the time period in question is already populated in the continuous aggregate and aggregated by day. Queries like this are used in historical analysis, often to plot graphs about how temperature changes over time.
By default, continuous aggregates support real-time aggregation, which combines aggregated data and raw data at query time for the most up to date results. (You can turn this off if desired, but the majority of developers want this behaviour by default).
With real-time aggregation turned off, continuous aggregates only return results for data in the time period they have materialized (refreshed). If you query continuous aggregates for data newer than the last materialized time, it will not return it or return stale results.
With real-time aggregation turned on, you always receive up-to-date results, as querying a continuous aggregate returns data that is already materialized combined with the newest raw data from the hypertable available at query time.
Real-time aggregation gives you the best of both worlds: the performance of continuous aggregates and the most up to date data for real time queries, without the performance degradation of querying and aggregating all raw data from scratch.
For example, consider the results of this query, which selects daily aggregates of all weather metrics for the past two years:
-- Real-time aggregation SELECT * from weather_metrics_daily WHERE bucket > now() - 2 * INTERVAL '1 year' ORDER BY bucket DESC;
The first row returned has a time value newer than 1 January 2021, which is the end date for which we materialized data in our continuous aggregate (done via manual refresh earlier in this tutorial). Despite not materializing the latest data via manual refresh or our policy (since the policy created above hasn't yet run), we still get the latest data thanks to real-time aggregation.
This makes real-time aggregation an ideal fit for many near real-time, monitoring and analysis use-cases, especially for dashboarding or reporting that requires the most up to date numbers all the time. It is for this reason that we recommend keeping the setting on.
See how real TimescaleDB users leverage continuous aggregates in the blog posts How FlightAware fuels flight prediction models for global travelers with TimescaleDB and Grafana and How I power a (successful) crypto trading bot with TimescaleDB
Detailed information on continuous aggregates and real-time aggregation can be found in the Continuous Aggregates docs.
Found an issue on this page?Report an issue!