Introduction

Given a series of timestamped heartbeats and a liveness interval, determine the overall liveness of a system. This aggregate can be used to report total uptime or downtime as well as report the time ranges where the system was live or dead.

It's also possible to combine multiple heartbeat aggregates to determine the overall health of a service. For example, the heartbeat aggregates from a primary and standby server could be combine to see if there was ever a window where both machines were down at the same time.

Related hyperfunction groups

Aggregate

heartbeat_agg
Create a liveness aggregate from a set of heartbeats

Accessor

dead_ranges
Get the down intervals from a heartbeat_agg
downtime
Get the total time dead during a heartbeat aggregate
interpolate
Adjust a heartbeat aggregate with predecessor information
interpolated_downtime
Get the total time dead from a heartbeat aggregate and predecessor
interpolated_uptime
Get the total time live from a heartbeat aggregate and predecessor
live_at
Test if the aggregate has a heartbeat covering a given time
live_ranges
Get the live intervals from a heartbeat_agg
num_gaps
Count the number of gaps between live ranges
num_live_ranges
Count the number of live ranges
trim_to
Reduce the covered interval of a heartbeat aggregate
uptime
Get the total time live during a heartbeat aggregate

Rollup

rollup
Combine multiple heartbeat aggregates
heartbeat_agg(
heartbeat TIMESTAMPTZ,
agg_start TIMESTAMPTZ,
agg_duration INTERVAL,
heartbeat_liveness INTERVAL
) RETURNS HeartbeatAgg

This takes a set of heartbeat timestamps and aggregates the liveness state of the underlying system for the specified time range.

Required arguments
NameTypeDescription
heartbeatTIMESTAMPTZThe column containing the timestamps of the heartbeats.
agg_startTIMESTAMPTZThe start of the time range over which this aggregate is tracking liveness.
agg_durationINTERVALThe length of the time range over which this aggregate is tracking liveness. Any point in this range that doesn't closely follow a heartbeat is considered to be dead.
heartbeat_livenessINTERVALHow long the system is considered to be live after each heartbeat.
Returns
ColumnTypeDescription
heartbeat_aggHeartbeatAggThe liveness data for the heartbeated system over the provided interval.
Examples

Given a table called system_health with a ping_time column, construct an aggregate of system liveness for 10 days starting from Jan 1, 2022. This assumes a system is unhealthy if it hasn't been heard from in a 5 minute window:

SELECT heartbeat_agg(
ping_time,
'01-01-2022 UTC',
'10 days',
'5 min')
FROM system_health;
dead_ranges(
agg HEARTBEATAGG
) RETURNS TABLE (
start TIMESTAMPTZ,
end TIMESTAMPTZ
)

Given a heartbeat aggregate, this will return a set of (starttime, endtime) pairs representing when the underlying system did not have a valid heartbeat during the interval of the aggregate.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Returns
ColumnTypeDescription
dead_rangesTABLE (start TIMESTAMPTZ, end TIMESTAMPTZ)The (start, end) pairs of when the system was down.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, we can use the following to get the intervals where the system was down during the week of Jan 9, 2022:

SELECT dead_ranges(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
dead_ranges
-----------------------------------------------------
("2022-01-09 00:00:00+00","2022-01-09 00:00:30+00")
("2022-01-12 15:27:22+00","2022-01-12 15:31:17+00")
downtime(
agg HEARTBEATAGG
) RETURNS INTERVAL

Given a heartbeat aggregate, this will sum all the ranges where the system did not have a recent enough heartbeat.

There may appear to be some downtime between the start of the aggregate and the first heartbeat. If there is a heartbeat aggregage covering the previous period, you can use its last heartbeat to correct for this using interpolated_downtime().

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Returns
ColumnTypeDescription
downtimeINTERVALThe sum of all the dead ranges in the aggregate.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, we can use the following to get the total downtime of the system during the week of Jan 9, 2022:

SELECT downtime(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
downtime
--------
00:04:25
interpolate(
agg HEARTBEATAGG,
pred HEARTBEATAGG
) RETURNS HEARTBEATAGG

This takes a heartbeat aggregate and the aggregate immediately preceding it. It updates the aggregate to include any live ranges that should have been carried over from the last heartbeat in the predecessor, even if there aren't heartbeats for that range in the interval covered by this aggregate. It returns the updated aggregate, which can then be used with any of the heartbeat aggregate accessors.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate containing liveness data for a particular interval.
Optional arguments
NameTypeDescription
predHeartbeatAggThe heartbeat aggregate for the preceding interval, if one exists.
Returns
ColumnTypeDescription
interpolateHeartbeatAggA copy of agg which has been update to include any heartbeat intervals extending past the end of pred.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, we can use the following to get the intervals where the system was unhealthy during the week of Jan 9, 2022. This correctly excludes any ranges covered by a heartbeat at the end of the Jan 2 week:

SELECT dead_ranges(
interpolate(
health,
LAG(health) OVER (ORDER BY date)
)
)
FROM liveness
WHERE date = '01-9-2022 UTC'
dead_ranges
-----------------------------------------------------
("2022-01-12 15:27:22+00","2022-01-12 15:31:17+00")
interpolated_downtime(
agg HEARTBEATAGG,
pred HEARTBEATAGG
) RETURNS INTERVAL

This behaves very similarly to downtime(), but it also takes the heartbeat aggregate from the preceding interval. It checks when the last heartbeat in the predecessor was received and makes sure not to consider the heartbeat interval after that time as unhealthy, even if it extends into the current aggregate prior to the first heartbeat.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Optional arguments
NameTypeDescription
predHeartbeatAggThe heartbeat aggregate for the interval before the one being measured, if one exists.
Returns
ColumnTypeDescription
interpolated_downtimeINTERVALThe sum of all the unhealthy ranges in the aggregate, excluding those covered by the last heartbeat of the previous interval.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, you can use this command to get the total interpolated downtime of the system during the week of Jan 9, 2022:

SELECT interpolated_downtime(
health,
LAG(health) OVER (ORDER BY date)
)
FROM liveness
WHERE date = '01-9-2022 UTC'
interpolated_downtime
---------------------
00:03:55
interpolated_uptime(
agg HEARTBEATAGG,
pred HEARTBEATAGG
) RETURNS INTERVAL

This behaves very similarly to uptime(), but it also takes the heartbeat aggregate from the preceding interval. It checks when the last heartbeat in the predecessor was received and makes sure that the entire heartbeat interval after that is considered live. This addresses the issue where uptime would consider the interval between the start of the interval and the first heartbeat as dead.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Optional arguments
NameTypeDescription
predHeartbeatAggThe heartbeat aggregate for the interval before the one being measured, if one exists.
Returns
ColumnTypeDescription
interpolated_uptimeINTERVALThe sum of all the live ranges in the aggregate, including those covered by the last heartbeat of the previous interval.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, you can use this command to get the total interpolated uptime of the system during the week of Jan 9, 2022:

SELECT interpolated_uptime(
health,
LAG(health) OVER (ORDER BY date)
)
FROM liveness
WHERE date = '01-9-2022 UTC'
interpolated_uptime
-------------------
6 days 23:56:05
live_at(
agg HEARTBEATAGG,
test TIMESTAMPTZ
) RETURNS BOOL

Given a heartbeat aggregate and a timestamp, this returns whether the aggregate has a heartbeat indicating the system was live at the given time.

Note that this returns false for any time not covered by the aggregate.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
testTimestampTzThe time to test the liveness of.
Returns
ColumnTypeDescription
live_atboolTrue if the heartbeat aggregate had a heartbeat close before the test time.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, we can use the following to see if the sytem was live at a particular time:

SELECT live_at(health, '2022-01-12 15:30:00+00')
FROM liveness
WHERE date = '01-9-2022 UTC'
live_at
---------
f
live_ranges(
agg HEARTBEATAGG
) RETURNS TABLE (
start TIMESTAMPTZ,
end TIMESTAMPTZ
)

Given a heartbeat aggregate, this returns a set of (starttime, endtime) pairs representing when the underlying system was live during the interval of the aggregate.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Returns
ColumnTypeDescription
live_rangesTABLE (start TIMESTAMPTZ, end TIMESTAMPTZ)The (start, end) pairs of when the system was live.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, we can use the following to get the intervals where the system was live during the week of Jan 9, 2022:

SELECT live_ranges(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
live_ranges
-----------------------------------------------------
("2022-01-09 00:00:30+00","2022-01-12 15:27:22+00")
("2022-01-12 15:31:17+00","2022-01-16 00:00:00+00")
num_gaps(
agg HEARTBEATAGG
) RETURNS BIGINT

Given a heartbeat aggregate, this returns the number of gaps between the periods of liveness. Additionally, if the aggregate is not live at the start or end of its covered interval, these are also considered gaps.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the number of gaps from.
Returns
ColumnTypeDescription
num_gapsbigintThe number of gaps in the aggregate.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, use this query to see how many times the system was down in a particular week::

SELECT num_gaps(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
num_gaps
---------
4
num_live_ranges(
agg HEARTBEATAGG
) RETURNS BIGINT

Given a heartbeat aggregate, this returns the number of live periods.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the number of ranges from.
Returns
ColumnTypeDescription
num_live_rangesbigintThe number of live ranges in the aggregate.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, use this query to see how many intervals the system was up in a given week::

SELECT num_live_ranges(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
num_live_ranges
---------
5
trim_to(
agg HEARTBEATAGG,
start TIMESTAMPTZ,
duration INTERVAL
) RETURNS HEARTBEATAGG

Given a heartbeat aggregate, this reduces the time range covered by that aggregate. This can only be used to narrow the covered interval, passing arguments that would extend beyond the range covered by the initial aggregate gives an error.

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to trim down.
Optional arguments
NameTypeDescription
startTimestampTzThe start of the trimmed range. If not provided, the returned heartbeat aggregate starts from the same time as the starting one.
durationIntervalHow long the resulting aggregate should cover. If not provided, the returned heartbeat aggregate ends at the same time as the starting one.
Returns
ColumnTypeDescription
trim_toheartbeat_aggThe trimmed aggregate.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, use this query to roll up several weeks and trim the result to an exact month::

SELECT trim_to(rollup(health), '03-1-2022 UTC', '1 month')
FROM liveness
WHERE date > '02-21-2022 UTC' AND date < '3-7-2022 UTC'
uptime(
agg HEARTBEATAGG
) RETURNS INTERVAL

Given a heartbeat aggregate, this sums all the ranges where the system was live and returns the total.

There may appear to be some downtime between the start of the aggregate and the first heartbeat. If there is a heartbeat aggregage covering the previous period, you can use its last heartbeat to correct for this using interpolated_uptime().

Required arguments
NameTypeDescription
aggHeartbeatAggA heartbeat aggregate to get the liveness data from.
Returns
ColumnTypeDescription
uptimeINTERVALThe sum of all the live ranges in the aggregate.
Examples

Given a table called liveness containing weekly heartbeat aggregates in column health with timestamp column date, you can use this command to get the total uptime of the system during the week of Jan 9, 2022:

SELECT uptime(health)
FROM liveness
WHERE date = '01-9-2022 UTC'
uptime
-----------------
6 days 23:55:35
rollup(
heartbeatagg HEARTBEATAGG
) RETURNS HEARTBEATAGG

This combines multiple heartbeat aggregates into one. This can be used to combine aggregates into adjacent intervals into one larger interval, such as rolling daily aggregates into a weekly or monthly aggregate.

Another use for this is to combine heartbeat aggregates for redundant systems to determine if there were any overlapping failures. For instance, a master and standby system can have their heartbeats combined to see if there were any intervals where both systems were down at the same time. The result of rolling overlapping heartbeats together like this is a heartbeat aggregate which considers a time live if any of its component aggregates were live.

Required arguments
NameTypeDescription
heartbeataggHeartbeatAggThe heartbeat aggregates to roll up.
Returns
ColumnTypeDescription
rollupHeartbeatAggA heartbeat aggregate covering the interval from the earliest start time of its component aggregates to the latest end time. It combines the live ranges of all the components.

Keywords

Found an issue on this page?

Report an issue!