Sometimes there are gaps in our time-series data: because systems are offline, or devices lose power, etc. This causes problems when you want to aggregate data across a large time window, for example, computing the average temperature over the past 6 hours by 30 minute time intervals or analyzing today's CPU utilization by 15 minute intervals. Gaps in data can also have other negative consequences, e.g., breaking applications downstream.
In this tutorial, you'll see how to use Grafana (an open-source visualization tool) and TimescaleDB for handling missing time-series data (using the TimescaleDB/PostgreSQL data source natively available in Grafana).
To complete this tutorial, you will need a cursory knowledge of the Structured Query Language (SQL). The tutorial will walk you through each SQL command, but it will be helpful if you've seen SQL before.
You will also need:
Time-series dataset with missing data (Note: in case you don't have one handy, we include an optional step for creating one below.)
A working installation of TimescaleDB. Once your installation is complete, we can proceed to ingesting or creating sample data and finishing the tutorial.
Grafana dashboard connected to your TimescaleDB instance (setup instructions)
(Please skip this step if you already have TimescaleDB loaded with your time-series data.)
For this tutorial, we are going to load our TimescaleDB instance with simulated IoT sensor data (available in our How to explore TimescaleDB using simulated IoT sensor data tutorial).
This dataset simulates four sensors that each collect temperature and CPU data, in a hypertable structured like this:
CREATE TABLE sensor_data ( time TIMESTAMPTZ NOT NULL, sensor_id INTEGER, temperature DOUBLE PRECISION, cpu DOUBLE PRECISION, FOREIGN KEY (sensor_id) REFERENCES sensors (id) );
To simulate missing data, let's delete all data our sensors collected between 1 hour and 2 hours ago:
DELETE FROM sensor_data WHERE sensor_id = 1 and time > now() - INTERVAL '2 hour' and time < now() - INTERVAL '1 hour';
(For this and the following steps, we'll use the IoT dataset from Step 0, but the steps are the same if you use your own - real or simulated - dataset).
To confirm we're missing data values, let's create a simple graph that
calculates the average temperature readings from
sensor_1 over the past
6 hours (using
SELECT time_bucket('5 minutes', "time") as time, AVG(temperature) AS sensor_1 FROM sensor_data WHERE $__timeFilter("time") AND sensor_id = 1 GROUP BY time_bucket('5 minutes', time) ORDER BY 1
There is missing data from 17:05 to 18:10, as we can see by the lack of data points (flat line) during that time period.
For interpolating the missing data, we use
LOCF ("Last Observation Carried Forward").
This takes the last reading before the missing data began and plots it
(the last recorded value) at regular time intervals until new data is
SELECT time_bucket_gapfill('5 minutes', "time") as time, LOCF(AVG(temperature)) AS sensor_1 FROM sensor_data WHERE $__timeFilter("time") AND sensor_id = 1 GROUP BY time_bucket_gapfill('5 minutes', "time") ORDER BY 1
LOCF is a handy interpolation technique when you have missing data, but no additional context to determine what the missing data values might have been.
As you can see, the graph now plots data points at regular intervals for the times where we have missing data.
Now, we return to our original problem: wanting to aggregate data across a large time window with missing data.
Here we use our interpolated data and compute the average temperature by 30 minute windows over the past 6 hours.
SELECT time_bucket_gapfill('30 minutes', "time") as time, LOCF(AVG(temperature)) AS sensor_1 FROM sensor_data WHERE $__timeFilter("time") AND sensor_id = 1 GROUP BY time_bucket_gapfill('30 minutes', "time") ORDER BY 1
Let's compare this to what the aggregate would have looked like had we not interpolated the missing data, by adding a new series to the graph:
SELECT time_bucket('30 minutes', "time") as time, AVG(temperature) AS sensor_1 FROM sensor_data WHERE $__timeFilter("time") AND sensor_id = 1 GROUP BY time_bucket('30 minutes', time) ORDER BY 1
(Note that the interpolated average is now in ORANGE, while the average with missing data is GREEN.)
As you can see above, the GREEN plot is missing a data point at 17:30, giving us little understanding of what happened during that time period, and risking breaking applications downstream. In contrast, the ORANGE plot uses our interpolated data to create a datapoint for that time period.
This is just one way to use TimescaleDB with Grafana to solve data problems and ensure that your applications, systems, and operations don't suffer any negative consequences (e.g., downtime, misbehaving applications, or a degregraded customer experience). For more ways on how to use TimescaleDB, check out our other tutorials (which range from beginner to advanced).
Found an issue on this page?Report an issue!