timescaledb-backfill

Timescale Cloud: Performance, Scale, Enterprise

Self-hosted products

MST

Dual-write and backfill is a method to write from your application to two databases at once, and gives tooling and guidance to move your existing data from the one database to the other. It is specifically catered for, and relies on, your data being predominantly append-only time-series data. As such, it comes with some caveats and prerequisites which live migration does not (dual-write and backfill does not support executing UPDATE or DELETE statements on your data). Additionally, it requires you to make changes to the ingest pipeline of your application.

The timescaledb-backfill tool is a command-line utility designed to support migrations from Timescale instances by copying historic data from one database to another ("backfilling"). timescaledb-backfill efficiently copies hypertable and continuous aggregates chunks directly, without the need for intermediate storage, or converting chunks from the columnstore to the rowstore. It operates transactionally, ensuring data integrity throughout the migration process. It is designed to be used in the dual-write and backfill migration procedure.

Limitations

The tool only supports backfilling of hypertables. Schema migrations and non-hypertable migrations should be handled separately before using this tool.
The tool is optimized for append-only workloads. Other scenarios may not be fully supported.
To prevent continuous aggregates from refreshing with incomplete data, any refresh and retention policies targeting the tables that are going to be backfilled should be turned off.

Installation

The tool performs best when executed in an instance located close to the target database. The ideal scenario is an EC2 instance located in the same region as the Timescale service. Use a Linux-based distribution on x86_64.

With the instance that will run the timescaledb-backfill ready, log in and download the tool's binary:


wget https://assets.timescale.com/releases/timescaledb-backfill-x86_64-linux.tar.gz
tar xf timescaledb-backfill-x86_64-linux.tar.gz
sudo mv timescaledb-backfill /usr/local/bin/

How to use

The timescaledb-backfill tool offers four main commands: stage, copy, verify and clean. The workflow involves creating tasks, copying chunks, verifying data integrity and cleaning up the administrative schema after the migration.

Note

In the context of migrations, your existing production database is referred to as the "source" database, while the new Timescale database that you intend to migrate your data to is referred to as the "target" database.

Stage Command: is used to create copy tasks for hypertable chunks based on the specified completion point (--until). If a starting point (--from) is not specified, data will be copied from the beginning of time up to the completion point (--until). An optional filter (--filter) can be used to refine the hypertables and continuous aggregates targeted for staging.
```
timescaledb-backfill stage --source $SOURCE --target $TARGET --until '2016-01-02T00:00:00'
```
The tables to be included in the stage can be controlled by providing filtering options:
--filter: this option accepts a POSIX regular expression to match schema-qualified hypertable names or continuous aggregate view names. Only hypertables and/or continuous aggregates matching the filter are staged.
By default, the filter includes only the matching objects, and does not concern itself with dependencies between objects. Depending on what is intended, this could be problematic for continuous aggregates, as they form a dependency hierarchy. This behaviour can be modified through cascade options.
For example, assuming a hierarchy of continuous aggregates for hourly, daily, and weekly rollups of data in an underlying hypertable called raw_data (all in the public schema). This could look as follows:
```
raw_data -> hourly_agg -> daily_agg -> monthly_agg
```
If the filter --filter='^public\.raw_data$' is applied, then no data from the continuous aggregates is staged. If the filter --filter='^public\.daily_agg$' is applied, then only materialized data in the continuous aggregate daily_agg is staged.
--cascade-up: when activated, this option ensures that any continuous aggregates which depend on the filtered object are included in the staging process. It is called "cascade up" because it cascades up the hierarchy. Using the example from before, if the filter --filter='^public\.raw_data$' --cascade up is applied, the data in raw_data, hourly_agg, daily_agg, and monthly_agg is staged.
--cascade-down: when activated, this option ensures that any objects which the filtered object depends on are included in the staging process. It is called "cascade down" because it cascades down the hierarchy. Using the example from before, if the filter --filter='^public\.daily_agg$' --cascade-down is applied, the data in daily_agg, hourly_agg, and raw_data is staged.
The --cascade-up and --cascade-down options can be combined. Using the example from before, if the filter --filter='^public\.daily_agg$' --cascade-up --cascade-down is applied, data in all objects in the example scenario is staged.
```
timescaledb-backfill stage --source $SOURCE --target $TARGET \
  --until '2016-01-02T00:00:00' \
  --filter '^public\.daily_agg$' \
  --cascade-up \
  --cascade-down
```
Copy Command: processes the tasks created during the staging phase and copies the corresponding hypertable chunks to the target Timescale service.
```
timescaledb-backfill copy --source $SOURCE --target $TARGET
```
In addition to the --source and --target parameters, the copy command takes one optional parameter:
--parallelism specifies the number of COPY jobs which will be run in parallel, the default is 8. It should ideally be set to the number of cores that the source and target database have, and is the most important parameter in dictating both how much load the source database experiences, and how quickly data is transferred from the source to the target database.
Verify Command: checks for discrepancies between the source and target chunks' data. It compares the results of the count for each chunk's table, as well as per-column count, max, min, and sum values (when applicable, depending on the column data type).
```
timescaledb-backfill verify --source $SOURCE --target $TARGET
```
In addition to the --source and --target parameters, the verify command takes one optional parameter:
--parallelism specifies the number of verification jobs which will be run in parallel, the default is 8. It should ideally be set to the number of cores that the source and target database have, and is the most important parameter in dictating both how much load the source and target databases experience during verification, and how long it takes for verification to complete.
Refresh Continuous Aggregates Command: refreshes the continuous aggregates of the target system. It covers the period from the last refresh in the target to the last refresh in the source, solving the problem of continuous aggregates being outdated beyond the coverage of the refresh policies.
```
timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGET
```
To refresh the continuous aggregates, the command executes the following SQL statement for all the matched continuous aggregates:
```
CALL refresh_continuous_aggregate({CAGG NAME}, {TARGET_WATERMARK}, {SOURCE_WATERMARK})
```
The continuous aggregates to be refreshed can be controlled by providing filtering options:
--filter: this option accepts a POSIX regular expression to match schema-qualified hypertable continuous aggregate view names.
By default, the filter includes only the matching objects, and does not concern itself with dependencies between objects. Depending on what is intended, this could be problematic as continuous aggregates form a dependency hierarchy. This behaviour can be modified through cascade options.
For example, assuming a hierarchy of continuous aggregates for hourly, daily, and weekly rollups of data in an underlying hypertable called raw_data (all in the public schema). This could look as follows:
```
raw_data -> hourly_agg -> daily_agg -> monthly_agg
```
If the filter --filter='^public\.daily_agg$' is applied, only materialized data in the continuous aggregate daily_agg will be updated. However, this approach can lead to potential issues. For example, if hourly_agg is not up to date, then daily_agg won't be either, as it requires the missing data from hourly_agg. Additionally, it's important to remember to refresh monthly_agg at some point to ensure its data remains current. In both cases, relying solely on refresh policies may result in data gaps if the policy doesn't cover the entire required period.
--cascade-up: when activated, this option ensures that any continuous aggregates which depend on the filtered object are refreshed. It is called "cascade up" because it cascades up the hierarchy. Using the example from before, if the filter --filter='^public\.daily_agg$' --cascade up is applied, the hourly_agg, daily_agg, and monthly_agg will be refreshed.
--cascade-down: when activated, this option ensures that any continuous aggregates which the filtered object depends on are refreshed. It is called "cascade down" because it cascades down the hierarchy. Using the example from before, if the filter --filter='^public\.daily_agg$' --cascade-down is applied, the data in daily_agg and hourly_agg will be refreshed.
The --cascade-up and --cascade-down options can be combined. Using the example from before, if the filter --filter='^public\.daily_agg$' --cascade-up --cascade-down is applied, then all the continuous aggregates will be refreshed.
Clean Command: removes the administrative schema (__backfill) that was used to store the tasks once the migration is completed successfully.
```
timescaledb-backfill clean --target $TARGET
```

Usage examples

Backfilling with a filter and until date:


timescaledb-backfill stage --source $SOURCE_DB --target $TARGET_DB \
  --filter '.*\.my_table.*' \
  --until '2016-01-02T00:00:00'

timescaledb-backfill copy --source $SOURCE --target $TARGET

timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGET

timescaledb-backfill verify --source $SOURCE --target $TARGET

timescaledb-backfill clean --target $TARGET

Running multiple stages with different filters and until dates:


timescaledb-backfill stage --source $SOURCE --target $TARGET \
  --filter '^schema1\.table_with_time_as_timestampz$' \
  --until '2015-01-01T00:00:00'

timescaledb-backfill stage --source $SOURCE --target $TARGET \
  --filter '^schema1\.table_with_time_as_bigint$' \
  --until '91827364'

timescaledb-backfill stage --source $SOURCE --target $TARGET \
  --filter '^schema2\..*' \
  --until '2017-01-01T00:00:00'

timescaledb-backfill copy --source $SOURCE --target $TARGET

timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGET

timescaledb-backfill verify --source $SOURCE --target $TARGET

timescaledb-backfill clean --target $TARGET

Backfilling a specific period of time with from and until:


timescaledb-backfill stage --source $SOURCE_DB --target $TARGET_DB \
    --from '2015-01-02T00:00:00' \
    --until '2016-01-02T00:00:00'

  timescaledb-backfill copy --source $SOURCE --target $TARGET

  timescaledb-backfill clean --target $TARGET

Refreshing a continuous aggregates hierarchy


timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGET \
    --filter='^public\.daily_agg$' --cascade-up --cascade-down

Stop and resume

The copy command can be safely stopped by sending an interrupt signal (SIGINT) to the process. This can be achieved by using the Ctrl-C keyboard shortcut from the terminal where the tool is currently running.

When the tool receives the first signal, it interprets it as a request for a graceful shutdown. It then notifies the copy workers that they should exit once they finish copying the chunk they are currently processing. Depending on the chunk size, this could take many minutes to complete.

When a second signal is received, it forces the tool to shut down immediately, interrupting all ongoing work. Due to the tool's usage of transactions, there is no risk of data inconsistency when using forced shutdown.

While a graceful shutdown waits for in-progress chunks to finish copying, a force shutdown rolls back the in-progress copy transactions. Any data copied into those chunks is lost, but the database is left in a transactional consistent state, and the backfill process can be safely resumed.

Inspect tasks progress

Each hypertable chunk that's going to be backfilled has a corresponding task stored in the target's database __backfill.task table. You can use this information to inspect the backfill's progress:


select
    hypertable_schema,
    hypertable_name,
    count(*) as total_chunks,
    count(worked) as finished_chunks,
    count(worked is null) pending_chunks
from __backfill.task
group by
    1,
    2

timescaledb-backfill

Limitations

Installation

How to use

Note

Usage examples

Stop and resume

Inspect tasks progress

Related Content