Dual-write and backfill is a method to write from your application to two
databases at once, and gives tooling and guidance to move your existing data
from the one database to the other. It is specifically catered for, and relies
on, your data being predominantly append-only time-series data. As such, it
comes with some caveats and prerequisites which live migration does not
(dual-write and backfill does not support executing UPDATE
or DELETE
statements on your data). Additionally, it requires you to make changes to the
ingest pipeline of your application.
The timescaledb-backfill
tool is a command-line utility designed to support
migrations from Timescale instances by copying historic data from one database
to another ("backfilling"). timescaledb-backfill
efficiently copies
hypertable and continuous aggregates chunks directly, without the need for
intermediate storage or decompressing compressed chunks. It operates
transactionally, ensuring data integrity throughout the migration process. It
is designed to be used in the dual-write and backfill
migration procedure.
- The tool only supports backfilling of hypertables. Schema migrations and non-hypertable migrations should be handled separately before using this tool.
- The tool is optimized for append-only workloads. Other scenarios may not be fully supported.
- To prevent continuous aggregates from refreshing with incomplete data, any refresh and retention policies targeting the tables that are going to be backfilled should be turned off.
The tool performs best when executed in an instance located close to the target database. The ideal scenario is an EC2 instance located in the same region as the Timescale service. Use a Linux-based distribution on x86_64.
With the instance that will run the timescaledb-backfill ready, log in and download the tool's binary:
wget https://assets.timescale.com/releases/timescaledb-backfill-x86_64-linux.tar.gztar xf timescaledb-backfill-x86_64-linux.tar.gzsudo mv timescaledb-backfill /usr/local/bin/
The timescaledb-backfill tool offers four main commands: stage
, copy
,
verify
and clean
. The workflow involves creating tasks, copying chunks,
verifying data integrity and cleaning up the administrative schema after the
migration.
Note
In the context of migrations, your existing production database is referred to as the "source" database, while the new Timescale database that you intend to migrate your data to is referred to as the "target" database.
Stage Command: is used to create copy tasks for hypertable chunks based on the specified completion point (
--until
). If a starting point (--from
) is not specified, data will be copied from the beginning of time up to the completion point (--until
). An optional filter (--filter
) can be used to refine the hypertables and continuous aggregates targeted for staging.timescaledb-backfill stage --source $SOURCE --target $TARGET --until '2016-01-02T00:00:00'The tables to be included in the stage can be controlled by providing filtering options:
--filter
: this option accepts a POSIX regular expression to match schema-qualified hypertable names or continuous aggregate view names. Only hypertables and/or continuous aggregates matching the filter are staged.By default, the filter includes only the matching objects, and does not concern itself with dependencies between objects. Depending on what is intended, this could be problematic for continuous aggregates, as they form a dependency hierarchy. This behaviour can be modified through cascade options.
For example, assuming a hierarchy of continuous aggregates for hourly, daily, and weekly rollups of data in an underlying hypertable called
raw_data
(all in thepublic
schema). This could look as follows:raw_data -> hourly_agg -> daily_agg -> monthly_aggIf the filter
--filter='^public\.raw_data$'
is applied, then no data from the continuous aggregates is staged. If the filter--filter='^public\.daily_agg$'
is applied, then only materialized data in the continuous aggregatedaily_agg
is staged.--cascade-up
: when activated, this option ensures that any continuous aggregates which depend on the filtered object are included in the staging process. It is called "cascade up" because it cascades up the hierarchy. Using the example from before, if the filter--filter='^public\.raw_data$' --cascade up
is applied, the data inraw_data
,hourly_agg
,daily_agg
, andmonthly_agg
is staged.--cascade-down
: when activated, this option ensures that any objects which the filtered object depends on are included in the staging process. It is called "cascade down" because it cascades down the hierarchy. Using the example from before, if the filter--filter='^public\.daily_agg$' --cascade-down
is applied, the data indaily_agg
,hourly_agg
, andraw_data
is staged.The
--cascade-up
and--cascade-down
options can be combined. Using the example from before, if the filter--filter='^public\.daily_agg$' --cascade-up --cascade-down
is applied, data in all objects in the example scenario is staged.timescaledb-backfill stage --source $SOURCE --target $TARGET \--until '2016-01-02T00:00:00' \--filter '^public\.daily_agg$' \--cascade-up \--cascade-downCopy Command: processes the tasks created during the staging phase and copies the corresponding hypertable chunks to the target Timescale service.
timescaledb-backfill copy --source $SOURCE --target $TARGETIn addition to the
--source
and--target
parameters, thecopy
command takes one optional parameter:--parallelism
specifies the number ofCOPY
jobs which will be run in parallel, the default is 8. It should ideally be set to the number of cores that the source and target database have, and is the most important parameter in dictating both how much load the source database experiences, and how quickly data is transferred from the source to the target database.Verify Command: checks for discrepancies between the source and target chunks' data. It compares the results of the count for each chunk's table, as well as per-column count, max, min, and sum values (when applicable, depending on the column data type).
timescaledb-backfill verify --source $SOURCE --target $TARGETIn addition to the
--source
and--target
parameters, theverify
command takes one optional parameter:--parallelism
specifies the number of verification jobs which will be run in parallel, the default is 8. It should ideally be set to the number of cores that the source and target database have, and is the most important parameter in dictating both how much load the source and target databases experience during verification, and how long it takes for verification to complete.Refresh Continuous Aggregates Command: refreshes the continuous aggregates of the target system. It covers the period from the last refresh in the target to the last refresh in the source, solving the problem of continuous aggregates being outdated beyond the coverage of the refresh policies.
timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGETTo refresh the continuous aggregates, the command executes the following SQL statement for all the matched continuous aggregates:
CALL refresh_continuous_aggregate({CAGG NAME}, {TARGET_WATERMARK}, {SOURCE_WATERMARK})The continuous aggregates to be refreshed can be controlled by providing filtering options:
--filter
: this option accepts a POSIX regular expression to match schema-qualified hypertable continuous aggregate view names.By default, the filter includes only the matching objects, and does not concern itself with dependencies between objects. Depending on what is intended, this could be problematic as continuous aggregates form a dependency hierarchy. This behaviour can be modified through cascade options.
For example, assuming a hierarchy of continuous aggregates for hourly, daily, and weekly rollups of data in an underlying hypertable called
raw_data
(all in thepublic
schema). This could look as follows:raw_data -> hourly_agg -> daily_agg -> monthly_aggIf the filter
--filter='^public\.daily_agg$'
is applied, only materialized data in the continuous aggregatedaily_agg
will be updated. However, this approach can lead to potential issues. For example, ifhourly_agg
is not up to date, thendaily_agg
won't be either, as it requires the missing data fromhourly_agg
. Additionally, it's important to remember to refreshmonthly_agg
at some point to ensure its data remains current. In both cases, relying solely on refresh policies may result in data gaps if the policy doesn't cover the entire required period.--cascade-up
: when activated, this option ensures that any continuous aggregates which depend on the filtered object are refreshed. It is called "cascade up" because it cascades up the hierarchy. Using the example from before, if the filter--filter='^public\.daily_agg$' --cascade up
is applied, thehourly_agg
,daily_agg
, andmonthly_agg
will be refreshed.--cascade-down
: when activated, this option ensures that any continuous aggregates which the filtered object depends on are refreshed. It is called "cascade down" because it cascades down the hierarchy. Using the example from before, if the filter--filter='^public\.daily_agg$' --cascade-down
is applied, the data indaily_agg
andhourly_agg
will be refreshed.The
--cascade-up
and--cascade-down
options can be combined. Using the example from before, if the filter--filter='^public\.daily_agg$' --cascade-up --cascade-down
is applied, then all the continuous aggregates will be refreshed.Clean Command: removes the administrative schema (
__backfill
) that was used to store the tasks once the migration is completed successfully.timescaledb-backfill clean --target $TARGET
Backfilling with a filter and until date:
timescaledb-backfill stage --source $SOURCE_DB --target $TARGET_DB \--filter '.*\.my_table.*' \--until '2016-01-02T00:00:00'timescaledb-backfill copy --source $SOURCE --target $TARGETtimescaledb-backfill refresh-caggs --source $SOURCE --target $TARGETtimescaledb-backfill verify --source $SOURCE --target $TARGETtimescaledb-backfill clean --target $TARGETRunning multiple stages with different filters and until dates:
timescaledb-backfill stage --source $SOURCE --target $TARGET \--filter '^schema1\.table_with_time_as_timestampz$' \--until '2015-01-01T00:00:00'timescaledb-backfill stage --source $SOURCE --target $TARGET \--filter '^schema1\.table_with_time_as_bigint$' \--until '91827364'timescaledb-backfill stage --source $SOURCE --target $TARGET \--filter '^schema2\..*' \--until '2017-01-01T00:00:00'timescaledb-backfill copy --source $SOURCE --target $TARGETtimescaledb-backfill refresh-caggs --source $SOURCE --target $TARGETtimescaledb-backfill verify --source $SOURCE --target $TARGETtimescaledb-backfill clean --target $TARGETBackfilling a specific period of time with from and until:
timescaledb-backfill stage --source $SOURCE_DB --target $TARGET_DB \--from '2015-01-02T00:00:00' \--until '2016-01-02T00:00:00'timescaledb-backfill copy --source $SOURCE --target $TARGETtimescaledb-backfill clean --target $TARGET
- Refreshing a continuous aggregates hierarchy
timescaledb-backfill refresh-caggs --source $SOURCE --target $TARGET \--filter='^public\.daily_agg$' --cascade-up --cascade-down
The copy
command can be safely stopped by sending an interrupt signal
(SIGINT) to the process. This can be achieved by using the Ctrl-C keyboard
shortcut from the terminal where the tool is currently running.
When the tool receives the first signal, it interprets it as a request for a graceful shutdown. It then notifies the copy workers that they should exit once they finish copying the chunk they are currently processing. Depending on the chunk size, this could take many minutes to complete.
When a second signal is received, it forces the tool to shut down immediately, interrupting all ongoing work. Due to the tool's usage of transactions, there is no risk of data inconsistency when using forced shutdown.
While a graceful shutdown waits for in-progress chunks to finish copying, a force shutdown rolls back the in-progress copy transactions. Any data copied into those chunks is lost, but the database is left in a transactional consistent state, and the backfill process can be safely resumed.
Each hypertable chunk that's going to be backfilled has a corresponding task
stored in the target's database __backfill.task
table. You can use this
information to inspect the backfill's progress:
selecthypertable_schema,hypertable_name,count(*) as total_chunks,count(worked) as finished_chunks,count(worked is null) pending_chunksfrom __backfill.taskgroup by1,2
Found an issue on this page?Report an issue or Edit this page in GitHub.