TimescaleDB compresses data by:
- Converting it from a row-based format to a hybrid row-columnar
- Ordering and segmenting the data according to the parameters you specify
- Compressing the columns, using type-specific compression algorithms where possible
- Data is stored on secondary disk pages
- Old indexes are removed and new indexes are created
Understanding compression architecture is important to improving how you work with compressed data.
In TimescaleDB, compression is done chunk by chunk. Each chunk is either compressed or uncompressed. You can't compress only part of the data in a chunk. Compression also doesn't change the number of chunks. One uncompressed chunk turns into one compressed chunk.
In an uncompressed chunk, data is stored in row format. For example, you might have a table that looks like this. Each entry occupies its own row:
When the chunk is compressed, the data is converted to a hybrid row-columnar format. Multiple records are grouped into a single row. The columns of this row hold an array-like structure that stores all the data. Your table looks like this:
|[12:00:02, 12:00:02, 12:00:01, 12:00:1]||[1, 2, 1, 2]||[88.2, 300.5, 88.6, 299.1]||[0.8, 0.9, 0.85, 0.95]|
Each column is then compressed. For information about compression algorithms, see the algorithms section.
For simplicity, this example shows only 4 entries. TimescaleDB can combine up to 1000 entries into a single row.
Behind the scenes, TimescaleDB actually creates a second, under-the-covers, hypertable with the compressed data. You shouldn't query this table directly. Run your queries against the original hypertable, and TimescaleDB finds any compressed data, decompresses it, and returns it to you.
By default, TimescaleDB orders rows by decreasing time value when compressing. Then it combines the rows into columns of up to 1000 entries.
At a first approximation, this means that a chunk with
N rows turns into a compressed chunk with
ceiling(N / 1000) rows. In practice, the number might differ slightly, depending on the start and end ranges of your data and chunks.
You can change the ordering and segmenting behavior to improve compression efficiency and query performance. The best choice depends on your application query patterns. For more information, see the how-to guide on compression.
If you often
ORDER BY a column other than time, you can change the compressed
chunk's data order by using
timescaledb.compress_orderby. For more
information, see the how-to guide for ordering entries.
TimescaleDB speeds up queries by storing the minimum and maximum values of the
orderby columns for each row. For example, when ordering by time, it stores
|[12:00:02, 12:00:02, 12:00:01, 12:00:1]||[1, 2, 1, 2]||[88.2, 300.5, 88.6, 299.1]||[0.8, 0.9, 0.85, 0.95]||12:00:01||12:00:02|
The query planner can quickly decide if it needs to read a row, without needing to decompress the data.
By default, TimescaleDB doesn't segment compressed chunks. It combines entries
in the order of your
orderby columns, without taking into account the value of
any other column.
In some applications, the values of those other columns are useful. For example,
you might use a
WHERE clause to filter by
location. You can
specify these columns as segmenting columns by using
timescaledb.compress_segmentby. For more information, see the how-to guide on
segmentby columns are not compressed. Instead, for each value of the
segmentby column, separate compressed rows are created. You can think of this
as working like a
GROUP BY clause for compression. In the following example,
the compressed table is segmented by
|[12:00:02, 12:00:01]||1||[88.2, 88.6]||[0.8, 0.85]|
|[12:00:02, 12:00:01]||2||[300.5, 299.1]||[0.9, 0.95]|
TimescaleDB uses type-specific compression algorithms to compress each column. This compresses data more efficiently than generic, type-agnostic algorithms. Here are the algorithms used for each data type:
|Data type||Compression method|
|Integers and integer-like types, including timestamps||Delta-of-delta and simple-8b with run-length encoding|
|Columns with low cardinality (columns with only a few possible values that often repeat)||Whole-row dictionary compression|
|Other data types||LZ compression|
Here are brief descriptions of the compression algorithms. For more information, see the compression algorithms blog post.
|Delta-of-delta||Delta encoding stores the difference between a value and a reference value. The reference value can be the previous value. Delta-of-delta applies delta encoding a second time over delta-encoded data. This works best for data that changes slowly, where delta values are small and often 0.|
|Simple-8b||Simple-8b stores integers efficiently inside fixed-size blocks. A set of integers is stored within each block, using the minimum bit-length needed to encode the largest integer in the set. The first bits of each block denote the integer bit-length for that block.|
|Run-length encoding||Run-length encoding compresses data that contains the same value repeated over and over. Values are stored in the form |
|Gorilla compression||Gorilla compression compresses floating point numbers by XORing each value with the previous value.|
|Dictionary compression||Dictionary compression uses a separate list of possible values. It then stores indexes into this dictionary, rather than storing the full repeated value multiple times. This works best for datasets with many repeated values.|
|LZ compression||Another type of dictionary compression, but not applied at the whole-row level. Regular PostgreSQL uses LZ compression for TOAST (The Oversized-Attribute Storage Technique) tables.|
When you compress a chunk, any indexes on that chunk are removed. New indexes
are created on each
segmentby column. The indexes are in the form
(<SEGMENT_BY_COLUMN>, _ts_meta_sequence_num), where
some internal metadata on the compressed chunk.
If you decompress a chunk, your old indexes are restored.
Your old indexes are removed because compressed data isn't indexable in the same way as uncompressed data. For example, say that you have an index on a column named
data. Before compression, the index points to individual rows with individual values for
data. After compression, the values for
data are stored in array-like structures, so the index can no longer point to individual values. Only columns used in
segment_by can be indexed, since they are stored as their original values.
For simplicity, the examples on this page show data arrays contained within the compressed tables. In practise, to speed up queries, data isn't stored within the tables. Instead, it is stored on secondary disk pages. The tables contain pointers to those pages:
This reduces the amount of data that must be read from disk. When querying a compressed chunk, the database reads the table of pointers, which is relatively lightweight. It then follows the pointers to read only the needed columns from disk.
TimescaleDB implements pointers and secondary pages by using PostgreSQL's TOAST feature. To learn more, see the PostgreSQL documentation on TOAST.
In many ways, you interact with compressed chunks just as you interact with uncompressed chunks. But there are some differences.
With TimescaleDB 2.3 and above, you can insert data into compressed chunks. You
INSERT statement as normal. Behind the scenes, TimescaleDB compresses
the inserted row as a single row. It then saves it within the appropriate chunk.
Periodically, it recompresses the chunk, which combines the individually
inserted rows with the previously compressed rows.
This minimizes the performance penalty at insert time, because recompression is batched and performed asynchronously.
To insert large batches of data, you can use a backfilling function.
You can't update or delete data in compressed chunks. You can work around this by decompressing the chunk, making your changes, and recompressing. You can also drop entire chunks at once without decompressing them, either manually or via data retention policy. For more information, see the sections on decompression and data retention.
To minimize the amount of decompressing and recompressing you need to do, set your compression policy to only compress data that is rarely updated.
You query compressed data just as you would query uncompressed data. Queries on compressed data are compatible with all the features of TimescaleDB.
Because features and query syntax remain the same, you can write your queries without thinking about whether the data is compressed. However, to optimize your queries, it helps to understand what is happening behind the scenes.
When you query data across both compressed and uncompressed chunks, TimescaleDB:
- Finds the compressed chunks and decompresses the requested columns
- Appends the decompressed data to more recent, uncompressed, data
- Returns the final results to you as if all the data were stored uncompressed
You incur a performance penalty to decompress the data. But you save on I/O, because you read only the required columns from the compressed chunks, rather than the full, uncompressed data. Overall, this can improve query performance.
Note that this only applies to data on disk, not in memory. Access to data in memory isn't I/O-bound. So, compression is most likely to improve performance for older, less-frequently accessed data.
Even for data on disk, not all queries perform faster on compressed data.
Queries that involve many columns, such as
SELECT * ... queries, incur
performance penalties to decompress and recombine all the columns. At the same
time, they don't benefit from the performance boost of reading fewer columns.
In addition, queries that benefit from particular indexes might be slower,
because compressed chunks can only be indexed on their
segmentby columns. But
such situations are rare. Choosing the right
segmentby column for your query
patterns can help. For more information, see the how-to guide on
Found an issue on this page?Report an issue!