You have now seen how to create a hypertable for your NYC taxi trip data and query it. When ingesting a dataset like this is seldom necessary to update old data and over time the amount of data in the tables grows. Over time you end up with a lot of data and since this is mostly immutable you can compress it to save space and avoid incurring additional cost.
It is possible to use disk-oriented compression like the support offered by ZFS and Btrfs but since TimescaleDB is build for handling event-oriented data (such as time-series) it comes with support for compressing data in hypertables.
TimescaleDB compression allows you to store the data in a vastly more efficient format allowing up to 20x compression ratio compared to a normal PostgreSQL table, but this is of course highly dependent on the data and configuration.
TimescaleDB compression is implemented natively in PostgreSQL and does not require special storage formats. Instead it relies on features of PostgreSQL to transform the data into columnar format before compression. The use of a columnar format allows better compression ratio since similar data is stored adjacently. For more details on how the compression format looks, you can look at the compression design section.
A beneficial side-effect of compressing data is that certain queries are significantly faster since less data has to be read into memory.
Connect to the Timescale database that contains the dataset using, for example
psql
.Enable compression on the table and pick suitable segment-by and order-by column using the
ALTER TABLE
command:ALTER TABLE ridesSET (timescaledb.compress,timescaledb.compress_segmentby='vendor_id',timescaledb.compress_orderby='pickup_datetime DESC');Depending on the choice if segment-by and order-by column you can get very different performance and compression ratio. To learn more about how to pick the correct columns, see here.
You can manually compress all the chunks of the hypertable using
compress_chunk
in this manner:SELECT compress_chunk(c) from show_chunks('rides') c;You can also automate compression by adding a compression policy which will be covered below.
Now that you have compressed the table you can compare the size of the dataset before and after compression:
SELECTpg_size_pretty(before_compression_total_bytes) as before,pg_size_pretty(after_compression_total_bytes) as afterFROM hypertable_compression_stats('rides');This shows a significant improvement in data usage:
before | after---------+--------1741 MB | 603 MB
To avoid running the compression step each time you have some data to compress you can set up a compression policy. The compression policy allows you to compress data that is older than a particular age, for example, to compress all chunks that are older than 8 days:
SELECT add_compression_policy('rides', INTERVAL '8 days');
Compression policies run on a regular schedule, by default once every day, which means that you might have up to 9 days of uncompressed data with the setting above.
You can find more information on compression policies in the add_compression_policy section.
Previously, compression was set up to be segmented by vendor_id
column value.
This means fetching data by filtering or grouping on that column will be
more efficient. Ordering is also set to time descending so if you run queries
which try to order data with that ordering, you should see performance benefits.
For instance, if you run the query example from previous section:
SELECT rate_code, COUNT(vendor_id) AS num_tripsFROM ridesWHERE pickup_datetime < '2016-01-08'GROUP BY rate_codeORDER BY rate_code;
You should see a decent performance difference when the dataset is compressed and when is decompressed. Try it yourself by running the previous query, decompressing the dataset and running it again while timing the execution time. You can enable timing query times in psql by running:
\timing
To decompress the whole dataset, run:
SELECT decompress_chunk(c) from show_chunks('rides') c;
On an example setup, speedup performance observed was pretty significant, 700 ms when compressed vs 1,2 sec when decompressed.
Try it yourself and see what you get!
Keywords
Found an issue on this page?Report an issue or Edit this page in GitHub.