Tick Data Compression Strategies: Storing Terabytes on a Budget
The Deluge of Data: Storing Tick History
A single active futures contract can generate millions of ticks per day. A global equities trading firm might subscribe to data for tens of thousands of symbols across dozens of exchanges. The result is a daily firehose of data that can easily reach multiple terabytes. Storing this raw tick data is a fundamental requirement for backtesting and research, but it presents a massive storage challenge. The cost of storing petabytes of historical data can be prohibitive if not managed intelligently. Effective data compression is therefore not an optimization, but a necessity.
This article explores various compression strategies, from general-purpose algorithms to domain-specific techniques tailored for the unique structure of financial time-series data. The goal is to find the optimal balance between compression ratio, compression speed, and decompression speed, as all three are important in a trading environment.
General-Purpose Compression Algorithms
Modern databases and file systems often support transparent compression using general-purpose algorithms. The most popular choices are:
- Snappy (or L4): Developed by Google, Snappy is designed for very high speed, at the expense of a lower compression ratio. It is often the default choice for real-time systems and columnar databases where decompression speed is paramount. A query that needs to scan a large amount of data can be bottlenecked by CPU if the decompression is too slow.
- Zstandard (Zstd): Developed by Facebook, Zstandard offers a wide range of compression levels, allowing a user to trade compression ratio for speed. At its lower levels, Zstd can be as fast as Snappy but with a significantly better compression ratio. At its higher levels, it can approach the compression ratios of older algorithms like Gzip, but with much faster decompression. Zstd is quickly becoming the new standard for general-purpose compression.
- Gzip: One of the oldest and most widely used algorithms. It provides a good compression ratio but is significantly slower, especially for decompression, than modern alternatives like Snappy and Zstd. It is generally not recommended for new systems where performance is a concern.
For most trading databases, Zstandard offers the best all-around performance. Its flexibility allows for fine-tuning the compression level to match the specific access patterns of the data. For example, recent, frequently accessed data might be stored with a lower compression level for faster access, while older, archival data can be compressed at a higher level to save space.
Domain-Specific Compression for Time-Series Data
While general-purpose algorithms are effective, we can achieve even better compression ratios by using techniques that exploit the specific structure of tick data. Tick data is a time series, and it has several properties that we can use to our advantage:
- Timestamps are monotonically increasing and have small deltas.
- Prices and sizes often do not change between consecutive ticks.
- Values often have a small number of significant digits.
Here are some of the most effective domain-specific techniques:
1. Delta Encoding: Instead of storing the absolute value of a field, we store the difference (delta) from the previous value. This is extremely effective for timestamps.
Original Timestamps (ns): 1677609600123456789, 1677609600123456999, 1677609600123457100
Deltas: +210, +101
These small integer deltas can be encoded using a variable-length integer format (like varint), which uses fewer bytes for smaller numbers.
2. Delta-of-Delta Encoding: We can take this a step further and store the delta of the deltas. This is effective if the rate of change is relatively constant.
Deltas: 210, 101, 105, 103
Delta-of-Deltas: -109, +4, -2
3. Run-Length Encoding (RLE):
RLE is used to compress sequences of identical values. Instead of storing [100, 100, 100, 100, 101], we store (100, 4), (101, 1). This is very effective for fields like the exchange or the trade condition, which often remain the same for many consecutive ticks.
4. Dictionary Encoding:
This technique is used for columns with low cardinality (a small number of unique values), such as the symbol or exchange column. A dictionary is built that maps each unique string value to a small integer. The column is then stored as a sequence of these integers. This is a core feature of many columnar databases.
Combining Techniques: The Columnar Approach
The most effective compression is achieved by combining these techniques in a columnar storage format. In a traditional row-based database, all the data for a single tick is stored together. In a columnar database (like Apache Parquet, ORC, or ClickHouse), all the values for a single column are stored together.
This has two major advantages for compression:
- Homogeneous Data: Storing column values together means that the data being compressed is of the same type and often has similar properties, which makes compression algorithms more effective.
- Per-Column Compression: Each column can be compressed with the algorithm best suited to its data type and distribution. For example:
timestampcolumn: Delta-of-delta encoding + Zstandardpricecolumn: Delta encoding + Zstandardsymbolcolumn: Dictionary encoding + RLEexchangecolumn: Dictionary encoding + RLE
This combination of columnar storage and domain-specific compression can lead to compression ratios of 10x to 50x or even higher for typical tick data, with minimal impact on query performance. For any firm serious about storing large amounts of historical market data, adopting a columnar storage format with advanced compression is not just an option, it is a financial and operational necessity.
