Downsampling

CrateDB is designed to handle large-scale analytical workloads efficiently. One common pattern is running GROUP BY queries over massive datasets—such as calculating AVG, MAX, and MIN values grouped by device or customer ID.

This guide shows how to improve performance when querying billions of records, using various downsampling techniques that balance accuracy and speed.

Example: GROUP BY Aggregation

Let’s say you want to calculate basic statistics over billions of measurements grouped by device:

cr> SELECT
   device_id,
   max(value),
   avg(value),
   min(value)
FROM
   measures
GROUP BY
   device_id
ORDER BY
   1 DESC;

+-----------+------------+-------------------+------------+
| device_id | max(value) |        avg(value) | min(value) |
+-----------+------------+-------------------+------------+
|         4 |      10000 | 5003.748816285036 |          0 |
|         3 |      10000 | 5005.297395430482 |          0 |
|         2 |      10000 | 5002.940588080021 |          0 |
|         1 |      10000 | 5002.216030711031 |          0 |
+-----------+------------+-------------------+------------+

By default, CrateDB scans all matching records, which may require substantial processing power depending on your dataset and cluster size.

Downsampling for Faster Queries

Some databases offer native downsampling to reduce the number of records processed during aggregation—trading off a bit of accuracy for significantly faster results. This is particularly useful for exploratory analysis or live dashboards where exact precision isn't critical.

CrateDB provides multiple ways to achieve downsampling:

Downsampling with %_docid

CrateDB users can efficiently simulate downsampling by filtering on the _docid system column:

cr> SELECT
   device_id,
   max(value),
   avg(value),
   min(value)
FROM
   measures
WHERE
   _docid % 10 = 0
GROUP BY
   device_id
ORDER BY
   1 DESC;

+-----------+------------+--------------------+------------+
| device_id | max(value) |         avg(value) | min(value) |
+-----------+------------+--------------------+------------+
|         4 |      10000 | 5013.052623224065  |          1 |
|         3 |      10000 | 4999.1253575025175 |          0 |
|         2 |      10000 | 5001.400379047543  |          0 |
|         1 |      10000 | 5035.220951927276  |          0 |
+-----------+------------+--------------------+------------+

As you can see, the results differ slightly but remain very close to the full-scan version.

Tip: The % 10 value here was chosen arbitrarily. It means “use every 10th row.” Higher values reduce query time but also decrease result accuracy.

Why _docid Works

  • _docid is an internal, segment-local document ID.

  • It is lightweight, always available, and doesn’t require additional disk reads.

  • While _docid isn’t globally unique, it’s sufficient for randomized sampling via modulo operations.

This technique is non-standard SQL and relies on internal mechanics of CrateDB. Use it only when approximate results are acceptable.

Downsampling with DATE_BIN (Time Bucketing)

For time-series data, it’s often better to downsample by aggregating data into fixed time intervals. This improves performance by reducing data granularity and volume—ideal for dashboards like Grafana.

SELECT ts_bin,
       battery_level,
       battery_status,
       battery_temperature
FROM (
  SELECT DATE_BIN('5 minutes'::INTERVAL, "time", 0) AS ts_bin,
         battery_level,
         battery_status,
         battery_temperature,
         ROW_NUMBER() OVER (PARTITION BY DATE_BIN('5 minutes'::INTERVAL, "time", 0) ORDER BY "time" DESC) AS "row_number"
  FROM doc.sensor_readings
) x
WHERE "row_number" = 1
ORDER BY 1 ASC

This query:

  • Bins records into 5-minute intervals.

  • Selects the latest record from each bin.

  • Reduces total records retrieved, improving dashboard responsiveness.

For aggregations (e.g., average per bin), replace ROW_NUMBER() logic with aggregation functions like AVG, MAX, etc.

Downsampling with LTTB

The LTTB algorithm downscales time-series data while preserving its visual shape, perfect for graphing applications where accuracy of peaks/trends matters more than raw precision.

WITH downsampleddata AS
  (SELECT lttb_with_parallel_arrays(
  array(SELECT n FROM demo ORDER BY n),
  array(SELECT reading FROM demo ORDER BY n), 100) AS lttb)
SELECT unnest(lttb['0']) AS n,
       unnest(lttb['1']) AS reading
FROM downsampleddata;
  • Reduces a large dataset to 100 points.

  • Keeps visual trends intact.

  • Ideal for mobile or frontend dashboards.

Summary: Choose the Right Strategy

Use Case
Technique
Trade-off

Fast but approximate stats

_docid % N filter

Slight accuracy loss

Time-bucketed aggregates

DATE_BIN + GROUP BY

Lower time resolution

Dashboard-friendly time-series

LTTB algorithm

Lower data volume, shape intact

Last updated