Downsampling
CrateDB is designed to handle large-scale analytical workloads efficiently. One common pattern is running GROUP BY
queries over massive datasets—such as calculating AVG
, MAX
, and MIN
values grouped by device or customer ID.
This guide shows how to improve performance when querying billions of records, using various downsampling techniques that balance accuracy and speed.
Example: GROUP BY
Aggregation
GROUP BY
AggregationLet’s say you want to calculate basic statistics over billions of measurements grouped by device:
cr> SELECT
device_id,
max(value),
avg(value),
min(value)
FROM
measures
GROUP BY
device_id
ORDER BY
1 DESC;
+-----------+------------+-------------------+------------+
| device_id | max(value) | avg(value) | min(value) |
+-----------+------------+-------------------+------------+
| 4 | 10000 | 5003.748816285036 | 0 |
| 3 | 10000 | 5005.297395430482 | 0 |
| 2 | 10000 | 5002.940588080021 | 0 |
| 1 | 10000 | 5002.216030711031 | 0 |
+-----------+------------+-------------------+------------+
Downsampling for Faster Queries
Some databases offer native downsampling to reduce the number of records processed during aggregation—trading off a bit of accuracy for significantly faster results. This is particularly useful for exploratory analysis or live dashboards where exact precision isn't critical.
CrateDB provides multiple ways to achieve downsampling:
Downsampling with %_docid
CrateDB users can efficiently simulate downsampling by filtering on the _docid
system column:
cr> SELECT
device_id,
max(value),
avg(value),
min(value)
FROM
measures
WHERE
_docid % 10 = 0
GROUP BY
device_id
ORDER BY
1 DESC;
+-----------+------------+--------------------+------------+
| device_id | max(value) | avg(value) | min(value) |
+-----------+------------+--------------------+------------+
| 4 | 10000 | 5013.052623224065 | 1 |
| 3 | 10000 | 4999.1253575025175 | 0 |
| 2 | 10000 | 5001.400379047543 | 0 |
| 1 | 10000 | 5035.220951927276 | 0 |
+-----------+------------+--------------------+------------+
As you can see, the results differ slightly but remain very close to the full-scan version.
Why _docid
Works
_docid
is an internal, segment-local document ID.It is lightweight, always available, and doesn’t require additional disk reads.
While
_docid
isn’t globally unique, it’s sufficient for randomized sampling via modulo operations.
Downsampling with DATE_BIN
(Time Bucketing)
DATE_BIN
(Time Bucketing)For time-series data, it’s often better to downsample by aggregating data into fixed time intervals. This improves performance by reducing data granularity and volume—ideal for dashboards like Grafana.
SELECT ts_bin,
battery_level,
battery_status,
battery_temperature
FROM (
SELECT DATE_BIN('5 minutes'::INTERVAL, "time", 0) AS ts_bin,
battery_level,
battery_status,
battery_temperature,
ROW_NUMBER() OVER (PARTITION BY DATE_BIN('5 minutes'::INTERVAL, "time", 0) ORDER BY "time" DESC) AS "row_number"
FROM doc.sensor_readings
) x
WHERE "row_number" = 1
ORDER BY 1 ASC
This query:
Bins records into 5-minute intervals.
Selects the latest record from each bin.
Reduces total records retrieved, improving dashboard responsiveness.
Downsampling with LTTB
The LTTB algorithm downscales time-series data while preserving its visual shape, perfect for graphing applications where accuracy of peaks/trends matters more than raw precision.
WITH downsampleddata AS
(SELECT lttb_with_parallel_arrays(
array(SELECT n FROM demo ORDER BY n),
array(SELECT reading FROM demo ORDER BY n), 100) AS lttb)
SELECT unnest(lttb['0']) AS n,
unnest(lttb['1']) AS reading
FROM downsampleddata;
Reduces a large dataset to 100 points.
Keeps visual trends intact.
Ideal for mobile or frontend dashboards.
Summary: Choose the Right Strategy
Fast but approximate stats
_docid % N
filter
Slight accuracy loss
Time-bucketed aggregates
DATE_BIN
+ GROUP BY
Lower time resolution
Dashboard-friendly time-series
LTTB algorithm
Lower data volume, shape intact
Last updated