Large subsets

When working with very large tables and composite primary keys, you may need to retrieve thousands—or even tens of thousands—of specific records efficiently.

A common challenge: building a large WHERE clause with multiple AND/OR conditions for composite primary keys can cause:

  • Performance degradation

  • Errors like statement is too large (stack overflow while parsing) or StackOverflowError[null]

The Solution: Use CrateDB’s _id System Column

Every record in CrateDB has an internal _id column. This column is:

  • A unique, deterministic string representing all the primary key values of a row.

  • Automatically generated and indexed.

  • Perfect for high-performance lookups—even on large datasets.

How It Works

Let’s say your table sensor_data uses a composite primary key:

PRIMARY KEY (machine_id, sensor_type)

Now suppose you have tens of thousands of key combinations (from another process or table) and want to fetch all matching rows from sensor_data.

Instead of writing a massive WHERE (machine_id, sensor_type) IN (...) clause, you can do this:

SELECT *
FROM sensor_data
WHERE _id IN (
    SELECT _id FROM relevant_pk_values
);

Here’s what’s happening:

  • relevant_pk_values is a staging table (temporary or persistent) that holds _id values derived from your original key combinations.

  • Since _id is index-backed and deterministic, this lookup is extremely fast—even at scale.


Why This Works

  • CrateDB’s _id encodes primary key values in a deterministic format.

  • If two tables share the same primary key definition and values, their _id values will match.

  • You avoid the complexity and limits of deeply nested WHERE clauses or huge IN lists.

Bonus: This pattern is especially useful in data pipelines or microservices that precompute a list of relevant keys and need to fetch associated records efficiently.

Summary

Feature
Benefit

_id system column

Fast, deterministic primary key representation

WHERE _id IN (SELECT …)

Scales to tens of thousands of lookups easily

Shared PK logic across tables

Enables cross-table matching via _id

Avoids SQL parsing limits

Prevents StackOverflowError or large query errors

Last updated