Hybrid Index

CrateDB’s hybrid indexing architecture delivers lightning-fast query performance by automatically indexing all columns in a way that is optimized for their data type. You get real-time search and analytics performance out of the box—without needing to manage indexes manually.

Table of Content

Overview

Like search engines such as Lucene and Elasticsearch, CrateDB indexes all fields by default. This includes both structured and semi-structured data. The result is consistently fast performance, whether you're filtering, aggregating, or searching across millions (or billions) of records.

Unlike traditional relational databases that require manual index tuning, CrateDB's hybrid approach lets you simply write SQL—and rely on the engine to make queries fast.

How It Works

CrateDB uses the Apache Lucene storage engine under the hood and applies a per-field indexing strategy:

Text fields are indexed using inverted indexes (ideal for full-text search).
Numeric and geo fields use BKD trees, which support efficient range queries and spatial operations.
Boolean and keyword fields are indexed using optimized structures for equality filters.
DocValues are used for efficient sorting, aggregations, and columnar reads.

Each field type is stored in a format that is purpose-built for the types of queries you're likely to run on it.

This indexing strategy enables CrateDB to handle both analytical queries and text search in a unified way, using standard SQL.

Details: Lucene Index Structures Used in CrateDB

CrateDB leverages three main data structures from Lucene to provide high-performance indexing across different data types:

Inverted Index (for text fields)

Lucene’s inverted index maps content (words, numbers) to the locations (documents) in which they appear. This is the core structure enabling efficient text search.

A plain index (default) stores values as-is, without analyzing or tokenizing them.
A full-text index analyzes input text (using a specified analyzer), tokenizes it into words or phrases, and creates index entries for each token.

This structure allows CrateDB to perform extremely fast and accurate searches on textual content, including relevance scoring when used with MATCH() queries.

BKD Tree (for numeric and spatial fields)

For numeric and geospatial indexing, CrateDB uses Lucene’s Block KD Tree (BKD tree):

BKD trees are highly efficient for range queries, including date ranges (e.g., WHERE ts >= '2025-01-01' AND ts < '2025-02-01').
They are optimized for multi-dimensional point data, such as latitude/longitude pairs and multi-column numeric indexing.

BKD trees maintain excellent performance even after large numbers of updates and provide high space efficiency for large datasets.

DocValues (for sorting and aggregations)

While inverted indexes are optimized for finding documents based on terms, they are not efficient for retrieving values by document or performing columnar operations. That’s where DocValues come in.

DocValues are columnar data structures created at index time.
They store unanalyzed field values in a compact format, making them efficient for:
- Sorting
- Aggregations
- GROUP BY queries
- Columnar-style projections

CrateDB relies on DocValues heavily to execute fast analytics workloads.

Lucene: Indexing Foundation for CrateDB

CrateDB builds its hybrid index architecture on top of Lucene, a high-performance, scalable search engine library. Lucene provides the fundamental building blocks for indexing and querying at speed and scale.

Documents

A single record in Lucene is called a document. A document is a container for fields—each having a name and a value. A Lucene index can store an arbitrary number of documents, each with its own set of fields.

Append-only Segments

Lucene indexes are composed of segments, which are immutable sub-indexes. When new documents are added, they are written to a new segment rather than modifying existing ones.

This append-only model supports fast indexing and avoids locking.
Segments are merged over time in the background to optimize query performance and manage index size.

This approach allows CrateDB to ingest new data at high rates without reindexing existing data.

Column Store

In addition to traditional row-based storage, Lucene also supports a column-oriented store—primarily through DocValues.

For text values, Lucene stores both tokenized terms for searching and the original (or unanalyzed) values in a columnar format.
This column store significantly improves performance for:
- Global aggregations
- Sorting
- Groupings

In CrateDB:

The column store is enabled by default and applies to most primitive types.
It can only be disabled for TEXT fields.
Container types (e.g. arrays, objects) and geospatial types are not currently supported in the column store.

Performance Characteristics

CrateDB’s hybrid index provides:
- Low-latency aggregations
- Efficient filters and range scans
- Fast full-text search
- Native geospatial indexing
All of this is accessible using standard SQL, with no need to write or manage custom index logic.

Usage

CrateDB’s indexing works out of the box. Simply create your table and start querying:

CREATE TABLE products (
  id INTEGER,
  name TEXT,
  price DOUBLE PRECISION,
  tags TEXT[],
  location GEO_POINT
);

-- This query leverages full-text and numeric indexes
SELECT name
FROM products
WHERE match(name, 'laptop') AND price < 1000;

CrateDB’s query engine automatically chooses the most efficient index for each part of the query.

PreviousStorage & Indexing NextStorage and consistency

Last updated 3 months ago

Good evening