Full-text data (Brian - Live)

Moved to PR #270arrow-up-right. See previewarrow-up-right.

The content has been updated since!


CrateDB offers native full-text search with Apache Lucene and Okapi BM25 ranking, accessible via SQL for easy modeling and querying of large-scale textual data. It features fuzzy matching, multi-language analysis, and composite indexing while fully integrating with data types like JSON, time-series, geospatial, vectors, and more for comprehensive multi-model queries. Whether for document search, catalog lookup, or content analytics, CrateDB is ideal.


Data Types & Indexing

By default, all text columns are indexed as plain (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries.

To use full-text search, add a FULLTEXT index with an optional analyzer to the text columns you want to search:

CREATE TABLE documents (
  title       TEXT,
  body        TEXT,
  INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english')
);

You can also index multiple columns with composite full-text indices:

INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english');

For detailed options, check out the Reference Manualarrow-up-right.


Analyzers

An analyzer splits text into searchable terms and consists of the following components:

  • Tokenizer - splits on whitespace/characters

  • Token Filters - e.g. lowercase, stemming, stop‑word removal

  • Char Filters - pre-processing (e.g. stripping HTML).

CrateDB offers about 50 built-in analyzersarrow-up-right supporting more than 30 languagesarrow-up-right.

You can extend a built-in analyzer:

or create your own custom analyzer :

Learn more about the builtin analyzersarrow-up-right and how to define your ownarrow-up-right with custom tokenizersarrow-up-right and token filters.arrow-up-right


Querying: MATCH Predicate & Scoring

CrateDB uses the SQL MATCH predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score _score, ranked via BM25.

Basic usage:

Searching multiple indices with weighted ranking:

You can configure match options like:

  • using best_fields (default)

  • fuzziness = 1 (tolerate minor typos)

  • operator = 'AND' or OR

  • slop = N for phrase proximity

Example: Fuzzy Search

This matches similar words like ‘James’.

Example: Multi‑language Composite Search


Use Cases & Integration

CrateDB is ideal for searching semi-structured large text data—product catalogs, article archives, user-generated content, descriptions and logs.

Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure.

You can combine full-text search with other data domains, for example:

This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one.


Further Learning & Resources

Last updated