Full-text data (Brian - Live)

Moved to PR #270. See preview.

The content has been updated since!


CrateDB offers native full-text search with Apache Lucene and Okapi BM25 ranking, accessible via SQL for easy modeling and querying of large-scale textual data. It features fuzzy matching, multi-language analysis, and composite indexing while fully integrating with data types like JSON, time-series, geospatial, vectors, and more for comprehensive multi-model queries. Whether for document search, catalog lookup, or content analytics, CrateDB is ideal.


Data Types & Indexing

By default, all text columns are indexed as plain (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries.

To use full-text search, add a FULLTEXT index with an optional analyzer to the text columns you want to search:

CREATE TABLE documents (
  title       TEXT,
  body        TEXT,
  INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english')
);

You can also index multiple columns with composite full-text indices:

INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english');

For detailed options, check out the Reference Manual.


Analyzers

An analyzer splits text into searchable terms and consists of the following components:

  • Tokenizer - splits on whitespace/characters

  • Token Filters - e.g. lowercase, stemming, stop‑word removal

  • Char Filters - pre-processing (e.g. stripping HTML).

CrateDB offers about 50 built-in analyzers supporting more than 30 languages.

You can extend a built-in analyzer:

CREATE ANALYZER german_snowball
  EXTENDS snowball
  WITH (language = 'german');

or create your own custom analyzer :

CREATE ANALYZER myanalyzer (
  TOKENIZER whitespace,
  TOKEN_FILTERS (lowercase, kstem),
  CHAR_FILTERS (html_strip)
);

Learn more about the builtin analyzers and how to define your own with custom tokenizers and token filters.


Querying: MATCH Predicate & Scoring

CrateDB uses the SQL MATCH predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score _score, ranked via BM25.

Basic usage:

SELECT title, _score
FROM documents
WHERE MATCH(ft_body, 'search term')
ORDER BY _score DESC;

Searching multiple indices with weighted ranking:

MATCH((ft_title boost 2.0, ft_body), 'keyword')

You can configure match options like:

  • using best_fields (default)

  • fuzziness = 1 (tolerate minor typos)

  • operator = 'AND' or OR

  • slop = N for phrase proximity

Example: Fuzzy Search

SELECT title, _score
FROM documents
WHERE MATCH(ft_body, 'Jamse') USING best_fields WITH (fuzziness = 2)
ORDER BY _score DESC;

This matches similar words like ‘James’.

Example: Multi‑language Composite Search

CREATE TABLE documents (
  title       TEXT,
  body        TEXT,
  INDEX ft_en USING FULLTEXT(body) WITH (analyzer = 'english'),
  INDEX ft_de USING FULLTEXT(body) WITH (analyzer = 'german')
);

SELECT title, _score
FROM documents
WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1)
ORDER BY _score DESC;

Use Cases & Integration

CrateDB is ideal for searching semi-structured large text data—product catalogs, article archives, user-generated content, descriptions and logs.

Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure.

You can combine full-text search with other data domains, for example:

SELECT *
FROM listings
WHERE 
  MATCH(ft_desc, 'garden deck') AND
  price < 500000 AND
  within(location, :polygon);

This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one.


Further Learning & Resources

Last updated