Full-text data (Brian - Live)
Moved to PR #270. See preview.
The content has been updated since!
CrateDB offers native full-text search with Apache Lucene and Okapi BM25 ranking, accessible via SQL for easy modeling and querying of large-scale textual data. It features fuzzy matching, multi-language analysis, and composite indexing while fully integrating with data types like JSON, time-series, geospatial, vectors, and more for comprehensive multi-model queries. Whether for document search, catalog lookup, or content analytics, CrateDB is ideal.
Data Types & Indexing
By default, all text columns are indexed as plain
(raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries.
To use full-text search, add a FULLTEXT index with an optional analyzer to the text columns you want to search:
CREATE TABLE documents (
title TEXT,
body TEXT,
INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english')
);
You can also index multiple columns with composite full-text indices:
INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english');
For detailed options, check out the Reference Manual.
Analyzers
An analyzer splits text into searchable terms and consists of the following components:
Tokenizer - splits on whitespace/characters
Token Filters - e.g. lowercase, stemming, stop‑word removal
Char Filters - pre-processing (e.g. stripping HTML).
CrateDB offers about 50 built-in analyzers supporting more than 30 languages.
You can extend a built-in analyzer:
CREATE ANALYZER german_snowball
EXTENDS snowball
WITH (language = 'german');
or create your own custom analyzer :
CREATE ANALYZER myanalyzer (
TOKENIZER whitespace,
TOKEN_FILTERS (lowercase, kstem),
CHAR_FILTERS (html_strip)
);
Learn more about the builtin analyzers and how to define your own with custom tokenizers and token filters.
Querying: MATCH Predicate & Scoring
CrateDB uses the SQL MATCH
predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score _score
, ranked via BM25.
Basic usage:
SELECT title, _score
FROM documents
WHERE MATCH(ft_body, 'search term')
ORDER BY _score DESC;
Searching multiple indices with weighted ranking:
MATCH((ft_title boost 2.0, ft_body), 'keyword')
You can configure match options like:
using best_fields
(default)fuzziness = 1
(tolerate minor typos)operator = 'AND'
orOR
slop = N
for phrase proximity
Example: Fuzzy Search
SELECT title, _score
FROM documents
WHERE MATCH(ft_body, 'Jamse') USING best_fields WITH (fuzziness = 2)
ORDER BY _score DESC;
This matches similar words like ‘James’.
Example: Multi‑language Composite Search
CREATE TABLE documents (
title TEXT,
body TEXT,
INDEX ft_en USING FULLTEXT(body) WITH (analyzer = 'english'),
INDEX ft_de USING FULLTEXT(body) WITH (analyzer = 'german')
);
SELECT title, _score
FROM documents
WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1)
ORDER BY _score DESC;
Use Cases & Integration
CrateDB is ideal for searching semi-structured large text data—product catalogs, article archives, user-generated content, descriptions and logs.
Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure.
You can combine full-text search with other data domains, for example:
SELECT *
FROM listings
WHERE
MATCH(ft_desc, 'garden deck') AND
price < 500000 AND
within(location, :polygon);
This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one.
Further Learning & Resources
Full-text Search in Query Capabilities: Overview of full text search features
Reference Manual:
Full-text indices: Defining indices, extending builtin analyzers, custom analyzers.
Full-text analyzers: Builtin analyzers, tokenizers, token and char filters.
Full-text search: Details about MATCH predicate arguments and options.
Hands‑On Academy Course: explore FTS on real datasets (e.g. Chicago neighborhoods).
Last updated