Full-text indexes

Fulltext indexes allow you to store and search text data by splitting it into tokens that can be queried efficiently. This tokenization is performed by an analyzer, which transforms raw text into searchable units. To perform fulltext searches, you must define a fulltext index (with an analyzer) on the relevant column(s).

Fulltext Index with Analyzer

A fulltext index stores analyzed tokens instead of raw text, enabling advanced text search capabilities. The analyzer defines how text is broken into tokens, lowercased, stemmed, etc.

If no analyzer is specified, CrateDB uses the standard analyzer.

-- Using default analyzer
CREATE TABLE table_c (
  first_column TEXT INDEX USING fulltext
);

-- Using a specific analyzer
CREATE TABLE table_d (
  first_column TEXT INDEX USING fulltext WITH (analyzer = 'english')
);

Note: With fulltext indices, exact-match queries on the original text may not work as expected. See Built-in analyzers for available options.

Named Index Columns

You can create a separately named index based on an existing column’s data. This is useful if you want both exact-match (plain) and fulltext search capabilities on the same field.

CREATE TABLE table_e (
  first_column TEXT,
  INDEX first_column_ft USING fulltext (first_column)
);

-- With a specific analyzer
CREATE TABLE table_f (
  first_column TEXT,
  INDEX first_column_ft USING fulltext(first_column) WITH (analyzer = 'english')
);

Composite Indices

A composite index combines multiple columns into a single fulltext index.

CREATE TABLE documents_a (
  title TEXT,
  body TEXT,
  INDEX title_body_ft USING fulltext(title, body) WITH (analyzer = 'english')
);

Composite indices can also include nested object columns:

CREATE TABLE documents_b (
  title TEXT,
  author OBJECT(DYNAMIC) AS (
    name TEXT,
    birthday TIMESTAMP WITH TIME ZONE
  ),
  INDEX author_title_ft USING fulltext(title, author['name'])
);

Note: Using plain in a composite index is equivalent to fulltext with analyzer = 'keyword'.

Creating a Custom Analyzer

An analyzer is composed of:

Tokenizer — splits text into tokens.
Token filters — modify, add, or remove tokens.
Char filters — preprocess characters before tokenization.

Example of a simple custom analyzer:

CREATE ANALYZER myanalyzer (
  TOKENIZER whitespace,
  TOKEN_FILTERS (
    lowercase,
    kstem
  ),
  CHAR_FILTERS (
    html_strip
  )
);

Example with customization:

CREATE ANALYZER myanalyzer_customized (
  TOKENIZER whitespace,
  TOKEN_FILTERS (
    lowercase,
    kstem
  ),
  CHAR_FILTERS (
    mymapping WITH (
      type = 'mapping',
      mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
    )
  )
);

Rules:

Custom filter names must not conflict with built-ins.
type is required to identify the built-in filter being customized.
Tokenizers and token filters can also be customized similarly.
Changing an analyzer after creation is not supported.

Extending a Built-in Analyzer

You can extend an existing analyzer to create a new one with modified parameters.

-- Extending a built-in analyzer
CREATE ANALYZER german_snowball EXTENDS snowball WITH (
  language = 'german'
);

When extending built-ins:

You cannot redefine tokenizers, token filters, or char filters.
Only available parameters can be customized.

When extending custom analyzers:

Omitted parts are inherited from the base analyzer.

CREATE ANALYZER e2 EXTENDS myanalyzer (
  TOKENIZER mypattern WITH (
    type = 'pattern',
    pattern = '.*'
  )
);

Here, e2 inherits all filters from myanalyzer but overrides the tokenizer.

PreviousIndex types NextFull-text analyzers

Last updated 3 months ago

Good evening