Tokenizers

Tokenizers split text into tokens.

Tokenizer
Type
Description
Parameters

Standard

standard

Grammar-based, Unicode text segmentation.

max_token_length

Classic

classic

Heuristics for English (acronyms, emails, hostnames).

max_token_length

Thai

thai

Properly splits Thai text; other languages use standard.

Letter

letter

Splits at non-letters.

Lowercase

lowercase

Letter tokenizer + lowercase.

Whitespace

whitespace

Splits at whitespace.

max_token_length

UAX URL email

uax_url_email

Standard tokenizer + single-token URLs/emails.

max_token_length

N-gram

ngram

Generates character n-grams.

min_gram, max_gram, token_chars

Edge n-gram

edge_ngram

Generates n-grams starting from token beginning.

min_gram, max_gram, token_chars

Keyword

keyword

Entire input as single token.

buffer_size

Pattern

pattern

Regex-based token splitting.

pattern, flags, group

Simple pattern

simple_pattern

Faster, limited regex support.

pattern

Simple pattern split

simple_pattern_split

Splits input based on limited regex.

pattern

Path hierarchy

path_hierarchy

Splits paths into hierarchical tokens.

delimiter, replacement, buffer_size, reverse, skip

Char group

char_group

Splits text on predefined characters.

tokenize_on_chars


Last updated