Tokenizers
Tokenizers split text into tokens.
Standard
standard
Grammar-based, Unicode text segmentation.
max_token_length
Classic
classic
Heuristics for English (acronyms, emails, hostnames).
max_token_length
Thai
thai
Properly splits Thai text; other languages use standard.
–
Letter
letter
Splits at non-letters.
–
Lowercase
lowercase
Letter tokenizer + lowercase.
–
Whitespace
whitespace
Splits at whitespace.
max_token_length
UAX URL email
uax_url_email
Standard tokenizer + single-token URLs/emails.
max_token_length
N-gram
ngram
Generates character n-grams.
min_gram
, max_gram
, token_chars
Edge n-gram
edge_ngram
Generates n-grams starting from token beginning.
min_gram
, max_gram
, token_chars
Keyword
keyword
Entire input as single token.
buffer_size
Pattern
pattern
Regex-based token splitting.
pattern
, flags
, group
Simple pattern
simple_pattern
Faster, limited regex support.
pattern
Simple pattern split
simple_pattern_split
Splits input based on limited regex.
pattern
Path hierarchy
path_hierarchy
Splits paths into hierarchical tokens.
delimiter
, replacement
, buffer_size
, reverse
, skip
Char group
char_group
Splits text on predefined characters.
tokenize_on_chars
Last updated