Tokenizers
Split text into tokens with Elasticsearch tokenizers in Sigmie — word boundaries, whitespace, pattern, simple pattern, path hierarchy, non-letter.
On this page
The tokenizer is the middle stage of the analysis pipeline. It takes a string and produces tokens — typically words, but the rules depend on which tokenizer you pick.
For text like:
"Make your user's search experience great"
A whitespace tokenizer produces:
"Make" "your" "user's" "search" "experience" "great"
Sigmie has tokenizers for word boundaries, whitespace, patterns, paths, non-letters, and a no-op that keeps the input as one token.
Word boundaries
Produces a token at every word boundary (handles punctuation):
use Sigmie\Index\Analysis\Tokenizers\WordBoundaries; $analyzer->tokenizer(new WordBoundaries(name: 'word_boundaries', maxTokenLength: 255)); // Or via the builder shortcut:$analyzer->tokenizeOnWordBoundaries(maxTokenLength: 255);
maxTokenLength defaults to 255.
Example:
"Aw shucks, pluto. I can't be mad at ya!" → "Aw"→ "shucks"→ "pluto"→ "I"→ "can't"→ "be"→ "mad"→ "at"→ "ya"
Punctuation is absorbed into the boundary.
Whitespace
Splits on whitespace characters only — punctuation stays attached to neighboring tokens:
use Sigmie\Index\Analysis\Tokenizers\Whitespace; $analyzer->tokenizer(new Whitespace(name: 'whitespace_tokenizer')); // Or:$analyzer->tokenizeOnWhitespaces();
Same input as above:
"Aw" "shucks," "pluto." "I" "can't" "be" "mad" "at" "ya!"
shucks,, pluto., and ya! keep their punctuation.
No-op
Treats the entire input as a single token. Useful when you want exact-match behavior on text fields:
use Sigmie\Index\Analysis\Tokenizers\Noop; $analyzer->tokenizer(new Noop(name: 'noop_tokenizer')); // Or:$analyzer->dontTokenize();
"If you ain't scared, you ain't alive." → "If you ain't scared, you ain't alive."
Pattern
Splits at every match of a regular expression. The matched text is not included in any token:
use Sigmie\Index\Analysis\Tokenizers\Pattern; $analyzer->tokenizer(new Pattern(name: 'pattern_tokenizer', ',')); // Or:$analyzer->tokenizeOnPattern(',');
"Though at times it may feel like the sky is falling around you, never give up, for every day is a new day" → "Though at times it may feel like the sky is falling around you"→ " never give up"→ " for every day is a new day"
Simple pattern
Outputs each match of the pattern as a token (the inverse of Pattern):
use Sigmie\Index\Analysis\Tokenizers\SimplePattern; $analyzer->tokenizer(new SimplePattern(name: 'simple_pattern', "'.*'")); // Or:$analyzer->tokenizeOnPatternMatch("'.*'");
"I remember daddy told me 'Fairytales can come true'." → "'Fairytales can come true'"
Only the quoted phrase becomes a token.
Path hierarchy
Produces a token at every level of a hierarchical path:
use Sigmie\Index\Analysis\Tokenizers\PathHierarchy; $analyzer->tokenizer(new PathHierarchy(delimiter: '/')); // Or:$analyzer->tokenizePathHierarchy(delimiter: '/');
Default delimiter is /.
"Disney/Movies/Musical/Sleeping Beauty" → "Disney"→ "Disney/Movies"→ "Disney/Movies/Musical"→ "Disney/Movies/Musical/Sleeping Beauty"
Useful for filtering on path prefixes — searching “Disney/Movies” matches every nested entry.
Non-letter
Splits on any character that isn’t a letter:
use Sigmie\Index\Analysis\Tokenizers\NonLetter; $analyzer->tokenizer(new NonLetter); // Or:$analyzer->tokenizeOnNonLetter();
"To infinity … and beyond!" → "To"→ "infinity"→ "and"→ "beyond"
See also
- Analysis — the full pipeline.
- Token Filters — transforming tokens after tokenization.
- Character Filters — pre-processing before tokenization.