Token Filters
Transform Elasticsearch tokens with Sigmie — stemming, synonyms, stopwords, lowercase, ASCII folding, decimal digit, truncate, and keyword guards.
On this page
Token filters run after the tokenizer. Each filter transforms or removes tokens — lowercasing, stemming, dropping stopwords, applying synonyms.
Filters run in the order you declare them. The order matters: lowercasing before applying stopwords (which are usually defined in lowercase) is correct; doing it the other way around drops nothing.
Stemming
Reduces words to a root form so “going” matches “go”:
$analyzer->stemming([ ['go', ['going']],]);
"Where" "are" "you" "going" │ ▼ Stemming"Where" "are" "you" "go"
Stopwords
Drop common words:
$analyzer->stopwords(['but', 'not']);
"Ladies" "do" "not" "start" "fights" "but" "they" "can" "finish" "them" │ ▼ Stopwords ("not", "but")"Ladies" "do" "start" "fights" "they" "can" "finish" "them"
Trim
Remove leading and trailing whitespace from each token:
$analyzer->trim();
Useful after pattern-based tokenization that can leave whitespace attached:
" never give up" → "never give up"" for every day" → "for every day"
Unique
Remove duplicate tokens:
$analyzer->unique(onlyOnSamePosition: false);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you" │ ▼ Unique"I" "was" "hiding" "under" "your" "porch" "because" "love" "you"
Synonyms
One-way
Replace specified terms with a canonical form:
$analyzer->oneWaySynonyms([ 'ipod' => ['i-pod', 'i pod'],]);
Anywhere i-pod or i pod appears, it’s also indexed as ipod — but searches for i-pod don’t match documents containing ipod.
Two-way
Map a set of terms to each other:
$analyzer->synonyms([ ['joy', 'fun'],]);
fun and joy are interchangeable — either matches documents containing the other.
"It's" "kind" "of" "fun" "to" "do" "the" "impossible" │ ▼ Synonyms (fun ↔ joy)"It's" "kind" "of" "fun" "joy" "to" "do" "the" "impossible"
Lowercase / Uppercase
$analyzer->lowercase();$analyzer->uppercase();
Lowercase is part of nearly every analyzer — without it, “Matrix” doesn’t match a query for “matrix”.
"You" "better" "be" "back" "ASAP" │ ▼ Lowercase"you" "better" "be" "back" "asap"
Decimal digit
Convert non-ASCII digits to ASCII:
$analyzer->decimalDigit();
"໑" "໒" "໓" "໔" "໕" (Lao digits) │ ▼ Decimal Digit"1" "2" "3" "4" "5"
ASCII folding
Strip diacritics:
$analyzer->asciiFolding();
"manténgase" → "mantengase"
Useful when users might or might not type accents.
Token limit
Keep only the first N tokens:
$analyzer->tokenLimit(maxTokenCount: 5);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you" │ ▼ Token Limit 5"I" "was" "hiding" "under" "your"
Truncate
Limit each token’s length:
$analyzer->truncate(length: 10);
"Supercalifragilisticexpialidocious" │ ▼ Truncate 10"Supercalif"
Keywords
Protect specific terms from later filters — for example, prevent stemming on a brand name:
$analyzer ->keywords(['going']) ->stemming([ ['go', ['going']], ]);
"Where" "are" "you" "going" │ ▼ Keywords protect "going" ▼ Stemming would normally turn "going" into "go" — but doesn't here"Where" "are" "you" "going"
Custom token filters
Register your own filter classes by name:
use Sigmie\Index\Analysis\TokenFilter\TokenFilter; TokenFilter::filterMap([ 'skroutz_greeklish' => SkroutzGreeklish::class, 'skroutz_stem_greek' => SkroutzGreekStemmer::class,]);
SkroutzGreeklish and SkroutzGreekStemmer are your classes implementing the token filter contract.
See also
- Tokenizers — splitting text into tokens before the filters run.
- Character Filters — preprocessing before tokenization.
- Analysis — the full pipeline.
- Languages — pre-built filter chains for English, German, Greek.