Token Filters

Transform Elasticsearch tokens with Sigmie — stemming, synonyms, stopwords, lowercase, ASCII folding, decimal digit, truncate, and keyword guards.

On this page

Token filters run after the tokenizer. Each filter transforms or removes tokens — lowercasing, stemming, dropping stopwords, applying synonyms.

Filters run in the order you declare them. The order matters: lowercasing before applying stopwords (which are usually defined in lowercase) is correct; doing it the other way around drops nothing.

Stemming

Reduces words to a root form so “going” matches “go”:

$analyzer->stemming([
['go', ['going']],
]);
"Where" "are" "you" "going"
▼ Stemming
"Where" "are" "you" "go"

Stopwords

Drop common words:

$analyzer->stopwords(['but', 'not']);
"Ladies" "do" "not" "start" "fights" "but" "they" "can" "finish" "them"
▼ Stopwords ("not", "but")
"Ladies" "do" "start" "fights" "they" "can" "finish" "them"

Trim

Remove leading and trailing whitespace from each token:

$analyzer->trim();

Useful after pattern-based tokenization that can leave whitespace attached:

" never give up" → "never give up"
" for every day" → "for every day"

Unique

Remove duplicate tokens:

$analyzer->unique(onlyOnSamePosition: false);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you"
▼ Unique
"I" "was" "hiding" "under" "your" "porch" "because" "love" "you"

Synonyms

One-way

Replace specified terms with a canonical form:

$analyzer->oneWaySynonyms([
'ipod' => ['i-pod', 'i pod'],
]);

Anywhere i-pod or i pod appears, it’s also indexed as ipod — but searches for i-pod don’t match documents containing ipod.

Two-way

Map a set of terms to each other:

$analyzer->synonyms([
['joy', 'fun'],
]);

fun and joy are interchangeable — either matches documents containing the other.

"It's" "kind" "of" "fun" "to" "do" "the" "impossible"
▼ Synonyms (fun ↔ joy)
"It's" "kind" "of" "fun" "joy" "to" "do" "the" "impossible"

Lowercase / Uppercase

$analyzer->lowercase();
$analyzer->uppercase();

Lowercase is part of nearly every analyzer — without it, “Matrix” doesn’t match a query for “matrix”.

"You" "better" "be" "back" "ASAP"
▼ Lowercase
"you" "better" "be" "back" "asap"

Decimal digit

Convert non-ASCII digits to ASCII:

$analyzer->decimalDigit();
"໑" "໒" "໓" "໔" "໕" (Lao digits)
▼ Decimal Digit
"1" "2" "3" "4" "5"

ASCII folding

Strip diacritics:

$analyzer->asciiFolding();
"manténgase" → "mantengase"

Useful when users might or might not type accents.

Token limit

Keep only the first N tokens:

$analyzer->tokenLimit(maxTokenCount: 5);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you"
▼ Token Limit 5
"I" "was" "hiding" "under" "your"

Truncate

Limit each token’s length:

$analyzer->truncate(length: 10);
"Supercalifragilisticexpialidocious"
▼ Truncate 10
"Supercalif"

Keywords

Protect specific terms from later filters — for example, prevent stemming on a brand name:

$analyzer
->keywords(['going'])
->stemming([
['go', ['going']],
]);
"Where" "are" "you" "going"
▼ Keywords protect "going"
▼ Stemming would normally turn "going" into "go" — but doesn't here
"Where" "are" "you" "going"

Custom token filters

Register your own filter classes by name:

use Sigmie\Index\Analysis\TokenFilter\TokenFilter;
 
TokenFilter::filterMap([
'skroutz_greeklish' => SkroutzGreeklish::class,
'skroutz_stem_greek' => SkroutzGreekStemmer::class,
]);

SkroutzGreeklish and SkroutzGreekStemmer are your classes implementing the token filter contract.

See also

  • Tokenizers — splitting text into tokens before the filters run.
  • Character Filters — preprocessing before tokenization.
  • Analysis — the full pipeline.
  • Languages — pre-built filter chains for English, German, Greek.