Languages
Language-specific Elasticsearch analyzers in Sigmie — English, German, and Greek stemmers, stopwords, lowercase, and umlaut normalizers.
On this page
Sigmie ships purpose-built analyzers for English, German, and Greek. Each language has its own stemmers, stopword lists, lowercase normalizer, and (where appropriate) script-specific normalizers.
To use a language, pass an instance to language() on the index builder, then chain the filters you want:
use Sigmie\Languages\English\English; $sigmie->newIndex('articles') ->language(new English) ->englishStemmer() ->englishStopwords() ->englishLowercase() ->create();
language() returns a builder typed to that language, so the chained methods are language-specific and discoverable.
English
use Sigmie\Languages\English\English; $sigmie->newIndex('articles') ->language(new English) ->englishStemmer() ->englishStopwords() ->englishLowercase() ->create();
Available filters
| Filter | Purpose |
|---|---|
englishStemmer() |
Standard English stemmer. |
englishPorter2Stemmer() |
Porter2 (Snowball) stemmer. More aggressive than the default. |
englishLightStemmer() |
Lighter stemming — keeps more of the original form. |
englishLovinsStemmer() |
The Lovins algorithm. |
englishMinimalStemmer() |
Minimal stemming for high-precision use cases. |
englishPossessiveStemming() |
Strip trailing 's and '. |
englishStopwords() |
Drop English stopwords. |
englishLowercase() |
Lowercase tokens. |
Pick one stemmer per analyzer — they overlap. Porter2 is the standard choice; the light/minimal variants help when stemming reduces precision too much.
$sigmie->newIndex('articles') ->language(new English) ->englishPorter2Stemmer() ->englishPossessiveStemming() ->englishStopwords() ->englishLowercase() ->create();
German
use Sigmie\Languages\German\German; $sigmie->newIndex('artikel') ->language(new German) ->germanNormalize() ->germanStemmer() ->germanStopwords() ->germanLowercase() ->create();
Available filters
| Filter | Purpose |
|---|---|
germanStemmer() |
Default German stemmer. |
germanStemmer2() |
Alternate German stemmer (variant 2). |
germanLightStemmer() |
Lighter stemming. |
germanMinimalStemmer() |
Minimal stemming. |
germanNormalize() |
Normalize umlauts and ß. |
germanStopwords() |
Drop German stopwords. |
germanLowercase() |
Lowercase tokens. |
germanNormalize() is usually worth including — it folds ü→u, ö→o, ä→a, ß→ss, so queries match regardless of how users type umlauts.
Greek
use Sigmie\Languages\Greek\Greek; $sigmie->newIndex('arthra') ->language(new Greek) ->greekLowercase() ->greekStemmer() ->greekStopwords() ->create();
Available filters
| Filter | Purpose |
|---|---|
greekStemmer() |
Greek stemmer. |
greekStopwords() |
Drop Greek stopwords. |
greekLowercase() |
Lowercase Greek tokens (handles σ → ς word-final form). |
greekLowercase() does more than ASCII lowercase — it handles the Greek-specific final-sigma form. Use it instead of the generic lowercase() for Greek text.
Multi-language indices
Per-field analysis lets you mix languages in one index. Define one analyzer per language-specific field:
use Sigmie\Index\NewAnalyzer;use Sigmie\Languages\English\English;use Sigmie\Languages\German\German; $props->text('description_en') ->withNewAnalyzer(function (NewAnalyzer $analyzer) { $analyzer->tokenizeOnWordBoundaries(); // language-specific filters via separate builder });
For most cases, simpler to keep one index per language and search across them:
$sigmie->newSearch("articles_de,articles_en") ->properties($props) ->queryString('Tür door') ->get();
See also
- Analysis — how text becomes searchable.
- Token Filters — generic, language-agnostic filters.
- Indices — index configuration.