Character Filters
Pre-process text before Elasticsearch tokenization with Sigmie's character filters — HTML strip, character mapping, and regex pattern replace.
On this page
Character filters run before the tokenizer. They operate on the raw string — stripping HTML, mapping characters, applying regex substitutions — so the tokenizer sees clean input.
Sigmie ships three: HTML strip, character mapping, and pattern replace.
HTML strip
Remove HTML tags from input text:
use Sigmie\Index\Analysis\CharFilter\HTMLStrip; $analyzer->charFilter(new HTMLStrip); // Or:$analyzer->stripHTML();
"<span>Some people are worth melting for.</span>" │ ▼ Strip HTML"Some people are worth melting for."
Use this for crawled web content, rich-text fields, or anywhere user input might contain HTML.
Character mapping
Replace specific substrings with replacements:
use Sigmie\Index\Analysis\CharFilter\Mapping; $analyzer->charFilter(new Mapping( name: 'mapping_char_filter', mappings: [ ':)' => 'happy', ':(' => 'sad', ],)); // Or:$analyzer->mapChars([ ':)' => 'happy', ':(' => 'sad',]);
"Even miracles take a little time. :)" │ ▼ Map Chars (":)" → "happy")"Even miracles take a little time. happy"
Mappings are literal substring substitutions — not regex.
Pattern replace
Apply a regex substitution:
use Sigmie\Index\Analysis\CharFilter\Pattern; $analyzer->charFilter(new Pattern( name: 'pattern_replace_char_filter', pattern: ':D|:\)', replace: 'happy',)); // Or:$analyzer->patternReplace(pattern: ':D|:\)', replace: 'happy');
"This is the perfect time to panic! :D :)" │ ▼ Pattern Replace (":D|:\\)" → "happy")"This is the perfect time to panic! happy happy"
Use pattern replace for edge cases mapping can’t handle — variable-length matches, alternations, anchors.
See also
- Tokenizers — what runs after the char filters.
- Token Filters — transforming individual tokens.
- Analysis — the full pipeline.