Character Filters

Pre-process text before Elasticsearch tokenization with Sigmie's character filters — HTML strip, character mapping, and regex pattern replace.

On this page

Character filters run before the tokenizer. They operate on the raw string — stripping HTML, mapping characters, applying regex substitutions — so the tokenizer sees clean input.

Sigmie ships three: HTML strip, character mapping, and pattern replace.

HTML strip

Remove HTML tags from input text:

use Sigmie\Index\Analysis\CharFilter\HTMLStrip;
 
$analyzer->charFilter(new HTMLStrip);
 
// Or:
$analyzer->stripHTML();
"<span>Some people are worth melting for.</span>"
▼ Strip HTML
"Some people are worth melting for."

Use this for crawled web content, rich-text fields, or anywhere user input might contain HTML.

Character mapping

Replace specific substrings with replacements:

use Sigmie\Index\Analysis\CharFilter\Mapping;
 
$analyzer->charFilter(new Mapping(
name: 'mapping_char_filter',
mappings: [
':)' => 'happy',
':(' => 'sad',
],
));
 
// Or:
$analyzer->mapChars([
':)' => 'happy',
':(' => 'sad',
]);
"Even miracles take a little time. :)"
▼ Map Chars (":)" → "happy")
"Even miracles take a little time. happy"

Mappings are literal substring substitutions — not regex.

Pattern replace

Apply a regex substitution:

use Sigmie\Index\Analysis\CharFilter\Pattern;
 
$analyzer->charFilter(new Pattern(
name: 'pattern_replace_char_filter',
pattern: ':D|:\)',
replace: 'happy',
));
 
// Or:
$analyzer->patternReplace(pattern: ':D|:\)', replace: 'happy');
"This is the perfect time to panic! :D :)"
▼ Pattern Replace (":D|:\\)" → "happy")
"This is the perfect time to panic! happy happy"

Use pattern replace for edge cases mapping can’t handle — variable-length matches, alternations, anchors.

See also