Token filters

Token Filters

Filters are applied in the order that you specify them. Here are some examples of how each filter works:

Stemming

The Stemming filter reduces words to their root form.

For example, “going” becomes “go”:

$newAnalyzer->stemming([
['go', ['going']]
// more
]);
"Where"
"are"
"you"
"going"
------------------------
Stemming "going" -> "go"
------------------------
"Where"
"are"
"you"
-"going"
+"go"

Stopwords

The Stopwords filter removes common words that do not contribute to the meaning of a phrase.

For example, “but” and “not” are removed from the phrase:

$newAnalyzer->stopwords(['but']);
"Ladies"
"do"
"not"
"start"
"fights"
"but"
"they"
"can"
"finish"
"them"
---------------------
Stopwords "but","not"
---------------------
"Ladies"
"do"
-"not"
"start"
"fights"
-"but"
"they"
"can"
"finish"
"them"

Unique

The Unique filter removes duplicate words.

For example, the second “I” is removed from the phrase:

$newAnalyzer->unique(onlyOnSamePosition: false);
"I"
"was"
"hiding"
"under"
"your"
"porch"
"because"
"I"
"love"
"you"
---------------------
Unique
---------------------
"I"
"was"
"hiding"
"under"
"your"
"porch"
"because"
-"I"
"love"
"you"

Trim

The Trim filter removes leading and trailing whitespace from words.

For example, “ never give up“ becomes “never give up”:

$newAnalyzer->trim();
"Though at times it may feel like the sky is falling around you"
" never give up"
" for every day is a new day"
---------------------
Trim
---------------------
"Though at times it may feel like the sky is falling around you"
- " never give up"
+ "never give up"
- " for every day is a new day"
+ "for every day is a new day"

One-Way Synonym

The One-Way Synonym filter replaces a word with its synonym.

For example, “fun” becomes “joy”:

$newAnalyzer->oneWaySynonyms([
'ipod' => ['i-pod', 'i pod'],
]);
$newAnalyzer->synonyms([
['joy' ,['fun']]
]);
"It’s"
"kind"
"of"
"fun"
"to"
"do"
"the"
"impossible"
---------------------
Synonyms "fun" -> "joy"
---------------------
"It’s"
"kind"
"of"
-"fun"
+"joy"
"to"
"do"
"the"
"impossible"

Two-Way Synonyms

The Two-Way Synonyms filter replaces a word with its synonym and vice versa.

For example, “fun” becomes “joy” and “joy” becomes “fun”:

$newAnalyzer->synonyms([
['joy' ,'fun']
]);
"It’s"
"kind"
"of"
"fun"
"to"
"do"
"the"
"impossible"
---------------------
Synonyms "fun", "joy"
---------------------
"It’s"
"kind"
"of"
"fun"
+"joy"
"to"
"do"
"the"
"impossible"

Lowercase

The Lowercase filter converts all characters in a word to lowercase.

For example, “ASAP” becomes “asap”:

$newAnalyzer->lowercase();
"You"
"better"
"be"
"back"
"ASAP"
---------------------
Lowercase
---------------------
-"You"
+"you"
"better"
"be"
"back"
-"ASAP"
+"asap"

Uppercase

The Uppercase filter converts all characters in a word to uppercase.

For example, “Miserable” becomes “MISERABLE”:

$newAnalyzer->uppercase();
"Miserable"
"darling"
"as"
"usual"
"perfectly"
"wretched"
---------------------
Uppercase
---------------------
-"Miserable"
-"darling"
-"as"
-"usual"
-"perfectly"
-"wretched"
+"MISERABLE"
+"DARLING"
+"AS"
+"USUAL"
+"PERFECTLY"
+"WRETCHED"

Decimal Digit

The Decimal Digit filter converts non-ASCII digits to their ASCII equivalents.

For example, Lao digits are converted to Arabic numerals:

$newAnalyzer->decimalDigit();
// Lao Digits from 1 to 5
""
""
""
""
""
---------------------
Decimal Digit
---------------------
- ""
- ""
- ""
- ""
- ""
+ "1"
+ "2"
+ "3"
+ "4"
+ "5"

Ascii Folding

The Ascii Folding filter removes diacritics from characters.

For example, “manténgase” becomes “mantengase”:

$newAnalyzer->asciiFolding();
"Por"
"favor"
"manténgase"
"alejado"
"de"
"las"
"puertas"
---------------------
Ascii Folding
---------------------
"Por"
"favor"
- "manténgase"
+ "mantengase"
"alejado"
"de"
"las"
"puertas"

Token Limit

The Token Limit filter limits the number of tokens in a phrase.

For example, only the first five words are kept in the phrase:

$newAnalyzer->tokenLimit(maxTokenCount: 10);
"I"
"was"
"hiding"
"under"
"your"
"porch"
"because"
"I"
"love"
"you"
---------------------
Token Limit 5
---------------------
"I"
"was"
"hiding"
"under"
"your"
-"porch"
-"because"
-"I"
-"love"
-"you"

Truncate

The Truncate filter limits the length of a word.

For example, “Supercalifragilisticexpialidocious” becomes “Supercalif”:

$newAnalyzer->truncate(length: 10);
"Supercalifragilisticexpialidocious"
---------------------
Truncate 10
---------------------
-"Supercalifragilisticexpialidocious"
+"Supercalif"

Keywords

The Keywords filter prevents certain words from being modified by other filters.

For example, “going” is not stemmed to “go”:

$newAnalyzer
->keywords(['going'])
->stemming([
['go', ['going']]
]);
"Where"
"are"
"you"
"going"
------------------------
Keywords "going"
------------------------
Stemming "going" -> "go"
------------------------
"Where"
"are"
"you"
"going"

Registering Custom Token Filters

You can register your own custom token filters. This can be done using the TokenFilter::filterMap method. This method accepts an associative array where the keys are the names of the custom filters and the values are the corresponding class names. Here is an example:

TokenFilter::filterMap([
'skroutz_greeklish' => SkroutzGreeklish::class,
'skroutz_stem_greek' => SkroutzGreekStemmer::class,
]);

In this example, two custom token filters are registered: skroutz_greeklish and skroutz_stem_greek. The SkroutzGreeklish and SkroutzGreekStemmer classes define the behavior of these filters.