Token filters
Token Filters
Filters are applied in the order that you specify them. Here are some examples of how each filter works:
Stemming
The Stemming filter reduces words to their root form.
For example, “going” becomes “go”:
$newAnalyzer->stemming([ ['go', ['going']] // more]);
"Where" "are" "you" "going" ------------------------ Stemming "going" -> "go" ------------------------ "Where" "are" "you"-"going" +"go"
Stopwords
The Stopwords filter removes common words that do not contribute to the meaning of a phrase.
For example, “but” and “not” are removed from the phrase:
$newAnalyzer->stopwords(['but']);
"Ladies" "do" "not" "start" "fights" "but" "they" "can" "finish" "them" --------------------- Stopwords "but","not" --------------------- "Ladies" "do"-"not" "start" "fights"-"but" "they" "can" "finish" "them"
Unique
The Unique filter removes duplicate words.
For example, the second “I” is removed from the phrase:
$newAnalyzer->unique(onlyOnSamePosition: false);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you" --------------------- Unique --------------------- "I" "was" "hiding" "under" "your" "porch" "because"-"I" "love" "you"
Trim
The Trim filter removes leading and trailing whitespace from words.
For example, “ never give up“ becomes “never give up”:
$newAnalyzer->trim();
"Though at times it may feel like the sky is falling around you" " never give up" " for every day is a new day" --------------------- Trim --------------------- "Though at times it may feel like the sky is falling around you"- " never give up" + "never give up" - " for every day is a new day" + "for every day is a new day"
One-Way Synonym
The One-Way Synonym filter replaces a word with its synonym.
For example, “fun” becomes “joy”:
$newAnalyzer->oneWaySynonyms([ 'ipod' => ['i-pod', 'i pod'], ]);
$newAnalyzer->synonyms([ ['joy' ,['fun']] ]);
"It’s" "kind" "of" "fun" "to" "do" "the" "impossible" --------------------- Synonyms "fun" -> "joy" --------------------- "It’s" "kind" "of"-"fun" +"joy" "to" "do" "the" "impossible"
Two-Way Synonyms
The Two-Way Synonyms filter replaces a word with its synonym and vice versa.
For example, “fun” becomes “joy” and “joy” becomes “fun”:
$newAnalyzer->synonyms([ ['joy' ,'fun'] ]);
"It’s" "kind" "of" "fun" "to" "do" "the" "impossible" --------------------- Synonyms "fun", "joy" --------------------- "It’s" "kind" "of" "fun" +"joy" "to" "do" "the" "impossible"
Lowercase
The Lowercase filter converts all characters in a word to lowercase.
For example, “ASAP” becomes “asap”:
$newAnalyzer->lowercase();
"You" "better" "be" "back" "ASAP" --------------------- Lowercase ----------------------"You" +"you" "better" "be" "back"-"ASAP" +"asap"
Uppercase
The Uppercase filter converts all characters in a word to uppercase.
For example, “Miserable” becomes “MISERABLE”:
$newAnalyzer->uppercase();
"Miserable" "darling" "as" "usual" "perfectly" "wretched" --------------------- Uppercase ----------------------"Miserable" -"darling" -"as" -"usual" -"perfectly" -"wretched" +"MISERABLE" +"DARLING" +"AS" +"USUAL" +"PERFECTLY" +"WRETCHED"
Decimal Digit
The Decimal Digit filter converts non-ASCII digits to their ASCII equivalents.
For example, Lao digits are converted to Arabic numerals:
$newAnalyzer->decimalDigit();
// Lao Digits from 1 to 5 "໑" "໒" "໓" "໔" "໕" --------------------- Decimal Digit ---------------------- "໑" - "໒" - "໓" - "໔" - "໕" + "1" + "2" + "3" + "4" + "5"
Ascii Folding
The Ascii Folding filter removes diacritics from characters.
For example, “manténgase” becomes “mantengase”:
$newAnalyzer->asciiFolding();
"Por" "favor" "manténgase" "alejado" "de" "las" "puertas" --------------------- Ascii Folding --------------------- "Por" "favor"- "manténgase" + "mantengase" "alejado" "de" "las" "puertas"
Token Limit
The Token Limit filter limits the number of tokens in a phrase.
For example, only the first five words are kept in the phrase:
$newAnalyzer->tokenLimit(maxTokenCount: 10);
"I" "was" "hiding" "under" "your" "porch" "because" "I" "love" "you" --------------------- Token Limit 5 --------------------- "I" "was" "hiding" "under" "your"-"porch" -"because" -"I" -"love" -"you"
Truncate
The Truncate filter limits the length of a word.
For example, “Supercalifragilisticexpialidocious” becomes “Supercalif”:
$newAnalyzer->truncate(length: 10);
"Supercalifragilisticexpialidocious" --------------------- Truncate 10 ----------------------"Supercalifragilisticexpialidocious" +"Supercalif"
Keywords
The Keywords filter prevents certain words from being modified by other filters.
For example, “going” is not stemmed to “go”:
$newAnalyzer->keywords(['going'])->stemming([ ['go', ['going']]]);
"Where""are""you""going"------------------------Keywords "going"------------------------Stemming "going" -> "go"------------------------"Where""are""you""going"
Registering Custom Token Filters
You can register your own custom token filters. This can be done using the TokenFilter::filterMap
method. This method accepts an associative array where the keys are the names of the custom filters and the values are the corresponding class names. Here is an example:
TokenFilter::filterMap([ 'skroutz_greeklish' => SkroutzGreeklish::class, 'skroutz_stem_greek' => SkroutzGreekStemmer::class,]);
In this example, two custom token filters are registered: skroutz_greeklish
and skroutz_stem_greek
. The SkroutzGreeklish
and SkroutzGreekStemmer
classes define the behavior of these filters.