Sigmie

Token Filters
- Stemming
- Stopwords
- Unique
- Trim
- One-Way Synonym
- Two-Way Synonyms
- Lowercase
- Uppercase
- Decimal Digit
- Ascii Folding
- Token Limit
- Truncate
- Keywords
Registering Custom Token Filters

Token Filters

Filters are applied in the order that you specify them. Here are some examples of how each filter works:

Stemming

The Stemming filter reduces words to their root form.

For example, “going” becomes “go”:

$newAnalyzer->stemming([
    ['go', ['going']]
    // more
]);

 "Where"
 "are"
 "you"
 "going"
 ------------------------
 Stemming "going" -> "go"
 ------------------------
 "Where"
 "are"
 "you"
-"going"                   
+"go"

Stopwords

The Stopwords filter removes common words that do not contribute to the meaning of a phrase.

For example, “but” and “not” are removed from the phrase:

$newAnalyzer->stopwords(['but']);

 "Ladies"
 "do"
 "not"
 "start"
 "fights"
 "but"
 "they"
 "can"
 "finish"
 "them"
 ---------------------
 Stopwords "but","not"
 ---------------------
 "Ladies"
 "do"
-"not"                  
 "start"
 "fights"
-"but"                  
 "they"
 "can"
 "finish"
 "them"

Unique

The Unique filter removes duplicate words.

For example, the second “I” is removed from the phrase:

$newAnalyzer->unique(onlyOnSamePosition: false);

 "I"
 "was"
 "hiding"
 "under"
 "your"
 "porch"
 "because"
 "I"
 "love"
 "you"
 ---------------------
 Unique
 ---------------------
 "I" 
 "was"
 "hiding"
 "under"
 "your"
 "porch"
 "because"
-"I"  
 "love"
 "you"

Trim

The Trim filter removes leading and trailing whitespace from words.

For example, “ never give up“ becomes “never give up”:

$newAnalyzer->trim();

  "Though at times it may feel like the sky is falling around you"
  " never give up"
  " for every day is a new day"
 ---------------------
 Trim
 ---------------------
  "Though at times it may feel like the sky is falling around you"
- " never give up" 
+ "never give up" 
- " for every day is a new day" 
+ "for every day is a new day"

One-Way Synonym

The One-Way Synonym filter replaces a word with its synonym.

For example, “fun” becomes “joy”:

$newAnalyzer->oneWaySynonyms([
                'ipod' => ['i-pod', 'i pod'],
            ]);

$newAnalyzer->synonyms([
                ['joy' ,['fun']]
            ]);

 "It’s"
 "kind"
 "of"
 "fun"
 "to"
 "do"
 "the"
 "impossible"
 ---------------------
 Synonyms "fun" -> "joy"
 ---------------------
 "It’s"
 "kind"
 "of"
-"fun" 
+"joy" 
 "to"
 "do"
 "the"
 "impossible"

Two-Way Synonyms

The Two-Way Synonyms filter replaces a word with its synonym and vice versa.

For example, “fun” becomes “joy” and “joy” becomes “fun”:

$newAnalyzer->synonyms([
                ['joy' ,'fun']
            ]);

 "It’s"
 "kind"
 "of"
 "fun"
 "to"
 "do"
 "the"
 "impossible"
 ---------------------
 Synonyms "fun", "joy"
 ---------------------
 "It’s"
 "kind"
 "of"
 "fun" 
+"joy" 
 "to"
 "do"
 "the"
 "impossible"

Lowercase

The Lowercase filter converts all characters in a word to lowercase.

For example, “ASAP” becomes “asap”:

$newAnalyzer->lowercase();

 "You"
 "better"
 "be"
 "back"
 "ASAP"
 ---------------------
 Lowercase
 ---------------------
-"You" 
+"you" 
 "better"
 "be"
 "back"
-"ASAP" 
+"asap"

Uppercase

The Uppercase filter converts all characters in a word to uppercase.

For example, “Miserable” becomes “MISERABLE”:

$newAnalyzer->uppercase();

 "Miserable"
 "darling"
 "as"
 "usual"
 "perfectly"
 "wretched"
  ---------------------
  Uppercase
  ---------------------
-"Miserable" 
-"darling" 
-"as" 
-"usual" 
-"perfectly" 
-"wretched" 
+"MISERABLE" 
+"DARLING" 
+"AS" 
+"USUAL" 
+"PERFECTLY" 
+"WRETCHED"

Decimal Digit

The Decimal Digit filter converts non-ASCII digits to their ASCII equivalents.

For example, Lao digits are converted to Arabic numerals:

$newAnalyzer->decimalDigit();

 // Lao Digits from 1 to 5
  "໑"
  "໒"
  "໓"
  "໔"
  "໕"
  ---------------------
  Decimal Digit
  ---------------------
- "໑" 
- "໒" 
- "໓" 
- "໔" 
- "໕" 
+ "1" 
+ "2" 
+ "3" 
+ "4" 
+ "5"

Ascii Folding

The Ascii Folding filter removes diacritics from characters.

For example, “manténgase” becomes “mantengase”:

$newAnalyzer->asciiFolding();

 "Por"
 "favor"
 "manténgase"
 "alejado"
 "de"
 "las"
 "puertas"
 ---------------------
 Ascii Folding
 ---------------------
  "Por"
  "favor"
- "manténgase" 
+ "mantengase"
  "alejado"
  "de"
  "las"
  "puertas"

Token Limit

The Token Limit filter limits the number of tokens in a phrase.

For example, only the first five words are kept in the phrase:

$newAnalyzer->tokenLimit(maxTokenCount: 10);

 "I"
 "was"
 "hiding"
 "under"
 "your"
 "porch"
 "because"
 "I"
 "love"
 "you"
 ---------------------
 Token Limit 5
 ---------------------
 "I"
 "was"
 "hiding"
 "under"
 "your"
-"porch" 
-"because" 
-"I"  
-"love" 
-"you"

Truncate

The Truncate filter limits the length of a word.

For example, “Supercalifragilisticexpialidocious” becomes “Supercalif”:

$newAnalyzer->truncate(length: 10);

 "Supercalifragilisticexpialidocious"
 ---------------------
 Truncate 10
 ---------------------
-"Supercalifragilisticexpialidocious" 
+"Supercalif"

Keywords

The Keywords filter prevents certain words from being modified by other filters.

For example, “going” is not stemmed to “go”:

$newAnalyzer
->keywords(['going'])
->stemming([
    ['go', ['going']]
]);

"Where"
"are"
"you"
"going"
------------------------
Keywords "going"
------------------------
Stemming "going" -> "go"
------------------------
"Where"
"are"
"you"
"going"

Registering Custom Token Filters

You can register your own custom token filters. This can be done using the TokenFilter::filterMap method. This method accepts an associative array where the keys are the names of the custom filters and the values are the corresponding class names. Here is an example:

TokenFilter::filterMap([
    'skroutz_greeklish' => SkroutzGreeklish::class,
    'skroutz_stem_greek' => SkroutzGreekStemmer::class,
]);

In this example, two custom token filters are registered: skroutz_greeklish and skroutz_stem_greek. The SkroutzGreeklish and SkroutzGreekStemmer classes define the behavior of these filters.

Token filters

#Token Filters

#Stemming

#Stopwords

#Unique

#Trim

#One-Way Synonym

#Two-Way Synonyms

#Lowercase

#Uppercase

#Decimal Digit

#Ascii Folding

#Token Limit

#Truncate

#Keywords

#Registering Custom Token Filters