Sigmie

Introduction
Analyzer
Query

Introduction

The analysis is the process that separates Elasticsearch from a traditional database. To accomplish this impressive speed when searching one word in thousands of records, work must be done before the search happens.

Every Document Text field is analyzed with its corresponding Analyzer at index time. This process is called Analysis and it consisted of the 3 steps below.

Analysis
├─ Char filters
├─ Tokenizer
├─ Token filters

Once the text passes through all the steps it has a form efficient for searching.

Analyzer

The Analyzer is responsible for performing the Analysis. It’s a group of Char filters, a Tokenizer, and Token filters.

The Document Text field either has its own Analyzer or uses the Index default one.

Let’s see an example of how this HTML Text

"<span>Some people are worth melting for</span>"

is analyzed by the below Analyzer.

Analyzer
├─ Char filters
│  ├─ Strip HTML
├─ Tokenizer
│  ├─ Whitespace
├─ Token filters
│  ├─ Lowercase
│  ├─ Stopwords

Char filter

The first step in the Analysis is to apply the configured Char Filters. In our case, the Strip HTML char filter removes all HTML from the text.

-"<span>Some people are worth melting for</span>"  
+"Some people are worth melting for"

Tokenize

After the Char Filters the resulting string is passed to the Tokenizer that split’s the text into terms called tokens.

In our example, we have the Word Boundaries tokenizer. This means that the tokenizer will produce a token every time it encounters a word boundary like this.

-"Some people are worth melting for"               
+"Some"                                             
+"people"                                           
+"are"                                              
+"worth"                                            
+"melting"                                          
+"for"

Token filters

The last step in the Analysis is to apply the Token Filters to all tokens produced by the tokenizer. Our example has the Lowercase token filter that converts all tokens to only contain lowercase letters.

-"Some"                                             
+"some"                                             
 "people"
-"are"                                              
 "worth"
 "melting"
-"for"

"some"
"people"
"worth"
"melting"

Query

Every time a query hits the Index, the query string goes through the same analysis process.

"Some people worth melting"

-"Some people worth melting"                        
+"some"                                             
+"people"                                           
+"worth"                                            
+"melting"

Once both the Query String and the Document attribute are analyzed in the same way, it’s easier for Elasticsearch to find where the incoming terms appear.

| Query Term   | Document 1  | Document 2  |
| -----------  | ----------- | ------------|
| "some"       | x           | x           |
| "people"     | x           | x           |
| "worth"      |             | x           | 
| "melting"    | x           | x           |

Analyisis

#Introduction

#Analyzer

#Char filter

#Tokenize

#Token filters