vendredi 6 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 1)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Dealing with Human Language (Part III) :

Chapter 18: Getting started with languages - examples
Elasticsearch comes with a set of analysers for most languages (e.g. Arabic, English, Japanese, Persian, Turkish, etc.). Each of these analysers perform the same kind of rules: tokenize text into words, lowercase each word, remove stopwords, stem tokens to their root. Additionally, these analysers may perform some language specific transformation to make the words searchable.
Language analysers can be used as is, but it is possible to configure them for instance by defining stem word exclusion (e.g. prevent word organisation from being stemmed to organ), or custom stop words (e.g. omitting no and not as they invert the meaning for the subsequent words).
In case there is multiple documents with predominant language in each one, it’s more appropriate to use different index for each language (e.g. blogs-en, blogs-fr). It is also possible to have all the translations gathered in the same document (e.g. title, title_br, title_es).

Chapter 19: Identifying words - examples
Elasticsearch provides a set of tokenisers in order to extract tokens (i.e. words) from text. Example of tokenisers that can be used regardless of language:
  • whitespace: simply breaks text on whitespace,
  • letter: breaks text on any character which is not letter,
  • standard: uses Unicode Text Segmentation to find boundaries between words, 
  • tax_url_email: is similar to the standard tokeniser excepts it treats emails and urls as single words,
The standard tokeniser is a good starting point to recognise words in most languages and is the basis tokeniser for specific one (e.g. spanish). However it provides a limited support for Asian languages, in such situation it’s better to consider the ‘icu_tokenizer'.
The ICU plugin need to be installed manually in order to have the support for other than english languages:
./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION
where $VERSION can be found in
or in newer version of Elasticsearch ./bin/plugin install analysis-icu

For tokenisers to work well the input text has to be cleaned, character filters can be added to preprocess text before tokenization. For instance, the ‘html_strip’ character filter removes HTML tags and decode entities into corresponding Unicode character.

Chapter 20: Normalising tokens - examples
After text is split into tokens, the later are normalised (e.g. to lowercase) in order for similar tokens to be searchable. For instance, removing diacritics (e.g. ‘,^ and ¨) in western languages with asciifolding filter which converts also Unicode characters into simpler ASCII representation.
Elasticsearch compares characters at the byte level, however the same Unicode characters may have different bytes representation. In this case, it is possible to use Unicode normalisation forms (nfc, nfd, nfkc, nfkd) that converts Unicode into standard format and comparable at byte level.
Lowercasing Unicode character is not straitforward, it has to be made by case folding that may not result in the correct spelling but does allow case-insensitive comparisons.
Similarly, asciifolding token filter has an equivalent for dealing with many languages which is icu_folding that extends the transformation to non ASCII-based scripts like Greek. For instance fold arabic numeral to latin equivalent.

We can protect particular characters from being folded using ‘UnicodeSet’ which is a kind of character class in regular expression.

Chapter 21: Reducing words to their root form - examples
Stemming attempts to remove the difference between inflected forms of a word (like number: fox and foxes, gender: waiter and waitress, aspect: ate and eaten) in order to reduce each word to its root form. English is a weak inflected language (i.e. we can ignore inflection in words and still having good search result), but this is not the case for all languages that may need an extra work.
Stemming may suffer from understemming and overstemming, the former is failing to reduce words with same meaning to the same root and result in relevant document not been returned. The latter is failling to separate words with different meaning which reduces precision (i.e. returning irrelevant documents).
Elasticsearch has two classes of stemmers that can be used: algorithmic and dictionary stemmer. 
Algorithmic stemmer applies a sequence of rules to the given word to reduce it to its root form.
Dictionary stemmer uses a dictionary of words to their root format, so that it has only to lookup for the word to be stemmed. These stemmers are as good as their dictionaries, for instance words meaning may change over time and the dictionary have to be updated. Also, the size of the dictionary may hurt the performances as all words (suffixes and prefixes) have to be loaded into RAM. Example of widely used dictionary is the spell checker Hunspell.

Chapter 22: Stopwords performance vs precision - examples
Reducing index size can be achieved by indexing fewer words. Terms to index can be divided into Low frequency terms that appear in fewer index thus having high weight. And terms with high frequency that appear in many documents in the index. The frequency depends on the type of indexed documents, e.g. 'and’ in chinese documents will be a rare word. For any language there are common words (also called stop words) that may be filtered out before indexing but this may bring some limitations: distinguishing between ‘happy’ and 'not happy’.
To speedup query performance, we should avoid default query that uses the ‘or’ operator. 
1. One possible option is to use ‘and’ operator in match query like {"match": {"my_field": {"query": "the quick brown fox", "operator":"and"}}}. Which is then rewritten to a bool query {"bool":{"must":[{"term": {"my_field":"the"}}, {"term": {"my_field":"quick"}}, {"term": {"my_field":"brown"}}, {"term": {"my_field":"fox"}}]}}. Elasticsearch will execute first the query with least frequent term to immediately reduce the number of explored documents.
2. Another option for enhancing performance is to use ‘minimum_should_match’ property in the match query.
3. Its possible to divide the terms in search query into low frequency group (relevant terms used for filtering/matching) and high frequency group (irrelevant terms used for scoring only) terms. This is can be achieved with ‘cutoff_frequency’ query parameter, e.g. {{"match": {"text": {"query": "Quick and the dead", "cutoff_frequency": 0.01}}}. The latter result in a combined “must” clause with terms “quick/dead” and a should clause with terms “and/the”.
The parameters “cutoff_frequency” and “minimum_should_match” can be combined toghether.
To effective reduce the index size use the appropriate ’index_options’ in a Mapping API request, possible values are: 
  • docs’ (default for ’non_analyzed’ string fields): store only which documents include which terms,
  • freqs’: store ‘docs’ information plus frequency of terms in each document,
  • positions’ (default for ‘analyzed’ string fields): store ‘docs’ and ‘frees’ information plus the position of each term in each document,
  • offsets’: store ‘docs’, ‘freqs’ and ‘positions’ plus the start and end character offsets of each term in the original string,

Chapter 23: Synonyms - examples
Synonyms are used to broaden the scope of the matching documents, this kind of search (like stemming, partial matching) should be combined with another query on a field with the original text. Synonyms can be defined in the Index API request inlined in the ’synonyms’ settings parameter or in a file by specifying a path in ’synonyms_path' parameter. The latter can be absolute or relative Elasticsearch ‘config’ directory.
Synonym expansion can be done at index or search time, for instance we can replace English with the terms ‘english’ and ‘british’ at index time then in search time we could u-query for one of these terms. If synonyms are not used at index time, then at search time we have to convert the queries with ‘english’ into a query for ‘english OR british’.
Synonyms are listed as comma-separated values like ‘jump,leap,hop’. It is also possible to use the syntax with ‘=>’ to specify on the left side a list of terms to match (e.g. gb, great brain) and on the right side one or many replacement (e.g. britain,england,scotland,wales). In case many rules are specified for the same left side then the tokens in the right side are merged.
Replacing synonyms can be done with one of the following options:
1. Simple expansion: any of the listed synonyms is expanded into all of the listed synonyms (e.g. ‘jump,hop,leap’). This type expansions can be applied either at index time or search time.
2. Simple contraction: a group of synonyms in the left side are mapped to a single value in the right side (e.g. ‘leap,hop => jump’). This type of expansions must be applied at index time and query time to insure query terms are mapped to the same value.
3. Genre expansion: it widens the meaning of terms to be more generic. Applying this technique at index time with the following rules:
  • ‘cat      => cat,pet’
  • ‘kitten  => kitten,cat,pet’
  • ‘dog     => dog,pet’
  • ‘puppy => puppy,dog,pet’

then when querying for ‘kitten’ only documents about kittens will be returned, when querying for cat documents about kittens and cats are returned, and when querying for pet all documents about kittens, cats, puppies, dogs or pets will be returned.
Synonyms and the analysis chain: 
It is appropriate to set the first a tokeniser filter, then a stemmer filter before putting the synonyms filter. In this case instead of having a rule like ‘jumps,jumped,leap,leaps,leaped => jump’ we can have ‘leap => jump’.
In some case, the synonym filters cannot be simply put behind a lowercase filter as it have to deal with terms like CAT or PET (Positron Emission Tomography) which are conflicting when lowercased. A possibility will be to: 
1. put the synonym filter before the lowercase filter and specify rules with both lowercase and uppercase forms.
2. or have two synonym filters one for case-sensitive synonyms (with rules like ‘CAT,CAT scan => cat_scan') and another one for case insensitive synonyms (with rules like ‘cat => cat,pet’).
Multi-word synonyms and phrase queries: using synonyms with 'simple expansion’ (i.e. rules like ‘usa,united states,u s a,united states of america') may lead to some bizarre results for phrase queries, it’s more appropriate to use ’simple contraction’ (i.e. rules like ‘united states,u s a,united states of america=>usa’).

Symbol synonyms: used for instance to avoid emoji (like ‘:)') been striped away by the standard tokeniser filter as they may change the meaning of the phrase. The solution will be to define a mapping character filter. This will ensure that emoticons are included in the index for instance to do sentiment analysis. 
Note that mapping character filter is useful for simple replacements of exact characters, for more complex patterns; regular expressions should be used.

Chapter 24: Typoes and misspellings - examples
This chapter is about fuzzy matching at query time and sounds-like matching at index time for handling misspelled words.
Fuzzy matching treats words which are fuzzily similar as if they are the same word. This is based on Damerau-Levenshtein edit distance, i.e. number of operations (edit, insertion, deletion) to perform on a word until it becomes equal to the target word. Elasticsearch supports a maximum of edit distance ‘fuzziness’ of 2 (default is set to ‘AUTO’). Two can be overkilling as a fuzziness value (most misspelling errors are of distance 1) especially for short words (e.g. hat is at 2 distance for mad).
Fuzzy query with an edit distance of two can perform very badly and match a large number of documents, the following parameters can be used to limit performance impact:
  1. prefix_length: number of initial characters that will not be fuzzified, as most types occur at the end of words,
  2. max_expansions: limit the number of options produced, i.e. generated fuzzy words until this limit is matched. 
Scoring fuzziness: fuzzy matching should not be used for scoring but only to widen the match result (i.e. increasing recall). For example, if we have 1000 documents containing the word ‘Algeria’ and one document with the word ‘Algeia’, then the latter misspelled word will be considered more relevant (thanks to TF/IDF) as has fewer appearance.

Phonetic matching: there is plenty of algorithms for dealing with phonetic error, most of them are specialisation of the Soundex algorithm. However they are language specific (either English or German). You need to install the phonetic plugin - here 

Similarly, phonetic matching should not be used for scoring as it is intended to increase recall. Phonetic algorithms are useful when the search result will be processed by the machine and not by humans.

Notes for subsequent chapters can be found here.

Aucun commentaire:

Enregistrer un commentaire