lundi 25 avril 2016

Notes from Elasticsearch - The Definitive Guide

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book:

Chapter1: 
  • history: lucent, compass, elasticsearch
  • download/run node, plugging manager Marvel, Elasticsearch vs Relational DB, 
  • Employee directory example: Create index (db), index (store) document, query (light && DSL), aggregations
Chapter 2: (about distribution)
  • Cluster health (green yellow, red), Create index with 3 shards (default 5) and 1 replica, then scaling nb of replicas (up or down), master reelection after it fails

Chapter 3: 
API for managing documents (create, retrieve, update, delete)
  • Document metadata (_index, _type, _id)
  • Index document: PUT /{index}/{type}/{id}, for auto-generated ids: POST /{index}/{type}/
  • Retrieve a document: GET /{index}/{type}/{id}, without metadata GET /{index}/{type}/{id}/_source, some fields: GET /{index}/{type}/{id}?_source={field1},{field2}
  • Check existence of document curl -i XHEAD http://elastic/{index}/{type}/{id}
  • Delete a document: DELETE /{index}/{type}/{id}, 
  • Update conflicts with optimistic concurrency control, uses _version to ensure changes to be applied in correct order, to retry in case of failures many times POST /{index}/{type}/{id}/_update?retry_on_conflict=5
  • Update using scripts (in Groovy) or set initial value (to avoid failures for non existing document) POST /{index}/{type}/{id}/_update -d ‘{“script”: “ctx._source.views+=1”, “upsert”: {“view”: 1}}’
  • Multi-GET: GET /_mget -d {“docs”: [{“_index”: “website”, “_type”: “blog”, “_id”: 2}, …]} or GET /{index}/{type}/{id}/_mget -d {“ids”: [“2”, “1”]}
  • Bulk operations (not atomic/transactional, i.e. if sone fails, some may succeeds) POST /_bulk  -d {action: {metadata}}\n{request body}

Chapter 4: 
How document management operations are executed by elastic search
Chapter 5: 
Search basics (look for data sample in gist)
  • Search all types in all indices GET /_search
  • Search a type that contains a word in a field GET /_all/{type}/_search?q={field}:{word}
  • Queries with + conditions (e.g. +{field}:{value}) must be satisfied, - conditions must not be satisfied, nothing means the condition is optional. 

Chapter 6:
Core data types in elastic search are indexed differently, to understand how elastic search interpreted the indexed documents and to avoid surprising query results (e.g. age mapped to string instead of integer), look at the mapping (i.e. schema definition) for the type and index. GET /{index}/_mapping/{type}
ES uses inverted indexes that consists of a list of unique words in all documents and for each one, the list of document it appears in. 
Each document and query are passed by analysers that filter characters, tokenise words, then filter these tokens. ES ships with some analysers: standard analyser (used by default), simple analyser, whitespace analyser, language analyser. Analysers are applied only to full text searches and not to exact values. 
To understand how documents are tokenised and stored in a given index, we can use the Analyse API by specifying the analyser: GET /_analyze?analyzer=standard -d “Text to analyse”. In the response, the value of token is what it will be stored in the index.

Chapter 7:
Filter vs Query DSL, elastic search has two DLS which are similar but serve different purposes, the filter DSL asks a yes/no question on every document and it is used for exact value field. In the other hand, Query DSL asks how well this relevant is this document question, and assign it a _score. In terms of performance, filters are much lighter and uses caches for even faster future searches. Queries are heavier and must be used only for full text searches.
Most used filters are: term/terms, exists, match_all, match, multi_match (to run same match on multiple fields), and bool query.

Queries can become easily very complex, combining multiple queries and filters, elastic search provides _validate endpoint for query validation:
GET /{index}/{type}/_validate/query QUERY_BODY
Elastic search provides also a human-readable explanation for non valid queries: GET /{index}/{type}/_validate/query?explain QUERY_BODY

Chapter 8: Sorting and relevance
By default search result documents are sorted by relevance (i.e. _score value) in descending order, however for filter queries which doesn’t have impact on the _score field it may be interesting to sort other ways (e.g. date). Example of a sort query:
GET /_search {"query": {“filtered”: {“filter”: {“term”: {“user_id”: 1}}}}, “sort”: {“date”: {“order”: “desc"}}}

Chapter 10: Index Management 
A type in Elasticsearch consists of a name and a mapping (just like a database schema) that describes its fields, there data types and how they are indexed and stored in lucene. The json representation of a document is stored in plain in the ‘_source’ field which may consume disk space, so a good idea will be to disable it.

Chapter 15: - examples
Phrase search (how to search for terms with a specific order in the target documents) and proximity search with ‘slop’ parameter that gives more flexibility to the search request

Chapter 16: - examples
Queries for matching parts of a term (not the whole). In many cases, it is sufficient to use a stemmer to index the root form of words, but there are cases where we need partial matching (e.g. matching a regex in not_analyzed values).
Example of queries: ‘prefix’ query works on term level, doesn’t analyse the query string before searching, and performs as a filter (i.e. no relevance calculation). Shorter prefix length means many possible terms to be visited, so for better performance use longer prefixes.
Query-time search as you type with match_phrase_prefix queries, and index-time search as you type by defining n-grams

Chapter 17: Controlling relevance score - examples
Relevance score in Lucene (thus Elasticsearch) is based on Term Frequency/Inverse Document Frequency and Vector Space Model (to combine weight of many terms in search query), in addition to a coordination factor, field length normalization and term/query clause boosting.
1. Boolean model: applies AND, OR and NOT conditions of the search query to find matching documents.
2. Term frequency/Inverse document frequency (TF/IDF): the matching documents then have to be sorted by relevance that depends on the weight of the query terms appearing in these documents. The weight of a term is determined by the following factors:
  •  Term frequency: defines how often a term appear in this document (the more often the higher is its weight). For a given term t and document d, it is calculated by the square root of the frequency, i.e. tf(t in d)=(frequency)^1/2
  • Inverse document frequency: defines how often a term appears in all document of a collection (the more often the lower the weight). It is calculated based on the number of documents in the collection and number of document the term appears in, as follows: idf(t) = 1 + log(numDocs / (docFreq + 1))
  • Field length norm: defines how long the field is (the shorter it is the higher the weight), if a term appears in a short field (e.g. title) then it is likely the content of that field is about this term. In some cases (e.g. logging) norms are not necessary (e.g. we don’t care about length of user agent), disabling them can save a lot of memory. This metric is calculated as the inverse square root of number of terms in the given field: norm(d) = 1 / (numTerms)^1/2
These factors are calculated and stored at index time, together they serve to calculate of a single term in a document.
3. Vector space model:
A single score representing how well a document match a query. It is calculated by first representing the search query and the document as one-dimensional vector with a size equal to number of query terms. Each element is the weight of a term calculated with TF/IDF by default although it’s possible to use other techniques (e.g. Okapi-BM25). Then the angle between these vectors is calculated (Cosine similarity), the closer they are the more relevant the document is to the query.
Lucene’s practical scoring function: Lucene combines multiple scoring functions:
1. Query coordination: rewards document that have most of the search query terms, i.e. the more query terms the document contains the more relevant it is. Sometimes, you may want to disable this function (although most use cases for disabling Query Coord are handled automatically), for instance if the query contains synonyms.
2. Query time boosting: a particular query clause can use the boost parameter to be given a higher importance over clauses with less boost value or without it. Boosting can also be applied to entire indexes.
Note: not_analyzed fields have ‘field length norms’ disabled and ‘index_options’ set to docs these disabling ’term frequencies’, the IDF of each term are still considered.
Function score query: can use Decay functions (linear, exp, guess) incorporate sliding scale (like publish_date, geo_location, price) into the _score to alter documents relevance (e.g. recently published, near a lat-lon/price point) 

For some use cases of ‘field_value_factor’ in a Function score query using directly the value of field (e.g. popularity) may not be appropriate (i.e. new_score = old_score * number_of_votes), in this case a modifier can be used for instance log1p which changes the formula to new_score = old_score * log(1 + number_of_votes).


Notes for subsequent chapters can be found here.