vendredi 13 janvier 2017

Notes from Elasticsearch - The Definitive Guide (suite 3)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 35: Controlling memory use and latency
Aggregation queries rely on a data structure called “fielddata” as inverted indices are not efficient when it comes to which unique terms existent in a single document. Understating how field data works is important as it is the primary consumer of memory in an Elasticsearch cluster.
Aggregations and Analysis: the “terms” bucket operate on string field that may be analysed or not_analyzed. For instance, doing a Terms aggregation on documents that have name of states (e.g. New York) will create a bucket for each field (e.g. new, york) and not one for each state name as the field is by default analysed. To fix this unwanted behaviour the field should be specifically made of type not_analyzed in the mapping when the index is first created.
Furthermore, the ngram analysis process can create a lot of tokens which is memory unfriendly.
Choosing the right heap size impact significantly the performance of field data and thus Elasticsearch. The value can be set with the $ES_HEAP_SIZE env variable:
Choose no more than half the available memory to let the other have for lucene as it relies on filesystem caches which are managed by the kernel
Choose no more than 32GB which allows the JVM to use compressed pointers and save memory, a value bigger than that will force the JVM to use pointers with double size and make garbage collection more expensive.
To control the size of memory allocated to fielddata set ‘indices.fielddata.cache.size’ to a percentage of head or concrete value (e.g. 40%, 5gb) in config/elasticsearch.yml. By default, this value is unbound which means ES will never evict data from field data.
Fielddata usage can be monitored (e.g. too many evictions may indicate poor performance) broken for each field:
GET /_stats/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?level=indices&fields=*
To avoid OutOfMemoryException when trying to load more data into the field data, ES uses a circuit breaker that will evaluate the required memory to answer a query before any more data loading. ES has different circuit breaker to ensure memory limit not exceeded:
indices.breaker.fielddata.limit: by default limits size of field data to 60% of the heap
indices.breaker.request.limit: estimates size of structures required to complete other parts of a request, by default 40% of the heap
indices.breaker.total.limit: total wrapping ‘requests’ and ‘fielddata’ circuit breaker, by default ensure combination not exceeding 70%
For instance, to circuit breaker can be set dynamically on live cluster:
PUT /_cluster/settings -d ‘{“persistent”:{“indices.breaker.fielddata.limit”: “40%"}}’

Filedata filtering: in some case we may need to filter out terms that fall into less interesting long tail and not to have to load them. This can be done by defining the document mapping and filtering terms by their frequency (or those matching a regular expression). Filtering data means not using it in search, for many applications the memory space gained is more important than keeping useless terms in memory.