lundi 9 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 2)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 25: High-level concepts - examples
Aggregations in Elasticsearch are based on ‘bucket’ which is a collection of documents that meet a certain criteria (equivalent to grouping in SQL) and ‘metrics’ which are statistics calculated on documents in a bucket (equivalent to count, avg, sum in SQL). 
Buckets can be nester in other buckets, and there is variety of them. Elasticsearch allows you to partition documents in many different ways (e.g. by hour, by most popular terms, by geographical location).
An aggregation is a combination of buckets and metrics, and buckets can be nested inside other buckets we can create very complex aggregations. For instance to calculate the average salary for a combination of :
  1. Partition documents by country (bucket),
  2. Partition each country bucket by gender (bucket),
  3. Partition each gender bucket by age range (bucket),
  4. Calculate the average salary for each age range (metric)

Chapter 26: Aggregation Test-drive - examples
Terms bucket in an aggregation query is a type of bucket definition that will create a new bucket for each unique term in encounters. In the result of this query, a bucket key correspond to the term value. 
An additional ‘aggs’ level can be added nested inside another one in order to nested metrics, for example to a first ‘count’ by colour aggregation we can add an ‘avg' metric to calculate average of the values of the price ‘field'.
In addition to nest metric inside bucket, we can nest buckets inside other buckets.

Chapter 28: Building Bar Charts - examples
The ‘histogram’ bucket is essential for bar charts. It works by specifying an interval and a numeric field (e.g. price) to calculate bucket on. The interval defines how wide each bucket will be, for instance if it is set to 10 then a new bucket will be created every 10. In the response to such aggregation, the histogram keys correspond to the lower boundary.

Chapter 29: Looking at time - examples
The second most popular activity in Elasticsearch is building date histograms. Timestamps exists in variety of type of data, we can build on top of it metrics which are expressed over time. Example of time-based questions: how many cars sold each month this year, what was the price of this stock for the last 12 hours.
The date_histogram bucket works similarly as the histogram bucket but instead of building buckets based on numeric field, it is calendar-aware and uses time ranges. Each bucket is defined as a certain calendar size (e.g. a month).

Chapter 30: Scoping Aggregations - examples
But default when no query parameter is specified in an aggregation, Elasticsearch runs the all document. In fact, aggregations operate in the scope of the query and if there is no query then the scope will be ‘match_all’ query.
Omitting ’search_type=count’ from the aggregation url forces the search hits to be returned, and thus seeing the search result and aggregation results.
We can use global bucket to by pass the scope of a query to all documents.

Chapter 31: Filtering Queries and Aggregations - examples
Because the aggregation operates in the scope of a query, then any filter added to the query will be applied to the aggregation.
We can use filter bucket so that document matching the filter (e.g. now - 1Month) will be added to the bucket. When using Filter bucket, all nested buckets or metrics will inherent the filter.
Post filter is a top level search parameter, it is executed after the search query to filter the results (i.e. search hits) but does not affect the query scope neither the aggregation. Thus it doesn’t affect the categorial facets. Note that for performance considerations, the post_filter should only be used in combination of aggregations and only when differential filtering is needed. Recap:
  • filtered query affects both search results and aggregations
  • filter bucket: affects only aggregations
  • post_filter: affects only search results.
Chapter 32: Sorting multi-value buckets - examples
By default elasticsearch sorts the aggregation buckets by doc_count in descending order. Elasticsearch provides many way to customise the sorting:
1. Intrinsic sorts: operates on data generated by the bucket (e.g. doc_count). It uses the ‘order’ object which can take one of these values: _count (sort by bucket’s document count), _term (sort by the values of a field), _key (sort by the bucket’s key, works only with histogram and date histogram buckets).
2. Sorting by a metric
: set the sort order with any metric (single value) by referencing it’s name. It is also possible to use multiple values metrics (e.g. extended_stats) by using a dot-path to the metric of interest.
3. Sorting based on a metric in subsequent nested buckets (my_bucket>another_bucket>metric): only for buckets generating one value (e.g. filter, global), multi-value bucket (e.g. terms) generate many dynamic buckets which makes it impossible to determine a deterministic path.

Chapter 33: Approximate Aggregations - examples
Simple operations like ‘max’ scales linearly with the number of machines of the Elasticsearch cluster. They don’t need coordination between the machines (i.e. no need for data movement over the network) and the memory footprint is too small (for the sum function all we need is to keep an integer). In the contrary, more complex operations need algorithms that can make tradeoffs between performance and memory utilisation.
Elastisearch support two approximate algorithms ‘cardinality’ and ‘percentiles’ which are fast but does provide an accurate result not an exact.

Cardinality is the approximation of the distinct query that counts unique values of a field, it is based on the HyperLogLog (HLL) algorithm. This algorithm has configurable precision (through the ‘precision_threshold’ field that accept values from 0 to 40k) that impact how much memory will be used. If the field cardinality is below the threshold than the returned cardinality is almost always 100%.
To speed up the cardinality calculation on very large datasets in which case calculating hashes at query time can be painful, we can instead calculate the hash at index time.

Percentiles is the other approximation algorithm provided by Elasticsearch, it shows the point at which certain percentage of values occur. For instance, 95th percentile is the value which is greater than 95% of the data. Percentiles are often used to quickly eyeball the distribution of data, check for skew or bimodalities, and also to find outliers. By default, the percentiles query return an array of pre-defined percentiles: 5, 25, 50, 75, 95, 99.
A compagnon metric is the ‘percentile_rank’ metrics which return for a given value the percentiles it belongs to. For example: the 50th percentile is 119ms, and 119ms percentile rank is the 50th percentile. 
The percentiles metric is based on Ted Dunning’s TDigest algorithm (paper Computing Accurate Quantiles using T-Digest).

Chapter 34: Significant Terms - examples
Significant terms are aggregation queries used for detecting anomalies. It is about finding uncommonly common patters, i.e. cases there becomes suddenly very common while in the past were uncommon. For instance, when analysing logs we may be interested in finding servers that throws a certain type of errors more often then they should.

An example of how to use this to recommend .. is by analysing the group of people enjoying a certain .. (the foreground group) and determine what .. are most popular, it will then construct a list of popular .. for everyone (the background group). Comparing the two lists shows that statistical anomalies will be the .. which are over represented in the foreground compared to the background.

to be continued..

Aucun commentaire:

Enregistrer un commentaire