vendredi 13 mai 2016

Notes on Big Data related talks

Hadoop Summit 2016

Apache Eagle Monitor Hadoop in Real time
The talk was about Apache Eagle a Hadoop product developed by eBay to monitor activities on a Hadoop cluster from the security perspective. The talk started by describing the pillars of security in hadoop: perimeter security; authorization & access control; discovery (e.g. classifying data according to their sensitivity), activity monitoring. The talk is mainly on the last part to address info sec questions: who many users are using Hadoop cluster, what files are they accessing, etc. From this purpose Eagle was born to be able to track events form different sources (e.g. accidently deleting files from HDFS) and correlate them with some user-defined policies.

Ingest and Stream Processing What will you choose
The talk was divided in two parts, the first one was about streaming patterns. And how each part provide at least once or exactly one message delivery.
The second part was a demo for building a streaming pipeline using streamsets editor easily. The demo was about using land data of the city of San Fransisco, streaming it and trying to calculate the land with maximum area. The generated data is then store into two destinations Kudu for analytics (e.g. top ten areas) and another Kafa for the events to be used for rendering on minecraft (which was pretty neat).

Real time Search on Terabytes of Data Per Day Lessons Learned
Lessons learned from the plaform engineering team at Rocana (an Ops monitoring software vendering) on building a search engine on HDFS. They described the context and amount of data they are dealing with at a daily basis in terabytes of data. Then, they talked about their initial use of Solar cloud as an enabler to their platform, and how they struggled to scale it and finally decided to create their our search engine based on Lucene and HDFS to store indexes. The rest of the talk was about the specific time-oriented search engine architecture. In the Q&A, one question was on Elasticsearch, they didn't really tested but rather relied on an analysis made by the author of Jepsen (which is a tool for analysing distributed systems).

Spark Summit East 2016

Spark Performance: What's Next
The talk started by a finding since the Spark project started in 2010 up to now on the evolution of IO speed, network throughput and CPU speed as the two firsts increase by a factor of 10x while CPU is stuck at 3Gz. The first attempt to CPU and memory optimization was through project Tungsten. The, the speaker described the two phases of perf enhancement:
  • Phase 1 (Spark 1.4 to 1.6) enhanced memory managed through using java.unsafe API and offheap memory instead of using Java objects (that allocates memory unnecessary).
  • Phase 2 (Spark 2): instead of using the Volcano Iterator Model to implement operators (i.e. filter, projection, aggregation) use the Whole-stage Codegen to generate optimized code (and avoid virtual functions call). Plus the use of vectorization (i.e. columnar) to represent data in memory for an efficient scan.

Then the speaker described the impact of these enhancement by comparing the performance of Spark 1.6 vs Spark 2 for different queries. These modification are on master under active development.
In the QA, the described techniques are applicable for DataFrames as the the engine has more information on the data schema which is not the case with RDDs. With Dataset API (which is on top of the DataFrame API) you get the benefit of telling the Engine the data schema as well as the safety data types (i.e. accessing the items without having to cast them to their type). DataFrame gives you index access, while Datasets gives you object access.

Others

Ted Dunning on Kafka, MapR, Drill, Apache Arrow and More
Ted Dunning talking about why the Hadoop ecosystem succeeded over the NoSQL movement thanks to the more stable API as a standard way to make consensus among the community. While in NoSQL it tends to be isolated icelands. As an example he gave Kafka release of version 0.9 as it reached a new level of stability thanks to its API. He then described how Kafka fit in its goal and give an example of a use case where it's going to be hard to used for. The use case was about real-time tracking of shipment containers, in the case where a dedicated Kafka topic is used to track each container, in this case it will be hard to replicate effectively.
Then, he described MapR approach to open source as way to innovate in the underneath implementation why applying to a standard API (e.g. HDFS).
He also talked about Drill and how MapR is trying to involve more member of the community so that it doesn't seem as the only supported. He also talked about the in-memory movement, and specially the Apache Arrow in-memory file system and how it enabled the co-author of pandas to be a Apache Feather a new file format to store data frames on disk and be able to send through wire with Apache Arrow without need for serialization.

more to come.

lundi 9 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 2)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 25: High-level concepts - examples
Aggregations in Elasticsearch are based on ‘bucket’ which is a collection of documents that meet a certain criteria (equivalent to grouping in SQL) and ‘metrics’ which are statistics calculated on documents in a bucket (equivalent to count, avg, sum in SQL). 
Buckets can be nester in other buckets, and there is variety of them. Elasticsearch allows you to partition documents in many different ways (e.g. by hour, by most popular terms, by geographical location).
An aggregation is a combination of buckets and metrics, and buckets can be nested inside other buckets we can create very complex aggregations. For instance to calculate the average salary for a combination of :
  1. Partition documents by country (bucket),
  2. Partition each country bucket by gender (bucket),
  3. Partition each gender bucket by age range (bucket),
  4. Calculate the average salary for each age range (metric)

Chapter 26: Aggregation Test-drive - examples
Terms bucket in an aggregation query is a type of bucket definition that will create a new bucket for each unique term in encounters. In the result of this query, a bucket key correspond to the term value. 
An additional ‘aggs’ level can be added nested inside another one in order to nested metrics, for example to a first ‘count’ by colour aggregation we can add an ‘avg' metric to calculate average of the values of the price ‘field'.
In addition to nest metric inside bucket, we can nest buckets inside other buckets.

Chapter 28: Building Bar Charts - examples
The ‘histogram’ bucket is essential for bar charts. It works by specifying an interval and a numeric field (e.g. price) to calculate bucket on. The interval defines how wide each bucket will be, for instance if it is set to 10 then a new bucket will be created every 10. In the response to such aggregation, the histogram keys correspond to the lower boundary.

Chapter 29: Looking at time - examples
The second most popular activity in Elasticsearch is building date histograms. Timestamps exists in variety of type of data, we can build on top of it metrics which are expressed over time. Example of time-based questions: how many cars sold each month this year, what was the price of this stock for the last 12 hours.
The date_histogram bucket works similarly as the histogram bucket but instead of building buckets based on numeric field, it is calendar-aware and uses time ranges. Each bucket is defined as a certain calendar size (e.g. a month).

Chapter 30: Scoping Aggregations - examples
But default when no query parameter is specified in an aggregation, Elasticsearch runs the all document. In fact, aggregations operate in the scope of the query and if there is no query then the scope will be ‘match_all’ query.
Omitting ’search_type=count’ from the aggregation url forces the search hits to be returned, and thus seeing the search result and aggregation results.
We can use global bucket to by pass the scope of a query to all documents.

Chapter 31: Filtering Queries and Aggregations - examples
Because the aggregation operates in the scope of a query, then any filter added to the query will be applied to the aggregation.
We can use filter bucket so that document matching the filter (e.g. now - 1Month) will be added to the bucket. When using Filter bucket, all nested buckets or metrics will inherent the filter.
Post filter is a top level search parameter, it is executed after the search query to filter the results (i.e. search hits) but does not affect the query scope neither the aggregation. Thus it doesn’t affect the categorial facets. Note that for performance considerations, the post_filter should only be used in combination of aggregations and only when differential filtering is needed. Recap:
  • filtered query affects both search results and aggregations
  • filter bucket: affects only aggregations
  • post_filter: affects only search results.
Chapter 32: Sorting multi-value buckets - examples
By default elasticsearch sorts the aggregation buckets by doc_count in descending order. Elasticsearch provides many way to customise the sorting:
1. Intrinsic sorts: operates on data generated by the bucket (e.g. doc_count). It uses the ‘order’ object which can take one of these values: _count (sort by bucket’s document count), _term (sort by the values of a field), _key (sort by the bucket’s key, works only with histogram and date histogram buckets).
2. Sorting by a metric
: set the sort order with any metric (single value) by referencing it’s name. It is also possible to use multiple values metrics (e.g. extended_stats) by using a dot-path to the metric of interest.
3. Sorting based on a metric in subsequent nested buckets (my_bucket>another_bucket>metric): only for buckets generating one value (e.g. filter, global), multi-value bucket (e.g. terms) generate many dynamic buckets which makes it impossible to determine a deterministic path.

Chapter 33: Approximate Aggregations - examples
Simple operations like ‘max’ scales linearly with the number of machines of the Elasticsearch cluster. They don’t need coordination between the machines (i.e. no need for data movement over the network) and the memory footprint is too small (for the sum function all we need is to keep an integer). In the contrary, more complex operations need algorithms that can make tradeoffs between performance and memory utilisation.
Elastisearch support two approximate algorithms ‘cardinality’ and ‘percentiles’ which are fast but does provide an accurate result not an exact.

Cardinality is the approximation of the distinct query that counts unique values of a field, it is based on the HyperLogLog (HLL) algorithm. This algorithm has configurable precision (through the ‘precision_threshold’ field that accept values from 0 to 40k) that impact how much memory will be used. If the field cardinality is below the threshold than the returned cardinality is almost always 100%.
To speed up the cardinality calculation on very large datasets in which case calculating hashes at query time can be painful, we can instead calculate the hash at index time.

Percentiles is the other approximation algorithm provided by Elasticsearch, it shows the point at which certain percentage of values occur. For instance, 95th percentile is the value which is greater than 95% of the data. Percentiles are often used to quickly eyeball the distribution of data, check for skew or bimodalities, and also to find outliers. By default, the percentiles query return an array of pre-defined percentiles: 5, 25, 50, 75, 95, 99.
A compagnon metric is the ‘percentile_rank’ metrics which return for a given value the percentiles it belongs to. For example: the 50th percentile is 119ms, and 119ms percentile rank is the 50th percentile. 
The percentiles metric is based on Ted Dunning’s TDigest algorithm (paper Computing Accurate Quantiles using T-Digest).

Chapter 34: Significant Terms - examples
Significant terms are aggregation queries used for detecting anomalies. It is about finding uncommonly common patters, i.e. cases there becomes suddenly very common while in the past were uncommon. For instance, when analysing logs we may be interested in finding servers that throws a certain type of errors more often then they should.


An example of how to use this to recommend .. is by analysing the group of people enjoying a certain .. (the foreground group) and determine what .. are most popular, it will then construct a list of popular .. for everyone (the background group). Comparing the two lists shows that statistical anomalies will be the .. which are over represented in the foreground compared to the background.

to be continued..

vendredi 6 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 1)



I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Dealing with Human Language (Part III) :


Chapter 18: Getting started with languages - examples
Elasticsearch comes with a set of analysers for most languages (e.g. Arabic, English, Japanese, Persian, Turkish, etc.). Each of these analysers perform the same kind of rules: tokenize text into words, lowercase each word, remove stopwords, stem tokens to their root. Additionally, these analysers may perform some language specific transformation to make the words searchable.
Language analysers can be used as is, but it is possible to configure them for instance by defining stem word exclusion (e.g. prevent word organisation from being stemmed to organ), or custom stop words (e.g. omitting no and not as they invert the meaning for the subsequent words).
In case there is multiple documents with predominant language in each one, it’s more appropriate to use different index for each language (e.g. blogs-en, blogs-fr). It is also possible to have all the translations gathered in the same document (e.g. title, title_br, title_es).

Chapter 19: Identifying words - examples
Elasticsearch provides a set of tokenisers in order to extract tokens (i.e. words) from text. Example of tokenisers that can be used regardless of language:
  • whitespace: simply breaks text on whitespace,
  • letter: breaks text on any character which is not letter,
  • standard: uses Unicode Text Segmentation to find boundaries between words, 
  • tax_url_email: is similar to the standard tokeniser excepts it treats emails and urls as single words,
The standard tokeniser is a good starting point to recognise words in most languages and is the basis tokeniser for specific one (e.g. spanish). However it provides a limited support for Asian languages, in such situation it’s better to consider the ‘icu_tokenizer'.
The ICU plugin need to be installed manually in order to have the support for other than english languages:
./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION
where $VERSION can be found in github.com/elasticsearch/elasticsearch-analysis-icu
or in newer version of Elasticsearch ./bin/plugin install analysis-icu

For tokenisers to work well the input text has to be cleaned, character filters can be added to preprocess text before tokenization. For instance, the ‘html_strip’ character filter removes HTML tags and decode entities into corresponding Unicode character.

Chapter 20: Normalising tokens - examples
After text is split into tokens, the later are normalised (e.g. to lowercase) in order for similar tokens to be searchable. For instance, removing diacritics (e.g. ‘,^ and ¨) in western languages with asciifolding filter which converts also Unicode characters into simpler ASCII representation.
Elasticsearch compares characters at the byte level, however the same Unicode characters may have different bytes representation. In this case, it is possible to use Unicode normalisation forms (nfc, nfd, nfkc, nfkd) that converts Unicode into standard format and comparable at byte level.
Lowercasing Unicode character is not straitforward, it has to be made by case folding that may not result in the correct spelling but does allow case-insensitive comparisons.
Similarly, asciifolding token filter has an equivalent for dealing with many languages which is icu_folding that extends the transformation to non ASCII-based scripts like Greek. For instance fold arabic numeral to latin equivalent.

We can protect particular characters from being folded using ‘UnicodeSet’ which is a kind of character class in regular expression.


Chapter 21: Reducing words to their root form - examples
Stemming attempts to remove the difference between inflected forms of a word (like number: fox and foxes, gender: waiter and waitress, aspect: ate and eaten) in order to reduce each word to its root form. English is a weak inflected language (i.e. we can ignore inflection in words and still having good search result), but this is not the case for all languages that may need an extra work.
Stemming may suffer from understemming and overstemming, the former is failing to reduce words with same meaning to the same root and result in relevant document not been returned. The latter is failling to separate words with different meaning which reduces precision (i.e. returning irrelevant documents).
Elasticsearch has two classes of stemmers that can be used: algorithmic and dictionary stemmer. 
Algorithmic stemmer applies a sequence of rules to the given word to reduce it to its root form.
Dictionary stemmer uses a dictionary of words to their root format, so that it has only to lookup for the word to be stemmed. These stemmers are as good as their dictionaries, for instance words meaning may change over time and the dictionary have to be updated. Also, the size of the dictionary may hurt the performances as all words (suffixes and prefixes) have to be loaded into RAM. Example of widely used dictionary is the spell checker Hunspell.

Chapter 22: Stopwords performance vs precision - examples
Reducing index size can be achieved by indexing fewer words. Terms to index can be divided into Low frequency terms that appear in fewer index thus having high weight. And terms with high frequency that appear in many documents in the index. The frequency depends on the type of indexed documents, e.g. 'and’ in chinese documents will be a rare word. For any language there are common words (also called stop words) that may be filtered out before indexing but this may bring some limitations: distinguishing between ‘happy’ and 'not happy’.
To speedup query performance, we should avoid default query that uses the ‘or’ operator. 
1. One possible option is to use ‘and’ operator in match query like {"match": {"my_field": {"query": "the quick brown fox", "operator":"and"}}}. Which is then rewritten to a bool query {"bool":{"must":[{"term": {"my_field":"the"}}, {"term": {"my_field":"quick"}}, {"term": {"my_field":"brown"}}, {"term": {"my_field":"fox"}}]}}. Elasticsearch will execute first the query with least frequent term to immediately reduce the number of explored documents.
2. Another option for enhancing performance is to use ‘minimum_should_match’ property in the match query.
3. Its possible to divide the terms in search query into low frequency group (relevant terms used for filtering/matching) and high frequency group (irrelevant terms used for scoring only) terms. This is can be achieved with ‘cutoff_frequency’ query parameter, e.g. {{"match": {"text": {"query": "Quick and the dead", "cutoff_frequency": 0.01}}}. The latter result in a combined “must” clause with terms “quick/dead” and a should clause with terms “and/the”.
The parameters “cutoff_frequency” and “minimum_should_match” can be combined toghether.
To effective reduce the index size use the appropriate ’index_options’ in a Mapping API request, possible values are: 
  • docs’ (default for ’non_analyzed’ string fields): store only which documents include which terms,
  • freqs’: store ‘docs’ information plus frequency of terms in each document,
  • positions’ (default for ‘analyzed’ string fields): store ‘docs’ and ‘frees’ information plus the position of each term in each document,
  • offsets’: store ‘docs’, ‘freqs’ and ‘positions’ plus the start and end character offsets of each term in the original string,

Chapter 23: Synonyms - examples
Synonyms are used to broaden the scope of the matching documents, this kind of search (like stemming, partial matching) should be combined with another query on a field with the original text. Synonyms can be defined in the Index API request inlined in the ’synonyms’ settings parameter or in a file by specifying a path in ’synonyms_path' parameter. The latter can be absolute or relative Elasticsearch ‘config’ directory.
Synonym expansion can be done at index or search time, for instance we can replace English with the terms ‘english’ and ‘british’ at index time then in search time we could u-query for one of these terms. If synonyms are not used at index time, then at search time we have to convert the queries with ‘english’ into a query for ‘english OR british’.
Synonyms are listed as comma-separated values like ‘jump,leap,hop’. It is also possible to use the syntax with ‘=>’ to specify on the left side a list of terms to match (e.g. gb, great brain) and on the right side one or many replacement (e.g. britain,england,scotland,wales). In case many rules are specified for the same left side then the tokens in the right side are merged.
Replacing synonyms can be done with one of the following options:
1. Simple expansion: any of the listed synonyms is expanded into all of the listed synonyms (e.g. ‘jump,hop,leap’). This type expansions can be applied either at index time or search time.
2. Simple contraction: a group of synonyms in the left side are mapped to a single value in the right side (e.g. ‘leap,hop => jump’). This type of expansions must be applied at index time and query time to insure query terms are mapped to the same value.
3. Genre expansion: it widens the meaning of terms to be more generic. Applying this technique at index time with the following rules:
  • ‘cat      => cat,pet’
  • ‘kitten  => kitten,cat,pet’
  • ‘dog     => dog,pet’
  • ‘puppy => puppy,dog,pet’

then when querying for ‘kitten’ only documents about kittens will be returned, when querying for cat documents about kittens and cats are returned, and when querying for pet all documents about kittens, cats, puppies, dogs or pets will be returned.
Synonyms and the analysis chain: 
It is appropriate to set the first a tokeniser filter, then a stemmer filter before putting the synonyms filter. In this case instead of having a rule like ‘jumps,jumped,leap,leaps,leaped => jump’ we can have ‘leap => jump’.
In some case, the synonym filters cannot be simply put behind a lowercase filter as it have to deal with terms like CAT or PET (Positron Emission Tomography) which are conflicting when lowercased. A possibility will be to: 
1. put the synonym filter before the lowercase filter and specify rules with both lowercase and uppercase forms.
2. or have two synonym filters one for case-sensitive synonyms (with rules like ‘CAT,CAT scan => cat_scan') and another one for case insensitive synonyms (with rules like ‘cat => cat,pet’).
Multi-word synonyms and phrase queries: using synonyms with 'simple expansion’ (i.e. rules like ‘usa,united states,u s a,united states of america') may lead to some bizarre results for phrase queries, it’s more appropriate to use ’simple contraction’ (i.e. rules like ‘united states,u s a,united states of america=>usa’).

Symbol synonyms: used for instance to avoid emoji (like ‘:)') been striped away by the standard tokeniser filter as they may change the meaning of the phrase. The solution will be to define a mapping character filter. This will ensure that emoticons are included in the index for instance to do sentiment analysis. 
Note that mapping character filter is useful for simple replacements of exact characters, for more complex patterns; regular expressions should be used.

Chapter 24: Typoes and misspellings - examples
This chapter is about fuzzy matching at query time and sounds-like matching at index time for handling misspelled words.
Fuzzy matching treats words which are fuzzily similar as if they are the same word. This is based on Damerau-Levenshtein edit distance, i.e. number of operations (edit, insertion, deletion) to perform on a word until it becomes equal to the target word. Elasticsearch supports a maximum of edit distance ‘fuzziness’ of 2 (default is set to ‘AUTO’). Two can be overkilling as a fuzziness value (most misspelling errors are of distance 1) especially for short words (e.g. hat is at 2 distance for mad).
Fuzzy query with an edit distance of two can perform very badly and match a large number of documents, the following parameters can be used to limit performance impact:
  1. prefix_length: number of initial characters that will not be fuzzified, as most types occur at the end of words,
  2. max_expansions: limit the number of options produced, i.e. generated fuzzy words until this limit is matched. 
Scoring fuzziness: fuzzy matching should not be used for scoring but only to widen the match result (i.e. increasing recall). For example, if we have 1000 documents containing the word ‘Algeria’ and one document with the word ‘Algeia’, then the latter misspelled word will be considered more relevant (thanks to TF/IDF) as has fewer appearance.

Phonetic matching: there is plenty of algorithms for dealing with phonetic error, most of them are specialisation of the Soundex algorithm. However they are language specific (either English or German). You need to install the phonetic plugin - here github.com/elasticsearch/elasticsearch-analysis-phonetic. 

Similarly, phonetic matching should not be used for scoring as it is intended to increase recall. Phonetic algorithms are useful when the search result will be processed by the machine and not by humans.

Notes for subsequent chapters can be found here.