vendredi 13 janvier 2017

Notes from Elasticsearch - The Definitive Guide (suite 3)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 35: Controlling memory use and latency
Aggregation queries rely on a data structure called “fielddata” as inverted indices are not efficient when it comes to which unique terms existent in a single document. Understating how field data works is important as it is the primary consumer of memory in an Elasticsearch cluster.
Aggregations and Analysis: the “terms” bucket operate on string field that may be analysed or not_analyzed. For instance, doing a Terms aggregation on documents that have name of states (e.g. New York) will create a bucket for each field (e.g. new, york) and not one for each state name as the field is by default analysed. To fix this unwanted behaviour the field should be specifically made of type not_analyzed in the mapping when the index is first created.
Furthermore, the ngram analysis process can create a lot of tokens which is memory unfriendly.
Choosing the right heap size impact significantly the performance of field data and thus Elasticsearch. The value can be set with the $ES_HEAP_SIZE env variable:
Choose no more than half the available memory to let the other have for lucene as it relies on filesystem caches which are managed by the kernel
Choose no more than 32GB which allows the JVM to use compressed pointers and save memory, a value bigger than that will force the JVM to use pointers with double size and make garbage collection more expensive.
To control the size of memory allocated to fielddata set ‘indices.fielddata.cache.size’ to a percentage of head or concrete value (e.g. 40%, 5gb) in config/elasticsearch.yml. By default, this value is unbound which means ES will never evict data from field data.
Fielddata usage can be monitored (e.g. too many evictions may indicate poor performance) broken for each field:
GET /_stats/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?level=indices&fields=*
To avoid OutOfMemoryException when trying to load more data into the field data, ES uses a circuit breaker that will evaluate the required memory to answer a query before any more data loading. ES has different circuit breaker to ensure memory limit not exceeded:
indices.breaker.fielddata.limit: by default limits size of field data to 60% of the heap
indices.breaker.request.limit: estimates size of structures required to complete other parts of a request, by default 40% of the heap
indices.breaker.total.limit: total wrapping ‘requests’ and ‘fielddata’ circuit breaker, by default ensure combination not exceeding 70%
For instance, to circuit breaker can be set dynamically on live cluster:
PUT /_cluster/settings -d ‘{“persistent”:{“indices.breaker.fielddata.limit”: “40%"}}’

Filedata filtering: in some case we may need to filter out terms that fall into less interesting long tail and not to have to load them. This can be done by defining the document mapping and filtering terms by their frequency (or those matching a regular expression). Filtering data means not using it in search, for many applications the memory space gained is more important than keeping useless terms in memory.

vendredi 13 mai 2016

Notes on Big Data related talks

Hadoop Summit 2016

Apache Eagle Monitor Hadoop in Real time
The talk was about Apache Eagle a Hadoop product developed by eBay to monitor activities on a Hadoop cluster from the security perspective. The talk started by describing the pillars of security in hadoop: perimeter security; authorization & access control; discovery (e.g. classifying data according to their sensitivity), activity monitoring. The talk is mainly on the last part to address info sec questions: who many users are using Hadoop cluster, what files are they accessing, etc. From this purpose Eagle was born to be able to track events form different sources (e.g. accidently deleting files from HDFS) and correlate them with some user-defined policies.

Ingest and Stream Processing What will you choose
The talk was divided in two parts, the first one was about streaming patterns. And how each part provide at least once or exactly one message delivery.
The second part was a demo for building a streaming pipeline using streamsets editor easily. The demo was about using land data of the city of San Fransisco, streaming it and trying to calculate the land with maximum area. The generated data is then store into two destinations Kudu for analytics (e.g. top ten areas) and another Kafa for the events to be used for rendering on minecraft (which was pretty neat).

Real time Search on Terabytes of Data Per Day Lessons Learned
Lessons learned from the plaform engineering team at Rocana (an Ops monitoring software vendering) on building a search engine on HDFS. They described the context and amount of data they are dealing with at a daily basis in terabytes of data. Then, they talked about their initial use of Solar cloud as an enabler to their platform, and how they struggled to scale it and finally decided to create their our search engine based on Lucene and HDFS to store indexes. The rest of the talk was about the specific time-oriented search engine architecture. In the Q&A, one question was on Elasticsearch, they didn't really tested but rather relied on an analysis made by the author of Jepsen (which is a tool for analysing distributed systems).

Spark Summit East 2016

Spark Performance: What's Next
The talk started by a finding since the Spark project started in 2010 up to now on the evolution of IO speed, network throughput and CPU speed as the two firsts increase by a factor of 10x while CPU is stuck at 3Gz. The first attempt to CPU and memory optimization was through project Tungsten. The, the speaker described the two phases of perf enhancement:
  • Phase 1 (Spark 1.4 to 1.6) enhanced memory managed through using java.unsafe API and offheap memory instead of using Java objects (that allocates memory unnecessary).
  • Phase 2 (Spark 2): instead of using the Volcano Iterator Model to implement operators (i.e. filter, projection, aggregation) use the Whole-stage Codegen to generate optimized code (and avoid virtual functions call). Plus the use of vectorization (i.e. columnar) to represent data in memory for an efficient scan.

Then the speaker described the impact of these enhancement by comparing the performance of Spark 1.6 vs Spark 2 for different queries. These modification are on master under active development.
In the QA, the described techniques are applicable for DataFrames as the the engine has more information on the data schema which is not the case with RDDs. With Dataset API (which is on top of the DataFrame API) you get the benefit of telling the Engine the data schema as well as the safety data types (i.e. accessing the items without having to cast them to their type). DataFrame gives you index access, while Datasets gives you object access.

Others

Ted Dunning on Kafka, MapR, Drill, Apache Arrow and More
Ted Dunning talking about why the Hadoop ecosystem succeeded over the NoSQL movement thanks to the more stable API as a standard way to make consensus among the community. While in NoSQL it tends to be isolated icelands. As an example he gave Kafka release of version 0.9 as it reached a new level of stability thanks to its API. He then described how Kafka fit in its goal and give an example of a use case where it's going to be hard to used for. The use case was about real-time tracking of shipment containers, in the case where a dedicated Kafka topic is used to track each container, in this case it will be hard to replicate effectively.
Then, he described MapR approach to open source as way to innovate in the underneath implementation why applying to a standard API (e.g. HDFS).
He also talked about Drill and how MapR is trying to involve more member of the community so that it doesn't seem as the only supported. He also talked about the in-memory movement, and specially the Apache Arrow in-memory file system and how it enabled the co-author of pandas to be a Apache Feather a new file format to store data frames on disk and be able to send through wire with Apache Arrow without need for serialization.

more to come.

lundi 9 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 2)

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 25: High-level concepts - examples
Aggregations in Elasticsearch are based on ‘bucket’ which is a collection of documents that meet a certain criteria (equivalent to grouping in SQL) and ‘metrics’ which are statistics calculated on documents in a bucket (equivalent to count, avg, sum in SQL). 
Buckets can be nester in other buckets, and there is variety of them. Elasticsearch allows you to partition documents in many different ways (e.g. by hour, by most popular terms, by geographical location).
An aggregation is a combination of buckets and metrics, and buckets can be nested inside other buckets we can create very complex aggregations. For instance to calculate the average salary for a combination of :
  1. Partition documents by country (bucket),
  2. Partition each country bucket by gender (bucket),
  3. Partition each gender bucket by age range (bucket),
  4. Calculate the average salary for each age range (metric)

Chapter 26: Aggregation Test-drive - examples
Terms bucket in an aggregation query is a type of bucket definition that will create a new bucket for each unique term in encounters. In the result of this query, a bucket key correspond to the term value. 
An additional ‘aggs’ level can be added nested inside another one in order to nested metrics, for example to a first ‘count’ by colour aggregation we can add an ‘avg' metric to calculate average of the values of the price ‘field'.
In addition to nest metric inside bucket, we can nest buckets inside other buckets.

Chapter 28: Building Bar Charts - examples
The ‘histogram’ bucket is essential for bar charts. It works by specifying an interval and a numeric field (e.g. price) to calculate bucket on. The interval defines how wide each bucket will be, for instance if it is set to 10 then a new bucket will be created every 10. In the response to such aggregation, the histogram keys correspond to the lower boundary.

Chapter 29: Looking at time - examples
The second most popular activity in Elasticsearch is building date histograms. Timestamps exists in variety of type of data, we can build on top of it metrics which are expressed over time. Example of time-based questions: how many cars sold each month this year, what was the price of this stock for the last 12 hours.
The date_histogram bucket works similarly as the histogram bucket but instead of building buckets based on numeric field, it is calendar-aware and uses time ranges. Each bucket is defined as a certain calendar size (e.g. a month).

Chapter 30: Scoping Aggregations - examples
But default when no query parameter is specified in an aggregation, Elasticsearch runs the all document. In fact, aggregations operate in the scope of the query and if there is no query then the scope will be ‘match_all’ query.
Omitting ’search_type=count’ from the aggregation url forces the search hits to be returned, and thus seeing the search result and aggregation results.
We can use global bucket to by pass the scope of a query to all documents.

Chapter 31: Filtering Queries and Aggregations - examples
Because the aggregation operates in the scope of a query, then any filter added to the query will be applied to the aggregation.
We can use filter bucket so that document matching the filter (e.g. now - 1Month) will be added to the bucket. When using Filter bucket, all nested buckets or metrics will inherent the filter.
Post filter is a top level search parameter, it is executed after the search query to filter the results (i.e. search hits) but does not affect the query scope neither the aggregation. Thus it doesn’t affect the categorial facets. Note that for performance considerations, the post_filter should only be used in combination of aggregations and only when differential filtering is needed. Recap:
  • filtered query affects both search results and aggregations
  • filter bucket: affects only aggregations
  • post_filter: affects only search results.
Chapter 32: Sorting multi-value buckets - examples
By default elasticsearch sorts the aggregation buckets by doc_count in descending order. Elasticsearch provides many way to customise the sorting:
1. Intrinsic sorts: operates on data generated by the bucket (e.g. doc_count). It uses the ‘order’ object which can take one of these values: _count (sort by bucket’s document count), _term (sort by the values of a field), _key (sort by the bucket’s key, works only with histogram and date histogram buckets).
2. Sorting by a metric
: set the sort order with any metric (single value) by referencing it’s name. It is also possible to use multiple values metrics (e.g. extended_stats) by using a dot-path to the metric of interest.
3. Sorting based on a metric in subsequent nested buckets (my_bucket>another_bucket>metric): only for buckets generating one value (e.g. filter, global), multi-value bucket (e.g. terms) generate many dynamic buckets which makes it impossible to determine a deterministic path.

Chapter 33: Approximate Aggregations - examples
Simple operations like ‘max’ scales linearly with the number of machines of the Elasticsearch cluster. They don’t need coordination between the machines (i.e. no need for data movement over the network) and the memory footprint is too small (for the sum function all we need is to keep an integer). In the contrary, more complex operations need algorithms that can make tradeoffs between performance and memory utilisation.
Elastisearch support two approximate algorithms ‘cardinality’ and ‘percentiles’ which are fast but does provide an accurate result not an exact.

Cardinality is the approximation of the distinct query that counts unique values of a field, it is based on the HyperLogLog (HLL) algorithm. This algorithm has configurable precision (through the ‘precision_threshold’ field that accept values from 0 to 40k) that impact how much memory will be used. If the field cardinality is below the threshold than the returned cardinality is almost always 100%.
To speed up the cardinality calculation on very large datasets in which case calculating hashes at query time can be painful, we can instead calculate the hash at index time.

Percentiles is the other approximation algorithm provided by Elasticsearch, it shows the point at which certain percentage of values occur. For instance, 95th percentile is the value which is greater than 95% of the data. Percentiles are often used to quickly eyeball the distribution of data, check for skew or bimodalities, and also to find outliers. By default, the percentiles query return an array of pre-defined percentiles: 5, 25, 50, 75, 95, 99.
A compagnon metric is the ‘percentile_rank’ metrics which return for a given value the percentiles it belongs to. For example: the 50th percentile is 119ms, and 119ms percentile rank is the 50th percentile. 
The percentiles metric is based on Ted Dunning’s TDigest algorithm (paper Computing Accurate Quantiles using T-Digest).

Chapter 34: Significant Terms - examples
Significant terms are aggregation queries used for detecting anomalies. It is about finding uncommonly common patters, i.e. cases there becomes suddenly very common while in the past were uncommon. For instance, when analysing logs we may be interested in finding servers that throws a certain type of errors more often then they should.


An example of how to use this to recommend .. is by analysing the group of people enjoying a certain .. (the foreground group) and determine what .. are most popular, it will then construct a list of popular .. for everyone (the background group). Comparing the two lists shows that statistical anomalies will be the .. which are over represented in the foreground compared to the background.

to be continued..

vendredi 6 mai 2016

Notes from Elasticsearch - The Definitive Guide (suite 1)



I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Dealing with Human Language (Part III) :


Chapter 18: Getting started with languages - examples
Elasticsearch comes with a set of analysers for most languages (e.g. Arabic, English, Japanese, Persian, Turkish, etc.). Each of these analysers perform the same kind of rules: tokenize text into words, lowercase each word, remove stopwords, stem tokens to their root. Additionally, these analysers may perform some language specific transformation to make the words searchable.
Language analysers can be used as is, but it is possible to configure them for instance by defining stem word exclusion (e.g. prevent word organisation from being stemmed to organ), or custom stop words (e.g. omitting no and not as they invert the meaning for the subsequent words).
In case there is multiple documents with predominant language in each one, it’s more appropriate to use different index for each language (e.g. blogs-en, blogs-fr). It is also possible to have all the translations gathered in the same document (e.g. title, title_br, title_es).

Chapter 19: Identifying words - examples
Elasticsearch provides a set of tokenisers in order to extract tokens (i.e. words) from text. Example of tokenisers that can be used regardless of language:
  • whitespace: simply breaks text on whitespace,
  • letter: breaks text on any character which is not letter,
  • standard: uses Unicode Text Segmentation to find boundaries between words, 
  • tax_url_email: is similar to the standard tokeniser excepts it treats emails and urls as single words,
The standard tokeniser is a good starting point to recognise words in most languages and is the basis tokeniser for specific one (e.g. spanish). However it provides a limited support for Asian languages, in such situation it’s better to consider the ‘icu_tokenizer'.
The ICU plugin need to be installed manually in order to have the support for other than english languages:
./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION
where $VERSION can be found in github.com/elasticsearch/elasticsearch-analysis-icu
or in newer version of Elasticsearch ./bin/plugin install analysis-icu

For tokenisers to work well the input text has to be cleaned, character filters can be added to preprocess text before tokenization. For instance, the ‘html_strip’ character filter removes HTML tags and decode entities into corresponding Unicode character.

Chapter 20: Normalising tokens - examples
After text is split into tokens, the later are normalised (e.g. to lowercase) in order for similar tokens to be searchable. For instance, removing diacritics (e.g. ‘,^ and ¨) in western languages with asciifolding filter which converts also Unicode characters into simpler ASCII representation.
Elasticsearch compares characters at the byte level, however the same Unicode characters may have different bytes representation. In this case, it is possible to use Unicode normalisation forms (nfc, nfd, nfkc, nfkd) that converts Unicode into standard format and comparable at byte level.
Lowercasing Unicode character is not straitforward, it has to be made by case folding that may not result in the correct spelling but does allow case-insensitive comparisons.
Similarly, asciifolding token filter has an equivalent for dealing with many languages which is icu_folding that extends the transformation to non ASCII-based scripts like Greek. For instance fold arabic numeral to latin equivalent.

We can protect particular characters from being folded using ‘UnicodeSet’ which is a kind of character class in regular expression.


Chapter 21: Reducing words to their root form - examples
Stemming attempts to remove the difference between inflected forms of a word (like number: fox and foxes, gender: waiter and waitress, aspect: ate and eaten) in order to reduce each word to its root form. English is a weak inflected language (i.e. we can ignore inflection in words and still having good search result), but this is not the case for all languages that may need an extra work.
Stemming may suffer from understemming and overstemming, the former is failing to reduce words with same meaning to the same root and result in relevant document not been returned. The latter is failling to separate words with different meaning which reduces precision (i.e. returning irrelevant documents).
Elasticsearch has two classes of stemmers that can be used: algorithmic and dictionary stemmer. 
Algorithmic stemmer applies a sequence of rules to the given word to reduce it to its root form.
Dictionary stemmer uses a dictionary of words to their root format, so that it has only to lookup for the word to be stemmed. These stemmers are as good as their dictionaries, for instance words meaning may change over time and the dictionary have to be updated. Also, the size of the dictionary may hurt the performances as all words (suffixes and prefixes) have to be loaded into RAM. Example of widely used dictionary is the spell checker Hunspell.

Chapter 22: Stopwords performance vs precision - examples
Reducing index size can be achieved by indexing fewer words. Terms to index can be divided into Low frequency terms that appear in fewer index thus having high weight. And terms with high frequency that appear in many documents in the index. The frequency depends on the type of indexed documents, e.g. 'and’ in chinese documents will be a rare word. For any language there are common words (also called stop words) that may be filtered out before indexing but this may bring some limitations: distinguishing between ‘happy’ and 'not happy’.
To speedup query performance, we should avoid default query that uses the ‘or’ operator. 
1. One possible option is to use ‘and’ operator in match query like {"match": {"my_field": {"query": "the quick brown fox", "operator":"and"}}}. Which is then rewritten to a bool query {"bool":{"must":[{"term": {"my_field":"the"}}, {"term": {"my_field":"quick"}}, {"term": {"my_field":"brown"}}, {"term": {"my_field":"fox"}}]}}. Elasticsearch will execute first the query with least frequent term to immediately reduce the number of explored documents.
2. Another option for enhancing performance is to use ‘minimum_should_match’ property in the match query.
3. Its possible to divide the terms in search query into low frequency group (relevant terms used for filtering/matching) and high frequency group (irrelevant terms used for scoring only) terms. This is can be achieved with ‘cutoff_frequency’ query parameter, e.g. {{"match": {"text": {"query": "Quick and the dead", "cutoff_frequency": 0.01}}}. The latter result in a combined “must” clause with terms “quick/dead” and a should clause with terms “and/the”.
The parameters “cutoff_frequency” and “minimum_should_match” can be combined toghether.
To effective reduce the index size use the appropriate ’index_options’ in a Mapping API request, possible values are: 
  • docs’ (default for ’non_analyzed’ string fields): store only which documents include which terms,
  • freqs’: store ‘docs’ information plus frequency of terms in each document,
  • positions’ (default for ‘analyzed’ string fields): store ‘docs’ and ‘frees’ information plus the position of each term in each document,
  • offsets’: store ‘docs’, ‘freqs’ and ‘positions’ plus the start and end character offsets of each term in the original string,

Chapter 23: Synonyms - examples
Synonyms are used to broaden the scope of the matching documents, this kind of search (like stemming, partial matching) should be combined with another query on a field with the original text. Synonyms can be defined in the Index API request inlined in the ’synonyms’ settings parameter or in a file by specifying a path in ’synonyms_path' parameter. The latter can be absolute or relative Elasticsearch ‘config’ directory.
Synonym expansion can be done at index or search time, for instance we can replace English with the terms ‘english’ and ‘british’ at index time then in search time we could u-query for one of these terms. If synonyms are not used at index time, then at search time we have to convert the queries with ‘english’ into a query for ‘english OR british’.
Synonyms are listed as comma-separated values like ‘jump,leap,hop’. It is also possible to use the syntax with ‘=>’ to specify on the left side a list of terms to match (e.g. gb, great brain) and on the right side one or many replacement (e.g. britain,england,scotland,wales). In case many rules are specified for the same left side then the tokens in the right side are merged.
Replacing synonyms can be done with one of the following options:
1. Simple expansion: any of the listed synonyms is expanded into all of the listed synonyms (e.g. ‘jump,hop,leap’). This type expansions can be applied either at index time or search time.
2. Simple contraction: a group of synonyms in the left side are mapped to a single value in the right side (e.g. ‘leap,hop => jump’). This type of expansions must be applied at index time and query time to insure query terms are mapped to the same value.
3. Genre expansion: it widens the meaning of terms to be more generic. Applying this technique at index time with the following rules:
  • ‘cat      => cat,pet’
  • ‘kitten  => kitten,cat,pet’
  • ‘dog     => dog,pet’
  • ‘puppy => puppy,dog,pet’

then when querying for ‘kitten’ only documents about kittens will be returned, when querying for cat documents about kittens and cats are returned, and when querying for pet all documents about kittens, cats, puppies, dogs or pets will be returned.
Synonyms and the analysis chain: 
It is appropriate to set the first a tokeniser filter, then a stemmer filter before putting the synonyms filter. In this case instead of having a rule like ‘jumps,jumped,leap,leaps,leaped => jump’ we can have ‘leap => jump’.
In some case, the synonym filters cannot be simply put behind a lowercase filter as it have to deal with terms like CAT or PET (Positron Emission Tomography) which are conflicting when lowercased. A possibility will be to: 
1. put the synonym filter before the lowercase filter and specify rules with both lowercase and uppercase forms.
2. or have two synonym filters one for case-sensitive synonyms (with rules like ‘CAT,CAT scan => cat_scan') and another one for case insensitive synonyms (with rules like ‘cat => cat,pet’).
Multi-word synonyms and phrase queries: using synonyms with 'simple expansion’ (i.e. rules like ‘usa,united states,u s a,united states of america') may lead to some bizarre results for phrase queries, it’s more appropriate to use ’simple contraction’ (i.e. rules like ‘united states,u s a,united states of america=>usa’).

Symbol synonyms: used for instance to avoid emoji (like ‘:)') been striped away by the standard tokeniser filter as they may change the meaning of the phrase. The solution will be to define a mapping character filter. This will ensure that emoticons are included in the index for instance to do sentiment analysis. 
Note that mapping character filter is useful for simple replacements of exact characters, for more complex patterns; regular expressions should be used.

Chapter 24: Typoes and misspellings - examples
This chapter is about fuzzy matching at query time and sounds-like matching at index time for handling misspelled words.
Fuzzy matching treats words which are fuzzily similar as if they are the same word. This is based on Damerau-Levenshtein edit distance, i.e. number of operations (edit, insertion, deletion) to perform on a word until it becomes equal to the target word. Elasticsearch supports a maximum of edit distance ‘fuzziness’ of 2 (default is set to ‘AUTO’). Two can be overkilling as a fuzziness value (most misspelling errors are of distance 1) especially for short words (e.g. hat is at 2 distance for mad).
Fuzzy query with an edit distance of two can perform very badly and match a large number of documents, the following parameters can be used to limit performance impact:
  1. prefix_length: number of initial characters that will not be fuzzified, as most types occur at the end of words,
  2. max_expansions: limit the number of options produced, i.e. generated fuzzy words until this limit is matched. 
Scoring fuzziness: fuzzy matching should not be used for scoring but only to widen the match result (i.e. increasing recall). For example, if we have 1000 documents containing the word ‘Algeria’ and one document with the word ‘Algeia’, then the latter misspelled word will be considered more relevant (thanks to TF/IDF) as has fewer appearance.

Phonetic matching: there is plenty of algorithms for dealing with phonetic error, most of them are specialisation of the Soundex algorithm. However they are language specific (either English or German). You need to install the phonetic plugin - here github.com/elasticsearch/elasticsearch-analysis-phonetic. 

Similarly, phonetic matching should not be used for scoring as it is intended to increase recall. Phonetic algorithms are useful when the search result will be processed by the machine and not by humans.

Notes for subsequent chapters can be found here.

lundi 25 avril 2016

Notes from Elasticsearch - The Definitive Guide

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book:

Chapter1: 
  • history: lucent, compass, elasticsearch
  • download/run node, plugging manager Marvel, Elasticsearch vs Relational DB, 
  • Employee directory example: Create index (db), index (store) document, query (light && DSL), aggregations
Chapter 2: (about distribution)
  • Cluster health (green yellow, red), Create index with 3 shards (default 5) and 1 replica, then scaling nb of replicas (up or down), master reelection after it fails

Chapter 3: 
API for managing documents (create, retrieve, update, delete)
  • Document metadata (_index, _type, _id)
  • Index document: PUT /{index}/{type}/{id}, for auto-generated ids: POST /{index}/{type}/
  • Retrieve a document: GET /{index}/{type}/{id}, without metadata GET /{index}/{type}/{id}/_source, some fields: GET /{index}/{type}/{id}?_source={field1},{field2}
  • Check existence of document curl -i XHEAD http://elastic/{index}/{type}/{id}
  • Delete a document: DELETE /{index}/{type}/{id}, 
  • Update conflicts with optimistic concurrency control, uses _version to ensure changes to be applied in correct order, to retry in case of failures many times POST /{index}/{type}/{id}/_update?retry_on_conflict=5
  • Update using scripts (in Groovy) or set initial value (to avoid failures for non existing document) POST /{index}/{type}/{id}/_update -d ‘{“script”: “ctx._source.views+=1”, “upsert”: {“view”: 1}}’
  • Multi-GET: GET /_mget -d {“docs”: [{“_index”: “website”, “_type”: “blog”, “_id”: 2}, …]} or GET /{index}/{type}/{id}/_mget -d {“ids”: [“2”, “1”]}
  • Bulk operations (not atomic/transactional, i.e. if sone fails, some may succeeds) POST /_bulk  -d {action: {metadata}}\n{request body}

Chapter 4: 
How document management operations are executed by elastic search
Chapter 5: 
Search basics (look for data sample in gist)
  • Search all types in all indices GET /_search
  • Search a type that contains a word in a field GET /_all/{type}/_search?q={field}:{word}
  • Queries with + conditions (e.g. +{field}:{value}) must be satisfied, - conditions must not be satisfied, nothing means the condition is optional. 

Chapter 6:
Core data types in elastic search are indexed differently, to understand how elastic search interpreted the indexed documents and to avoid surprising query results (e.g. age mapped to string instead of integer), look at the mapping (i.e. schema definition) for the type and index. GET /{index}/_mapping/{type}
ES uses inverted indexes that consists of a list of unique words in all documents and for each one, the list of document it appears in. 
Each document and query are passed by analysers that filter characters, tokenise words, then filter these tokens. ES ships with some analysers: standard analyser (used by default), simple analyser, whitespace analyser, language analyser. Analysers are applied only to full text searches and not to exact values. 
To understand how documents are tokenised and stored in a given index, we can use the Analyse API by specifying the analyser: GET /_analyze?analyzer=standard -d “Text to analyse”. In the response, the value of token is what it will be stored in the index.

Chapter 7:
Filter vs Query DSL, elastic search has two DLS which are similar but serve different purposes, the filter DSL asks a yes/no question on every document and it is used for exact value field. In the other hand, Query DSL asks how well this relevant is this document question, and assign it a _score. In terms of performance, filters are much lighter and uses caches for even faster future searches. Queries are heavier and must be used only for full text searches.
Most used filters are: term/terms, exists, match_all, match, multi_match (to run same match on multiple fields), and bool query.

Queries can become easily very complex, combining multiple queries and filters, elastic search provides _validate endpoint for query validation:
GET /{index}/{type}/_validate/query QUERY_BODY
Elastic search provides also a human-readable explanation for non valid queries: GET /{index}/{type}/_validate/query?explain QUERY_BODY

Chapter 8: Sorting and relevance
By default search result documents are sorted by relevance (i.e. _score value) in descending order, however for filter queries which doesn’t have impact on the _score field it may be interesting to sort other ways (e.g. date). Example of a sort query:
GET /_search {"query": {“filtered”: {“filter”: {“term”: {“user_id”: 1}}}}, “sort”: {“date”: {“order”: “desc"}}}

Chapter 10: Index Management 
A type in Elasticsearch consists of a name and a mapping (just like a database schema) that describes its fields, there data types and how they are indexed and stored in lucene. The json representation of a document is stored in plain in the ‘_source’ field which may consume disk space, so a good idea will be to disable it.

Chapter 15: - examples
Phrase search (how to search for terms with a specific order in the target documents) and proximity search with ‘slop’ parameter that gives more flexibility to the search request

Chapter 16: - examples
Queries for matching parts of a term (not the whole). In many cases, it is sufficient to use a stemmer to index the root form of words, but there are cases where we need partial matching (e.g. matching a regex in not_analyzed values).
Example of queries: ‘prefix’ query works on term level, doesn’t analyse the query string before searching, and performs as a filter (i.e. no relevance calculation). Shorter prefix length means many possible terms to be visited, so for better performance use longer prefixes.
Query-time search as you type with match_phrase_prefix queries, and index-time search as you type by defining n-grams

Chapter 17: Controlling relevance score - examples
Relevance score in Lucene (thus Elasticsearch) is based on Term Frequency/Inverse Document Frequency and Vector Space Model (to combine weight of many terms in search query), in addition to a coordination factor, field length normalization and term/query clause boosting.
1. Boolean model: applies AND, OR and NOT conditions of the search query to find matching documents.
2. Term frequency/Inverse document frequency (TF/IDF): the matching documents then have to be sorted by relevance that depends on the weight of the query terms appearing in these documents. The weight of a term is determined by the following factors:
  •  Term frequency: defines how often a term appear in this document (the more often the higher is its weight). For a given term t and document d, it is calculated by the square root of the frequency, i.e. tf(t in d)=(frequency)^1/2
  • Inverse document frequency: defines how often a term appears in all document of a collection (the more often the lower the weight). It is calculated based on the number of documents in the collection and number of document the term appears in, as follows: idf(t) = 1 + log(numDocs / (docFreq + 1))
  • Field length norm: defines how long the field is (the shorter it is the higher the weight), if a term appears in a short field (e.g. title) then it is likely the content of that field is about this term. In some cases (e.g. logging) norms are not necessary (e.g. we don’t care about length of user agent), disabling them can save a lot of memory. This metric is calculated as the inverse square root of number of terms in the given field: norm(d) = 1 / (numTerms)^1/2
These factors are calculated and stored at index time, together they serve to calculate of a single term in a document.
3. Vector space model:
A single score representing how well a document match a query. It is calculated by first representing the search query and the document as one-dimensional vector with a size equal to number of query terms. Each element is the weight of a term calculated with TF/IDF by default although it’s possible to use other techniques (e.g. Okapi-BM25). Then the angle between these vectors is calculated (Cosine similarity), the closer they are the more relevant the document is to the query.
Lucene’s practical scoring function: Lucene combines multiple scoring functions:
1. Query coordination: rewards document that have most of the search query terms, i.e. the more query terms the document contains the more relevant it is. Sometimes, you may want to disable this function (although most use cases for disabling Query Coord are handled automatically), for instance if the query contains synonyms.
2. Query time boosting: a particular query clause can use the boost parameter to be given a higher importance over clauses with less boost value or without it. Boosting can also be applied to entire indexes.
Note: not_analyzed fields have ‘field length norms’ disabled and ‘index_options’ set to docs these disabling ’term frequencies’, the IDF of each term are still considered.
Function score query: can use Decay functions (linear, exp, guess) incorporate sliding scale (like publish_date, geo_location, price) into the _score to alter documents relevance (e.g. recently published, near a lat-lon/price point) 

For some use cases of ‘field_value_factor’ in a Function score query using directly the value of field (e.g. popularity) may not be appropriate (i.e. new_score = old_score * number_of_votes), in this case a modifier can be used for instance log1p which changes the formula to new_score = old_score * log(1 + number_of_votes).


Notes for subsequent chapters can be found here.

jeudi 6 août 2015

Adding functionalities to existing classes in Scala

New functionalities can be added to existent classes by wrapping them with a Value class and adding and implicit methods for converting back and forward form the original class:

class TLong(val value: Long) {
  def +(other: TLong) = new TLong(value + otehr.value)
  def decrement =  new TLong(value - 1L)
  override def toString(): String = value.toString;
}

// implicit methods for conversions
implicit def toTLong(l: Long) = new TLong(l)
implicit def toLong(tl: TLong) = tl.value

// some tests
val l1: TLong = new TLong(1)
val l2: TLong = new TLong(2)
l1 + l2
1L + l2
l1 + 2L

From Scala 2.10, you can use implicit classes so that you don't have to define conversion methods as they are automatically created:
implicit class ImplicitLong(val l: Long) {
  def print = l.toString
}

1L.print

samedi 13 juin 2015

Running Java applications on CloudFoundry

Introduction

CloudFoundry v2 uses Heroku buildpacks to package droplet on which an application will run. But before, CF checks among the locally available buildpacks which one can be used to prepare the application runtime. The buildpack contract is composed of the following scripts (that can be written in shell, python, ruby, etc):
  • Detect: checks if this buildpack is suitable for the submitted application,
  • Compile: prepares the runtime environment of the application,
  • Release: finally launches the application

Applications with single file

Java applications whether a standalone or web are managed by the java-buildpack. In case a manifest.yml is used to submit the application, then for Web applications or executable jar it may looks like:
---
applications:
- name: APP_NAME
  memory: 4G
  disk_quota: 2G
  timeout: 180
  instances: 1
  host: APP_NAME-${random-word}
  path: /path/to/war/file.war or /path/to/executable/file.jar

The java-buildpack will check if the file is a .war to launch Tomcat container, or an executable jar to look for the mainClass in META-INF/MANIFEST.MF.

Applications with many files

In case the application is composed of multiple files (jars, assets, configs, etc.) the java-buildpack won't be able to automatically detect what appropriate container to use. We need:
1. For the Detect phase to choose which container is appropriate (here the java-main): Clone the java-buildpack and set the java_main_class property in config/java_main.yml.

2. In the manifest: indicate the path to the folder containing all artifacts that should be download  to the droplet at the Compile phase.

3. In the manifest: set the command that will be used at the Release phase to launch the application. 

An example of java_main.yml file:
---
java_main_class: package.name.ClassName

An example of a manifest.yml file:
---
applications:
- name: APP_NAME
  memory: 2G
  timeout: 180
  instances: 1
  host: APP_NAME-${random-word}
  path: ./
  buildpack: http://url/to/custom/java-buildpack
  command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:. -Djava.io.tmpdir=$TMPDIR package.name.ClassName

Application submission

$ cf push to submit an application
$ cf logs APP_NAME to access the application logs
$ cf events APP_NAME to access CF events related to this application
$ cf files APP_NAME to access the VCAP user home where the application files are stored

Troubleshooting

If the application fails to start for some reason (you may see no logs), you can check what command was used to launch the application as follows:
$ CF_TRACE=true cf app app_name | grep "detected_start_command"

Note 
  • Uploaded jar files are extracted under /home/vcap/app/APP_NAME in the droplet.
  • for executable jar, we need to accept traffic on the port given by CF which is in the VCAP_APP_PORT environment variable. Otherwise CF will think that the application has failed to start and thus shut it down.
  • to check if a java program is running on CloudFoundry:
import org.cloudfoundry.runtime.env.CloudEnvironment;
...
CloudEnvironment cloudEnvironment = new CloudEnvironment();
if (cloudEnvironment.isCloudFoundry()) {
    // activate cloud profile      
    System.out.println("On A cloudfoundry environment");
}else {
    System.out.println("Not on A cloudfoundry environment");
}

Resources:
  • Standalone (non-web) applications on Cloud Foundry - link