Random Notes

Notes from Elasticsearch - The Definitive Guide (suite 3)

2017-01-13T08:19:00.003-08:00

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 35: Controlling memory use and latency
Aggregation queries rely on a data structure called “fielddata” as inverted indices are not efficient when it comes to which unique terms existent in a single document. Understating how field data works is important as it is the primary consumer of memory in an Elasticsearch cluster.
Aggregations and Analysis: the “terms” bucket operate on string field that may be analysed or not_analyzed. For instance, doing a Terms aggregation on documents that have name of states (e.g. New York) will create a bucket for each field (e.g. new, york) and not one for each state name as the field is by default analysed. To fix this unwanted behaviour the field should be specifically made of type not_analyzed in the mapping when the index is first created.
Furthermore, the ngram analysis process can create a lot of tokens which is memory unfriendly.
Choosing the right heap size impact significantly the performance of field data and thus Elasticsearch. The value can be set with the $ES_HEAP_SIZE env variable:
Choose no more than half the available memory to let the other have for lucene as it relies on filesystem caches which are managed by the kernel
Choose no more than 32GB which allows the JVM to use compressed pointers and save memory, a value bigger than that will force the JVM to use pointers with double size and make garbage collection more expensive.
To control the size of memory allocated to fielddata set ‘indices.fielddata.cache.size’ to a percentage of head or concrete value (e.g. 40%, 5gb) in config/elasticsearch.yml. By default, this value is unbound which means ES will never evict data from field data.
Fielddata usage can be monitored (e.g. too many evictions may indicate poor performance) broken for each field:
GET /_stats/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?level=indices&fields=*
To avoid OutOfMemoryException when trying to load more data into the field data, ES uses a circuit breaker that will evaluate the required memory to answer a query before any more data loading. ES has different circuit breaker to ensure memory limit not exceeded:
indices.breaker.fielddata.limit: by default limits size of field data to 60% of the heap
indices.breaker.request.limit: estimates size of structures required to complete other parts of a request, by default 40% of the heap
indices.breaker.total.limit: total wrapping ‘requests’ and ‘fielddata’ circuit breaker, by default ensure combination not exceeding 70%
For instance, to circuit breaker can be set dynamically on live cluster:
PUT /_cluster/settings -d ‘{“persistent”:{“indices.breaker.fielddata.limit”: “40%"}}’

Filedata filtering: in some case we may need to filter out terms that fall into less interesting long tail and not to have to load them. This can be done by defining the document mapping and filtering terms by their frequency (or those matching a regular expression). Filtering data means not using it in search, for many applications the memory space gained is more important than keeping useless terms in memory.

Notes on Big Data related talks

2016-05-13T09:17:00.002-07:00

Hadoop Summit 2016

Apache Eagle Monitor Hadoop in Real time
The talk was about Apache Eagle a Hadoop product developed by eBay to monitor activities on a Hadoop cluster from the security perspective. The talk started by describing the pillars of security in hadoop: perimeter security; authorization & access control; discovery (e.g. classifying data according to their sensitivity), activity monitoring. The talk is mainly on the last part to address info sec questions: who many users are using Hadoop cluster, what files are they accessing, etc. From this purpose Eagle was born to be able to track events form different sources (e.g. accidently deleting files from HDFS) and correlate them with some user-defined policies.

Ingest and Stream Processing What will you choose
The talk was divided in two parts, the first one was about streaming patterns. And how each part provide at least once or exactly one message delivery.
The second part was a demo for building a streaming pipeline using streamsets editor easily. The demo was about using land data of the city of San Fransisco, streaming it and trying to calculate the land with maximum area. The generated data is then store into two destinations Kudu for analytics (e.g. top ten areas) and another Kafa for the events to be used for rendering on minecraft (which was pretty neat).

Real time Search on Terabytes of Data Per Day Lessons Learned
Lessons learned from the plaform engineering team at Rocana (an Ops monitoring software vendering) on building a search engine on HDFS. They described the context and amount of data they are dealing with at a daily basis in terabytes of data. Then, they talked about their initial use of Solar cloud as an enabler to their platform, and how they struggled to scale it and finally decided to create their our search engine based on Lucene and HDFS to store indexes. The rest of the talk was about the specific time-oriented search engine architecture. In the Q&A, one question was on Elasticsearch, they didn't really tested but rather relied on an analysis made by the author of Jepsen (which is a tool for analysing distributed systems).

Spark Summit East 2016

Spark Performance: What's Next
The talk started by a finding since the Spark project started in 2010 up to now on the evolution of IO speed, network throughput and CPU speed as the two firsts increase by a factor of 10x while CPU is stuck at 3Gz. The first attempt to CPU and memory optimization was through project Tungsten. The, the speaker described the two phases of perf enhancement:

Phase 1 (Spark 1.4 to 1.6) enhanced memory managed through using java.unsafe API and offheap memory instead of using Java objects (that allocates memory unnecessary).
Phase 2 (Spark 2): instead of using the Volcano Iterator Model to implement operators (i.e. filter, projection, aggregation) use the Whole-stage Codegen to generate optimized code (and avoid virtual functions call). Plus the use of vectorization (i.e. columnar) to represent data in memory for an efficient scan.

Then the speaker described the impact of these enhancement by comparing the performance of Spark 1.6 vs Spark 2 for different queries. These modification are on master under active development.
In the QA, the described techniques are applicable for DataFrames as the the engine has more information on the data schema which is not the case with RDDs. With Dataset API (which is on top of the DataFrame API) you get the benefit of telling the Engine the data schema as well as the safety data types (i.e. accessing the items without having to cast them to their type). DataFrame gives you index access, while Datasets gives you object access.

Others

Ted Dunning on Kafka, MapR, Drill, Apache Arrow and More
Ted Dunning talking about why the Hadoop ecosystem succeeded over the NoSQL movement thanks to the more stable API as a standard way to make consensus among the community. While in NoSQL it tends to be isolated icelands. As an example he gave Kafka release of version 0.9 as it reached a new level of stability thanks to its API. He then described how Kafka fit in its goal and give an example of a use case where it's going to be hard to used for. The use case was about real-time tracking of shipment containers, in the case where a dedicated Kafka topic is used to track each container, in this case it will be hard to replicate effectively.
Then, he described MapR approach to open source as way to innovate in the underneath implementation why applying to a standard API (e.g. HDFS).
He also talked about Drill and how MapR is trying to involve more member of the community so that it doesn't seem as the only supported. He also talked about the in-memory movement, and specially the Apache Arrow in-memory file system and how it enabled the co-author of pandas to be a Apache Feather a new file format to store data frames on disk and be able to send through wire with Apache Arrow without need for serialization.

more to come.

Notes from Elasticsearch - The Definitive Guide (suite 2)

2016-05-09T02:10:00.000-07:00

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Aggregations (Part IV) :

Chapter 25: High-level concepts - examples

Aggregations in Elasticsearch are based on ‘bucket’ which is a collection of documents that meet a certain criteria (equivalent to grouping in SQL) and ‘metrics’ which are statistics calculated on documents in a bucket (equivalent to count, avg, sum in SQL).

Buckets can be nester in other buckets, and there is variety of them. Elasticsearch allows you to partition documents in many different ways (e.g. by hour, by most popular terms, by geographical location).

An aggregation is a combination of buckets and metrics, and buckets can be nested inside other buckets we can create very complex aggregations. For instance to calculate the average salary for a combination of :

Partition documents by country (bucket),
Partition each country bucket by gender (bucket),
Partition each gender bucket by age range (bucket),
Calculate the average salary for each age range (metric)

Chapter 26: Aggregation Test-drive - examples

Terms bucket in an aggregation query is a type of bucket definition that will create a new bucket for each unique term in encounters. In the result of this query, a bucket key correspond to the term value.

An additional ‘aggs’ level can be added nested inside another one in order to nested metrics, for example to a first ‘count’ by colour aggregation we can add an ‘avg' metric to calculate average of the values of the price ‘field'.

In addition to nest metric inside bucket, we can nest buckets inside other buckets.

Chapter 28: Building Bar Charts - examples

The ‘histogram’ bucket is essential for bar charts. It works by specifying an interval and a numeric field (e.g. price) to calculate bucket on. The interval defines how wide each bucket will be, for instance if it is set to 10 then a new bucket will be created every 10. In the response to such aggregation, the histogram keys correspond to the lower boundary.

Chapter 29: Looking at time - examples

The second most popular activity in Elasticsearch is building date histograms. Timestamps exists in variety of type of data, we can build on top of it metrics which are expressed over time. Example of time-based questions: how many cars sold each month this year, what was the price of this stock for the last 12 hours.

The date_histogram bucket works similarly as the histogram bucket but instead of building buckets based on numeric field, it is calendar-aware and uses time ranges. Each bucket is defined as a certain calendar size (e.g. a month).

Chapter 30: Scoping Aggregations - examples

But default when no query parameter is specified in an aggregation, Elasticsearch runs the all document. In fact, aggregations operate in the scope of the query and if there is no query then the scope will be ‘match_all’ query.

Omitting ’search_type=count’ from the aggregation url forces the search hits to be returned, and thus seeing the search result and aggregation results.

We can use global bucket to by pass the scope of a query to all documents.

Chapter 31: Filtering Queries and Aggregations - examples

Because the aggregation operates in the scope of a query, then any filter added to the query will be applied to the aggregation.

We can use filter bucket so that document matching the filter (e.g. now - 1Month) will be added to the bucket. When using Filter bucket, all nested buckets or metrics will inherent the filter.

Post filter is a top level search parameter, it is executed after the search query to filter the results (i.e. search hits) but does not affect the query scope neither the aggregation. Thus it doesn’t affect the categorial facets. Note that for performance considerations, the post_filter should only be used in combination of aggregations and only when differential filtering is needed. Recap:

filtered query affects both search results and aggregations
filter bucket: affects only aggregations
post_filter: affects only search results.

Chapter 32: Sorting multi-value buckets - examples
By default elasticsearch sorts the aggregation buckets by doc_count in descending order. Elasticsearch provides many way to customise the sorting:
1. Intrinsic sorts: operates on data generated by the bucket (e.g. doc_count). It uses the ‘order’ object which can take one of these values: _count (sort by bucket’s document count), _term (sort by the values of a field), _key (sort by the bucket’s key, works only with histogram and date histogram buckets).
2. Sorting by a metric: set the sort order with any metric (single value) by referencing it’s name. It is also possible to use multiple values metrics (e.g. extended_stats) by using a dot-path to the metric of interest.
3. Sorting based on a metric in subsequent nested buckets (my_bucket>another_bucket>metric): only for buckets generating one value (e.g. filter, global), multi-value bucket (e.g. terms) generate many dynamic buckets which makes it impossible to determine a deterministic path.

Chapter 33: Approximate Aggregations - examples

Simple operations like ‘max’ scales linearly with the number of machines of the Elasticsearch cluster. They don’t need coordination between the machines (i.e. no need for data movement over the network) and the memory footprint is too small (for the sum function all we need is to keep an integer). In the contrary, more complex operations need algorithms that can make tradeoffs between performance and memory utilisation.

Elastisearch support two approximate algorithms ‘cardinality’ and ‘percentiles’ which are fast but does provide an accurate result not an exact.

Cardinality is the approximation of the distinct query that counts unique values of a field, it is based on the HyperLogLog (HLL) algorithm. This algorithm has configurable precision (through the ‘precision_threshold’ field that accept values from 0 to 40k) that impact how much memory will be used. If the field cardinality is below the threshold than the returned cardinality is almost always 100%.

To speed up the cardinality calculation on very large datasets in which case calculating hashes at query time can be painful, we can instead calculate the hash at index time.

Percentiles is the other approximation algorithm provided by Elasticsearch, it shows the point at which certain percentage of values occur. For instance, 95th percentile is the value which is greater than 95% of the data. Percentiles are often used to quickly eyeball the distribution of data, check for skew or bimodalities, and also to find outliers. By default, the percentiles query return an array of pre-defined percentiles: 5, 25, 50, 75, 95, 99.

A compagnon metric is the ‘percentile_rank’ metrics which return for a given value the percentiles it belongs to. For example: the 50th percentile is 119ms, and 119ms percentile rank is the 50th percentile.

The percentiles metric is based on Ted Dunning’s TDigest algorithm (paper Computing Accurate Quantiles using T-Digest).

Chapter 34: Significant Terms - examples

Significant terms are aggregation queries used for detecting anomalies. It is about finding uncommonly common patters, i.e. cases there becomes suddenly very common while in the past were uncommon. For instance, when analysing logs we may be interested in finding servers that throws a certain type of errors more often then they should.

An example of how to use this to recommend .. is by analysing the group of people enjoying a certain .. (the foreground group) and determine what .. are most popular, it will then construct a list of popular .. for everyone (the background group). Comparing the two lists shows that statistical anomalies will be the .. which are over represented in the foreground compared to the background.

to be continued..

Notes from Elasticsearch - The Definitive Guide (suite 1)

2016-05-06T01:54:00.001-07:00

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.
Following are notes I've taken while reading this book.

Dealing with Human Language (Part III) :

Chapter 18: Getting started with languages - examples

Elasticsearch comes with a set of analysers for most languages (e.g. Arabic, English, Japanese, Persian, Turkish, etc.). Each of these analysers perform the same kind of rules: tokenize text into words, lowercase each word, remove stopwords, stem tokens to their root. Additionally, these analysers may perform some language specific transformation to make the words searchable.

Language analysers can be used as is, but it is possible to configure them for instance by defining stem word exclusion (e.g. prevent word organisation from being stemmed to organ), or custom stop words (e.g. omitting no and not as they invert the meaning for the subsequent words).

In case there is multiple documents with predominant language in each one, it’s more appropriate to use different index for each language (e.g. blogs-en, blogs-fr). It is also possible to have all the translations gathered in the same document (e.g. title, title_br, title_es).

Chapter 19: Identifying words - examples

Elasticsearch provides a set of tokenisers in order to extract tokens (i.e. words) from text. Example of tokenisers that can be used regardless of language:

whitespace: simply breaks text on whitespace,
letter: breaks text on any character which is not letter,
standard: uses Unicode Text Segmentation to find boundaries between words,
tax_url_email: is similar to the standard tokeniser excepts it treats emails and urls as single words,

The standard tokeniser is a good starting point to recognise words in most languages and is the basis tokeniser for specific one (e.g. spanish). However it provides a limited support for Asian languages, in such situation it’s better to consider the ‘icu_tokenizer'.

The ICU plugin need to be installed manually in order to have the support for other than english languages:

./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION
where $VERSION can be found in github.com/elasticsearch/elasticsearch-analysis-icu
or in newer version of Elasticsearch ./bin/plugin install analysis-icu

For tokenisers to work well the input text has to be cleaned, character filters can be added to preprocess text before tokenization. For instance, the ‘html_strip’ character filter removes HTML tags and decode entities into corresponding Unicode character.

Chapter 20: Normalising tokens - examples

After text is split into tokens, the later are normalised (e.g. to lowercase) in order for similar tokens to be searchable. For instance, removing diacritics (e.g. ‘,^ and ¨) in western languages with asciifolding filter which converts also Unicode characters into simpler ASCII representation.

Elasticsearch compares characters at the byte level, however the same Unicode characters may have different bytes representation. In this case, it is possible to use Unicode normalisation forms (nfc, nfd, nfkc, nfkd) that converts Unicode into standard format and comparable at byte level.

Lowercasing Unicode character is not straitforward, it has to be made by case folding that may not result in the correct spelling but does allow case-insensitive comparisons.

Similarly, asciifolding token filter has an equivalent for dealing with many languages which is icu_folding that extends the transformation to non ASCII-based scripts like Greek. For instance fold arabic numeral to latin equivalent.

We can protect particular characters from being folded using ‘UnicodeSet’ which is a kind of character class in regular expression.

Chapter 21: Reducing words to their root form - examples

Stemming attempts to remove the difference between inflected forms of a word (like number: fox and foxes, gender: waiter and waitress, aspect: ate and eaten) in order to reduce each word to its root form. English is a weak inflected language (i.e. we can ignore inflection in words and still having good search result), but this is not the case for all languages that may need an extra work.

Stemming may suffer from understemming and overstemming, the former is failing to reduce words with same meaning to the same root and result in relevant document not been returned. The latter is failling to separate words with different meaning which reduces precision (i.e. returning irrelevant documents).

Elasticsearch has two classes of stemmers that can be used: algorithmic and dictionary stemmer.

Algorithmic stemmer applies a sequence of rules to the given word to reduce it to its root form.

Dictionary stemmer uses a dictionary of words to their root format, so that it has only to lookup for the word to be stemmed. These stemmers are as good as their dictionaries, for instance words meaning may change over time and the dictionary have to be updated. Also, the size of the dictionary may hurt the performances as all words (suffixes and prefixes) have to be loaded into RAM. Example of widely used dictionary is the spell checker Hunspell.

Chapter 22: Stopwords performance vs precision - examples

Reducing index size can be achieved by indexing fewer words. Terms to index can be divided into Low frequency terms that appear in fewer index thus having high weight. And terms with high frequency that appear in many documents in the index. The frequency depends on the type of indexed documents, e.g. 'and’ in chinese documents will be a rare word. For any language there are common words (also called stop words) that may be filtered out before indexing but this may bring some limitations: distinguishing between ‘happy’ and 'not happy’.

To speedup query performance, we should avoid default query that uses the ‘or’ operator.

1. One possible option is to use ‘and’ operator in match query like {"match": {"my_field": {"query": "the quick brown fox", "operator":"and"}}}. Which is then rewritten to a bool query {"bool":{"must":[{"term": {"my_field":"the"}}, {"term": {"my_field":"quick"}}, {"term": {"my_field":"brown"}}, {"term": {"my_field":"fox"}}]}}. Elasticsearch will execute first the query with least frequent term to immediately reduce the number of explored documents.

2. Another option for enhancing performance is to use ‘minimum_should_match’ property in the match query.

3. Its possible to divide the terms in search query into low frequency group (relevant terms used for filtering/matching) and high frequency group (irrelevant terms used for scoring only) terms. This is can be achieved with ‘cutoff_frequency’ query parameter, e.g. {{"match": {"text": {"query": "Quick and the dead", "cutoff_frequency": 0.01}}}. The latter result in a combined “must” clause with terms “quick/dead” and a should clause with terms “and/the”.

The parameters “cutoff_frequency” and “minimum_should_match” can be combined toghether.

To effective reduce the index size use the appropriate ’index_options’ in a Mapping API request, possible values are:

‘docs’ (default for ’non_analyzed’ string fields): store only which documents include which terms,
‘freqs’: store ‘docs’ information plus frequency of terms in each document,
‘positions’ (default for ‘analyzed’ string fields): store ‘docs’ and ‘frees’ information plus the position of each term in each document,
‘offsets’: store ‘docs’, ‘freqs’ and ‘positions’ plus the start and end character offsets of each term in the original string,

Chapter 23: Synonyms - examples

Synonyms are used to broaden the scope of the matching documents, this kind of search (like stemming, partial matching) should be combined with another query on a field with the original text. Synonyms can be defined in the Index API request inlined in the ’synonyms’ settings parameter or in a file by specifying a path in ’synonyms_path' parameter. The latter can be absolute or relative Elasticsearch ‘config’ directory.

Synonym expansion can be done at index or search time, for instance we can replace English with the terms ‘english’ and ‘british’ at index time then in search time we could u-query for one of these terms. If synonyms are not used at index time, then at search time we have to convert the queries with ‘english’ into a query for ‘english OR british’.

Synonyms are listed as comma-separated values like ‘jump,leap,hop’. It is also possible to use the syntax with ‘=>’ to specify on the left side a list of terms to match (e.g. gb, great brain) and on the right side one or many replacement (e.g. britain,england,scotland,wales). In case many rules are specified for the same left side then the tokens in the right side are merged.

Replacing synonyms can be done with one of the following options:

1. Simple expansion: any of the listed synonyms is expanded into all of the listed synonyms (e.g. ‘jump,hop,leap’). This type expansions can be applied either at index time or search time.

2. Simple contraction: a group of synonyms in the left side are mapped to a single value in the right side (e.g. ‘leap,hop => jump’). This type of expansions must be applied at index time and query time to insure query terms are mapped to the same value.

3. Genre expansion: it widens the meaning of terms to be more generic. Applying this technique at index time with the following rules:

‘cat => cat,pet’
‘kitten => kitten,cat,pet’
‘dog => dog,pet’
‘puppy => puppy,dog,pet’

then when querying for ‘kitten’ only documents about kittens will be returned, when querying for cat documents about kittens and cats are returned, and when querying for pet all documents about kittens, cats, puppies, dogs or pets will be returned.

Synonyms and the analysis chain:

It is appropriate to set the first a tokeniser filter, then a stemmer filter before putting the synonyms filter. In this case instead of having a rule like ‘jumps,jumped,leap,leaps,leaped => jump’ we can have ‘leap => jump’.

In some case, the synonym filters cannot be simply put behind a lowercase filter as it have to deal with terms like CAT or PET (Positron Emission Tomography) which are conflicting when lowercased. A possibility will be to:

1. put the synonym filter before the lowercase filter and specify rules with both lowercase and uppercase forms.

2. or have two synonym filters one for case-sensitive synonyms (with rules like ‘CAT,CAT scan => cat_scan') and another one for case insensitive synonyms (with rules like ‘cat => cat,pet’).

Multi-word synonyms and phrase queries: using synonyms with 'simple expansion’ (i.e. rules like ‘usa,united states,u s a,united states of america') may lead to some bizarre results for phrase queries, it’s more appropriate to use ’simple contraction’ (i.e. rules like ‘united states,u s a,united states of america=>usa’).

Symbol synonyms: used for instance to avoid emoji (like ‘:)') been striped away by the standard tokeniser filter as they may change the meaning of the phrase. The solution will be to define a mapping character filter. This will ensure that emoticons are included in the index for instance to do sentiment analysis.

Note that mapping character filter is useful for simple replacements of exact characters, for more complex patterns; regular expressions should be used.

Chapter 24: Typoes and misspellings - examples

This chapter is about fuzzy matching at query time and sounds-like matching at index time for handling misspelled words.

Fuzzy matching treats words which are fuzzily similar as if they are the same word. This is based on Damerau-Levenshtein edit distance, i.e. number of operations (edit, insertion, deletion) to perform on a word until it becomes equal to the target word. Elasticsearch supports a maximum of edit distance ‘fuzziness’ of 2 (default is set to ‘AUTO’). Two can be overkilling as a fuzziness value (most misspelling errors are of distance 1) especially for short words (e.g. hat is at 2 distance for mad).

Fuzzy query with an edit distance of two can perform very badly and match a large number of documents, the following parameters can be used to limit performance impact:

prefix_length: number of initial characters that will not be fuzzified, as most types occur at the end of words,
max_expansions: limit the number of options produced, i.e. generated fuzzy words until this limit is matched.

Scoring fuzziness: fuzzy matching should not be used for scoring but only to widen the match result (i.e. increasing recall). For example, if we have 1000 documents containing the word ‘Algeria’ and one document with the word ‘Algeia’, then the latter misspelled word will be considered more relevant (thanks to TF/IDF) as has fewer appearance.

Phonetic matching: there is plenty of algorithms for dealing with phonetic error, most of them are specialisation of the Soundex algorithm. However they are language specific (either English or German). You need to install the phonetic plugin - here github.com/elasticsearch/elasticsearch-analysis-phonetic.

Similarly, phonetic matching should not be used for scoring as it is intended to increase recall. Phonetic algorithms are useful when the search result will be processed by the machine and not by humans.

Notes for subsequent chapters can be found here.

Notes from Elasticsearch - The Definitive Guide

2016-04-25T07:47:00.003-07:00

I started reading Elasticsearch - The Definitive Guide few weeks ago, and working on an Elasticsearch client for golang.

Following are notes I've taken while reading this book:

Chapter1:

history: lucent, compass, elasticsearch
download/run node, plugging manager Marvel, Elasticsearch vs Relational DB,
Employee directory example: Create index (db), index (store) document, query (light && DSL), aggregations

Chapter 2: (about distribution)

Cluster health (green yellow, red), Create index with 3 shards (default 5) and 1 replica, then scaling nb of replicas (up or down), master reelection after it fails

Chapter 3:

API for managing documents (create, retrieve, update, delete)

Document metadata (_index, _type, _id)
Index document: PUT /{index}/{type}/{id}, for auto-generated ids: POST /{index}/{type}/
Retrieve a document: GET /{index}/{type}/{id}, without metadata GET /{index}/{type}/{id}/_source, some fields: GET /{index}/{type}/{id}?_source={field1},{field2}
Check existence of document curl -i XHEAD http://elastic/{index}/{type}/{id}
Delete a document: DELETE /{index}/{type}/{id},
Update conflicts with optimistic concurrency control, uses _version to ensure changes to be applied in correct order, to retry in case of failures many times POST /{index}/{type}/{id}/_update?retry_on_conflict=5
Update using scripts (in Groovy) or set initial value (to avoid failures for non existing document) POST /{index}/{type}/{id}/_update -d ‘{“script”: “ctx._source.views+=1”, “upsert”: {“view”: 1}}’
Multi-GET: GET /_mget -d {“docs”: [{“_index”: “website”, “_type”: “blog”, “_id”: 2}, …]} or GET /{index}/{type}/{id}/_mget -d {“ids”: [“2”, “1”]}
Bulk operations (not atomic/transactional, i.e. if sone fails, some may succeeds) POST /_bulk -d {action: {metadata}}\n{request body}

Chapter 4:

How document management operations are executed by elastic search

Chapter 5:

Search basics (look for data sample in gist)

Search all types in all indices GET /_search
Search a type that contains a word in a field GET /_all/{type}/_search?q={field}:{word}
Queries with + conditions (e.g. +{field}:{value}) must be satisfied, - conditions must not be satisfied, nothing means the condition is optional.

Chapter 6:

Core data types in elastic search are indexed differently, to understand how elastic search interpreted the indexed documents and to avoid surprising query results (e.g. age mapped to string instead of integer), look at the mapping (i.e. schema definition) for the type and index. GET /{index}/_mapping/{type}

ES uses inverted indexes that consists of a list of unique words in all documents and for each one, the list of document it appears in.

Each document and query are passed by analysers that filter characters, tokenise words, then filter these tokens. ES ships with some analysers: standard analyser (used by default), simple analyser, whitespace analyser, language analyser. Analysers are applied only to full text searches and not to exact values.

To understand how documents are tokenised and stored in a given index, we can use the Analyse API by specifying the analyser: GET /_analyze?analyzer=standard -d “Text to analyse”. In the response, the value of token is what it will be stored in the index.

Chapter 7:

Filter vs Query DSL, elastic search has two DLS which are similar but serve different purposes, the filter DSL asks a yes/no question on every document and it is used for exact value field. In the other hand, Query DSL asks how well this relevant is this document question, and assign it a _score. In terms of performance, filters are much lighter and uses caches for even faster future searches. Queries are heavier and must be used only for full text searches.

Most used filters are: term/terms, exists, match_all, match, multi_match (to run same match on multiple fields), and bool query.

Queries can become easily very complex, combining multiple queries and filters, elastic search provides _validate endpoint for query validation:

GET /{index}/{type}/_validate/query QUERY_BODY

Elastic search provides also a human-readable explanation for non valid queries: GET /{index}/{type}/_validate/query?explain QUERY_BODY

Chapter 8: Sorting and relevance

By default search result documents are sorted by relevance (i.e. _score value) in descending order, however for filter queries which doesn’t have impact on the _score field it may be interesting to sort other ways (e.g. date). Example of a sort query:

GET /_search {"query": {“filtered”: {“filter”: {“term”: {“user_id”: 1}}}}, “sort”: {“date”: {“order”: “desc"}}}

Chapter 10: Index Management

A type in Elasticsearch consists of a name and a mapping (just like a database schema) that describes its fields, there data types and how they are indexed and stored in lucene. The json representation of a document is stored in plain in the ‘_source’ field which may consume disk space, so a good idea will be to disable it.

Chapter 15: - examples

Phrase search (how to search for terms with a specific order in the target documents) and proximity search with ‘slop’ parameter that gives more flexibility to the search request

Chapter 16: - examples

Queries for matching parts of a term (not the whole). In many cases, it is sufficient to use a stemmer to index the root form of words, but there are cases where we need partial matching (e.g. matching a regex in not_analyzed values).

Example of queries: ‘prefix’ query works on term level, doesn’t analyse the query string before searching, and performs as a filter (i.e. no relevance calculation). Shorter prefix length means many possible terms to be visited, so for better performance use longer prefixes.

Query-time search as you type with match_phrase_prefix queries, and index-time search as you type by defining n-grams

Chapter 17: Controlling relevance score - examples

Relevance score in Lucene (thus Elasticsearch) is based on Term Frequency/Inverse Document Frequency and Vector Space Model (to combine weight of many terms in search query), in addition to a coordination factor, field length normalization and term/query clause boosting.

1. Boolean model: applies AND, OR and NOT conditions of the search query to find matching documents.

2. Term frequency/Inverse document frequency (TF/IDF): the matching documents then have to be sorted by relevance that depends on the weight of the query terms appearing in these documents. The weight of a term is determined by the following factors:

Term frequency: defines how often a term appear in this document (the more often the higher is its weight). For a given term t and document d, it is calculated by the square root of the frequency, i.e. tf(t in d)=(frequency)^1/2
Inverse document frequency: defines how often a term appears in all document of a collection (the more often the lower the weight). It is calculated based on the number of documents in the collection and number of document the term appears in, as follows: idf(t) = 1 + log(numDocs / (docFreq + 1))
Field length norm: defines how long the field is (the shorter it is the higher the weight), if a term appears in a short field (e.g. title) then it is likely the content of that field is about this term. In some cases (e.g. logging) norms are not necessary (e.g. we don’t care about length of user agent), disabling them can save a lot of memory. This metric is calculated as the inverse square root of number of terms in the given field: norm(d) = 1 / (numTerms)^1/2

These factors are calculated and stored at index time, together they serve to calculate of a single term in a document.

3. Vector space model:

A single score representing how well a document match a query. It is calculated by first representing the search query and the document as one-dimensional vector with a size equal to number of query terms. Each element is the weight of a term calculated with TF/IDF by default although it’s possible to use other techniques (e.g. Okapi-BM25). Then the angle between these vectors is calculated (Cosine similarity), the closer they are the more relevant the document is to the query.

Lucene’s practical scoring function: Lucene combines multiple scoring functions:

1. Query coordination: rewards document that have most of the search query terms, i.e. the more query terms the document contains the more relevant it is. Sometimes, you may want to disable this function (although most use cases for disabling Query Coord are handled automatically), for instance if the query contains synonyms.

2. Query time boosting: a particular query clause can use the boost parameter to be given a higher importance over clauses with less boost value or without it. Boosting can also be applied to entire indexes.

Note: not_analyzed fields have ‘field length norms’ disabled and ‘index_options’ set to docs these disabling ’term frequencies’, the IDF of each term are still considered.

Function score query: can use Decay functions (linear, exp, guess) incorporate sliding scale (like publish_date, geo_location, price) into the _score to alter documents relevance (e.g. recently published, near a lat-lon/price point)

For some use cases of ‘field_value_factor’ in a Function score query using directly the value of field (e.g. popularity) may not be appropriate (i.e. new_score = old_score * number_of_votes), in this case a modifier can be used for instance log1p which changes the formula to new_score = old_score * log(1 + number_of_votes).

Notes for subsequent chapters can be found here.

Adding functionalities to existing classes in Scala

2015-08-06T09:51:00.001-07:00

New functionalities can be added to existent classes by wrapping them with a Value class and adding and implicit methods for converting back and forward form the original class:

class TLong(val value: Long) {
  def +(other: TLong) = new TLong(value + otehr.value)
  def decrement =  new TLong(value - 1L)
  override def toString(): String = value.toString;
}

// implicit methods for conversions
implicit def toTLong(l: Long) = new TLong(l)
implicit def toLong(tl: TLong) = tl.value

// some tests
val l1: TLong = new TLong(1)
val l2: TLong = new TLong(2)
l1 + l2
1L + l2
l1 + 2L

From Scala 2.10, you can use implicit classes so that you don't have to define conversion methods as they are automatically created:

implicit class ImplicitLong(val l: Long) {
  def print = l.toString
}

1L.print

Running Java applications on CloudFoundry

2015-06-13T09:03:00.002-07:00

Introduction

CloudFoundry v2 uses Heroku buildpacks to package droplet on which an application will run. But before, CF checks among the locally available buildpacks which one can be used to prepare the application runtime. The buildpack contract is composed of the following scripts (that can be written in shell, python, ruby, etc):

Detect: checks if this buildpack is suitable for the submitted application,
Compile: prepares the runtime environment of the application,
Release: finally launches the application

Applications with single file

Java applications whether a standalone or web are managed by the java-buildpack. In case a manifest.yml is used to submit the application, then for Web applications or executable jar it may looks like:
---
applications:
- name: APP_NAME
memory: 4G
disk_quota: 2G
timeout: 180
instances: 1
host: APP_NAME-${random-word}
path: /path/to/war/file.war or /path/to/executable/file.jar

The java-buildpack will check if the file is a .war to launch Tomcat container, or an executable jar to look for the mainClass in META-INF/MANIFEST.MF.

Applications with many files

In case the application is composed of multiple files (jars, assets, configs, etc.) the java-buildpack won't be able to automatically detect what appropriate container to use. We need:

1. For the Detect phase to choose which container is appropriate (here the java-main): Clone the java-buildpack and set the java_main_class property in config/java_main.yml.

2. In the manifest: indicate the path to the folder containing all artifacts that should be download to the droplet at the Compile phase.

3. In the manifest: set the command that will be used at the Release phase to launch the application.

An example of java_main.yml file:

---
java_main_class: package.name.ClassName

An example of a manifest.yml file:

---
applications:
- name: APP_NAME
memory: 2G
timeout: 180
instances: 1
host: APP_NAME-${random-word}
path: ./
buildpack: http://url/to/custom/java-buildpack
command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:. -Djava.io.tmpdir=$TMPDIR package.name.ClassName

Application submission

$ cf push to submit an application
$ cf logs APP_NAME to access the application logs
$ cf events APP_NAME to access CF events related to this application
$ cf files APP_NAME to access the VCAP user home where the application files are stored

Troubleshooting

If the application fails to start for some reason (you may see no logs), you can check what command was used to launch the application as follows:
$ CF_TRACE=true cf app app_name | grep "detected_start_command"

Note

Uploaded jar files are extracted under /home/vcap/app/APP_NAME in the droplet.
for executable jar, we need to accept traffic on the port given by CF which is in the VCAP_APP_PORT environment variable. Otherwise CF will think that the application has failed to start and thus shut it down.
to check if a java program is running on CloudFoundry:

import org.cloudfoundry.runtime.env.CloudEnvironment;
...
CloudEnvironment cloudEnvironment = new CloudEnvironment();
if (cloudEnvironment.isCloudFoundry()) {
    // activate cloud profile      
    System.out.println("On A cloudfoundry environment");
}else {
    System.out.println("Not on A cloudfoundry environment");
}

Resources:

Standalone (non-web) applications on Cloud Foundry - link

DEV 301 - Developing Hadoop Applications

2015-05-31T03:54:00.000-07:00

1. Introduction to Developing Hadoop Applications
- Introducing MapReduce concepts and history
- Discribing how MapReduce works at a high-level and how data flows in it

The typical example of MapReduce applications is Word Count. As an input, there is many files that are splitting amongst the TaskTracker nodes where the files are located. The splits are of multiple record, here a record is a line. The Map function gets a Key-Value pairs, and just uses the Value (i.e. line) to calculate one occurrence at a time of each word. Then a Combine function aggregates the occurrences and pass them to the Shuffle function The later is handled by the framework and aims to gather the output of prior functions by keys before sending them to the reducers. The Reduce function takes a list of all occurrence (i.e. value) of a word (i.e. key) to sum them up and return the total time the word has been seen.

MapReduce example: Word Count

Run the word count example:
1. Prepare a set of input text files:
$ mkdir -p /user/user01/1.1/IN1
$ cp /etc/*.conf /user/user01/1.1/IN1 2> /dev/null
$ ls /user/user01/1.1/IN1 | wc -l
2. Run word count application using the previously created files
$ hadoop jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1-mapr-1408.jar wordcount /user/user01/1.1/IN1 /user/user01/1.1/OUT1
3. Check the job output
$ wc -l /user/user01/1.1/OUT1/part-r-00000
$ more /user/user01/1.1/OUT1/part-r-00000

Trying binary files as input:
$ mkdir -p /user/user01/1.1/IN2/mybinary
$ cp /bin/cp /user/user01/1.1/IN2/mybinary

$ file /user/user01/1.1/IN2/mybinary

$ strings /user/user01/1.1/IN2/mybinary | more
$ hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /user/user01/1.1/IN2/mybinary /user/user01/1.1/OUT2
$ more /user/user01/1.1/OUT2/part-r-00000
Look for reference in the input and output of work AUTH:
$ strings /user/user01/1.1/IN2/mybinary | grep -c ATUH
$ egrep -ac ATUH /user/user01/1.1/OUT2/part-r-00000

MapReduce execution summary and data flow

MapReduce Workflow:

Load production data into HDFS with tools like Sqoop for SQL data, Flume for log data, or traditional tools as MapR-FS support POSIX operations and NFS access.
Analyze, Store, Read.

The InputFormat object is responsible for validating the job input, splitting files among mappers and instantiating the RecordReader. By default, the size of an input split is equal to the size of a block which is 64Mb in Hadoop and it is the size of a chunk in MapR which is 256 Mb. Each input Split references a set of Records which will be broken into a Key-Value for the Mapper. The TaskTracker passes the split input to the RecordReader constructor which will read the records one by one and passes them to the mapper as key-value pairs. By default, the RecordReader considers a line as a record. This can be modified by extending the RecordReader and InputFormat classes to define different records in the input file, for example multi-line records.

The Partitioner takes the output generated by the Map functions, hashes the record key to create partitions based on the key. By default, each partition will be passed to a reducer, this behavior can be overrided. As part of Shuffle operation, The partitions are then sorted and merged as preparation before sending them to the reducers. Once an intermediate partition is complete, it will be send over the network using protocols like RPC or HTTP.

The result of a MapReduce job is writing to an output directory:

an empty file named _SUCCESS is created to indicate the success of the job,
the history of the job is captured under the _log/history* directory,
the output of the reduce job is captured under part-r-00000, part-r-00001, ...
if you run a map-only job the output will be part-m-00000, part-m-00001, ...

Hadoop Job Scheduling

Two schedulers are available in hadoop, the use of each one is declared in mapred-site.xml:

by default the Fair Scheduler is used where resources are shared evenly across pools (a slot of resources) and each user has its own pool. Custom pools can be configured to guaranty minimum access to pools to prevent starvation. This scheduler supports preemption.
Capacity Scheduler: resources are shared across queues, the administrator configure hierarchically queues (percentage of total resources in the cluster) to control access to resources. The queues has ACL to control user access and it's also possible to configure soft and hard limits per user within a queue. The schedule support resource-based scheduling and job priority.

YARN architecture

Hadoop Job Management

Dependent on the MapReduce version there is different ways to manage Hadoop jobs:

MRv1: through web UIs (JobTracker, TaskTraker), MapR metrics database, hadoop job CLI.
MRv2 (YARN): through web UIs (Resource Manager, Node Manager, History Server), MapR metrics database (for future releases), mapred CLI.

The DistributedShell example

$ yarn jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar -shell_command /bin/ls -shell_args /user/user01 -jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar

Check the application logs

$ cd /opt/mapr/hadoop/hadoop-2.4.1/logs/userlogs/application_1430664875648_0001/

$ cat container_1430664875648_0001_01_000002/stdout # stdout file

$ cat container_1430664875648_0001_01_000002/stderr # stderr file

The logs can also be accessed from the History Server Web UI at http://node-ip:8088/

to be continued

ADM 201 - Hadoop Operations: Cluster Administration

2015-04-08T01:22:00.000-07:00

This is article gathers notes taken from MapR's ADM 201 class which is mainly about:
- Testing & verifying Hardware before installing MapR Hadoop
- Installing MapR Hadoop
- Benchmarking MapR Hadoop Cluster configure new cluster for production
- Monitoring Cluster for failures & performance

Prerequisites

Install the cluster shell utility, declare the slave nodes and check if it is accessing the nodes properly

$ sudo -i
$ apt-get install clustershell

$ mv /etc/clustershell/groups /etc/clustershell/groups.original
$ echo "all: 192.168.2.212 192.168.2.200" > /etc/clustershell/groups

$ clush -a date

Mapr Cluster validation

Inconsistency in the hardware (e.g. different disk sizes or cpu cores) may not cause installation failure but may cause poor performance of the cluster. The use of benchmarking tools (cluster validation github repo) allows the measurement of the cluster performance.

The remaining of this section address pre-Install cluster hardware tests:
1. Download Benchmark Tools
2. Prepare Cluster Hardware for Parallel Execution of Tests
3. Test & Measure Subsystem Components
4. Validate Component SOftware & Firmware

Grap the validation tools from the github repo
$ curl -L -o cluster-validation.tgz http://github.com/jbenninghoff/cluster-validation/tarball/master
$ tar xvzf cluster-validation.tgz
$ mv jbenninghoff-cluster-validation-*/ ./
$ cd pre-install/

Copy the pre-install folder to all nodes, and check if it succeeded
$ clush -a --copy /root/pre-install/
$ clush -a ls /root/pre-install/

Test the hardware for specification heterogeneity

$ /root/pre-install/cluster-audit.sh | tee cluster-audit.log

Test the network bandwidth for its ability to handle MapReduce operations:
First, set the IP addresses of the node in network-test.sh (divide them between half1 and half2).
$ /root/pre-install/network-test.sh | tee network-test.log

Test memory performance
$ clush -Ba '/root/pre-install/memory-test.sh | grep ^Triad' | tee memory-test.log

Test disk performance
The disk-test.sh script checks the disk health and performance (i.e. throughput for sequential and random I/O read/write), it destroys any data available on it.
$ clush -ab /root/pre-install/disk-test.sh
For each scanned disk there will be a result file of the form disk_name-iozone.log.

Mapr Quick Install - link

Minimum requirements:

2-4 cores (at least two: 1 CPU for OS, 1 CPU for filesystem)
6GB of ram
20GB size of raw disk (should not be formatted/partitioned)

First, download installer script
$ wget http://package.mapr.com/releases/v4.1.0/ubuntu/mapr-setup
$ chmod 755 mapr-setup
$ ./mapr-setup

Second, configure the installation process (e.g. define data and control nodes). A sample configuration can be found in /opt/mapr-installer/bin/config.example
$ cd /opt/mapr-installer/bin
$ cp config.example config.example.original
Use following commands to find information on nodes to declare in the configuration
$ clush -a lsblk # list drivers name
$ clush -a mount # list ip addresses and mounted drivers

Edit config.example file

Declare the nodes information (IP addresses and data drives) under the Control_Nodes section.
Customize the cluster domain by replacing my.cluster.com with your own.
Set a new password (e.g. mapr)
Declare the disks and set ForceFormat to true.

Installing mapr (the installation script uses Ansible behind the scene)
$ ./install --cfg config.example --private-key /root/.ssh/id_rsa -u root -s -U root --debug new

MapR Cluster Services - link

In case the installation succeeded, you can login to https://master-node:8443/ with mapr:mapr to access MapR Control System (MCS) then get a new license.
Otherwise, if the installation fails, then remove install folder then check installation logs that can be found at /opt/mapr-installer/var/mapr-installer.log. Example of failures may be caused by:

problems formatting disks for MapR FS (check /opt/mapr/logs/disksetup.0.log).
one of the nodes has less than 4G of memory
disks with LVM setup

As last remedial, you can remove all mapr packages and re-install again:

$ rm -r -f /opt/mapr/ # remove installation folder
$ dpkg --get-selections | grep -v deinstall | grep mapr
mapr-cldb install
mapr-core install
mapr-core-internal install
mapr-fileserver install
mapr-hadoop-core install
mapr-hbase install
mapr-historyserver install
mapr-mapreduce1 install
mapr-mapreduce2 install
mapr-nfs install
mapr-nodemanager install
mapr-resourcemanager install
mapr-webserver install
mapr-zk-internal install
mapr-zookeeper install
$ dpkg -r --force-depends # remove all listed packages

To check if the cluster is running properly, we can run the following quick test job.
Note: check that the names of cluster nodes are resolvable through DNS, otherwise declare them in the /etc/hosts of each node.
$ su - mapr
$ cd /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/
$ yarn jar hadoop-mapreduce-examples-2.5.1-mapr-1501.jar pi 8 800

Benchmark the Cluster

1. Hardware Benchmarking
First, copy the post-install folder to all nodes
$ clush -a --copy /root/post-install
$ clush -a ls /root/post-install

Second, run tests to check drive throughput and establish a baseline for future comparison
$ cd /root/post-install
$ clush -Ba '/root/post-install/runRWSpeedTest.sh' | tee runRWSpeedTest.log

2. Application Benchmarking
Use specific MapReduce jobs to create test data and process it in order to challenge the performance limits of the cluster.
First, create a volume for the test data
$ maprcli volume create -name benchmarks -replication 1 -mount 1 -path /benchmarks

Second, generate random sequence of data
$ su mapr
$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar teragen 5000000 /benchmarks/teragen1

Then, sort the data and write the output to a directory
$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar terasort /benchmarks/teragen1 /benchmarks/terasort1

To analyze how long it takes to perform each step check the logs on the JobHistoryServer
$ clush -a jps | grep -i JobHistoryServer

Cluster Storage Resources

MapR FS organizes the drives of a cluster into Storage Pools. The later is a group of drives (three by default) on a single physical node. Data is stored across drives of the cluster storage pools. In case, one drive fails then the entire storage pool is lost. To recover it, we need to put all drives of this pool offline, replace the failed drive then return then back to the cluster.
The 3 drives per pool gives us a good balance read/write speed for ingestion huge data and recovery time for failed drives.
Storage pools hold units called Containers (32Gb size by default) which are logically organized into Volumes (which are specific to MapR FS). By default, containers has replication factor inside a volume set to three. We can choose a pattern for replication across containers: chain pattern, star pattern.
$ maprcli volume create name type 0|1

When writing for a file, Container Location Database (CLDB) is used to determine first container where data is written. CLDB replaces the function of a NameNode in MapR hadoop, it stores container replication factor and pattern information. A file is divided into chuncks (default size 256 Mb): small chunk size leads to high writes scheduling overhead, big chunk size requires more memory.
A topology defines the physical layout of a cluster nodes. It's recommended to have two top-level topologies:

/data the parent topology for active nodes in the cluster
/decommissioned the parent topology used to segregate offline nodes or nodes to be repaired.

Usually, racks that house the physical nodes are used as sub-topology to /data.

Data Ingestion

Ingestion data to MapR FS can be done through:

NFS (e.g. Gateway Strategy, Colocation Strategy) by using traditional applications with multiple concurrent read/writes easily - link,
Sqoop to transfer data between MapR-FS and relational databases,
Flume a distributed service for collecting, aggregating & moving data into MapR-FS

Snapshots are read-only images of volumes at a specific point in time, more accurately a pointer that costs almost nothing. It's good idea to create them regularly to protect the integrity of the data. By default, a snapshot is scheduled automatically at the creation of a volume, it can be customized through the MCS or manually created as follows:
$ maprcli volume snapshot create -volume DemoVolume -snapshotname 12042015-DemoSnapshot

Mirrors are volumes that represents an exact copy of a source volume from same or different cluster, it takes an extra amount of resources and time to create them. By default, a mirror is a read-only volume but can be made writable. They can be created through the MCS, set the replication factor or manually as follows:
$ maprcli volume mirror start -name DemoVolumeMirror

$ maprcli volume mirror push -name DemoVolumeMirror

Configuring remote mirrors
First, edit cluster configuration file (in both clusters) to include the location of CLDB nodes on the remote one:
$ echo "cldb_addr1:7222 cldb_addr2:7222 cldb_addr3:7222" >> /opt/mapr/conf/mapr-clusters.conf
Second, copy this new configuration to all nodes in the cluster
$ clush -a --copy /opt/mapr/conf/mapr-clusters.conf
Third, restart the Warden service so that the modification takes effect:
$ clush -a service mapr-warden restart
Finally, start the mirroring from the MCS interface.

Cluster Monitoring

Once a cluster is up and running, it has to be kept running smoothly. MCS provides many tools to monitor the health and to investigate failure causes of the cluster by providing:

alarms: sending emails, nagios notification, and
statistics about nodes (e.g. services), volumes, jobs (MapR metrics database). MapR Hadoop provide ways to

Standard logs for each node are stored at /opt/mapr/hadoop/hadoop-2.5.1/logs, however the centralized logs are stored in /mapr/MaprQuickInstallDemo/var/mapr/local/c200-01/logs at the cluster level.

Centralized logging automate for us the gathering of logs from all cluster nodes. It provides a job-centric view. The following command can be used to create a centralized log direcotry populated with symbolic links to all log files related to: tasks, map attempts, reduce attempts, pretaited to this specific job.
$ maprcli job linklogs -jobid JOB_ID -todir MAPRFS_DIR

The MapR centralized logging feature is enabled by default in /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-env.sh through the environement variable HADOOP_TASKTRACKER_ROOT_LOGGER.
Standard log for each node is stored under /opt/mapr/hadoop/hadoop-0.20.2/logs,
on the other hands the centralized logs are stored in the /map/ when starting at the cluster level.

Alarms
When a disk failure alarm is raised, the report at /opt/mapr/logs/faileddisk.log gives information about what disks have failed, the reason of the failure and recommended resolution.

Cluster Statistics

MapR collects a variety of statistics about the cluster and running jobs. There information helps track the cluster usage and health. They can be writting to an output file or consumed by ganglia, the output type is specified in two hadoop-metrics.properties files:

/opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties for output of hadoop standard services
/opt/mapr/conf/hadoop-metrics.properties for output of MapR specific services

Collected metrics can be about services, jobs, nodes and monitoring node.

Schedule Maintenance Jobs
The collected metrics give us a good view of the cluster performance and health. The complexity of the cluster makes it hard to use these metrics to optimize how the cluster is running.
Running test jobs regularly to gather job statistics and watch cluster performance. If a variance in the cluster performance can be seen the actions need to be taken to get back the cluster performance. By doing this in a controlled environment we can try different ways (e.g. tweak Disk and Role balancers settings) to optimize the cluster performance.

Resources:

MapR installation - Lab Guide, Quick Installation Guide
Preparing Each Node - link
Setting up a MapR Cluster on Amazon Elastic MapReduce - link
Cluster service planning - link
Tuning cluster for MapReduce performance for specific jobs - link
MapR Hadoop data storage - link

Hadoop interview questions

2015-04-03T06:44:00.003-07:00

1) HDFS file can ...

... be duplicated on several nodes
... compressed
... combine multiple files
... contain multiple blocks of different sizes

2) How does HDFS ensure the integrity of the stored data?

by comparing the replicated data blocks with each other
through error logs
using checksums
by comparing the replicated blocks to the master copy

3) HBase is ...

... column oriented
... key-value oriented
... versioned
... unversioned
... use zookeeper for synchronization
... use zookeeper for electing a master

4) An HBase table ...

... need a scheme
... doesn't need a scheme
... is served by only one server
... is distributed by region

5) What does a major_compact on an HBase table?

It compresses the table files.
It combines multiple existing store files to one for each family.
It merges region to limit the region number.
It splits regions that are too big.

6) What is the relationship between Jobs and Tasks in Hadoop?

One job contains only one task
One task contains only one job
One Job can contain multiple tasks
One task can contain multiple tasks

7) The number of Map tasks to be launched in a given job mostly depends on...

the number of nodes in the cluster
property mapred.map.tasks
the number of reduce tasks
the size of input splits

8) If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

One by one on each available reduce slot
Statistically
By hash

9) In Hadoop can you set

Number of map
Number of reduce
Both map and reduce number
None, it's automatic

10) What is the minimum number of Reduce tasks for a Job?

0
1
100
As many as there are nodes in the cluster

11) When a task fails, hadoop....

... try it again
... try it again until a failure threshold stops the job
... stop the job
... continue without this particular task

12) How can you debug map reduce job?

By adding counters.
By analyzing log.
By running in local mode in an IDE.
You can't debut a job.

References:

Hadoop wiki - link
Hadoop tutorial - link

Password-less SSH root access

2015-03-24T07:41:00.002-07:00

So I had to configure password-less SSH access between a master machine and a slave one:

1. Create an SSH key pair on the master machine
root@master-machine$ ssh-keygen

2. Create an SSH key pair on the slave machine,
root@slave-machine$ ssh-keygen

To copy the public key to the remote machine we need a root access, however by default password-based SSH access as root is not allowed

3. On the slave machine: sudo passwd.
3.1. set a password for root (if not already set)
3.2. edit /etc/ssh/sshd_config (not /etc/ssh/ssh_config) to change PermitRootLogin without-password to PermitRootLogin yes.
3.3. restart SSH deamon with service ssh restart, if in an ssh session service ssh reload.

4. Copy master's root public key to the authorized keys in the slave machine
root@master-machine$ ssh-copy-id -i root@slave-machine

Disable password-based SSH access for root:
5. On the slave machine, edit /etc/ssh/sshd_config to change PermitRootLogin yes to PermitRootLogin without-password.

6. Now you can ssh as root from the master to the slave machine without password:
root@master-machine$ ssh root@slave-machine

For more details on SSH keys, check link.

Exposing services to CF applications

2015-03-20T04:48:00.000-07:00

Service Broker API

The Service Broker (SB) API (full documentation) enables service providers to expose there offers to applications running on Cloud Foundry (CF). Implementing this contract, allows the CF Cloud Controller (CC) to communicate with the service provider in order to:

Catalog Management: register the offering catalog (e.g. different service plans),
Provisioning: create/delete a service instance (e.g. create a new MongoDB collection),
Binding: connect/deconnect a CF application to a provisioned service instance.

For each of these possible actions, there an endpoint defined in the Service Broker contract.

1. Catalog Management
The Service Broker (full documentation) should expose an endpoint for catalog management that provides information on the service itself in a JSON format, the different plans (e.g. free or not) that can be consumed by applications, some meta-data that describe the service.

# The Cloud Controller sends the following request
GET http://broker-url/v2/catalog
# The Service Broker may reply as follows
< HTTP/1.1 200 OK
< Content-Type: application/json;charset=UTF-8
...
{

services:

[
- {
  - planUpdatable: false,
  - id: "a unique service identifier",
  - name: "service name",
  - description: "service description",
  - bindable: true,
  - plan_updateable: false,
  - plans:
    
    [
    
    {
    
    id: "a unique plan id",
    
    name: "plan name",
    
    description: "plan description",
    
    metadata: { },
    
    free: false
    
    }
    
    ],
  - tags: [ ],
  - metadata: { },
  - requires: [ ],
  - dashboard_client: null
  }
]

}

2. Provisioning
The provisioning consists of synchronous actions that the Service Broker performs on demand from the CC to create a new or destroy an existing resource for the application. The CC sends PUT message with a designated instance identifier. Once the actions are performed, the Service Broker replies with the service and plan identifiers in a JSON format.

# The Cloud Controller sends the following request
PUT http://broker-url/v2/service_instances/:instance_id
{

service_id: "service identifier"

,
plan_id: "plan identifier"

,
organization_guid: "ORG identifier"

,
space_id: "SPACE identifier"

}
# The Service Broker may reply as follows
< HTTP/1.1 201 Created
< Content-Type: application/json;charset=UTF-8
...
{

dashboard_url: null

}

A service instance once created can be updated (e.g. upgrading service consumption plan). For this, the same query is sent to the SB with a body containing only the attribute to update:
{

plan_id: "new_plan_identifier"

}

3. Binding
Binding allows CF application to connect to a provisioned service instance and to start consuming the offered plan. When the SB receives a binding request from a CC, it replies with a the necessary information (e.g. service url, authentication information, etc.) for the CF application to utilize the offered service.

# The Cloud Controller sends the following request
PUT http://broker-url/v2/service_instances/:instance_id/service_bindings/:binding_id
{

service_id: "service identifier"

,
plan_id: "plan identifier"

,
app_guid: "application identifier"

}
# The Service Broker may reply as follows
< HTTP/1.1 201 Created
< Content-Type: application/json;charset=UTF-8
...
{

credentials:

{
- uri: "a uri to the service instance",
- username: "username on the service",
- password: "password for the username"
}
syslog_drain_url:

null

}

For unbinding the application from the service, the SB receives on the same URL a request with a DELETE method.

Note!
All previous requests from the Cloud Controller to the Service Broker contains the X-Broker-Api-Version HTTP header. It designates the Service Broker API (e.g. 2.4) supported by the Cloud Controller.

Managing Service Brokers

Once the previous endpoints are implemented, the SB can be registered to Cloud Foundry to be exposed to applications with the following command:
$ cf create-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD http://broker-url/

To check if the service broker is successfully implemented
$ cf service-brokers

Other possible management operations are available to update, rename or delete a service borker
$ cf update-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD http://broker-url/
$ cf rename-service-broker SERVICE_BROKER_NAME NEW_SERVICE_BROKER_NAME
$ cf delete-service-broker SERVICE_BROKER_NAME

Once the SB is created in CF database, its plans can be viewed with:
$ cf service-access

By default, the plans are all disabled, pick the service name from the output of the previous command and then:
$ cf enable-service-access SERVICE_NAME # enable access to service
$ cf marketplace -s SERVICE_NAME # output service plans

Managing Services

Once a service broker is available in the marketplace, an instance of the service can be created:
$ cf create-service SERVICE_NAME SERVICE_PLAN SERVICE_INSTANCE_NAME
Then service instances can be seen with:
$ cf services

Connecting service to application
To be able to connect an application to a service (running on a different network) and communicate with it, a route should be added through the definition of a Security group. Security groups allows you to control the outbound traffic of a CF app
$ cf create-security-group my_security_settings security.json

The content of security.json is as follows
[
{
"protocol": "tcp",
"destination": "192.168.2.0/24",
"ports":"80"
}
]

Then, binding to a service instance should be performed as follows:
$ cf bind-service APP_NAME SERVICE_INSTANCE_NAME
Now, the application running on CF can access service instances through the credentials available from the environment variable VCAP_SERVICES.

Resources

Managed services in CloudFoundry - link
CloudFoundry and Apache Brooklyn for automating PaaS with a Service Broker - link
Leveraging Riak with CloudFoundry - link

Pushing applications to CloudFoundry the Java way

2015-03-11T02:41:00.001-07:00

CloudFondry provides a Java API that can be used to do anything just as the CLI. Follows are the steps that shows how to connect and publish an application to CF using Java code:

1. Skip SSL validation
You may have to skip SSL validation to avoid sun.security.validator.ValidatorException:

SSLContext ctx = SSLContext.getInstance("TLS");
X509TrustManager tm = new X509TrustManager() {
  public void checkClientTrusted(X509Certificate[] xcs, String string) {
  }
  public void checkServerTrusted(X509Certificate[] xcs, String string) {
  }
  public X509Certificate[] getAcceptedIssuers() {
    return null;
  }
};
ctx.init(null, new TrustManager[] { tm }, null);
SSLContext.setDefault(ctx);

2. Connect to CloudFoundry

Connect to the CloudFoundry API endpoint (e.g. https://api.run.pivotal.io) and authenticatewith your credentials:

String user = "admin";
String password = "admin";
String target = "https://api.10.244.0.34.xip.io";
CloudCredentials credentials = new CloudCredentials(user, password);
HttpProxyConfiguration proxy = new HttpProxyConfiguration("proxy_hostname", proxy_port);
CloudFoundryClientclient = new CloudFoundryClient(credentials, target, org, space, proxy);

3. Create an application

String appName = "my-app";
List urls = Arrays.asList("my-app.10.244.0.34.xip.io");
Staging staging = new Staging(null, "app_buildpack_git_repo");
client.createApplication(appName, staging, disk, mem, urls, Collections. emptyList());

4. Push the application

ZipFile file = new ZipFile(new File("path_to_app_archive_file"));
ApplicationArchive archive = new ZipApplicationArchive(file);
client.uploadApplication(appName, archive);

5. Check the application state

StartingInfo startingInfo = client.startApplication(appName);
System.out.println("Starting application: %s on %s", appName, startingInfo.getStagingFile());
CloudApplication application : client.getApplications()
System.out.printf("  %s (%s)%n", application.getName(), application.getState());

6. Disconnect from CloudFoundry

client.logout();

Installing Cloud Foundry v2 locally on Vagrant

2015-02-24T09:34:00.002-08:00

Cloud Foundry (CF)

CloundFoundry (CF) is one of the many PaaS available out there that aims to empower developers to build their applications (e.g. web) without caring about infrastructure details. The PaaS handles the deployment, scaling and management of the apps in the cloud data center, thus boosting the developer productivity.
CF has many advantages over other PaaS solutions as it is open source, it has a fast growing community and many big cloud actors are involved in the development and spreading it adoption. It also can be run anywhere even on a laptop and this what this post is about. So keep reading..

Terminology

- Bosh is an open-source platform that helps deploying/managing systems on cloud infrastructures (AWS, OpenStack/CloudStack, vSphere, vCloud, ect).

- Bosh Lite is a lightweight version of Bosh that can be used to deploy systems locally by using Vagrant instead of cloud infrastructure (e.g. AWS) and Linux Containers (Warden project) for to run your system instead of VMs.
- Stemcell is a template VM that will be used by Bosh to create VMs and deploy them to the cloud. I contains essentially an OS (e.g. CentOS) and a Bosh Agent in order to be controlled.

1. Install Git

$ sudo apt-get install git

2. Install VirtualBox
$ sudo echo "deb http://download.virtualbox.org/virtualbox/debian precise contrib" >> /etc/apt/sources.list
or create a new .list file as described in this thread.
$ wget -q http://download.virtualbox.org/virtualbox/debian/oracle_vbox.asc -O- | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install virtualbox-4.3
$ sudo apt-get install dkms
$ VBoxManage --version
4.3.10_Ubuntur93012

3. Install Vagrant (the known version to work with bosh-lite is 1.6.3 - link)
$ wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.6.3_x86_64.deb
$ sudo dpkg -i vagrant_1.6.3_x86_64.deb
$ vagrant --version
Vagrant 1.6.3

Check if vagrant is correctly working with the installed virtual box
$ vagrant init hashicorp/precise32
$ vagrant up

4. Install Ruby(using RVM) + RubyGems + Bundler
4.1. Install rvm
$ curl -sSL https://rvm.io/mpapis.asc | gpg --import -
$ curl -sSL https://get.rvm.io | bash -s stable
$ source /home/{username}/.rvm/scripts/rvm
$ rvm --version

4.2. Install latest ruby version
$ rvm install 1.9.3-p551
$ ruby -v
ruby 1.9.3p551 (2014-11-13 revision 48407) [x86_64-linux]

5. Install Bosh CLI (check the prerequisites for the target OS here)
- Note that Bosh CLI is not suppored on windows - github issue
$ sudo apt-get install build-essential libxml2-dev libsqlite3-dev libxslt1-dev libpq-dev libmysqlclient-dev
$ gem install bosh_cli

6. Install Bosh-Lite
$ git clone https://github.com/cloudfoundry/bosh-lite
$ cd bosh-lite
$ vagrant up --provider=virtualbox

In case the following message is seen The guest machine entered an invalid state while waiting for it to boot, then:

check if virtualisation (Intel VT-x / AMD-V for 32bits or Intel EPT / AMD RVI for 64bits) is enabled on target system here. If not then enable it from the BIOS, for ESXi check link1 and link2 and add vhv.enable = "TRUE" to the vm configuration file (i.e. vmx) and make sure the VM is of version 9.
You may also have to check if USB 2.0 controller is enabled, if it is then disable it.

Target the BOSH Director
$ cd ..
$ bosh target 192.168.50.4 lite
$ bosh login
Your username: admin
Enter password: *****

Logged in as `admin'

Setup a route between the laptop and the VMs running inside Bosh Lite
$ cd bosh-lite
$ ./bin/add-route

7. Deploy Cloud Foundry
Install spiff

$ brew tap xoebus/homebrew-cloudfoundry

$ brew install spiff

$ spiff

To install spiff on linux systems check this issue.

Upload latest stemcell
$ wget http://bosh-jenkins-artifacts.s3.amazonaws.com/bosh-stemcell/warden/latest-bosh-stemcell-warden.tgz
$ bosh upload stemcell latest-bosh-stemcell-warden.tgz
Check the stemcells
$ bosh stemcells

Upload latest CF release
$ git clone https://github.com/cloudfoundry/cf-release
$ export CF_RELEASE_DIR=$PWD/cf-release/
$ bosh upload release cf-release/releases/cf-XXX.yml

Deploy CF releases
$ cd bosh-lite/
$ ./bin/provision_cf
$ bosh target # check the target director
$ bosh vms # check the installed VMs on the cloud

Manually (to be continued)
Generate a configuration file manifests/cf-manifest.yml
$ mkdir -p go
$ export GOPATH=~/go
$ cd bosh-lite
$ ./bin/make_manifest_spiff

Deploy release
$ bosh deploy

Install CF CLI

Play with CF
$ cf api --skip-ssl-validation https://api.10.244.0.34.xip.io
$ cf login
$ cf create-org ORG_NAME
$ cf orgs

$ cf target -o ORG_NAME

$ cf create-space SPACE_NAME

$ cf target -o ORG_NAME -s SPACE_NAME

To access the VM from the LAN (i.e. another machine):

Install an HTTP Proxy (e.g. squid3),
Configure CF HTTP_PROXY environment variable, and
Configure the proxy:

$ sudo nano /etc/squid3/squid.conf
acl local_network src 192.168.2.0/24
http_access allow local_network

Stopping CF
Shooting down bosh-lite VM can be surprisingly tricky. May better stop the VM with:

vagrant suspend to save current state for next start up, or
vagrant halt, then next time to start CF use vagrant up followed by bosh cck (documentation).

Troubleshooting
$ bosh ssh # then choose the job to access (password: admin)
bosh_something@something:~$ sudo /var/vcap/bosh/bin/monit summary
Find the Bosh Lite IP address
$ cd bosh-lite/
$ vagrant ssh
vagrant@agent-id-bosh-0:~$ ifconfig
vagrant@agent-id-bosh-0:~$ exit

Complete installation script can be found here.

Resources

Installing latest versions for virtualbox and vagrant - link
Installing ruby with rvm - link.
DIY PaaS (CF v1) running DEA link1, stagging applications link2.
Deploying CF Playground (a kind of web admin interface) - link
Installing CF on vagrant - link video
Installing BOSH lite - github repo, tutorial
Deploying CF using BOSH lite - github repo, demo
http://altoros.github.io/2013/using-bosh-lite/
Installing a new hard drive - link
xip.io a free internet service providing DNS wildcard - link
Troubleshooting with Bosh CLI - official doc, app health, monit summary
Remotely debug a CF application - link
CloudFoundry manifest.yml generator - link

Getting started with Hive

2014-09-04T10:38:00.002-07:00

Introducing Hive

Hive installation is straightforward (no much things to configure)

$ wget http://mir2.ovh.net/ftp.apache.org/dist/hive/stable/apache-hive--bin.tar.gz
$ tar xzf apache-hive--bin.tar.gz
$ cd apache-hive--bin/bin/
hive> show tables;

Notice that the environment variable HIVE_HOME is not required (which is not the case for hadoop/hbase/tez). Also, hive-site.xml is not required but if we want to use an hdfs directory then it should contain something like:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://namenode_hostname/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

Introducing HQL

So lets create a table 'quotes' and make it available to other hadoop programs as a text file:
hive> CREATE EXTERNAL TABLE quotes (symbol STRING, name STRING, price DOUBLE)
> ROW FORMAT DELIMITED FIELDS TERMINATED by ',' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '/tmp/quotes.txt';

Then, we can load data into these tables, for instance form a local file quotes.csv that looks like:
"GE","General Electric ",28.09
"MSFT","Microsoft Corpora",41.66
"GOOG","Google Inc.",604.83
"GM","General Motors Co",41.85
"FB","Facebook, Inc.",72.59
"AAPL","Apple Inc.",607.33
"T","AT&T Inc.",37.15
"VZ","Verizon Communica",52.06
"TM","Toyota Motor Corp",134.94

with the following query:
hive> LOAD DATA LOCAL INPATH '/path/to/quotes.csv'
> OVERWRITE INTO TABLE quotes;

Once the table is filled, we can query it with things like:
hive> SELECT * FROM quotes;
hive> SELECT symbol FROM quotes;

We can export and save the result of a query into a file locally say /tmp/..:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/quotes_100'
SELECT *
> FROM quotes
> WHERE quotes.price > 100;
The result of this export is a set of files under the quotes_100 directory, the list of quotes that match the criteria can be found in a file name 000000_0

Tuning Hive

Understanding the underlying details of how Hive plan when executing queries is essential for performance tuning. One way to understand the query plan is the use of the EXPLAIN key word:
hive> EXPLAIN SELECT * FROM quotes
> FROM quotes
> WHERE quotes.price > 100;
hive> EXPLAIN SELECT SUM(price) FROM quotes;
The result shows the translation of these queries into different possible operations called stages, for instance map-reduce, sampling, merge, or limit stages.
The use of the keyword EXTENDED along with explain will provide even further details for the query execution plan:
hive> EXPLAIN EXTENDED SELECT SUM(price) FROM quotes;

By default, hive executes a stage at once. This default behavior can be overridden by setting the property to true in hive-site.xml:
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>

The number of mappers/reducers launched is determined by the size of the input files divided by the default size attributed to a given task, it can be configured via:
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>750000000</value>
<description></description>
</property>

Resources

Troubleshooting ubuntu server network interface

2014-09-01T04:39:00.000-07:00

So I've installed Ubuntu server on VirtualBox and when I activated a second network adapter with a bridged mode, the later was not automatically configured on Ubuntu.
In fact, the interace cannot be seen with ifconfig and ifconfig -a showed it as disabled.
I tried to bring it up and restart networking service:
$ifconfig eth1 up
$/etc/init.d/networking restart
Now the interface is active but it has only an IPv6 address and when I restart the virtual machine, the interface goes disabled again.
When checking the /etc/network/interfaces there was no eth1!!, so I added it in order to be configured automatically:
$vi /etc/network/interfaces

auto eth1
iface eth1 inet dhcp

that's it now the interface works fine.

Comparison between caching systems for Java

2014-08-18T03:13:00.003-07:00

Servers are getting more and more powerful with a lot of RAM (up to hundred to thousands of giga bytes). However, it is still not possible to use most of the available capacity directly in java applications due to inherent limitations of the GC (Garbage Collector) on JVM that may pause the application for a long time (even up to many minutes) to move objects between different generations.

Follows is the description/comparison between some solutions, also called data grids like, that can be used to face this problem like the Infinispan project of JBoss (ex. JBoss Cache), DirectMemory (an Apache proposal), EhCache (of terracotta), etc.

Caches

1. Infinispan (JBoss Data Grid Platform)

Don't provide support for expiration events as disscussed in the forum.
SingleFileCacheStore a cache loader from a file stores that manages the data activation (loading from store to cache) and passivation (saving data to store).
List of possible attributes in the XML configuration for infinispan 4.0 and infinispan 6.0.

2. MapDB

Exists only in the embbeded mode
Enables the creation of on heap and off-heap collections (map, queue), as well as file-backed collections
Listeners registerd to cache events are notified in the main thread (i.e. should implement async notifications)
Can be used for lazy loading (e.g. Lazily_Loaded_Records.java).
Provides means for pumping the integral data available on memory to disk (e.g. Pump_InMemory_Import_Then_Save_To_Disk.java).
Transaction isolation level is Serializable which is the highest level and means a new transaction can be initiated only if previous one was committed.
Transactions uses a global lock which reduce considerably the cache performance.

3. Akiban's Persistit - github

key/value data storage library
Transactions are based on the Snapshot Isolation algorithm to provide high concurrency.
used by Titan (which is a Distribued graph database) for their storage layer.
For custom objects, users should provide a serializer for

keys by implementing com.persistit.encoding.KeyCoder, as well as for
values by implementing com.persistit.encoding.ValueCoder,
and declare coder manager.

Samples can be found here in Index and Search 2.3 Million Freebase Person Records with Persistit, and Simple Blog Application with Akiban and JugglingDB.

4. JCS (Java Caching System)

Build faster Web applications with caching - developerWorks
Caching with JCS - Object Partners
JCS event handling examples on Stackoverflow and SPOCS.
Configuring a JCS Cache - InformIT
Introduction, Using, Developing Web applications and Java Object Caching with Java Caching System (JCS) - bhaveshthaker.com.

5. Hazelcast

Can be backed with different kind of stores mysql, hbase, etc.
A case of processing Mozilla very large crash reports - highscalability.com

6. GridGain

Resources: gridgain.blogspot.com

5. Others: LArray, Cache2K, DirectMemory (initial project on github, apache proposal for incubation) an off-heap memory storage, MVStore the storage subsystem of the H2 database, Spring cache, HugeCollections.

Search

Integrating Lucene with HBase - an article explaining implementation of a Lucene backend based on HBase, the code is on [[github>>https://github.com/akkumar/hbasene]]. Other implementations: Solbase.
Lucandra / Solandra: A Cassandra-based Lucene backend - an article explaining implementation of a Lucene backend based on Cassandra. The project source code is on github.
Create Lucene Index in database using JdbcDirectory - an article explaining the use of a database as Lucene backed.
Compass project provides an Java friendly API for wrapping the Lucence api for a better integration with Java/J2ee applications.

Resources

A good explanation of the use of ByteByffer to build non-heap memory caches by Keith Gregory: blog post, JUG presentation, another one.
An article on InfoQ about HashMap implementation for off-heap map.
An ibm red book on capacity for big data and off-heap memory.
Examples related to the use of EhCache from a Devoxx 2014 presentation.

Benchmarks

Cache2K vs Infinispan/EhCache/JCS - bench
Radargun a framework for benchmarking data grids

Memory storage

In-memory databases (a detailed description can be found at Information Week):

NoSQL approaches (covers the class of nonrelational and horizontally scalable databases) like Aerospike.
NewSQL approaches (emerging databases offerting NoSQL scalability but with familiar SQL query capabilities, i.e. SQL-compliant) like VoltDB, Oracle TimesTen, IBM solidDB, MemSQL.

Companies like Microsoft, Oracle and IBM choosed to add the in-memory support for their traditional databases (e.g. moving tables to memory), whereas SAP adopted another approach with its Hana platform that aims to put everything in-memory.

Some traditional RDBMS can be configured to store their data in-memory instead of disk storage like sqlite, MySQL, etc.

Getting started with HBase

2014-06-13T07:38:00.000-07:00

Introduction
HBase indexes data bases on 4D coordinaes which are rowkey, column family (or a collection of columns), column qualifier and version. As a result, HBase can be considered a Key-Value store with a key as the 4D coordinates and the the cell as the value. Based on how many of these coordinates are specified during a query, the value may be a map or a map of map.

Installation

Installing the lastest stable version of hadoop:
$ mkdir hbase-install
$ cd hbase-install
$ wget http://apache.claz.org/hbase/stable/hbase-0.98.3-hadoop2-bin.tar.gz
$ tar xvfz hbase-0.98.3-hadoop2-bin.tar.gz
$ export HBASE_HOME=`pwd`/hbase-0.98.3-hadoop2

Adding the HBase program to path
$ export PATH=$PATH:$HBASE_HOME/bin/

# you need the JAVA_HOME variable to be already set, if you're using open jdk, you can set it to:
$ export JAVA_HOME=/usr/lib/jvm/default-java

Running a standalone version
$ start-hbase.sh

once the master launched you can accees the web admin interface on http://localhost:60010/

By default, hbase will write data into /tmp directory. You can change this by editing $HBASE_HOME/conf/hbase-site.xml and setting the following property (the complete list of properties can be found in the official documentation):
<property>
<name>hbase.rootdir</name>
<value>file:///path/to/hbase/direcotry</value>
</property>

The $HBASE_HOME/conf/hbase-env.sh bash file can be run to setup hbase configuration, for instance setting environment variables. For further information on configuring HBase, check the official documentation.

Shell-based interaction
Along the installation binaries, there is a JRuby-based shell that wraps a Java client to interact with HBase interactively (sedding commands and receiving responses directly on the terminal) or via bash scripts.

To validate the installtion, lets run the hbase shell and manipulate some data
$ hbase shell
# check existing tables
hbase(main):001:> list
# create table of column famity 'cf'
hbase(main):002:> create 'mytable', 'cf'
# write 'hello hbase' in first row of column 'cf:message' of table 'mytable'
hbase(main):003:> put 'mytable', 'first', 'cf:message', 'hello HBase'
# create a user table of 'info' famity
hbase(main):004:> create 'users', 'info'
hbase(main):005:> put 'mytable', 'second', 'cf:foo', 3.14159
hbase(main):006:> put 'users', 'first', 'cf:username', "John Doe"
# reading the first row from a table
hbase(main):007:> get 'mytable', 'first'
# reading the whole rows from a table
hbase(main):008:> scan 'mytable'

Java-based interaction

// define a custom configuration (by default the content of hbase-site.xml is used)
Configuration myConf = HBaseConfiguration.create();
myConf.set("param_name", "param_value");

// e.g. to connect to a remote HBase instance you need to set Zookeeper quorum address and port number
myConf.set("hbase.zookeeper.quorum", "serverip");

myConf.set("hbase.zookeeper.property.clientPort", "2181");

// establish a connection
HTableInterface myTable = new HTable(myConf, "users");

// Use pool for a better reuse of connections which are expensive resources
HTablePool pool = new HTablePool(myConf, max_nb_connection);
HTableInterface myTable = pool.getTable("mytable");
...
// close connection and returned to the pool
myTable.close();

In HBase data is manipulated in bytes, Java types should be converted into raw bytes with the help of the utility class Bytes. The HBase API for manipulating data is divided into commands: Get, Put, Delete, Scan and Increment. Data is Example, data can be stored as follows:

// create a command with row key TheRealMT

Put p = new Put(Bytes.toBytes("TheRealJD"));

// add information about user
p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("john.doe@acme.inc"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("pass00"));

Once, the entry is ready we can send it to hbase for persistence:

HTableInterface usersTable = pool.getTable("users");
Put p = new Put(Bytes.toBytes("TheRealJD"));
p.add(...);
usersTable.put(p);
usersTable.close();

The Put command can also be used to update the user information:

Put p = new Put(Bytes.toBytes("TheRealJD"));

p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("securepass"));

usersTable.put(p);

The HBase client does not interact directly with the storage layer which is formed of HFile. Instead, HBase writes all operations in a Write-Ahead-Log (WAL) for durability and failure recovery. While, the data are store in memory region called MemStore that upon filled its entire content is flushed to a new immutable file called HFile (no modification of existing HFiles).
This can be customized. For instance, the size of this region can be set via the hbase.hregion.memstore.flush.size parameter. Also, the WAL can be disabled with:

Put p = new Put();
p.setWriteToWAL(false);

The Get command is used to query data from a set of given columns:

Get g = new Get(Bytes.toBytes("TheRealJD"));
g.addFamily(Bytes.getBytes("info"));
g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
Result r usersTable.get(g);
byte[] b = r.getValue(Bytes.toBytes("info"), Bytes.toBytes("email"));
String email = Bytes.toString(b);

As HBase is versioned, we can look at partical values in history:

List<keyvalue> passwords = r.getColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
b = passwords.get(0).getValue();
String currentPassword = Bytes.toString(b);
b = passwords.get(1).getValue();
String previousPassword = Bytes.toString(b);

// the verions are by default the milliseconds corresponding to the moment when the operation was performed
long version = passwords.get(0).getTimestamp();

The Delete command is used to delete data from HBase

Delete d = new Delete(Bytes.toBytes("TheRealJD"));

// remove one column
d.deleteColumn(Bytes.toBytes("info"), Bytes.toBytes("email"));

// remove an entire row with all its columns
d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("email"));

usersTable.delete(d);

The delete operation is logical, meaning the concerned record is flagged as deleted and will no loger be returned in a get or scan. It is until compaction (merging two HFiles into single bigger one) that the record is effectively deleted. More details on the compaction operation can be found in this article.

Creating a table programatically

Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("UserFeed");
// create a column family
HColumnDescriptor c = new HColumnDescriptor("stream");
c.setMaxVersions(1);
desc.addFamily(c);
admin.createTable(desc);

Once the table is created we can insert data into it, we may hash the row key used for users (i.e. TheRealJD) to a void variable length rowkey and for a better performance:

// prepare the value of the row key
int longLength = Long.SIZE / 8;
byte[] userHash = Md5Utils.md5sum("TheRealJD");
byte[] timestamp = Bytes.toBytes(-1 * System.currentMilliseconds());
byte[] rowKey = new byte[Md5Utils.MD5_LENGTH + longLength];
int offset = 0;
offset = Bytes.putBytes(rowKey, offset, userHash, 0, userHash.length);
Bytes.putBytes(rowKey, offset, timestamp, 0, timestamp.length);
// prepare the put command
Put put = new Put(rowKey);
// we may need to store the real value of user id to be able to find the associated user when scanning the feeds table
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"), Bytes.toBytes("TheRealMT"));
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"), Bytes.toBytes("Hello world!"));

When it comes to scanning the feeds table, things got easy as a result of using a row key starting with a hash of the user row key.

byte[] userHash = Md5Utils.md5sum(user);
byte[] startRow = Bytes.padHead(userHash, longLength);
// create a stop key equal to the increment of the last byte of user id
byte[] stopRow = Bytes.padTail(userHash, longLength);
stopRow[md5Utils.MD5_LENGTH-1]++;
Scan s = new Scan(startRow, stopRow);
ResultsScanner rs = feedsTable.getScanner(s);
// extract the columns (as created previously) from each result
for(Result r: rs) {
  // extract the username
  byte[] b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"));
  String user = Bytes.toString(b);
  // extract the feed
  b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"));
  String feed = Bytes.toString(b);
  // extract the timestamp
  b = Arrays.copyOfRange(r.getRow(), Md5Utils.MD5_LENGTH, Md5Utils.MD5_LENGTH+longLength);
  DateTime dt = new DateTime(-1 * Bytes.toLong(b));
}

By default, each RPC call from the client to HBase will return only 1 row (i.e. no cashing) which is not good in case of scanning the whole table. We can make each call returning n row by setting the property hbase.client.scanner.cashing or calling Scan.setCashing(int).

Continue here.

Resources

HBase administration using the Java API - linuxjunkies.wordpress.com
Many examples for Delete operation - programcreek.com
Coprocessor introduction - official blog, presentation.

Indexing keys and values in MapDB

2014-05-28T08:58:00.002-07:00

MapDB is a high performance pure java database, it provides concurrent collections (Maps, Sets and Queues) backed by disk storage or off-heap memory.
It provides a powerful mechanism to synchronize collections that can be used to build multiple indexes on a primary collection. Follows is an example showing how to index keys and also values of main collection.

1. define a serializable class

// this class should implement serializable in order to be stored
public class Person implements Serializable {
  String firstname; 
  String lastname; 
  Integer age; 
  boolean male;

  public Person(String f, String l, Integer a, boolean m) {     
    this.firstname = f;
    this.lastname = l; 
    this.age = a; 
    this.male = m;
  }

  public boolean isMale() {
    return male;
  }
  @Override
  public String toString() {
    return "Person [firstname=" + firstname + ", lastname=" + lastname + ", age=" + age + ", male=" + male + "]";
  }
}

2. Define a map of persons by id

// stores person under id
BTreeMap<Integer, Person> primary = DBMaker.newTempTreeMap();
primary.put(111, new Person("bIs9r", "NWmqoxFf", 92, true)); 
primary.put(111, new Person("4KXp8", "QrPsabf1", 31, false)); 
primary.put(111, new Person("eJLIo", "SJwJidWk", 6, true)); 
primary.put(111, new Person("LGW58", "vteM4khp", 42, false)); 
primary.put(111, new Person("tIM8R", "Rzq75ONh", 57, false)); 
primary.put(111, new Person("KqKRE", "BnpUV4dW", 26, true));

3. Define a gender-based index

// stores value hash from primary map
NavigableSet<Fun.Tuple2<Boolean, Integer>> genderIndex = new TreeSet<Fun.Tuple2<Boolean, Integer>>();

//1. gender-based index: bind secondary to primary so it contains secondary key
Bind.secondaryKey(primary, genderIndex, new Fun.Function2<Boolean, Integer, Person>() {
  @Override
  public Boolean run(Integer key, Person value) {
    return Boolean.valueOf(value.isMale());
  }
});

4. Use the gender-index to read all male persons

Iterable<Integer> ids = Fun.filter(genderIndex, true);
for(Integer id: ids) {
 System.out.println(primary.get(id));
}

MapdDB offers multiple ways to define indexes on a given collection, It can also be extended to define specific kind of indexes. Follows is an example of implementing the Bitmap index in MapDB:

public static <K, V, K2> void secondaryKey(MapWithModificationListener<K, V> map, final Map<K2, Set<K>> secondary,
      final Fun.Function2<K2, K, V> fun) {
  // fill if empty
  if (secondary.isEmpty()) {
    for (Map.Entry<K, V> e : map.entrySet()) {
      K2 k2 = fun.run(e.getKey(), e.getValue());
      Set<K> set = secondary.get(k2);
      if (set == null) {
        set = new TreeSet<K>();
        secondary.put(k2, set);
      }
      set.add(e.getKey());
    }
  }
  // hook listener
  map.modificationListenerAdd(new MapListener<K, V>() {
    @Override
    public void update(K key, V oldVal, V newVal) {
      if (newVal == null) {
        // removal
        secondary.get(fun.run(key, oldVal)).remove(key);
      } else if (oldVal == null) {
        // insert
        K2 key2 = fun.run(key, newVal);
        Set<K> set = secondary.get(key2);
        if (set == null) {
          set = new TreeSet<K>();
          secondary.put(key2, set);
        }
        set.add(key);
      } else {
        // update, must remove old key and insert new
        K2 oldKey = fun.run(key, oldVal);
        K2 newKey = fun.run(key, newVal);
        if (oldKey == newKey || oldKey.equals(newKey))
          return;
        Set<K> set1 = secondary.get(oldKey);
        if (set1 != null) {
          set1.remove(key);
        }
        Set<K> set2 = secondary.get(newKey);
        if (set2 == null) {
          set2 = new TreeSet<K>();
          secondary.put(newKey, set2);
        }
        set2.add(key);
      }
    }
  });
}

This new index can be used as follows:

final Map<Boolean, Set<Integer>> bitmapIndex = new HashMap<Boolean, Set<Integer>>();
secondaryKey(primary, bitmapIndex, fun);

Continue here

Exploiting Big RAMs

2014-05-03T08:23:00.002-07:00

Those are notes from a talk given by Neil Ferguson about how to take benefit of very large amount of memory to improve the performance of server-side applications.

Background
With the increases in the amount of managed data of any enterprise or web application, there is a continuous need for storing more and more of data while providing a real-time access to it. The performance of such applications can be improved by making data available directly from memory and efficiently use the available huge amount of memory that may reach many many terabytes in a near future.

In fact, memory prices is continuously decreasing while the capacity increases to the point where terabytes of RAM will be available for servers in a near future. The cost of a 1MB of RAM was about $0.01 in Jan 2009 while it is $0.005 in 2013, source Memory Prices (1957-2013). In fact, we could by a workstation with 512GB of RAM for $2299, and new Intell processors (e.g. Xeon) allow up to 144GB of RAM and more (around terabytes) for new generation processors dedicated to server-class machines. However, it still not practical to do anything with such an amount of RAM. Why?

Garbage Collection Limitations
In any Garbage-collected environment (like JVMs), if the object allocation rates overtake the rates at which the GC collect these objects then long GC pauses (time during which the JVM stops applications to just run the garbage collector) may become very frequent. One way to avoid such problem is to leave a plenty of free space in the heap. The thing is when you leave a third of 3GB it's not really a big deal compared to the case when leaving the third of 300GB even if it's the same ratio betwee free space and live data.
The bad news, is that even with large free space there may be some situations where GC pauses are too long typically for memory defragmentation.
You can improve an application performance with -XX:+ExplicitGCInvokesConcurrent as a workaround to avoid long pausses when System.gc() or Runtime.getRuntime().gc() are explicitly called (e.g. Direct ByteBuffer allocation).

Off-Heap storage
To overcome some of these limitations in JVMs or in Garbage-collected environment, allocation of memory off-heap can be a solution. This can be done in different ways:

1. Direct ByteBuffers
The NIO api allows the allocation of off-heap memory (i.e. not part of the process heap and not subject to GC) for storing long-lived data via ByteBuffer.allocateDirect(int capacity). Capacity is limited to what was specified with the JVM option -XX:MaxDirectMemorySize.
The allocation through ByteByffer has implications for GC (long pauses) when it is freed not fast enough and makes it not suitable for short-lived objects, i.e. allocating and freeing a lot of memory frequently.

2. sun.misc.Unsafe
Direct ByteByffer itself relies on sun.misc.Unsafe.allocateMemory to allocate a big block of memory off-heap and on sun.misc.Unsafe.freeMemory to explicitly free it.
Here is a very sample implementation of a wrapper class based on the Unsafe API for managing off-heap memory:

public class OffHeapObject {
  // fields
  private static Unsafe UNSAFE;

  static {
    try {
      // get instance using reflection
      Field field = sun.misc.Unsafe.class.getDeclaredField("theUnsafe");
      field.setAccessible(true);
      UNSAFE = (sun.misc.Unsafe) field.get(null);
    }catch(Exception e){
      throw new IllegalStateException("Could not access theUnsafe instance field");
    }
  }
  private static final int INT_SIZE = 4;
  // base address for the allocated data
  private long address;
  
  // constructor
  public OffHeapObject(T heapObject) {
    // serialize data
    byte[] data = serialize(heapObject);
    // allocate off-heap memory
    address = UNSAFE.allocateMemory(INT_SIZE + data.length);
    // save the data size in first bytes
    UNSAFE.putInt(address, data.length);
    // Write data byte by byte to the allocated memory
    for(int i=0; i < data.length; i++) {
      UNSAFE.putByte(address + INT_SIZE + i, data[i]);
    }
  }

  public T get() {
    int length = UNSAFE.getInt(address);
    // read data from the memory
    byte[] data = new byte[length];
    for(int i = 0; i < data.length; i++) {
      data[i] = UNSAFE.getByte(address + INT_SIZE + i);
    }
    // return the deserialized data
    return deserialize(data);
  }
  // free allocate space to avoid memory leaks

  public void deallocate() {
    //TODO make sure to not call this more than once
    UNSAFE.freeMemory(address);
  }
}

The OffHeapObject can be used for instance to store values of a cached data, e.g. using Google Guava to store keys-OffHeapObject pairs where the latter wraps data in the off-heap memory. This way GC pauses can be considerably reduced as these objects are just references and do not occupy big block of heap memory. Also, the process size may not grow indefinitely as fragmentation is reduced.

Note that the implementation of the OffHeapObject is very basic, there is a performance impact for using off-heap memory. In fact, everything needs to be serialized on writes to off-heap and de-serialized on read from off-heap memory and these operations has some overhead and reduced throughput compared to on-heap storage.
Furthermore, not every object can be stored in the off-heap memory for instance the OffHeapObject that keep a reference to a block of memory in the off-heap is actually stored in the heap.
The performance of this implementation may be enhanced with techniques like data alignment.

Some existing caches based on off-heap storage

continue from 28:39

Big RAM: How Java Developers Can Fully Exploit Massive Amounts of RAM

Resources

Understanding Java Garbage Collection presentation at JUG Victoria Nov/2013 - Azul Systems
Measuring GC pauses with jHiccup - Azul Systems
A good documentation of the Unsafe API can be found in this blog post.
How Garbage Collection works in Java - blog post.

Random resources related to Docker

2014-05-03T06:59:00.003-07:00

General

Light weight virtualization with Docker
The Docker Book sample and TOC.
Docker Q&As
Containers configuration
Docker registry
Getting started with docker - serversforhackers.com
Advanced provisionning with Packer - mmckeen.net
A practical introduction to docker containers - developerblog.redhat.com

Management

Cleanup old images blog post
Docker Log Management Using Fluentd - jasonwilder.com

Continuous Integration

Provisionning jenkins slaves with docker - nuxeo.com
Continuous Integration using Docker, Maven and Jenkins - wouterdanes.net
Next gen CI built with docker - frozenridge.co
Intro to build tools and continuous delivery (french) slideshare.net

Environment configuration

Docker-Based Development Environments - vagrantup.com
Configuring dev environment with docker - Terse Systtems

DevOps

Introducing the IBM DevOps approach - developerworks

lmctfy

What is the essential difference between lmctfy and LXC? google groups
What is the difference between lmctfy and lxc stackoverflow
Containers track at Linux plumber conf 2013
Let Me Contain That For You at Linux plumber conf 2013
LMCTFY on github
Develop apps on the cloud - developerworks
Difference between docker and Cloud Foundry's warden - google docs

Networking

Configure Networking - official doc
Docker container’s configuration - octo talks

Ecosystem

Atomic project - Deploy and Manage your Docker Containers
GearD - The Intersection of PaaS, Docker and Project Atomic
Classification of the ecosystem of startups based on Docker
Slides from DockerFr Meetup on Docker ecosystem
OpenCore a Big Data (Hadoop) as a Service provider

API Client

Docker Java on github
Maven plugin on github

work in progress

Managing Docker images and containers

2014-04-13T08:48:00.001-07:00

In addition to managing Docker resources (including containers, images, hosts) through the official CLI, there is plenty of solutions available in the community to manage Docker resources in a comprehensive way from a single web-based interface.

DockerUI

Once our containers are running, DockerUI can be use to manage the overall system. It's a simple web app with basic features for:
- Check the states of the images (running, stopped)
- Remove images
- Start, Stop, Kill and Remove containers

DockerUI can be used with the following commands

1. Building the web app from the github repository and tag the build image
$docker build -t crosbymichael/dockerui github.com/crosbymichael/dockerui

2. Launch the built container, make the web app available on the 9000 port and connect to the docker uinx socket to remotely control docker
$docker run -p 9000:9000 -v /var/run/docker.sock:/docker.sock crosbymichael/dockerui -e /docker.sock

Then on the browser, visit localhot:9000 to get something like:

Shipyard

Shipyard is a more advanced Docker management solution based on a client-server architecture where the agents (i.e. clients) collect information on Docker resources and report them to the Shipyard server. It providers in addition to the features available in DockerUI:
- Authentication
- Building new images by uploading local Dockerfile or providing URLs to a remote location
- In the browser terminal emulation for attaching containers
- Visualizing CPU and memory utilization of the running images
- ...

1. To use Shipyard, issue to pull the image from the Docker public index:
$docker run -i -t -v /var/run/docker.sock:/docker.sock shipyard/deploy setup

Now, we can register as admin to Shipyard on http://localhost:8000/

2. Install the latest release (e.g. v0.2.5) of Shipyard agent on every hosts to collect the information on Docker resources:
$curl https://github.com/shipyard/shipyard-agent/releases/download/v0.2.5/shipyard-agent -L -o /usr/local/bin/shipyard-agent
$chmod +x /usr/local/bin/shipyard-agent

3. Run the agent and register to the main host where Shipyard is running
$/usr/local/bin/shipyard-agent -url http://localhost:8000 -register

4. On the Shipyard interface, authorize the agents already deployed to enable them.
5. Run the agent with the given key at registration:
$/usr/local/bin/shipyard-agent -url http://localhost:8000 -key agent_key

Troubleshooting, in case you get this message:
Error requesting images from Docker: Get http://127.0.0.1:4243/images/json?all=0
Then stop the Docker service and re-start it while enabling Remote API access for any IP address:
$sudo service docker stop
$docker -H tcp://0.0.0.0:4243 -H unix:///var/run/docker.sock -d &

happy dockering

Automating Docker image builds with Dockerfiles

2014-04-06T13:49:00.003-07:00

Hello Dockerfile
This is a continuation of an previous post on Docker with the aim of using specific scripts called dockerfiles in order to automate the steps that we have been issuing to build docker images. When docker parse the script file, it sequentially executes the commands starting from a base image to create a new one after each command.
The syntax of a dockerfile instruction is as simple as :
command argument1 argument2 ...
or
command ["argument1", "argument2", ...] only for the entry-point command !!

It's preferable to write the command in uppercase!

Dockerfile instructions
There is a dozen of instructions that can be present in a dockerfile, a detailed list can be found in the official documentation. The most common ones are:

FROM all dockerfile should start with this command that specify the name of the image to use as a working or base image;
RUN allows to run a command in the current container and commit (automatically) the changes to a new image;
MAINTAINER allows to specify information (name, email) on the person responsible for maintain this script;
ENTRYPOINT allows to specify what command should be executed at first once the container is started;
USER allows to specify with which user account the command inside the container have to be executed with;
EXPOSE allows to specify what port to expose for the running container.
ENV to use for setting environment variables
ADD to copy files from the build context (it does not work if using stdin to read dockerfile) into a physical directory in the image (e.g. copying a war file into tomcat webapps folder)

Here you can find the official tutorial to experiment with these command.

Parsing dockerfiles
Once finished editing the build script, issue docker build to parse the dockerfile and create a new image. There is different ways to use this command:

dockerfile is in current directory docker build .
from stdin docker build - < Dockerfile
from a github repository docker build github.com/username/repo docker will then clone the repo and parse the files in the repo directory.

Example
Now lets take the instructions from the previous post and gather them into a dockerfile:
# Use ubuntu as a base image
FROM ubuntu

# update package respository
RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list

RUN echo "deb http://archive.ubuntu.com/ubuntu precise-security main universe" > /etc/apt/sources.list
RUN apt-get update

# install java, tomcat7
RUN apt-get install -y default-jdk
RUN apt-get install -y tomcat7

RUN mkdir /usr/share/tomcat7/logs/
RUN mkdir /usr/share/tomcat7/temp/

# set tomcat environment variables
ENV JAVA_HOME=/usr/lib/jvm/default-java
ENV JRE_HOME=/usr/lib/jvm/default-java/jre
ENV CATALINA_HOME=/usr/share/tomcat7/

# copy war files to the webapps/ folder
ADD path/to/war /usr/share/tomcat7/webapps/

# launch tomcat once the container started
#ENTRYPOINT service tomcat7 start
ENTRYPOINT /usr/share/tomcat7/bin/catalina.sh run

# expose the tomcat port number
EXPOSE 8080

Save this script to Dockerfile, build it and tag the image by tomcat7, then launch the container while exposing publicaly the tomcat server port 8080, and finally check if the container is running
$docker build -t tomcat7 - < Dockerfile
$docker run -p 8080 tomcat7
$docker ps

to be continued;

Build your own SaaS with Docker - Part I

2014-03-31T11:46:00.000-07:00

Hello Docker
Docker enables sand-boxing of applications and their dependencies in virtual containers to be able to run them in isolated mode. It provides an easy to use API for automating deployment operations that looks very close to Git commands. More introductory information can be found in its Wikipedia page.

Installation
Docker installation on a Ubuntu 64bit (for other OS check official documentation)
$sudo sh -c "curl https://get.docker.io/gpg | apt-key add -"
$sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$sudo apt-get update
$sudo apt-get install lxc-docker

Once docker installed, run a shell from within a container as follow
$sudo docker run -i -t ubuntu /bin/bash

As it is supposed to not find the ubuntu image, docker will pull it from the registry. Once, installed you can prompt:

#exit to leave the container
$sudo docker images to see all local images.
$sudo docker inspect image_name to see detailed information on an image.
$sudo docker ps to see the status of the container
$sudo docker stop CONTAINER_ID to stop a running image (or container)
$sudo docker logs CONTAINER_ID to see all logs if a given container
$sudo docker commit CONTAINER_ID image_name to commit changes made to a container

Installing Tomcat within a container
Start a new container using the ubuntu base image:
$sudo docker run -i -t ubuntu /bin/bash

Update the image's system packages
#apt-get update

1. Install the Apache Tomcat application server:
#apt-get install -y tomcat7

Once installed the following directories are created (more details can be found here):

/etc/tomcat7 for configuration
/usr/share/tomcat7 for runtime, called $CATALINA_HOME
/usr/share/tomcat7-root for webapps

2. Install Java DK
#apt-get install -y default-jdk

3. Configure environment variables
#pico ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/default-java
export CATALINA_HOME=~/path/to/tomcat
#. ~/.bashrc to make the changes effective

Now when typing #echo $CATALINA_HOME you should see the exact path set to tomcat7.

4. Start the Tomcat7 server
#$CATALINA_HOME/bin/startup.sh
or
#service tomcat7 start

The start-up may fail with something like "cannot create directory '/usr/share/tomcat7/logs/catalina.out/'". To solve this, you may just have create the logs directory:
#mkdir /usr/share/tomcat7/logs

to check if Tomcat is running issue
#ps -ef | grep tomcat
or
#service tomcat7 status

then check in your browser http://container_ip_address:8080/
to get the IP address of the container issue
#ifconfig

5. Shutdown Tomcat7
#$CATALINA_HOME/bin/shutdown.sh
or
#service tomcat7 stop

Save the image to index.docker.io
The changes we made on the base image created a new one, we should commit these changes to not lose these changes.

1. Login to index.docker.io
$sudo docker login
Username: your_user_name
Password: your_password
Email: your_email
Login Succeeded

If you don't have an account, sign up here.

2. Commit changes to your repository
$sudo docker commit CONTAINER_ID USERNAME/REPO_NAME

3. Push changes to this repository
$sudo docker push USERNAME/REPO_NAME

4. Start a new container using the image commit to your repository as base image
$sudo docker run -i -t USERNAME/REPO_NAME /bin/bash
#

To run Tomcat in the container
$sudo docker run -i -t USERNAME/REPO_NAME $CATALINA_HOME/bin/startup.sh
or
$sudo docker run -i -t USERNAME/REPO_NAME service tomcat7 start

to cleanup old containers
$sudo docker ps -a -q | xargs sudo docker rm
or
$sudo docker ps -a | awk '{print $1}' | xargs sudo docker rm

to cleanup old and non tagged images
$sudo docker images | grep "^" | awk '{print $3}' | xargs sudo docker rmi -f

Resources

If you are confused with docker terminology (e.g. container, image, etc.) check this official documentation.
General purpose instructions for installing Tomcat7 on a ubuntu machine here.

Joyn or RCS

2013-10-06T02:49:00.001-07:00

RCS (Rich Communication Services) is a GSMA standard that aims to bring a set of rich communication (that goes beyond SMS and phone calls) yet inter-operable services across different domains managed by different telecom operators. This telcos standard is marketed under the name of Joyn.
Many operators has already deployed on their networks offering users VoIP and presence services that can be accessed by installing an application from the market store (Google Play, AppStore). In addition, some smarphone manufacturers who have joined the movement already embed the RCS stack into their devices.

The next step of commercializing Joyn is to build an ecosystem by providing APIs and empowering the developers community to create communication-based applications that relies on the platform. Orange through Orange Partner and Deutsch Telekom through the Developer Garden programs are leading these efforts in Europe. For instance, they jointly sponsored the Joyn Hackathon (press release) were the Joyn API was introduced.

The remaining of this post explains how to use the Android Joyn SDK to build conversational applications. The overall interaction between an application and the Joyn SDK (and behind the RCS platform) is explained in the following figure.

Joyn API call flow

Instantiate Joyn service and establish a connection

   private ChatService mService;
   private JoynServiceListener mListener = new JoynServiceListener() {
      @Override public void onServiceDisconnected(int error) {
         Log.i(TAG, "ChatService disconnected!");
      }
      @Override public void onServiceConnected() {
         Log.i(TAG, "ChatService connected!");
      }
   };
   ...
   // Instanciate API
   mService = new ChatService(getApplicationContext(), mListener); 
   // Connect API
   mService.connect();

When the the connection is successfully established then start calling API methods

   private Chat mChat;
   private ChatListener mChatListener = new ChatListener() {
      @Override public void onReportMessageFailed(String arg0) {}
      @Override public void onReportMessageDisplayed(String arg0) {}
      @Override public void onReportMessageDelivered(String arg0) {}
      @Override public void onNewMessage(ChatMessage arg0) {}
      @Override public void onComposingEvent(boolean arg0) {}
   };
   ...
   @Override public void onServiceConnected() {
      Log.i(TAG, "ChatService connected!");
      if (mService != null && mService.isServiceRegistered()) {
         // Get remote contact
         String contact = getIntent().getStringExtra("contact");
         // Call API Methods
         mChat = mService.openSingleChat(contact, mChatListener); 
         mChat.sendMessage("hello world!");
      }
   }

The API doc of the ChatService can be found on this link.