tag:blogger.com,1999:blog-20354977361241966922024-03-22T03:31:47.212-07:00Random Notesb@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.comBlogger79125tag:blogger.com,1999:blog-2035497736124196692.post-10499618404197969682017-01-13T08:19:00.003-08:002017-01-13T08:19:39.544-08:00Notes from Elasticsearch - The Definitive Guide (suite 3)<div dir="ltr" style="text-align: left;" trbidi="on">
I started reading <a href="http://shop.oreilly.com/product/0636920028505.do">Elasticsearch - The Definitive Guide</a> few weeks ago, and working on an Elasticsearch client for <a href="https://github.com/dzlab/elastic-go">golang</a>.<br />
Following are notes I've taken while reading this book.<br />
<h3 style="text-align: left;">
Aggregations (Part IV) :</h3>
<b>Chapter 35: Controlling memory use and latency</b><br />
Aggregation queries rely on a data structure called “fielddata” as inverted indices are not efficient when it comes to which unique terms existent in a single document. Understating how field data works is important as it is the primary consumer of memory in an Elasticsearch cluster.<br />
Aggregations and Analysis: the “terms” bucket operate on string field that may be analysed or not_analyzed. For instance, doing a Terms aggregation on documents that have name of states (e.g. New York) will create a bucket for each field (e.g. new, york) and not one for each state name as the field is by default analysed. To fix this unwanted behaviour the field should be specifically made of type not_analyzed in the mapping when the index is first created.<br />
Furthermore, the ngram analysis process can create a lot of tokens which is memory unfriendly.<br />
Choosing the right heap size impact significantly the performance of field data and thus Elasticsearch. The value can be set with the $ES_HEAP_SIZE env variable:<br />
Choose no more than half the available memory to let the other have for lucene as it relies on filesystem caches which are managed by the kernel<br />
Choose no more than 32GB which allows the JVM to use compressed pointers and save memory, a value bigger than that will force the JVM to use pointers with double size and make garbage collection more expensive.<br />
To control the size of memory allocated to fielddata set ‘indices.fielddata.cache.size’ to a percentage of head or concrete value (e.g. 40%, 5gb) in config/elasticsearch.yml. By default, this value is unbound which means ES will never evict data from field data.<br />
Fielddata usage can be monitored (e.g. too many evictions may indicate poor performance) broken for each field:<br />
GET /_stats/fielddata?fields=*<br />
GET /_nodes/stats/indices/fielddata?fields=*<br />
GET /_nodes/stats/indices/fielddata?level=indices&fields=*<br />
To avoid OutOfMemoryException when trying to load more data into the field data, ES uses a circuit breaker that will evaluate the required memory to answer a query before any more data loading. ES has different circuit breaker to ensure memory limit not exceeded:<br />
indices.breaker.fielddata.limit: by default limits size of field data to 60% of the heap<br />
indices.breaker.request.limit: estimates size of structures required to complete other parts of a request, by default 40% of the heap<br />
indices.breaker.total.limit: total wrapping ‘requests’ and ‘fielddata’ circuit breaker, by default ensure combination not exceeding 70%<br />
For instance, to circuit breaker can be set dynamically on live cluster:<br />
PUT /_cluster/settings -d ‘{“persistent”:{“indices.breaker.fielddata.limit”: “40%"}}’<br />
<br />
Filedata filtering: in some case we may need to filter out terms that fall into less interesting long tail and not to have to load them. This can be done by defining the document mapping and filtering terms by their frequency (or those matching a regular expression). Filtering data means not using it in search, for many applications the memory space gained is more important than keeping useless terms in memory.<br />
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #000000}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #000000; min-height: 14.0px}
li.li1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #000000}
span.s1 {color: #454545}
ul.ul1 {list-style-type: hyphen}
</style></div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-14310604586031587072016-05-13T09:17:00.002-07:002016-06-01T08:31:21.744-07:00Notes on Big Data related talks<div dir="ltr" style="text-align: left;" trbidi="on">
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Hadoop Summit 2016</h4>
<a href="https://www.youtube.com/watch?v=scJ4pOQvE1Q">Apache Eagle Monitor Hadoop in Real time</a><br />
The talk was about <a href="https://eagle.incubator.apache.org/">Apache Eagle</a> a Hadoop product developed by eBay to monitor activities on a Hadoop cluster from the security perspective. The talk started by describing the pillars of security in hadoop: perimeter security; authorization & access control; discovery (e.g. classifying data according to their sensitivity), activity monitoring. The talk is mainly on the last part to address info sec questions: who many users are using Hadoop cluster, what files are they accessing, etc. From this purpose Eagle was born to be able to track events form different sources (e.g. accidently deleting files from HDFS) and correlate them with some user-defined policies.<br />
<br />
<a href="https://www.youtube.com/watch?v=LTONR-L40Xg">Ingest and Stream Processing What will you choose</a> <br />
The talk was divided in two parts, the first one was about <a href="https://blog.cloudera.com/blog/2015/06/architectural-patterns-for-near-real-time-data-processing-with-apache-hadoop/">streaming patterns</a>. And how each part provide at least once or exactly one message delivery.<br />
The second part was a demo for building a streaming pipeline using <a href="https://streamsets.com/">streamsets</a> editor easily. The demo was about using <a href="https://data.sfgov.org/Geographic-Locations-and-Boundaries/San-Francisco-City-Lands-Current-Zipped-Shapefile-/nt8v-6azn">land data</a> of the city of San Fransisco, streaming it and trying to calculate the land with maximum area. The generated data is then store into two destinations Kudu for analytics (e.g. top ten areas) and another Kafa for the events to be used for rendering on minecraft (which was pretty neat).<br />
<br />
<a href="https://www.youtube.com/watch?v=8S_mFmMRijE">Real time Search on Terabytes of Data Per Day Lessons Learned</a><br />
Lessons learned from the plaform engineering team at Rocana (an Ops monitoring software vendering) on building a search engine on HDFS. They described the context and amount of data they are dealing with at a daily basis in terabytes of data. Then, they talked about their initial use of Solar cloud as an enabler to their platform, and how they struggled to scale it and finally decided to create their our search engine based on Lucene and HDFS to store indexes. The rest of the talk was about the specific time-oriented search engine architecture. In the Q&A, one question was on Elasticsearch, they didn't really tested but rather relied on an analysis made by the author of <a href="http://jepsen.io/">Jepsen</a> (which is a tool for analysing distributed systems).<br />
<h4 style="text-align: left;">
Spark Summit East 2016</h4>
<a href="https://www.youtube.com/watch?v=JX0CdOTWYX4">Spark Performance: What's Next</a><br />
The talk started by a finding since the Spark project started in 2010 up to now on the evolution of IO speed, network throughput and CPU speed as the two firsts increase by a factor of 10x while CPU is stuck at 3Gz. The first attempt to CPU and memory optimization was through project Tungsten. The, the speaker described the two phases of perf enhancement:<br />
<div style="text-align: left;">
</div>
<ul style="text-align: left;">
<li><i><b>Phase 1</b></i> (Spark 1.4 to 1.6) enhanced memory managed through using <span style="color: blue; font-family: monospace;">java.unsafe</span> API and offheap memory instead of using Java objects (that allocates memory unnecessary).</li>
<li><b><i>Phase 2</i></b> (Spark 2): instead of using the <a href="http://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf">Volcano Iterator Model</a> to implement operators (i.e. filter, projection, aggregation) use the <a href="https://issues.apache.org/jira/browse/SPARK-12795">Whole-stage Codegen</a> to generate optimized code (and avoid virtual functions call). Plus the use of vectorization (i.e. columnar) to represent data in memory for an efficient scan.</li>
</ul>
<br />
Then the speaker described the impact of these enhancement by comparing the performance of Spark 1.6 vs Spark 2 for different queries. These modification are on master under active development.<br />
In the QA, the described techniques are applicable for DataFrames as the the engine has more information on the data schema which is not the case with RDDs. With Dataset API (which is on top of the DataFrame API) you get the benefit of telling the Engine the data schema as well as the safety data types (i.e. accessing the items without having to cast them to their type). DataFrame gives you index access, while Datasets gives you object access.<br />
<h4 style="text-align: left;">
Others</h4>
<a href="https://www.youtube.com/watch?v=l3mDDKjDjMk">Ted Dunning on Kafka, MapR, Drill, Apache Arrow and More</a> <br />
Ted Dunning talking about why the Hadoop ecosystem succeeded over the NoSQL movement thanks to the more stable API as a standard way to make consensus among the community. While in NoSQL it tends to be isolated icelands. As an example he gave Kafka release of version 0.9 as it reached a new level of stability thanks to its API. He then described how Kafka fit in its goal and give an example of a use case where it's going to be hard to used for. The use case was about real-time tracking of shipment containers, in the case where a dedicated Kafka topic is used to track each container, in this case it will be hard to replicate effectively.<br />
Then, he described MapR approach to open source as way to innovate in the underneath implementation why applying to a standard API (e.g. HDFS).<br />
He also talked about Drill and how MapR is trying to involve more member of the community so that it doesn't seem as the only supported. He also talked about the in-memory movement, and specially the Apache Arrow in-memory file system and how it enabled the co-author of pandas to be a <a href="https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/">Apache Feather</a> a new file format to store data frames on disk and be able to send through wire with Apache Arrow without need for serialization.<br />
<br />
more to come.</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com2tag:blogger.com,1999:blog-2035497736124196692.post-33633424869326672282016-05-09T02:10:00.000-07:002016-05-23T01:14:44.442-07:00Notes from Elasticsearch - The Definitive Guide (suite 2)<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" width="150" /></a>I started reading <a href="http://shop.oreilly.com/product/0636920028505.do">Elasticsearch - The Definitive Guide</a> few weeks ago, and working on an Elasticsearch client for <a href="https://github.com/dzlab/elastic-go">golang</a>.<br />
Following are notes I've taken while reading this book.<br />
<br />
<h3 style="text-align: left;">
Aggregations (Part IV) :</h3>
<b>Chapter 25: High-level concepts -</b> examples<br />
<div class="p1">
<span class="s1">Aggregations in Elasticsearch are based on ‘<b><i>bucket</i></b>’ which is a collection of documents that meet a certain criteria (equivalent to grouping in SQL) and ‘<b><i>metrics</i></b>’ which are statistics calculated on documents in a bucket (equivalent to count, avg, sum in SQL). </span></div>
<div class="p1">
<span class="s1">Buckets can be nester in other buckets, and there is variety of them. Elasticsearch allows you to partition documents in many different ways (e.g. by hour, by most popular terms, by geographical location).</span></div>
<div class="p1">
<span class="s1">An aggregation is a combination of buckets and metrics, and buckets can be nested inside other buckets we can create very complex aggregations. For instance to calculate the average salary for a combination of <country age="" gender="">:</country></span></div>
<ol class="ol1">
<li class="li1"><span class="s1">Partition documents by country (bucket),</span></li>
<li class="li1"><span class="s1">Partition each country bucket by gender (bucket),</span></li>
<li class="li1"><span class="s1">Partition each gender bucket by age range (bucket),</span></li>
<li class="li1"><span class="s1">Calculate the average salary for each age range (metric)</span></li>
</ol>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 26: Aggregation Test-drive -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap26.go">examples</a></span></div>
<div class="p1">
<span class="s1">Terms bucket in an aggregation query is a type of bucket definition that will create a new bucket for each unique term in encounters. In the result of this query, a bucket key correspond to the term value. </span></div>
<div class="p1">
<span class="s1">An additional ‘aggs’ level can be added nested inside another one in order to nested metrics, for example to a first ‘count’ by colour aggregation we can add an ‘avg' metric to calculate average of the values of the price ‘field'.</span></div>
<div class="p1">
<span class="s1">In addition to nest metric inside bucket, we can nest buckets inside other buckets.</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 28: Building Bar Charts -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap28.go">examples</a></span></div>
<div class="p1">
<span class="s1">The ‘<b><i>histogram</i></b>’ bucket is essential for bar charts. It works by specifying an interval and a numeric field (e.g. price) to calculate bucket on. The interval defines how wide each bucket will be, for instance if it is set to 10 then a new bucket will be created every 10. In the response to such aggregation, the histogram keys correspond to the lower boundary.</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 29: Looking at time -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap29.go">examples</a></span></div>
<div class="p1">
<span class="s1">The second most popular activity in Elasticsearch is building date histograms. Timestamps exists in variety of type of data, we can build on top of it metrics which are expressed over time. Example of time-based questions: how many cars sold each month this year, what was the price of this stock for the last 12 hours.</span></div>
<div class="p1">
<span class="s1">The <b><i>date_histogram</i></b> bucket works similarly as the histogram bucket but instead of building buckets based on numeric field, it is calendar-aware and uses time ranges. Each bucket is defined as a certain calendar size (e.g. a month).</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 30: Scoping Aggregations -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap30.go">examples</a></span></div>
<div class="p1">
<span class="s1">But default when no query parameter is specified in an aggregation, Elasticsearch runs the all document. In fact, aggregations operate in the scope of the query and if there is no query then the scope will be ‘match_all’ query.</span></div>
<div class="p1">
<span class="s1">Omitting ’search_type=count’ from the aggregation url forces the search hits to be returned, and thus seeing the search result and aggregation results.</span></div>
<div class="p1">
<span class="s1">We can use global bucket to by pass the scope of a query to all documents.</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 31: Filtering Queries and Aggregations -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap31.go">examples</a></span></div>
<div class="p1">
<span class="s1">Because the aggregation operates in the scope of a query, then any filter added to the query will be applied to the aggregation.</span></div>
<div class="p1">
<span class="s1">We can use filter bucket so that document matching the filter (e.g. now - 1Month) will be added to the bucket. When using Filter bucket, all nested buckets or metrics will inherent the filter.</span></div>
<div class="p1">
<span class="s1"><b><i>Post filter</i></b> is a top level search parameter, it is executed after the search query to filter the results (i.e. search hits) but does not affect the query scope neither the aggregation. Thus it doesn’t affect the categorial facets. Note that for performance considerations, the post_filter should only be used in combination of aggregations and only when differential filtering is needed. Recap:</span></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; margin: 0px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">
</div>
<ul class="ul1">
<li class="li1"><span class="s1">filtered query affects both search results and aggregations</span></li>
<li class="li1"><span class="s1">filter bucket: affects only aggregations</span></li>
<li class="li1"><span class="s1">post_filter: affects only search results.</span></li>
</ul>
<b>Chapter 32: Sorting multi-value buckets -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap32.go">examples</a><br />
By default elasticsearch sorts the aggregation buckets by doc_count in descending order. Elasticsearch provides many way to customise the sorting:<br />
<b>1. <i>Intrinsic sorts</i></b>: operates on data generated by the bucket (e.g. doc_count). It uses the ‘order’ object which can take one of these values: _count (sort by bucket’s document count), _term (sort by the values of a field), _key (sort by the bucket’s key, works only with histogram and date histogram buckets).<b><br />2. <i>Sorting by a metric</i></b>: set the sort order with any metric (single value) by referencing it’s name. It is also possible to use multiple values metrics (e.g. extended_stats) by using a dot-path to the metric of interest.<br />
<b>3. <i>Sorting based on a metric in subsequent nested buckets</i></b> (my_bucket>another_bucket>metric): only for buckets generating one value (e.g. filter, global), multi-value bucket (e.g. terms) generate many dynamic buckets which makes it impossible to determine a deterministic path.<br />
<b><br /></b>
<div class="p1">
<span class="s1"><b>Chapter 33: Approximate Aggregations -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap33.go">examples</a></span></div>
<div class="p1">
<span class="s1">Simple operations like ‘max’ scales linearly with the number of machines of the Elasticsearch cluster. They don’t need coordination between the machines (i.e. no need for data movement over the network) and the memory footprint is too small (for the sum function all we need is to keep an integer). In the contrary, more complex operations need algorithms that can make tradeoffs between performance and memory utilisation.</span></div>
<div class="p1">
<span class="s1">Elastisearch support two approximate algorithms ‘<i>cardinality</i>’ and ‘<i>percentiles</i>’ which are fast but does provide an accurate result not an exact.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b><i>Cardinality</i></b> is the approximation of the distinct query that counts unique values of a field, it is based on the HyperLogLog (HLL) algorithm. This algorithm has configurable precision (through the ‘precision_threshold’ field that accept values from 0 to 40k) that impact how much memory will be used. If the field cardinality is below the threshold than the returned cardinality is almost always 100%.</span></div>
<div class="p1">
<span class="s1">To speed up the cardinality calculation on very large datasets in which case calculating hashes at query time can be painful, we can instead calculate the hash at index time.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b><i>Percentiles</i></b> is the other approximation algorithm provided by Elasticsearch, it shows the point at which certain percentage of values occur. For instance, 95th percentile is the value which is greater than 95% of the data. Percentiles are often used to quickly eyeball the distribution of data, check for skew or bimodalities, and also to find outliers. By default, the percentiles query return an array of pre-defined percentiles: 5, 25, 50, 75, 95, 99.</span></div>
<div class="p1">
<span class="s1">A compagnon metric is the ‘<i>percentile_rank</i>’ metrics which return for a given value the percentiles it belongs to. For example: the 50th percentile is 119ms, and 119ms percentile rank is the 50th percentile. </span></div>
<div class="p1">
<span class="s1">The percentiles metric is based on Ted Dunning’s TDigest algorithm (paper Computing Accurate Quantiles using T-Digest).</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 34: Significant Terms -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap34.go">examples</a></span></div>
<div class="p1">
<span class="s1">Significant terms are aggregation queries used for detecting anomalies. It is about finding uncommonly common patters, i.e. cases there becomes suddenly very common while in the past were uncommon. For instance, when analysing logs we may be interested in finding servers that throws a certain type of errors more often then they should.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<br />
<div class="p1">
<span class="s1">An example of how to use this to recommend .. is by analysing the group of people enjoying a certain .. (the foreground group) and determine what .. are most popular, it will then construct a list of popular .. for everyone (the background group). Comparing the two lists shows that statistical anomalies will be the .. which are over represented in the foreground compared to the background.</span></div>
<br />
to be continued..</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-51199345940930831062016-05-06T01:54:00.001-07:002016-05-10T01:59:44.431-07:00Notes from Elasticsearch - The Definitive Guide (suite 1)<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br /></a><a href="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" width="150" /></a><br />
I started reading <a href="http://shop.oreilly.com/product/0636920028505.do">Elasticsearch - The Definitive Guide</a> few weeks ago, and working on an Elasticsearch client for <a href="https://github.com/dzlab/elastic-go">golang</a>.<br />
Following are notes I've taken while reading this book.<br />
<br />
<h3 style="text-align: left;">
Dealing with Human Language (Part III) :</h3>
<br />
<div class="p1">
<span class="s1"><b>Chapter 18: Getting started with languages - </b><a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap18.go">examples</a></span></div>
<div class="p1">
<span class="s1">Elasticsearch comes with a set of analysers for most languages (e.g. Arabic, English, Japanese, Persian, Turkish, etc.). Each of these analysers perform the same kind of rules: tokenize text into words, lowercase each word, remove stopwords, stem tokens to their root. Additionally, these analysers may perform some language specific transformation to make the words searchable.</span></div>
<div class="p1">
<span class="s1">Language analysers can be used as is, but it is possible to configure them for instance by defining stem word exclusion (e.g. prevent word organisation from being stemmed to organ), or custom stop words (e.g. omitting no and not as they invert the meaning for the subsequent words).</span></div>
<div class="p1">
<span class="s1">In case there is multiple documents with predominant language in each one, it’s more appropriate to use different index for each language (e.g. blogs-en, blogs-fr). It is also possible to have all the translations gathered in the same document (e.g. title, title_br, title_es).</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 19: Identifying words - </b><a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap19.go">examples</a></span></div>
<div class="p1">
<span class="s1">Elasticsearch provides a set of tokenisers in order to extract tokens (i.e. words) from text. Example of tokenisers that can be used regardless of language:</span></div>
<ul class="ul1">
<li class="li1"><span class="s1">whitespace: simply breaks text on whitespace,</span></li>
<li class="li1"><span class="s1">letter: breaks text on any character which is not letter,</span></li>
<li class="li1"><span class="s1">standard: uses Unicode Text Segmentation to find boundaries between words, </span></li>
<li class="li1"><span class="s1">tax_url_email: is similar to the standard tokeniser excepts it treats emails and urls as single words,</span></li>
</ul>
<div class="p1">
<span class="s1">The standard tokeniser is a good starting point to recognise words in most languages and is the basis tokeniser for specific one (e.g. spanish). However it provides a limited support for Asian languages, in such situation it’s better to consider the ‘icu_tokenizer'.</span></div>
<div class="p1">
<span class="s1">The ICU plugin need to be installed manually in order to have the support for other than english languages:</span></div>
<div class="p1">
<span style="color: blue; font-family: monospace;">./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION</span><br />
where <span style="color: blue; font-family: monospace;">$VERSION</span> can be found in <a href="http://github.com/elasticsearch/elasticsearch-analysis-icu">github.com/elasticsearch/elasticsearch-analysis-icu</a><br />
or in newer version of Elasticsearch <span style="color: blue; font-family: monospace;">./bin/plugin </span><span style="color: blue; font-family: monospace;">install analysis-icu</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">For tokenisers to work well the input text has to be cleaned, character filters can be added to preprocess text before tokenization. For instance, the ‘html_strip’ character filter removes HTML tags and decode entities into corresponding Unicode character.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 20: Normalising tokens - </b><a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap20.go">examples</a></span></div>
<div class="p1">
<span class="s1">After text is split into tokens, the later are normalised (e.g. to lowercase) in order for similar tokens to be searchable. For instance, removing diacritics (e.g. ‘,^ and ¨) in western languages with asciifolding filter which converts also Unicode characters into simpler ASCII representation.</span></div>
<div class="p1">
<span class="s1">Elasticsearch compares characters at the byte level, however the same Unicode characters may have different bytes representation. In this case, it is possible to use Unicode normalisation forms (nfc, nfd, nfkc, nfkd) that converts Unicode into standard format and comparable at byte level.</span></div>
<div class="p1">
<span class="s1">Lowercasing Unicode character is not straitforward, it has to be made by case folding that may not result in the correct spelling but does allow case-insensitive comparisons.</span></div>
<div class="p1">
<span class="s1">Similarly, asciifolding token filter has an equivalent for dealing with many languages which is icu_folding that extends the transformation to non ASCII-based scripts like Greek. For instance fold arabic numeral to latin equivalent.</span></div>
<br />
<div class="p1">
<span class="s1">We can protect particular characters from being folded using ‘UnicodeSet’ which is a kind of character class in regular expression.</span><br />
<span class="s1"><br /></span>
<br />
<div class="p1">
<span class="s1"><b>Chapter 21: Reducing words to their root form -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap21.go">examples</a></span></div>
<div class="p1">
<span class="s1">Stemming attempts to remove the difference between inflected forms of a word (like number: fox and foxes, gender: waiter and waitress, aspect: ate and eaten) in order to reduce each word to its root form. English is a weak inflected language (i.e. we can ignore inflection in words and still having good search result), but this is not the case for all languages that may need an extra work.</span></div>
<div class="p1">
<span class="s1">Stemming may suffer from understemming and overstemming, the former is failing to reduce words with same meaning to the same root and result in relevant document not been returned. The latter is failling to separate words with different meaning which reduces precision (i.e. returning irrelevant documents).</span></div>
<div class="p1">
<span class="s1">Elasticsearch has two classes of stemmers that can be used: algorithmic and dictionary stemmer. </span></div>
<div class="p1">
<span class="s1"><b>Algorithmic stemmer</b> applies a sequence of rules to the given word to reduce it to its root form.</span></div>
<div class="p1">
<span class="s1"><b>Dictionary stemmer</b> uses a dictionary of words to their root format, so that it has only to lookup for the word to be stemmed. These stemmers are as good as their dictionaries, for instance words meaning may change over time and the dictionary have to be updated. Also, the size of the dictionary may hurt the performances as all words (suffixes and prefixes) have to be loaded into RAM. Example of widely used dictionary is the spell checker Hunspell.</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 22: Stopwords performance vs precision -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap22.go">examples</a></span></div>
<div class="p1">
<span class="s1">Reducing index size can be achieved by indexing fewer words. Terms to index can be divided into Low frequency terms that appear in fewer index thus having high weight. And terms with high frequency that appear in many documents in the index. The frequency depends on the type of indexed documents, e.g. 'and’ in chinese documents will be a rare word. For any language there are common words (also called stop words) that may be filtered out before indexing but this may bring some limitations: distinguishing between ‘happy’ and 'not happy’.</span></div>
<div class="p1">
<span class="s1">To speedup query performance, we should avoid default query that uses the ‘or’ operator. </span></div>
<div class="p1">
<span class="s1">1. One possible option is to use ‘and’ operator in match query like </span><span style="color: blue; font-family: monospace;">{"match": {"my_field": {"query": "the quick brown fox", "operator":"and"}}}</span>. Which is then rewritten to a bool query <span style="color: blue; font-family: monospace;">{"bool":{"must":[{"term": {"my_field":"the"}}, {"term": {"my_field":"quick"}}, {"term": {"my_field":"brown"}}, {"term": {"my_field":"fox"}}]}}</span>. Elasticsearch will execute first the query with least frequent term to immediately reduce the number of explored documents.</div>
<div class="p1">
<span class="s1">2. Another option for enhancing performance is to use ‘<b><i>minimum_should_match</i></b>’ property in the match query.</span></div>
<div class="p1">
<span class="s1">3. Its possible to divide the terms in search query into low frequency group (relevant terms used for filtering/matching) and high frequency group (irrelevant terms used for scoring only) terms. This is can be achieved with ‘<b><i>cutoff_frequency</i></b>’ query parameter, e.g. </span><span style="color: blue; font-family: monospace;">{</span><span style="color: blue; font-family: monospace;">{"match": {"text": {"query": "Quick and the dead", "cutoff_frequency": 0.01}}}</span>. The latter result in a combined “must” clause with terms “quick/dead” and a should clause with terms “and/the”.</div>
<div class="p1">
<span class="s1">The parameters “<b><i>cutoff_frequency</i></b>” and “<i><b>minimum_should_match</b></i>” can be combined toghether.</span></div>
<div class="p1">
<span class="s1">To effective reduce the index size use the appropriate ’index_options’ in a Mapping API request, possible values are: </span></div>
<ul class="ul1">
<li class="li1"><span class="s1">‘<b><i>docs</i></b>’ (default for ’<b><i>non_analyzed</i></b>’ string fields): store only which documents include which terms,</span></li>
<li class="li1"><span class="s1">‘<b><i>freqs</i></b>’: store ‘docs’ information plus frequency of terms in each document,</span></li>
<li class="li1"><span class="s1">‘<b><i>positions</i></b>’ (default for ‘<b><i>analyzed</i></b>’ string fields): store ‘docs’ and ‘frees’ information plus the position of each term in each document,</span></li>
<li class="li1"><span class="s1">‘<b><i>offsets</i></b>’: store ‘docs’, ‘freqs’ and ‘positions’ plus the start and end character offsets of each term in the original string,</span></li>
</ul>
<span class="s1"><b>
</b></span><br />
<div class="p1">
<div class="p1">
<span class="s1"><b>Chapter 23: Synonyms -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap23.go">examples</a></span></div>
<div class="p1">
<span class="s1">Synonyms are used to broaden the scope of the matching documents, this kind of search (like stemming, partial matching) should be combined with another query on a field with the original text. Synonyms can be defined in the Index API request inlined in the ’synonyms’ settings parameter or in a file by specifying a path in ’synonyms_path' parameter. The latter can be absolute or relative Elasticsearch ‘config’ directory.</span></div>
<div class="p1">
<span class="s1">Synonym expansion can be done at index or search time, for instance we can replace English with the terms ‘english’ and ‘british’ at index time then in search time we could u-query for one of these terms. If synonyms are not used at index time, then at search time we have to convert the queries with ‘english’ into a query for ‘english OR british’.</span></div>
<div class="p1">
<span class="s1">Synonyms are listed as comma-separated values like ‘jump,leap,hop’. It is also possible to use the syntax with ‘=>’ to specify on the left side a list of terms to match (e.g. gb, great brain) and on the right side one or many replacement (e.g. britain,england,scotland,wales). In case many rules are specified for the same left side then the tokens in the right side are merged.</span></div>
<div class="p1">
<span class="s1">Replacing synonyms can be done with one of the following options:</span></div>
<div class="p1">
<span class="s1"><b><i>1. Simple expansion:</i></b> any of the listed synonyms is expanded into all of the listed synonyms (e.g. ‘jump,hop,leap’). This type expansions can be applied either at index time or search time.</span></div>
<div class="p1">
<span class="s1"><b><i>2. Simple contraction:</i></b> a group of synonyms in the left side are mapped to a single value in the right side (e.g. ‘leap,hop => jump’). This type of expansions must be applied at index time and query time to insure query terms are mapped to the same value.</span></div>
<div class="p1">
<span class="s1"><b><i>3. Genre expansion:</i></b> it widens the meaning of terms to be more generic. Applying this technique at index time with the following rules:</span></div>
<div class="p1">
</div>
<ul style="text-align: left;">
<li>‘cat => cat,pet’</li>
<li>‘kitten => kitten,cat,pet’</li>
<li>‘dog => dog,pet’</li>
<li>‘puppy => puppy,dog,pet’</li>
</ul>
<br />
<div class="p1">
<span class="s1">then when querying for ‘kitten’ only documents about kittens will be returned, when querying for cat documents about kittens and cats are returned, and when querying for pet all documents about kittens, cats, puppies, dogs or pets will be returned.</span></div>
<div class="p1">
<span class="s1">Synonyms and the analysis chain: </span></div>
<div class="p1">
<span class="s1">It is appropriate to set the first a tokeniser filter, then a stemmer filter before putting the synonyms filter. In this case instead of having a rule like ‘jumps,jumped,leap,leaps,leaped => jump’ we can have ‘leap => jump’.</span></div>
<div class="p1">
<span class="s1">In some case, the synonym filters cannot be simply put behind a lowercase filter as it have to deal with terms like CAT or PET (Positron Emission Tomography) which are conflicting when lowercased. A possibility will be to: </span></div>
<div class="p1">
<span class="s1">1. put the synonym filter before the lowercase filter and specify rules with both lowercase and uppercase forms.</span></div>
<div class="p1">
<span class="s1">2. or have two synonym filters one for case-sensitive synonyms (with rules like ‘CAT,CAT scan => cat_scan') and another one for case insensitive synonyms (with rules like ‘cat => cat,pet’).</span></div>
<div class="p1">
<span class="s1">Multi-word synonyms and phrase queries: using synonyms with 'simple expansion’ (i.e. rules like ‘usa,united states,u s a,united states of america') may lead to some bizarre results for phrase queries, it’s more appropriate to use ’simple contraction’ (i.e. rules like ‘united states,u s a,united states of america=>usa’).</span></div>
<div class="p1">
<span class="s1"><br /></span></div>
<div class="p1">
<span class="s1"><b>Symbol synonyms:</b> used for instance to avoid emoji (like ‘:)') been striped away by the standard tokeniser filter as they may change the meaning of the phrase. The solution will be to define a mapping character filter. This will ensure that emoticons are included in the index for instance to do sentiment analysis. </span></div>
<div class="p1">
<span class="s1">Note that mapping character filter is useful for simple replacements of exact characters, for more complex patterns; regular expressions should be used.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 24: Typoes and misspellings -</b> <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap24.go">examples</a></span></div>
<div class="p1">
<span class="s1">This chapter is about fuzzy matching at query time and sounds-like matching at index time for handling misspelled words.</span></div>
<div class="p1">
<span class="s1">Fuzzy matching treats words which are fuzzily similar as if they are the same word. This is based on Damerau-Levenshtein edit distance, i.e. number of operations (edit, insertion, deletion) to perform on a word until it becomes equal to the target word. Elasticsearch supports a maximum of edit distance ‘fuzziness’ of 2 (default is set to ‘AUTO’). Two can be overkilling as a fuzziness value (most misspelling errors are of distance 1) especially for short words (e.g. hat is at 2 distance for mad).</span></div>
<div class="p1">
<span class="s1">Fuzzy query with an edit distance of two can perform very badly and match a large number of documents, the following parameters can be used to limit performance impact:</span></div>
<ol class="ol1">
<li class="li1"><span class="s1"><b><i>prefix_length</i></b>: number of initial characters that will not be fuzzified, as most types occur at the end of words,</span></li>
<li class="li1"><span class="s1"><b><i>max_expansions</i></b>: limit the number of options produced, i.e. generated fuzzy words until this limit is matched. </span></li>
</ol>
<div class="p1">
<span class="s1">Scoring fuzziness: fuzzy matching should not be used for scoring but only to widen the match result (i.e. increasing recall). For example, if we have 1000 documents containing the word ‘Algeria’ and one document with the word ‘Algeia’, then the latter misspelled word will be considered more relevant (thanks to TF/IDF) as has fewer appearance.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Phonetic matching</b>: there is plenty of algorithms for dealing with phonetic error, most of them are specialisation of the Soundex algorithm. However they are language specific (either English or German). You need to install the phonetic plugin - here github.com/elasticsearch/elasticsearch-analysis-phonetic. </span></div>
<br />
<div class="p1">
<span class="s1">Similarly, phonetic matching should not be used for scoring as it is intended to increase recall. Phonetic algorithms are useful when the search result will be processed by the machine and not by humans.</span><br />
<span class="s1"><br /></span>
<span class="s1">Notes for subsequent chapters can be found <a href="http://elsoufy.blogspot.com/2016/05/notes-from-elasticsearch-definitive_9.html">here</a>.</span></div>
</div>
</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-51312321783901976162016-04-25T07:47:00.003-07:002016-05-06T01:56:35.424-07:00Notes from Elasticsearch - The Definitive Guide<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: left;">
<a href="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="200" src="https://www.elastic.co/assets/blt754ffd504915e208/elasticsearch-the-defnitive-guide.jpg" width="150" /></a>I started reading <a href="http://shop.oreilly.com/product/0636920028505.do">Elasticsearch - The Definitive Guide</a> few weeks ago, and working on an Elasticsearch client for <a href="https://github.com/dzlab/elastic-go">golang</a>.</div>
Following are notes I've taken while reading this book:<br />
<br />
<div class="p1">
<span class="s1"><b>Chapter1: </b></span></div>
<ul class="ul1">
<li class="li1"><span class="s1">history: lucent, compass, elasticsearch</span></li>
<li class="li1"><span class="s1">download/run node, plugging manager Marvel, Elasticsearch vs Relational DB, </span></li>
<li class="li1"><span class="s1">Employee directory example: Create index (db), index (store) document, query (light && DSL), aggregations</span></li>
</ul>
<div class="p1">
<span class="s1"><b>Chapter 2:</b> (about distribution)</span></div>
<ul class="ul1">
<li class="li1"><span class="s1">Cluster health (green yellow, red), Create index with 3 shards (default 5) and 1 replica, then scaling nb of replicas (up or down), master reelection after it fails</span></li>
</ul>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 3:</b> </span></div>
<div class="p1">
<span class="s1">API for managing documents (create, retrieve, update, delete)</span></div>
<ul class="ul1">
<li class="li1"><span class="s1">Document metadata (_index, _type, _id)</span></li>
<li class="li1"><span class="s1">Index document: PUT /{index}/{type}/{id}, for auto-generated ids: POST /{index}/{type}/</span></li>
<li class="li1"><span class="s1">Retrieve a document: GET /{index}/{type}/{id}, without metadata GET /{index}/{type}/{id}/_source, some fields: GET /{index}/{type}/{id}?_source={field1},{field2}</span></li>
<li class="li1"><span class="s1">Check existence of document curl -i XHEAD http://elastic/{index}/{type}/{id}</span></li>
<li class="li1"><span class="s1">Delete a document: DELETE /{index}/{type}/{id}, </span></li>
<li class="li1"><span class="s1">Update conflicts with optimistic concurrency control, uses _version to ensure changes to be applied in correct order, to retry in case of failures many times POST /{index}/{type}/{id}/_update?retry_on_conflict=5</span></li>
<li class="li1"><span class="s1">Update using scripts (in Groovy) or set initial value (to avoid failures for non existing document) POST /{index}/{type}/{id}/_update -d ‘{“script”: “ctx._source.views+=1”, “upsert”: {“view”: 1}}’</span></li>
<li class="li1"><span class="s1">Multi-GET: GET /_mget -d {“docs”: [{“_index”: “website”, “_type”: “blog”, “_id”: 2}, …]} or GET /{index}/{type}/{id}/_mget -d {“ids”: [“2”, “1”]}</span></li>
<li class="li1"><span class="s1">Bulk operations (not atomic/transactional, i.e. if sone fails, some may succeeds) POST /_bulk -d {action: {metadata}}\n{request body}</span></li>
</ul>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 4:</b> </span></div>
<div class="p1">
<span class="s1">How document management operations are executed by elastic search</span></div>
<div class="p1">
<span class="s1"><b>Chapter 5:</b> </span></div>
<div class="p1">
<span class="s1">Search basics (look for data sample in gist)</span></div>
<ul class="ul1">
<li class="li1"><span class="s1">Search all types in all indices GET /_search</span></li>
<li class="li1"><span class="s1">Search a type that contains a word in a field GET /_all/{type}/_search?q={field}:{word}</span></li>
<li class="li1"><span class="s1">Queries with + conditions (e.g. +{field}:{value}) must be satisfied, - conditions must not be satisfied, nothing means the condition is optional. </span></li>
</ul>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 6:</b></span></div>
<div class="p1">
<span class="s1">Core data types in elastic search are indexed differently, to understand how elastic search interpreted the indexed documents and to avoid surprising query results (e.g. age mapped to string instead of integer), look at the mapping (i.e. schema definition) for the type and index. GET /{index}/_mapping/{type}</span></div>
<div class="p1">
<span class="s1">ES uses inverted indexes that consists of a list of unique words in all documents and for each one, the list of document it appears in. </span></div>
<div class="p1">
<span class="s1">Each document and query are passed by analysers that filter characters, tokenise words, then filter these tokens. ES ships with some analysers: standard analyser (used by default), simple analyser, whitespace analyser, language analyser. Analysers are applied only to full text searches and not to exact values. </span></div>
<div class="p1">
<span class="s1">To understand how documents are tokenised and stored in a given index, we can use the Analyse API by specifying the analyser: GET /_analyze?analyzer=standard -d “Text to analyse”. In the response, the value of token is what it will be stored in the index.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 7:</b></span></div>
<div class="p1">
<span class="s1">Filter vs Query DSL, elastic search has two DLS which are similar but serve different purposes, the filter DSL asks a yes/no question on every document and it is used for exact value field. In the other hand, Query DSL asks how well this relevant is this document question, and assign it a _score. In terms of performance, filters are much lighter and uses caches for even faster future searches. Queries are heavier and must be used only for full text searches.</span></div>
<div class="p1">
<span class="s1">Most used filters are: term/terms, exists, match_all, match, multi_match (to run same match on multiple fields), and bool query.</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">Queries can become easily very complex, combining multiple queries and filters, elastic search provides _validate endpoint for query validation:</span></div>
<div class="p1">
<span class="s1">GET /{index}/{type}/_validate/query QUERY_BODY</span></div>
<div class="p1">
<span class="s1">Elastic search provides also a human-readable explanation for non valid queries: GET /{index}/{type}/_validate/query?explain QUERY_BODY</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 8: Sorting and relevance</b></span></div>
<div class="p1">
<span class="s1">By default search result documents are sorted by relevance (i.e. _score value) in descending order, however for filter queries which doesn’t have impact on the _score field it may be interesting to sort other ways (e.g. date). Example of a sort query:</span></div>
<div class="p1">
<span class="s1">GET /_search {"query": {“filtered”: {“filter”: {“term”: {“user_id”: 1}}}}, “sort”: {“date”: {“order”: “desc"}}}</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 10: Index Management </b></span></div>
<div class="p1">
<span class="s1">A type in Elasticsearch consists of a name and a mapping (just like a database schema) that describes its fields, there data types and how they are indexed and stored in lucene. The json representation of a document is stored in plain in the ‘_source’ field which may consume disk space, so a good idea will be to disable it.</span></div>
<div class="p2">
<b><span class="s1"></span><br /></b></div>
<div class="p1">
<span class="s1"><b>Chapter 15:</b> - <a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap15.go">examples</a></span></div>
<div class="p1">
<span class="s1">Phrase search (how to search for terms with a specific order in the target documents) and proximity search with ‘slop’ parameter that gives more flexibility to the search request</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 16: - </b><a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap16.go">examples</a></span></div>
<div class="p1">
<span class="s1">Queries for matching parts of a term (not the whole). In many cases, it is sufficient to use a stemmer to index the root form of words, but there are cases where we need partial matching (e.g. matching a regex in not_analyzed values).</span></div>
<div class="p1">
<span class="s1">Example of queries: ‘prefix’ query works on term level, doesn’t analyse the query string before searching, and performs as a filter (i.e. no relevance calculation). Shorter prefix length means many possible terms to be visited, so for better performance use longer prefixes.</span></div>
<div class="p1">
<span class="s1">Query-time search as you type with match_phrase_prefix queries, and index-time search as you type by defining n-grams</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1"><b>Chapter 17: Controlling relevance score - </b><a href="https://github.com/dzlab/elastic-go/blob/master/examples/chap17.go">examples</a></span></div>
<div class="p1">
<span class="s1">Relevance score in Lucene (thus Elasticsearch) is based on Term Frequency/Inverse Document Frequency and Vector Space Model (to combine weight of many terms in search query), in addition to a coordination factor, field length normalization and term/query clause boosting.</span></div>
<div class="p1">
<span class="s1"><b><i>1. Boolean model</i></b>: applies AND, OR and NOT conditions of the search query to find matching documents.</span></div>
<div class="p1">
<span class="s1"><i><b>2. Term frequency/Inverse document frequency (TF/IDF)</b></i>: the matching documents then have to be sorted by relevance that depends on the weight of the query terms appearing in these documents. The weight of a term is determined by the following factors:</span></div>
<ul class="ul1">
<li class="li1"><span class="s1"> Term frequency: defines how often a term appear in this document (the more often the higher is its weight). For a given term t and document d, it is calculated by the square root of the frequency, i.e. tf(t in d)=(frequency)^1/2</span></li>
<li class="li1"><span class="s1">Inverse document frequency: defines how often a term appears in all document of a collection (the more often the lower the weight). It is calculated based on the number of documents in the collection and number of document the term appears in, as follows: idf(t) = 1 + log(numDocs / (docFreq + 1))</span></li>
<li class="li1"><span class="s1">Field length norm: defines how long the field is (the shorter it is the higher the weight), if a term appears in a short field (e.g. title) then it is likely the content of that field is about this term. In some cases (e.g. logging) norms are not necessary (e.g. we don’t care about length of user agent), disabling them can save a lot of memory. This metric is calculated as the inverse square root of number of terms in the given field: norm(d) = 1 / (numTerms)^1/2</span></li>
</ul>
<div class="p1">
<span class="s1">These factors are calculated and stored at index time, together they serve to calculate of a single term in a document.</span></div>
<div class="p1">
<span class="s1"><b><i>3. Vector space model</i></b>:</span></div>
<div class="p1">
<span class="s1">A single score representing how well a document match a query. It is calculated by first representing the search query and the document as one-dimensional vector with a size equal to number of query terms. Each element is the weight of a term calculated with TF/IDF by default although it’s possible to use other techniques (e.g. Okapi-BM25). Then the angle between these vectors is calculated (Cosine similarity), the closer they are the more relevant the document is to the query.</span></div>
<div class="p1">
<span class="s1">Lucene’s practical scoring function: Lucene combines multiple scoring functions:</span></div>
<div class="p1">
<span class="s1"><i><b>1. Query coordination</b></i>: rewards document that have most of the search query terms, i.e. the more query terms the document contains the more relevant it is. Sometimes, you may want to disable this function (although most use cases for disabling Query Coord are handled automatically), for instance if the query contains synonyms.</span></div>
<div class="p1">
<span class="s1"><b><i>2. Query time boosting</i></b>: a particular query clause can use the boost parameter to be given a higher importance over clauses with less boost value or without it. Boosting can also be applied to entire indexes.</span></div>
<div class="p1">
<span class="s1">Note: not_analyzed fields have ‘field length norms’ disabled and ‘index_options’ set to docs these disabling ’term frequencies’, the IDF of each term are still considered.</span></div>
<div class="p1">
<span class="s1">Function score query: can use Decay functions (linear, exp, guess) incorporate sliding scale (like publish_date, geo_location, price) into the _score to alter documents relevance (e.g. recently published, near a lat-lon/price point) </span></div>
<br />
<div class="p1">
<span class="s1">For some use cases of ‘field_value_factor’ in a Function score query using directly the value of field (e.g. popularity) may not be appropriate (i.e. new_score = old_score * number_of_votes), in this case a modifier can be used for instance log1p which changes the formula to new_score = old_score * log(1 + number_of_votes).</span></div>
<b><br /></b>
<br />
<div class="p1">
Notes for subsequent chapters can be found <a href="http://elsoufy.blogspot.fr/2016/05/notes-from-elasticsearch-definitive.html">here</a>.</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-91630223836235303352015-08-06T09:51:00.001-07:002015-08-06T09:52:08.686-07:00Adding functionalities to existing classes in Scala<div dir="ltr" style="text-align: left;" trbidi="on">
New functionalities can be added to existent classes by wrapping them with a Value class and adding and implicit methods for converting back and forward form the original class:<br />
<br />
<pre class="brush:java">class TLong(val value: Long) {
def +(other: TLong) = new TLong(value + otehr.value)
def decrement = new TLong(value - 1L)
override def toString(): String = value.toString;
}
// implicit methods for conversions
implicit def toTLong(l: Long) = new TLong(l)
implicit def toLong(tl: TLong) = tl.value
// some tests
val l1: TLong = new TLong(1)
val l2: TLong = new TLong(2)
l1 + l2
1L + l2
l1 + 2L
</pre>
From Scala 2.10, you can use implicit classes so that you don't have to define conversion methods as they are automatically created:<br />
<pre class="brush:java">implicit class ImplicitLong(val l: Long) {
def print = l.toString
}
1L.print</pre>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-25173971721346698282015-06-13T09:03:00.002-07:002015-06-15T08:29:20.770-07:00Running Java applications on CloudFoundry<div dir="ltr" style="text-align: left;" trbidi="on">
<h4 style="text-align: left;">
Introduction</h4>
<div style="text-align: left;">
CloudFoundry v2 uses Heroku buildpacks to package droplet on which an application will run. But before, CF checks among the locally available buildpacks which one can be used to prepare the application runtime. The <a href="http://docs.cloudfoundry.org/buildpacks/custom.html">buildpack contract</a> is composed of the following scripts (that can be written in shell, python, ruby, etc):</div>
<ul style="text-align: left;">
<li>Detect: checks if this buildpack is suitable for the submitted application,</li>
<li>Compile: prepares the runtime environment of the application,</li>
<li>Release: finally launches the application</li>
</ul>
<br />
<h4 style="text-align: left;">
Applications with single file</h4>
Java applications whether a standalone or web are managed by the <a href="https://github.com/cloudfoundry/java-buildpack">java-buildpack</a>. In case a <a href="http://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html">manifest.yml</a> is used to submit the application, then for Web applications or executable jar it may looks like:<br />
<span style="color: blue; font-family: monospace;">---</span><br />
<span style="color: blue; font-family: monospace;">applications:</span><br />
<span style="color: blue; font-family: monospace;">- name: </span><span style="color: blue; font-family: monospace;">APP_NAME</span><br />
<span style="color: blue; font-family: monospace;"> memory: 4G</span><br />
<span style="color: blue; font-family: monospace;"> disk_quota: 2G</span><br />
<span style="color: blue; font-family: monospace;"> timeout: 180</span><br />
<span style="color: blue; font-family: monospace;"> instances: 1</span><br />
<span style="color: blue; font-family: monospace;"> host: </span><span style="color: blue; font-family: monospace;">APP_NAME</span><span style="color: blue; font-family: monospace;">-${random-word}</span><br />
<span style="color: blue; font-family: monospace;"> path: /path/to/war/file.war</span> or <span style="color: blue; font-family: monospace;">/path/to/executable/file.jar</span><br />
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
The java-buildpack will check if the file is a .war to launch <a href="https://github.com/cloudfoundry/java-buildpack/blob/master/lib/java_buildpack/container/tomcat.rb">Tomcat container</a>, or an <a href="https://github.com/cloudfoundry/java-buildpack/blob/master/docs/container-java_main.md">executable jar</a> to look for the mainClass in <span style="color: blue; font-family: monospace;">META-INF/MANIFEST.MF</span>.</div>
<div style="text-align: left;">
<br /></div>
<h4>
Applications with many files</h4>
<div style="text-align: left;">
In case the application is composed of multiple files (jars, assets, configs, etc.) the java-buildpack won't be able to automatically detect what appropriate container to use. We need:</div>
<div style="text-align: left;">
1. For the <b>Detect</b> phase to choose which container is appropriate (here the java-main): Clone the <a href="https://github.com/cloudfoundry/java-buildpack/">java-buildpack</a> and set the <span style="color: blue; font-family: monospace;">java_main_class</span> property in <span style="color: blue; font-family: monospace;">config/java_main.yml</span>.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
2. In the manifest: indicate the <span style="color: blue; font-family: monospace;">path</span> to the folder containing all artifacts that should be download to the droplet at the <b>Compile</b> phase.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
3. In the manifest: set the <span style="color: blue; font-family: monospace;">command</span> that will be used at the <b>Release</b> phase to launch the application. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
An example of java_main.yml file:</div>
<span style="color: blue; font-family: monospace;">---</span><br />
<span style="color: blue; font-family: monospace;">java_main_class: package.name.ClassName</span><br />
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
An example of a manifest.yml file:</div>
<span style="color: blue; font-family: monospace;">---</span><br />
<span style="color: blue; font-family: monospace;">applications:</span><br />
<span style="color: blue; font-family: monospace;">- name: </span><span style="color: blue; font-family: monospace;">APP_NAME</span><br />
<span style="color: blue; font-family: monospace;"> memory: 2G</span><br />
<span style="color: blue; font-family: monospace;"> timeout: 180</span><br />
<span style="color: blue; font-family: monospace;"> instances: 1</span><br />
<span style="color: blue; font-family: monospace;"> host: </span><span style="color: blue; font-family: monospace;">APP_NAME</span><span style="color: blue; font-family: monospace;">-${random-word}</span><br />
<span style="color: blue; font-family: monospace;"> path: ./</span><br />
<span style="color: blue; font-family: monospace;"> buildpack: http://url/to/custom/java-buildpack</span><br />
<span style="color: blue; font-family: monospace;"> command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:. -Djava.io.tmpdir=$TMPDIR </span><span style="color: blue; font-family: monospace;">package.name.ClassName</span><br />
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<h4 style="text-align: left;">
Application submission</h4>
<span style="color: blue; font-family: monospace;">$ cf push</span><span style="color: blue; font-family: monospace;"> </span>to submit an application<br />
<span style="color: blue; font-family: monospace;">$ cf logs APP_NAME</span><span style="color: blue; font-family: monospace;"> </span>to access the application logs<br />
<span style="color: blue; font-family: monospace;">$ cf events APP_NAME</span><span style="color: blue; font-family: monospace;"> </span>to access CF events related to this application<br />
<span style="color: blue; font-family: monospace;">$ cf files APP_NAME</span><span style="color: blue; font-family: monospace;"> </span>to access the VCAP user home where the application files are stored<br />
<br /></div>
<h4 style="text-align: left;">
Troubleshooting</h4>
If the application fails to start for some reason (you may see no logs), you can check what command was used to launch the application as follows:<br />
<span style="color: blue; font-family: monospace;">$ CF_TRACE=true cf app app_name | grep "detected_start_command"</span><br />
<br />
<div>
<b>Note</b> </div>
<div>
</div>
<ul>
<li>Uploaded jar files are extracted under <span style="color: blue; font-family: monospace;">/home/vcap/app/APP_NAME</span> in the droplet.</li>
<li>for executable jar, we need to accept traffic on the port given by CF which is in the <span style="color: blue; font-family: monospace;">VCAP_APP_PORT</span> environment variable. Otherwise CF will think that the application has failed to start and thus shut it down.</li>
<li>to check if a java program is running on CloudFoundry:</li>
</ul>
<pre class="brush:java">import org.cloudfoundry.runtime.env.CloudEnvironment;
...
CloudEnvironment cloudEnvironment = new CloudEnvironment();
if (cloudEnvironment.isCloudFoundry()) {
// activate cloud profile
System.out.println("On A cloudfoundry environment");
}else {
System.out.println("Not on A cloudfoundry environment");
}
</pre>
<br />
Resources:<br />
<ul style="text-align: left;">
<li>Standalone (non-web) applications on Cloud Foundry - <a href="http://blog.yenlo.com/nl/standalone-non-web-applications-cloud-foundry">link</a></li>
</ul>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com4tag:blogger.com,1999:blog-2035497736124196692.post-68090020339481032292015-05-31T03:54:00.000-07:002015-06-13T09:08:36.567-07:00DEV 301 - Developing Hadoop Applications<div dir="ltr" style="text-align: left;" trbidi="on">
1. Introduction to Developing Hadoop Applications<br />
- Introducing MapReduce concepts and history<br />
- Discribing how MapReduce works at a high-level and how data flows in it<br />
<br />
The typical example of MapReduce applications is Word Count. As an input, there is many files that are splitting amongst the TaskTracker nodes where the files are located. The splits are of multiple record, here a record is a line. The <b>Map</b> function gets a Key-Value pairs, and just uses the Value (i.e. line) to calculate one occurrence at a time of each word. Then a <b>Combine</b> function aggregates the occurrences and pass them to the <b>Shuffle</b> function The later is handled by the framework and aims to gather the output of prior functions by keys before sending them to the reducers. The <b>Reduce</b> function takes a list of all occurrence (i.e. value) of a word (i.e. key) to sum them up and return the total time the word has been seen.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg24eRZqG1AgvVK7X9LfEMNpaWXGfI-Jid6tIBpsnM2GZ8p4uUhegcMOzABYXRj-SZUTl7WKBczExHMweMbrcNRHFCjdCeTHc9s-C9qbpP5UEVD6PApJPDRDWX419fncNyWiTXTfV8m00Q/s1600/WordCount+example.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg24eRZqG1AgvVK7X9LfEMNpaWXGfI-Jid6tIBpsnM2GZ8p4uUhegcMOzABYXRj-SZUTl7WKBczExHMweMbrcNRHFCjdCeTHc9s-C9qbpP5UEVD6PApJPDRDWX419fncNyWiTXTfV8m00Q/s1600/WordCount+example.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">MapReduce example: Word Count</td></tr>
</tbody></table>
Run the word count example:<br />
1. Prepare a set of input text files:<br />
<span style="color: blue; font-family: monospace;">$ mkdir -p /user/user01/1.1/IN1</span><br />
<span style="color: blue; font-family: monospace;">$ cp /etc/*.conf /user/user01/1.1/IN1 2> /dev/null</span><br />
<span style="color: blue; font-family: monospace;">$ ls /user/user01/1.1/IN1 | wc -l</span><br />
2. Run word count application using the previously created files<br />
<span style="color: blue; font-family: monospace;">$ hadoop jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1-mapr-1408.jar wordcount /user/user01/1.1/IN1 /user/user01/1.1/OUT1</span><br />
3. Check the job output<br />
<span style="color: blue; font-family: monospace;">$ wc -l /user/user01/1.1/OUT1/part-r-00000</span><br />
<span style="color: blue; font-family: monospace;">$ more </span><span style="color: blue; font-family: monospace;">/user/user01/1.1/OUT1/part-r-00000</span><br />
<br />
Trying binary files as input:<br />
<span style="color: blue; font-family: monospace;">$ mkdir -p /user/user01/1.1/IN2/mybinary</span><br />
<span style="color: blue; font-family: monospace;">$ cp /bin/cp /user/user01/1.1/IN2/</span><span style="color: blue; font-family: monospace;">mybinary</span><br />
<div>
<span style="color: blue; font-family: monospace;">$ file </span><span style="color: blue; font-family: monospace;">/user/user01/1.1/IN2/</span><span style="color: blue; font-family: monospace;">mybinary</span></div>
<span style="color: blue; font-family: monospace;">$ strings </span><span style="color: blue; font-family: monospace;">/user/user01/1.1/IN2/</span><span style="color: blue; font-family: monospace;">mybinary | more</span><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-</span><span style="color: blue; font-family: monospace;">examples.jar wordcount /user/user01/1.1/IN2/mybinary </span><span style="color: blue; font-family: monospace;">/user/user01/1.1/OUT2</span><br />
<span style="color: blue; font-family: monospace;">$ more </span><span style="color: blue; font-family: monospace;">/user/user01/1.1/OUT2/</span><span style="color: blue; font-family: monospace;">part-r-00000</span><br />
Look for reference in the input and output of work AUTH:<br />
<span style="color: blue; font-family: monospace;">$ strings /user/user01/1.1/IN2/mybinary | grep -c ATUH</span><br />
<span style="color: blue; font-family: monospace;">$ egrep -ac ATUH /user/user01/1.1/OUT2/part-r-00000</span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmy1meMR8NjOyyChf2bZEFmTb6V7x5AQ8bA7mfh93Zv9ttd2Fww3aNhgfGbyla8g9xKz5ieKDPMrtEjjCwKffXlCmtjwT2pT6vLgZg8DEUlHw6cueuxW7ajYIK2QiKmjZ3AEjJ6EUFXp8/s1600/Execution+and+Data+Flow.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmy1meMR8NjOyyChf2bZEFmTb6V7x5AQ8bA7mfh93Zv9ttd2Fww3aNhgfGbyla8g9xKz5ieKDPMrtEjjCwKffXlCmtjwT2pT6vLgZg8DEUlHw6cueuxW7ajYIK2QiKmjZ3AEjJ6EUFXp8/s1600/Execution+and+Data+Flow.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">MapReduce execution summary and data flow</td></tr>
</tbody></table>
<br />
<br />
MapReduce Workflow:<br />
<ul style="text-align: left;">
<li>Load production data into HDFS with tools like Sqoop for SQL data, Flume for log data, or traditional tools as MapR-FS support POSIX operations and NFS access.</li>
<li>Analyze, Store, Read.</li>
</ul>
<div>
The InputFormat object is responsible for validating the job input, splitting files among mappers and instantiating the RecordReader. By default, the size of an input split is equal to the size of a block which is 64Mb in Hadoop and it is the size of a chunk in MapR which is 256 Mb. Each input Split references a set of Records which will be broken into a Key-Value for the Mapper. The TaskTracker passes the split input to the RecordReader constructor which will read the records one by one and passes them to the mapper as key-value pairs. By default, the RecordReader considers a line as a record. This can be modified by extending the RecordReader and InputFormat classes to define different records in the input file, for example multi-line records.</div>
<div>
The Partitioner takes the output generated by the Map functions, hashes the record key to create partitions based on the key. By default, each partition will be passed to a reducer, this behavior can be overrided. As part of Shuffle operation, The partitions are then sorted and merged as preparation before sending them to the reducers. Once an intermediate partition is complete, it will be send over the network using protocols like RPC or HTTP.</div>
<div>
The result of a MapReduce job is writing to an output directory: </div>
<div>
<ul style="text-align: left;">
<li>an empty file named <span style="color: blue; font-family: monospace;">_SUCCESS</span> is created to indicate the success of the job,</li>
<li>the history of the job is captured under the <span style="color: blue; font-family: monospace;">_log/history*</span> directory,</li>
<li>the output of the reduce job is captured under <span style="color: blue; font-family: monospace;">part-r-00000<span style="color: black; font-family: 'Times New Roman';">, </span>part-r-00001<span style="color: black; font-family: 'Times New Roman';">, </span>...</span></li>
<li>if you run a map-only job the output will be <span style="color: blue; font-family: monospace;">part-m-00000<span style="color: black; font-family: Times New Roman;">, </span>part-m-00001<span style="color: black; font-family: 'Times New Roman';">, </span>...</span> </li>
</ul>
<div>
<h4 style="text-align: left;">
Hadoop Job Scheduling</h4>
Two schedulers are available in hadoop, the use of each one is declared in <span style="color: blue; font-family: monospace;">mapred-site.xml</span>:<br />
<br />
<ul style="text-align: left;">
<li>by default the <b>Fair Scheduler</b> is used where resources are shared evenly across pools (a slot of resources) and each user has its own pool. Custom pools can be configured to guaranty minimum access to pools to prevent starvation. This scheduler supports preemption.</li>
<li><b>Capacity Scheduler</b>: resources are shared across queues, the administrator configure hierarchically queues (percentage of total resources in the cluster) to control access to resources. The queues has ACL to control user access and it's also possible to configure soft and hard limits per user within a queue. The schedule support resource-based scheduling and job priority. </li>
</ul>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOTa7vfTwQZcQiJlBaLBGARa6_QhT5VePFc32wZKsd2lkxGY0Iz_sfxzl9r7mHLyaWyHeS3i_8YRwMbe_A6fJ21GEfI_wdqxwcDqEDIAiCYi44hTvQffkmOglUj_5jy9qKWP0mjNwhWko/s1600/yarn+architecture.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOTa7vfTwQZcQiJlBaLBGARa6_QhT5VePFc32wZKsd2lkxGY0Iz_sfxzl9r7mHLyaWyHeS3i_8YRwMbe_A6fJ21GEfI_wdqxwcDqEDIAiCYi44hTvQffkmOglUj_5jy9qKWP0mjNwhWko/s320/yarn+architecture.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">YARN architecture</td></tr>
</tbody></table>
<br />
<h4>
Hadoop Job Management</h4>
Dependent on the MapReduce version there is different ways to manage Hadoop jobs:<br />
<br />
<ul style="text-align: left;">
<li>MRv1: through web UIs (JobTracker, TaskTraker), MapR metrics database, <span style="color: blue; font-family: monospace;">hadoop job </span>CLI.</li>
<li>MRv2 (YARN): through web UIs (Resource Manager, Node Manager, History Server), MapR metrics database (for future releases), <span style="color: blue; font-family: monospace;">mapred</span> CLI.</li>
</ul>
<br />
<br /></div>
<div>
The DistributedShell example</div>
</div>
<div>
<span style="color: blue; font-family: monospace;">$ yarn jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar -shell_command /bin/ls -shell_args /user/user01 -jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar</span></div>
<div>
Check the application logs</div>
<div>
<span style="color: blue; font-family: monospace;">$ cd</span><span style="color: blue; font-family: monospace;"> /opt/mapr/hadoop/hadoop-2.4.1/logs/userlogs/application_1430664875648_0001/</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ cat container_1430664875648_0001_01_000002/stdout </span><span style="color: #666666; font-family: monospace;"># stdout file</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ cat container_1430664875648_0001_01_000002/stderr</span><span style="color: blue; font-family: monospace;"> </span><span style="font-family: monospace;"><span style="color: #666666;"># stderr file</span></span></div>
<div>
<br /></div>
<div>
The logs can also be accessed from the History Server Web UI at <a href="http://node-ip:8088/">http://node-ip:8088/</a></div>
<div>
<br /></div>
<div>
to be continued</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-40505852340177018002015-04-08T01:22:00.000-07:002015-04-18T08:58:55.696-07:00ADM 201 - Hadoop Operations: Cluster Administration<div dir="ltr" style="text-align: left;" trbidi="on">
This is article gathers notes taken from <a href="https://www.mapr.com/services/mapr-academy/Hadoop-Operations-Cluster-Administration">MapR's ADM 201</a> class which is mainly about:<br />
- Testing & verifying Hardware before installing MapR Hadoop<br />
- Installing MapR Hadoop<br />
- Benchmarking MapR Hadoop Cluster configure new cluster for production<br />
- Monitoring Cluster for failures & performance<br />
<h4 style="text-align: left;">
Prerequisites</h4>
Install the cluster shell utility, declare the slave nodes and check if it is accessing the nodes properly<br />
<div>
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">sudo -i</span><br />
<span style="color: blue; font-family: monospace;">$ apt-get install clustershell</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ mv /etc/clustershell/groups /etc/clustershell/groups.original</span><br />
<span style="color: blue; font-family: monospace;">$ echo "all: 192.168.2.212 192.168.2.200" > /etc/clustershell/groups</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ clush -a date</span></div>
<div>
<br />
<h4 style="text-align: left;">
Mapr Cluster validation</h4>
</div>
<div>
Inconsistency in the hardware (e.g. different disk sizes or cpu cores) may not cause installation failure but may cause poor performance of the cluster. The use of benchmarking tools (cluster validation <a href="https://github.com/jbenninghoff/cluster-validation">github repo</a>) allows the measurement of the cluster performance.<br />
<br />
The remaining of this section address pre-Install cluster hardware tests:<br />
1. Download Benchmark Tools<br />
2. Prepare Cluster Hardware for Parallel Execution of Tests<br />
3. Test & Measure Subsystem Components<br />
4. Validate Component SOftware & Firmware<br />
<div>
<br /></div>
Grap the validation tools from the github repo<br />
<span style="color: blue; font-family: monospace;">$ curl -L -o cluster-validation.tgz http://github.com/jbenninghoff/cluster-validation/tarball/master</span><br />
<span style="color: blue; font-family: monospace;">$ tar xvzf cluster-validation.tgz</span><br />
<span style="color: blue; font-family: monospace;">$ mv jbenninghoff-cluster-validation-*/ </span><span style="color: blue; font-family: monospace;">./</span><br />
<span style="color: blue; font-family: monospace;">$ cd pre-install/</span><br />
<br />
Copy the pre-install folder to all nodes, and check if it succeeded<br />
<span style="color: blue; font-family: monospace;">$ clush -a --copy /root/</span><span style="color: blue; font-family: monospace;">pre-install/</span><br />
<span style="color: blue; font-family: monospace;">$ clush -a ls /root/</span><span style="color: blue; font-family: monospace;">pre-install/</span><br />
<div>
<br /></div>
<div>
Test the hardware for specification heterogeneity</div>
<span style="color: blue; font-family: monospace;"></span><span style="color: blue; font-family: monospace;"></span><span style="color: blue; font-family: monospace;">$ /root/pre-install/cluster-audit.sh | tee cluster-audit.log</span><br />
<br />
Test the network bandwidth for its ability to handle MapReduce operations:<br />
First, set the IP addresses of the node in <span style="color: blue; font-family: monospace;">network-test.sh</span> (divide them between <span style="color: blue; font-family: monospace;">half1</span> and <span style="color: blue; font-family: monospace;">half2</span>).<br />
<span style="color: blue; font-family: monospace;">$ /root/pre-install/network-test.sh | tee network-test.log</span><br />
<br />
Test memory performance<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">clush -Ba '/root</span><span style="color: blue; font-family: monospace;">/</span><span style="color: blue; font-family: monospace;">pre-install/memory-test.sh | grep ^Triad' | tee memory-test.log</span></div>
<div>
<br />
Test disk performance<br />
The <span style="color: blue; font-family: monospace;">disk-test.sh</span> script checks the disk health and performance (i.e. throughput for sequential and random I/O read/write), it destroys any data available on it.<br />
<span style="color: blue; font-family: monospace;">$ clush -ab /root/</span><span style="color: blue; font-family: monospace;">pre-install/</span><span style="color: blue; font-family: monospace;">disk-test.sh</span><br />
For each scanned disk there will be a result file of the form <span style="color: blue; font-family: monospace;">disk_name-iozone.log</span>.<br />
<br />
<h4 style="text-align: left;">
Mapr Quick Install - <a href="http://doc.mapr.com/display/MapR/Quick+Installation+Guide">link</a></h4>
Minimum requirements:<br />
<ul style="text-align: left;">
<li>2-4 cores (at least two: 1 CPU for OS, 1 CPU for filesystem)</li>
<li>6GB of ram</li>
<li>20GB size of raw disk (should not be formatted/partitioned)</li>
</ul>
<br />
First, download installer script<br />
<span style="color: blue; font-family: monospace;">$ wget </span><span style="color: blue; font-family: monospace;">http://package.mapr.com/releases/v4.1.0/ubuntu/mapr-setup</span><br />
<span style="color: blue; font-family: monospace;">$ chmod 755 </span><span style="color: blue; font-family: monospace;">mapr-setup</span><br />
<span style="color: blue; font-family: monospace;">$ ./</span><span style="color: blue; font-family: monospace;">mapr-setup</span><br />
<br />
Second, configure the installation process (e.g. define data and control nodes). A sample configuration can be found in <span style="color: blue; font-family: monospace;">/opt/mapr-installer/bin/config.example</span><br />
<span style="color: blue; font-family: monospace;">$ cd </span><span style="color: blue; font-family: monospace;">/opt/mapr-installer/bin</span><br />
<span style="color: blue; font-family: monospace;">$ cp </span><span style="color: blue; font-family: monospace;">config.example </span><span style="color: blue; font-family: monospace;">config.example.original</span><br />
Use following commands to find information on nodes to declare in the configuration<br />
<span style="color: blue; font-family: monospace;">$ clush -a lsblk </span><span style="font-family: monospace;"><span style="color: #666666;"># list drivers name</span></span><br />
<span style="color: blue; font-family: monospace;">$ clush -a mount </span><span style="font-family: monospace;"><span style="color: #666666;"># list ip addresses and mounted drivers</span></span><br />
<br />
Edit <span style="color: blue; font-family: monospace;">config.example</span> file<br />
<ul style="text-align: left;">
<li>Declare the nodes information (IP addresses and data drives) under the <span style="color: blue; font-family: monospace;">Control_Nodes</span> section. </li>
<li>Customize the cluster domain by replacing <span style="color: blue; font-family: monospace;">my.cluster.com</span> with your own.</li>
<li>Set a new password (e.g. mapr)</li>
<li>Declare the disks and set <span style="color: blue; font-family: monospace;">ForceFormat</span> to true.</li>
</ul>
Installing mapr (the installation script uses Ansible behind the scene)<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">./install --cfg config.example --private-key /root/.ssh/id_rsa -u root -s -U root --debug new</span><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1vQBgpAqVhNSGKjwcnjzGA2h8z_Ym5S4S3bbhwNuNbkOKuoHQI9NEz8nzG5Ho9KaAQ15oLQBej9uq_ZwdmoasaFumNKJbCsOb1nt1g-vwzDRL8s3Z9V_oFOsHTiqsw9BqZ-tCRWsf1pI/s1600/MapR+Cluster+Services.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1vQBgpAqVhNSGKjwcnjzGA2h8z_Ym5S4S3bbhwNuNbkOKuoHQI9NEz8nzG5Ho9KaAQ15oLQBej9uq_ZwdmoasaFumNKJbCsOb1nt1g-vwzDRL8s3Z9V_oFOsHTiqsw9BqZ-tCRWsf1pI/s1600/MapR+Cluster+Services.png" height="158" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">MapR Cluster Services - <a href="https://mapr.app.box.com/ClusterServices">link</a></td></tr>
</tbody></table>
In case the installation succeeded, you can login to https://master-node:8443/ with mapr:mapr to access <b>MapR Control System</b> (MCS) then get a new license.<br />
Otherwise, if the installation fails, then remove install folder then check installation logs that can be found at <span style="color: blue; font-family: monospace;">/opt/mapr-installer/</span><span style="color: blue; font-family: monospace;">var/mapr-installer.log</span>. Example of failures may be caused by:<br />
<ul>
<li>problems formatting disks for MapR FS (check <span style="color: blue; font-family: monospace;">/opt/mapr/logs/disksetup.0.log</span>).</li>
<li>one of the nodes has less than 4G of memory</li>
<li>disks with LVM setup</li>
</ul>
As last remedial, you can remove all mapr packages and re-install again:<br />
<br />
<span style="color: blue; font-family: monospace;">$ rm -r -f /opt/mapr/ </span><span style="color: #999999; font-family: monospace;"># remove installation folder</span><br />
<span style="color: blue; font-family: monospace;">$ dpkg --get-selections | grep -v deinstall | grep mapr</span><br />
<span style="color: blue; font-family: monospace;">mapr-cldb install</span><br />
<span style="color: blue; font-family: monospace;">mapr-core install</span><br />
<span style="color: blue; font-family: monospace;">mapr-core-internal install</span><br />
<span style="color: blue; font-family: monospace;">mapr-fileserver install</span><br />
<span style="color: blue; font-family: monospace;">mapr-hadoop-core install</span><br />
<span style="color: blue; font-family: monospace;">mapr-hbase install</span><br />
<span style="color: blue; font-family: monospace;">mapr-historyserver install</span><br />
<span style="color: blue; font-family: monospace;">mapr-mapreduce1 install</span><br />
<span style="color: blue; font-family: monospace;">mapr-mapreduce2 install</span><br />
<span style="color: blue; font-family: monospace;">mapr-nfs install</span><br />
<span style="color: blue; font-family: monospace;">mapr-nodemanager install</span><br />
<span style="color: blue; font-family: monospace;">mapr-resourcemanager install</span><br />
<span style="color: blue; font-family: monospace;">mapr-webserver install</span><br />
<span style="color: blue; font-family: monospace;">mapr-zk-internal install</span><br />
<span style="color: blue; font-family: monospace;">mapr-zookeeper install</span><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">dpkg -r --force-depends <package> </package></span><span style="color: #999999; font-family: monospace;"># remove all listed packages</span><br />
<br />
To check if the cluster is running properly, we can run the following quick test job.<br />
Note: check that the names of cluster nodes are resolvable through DNS, otherwise declare them in the <span style="color: blue; font-family: monospace;">/etc/hosts</span> of each node.<br />
<span style="color: blue; font-family: monospace;">$ su - mapr</span><br />
<span style="color: blue; font-family: monospace;">$ cd /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/</span><br />
<span style="color: blue; font-family: monospace;">$ yarn jar hadoop-mapreduce-examples-2.5.1-mapr-1501.jar pi 8 800</span><br />
<br />
<h4 style="text-align: left;">
Benchmark the Cluster</h4>
1. Hardware Benchmarking<br />
First, copy the post-install folder to all nodes<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">clush -a --copy /root/post-install</span><br />
<span style="color: blue; font-family: monospace;">$ clush -a ls /root/post-install</span><br />
<br />
Second, run tests to check drive throughput and establish a baseline for future comparison<br />
<span style="color: blue; font-family: monospace;">$ cd /root/post-install</span><br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">clush -Ba '/root/post-install/runRWSpeedTest.sh' | tee runRWSpeedTest.log</span><br />
<br />
2. Application Benchmarking<br />
Use specific MapReduce jobs to create test data and process it in order to challenge the performance limits of the cluster.<br />
First, create a volume for the test data<br />
<span style="color: blue; font-family: monospace;">$ maprcli volume create -name benchmarks -replication 1 -mount 1 -path /benchmarks</span><br />
<br />
Second, generate random sequence of data<br />
<span style="color: blue; font-family: monospace;">$ su mapr</span><br />
<span style="color: blue; font-family: monospace;">$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar teragen 5000000 /benchmarks/teragen1</span><br />
<br />
Then, sort the data and write the output to a directory<br />
<span style="color: blue; font-family: monospace;">$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar terasort /benchmarks/teragen1 /benchmarks/terasort1</span><br />
<br />
To analyze how long it takes to perform each step check the logs on the JobHistoryServer<br />
<span style="color: blue; font-family: monospace;">$ clush -a jps | grep -i JobHistoryServer</span><br />
<br />
<h4 style="text-align: left;">
Cluster Storage Resources</h4>
MapR FS organizes the drives of a cluster into <b>Storage Pools</b>. The later is a group of drives (three by default) on a single physical node. Data is stored across drives of the cluster storage pools. In case, one drive fails then the entire storage pool is lost. To recover it, we need to put all drives of this pool offline, replace the failed drive then return then back to the cluster.<br />
The 3 drives per pool gives us a good balance read/write speed for ingestion huge data and recovery time for failed drives.<br />
Storage pools hold units called <b>Containers</b> (32Gb size by default) which are logically organized into <b>Volumes</b> (which are specific to MapR FS). By default, containers has replication factor inside a volume set to three. We can choose a pattern for replication across containers: chain pattern, star pattern.<br />
<span style="color: blue; font-family: monospace;">$ maprcli volume create name <volume name=""> type 0|1</volume></span><br />
<br />
When writing for a file, Container Location Database (<b>CLDB</b>) is used to determine first container where data is written. CLDB replaces the function of a NameNode in MapR hadoop, it stores container replication factor and pattern information. A file is divided into chuncks (default size 256 Mb): small chunk size leads to high writes scheduling overhead, big chunk size requires more memory.<br />
A <b>topology</b> defines the physical layout of a cluster nodes. It's recommended to have two top-level topologies:<br />
<ul style="text-align: left;">
<li><span style="color: blue; font-family: monospace;">/data</span> the parent topology for active nodes in the cluster</li>
<li><span style="color: blue; font-family: monospace;">/decommissioned</span> the parent topology used to segregate offline nodes or nodes to be repaired.</li>
</ul>
Usually, racks that house the physical nodes are used as sub-topology to <span style="color: blue; font-family: monospace;">/data</span>.<br />
<br />
<h4 style="text-align: left;">
Data Ingestion</h4>
Ingestion data to MapR FS can be done through:<br />
<ul style="text-align: left;">
<li>NFS (e.g. Gateway Strategy, Colocation Strategy) by using traditional applications with multiple concurrent read/writes easily - <a href="http://doc.mapr.com/display/MapR/Setting+Up+VIPs+for+NFS">link</a>,</li>
<li>Sqoop to transfer data between MapR-FS and relational databases,</li>
<li>Flume a distributed service for collecting, aggregating & moving data into MapR-FS</li>
</ul>
<br />
<a href="http://doc.mapr.com/display/MapR/Working+with+Snapshots">Snapshots</a> are read-only images of volumes at a specific point in time, more accurately a pointer that costs almost nothing. It's good idea to create them regularly to protect the integrity of the data. By default, a snapshot is scheduled automatically at the creation of a volume, it can be customized through the MCS or manually created as follows:<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">maprcli volume snapshot create -volume DemoVolume -snapshotname 12042015-DemoSnapshot</span><br />
<br />
<b>Mirrors</b> are volumes that represents an exact copy of a source volume from same or different cluster, it takes an extra amount of resources and time to create them. By default, a mirror is a read-only volume but can be made writable. They can be created through the MCS, set the replication factor or manually as follows:<br />
<span style="color: blue; font-family: monospace;">$ maprcli volume mirror start -name DemoVolumeMirror</span><br />
<span style="color: blue; font-family: monospace;"></span><br />
<span style="color: blue; font-family: monospace;">$ maprcli volume mirror push -name DemoVolumeMirror</span><br />
<br />
Configuring remote mirrors<br />
First, edit cluster configuration file (in both clusters) to include the location of CLDB nodes on the remote one:<br />
<span style="color: blue; font-family: monospace;">$ echo "cldb_addr1:7222 </span><span style="color: blue; font-family: monospace;">cldb_addr2:7222 </span><span style="color: blue; font-family: monospace;">cldb_addr3:7222</span><span style="color: blue; font-family: monospace;">" >> /opt/mapr/conf/mapr-clusters.conf</span><br />
Second, copy this new configuration to all nodes in the cluster<br />
<span style="color: blue; font-family: monospace;">$ clush -a --copy </span><span style="color: blue; font-family: monospace;">/opt/mapr/conf/mapr-clusters.conf</span><br />
Third, restart the Warden service so that the modification takes effect:<br />
<span style="color: blue; font-family: monospace;">$ clush -a service mapr-warden restart</span><br />
Finally, start the mirroring from the MCS interface.<br />
<h4 style="text-align: left;">
Cluster Monitoring</h4>
Once a cluster is up and running, it has to be kept running smoothly. MCS provides many tools to monitor the health and to investigate failure causes of the cluster by providing:<br />
<br />
<ul style="text-align: left;">
<li><a href="http://doc.mapr.com/display/MapR/Alarms+Reference">alarms</a>: sending emails, nagios notification, and </li>
<li>statistics about nodes (e.g. services), volumes, jobs (<a href="http://doc.mapr.com/display/MapR/Setting+up+the+MapR+Metrics+Database">MapR metrics database</a>). MapR Hadoop provide ways to</li>
</ul>
<br />
Standard logs for each node are stored at <span style="color: blue; font-family: monospace;">/opt/mapr/hadoop/hadoop-2.5.1/logs</span>, however the <a href="http://doc.mapr.com/display/MapR/Centralized+Logging">centralized logs</a> are stored in <span style="color: blue; font-family: monospace;">/mapr/MaprQuickInstallDemo/var/mapr/local/c200-01/logs</span> at the cluster level.<br />
<br />
<a href="http://doc.mapr.com/display/MapR/Centralized+Logging">Centralized logging</a> automate for us the gathering of logs from all cluster nodes. It provides a job-centric view. The following command can be used to create a centralized log direcotry populated with symbolic links to all log files related to: tasks, map attempts, reduce attempts, pretaited to this specific job.<br />
<span style="color: blue; font-family: monospace;">$ maprcli job linklogs -jobid JOB_ID -todir MAPRFS_DIR</span><br />
<br />
The MapR centralized logging feature is enabled by default in <span style="color: blue; font-family: monospace;">/opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-env.sh</span> through the environement variable <span style="color: blue; font-family: monospace;">HADOOP_TASKTRACKER_ROOT_LOGGER</span>.<br />
Standard log for each node is stored under <span style="color: blue; font-family: monospace;">/opt/mapr/hadoop/hadoop-0.20.2/logs</span>,<br />
on the other hands the centralized logs are stored in the /map/ when starting at the cluster level.<br />
<br />
<b>Alarms</b><br />
When a disk failure alarm is raised, the report at /opt/mapr/logs/faileddisk.log gives information about what disks have failed, the reason of the failure and <a href="http://doc.mapr.com/display/MapR/Handling+Disk+Failures">recommended resolution</a>.<br />
<br />
<br />
<div style="text-align: left;">
<b>Cluster Statistics</b></div>
MapR collects a variety of statistics about the cluster and running jobs. There information helps track the cluster usage and health. They can be writting to an output file or consumed by ganglia, the output type is specified in two <a href="http://doc.mapr.com/display/MapR/hadoop-metrics.properties">hadoop-metrics.properties</a> files:<br />
<ul style="text-align: left;">
<li><span style="color: blue; font-family: monospace;">/opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties</span> for output of hadoop standard services</li>
<li><span style="color: blue; font-family: monospace;">/opt/mapr/conf/hadoop-metrics.properties</span> for output of MapR specific services</li>
</ul>
Collected metrics can be about <a href="http://doc.mapr.com/display/MapR/Service+Metrics">services</a>, <a href="http://doc.mapr.com/display/MapR/Analyzing+Job+Metrics">jobs</a>, <a href="http://doc.mapr.com/display/MapR/Node+Metrics">nodes</a> and <a href="http://doc.mapr.com/display/MapR/Monitoring+Node+Metrics">monitoring node</a>.<br />
<br />
<b>Schedule Maintenance Jobs</b><br />
The collected metrics give us a good view of the cluster performance and health. The complexity of the cluster makes it hard to use these metrics to optimize how the cluster is running.<br />
Running test jobs regularly to gather job statistics and watch cluster performance. If a variance in the cluster performance can be seen the actions need to be taken to get back the cluster performance. By doing this in a controlled environment we can try different ways (e.g. tweak Disk and Role <a href="http://doc.mapr.com/display/MapR/Configuring+Balancer+Settings">balancers settings</a>) to optimize the cluster performance.<br />
<br />
<h4 style="text-align: left;">
Resources:</h4>
<ul style="text-align: left;">
<li>MapR installation - <a href="https://mapr.app.box.com/ADM201-LabGuide-v401">Lab Guide</a>, <a href="http://doc.mapr.com/display/MapR/Quick+Installation+Guide">Quick Installation Guide</a></li>
<li>Preparing Each Node - <a href="http://doc.mapr.com/display/MapR/Preparing+Each+Node">link</a></li>
<li>Setting up a MapR Cluster on Amazon Elastic MapReduce - <a href="http://doc.mapr.com/display/MapR3/Setting+up+a+MapR+Cluster+on+Amazon+Elastic+MapReduce">link</a></li>
<li>Cluster service planning - <a href="http://doc.mapr.com/display/MapR/Planning+the+Cluster">link</a></li>
<li>Tuning cluster for MapReduce performance for specific jobs - <a href="http://doc.mapr.com/display/MapR3/Tuning+a+Cluster+for+MapReduce+Performance">link</a></li>
<li>MapR Hadoop data storage - <a href="http://doc.mapr.com/display/MapR/Managing+Disks">link</a></li>
</ul>
<br />
<br /></div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com3tag:blogger.com,1999:blog-2035497736124196692.post-975922085712548302015-04-03T06:44:00.003-07:002015-04-03T06:55:45.330-07:00Hadoop interview questions<div dir="ltr" style="text-align: left;" trbidi="on">
1) HDFS file can ...<br />
<br />
<ul style="text-align: left;">
<li>... be duplicated on several nodes</li>
<li>... compressed</li>
<li>... combine multiple files</li>
<li>... contain multiple blocks of different sizes</li>
</ul>
<br />
<div>
2) How does HDFS ensure the integrity of the stored data?</div>
<div>
<div>
<ul style="text-align: left;">
<li>by comparing the replicated data blocks with each other</li>
<li>through error logs</li>
<li>using checksums</li>
<li>by comparing the replicated blocks to the master copy</li>
</ul>
</div>
<div>
<div>
3) HBase is ...</div>
<div>
<ul style="text-align: left;">
<li>... column oriented</li>
<li>... key-value oriented</li>
<li>... versioned</li>
<li>... unversioned</li>
<li>... use zookeeper for synchronization</li>
<li>... use zookeeper for electing a master</li>
</ul>
</div>
</div>
</div>
<div>
<div>
4) An HBase table ...</div>
<div>
<ul style="text-align: left;">
<li>... need a scheme</li>
<li>... doesn't need a scheme</li>
<li>... is served by only one server</li>
<li>... is distributed by region</li>
</ul>
</div>
</div>
<div>
<div>
5) What does a major_compact on an HBase table?</div>
<div>
<ul style="text-align: left;">
<li>It compresses the table files.</li>
<li>It combines multiple existing store files to one for each family.</li>
<li>It merges region to limit the region number.</li>
<li>It splits regions that are too big.</li>
</ul>
</div>
</div>
<div>
<div>
6) What is the relationship between Jobs and Tasks in Hadoop?</div>
<div>
<ul style="text-align: left;">
<li>One job contains only one task</li>
<li>One task contains only one job</li>
<li>One Job can contain multiple tasks</li>
<li>One task can contain multiple tasks</li>
</ul>
</div>
</div>
<div>
<div>
7) The number of Map tasks to be launched in a given job mostly depends on...</div>
<div>
<ul style="text-align: left;">
<li>the number of nodes in the cluster</li>
<li>property mapred.map.tasks</li>
<li>the number of reduce tasks</li>
<li>the size of input splits</li>
</ul>
</div>
<div>
8) If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?</div>
<div>
<ul style="text-align: left;">
<li>One by one on each available reduce slot</li>
<li>Statistically</li>
<li>By hash</li>
</ul>
</div>
<div>
9) In Hadoop can you set</div>
<div>
<ul style="text-align: left;">
<li>Number of map</li>
<li>Number of reduce</li>
<li>Both map and reduce number</li>
<li>None, it's automatic</li>
</ul>
</div>
<div>
10) What is the minimum number of Reduce tasks for a Job?</div>
<div>
<ul style="text-align: left;">
<li>0</li>
<li>1</li>
<li>100</li>
<li>As many as there are nodes in the cluster</li>
</ul>
</div>
<div>
11) When a task fails, hadoop....</div>
<div>
<ul style="text-align: left;">
<li>... try it again</li>
<li>... try it again until a failure threshold stops the job</li>
<li>... stop the job</li>
<li>... continue without this particular task</li>
</ul>
</div>
<div>
12) How can you debug map reduce job?</div>
<div>
<ul style="text-align: left;">
<li>By adding counters.</li>
<li>By analyzing log.</li>
<li>By running in local mode in an IDE.</li>
<li>You can't debut a job.</li>
</ul>
</div>
</div>
<div>
References:<br />
<ul style="text-align: left;">
<li>Hadoop wiki - <a href="http://wiki.apache.org/hadoop/">link</a></li>
<li>Hadoop tutorial - <a href="http://hadooptutorial.wikispaces.com/Hadoop">link</a></li>
</ul>
</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com2tag:blogger.com,1999:blog-2035497736124196692.post-20309610917958257842015-03-24T07:41:00.002-07:002015-04-04T03:03:06.544-07:00Password-less SSH root access<div dir="ltr" style="text-align: left;" trbidi="on">
So I had to configure password-less SSH access between a master machine and a slave one:<br />
<br />
1. Create an SSH key pair on the master machine<br />
<span style="color: blue; font-family: monospace;">root@master-machine$</span><span style="color: blue; font-family: monospace;"> ssh-keygen </span><br />
<br />
2. Create an SSH key pair on the slave machine,<br />
<span style="color: blue; font-family: monospace;">root@slave-machine$</span><span style="color: blue; font-family: monospace;"> ssh-keygen</span><br />
<br />
To copy the public key to the remote machine we need a root access, however by default password-based SSH access as root is not allowed<br />
<br />
3. On the slave machine: <span style="color: blue; font-family: monospace;">sudo passwd</span>.<br />
3.1. set a password for root (if not already set)<br />
3.2. edit <span style="color: blue; font-family: monospace;">/etc/ssh/sshd_config</span> (not <span style="color: blue; font-family: monospace;">/etc/ssh/ssh_config</span>) to change <span style="color: blue; font-family: monospace;">PermitRootLogin without-password</span> to <span style="color: blue; font-family: monospace;">PermitRootLogin yes</span>.<br />
3.3. restart SSH deamon with <span style="color: blue; font-family: monospace;">service ssh restart</span>, if in an ssh session <span style="color: blue; font-family: monospace;">service ssh reload</span>.<br />
<br />
4. Copy master's root public key to the authorized keys in the slave machine<br />
<span style="color: blue; font-family: monospace;">root@master-machine$</span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">ssh-copy-id -i root@</span><span style="color: blue; font-family: monospace;">slave-machine</span><br />
<br />
Disable password-based SSH access for root:<br />
5. On the slave machine, edit <span style="color: blue; font-family: monospace;">/etc/ssh/sshd_config</span> to change <span style="color: blue; font-family: monospace;">PermitRootLogin </span><span style="color: blue; font-family: monospace;">yes</span> to <span style="color: blue; font-family: monospace;">PermitRootLogin </span><span style="color: blue; font-family: monospace;">without-password</span>.<br />
<br />
6. Now you can ssh as root from the master to the slave machine without password:<br />
<span style="color: blue; font-family: monospace;">root@master-machine$ ssh root@</span><span style="color: blue; font-family: monospace;">slave-machine</span><br />
<span style="color: blue; font-family: monospace;"><br /></span>
For more details on SSH keys, check <a href="https://help.ubuntu.com/community/SSH/OpenSSH/Keys">link</a>.</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-22348036023012039592015-03-20T04:48:00.000-07:002015-03-21T04:30:32.237-07:00Exposing services to CF applications<div dir="ltr" style="text-align: left;" trbidi="on">
<h4 style="text-align: left;">
Service Broker API</h4>
The Service Broker (SB) API (full <a href="http://docs.cloudfoundry.org/services/">documentation</a>) enables service providers to expose there offers to applications running on Cloud Foundry (CF). Implementing this contract, allows the CF Cloud Controller (CC) to communicate with the service provider in order to:<br />
<ol style="text-align: left;">
<li>Catalog Management: register the offering catalog (e.g. different service plans), </li>
<li>Provisioning: create/delete a service instance (e.g. create a new MongoDB collection),</li>
<li>Binding: connect/deconnect a CF application to a provisioned service instance.</li>
</ol>
For each of these possible actions, there an endpoint defined in the Service Broker contract.<br />
<br />
<b>1. Catalog Management</b><br />
The Service Broker (full <a href="http://docs.cloudfoundry.org/services/api.html">documentation</a>) should expose an endpoint for catalog management that provides information on the service itself in a JSON format, the different plans (e.g. free or not) that can be consumed by applications, some meta-data that describe the service.<br />
<br />
<span style="color: #666666; font-family: monospace;"># The Cloud Controller sends the following request</span><br />
<span style="color: blue; font-family: monospace;">GET http://broker-url/v2/catalog</span><br />
<span style="color: #666666; font-family: monospace;">#</span><span style="color: #666666; font-family: monospace;"> </span><span style="color: #666666; font-family: monospace;">The Service Broker may reply as follows</span><br />
<span style="color: blue; font-family: monospace;">< HTTP/1.1 200 OK</span><br />
<span style="color: blue; font-family: monospace;">< Content-Type: application/json;charset=UTF-8</span><br />
<span style="color: blue; font-family: monospace;">...</span><br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="font-family: monospace; list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">services</span>:<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
[<br />
<ul class="array collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
{<br />
<ul class="obj collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">planUpdatable</span>: <span class="type-boolean" style="color: firebrick;">false</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">id</span>: <span class="type-string" style="color: green;">"a unique service identifier"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">name</span>: <span class="type-string" style="color: green;">"service name"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">description</span>: <span class="type-string" style="color: green;">"service description"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">bindable</span>: <span class="type-boolean" style="color: firebrick;">true</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">plan_updateable</span>: <span class="type-boolean" style="color: firebrick;">false</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">plans</span>:<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
[<br />
<ul class="array collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
{<br />
<ul class="obj collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">id</span>: <span class="type-string" style="color: green;">"a unique plan id"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">name</span>: <span class="type-string" style="color: green;">"plan name"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">description</span>: <span class="type-string" style="color: green;">"plan description"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">metadata</span>: { },</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">free</span>: <span class="type-boolean" style="color: firebrick;">false</span></div>
</li>
</ul>
}</div>
</li>
</ul>
],</div>
</li>
<li style="position: relative;"><div class="hoverable hovered" style="-webkit-transition: background-color 0.2s ease-out 0.2s; background-color: white; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0.2s;">
<span class="property" style="font-weight: bold;">tags</span>: [ ],</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">metadata</span>: { },</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">requires</span>: [ ],</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">dashboard_client</span>: <span class="type-null" style="color: grey;">null</span></div>
</li>
</ul>
}</div>
</li>
</ul>
]</div>
</li>
</ul>
<span style="font-family: monospace;">}</span><br />
<span style="font-family: monospace;"><br /></span>
<b>2. Provisioning</b><br />
The provisioning consists of synchronous actions that the Service Broker performs on demand from the CC to create a new or destroy an existing resource for the application. The CC sends PUT message with a designated instance identifier. Once the actions are performed, the Service Broker replies with the service and plan identifiers in a JSON format.<br />
<br />
<span style="color: #666666; font-family: monospace;"># The Cloud Controller sends the following request</span><br />
<span style="color: blue; font-family: monospace;">PUT http://broker-url/v2/</span><span style="color: blue; font-family: monospace;">service_instances/:instance_id</span><br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="font-family: monospace; list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">service_id</span>: "service identifier"<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">plan_id</span>: "plan identifier"<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">organization_guid</span>: "ORG identifier"<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">space_id</span>: "SPACE identifier"<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
</div>
</li>
</ul>
<span style="font-family: monospace;">}</span><br />
<span style="color: #666666; font-family: monospace;">#</span><span style="color: #666666; font-family: monospace;"> </span><span style="color: #666666; font-family: monospace;">The Service Broker may reply as follows</span><br />
<span style="color: blue; font-family: monospace;">< HTTP/1.1 201 Created</span><br />
<span style="color: blue; font-family: monospace;">< Content-Type: application/json;charset=UTF-8</span><br />
<span style="color: blue; font-family: monospace;">...</span><br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="font-family: monospace; list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">dashboard_url</span>: null</div>
</li>
</ul>
<span style="font-family: monospace;">}</span><br />
<br />
A service instance once created can be updated (e.g. upgrading service consumption plan). For this, the same query is sent to the SB with a body containing only the attribute to update:<br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="font-family: monospace; list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">plan_id</span>: "new_plan_identifier"</div>
</li>
</ul>
<span style="font-family: monospace;">}</span><br />
<br />
<b>3. Binding</b><br />
Binding allows CF application to connect to a provisioned service instance and to start consuming the offered plan. When the SB receives a binding request from a CC, it replies with a the <a href="http://docs.cloudfoundry.org/services/binding-credentials.html">necessary information</a> (e.g. service url, authentication information, etc.) for the CF application to utilize the offered service.<br />
<br />
<span style="color: #666666; font-family: monospace;"># The Cloud Controller sends the following request</span><br />
<span style="color: blue; font-family: monospace;">PUT http://broker-url/v2/</span><span style="color: blue; font-family: monospace;">service_instances/:instance_id/service_bindings/:binding_id</span><br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-family: monospace; font-weight: bold;">service_id</span><span style="font-family: monospace;">: </span><span style="color: green; font-family: monospace;">"</span><span style="color: green; font-family: monospace;">service identifier"</span><br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; font-family: monospace; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
<span style="font-family: monospace;">,</span></div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-family: monospace; font-weight: bold;">plan_id</span><span style="font-family: monospace;">: </span><span style="color: green; font-family: monospace;">"plan identifier"</span><br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; font-family: monospace; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
<span style="font-family: monospace;">,</span></div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-family: monospace; font-weight: bold;">app_guid</span><span style="font-family: monospace;">: </span><span style="color: green; font-family: monospace;">"application identifier"</span></div>
</li>
</ul>
<span style="font-family: monospace;">}</span><br />
<span style="color: #666666; font-family: monospace;">#</span><span style="color: #666666; font-family: monospace;"> </span><span style="color: #666666; font-family: monospace;">The Service Broker may reply as follows</span><br />
<span style="color: blue; font-family: monospace;">< HTTP/1.1 201 Created</span><br />
<span style="color: blue; font-family: monospace;">< Content-Type: application/json;charset=UTF-8</span><br />
<span style="color: blue; font-family: monospace;">...</span><br />
<span style="font-family: monospace;">{</span><br />
<ul class="obj collapsible" style="font-family: monospace; list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">credentials</span>:<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
{<br />
<ul class="array collapsible" style="list-style-type: none; margin: 0px 0px 0px 2em; padding: 0px;">
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
<span class="property" style="font-weight: bold;">uri</span>: <span class="type-string" style="color: green;">"a uri to the service instance"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">username</span>: <span class="type-string" style="color: green;">"username on the service"</span>,</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">password</span>: <span class="type-string" style="color: green;">"password for the username"</span></div>
</li>
</ul>
}</div>
</li>
<li style="position: relative;"><div class="hoverable" style="-webkit-transition: background-color 0.2s ease-out 0s; border-radius: 2px; display: inline-block; padding: 1px 2px; transition: background-color 0.2s ease-out 0s;">
<span class="property" style="font-weight: bold;">syslog_drain_url</span>:<br />
<div class="collapser" style="-webkit-user-select: none; cursor: default; left: -1.5em; padding-left: 6px; padding-right: 6px; position: absolute; top: 1px;">
</div>
</div>
<span class="type-null" style="color: grey;">null</span></li>
</ul>
<span style="font-family: monospace;">}</span><br />
<br />
For unbinding the application from the service, the SB receives on the same URL a request with a DELETE method.<br />
<br />
<b>Note! </b><br />
All previous requests from the Cloud Controller to the Service Broker contains the <span style="color: blue; font-family: monospace;">X-Broker-Api-Version</span> HTTP header. It designates the Service Broker API (e.g. <span style="color: blue; font-family: monospace;">2.4</span>) supported by the Cloud Controller.<br />
<br />
<h4 style="text-align: left;">
Managing Service Brokers</h4>
Once the previous endpoints are implemented, the SB can be registered to Cloud Foundry to be exposed to applications with the following command:<br />
<span style="color: blue; font-family: monospace;">$ cf create-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD </span><span style="color: blue; font-family: monospace;">http://broker-url/</span><br />
<br />
To check if the service broker is successfully implemented<br />
<span style="color: blue; font-family: monospace;">$ cf service-brokers</span><br />
<br />
Other possible management operations are available to update, rename or delete a service borker<br />
<span style="color: blue; font-family: monospace;">$ cf update-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD </span><span style="color: blue; font-family: monospace;">http://broker-url/</span><br />
<span style="color: blue; font-family: monospace;">$ cf rename-service-broker SERVICE_BROKER_NAME NEW_</span><span style="color: blue; font-family: monospace;">SERVICE_BROKER_NAME</span><br />
<span style="color: blue; font-family: monospace;">$ cf delete-service-broker SERVICE_BROKER_NAME</span><br />
<br />
Once the SB is created in CF database, its plans can be viewed with:<br />
<span style="color: blue; font-family: monospace;">$ cf service-access</span><br />
<br />
By default, the plans are all disabled, pick the service name from the output of the previous command and then:<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">cf enable-service-access SERVICE_NAME </span><span style="color: #666666; font-family: monospace;">#</span><span style="color: #666666; font-family: monospace;"> enable access to service</span><br />
<span style="color: blue; font-family: monospace;">$ cf marketplace -s </span><span style="color: blue; font-family: monospace;">SERVICE_NAME </span><span style="color: #666666; font-family: monospace;">#</span><span style="color: #666666; font-family: monospace;"> output service plans</span><br />
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<b>Managing Services</b></div>
Once a service broker is available in the marketplace, an instance of the service can be created:<br />
<span style="color: blue; font-family: monospace;">$ cf create-service SERVICE_NAME </span><span style="color: blue; font-family: monospace;">SERVICE_PLAN </span><span style="color: blue; font-family: monospace;">SERVICE_INSTANCE_NAME</span><br />
Then service instances can be seen with:<br />
<span style="color: blue; font-family: monospace;">$ cf services</span><br />
<br />
<b>Connecting service to </b><b>application</b><br />
To be able to connect an application to a service (running on a different network) and communicate with it, a route should be added through the definition of a <a href="http://docs.pivotal.io/pivotalcf/adminguide/app-sec-groups.html">Security group</a>. Security groups allows you to control the outbound traffic of a CF app<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">cf create-security-group my_security_settings security.json</span><br />
<br />
The content of security.json is as follows<br />
<span style="color: blue; font-family: monospace;">[</span><br />
<span style="color: blue; font-family: monospace;"> {</span><br />
<span style="color: blue; font-family: monospace;"> "protocol": "tcp",</span><br />
<span style="color: blue; font-family: monospace;"> "destination": "192.168.2.0/24",</span><br />
<span style="color: blue; font-family: monospace;"> "ports":"80"</span><br />
<span style="color: blue; font-family: monospace;"> }</span><br />
<span style="color: blue; font-family: monospace;">]</span><br />
<br />
Then, binding to a service instance should be performed as follows:<br />
<span style="color: blue; font-family: monospace;">$ cf bind-service APP_NAME </span><span style="color: blue; font-family: monospace;">SERVICE_INSTANCE_NAME</span><br />
Now, the application running on CF can access service instances through the credentials available from the environment variable <span style="color: blue; font-family: monospace;">VCAP_SERVICES</span>.<br />
<br />
<b>Resources</b><br />
<ul style="text-align: left;">
<li>Managed services in CloudFoundry - <a href="http://software.danielwatrous.com/managed-services-in-cloudfoundry/">link</a></li>
<li>CloudFoundry and Apache Brooklyn for automating PaaS with a Service Broker - <a href="http://thenewstack.io/cloud-foundry-and-apache-brooklyn-for-automating-paas-with-a-service-broker/">link</a></li>
<li>Leveraging Riak with CloudFoundry - <a href="http://basho.com/tag/service-broker/">link</a></li>
</ul>
<br />
<br />
<br /></div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-71381442641641613052015-03-11T02:41:00.001-07:002015-03-11T02:41:36.908-07:00Pushing applications to CloudFoundry the Java way<div dir="ltr" style="text-align: left;" trbidi="on">
CloudFondry provides a <a href="http://docs.run.pivotal.io/buildpacks/java/java-client.html">Java API</a> that can be used to do anything just as the CLI. Follows are the steps that shows how to connect and publish an application to CF using Java code:<br />
<br />
<b>1. Skip SSL validation</b><br />
You may have to skip SSL validation to avoid <span style="color: red; font-family: monospace;">sun.security.validator.ValidatorException</span>:<br />
<pre class="brush:java">SSLContext ctx = SSLContext.getInstance("TLS");
X509TrustManager tm = new X509TrustManager() {
public void checkClientTrusted(X509Certificate[] xcs, String string) {
}
public void checkServerTrusted(X509Certificate[] xcs, String string) {
}
public X509Certificate[] getAcceptedIssuers() {
return null;
}
};
ctx.init(null, new TrustManager[] { tm }, null);
SSLContext.setDefault(ctx);
</pre>
<br />
<b>2. Connect to CloudFoundry</b><br />
<pre class="brush:java">Connect to the CloudFoundry API endpoint (e.g. https://api.run.pivotal.io) and authenticatewith your credentials:</pre>
<pre class="brush:java">String user = "admin";
String password = "admin";
String target = "https://api.10.244.0.34.xip.io";
CloudCredentials credentials = new CloudCredentials(user, password);
HttpProxyConfiguration proxy = new HttpProxyConfiguration("proxy_hostname", proxy_port);
CloudFoundryClientclient = new CloudFoundryClient(credentials, target, org, space, proxy);
</pre>
<br />
<b>3. Create an application</b><br />
<pre class="brush:java">String appName = "my-app";
List<string> urls = Arrays.asList("my-app.10.244.0.34.xip.io");
Staging staging = new Staging(null, "app_buildpack_git_repo");
client.createApplication(appName, staging, disk, mem, urls, Collections.<string> emptyList());
</string></string></pre>
<br />
<b>4. Push the application
</b><br />
<pre class="brush:java">ZipFile file = new ZipFile(new File("path_to_app_archive_file"));
ApplicationArchive archive = new ZipApplicationArchive(file);
client.uploadApplication(appName, archive);
</pre>
<br />
<b>5. Check the application state
</b><br />
<pre class="brush:java">StartingInfo startingInfo = client.startApplication(appName);
System.out.println("Starting application: %s on %s", appName, startingInfo.getStagingFile());
CloudApplication application : client.getApplications()
System.out.printf(" %s (%s)%n", application.getName(), application.getState());
</pre>
<br />
<b>6. Disconnect from CloudFoundry
</b><br />
<pre class="brush:java">client.logout();</pre>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-65966005116065116542015-02-24T09:34:00.002-08:002015-06-05T10:31:39.239-07:00Installing Cloud Foundry v2 locally on Vagrant<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 style="text-align: left;">
Cloud Foundry (CF)</h3>
<a href="http://www.cloudfoundry.org/">CloundFoundry</a> (CF) is one of the many PaaS available out there that aims to empower developers to build their applications (e.g. web) without caring about infrastructure details. The PaaS handles the deployment, scaling and management of the apps in the cloud data center, thus boosting the developer productivity.<br />
CF has many advantages over other PaaS solutions as it is open source, it has a fast growing community and many big cloud actors are involved in the development and spreading it adoption. It also can be run anywhere even on a laptop and this what this post is about. So keep reading..<br />
<h4 style="text-align: left;">
Terminology</h4>
<div style="text-align: left;">
<b><i>- Bosh</i></b> is an open-source platform that helps deploying/managing systems on cloud infrastructures (AWS, OpenStack/CloudStack, vSphere, vCloud, ect).</div>
<b><i>- Bosh Lite</i></b> is a lightweight version of <b><i>Bosh</i></b> that can be used to deploy systems locally by using Vagrant instead of cloud infrastructure (e.g. AWS) and Linux Containers (Warden project) for to run your system instead of VMs.<br />
<b><i>- Stemcell</i></b> is a template VM that will be used by <b><i>Bosh</i></b> to create VMs and deploy them to the cloud. I contains essentially an OS (e.g. CentOS) and a Bosh Agent in order to be controlled.<br />
<br />
<div style="text-align: left;">
<b>1. Install Git</b></div>
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">sudo apt-get install git</span><br />
<br />
<b>2. Install VirtualBox</b><br />
<span style="color: blue; font-family: monospace;">$ sudo echo "deb http://download.virtualbox.org/virtualbox/debian precise contrib" >> /etc/apt/sources.list</span><br />
or create a new .list file as described in this <a href="http://stackoverflow.com/questions/1584066/append-to-etc-apt-sources-list">thread</a>.<br />
<span style="color: blue; font-family: monospace;">$ wget -q http://download.virtualbox.org/virtualbox/debian/oracle_vbox.asc -O- | sudo apt-key add -</span><br />
<span style="color: blue; font-family: monospace;">$ sudo apt-get update</span><br />
<span style="color: blue; font-family: monospace;">$ sudo apt-get install virtualbox-4.3</span><br />
<span style="color: blue; font-family: monospace;">$ sudo apt-get install dkms</span><br />
<span style="color: blue; font-family: monospace;">$ VBoxManage --version</span><br />
<span style="color: blue; font-family: monospace;">4.3.10_Ubuntur93012</span><br />
<br />
<b>3. Install Vagrant</b> (the known version to work with bosh-lite is 1.6.3 - <a href="https://github.com/cloudfoundry/bosh-lite#prepare-the-environment">link</a>)<br />
<span style="color: blue; font-family: monospace;">$ wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.6.3_x86_64.deb</span><br />
<span style="color: blue; font-family: monospace;">$ sudo dpkg -i vagrant_1.6.3_x86_64.deb</span><br />
<span style="color: blue; font-family: monospace;">$ vagrant --version</span><br />
<span style="color: blue; font-family: monospace;">Vagrant 1.6.3</span><br />
<br />
Check if vagrant is correctly working with the installed virtual box<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">vagrant init hashicorp/precise32</span><br />
<span style="color: blue; font-family: monospace;">$ vagrant up</span><br />
<br />
<b>4. Install Ruby(using RVM) + RubyGems + Bundler</b><br />
<b>4.1. Install rvm</b><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">curl -sSL https://rvm.io/mpapis.asc | gpg --import -</span><br />
<span style="color: blue; font-family: monospace;">$ curl -sSL https://get.rvm.io | bash -s stable</span><br />
<span style="color: blue; font-family: monospace;">$ source /home/{username}/.rvm/scripts/rvm</span><br />
<span style="color: blue; font-family: monospace;">$ rvm --version</span><br />
<br />
<b>4.2. Install latest ruby version</b><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">rvm install 1.9.3-p551</span><br />
<span style="color: blue; font-family: monospace;">$ ruby -v</span><br />
<span style="color: blue; font-family: monospace;">ruby 1.9.3p551 (2014-11-13 revision 48407) [x86_64-linux]</span><br />
<br />
<b>5. Install Bosh CLI</b> (check the prerequisites for the target OS <a href="http://bosh.io/docs/bosh-cli.html">here</a>)<br />
- Note that Bosh CLI is not suppored on windows - <a href="https://github.com/cloudfoundry/bosh-lite/issues/233#issuecomment-71705208">github issue</a><br />
<span style="color: blue; font-family: monospace;">$ sudo apt-get install build-essential libxml2-dev libsqlite3-dev libxslt1-dev libpq-dev libmysqlclient-dev</span><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">gem install bosh_cli</span><br />
<br />
<b>6. Install Bosh-Lite</b><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">git clone https://github.com/cloudfoundry/bosh-lite</span><br />
<span style="color: blue; font-family: monospace;">$ cd bosh-lite</span><br />
<span style="color: blue; font-family: monospace;">$ vagrant up --provider=virtualbox</span><br />
<br />
In case the following message is seen <span style="background-color: white; font-family: Courier New, Courier, monospace; font-size: x-small;"><b><i>The guest machine entered an invalid state while waiting for it to boot</i></b></span>, then:<br />
<ul style="text-align: left;">
<li>check if virtualisation (Intel VT-x / AMD-V for 32bits or Intel EPT / AMD RVI for 64bits) is enabled on target system <a href="http://superuser.com/questions/22915/how-do-i-enable-vt-x">here</a>. If not then enable it from the BIOS, for ESXi check <a href="http://www.virtuallyghetto.com/2012/08/how-to-enable-nested-esxi-other.html">link1</a> and <a href="https://communities.vmware.com/docs/DOC-8970">link2</a> and add <span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><b style="background-color: #cccccc;">vhv.enable = "TRUE"</b></span> to the vm configuration file (i.e. vmx) and make sure the VM is of version 9. </li>
<li>You may also have to check if USB 2.0 controller is enabled, if it is then disable it.</li>
</ul>
Target the BOSH Director<br />
<span style="color: blue; font-family: monospace;">$ cd ..</span><br />
<span style="color: blue; font-family: monospace;">$ bosh target 192.168.50.4 lite</span><br />
<span style="color: blue; font-family: monospace;">$ bosh login</span><br />
<span style="color: blue; font-family: monospace;">Your username: admin</span><br />
<span style="color: blue; font-family: monospace;">Enter password: *****</span><br />
<span style="color: blue; font-family: monospace;"></span><br />
<span style="color: blue; font-family: monospace;">Logged in as `admin'</span><br />
<br />
Setup a route between the laptop and the VMs running inside Bosh Lite<br />
<span style="color: blue; font-family: monospace;">$ cd bosh-lite</span><br />
<span style="color: blue; font-family: monospace;">$ ./bin/add-route</span><br />
<br />
<b>7. Deploy Cloud Foundry</b><br />
Install spiff<br />
<div>
<div>
<span style="color: blue; font-family: monospace;">$ brew tap xoebus/homebrew-cloudfoundry</span></div>
<span style="color: blue; font-family: monospace;">
</span><span style="color: blue; font-family: monospace;"></span>
<br />
<div>
<span style="color: blue; font-family: monospace;">$ brew install spiff</span></div>
<span style="color: blue; font-family: monospace;">
</span><span style="color: blue; font-family: monospace;"></span>
<br />
<div>
<span style="color: blue; font-family: monospace;">$ spiff</span></div>
<span style="color: blue; font-family: monospace;">
</span>To install spiff on linux systems check this <a href="https://github.com/cloudfoundry-incubator/spiff/issues/29">issue</a>.<br />
<br />
Upload latest stemcell<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">wget http://bosh-jenkins-artifacts.s3.amazonaws.com/bosh-stemcell/warden/latest-bosh-stemcell-warden.tgz</span><br />
<span style="color: blue; font-family: monospace;">$ bosh upload stemcell latest-bosh-stemcell-warden.tgz</span><br />
Check the stemcells<br />
<span style="color: blue; font-family: monospace;">$ bosh stemcells</span><br />
<br />
Upload latest CF release<br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">git clone </span><span style="color: blue; font-family: monospace;">https://github.com/cloudfoundry/cf-release</span><br />
<span style="color: blue; font-family: monospace;">$ export CF_RELEASE_DIR=$PWD/</span><span style="color: blue; font-family: monospace;">cf-release/</span><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">bosh upload release </span><span style="color: blue; font-family: monospace;">cf-release/</span><span style="color: blue; font-family: monospace;">releases/cf-XXX.yml</span><br />
<br />
Deploy CF releases<br />
<span style="color: blue; font-family: monospace;">$ cd bosh-lite/</span><br />
<span style="color: blue; font-family: monospace;">$ ./bin/provision_cf</span><br />
<span style="color: blue; font-family: monospace;">$ bosh target </span><span style="color: #666666;"><span style="font-family: monospace;"># </span><span style="font-family: monospace;">check the target director</span></span><br />
<span style="color: blue; font-family: monospace;">$ bosh vms </span><span style="color: #666666;"><span style="font-family: monospace;"># </span><span style="font-family: monospace;">check the installed VMs on the cloud</span></span><br />
<br />
Manually (to be continued)<br />
Generate a configuration file manifests/cf-manifest.yml<br />
<span style="color: blue; font-family: monospace;">$ mkdir -p go</span><br />
<span style="color: blue; font-family: monospace;">$ export GOPATH=~/go</span><br />
<span style="color: blue; font-family: monospace;">$ cd bosh-lite</span><br />
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">./bin/make_manifest_spiff</span><br />
<br />
Deploy release<br />
<span style="color: blue; font-family: monospace;">$ bosh deploy</span><br />
<br />
Install CF <a href="https://github.com/cloudfoundry/cli">CLI</a><br />
<br />
Play with CF<br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">cf api --skip-ssl-validation https://api.10.244.0.34.xip.io</span><br />
<span style="color: blue; font-family: monospace;">$ cf login</span><br />
<span style="color: blue; font-family: monospace;">$ cf create-org ORG_NAME</span><br />
<span style="color: blue; font-family: monospace;">$ cf orgs</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ cf target -o </span><span style="color: blue; font-family: monospace;">ORG_NAME</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ </span><span style="color: blue; font-family: monospace;">cf create-space SPACE_NAME</span></div>
<div>
<span style="color: blue; font-family: monospace;">$ cf target -o </span><span style="color: blue; font-family: monospace;">ORG_NAME</span><span style="color: blue; font-family: monospace;"> -s </span><span style="color: blue; font-family: monospace;">SPACE_NAME</span></div>
<div>
<br />
To access the VM from the LAN (i.e. another machine):<br />
<ol style="text-align: left;">
<li>Install an HTTP Proxy (e.g. <a href="https://help.ubuntu.com/lts/serverguide/squid.html">squid3</a>),</li>
<li>Configure CF <a href="http://docs.cloudfoundry.org/devguide/installcf/http-proxy.html">HTTP_PROXY</a> environment variable, and </li>
<li>Configure the proxy:</li>
</ol>
<span style="color: blue; font-family: monospace;"> $ sudo nano </span><span style="color: blue; font-family: monospace;">/etc/squid3/squid.conf</span><span style="color: blue; font-family: monospace;"> </span><br />
<span style="color: blue; font-family: monospace;"> acl </span><span style="color: blue; font-family: monospace;">local</span><span style="color: blue; font-family: monospace;">_network src 192.168.2.0/24</span><br />
<span style="color: blue; font-family: monospace;"> http_access allow local_network</span><br />
<br />
Stopping CF<br />
Shooting down bosh-lite VM can be surprisingly tricky. May better stop the VM with:<br />
<br />
<ul style="text-align: left;">
<li><span style="color: blue; font-family: monospace;">vagrant suspend</span> to save current state for next start up, or</li>
<li><span style="color: blue; font-family: monospace;">vagrant halt</span>, then next time to start CF use <span style="color: blue; font-family: monospace;">vagrant up</span> followed by <span style="color: blue; font-family: monospace;">bosh cck </span>(<a href="https://github.com/cloudfoundry/bosh-lite/blob/master/docs/bosh-cck.md">documentation</a>).</li>
</ul>
<br />
<br />
<b>Troubleshooting</b><br />
<span style="color: blue; font-family: monospace;">$ bosh ssh </span><span style="color: #666666; font-family: monospace;"># </span><span style="color: #666666; font-family: monospace;">then choose the job to access (password: admin)</span><br />
<span style="color: blue; font-family: monospace;">bosh_something@something:~</span><span style="color: blue; font-family: monospace;">$ sudo /var/vcap/bosh/bin/monit summary</span><br />
Find the Bosh Lite IP address<br />
<span style="color: blue; font-family: monospace;">$ cd bosh-lite/</span><br />
<span style="color: blue; font-family: monospace;">$ vagrant ssh</span><br />
<span style="color: blue; font-family: monospace;">vagrant@agent-id-bosh-0:~$ ifconfig</span><br />
<span style="color: blue; font-family: monospace;">vagrant@agent-id-bosh-0:~$ exit</span><br />
<br />
Complete installation script can be found <a href="https://gist.github.com/dzlab/986a2b79ecabe725d324">here</a>.<br />
<br />
<b>Resources</b><br />
<ul style="text-align: left;">
<li>Installing latest versions for virtualbox and vagrant - <a href="http://www.icchasethi.com/installing-latest-vagrant-and-virtualbox-version-on-ubuntu-12-01/">link</a></li>
<li>Installing ruby with rvm - <a href="https://www.blogger.com/Check%20this%20article%20http://fhanik.blogspot.fr/2013/11/installing-ruby-193-on-ubuntu-1204.html">link</a>.</li>
<li>DIY PaaS (CF v1) running DEA <a href="http://starkandwayne.com/articles/2013/02/05/diy-paas-running-apps-with-a-dea/">link1</a>, stagging applications <a href="http://starkandwayne.com/articles/2013/02/14/diy-paas-staging-an-app/">link2</a>.</li>
<li>Deploying CF Playground (a kind of web admin interface) - <a href="https://blog.starkandwayne.com/2014/09/13/deploying-cfplayground-to-cloud-foundry/">link</a></li>
<li>Installing CF on vagrant - <a href="http://blog.cloudfoundry.org/2013/06/27/installing-cloud-foundry-on-vagrant/">link</a> <a href="https://www.youtube.com/watch?v=DYn_B_IPmM8">video</a></li>
<li>Installing BOSH lite - <a href="https://github.com/cloudfoundry/bosh-lite">github repo</a>, <a href="https://blog.starkandwayne.com/2014/12/16/running-cloud-foundry-locally-with-bosh-lite/">tutorial</a></li>
<li>Deploying CF using BOSH lite - <a href="https://github.com/cloudfoundry/bosh-lite/blob/master/docs/deploy-cf.md">github repo</a>, <a href="https://github.com/cloudfoundry-community/bosh-lite-demo">demo</a></li>
<li>http://altoros.github.io/2013/using-bosh-lite/</li>
<li>Installing a new hard drive - <a href="https://help.ubuntu.com/community/InstallingANewHardDrive">link</a></li>
<li>xip.io a free internet service providing DNS wildcard - <a href="http://xip.io/">link</a></li>
<li>Troubleshooting with Bosh CLI - <a href="http://docs.pivotal.io/pivotalcf/customizing/trouble-advanced.html">official doc</a>, <a href="http://docs.cloudfoundry.org/devguide/deploy-apps/troubleshoot-app-health.html">app health</a>, <a href="https://github.com/yudai/cf_nise_installer">monit summary</a></li>
<li>Remotely debug a CF application - <a href="http://blog.altoros.com/how-to-remotely-debug-cloud-foundry-apps.html">link</a></li>
<li>CloudFoundry manifest.yml generator - <a href="http://cfmanigen.mybluemix.net/">link</a></li>
</ul>
<br />
<br /></div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-85264137553999743642014-09-04T10:38:00.002-07:002014-09-05T10:50:38.385-07:00Getting started with Hive<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<h4 style="text-align: left;">
Introducing Hive</h4>
<div style="text-align: left;">
Hive installation is straightforward (no much things to configure)</div>
<span style="color: blue; font-family: monospace;">$ wget http://mir2.ovh.net/ftp.apache.org/dist/hive/stable/apache-hive-<version>-bin.tar.gz</version></span><br />
<span style="color: blue; font-family: monospace;">$ tar xzf </span><span style="color: blue; font-family: monospace;">apache-hive-<version>-bin.tar.gz</version></span><br />
<span style="color: blue; font-family: monospace;">$ cd </span><span style="color: blue; font-family: monospace;">apache-hive-<version>-bin/bin/</version></span><br />
<span style="color: blue; font-family: monospace;">hive> show tables;</span><br />
<br />
Notice that the environment variable <span style="color: blue; font-family: monospace;">HIVE_HOME</span> is not required (which is not the case for hadoop/hbase/tez). Also, hive-site.xml is not required but if we want to use an hdfs directory then it should contain something like:<br />
<span style="color: blue; font-family: monospace;"><property></span><br />
<span style="color: blue; font-family: monospace;"> <name>hive.metastore.warehouse.dir</name></span><br />
<span style="color: blue; font-family: monospace;"> <value>hdfs://namenode_hostname/user/hive/warehouse</value></span><br />
<span style="color: blue; font-family: monospace;"> <description>location of default database for the warehouse</description></span><br />
<span style="color: blue; font-family: monospace;"></property></span></div>
<h4 style="text-align: left;">
Introducing HQL</h4>
So lets create a table 'quotes' and make it available to other hadoop programs as a text file:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">CREATE EXTERNAL TABLE quotes (symbol STRING, name STRING, price DOUBLE)</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">ROW FORMAT DELIMITED FIELDS TERMINATED by ',' LINES TERMINATED BY '\n'</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">STORED AS TEXTFILE</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">LOCATION '/tmp/quotes.txt';</span><br />
<br />
Then, we can load data into these tables, for instance form a local file quotes.csv that looks like:<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"GE","General Electric ",28.09</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"MSFT","Microsoft Corpora",41.66</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"GOOG","Google Inc.",604.83</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"GM","General Motors Co",41.85</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"FB","Facebook, Inc.",72.59</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"AAPL","Apple Inc.",607.33</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"T","AT&T Inc.",37.15</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"VZ","Verizon Communica",52.06</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">"TM","Toyota Motor Corp",134.94</span><br />
<br />
with the following query:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> LOAD DATA LOCAL INPATH '/path/to/quotes.csv'</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">OVERWRITE INTO TABLE quotes;</span><br />
<br />
Once the table is filled, we can query it with things like:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> SELECT * FROM quotes;</span><br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> SELECT symbol FROM quotes;</span><br />
<br />
We can export and save the result of a query into a file locally say /tmp/..:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/quotes_100'</span><br />
<span style="color: blue; font-family: monospace;">SELECT *</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">FROM quotes</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">WHERE quotes.</span><span style="color: blue; font-family: monospace;">pric</span><span style="color: blue; font-family: monospace;">e > 100;</span><br />
The result of this export is a set of files under the quotes_100 directory, the list of quotes that match the criteria can be found in a file name 000000_0<br />
<h4 style="text-align: left;">
Tuning Hive</h4>
Understanding the underlying details of how Hive plan when executing queries is essential for performance tuning. One way to understand the query plan is the use of the <b>EXPLAIN</b> key word:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> EXPLAIN </span><span style="color: blue; font-family: monospace;">SELECT * FROM quotes</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">FROM quotes</span><br />
<span style="color: blue; font-family: monospace;"> ></span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">WHERE quotes.</span><span style="color: blue; font-family: monospace;">pric</span><span style="color: blue; font-family: monospace;">e > 100;</span><br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> EXPLAIN </span><span style="color: blue; font-family: monospace;">SELECT SUM(price) FROM quotes;</span><br />
The result shows the translation of these queries into different possible operations called stages, for instance map-reduce, sampling, merge, or limit stages.<br />
The use of the keyword EXTENDED along with explain will provide even further details for the query execution plan:<br />
<span style="color: blue; font-family: monospace;">hive></span><span style="color: blue; font-family: monospace;"> EXPLAIN EXTENDED </span><span style="color: blue; font-family: monospace;">SELECT SUM(price) FROM quotes;</span><br />
<br />
By default, hive executes a stage at once. This default behavior can be overridden by setting the property to true in hive-site.xml:<br />
<span style="color: blue; font-family: monospace;"><property></span><br />
<span style="color: blue; font-family: monospace;"> <name>hive.exec.parallel</name></span><br />
<span style="color: blue; font-family: monospace;"> <value>true</value></span><br />
<span style="color: blue; font-family: monospace;"> <description>Whether to execute jobs in parallel</description></span><br />
<span style="color: blue; font-family: monospace;"></property></span><br />
<br />
The number of mappers/reducers launched is determined by the size of the input files divided by the default size attributed to a given task, it can be configured via:<br />
<span style="color: blue; font-family: monospace;"><property></span><br />
<span style="color: blue; font-family: monospace;"> <name>hive.exec.reducers.bytes.per.reducer</name></span><br />
<span style="color: blue; font-family: monospace;"> <value>750000000</value></span><br />
<span style="color: blue; font-family: monospace;"> <description></description></span><br />
<span style="color: blue; font-family: monospace;"></property></span><br />
<br />
Resources<br />
<br />
<ul style="text-align: left;">
<li><a href="http://www.qubole.com/5-tips-for-efficient-hive-queries/">Tips for efficient hive queries</a></li>
<li><a href="https://github.com/dzlab/bigdata-samples">Check more samples on github</a>.</li>
</ul>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-31323314607245769312014-09-01T04:39:00.000-07:002014-09-01T04:39:15.185-07:00Troubleshooting ubuntu server network interface <div dir="ltr" style="text-align: left;" trbidi="on">
So I've installed Ubuntu server on VirtualBox and when I activated a second network adapter with a bridged mode, the later was not automatically configured on Ubuntu.<br />
In fact, the interace cannot be seen with <span style="color: blue; font-family: monospace;">ifconfig</span> and <span style="color: blue; font-family: monospace;">ifconfig -a</span> showed it as disabled.<br />
I tried to bring it up and restart networking service:<br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">ifconfig eth1 up</span><br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">/etc/init.d/networking restart</span><br />
Now the interface is active but it has only an IPv6 address and when I restart the virtual machine, the interface goes disabled again.<br />
When checking the /etc/network/interfaces there was no eth1!!, so I added it in order to be configured automatically:<br />
<span style="color: blue; font-family: monospace;">$vi /etc/network/interfaces</span><br />
<span style="color: blue; font-family: monospace;"><br /></span>
<span style="color: blue; font-family: monospace;">auto eth1</span><br />
<span style="color: blue; font-family: monospace;">iface eth1 inet dhcp</span><br />
<span style="color: blue; font-family: monospace;"><br /></span>
that's it now the interface works fine.</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-25001426942568641072014-08-18T03:13:00.003-07:002014-10-22T10:13:05.163-07:00Comparison between caching systems for Java<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
Servers are getting more and more powerful with a lot of RAM (up to hundred to thousands of giga bytes). However, it is still not possible to use most of the available capacity directly in java applications due to inherent limitations of the GC (Garbage Collector) on JVM that may pause the application for a long time (even up to many minutes) to move objects between different generations.<br />
<br />
Follows is the description/comparison between some solutions, also called data grids like, that can be used to face this problem like the Infinispan project of JBoss (ex. JBoss Cache), DirectMemory (an Apache proposal), EhCache (of terracotta), etc.<br />
<br />
<b>Caches</b><br />
<br />
<div style="text-align: left;">
1. Infinispan (JBoss Data Grid Platform)</div>
<ul style="text-align: left;">
<li>Don't provide support for expiration events as disscussed in the <a href="https://community.jboss.org/thread/175533">forum</a>.</li>
<li><a href="http://blog.infinispan.org/2013/07/faster-file-cache-store-no-extra.html">SingleFileCacheStore</a> a cache loader from a file stores that manages the data activation (loading from store to cache) and <a href="http://infinispan.org/docs/6.0.x/user_guide/user_guide.html#cache-passivation">passivation</a> (saving data to store).</li>
<li>List of possible attributes in the XML configuration for <a href="https://docs.jboss.org/infinispan/4.0/apidocs/config.html">infinispan 4.0</a> and <a href="http://docs.jboss.org/infinispan/6.0/configdocs/infinispan-config-6.0.html">infinispan 6.0</a>.</li>
</ul>
<br />
2. <a href="http://www.mapdb.org/">MapDB</a><br />
<ul style="text-align: left;">
<li>Exists only in the embbeded mode</li>
<li>Enables the creation of on heap and off-heap collections (map, queue), as well as file-backed collections</li>
<li>Listeners registerd to cache events are notified in the main thread (i.e. should implement async notifications)</li>
<li>Can be used for lazy loading (e.g. <a href="https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Lazily_Loaded_Records.java">Lazily_Loaded_Records.java</a>).</li>
<li>Provides means for pumping the integral data available on memory to disk (e.g. <a href="https://github.com/jankotek/MapDB/blob/master/src/test/java/org/mapdb/Pump_InMemory_Import_Then_Save_To_Disk.java">Pump_InMemory_Import_Then_Save_To_Disk.java</a>).</li>
<li>Transaction isolation level is <a href="http://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable">Serializable</a> which is the highest level and means a new transaction can be initiated only if previous one was committed. </li>
<li>Transactions uses a global lock which reduce considerably the cache performance.</li>
</ul>
<br />
3. <a href="http://www.akiban.com/">Akiban</a>'s Persistit - <a href="https://github.com/pbeaman/persistit">github</a><br />
<ul style="text-align: left;">
<li>key/value data storage library</li>
<li>Transactions are based on the <a href="http://en.wikipedia.org/wiki/Snapshot_isolation">Snapshot Isolation</a> algorithm to provide high concurrency.</li>
<li>used by <a href="https://github.com/thinkaurelius/titan/wiki/Using-Persistit">Titan</a> (which is a Distribued graph database) for their storage layer.</li>
<li>For custom objects, users should provide a serializer for</li>
<ul>
<li>keys by implementing <a href="https://github.com/pbeaman/persistit/blob/master/src/main/java/com/persistit/encoding/KeyCoder.java">com.persistit.encoding.KeyCoder</a>, as well as for</li>
<li>values by implementing <a href="https://github.com/pbeaman/persistit/blob/master/src/main/java/com/persistit/encoding/ValueCoder.java">com.persistit.encoding.ValueCoder</a>,</li>
<li>and declare coder manager.</li>
</ul>
<li>Samples can be found here in <a href="https://github.com/tobrien/persistit-example">Index and Search 2.3 Million Freebase Person Records with Persistit</a>, and <a href="https://github.com/posulliv/nodejs-express-akiban-demo">Simple Blog Application with Akiban and JugglingDB</a>.</li>
</ul>
<div>
4. JCS (Java Caching System)</div>
<ul>
<li>Build faster Web applications with caching - <a href="http://www.ibm.com/developerworks/java/library/j-jcs/index.html">developerWorks</a></li>
<li>Caching with JCS - <a href="http://www.objectpartners.com/2012/12/19/caching-with-jcs/">Object Partners</a></li>
<li>JCS event handling examples on <a href="http://stackoverflow.com/questions/4473479/jcs-notify-on-expire-remove">Stackoverflow</a> and <a href="https://joinup.ec.europa.eu/svn/spocs/eSafe/trunk/ESafeDocX_Open_Module_Core_JEE/src/test/java/eu/spocseu/esafedocx/util/cache/JcsCacheTest.java">SPOCS</a>.</li>
<li>Configuring a JCS Cache - <a href="http://www.informit.com/guides/content.aspx?g=java&seqNum=438">InformIT</a></li>
<li>Introduction, Using, Developing Web applications and Java Object Caching with Java Caching System (JCS) - <a href="http://www.bhaveshthaker.com/29/introduction-using-developing-web-applications-and-java-object-caching-with-java-caching-system-jcs/">bhaveshthaker.com</a>.</li>
</ul>
5. Hazelcast<br />
<ul style="text-align: left;">
<li>Can be backed with different kind of stores <a href="https://github.com/RichardHightower/slumberdb/wiki/Configuring-a-Hazelcast-MySQL-MapStore">mysql</a>, <a href="http://www.enesakar.com/post/45187344794/distribute-with-hazelcast-persist-into-hbase">hbase</a>, etc.</li>
<li>A case of processing Mozilla very large crash reports - <a href="http://highscalability.com/blog/2011/4/12/caching-and-processing-2tb-mozilla-crash-reports-in-memory-w.html">highscalability.com</a></li>
</ul>
6. GridGain<br />
<br />
<ul style="text-align: left;">
<li>Resources: <a href="http://gridgain.blogspot.com/">gridgain.blogspot.com</a></li>
</ul>
<br />
<br />
5. Others: <a href="https://github.com/xerial/larray">LArray</a>, <a href="http://cache2k.org/">Cache2K</a>, DirectMemory (initial project on <a href="https://github.com/raffaeleguidi/DirectMemory/">github</a>, apache <a href="http://wiki.apache.org/incubator/DirectMemoryProposal">proposal</a> for incubation) an off-heap memory storage, <a href="http://www.h2database.com/html/mvstore.html">MVStore</a> the storage subsystem of the H2 database, <a href="http://blog.zenika.com/index.php?post/2014/06/03/spring-cache">Spring cache</a>, <a href="https://code.google.com/p/vanilla-java/wiki/HugeCollections">HugeCollections</a>.<br />
<br />
<b>Search</b><br />
<ul style="text-align: left;">
<li><a href="http://www.infoq.com/articles/LuceneHbase">Integrating Lucene with HBase</a> - an article explaining implementation of a Lucene backend based on HBase, the code is on [[github>>https://github.com/akkumar/hbasene]]. Other implementations: <a href="https://github.com/Photobucket/Solbase">Solbase</a>.</li>
<li><a href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/">Lucandra / Solandra: A Cassandra-based Lucene backend</a> - an article explaining implementation of a Lucene backend based on Cassandra. The project source code is on <a href="https://github.com/tjake/Solandra">github</a>.</li>
<li><a href="http://mprabhat.wordpress.com/2012/08/13/create-lucene-index-in-database-using-jdbcdirectory/">Create Lucene Index in database using JdbcDirectory</a> - an article explaining the use of a database as Lucene backed.</li>
<li><a href="http://www.compass-project.org/">Compass</a> project provides an Java friendly API for wrapping the Lucence api for a better integration with Java/J2ee applications.</li>
</ul>
<b>Resources</b><br />
<ul style="text-align: left;">
<li>A good explanation of the use of <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html">ByteByffer</a> to build non-heap memory caches by Keith Gregory: <a href="http://www.kdgregory.com/index.php?page=java.byteBuffer">blog post</a>, JUG <a href="http://www.kdgregory.com/programming/java/ByteBuffer_JUG_Presentation.pdf">presentation</a>, another <a href="http://coders.talend.com/sites/default/files/heapoff-wtf_OlivierLamy.pdf">one</a>.</li>
<li>An article on <a href="http://www.infoq.com/articles/Open-JDK-and-HashMap-Off-Heap">InfoQ</a> about HashMap implementation for off-heap map.</li>
<li>An ibm <a href="http://www.redbooks.ibm.com/abstracts/redp5070.html">red book</a> on capacity for big data and off-heap memory.</li>
<li>Examples related to the use of <a href="https://github.com/MathildeLemee/Hands_On_Ehcache">EhCache</a> from a Devoxx 2014 presentation.</li>
</ul>
<b>Benchmarks</b><br />
<ul style="text-align: left;">
<li>Cache2K vs Infinispan/EhCache/JCS - <a href="http://cache2k.org/benchmarks.html">bench</a></li>
<li><a href="https://github.com/radargun/radargun">Radargun</a> a framework for benchmarking data grids</li>
</ul>
<b>Memory storage</b><br />
<br />
In-memory databases (a detailed description can be found at <a href="http://www.informationweek.com/software/information-management/two-approaches-to-in-memory-database-battle/d/d-id/1114088">Information Week</a>):<br />
<ul style="text-align: left;">
<li>NoSQL approaches (covers the class of nonrelational and horizontally scalable databases) like <a href="http://www.aerospike.com/">Aerospike</a>.</li>
<li>NewSQL approaches (emerging databases offerting NoSQL scalability but with familiar SQL query capabilities, i.e. SQL-compliant) like <a href="http://voltdb.com/">VoltDB</a>, Oracle TimesTen, IBM solidDB, <a href="http://www.memsql.com/">MemSQL</a>.</li>
</ul>
Companies like Microsoft, Oracle and IBM choosed to add the in-memory support for their traditional databases (e.g. moving tables to memory), whereas SAP adopted another approach with its Hana platform that aims to put everything in-memory.<br />
<br />
<br />
Some traditional RDBMS can be configured to store their data in-memory instead of disk storage like <a href="https://www.sqlite.org/inmemorydb.html">sqlite</a>, <a href="http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html">MySQL</a>, etc.</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com12tag:blogger.com,1999:blog-2035497736124196692.post-5222583321988179652014-06-13T07:38:00.000-07:002014-11-20T04:25:19.149-08:00Getting started with HBase<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Introduction </b><br />
HBase indexes data bases on 4D coordinaes which are rowkey, column family (or a <a href="http://www.toadworld.com/products/toad-for-cloud-databases/w/wiki/321.column-families-101.aspx">collection of columns</a>), column qualifier and version. As a result, HBase can be considered a Key-Value store with a key as the 4D coordinates and the the cell as the value. Based on how many of these coordinates are specified during a query, the value may be a map or a map of map.<br />
<br />
<b>Installation</b><br />
<br />
Installing the lastest stable version of hadoop:<br />
<span style="color: blue; font-family: monospace;">$ mkdir hbase-install</span><br />
<span style="color: blue; font-family: monospace;">$ cd hbase-install</span><br />
<span style="color: blue; font-family: monospace;">$ wget http://apache.claz.org/hbase/stable/hbase-0.98.3-hadoop2-bin.tar.gz</span><br />
<span style="color: blue; font-family: monospace;">$ tar xvfz hbase-0.98.3-hadoop2-bin.tar.gz</span><br />
<span style="color: blue; font-family: monospace;">$ export HBASE_HOME=`pwd`/hbase-0.98.3-hadoop2</span><br />
<br />
Adding the HBase program to path<br />
<span style="color: blue; font-family: monospace;">$ export PATH=$PATH:$HBASE_HOME/bin/</span><br />
<br />
# you need the <span style="color: blue; font-family: monospace;">JAVA_HOME </span>variable to be already set, if you're using open jdk, you can set it to:<br />
<span style="color: blue; font-family: monospace;">$ export JAVA_HOME=/usr/lib/jvm/default-java</span><br />
<br />
Running a standalone version<br />
<span style="color: blue; font-family: monospace;">$ start-hbase.sh</span><br />
<br />
once the master launched you can accees the web admin interface on <a href="http://localhost:60010/">http://localhost:60010/</a><br />
<br />
By default, hbase will write data into <span style="color: blue; font-family: monospace;">/tmp</span> directory. You can change this by editing <span style="color: blue; font-family: monospace;">$HBASE_HOME/conf/hbase-site.xml</span> and setting the following property (the complete list of properties can be found in the <a href="http://hbase.apache.org/book/config.files.html#hbase_default_configurations">official documentation</a>):<br />
<property><br />
<name>hbase.rootdir</name><br />
<value>file:///path/to/hbase/direcotry</value><br />
</property><br />
<br />
The <span style="color: blue; font-family: monospace;">$HBASE_HOME/conf/hbase-env.sh</span> bash file can be run to setup hbase configuration, for instance setting environment variables. For further information on configuring HBase, check the <a href="http://hbase.apache.org/book/quickstart.html">official documentation</a>.<br />
<br />
<b>Shell-based interaction</b><br />
Along the installation binaries, there is a JRuby-based shell that wraps a Java client to interact with HBase interactively (sedding commands and receiving responses directly on the terminal) or via bash scripts.<br />
<br />
To validate the installtion, lets run the hbase shell and manipulate some data<br />
<span style="color: blue; font-family: monospace;">$ hbase shell</span><br />
<span style="color: #666666; font-family: monospace;"># check existing tables</span><br />
<span style="color: blue; font-family: monospace;">hbase(main):001:> list</span><br />
<span style="color: #666666; font-family: monospace;"># create table of column famity 'cf'</span><br />
<span style="color: blue; font-family: monospace;">hbase(main):002:> create 'mytable', 'cf'</span><br />
<span style="color: #666666;"><span style="font-family: monospace;">#</span><span style="font-family: monospace;"> </span><span style="font-family: monospace;">write 'hello hbase' in first row of column 'cf:message' of table 'mytable'</span></span><br />
<span style="color: blue; font-family: monospace;">hbase(main):003:> </span><span style="color: blue; font-family: monospace;">put 'mytable', 'first', 'cf:message', 'hello HBase'</span><br />
<span style="font-family: monospace;"><span style="color: #666666;"># create a user table of 'info' famity</span></span><br />
<span style="color: blue; font-family: monospace;">hbase(main):004:> </span><span style="color: blue; font-family: monospace;">create 'users', 'info'</span><br />
<span style="color: blue; font-family: monospace;">hbase(main):005:> </span><span style="color: blue; font-family: monospace;">put 'mytable', 'second', 'cf:foo', 3.14159</span><br />
<span style="color: blue; font-family: monospace;">hbase(main):006:></span> <span style="color: blue; font-family: monospace;">put </span><span style="color: blue; font-family: monospace;">'users', 'first', 'cf:username', "John Doe"</span><br />
<span style="font-family: monospace;"><span style="color: #666666;"># reading the first row from a table</span></span><br />
<span style="color: blue; font-family: monospace;">hbase(main):007:> </span><span style="color: blue; font-family: monospace;">get 'mytable', 'first'</span><br />
<span style="color: #666666;"><span style="font-family: monospace;"># reading the</span><span style="font-family: monospace;"> whole rows from a table</span></span><br />
<span style="color: blue; font-family: monospace;">hbase(main):008:> </span><span style="color: blue; font-family: monospace;">scan 'mytable'</span><br />
<br />
<b>Java-based interaction</b><br />
<br />
<pre class="brush:java">// define a custom configuration (by default the content of hbase-site.xml is used)
Configuration myConf = HBaseConfiguration.create();
myConf.set("param_name", "param_value");
// e.g. to connect to a remote HBase instance you need to set Zookeeper quorum address and port number
myConf.set("hbase.zookeeper.quorum", "serverip");</pre>
<pre class="brush:java">myConf.set("hbase.zookeeper.property.clientPort", "2181");
// establish a connection
HTableInterface myTable = new HTable(myConf, "users");
// Use pool for a better reuse of connections which are expensive resources
HTablePool pool = new HTablePool(myConf, max_nb_connection);
HTableInterface myTable = pool.getTable("mytable");
...
// close connection and returned to the pool
myTable.close();</pre>
<br />
In HBase data is manipulated in bytes, Java types should be converted into raw bytes with the help of the utility class Bytes. The HBase API for manipulating data is divided into commands: Get, Put, Delete, Scan and Increment. Data is Example, data can be stored as follows:<br />
<pre class="brush:java">// create a command with row key TheRealMT
Put p = new Put(Bytes.toBytes("TheRealJD"));
// add information about user
p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("john.doe@acme.inc"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("pass00"));
</pre>
<br />
Once, the entry is ready we can send it to hbase for persistence:<br />
<pre class="brush:java">HTableInterface usersTable = pool.getTable("users");
Put p = new Put(Bytes.toBytes("TheRealJD"));
p.add(...);
usersTable.put(p);
usersTable.close();</pre>
<br />
The Put command can also be used to update the user information:<br />
<pre class="brush:java">Put p = new Put(Bytes.toBytes("TheRealJD"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("securepass"));
usersTable.put(p);</pre>
<br />
The HBase client does not interact directly with the storage layer which is formed of HFile. Instead, HBase writes all operations in a Write-Ahead-Log (WAL) for durability and failure recovery. While, the data are store in memory region called MemStore that upon filled its entire content is flushed to a new immutable file called HFile (no modification of existing HFiles). <br />
This can be customized. For instance, the size of this region can be set via the hbase.hregion.memstore.flush.size parameter. Also, the WAL can be disabled with:<br />
<pre class="brush:java">Put p = new Put();
p.setWriteToWAL(false);</pre>
<br />
The Get command is used to query data from a set of given columns:
<br />
<pre class="brush:java">Get g = new Get(Bytes.toBytes("TheRealJD"));
g.addFamily(Bytes.getBytes("info"));
g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
Result r usersTable.get(g);
byte[] b = r.getValue(Bytes.toBytes("info"), Bytes.toBytes("email"));
String email = Bytes.toString(b);
</pre>
As HBase is versioned, we can look at partical values in history:
<br />
<pre class="brush:java">List<keyvalue> passwords = r.getColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
b = passwords.get(0).getValue();
String currentPassword = Bytes.toString(b);
b = passwords.get(1).getValue();
String previousPassword = Bytes.toString(b);
// the verions are by default the milliseconds corresponding to the moment when the operation was performed
long version = passwords.get(0).getTimestamp();</pre>
<br />
The Delete command is used to delete data from HBase<br />
<pre class="brush:java">Delete d = new Delete(Bytes.toBytes("TheRealJD"));
// remove one column
d.deleteColumn(Bytes.toBytes("info"), Bytes.toBytes("email"));
// remove an entire row with all its columns
d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("email"));
usersTable.delete(d);
</pre>
The delete operation is logical, meaning the concerned record is flagged as deleted and will no loger be returned in a get or scan. It is until compaction (merging two HFiles into single bigger one) that the record is effectively deleted. More details on the compaction operation can be found in this <a href="http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/">article</a>.
<br />
<br />
Creating a table programatically<br />
<pre class="brush:java">Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("UserFeed");
// create a column family
HColumnDescriptor c = new HColumnDescriptor("stream");
c.setMaxVersions(1);
desc.addFamily(c);
admin.createTable(desc);
</pre>
<br />
Once the table is created we can insert data into it, we may hash the row key used for users (i.e. TheRealJD) to a void variable length rowkey and for a better performance:<br />
<pre class="brush:java">// prepare the value of the row key
int longLength = Long.SIZE / 8;
byte[] userHash = Md5Utils.md5sum("TheRealJD");
byte[] timestamp = Bytes.toBytes(-1 * System.currentMilliseconds());
byte[] rowKey = new byte[Md5Utils.MD5_LENGTH + longLength];
int offset = 0;
offset = Bytes.putBytes(rowKey, offset, userHash, 0, userHash.length);
Bytes.putBytes(rowKey, offset, timestamp, 0, timestamp.length);
// prepare the put command
Put put = new Put(rowKey);
// we may need to store the real value of user id to be able to find the associated user when scanning the feeds table
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"), Bytes.toBytes("TheRealMT"));
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"), Bytes.toBytes("Hello world!"));
</pre>
<br />
When it comes to scanning the feeds table, things got easy as a result of using a row key starting with a hash of the user row key.
<br />
<pre class="brush:java">byte[] userHash = Md5Utils.md5sum(user);
byte[] startRow = Bytes.padHead(userHash, longLength);
// create a stop key equal to the increment of the last byte of user id
byte[] stopRow = Bytes.padTail(userHash, longLength);
stopRow[md5Utils.MD5_LENGTH-1]++;
Scan s = new Scan(startRow, stopRow);
ResultsScanner rs = feedsTable.getScanner(s);
// extract the columns (as created previously) from each result
for(Result r: rs) {
// extract the username
byte[] b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"));
String user = Bytes.toString(b);
// extract the feed
b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"));
String feed = Bytes.toString(b);
// extract the timestamp
b = Arrays.copyOfRange(r.getRow(), Md5Utils.MD5_LENGTH, Md5Utils.MD5_LENGTH+longLength);
DateTime dt = new DateTime(-1 * Bytes.toLong(b));
}
</pre>
By default, each RPC call from the client to HBase will return only 1 row (i.e. no cashing) which is not good in case of scanning the whole table. We can make each call returning n row by setting the property <span style="color: blue; font-family: monospace;">hbase.client.scanner.cashing</span> or calling <span style="color: blue; font-family: monospace;">Scan.setCashing(int)</span>.<br />
<br />
Continue here.<br />
<br />
<b>Resources</b><br />
<br />
<ul style="text-align: left;">
<li>HBase administration using the Java API - <a href="http://linuxjunkies.wordpress.com/2011/12/03/hbase-administration-using-the-java-api-using-code-examples/">linuxjunkies.wordpress.com</a></li>
<li>Many examples for Delete operation - <a href="http://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.hbase.client.Delete">programcreek.com</a></li>
<li>Coprocessor introduction - <a href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">official blog</a>, <a href="http://fr.slideshare.net/schubertzhang/coprocessor-introduction-20120830a">presentation</a>.</li>
</ul>
</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-83743820731689804582014-05-28T08:58:00.002-07:002014-05-31T03:00:55.276-07:00Indexing keys and values in MapDB<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://mapdb.org/">MapDB</a> is a high performance pure java database, it provides concurrent collections (Maps, Sets and Queues) backed by disk storage or off-heap memory.<br />
It provides a powerful mechanism to synchronize collections that can be used to build multiple indexes on a primary collection. Follows is an example showing how to index keys and also values of main collection.<br />
<br />
1. define a serializable class<br />
<pre class="brush:java">// this class should implement serializable in order to be stored
public class Person implements Serializable {
String firstname;
String lastname;
Integer age;
boolean male;
public Person(String f, String l, Integer a, boolean m) {
this.firstname = f;
this.lastname = l;
this.age = a;
this.male = m;
}
public boolean isMale() {
return male;
}
@Override
public String toString() {
return "Person [firstname=" + firstname + ", lastname=" + lastname + ", age=" + age + ", male=" + male + "]";
}
}
</pre>
<br />
2. Define a map of persons by id
<br />
<pre class="brush:java">// stores person under id
BTreeMap<Integer, Person> primary = DBMaker.newTempTreeMap();
primary.put(111, new Person("bIs9r", "NWmqoxFf", 92, true));
primary.put(111, new Person("4KXp8", "QrPsabf1", 31, false));
primary.put(111, new Person("eJLIo", "SJwJidWk", 6, true));
primary.put(111, new Person("LGW58", "vteM4khp", 42, false));
primary.put(111, new Person("tIM8R", "Rzq75ONh", 57, false));
primary.put(111, new Person("KqKRE", "BnpUV4dW", 26, true));
</pre>
<br />
3. Define a gender-based index
<br />
<pre class="brush:java">// stores value hash from primary map
NavigableSet<Fun.Tuple2<Boolean, Integer>> genderIndex = new TreeSet<Fun.Tuple2<Boolean, Integer>>();
//1. gender-based index: bind secondary to primary so it contains secondary key
Bind.secondaryKey(primary, genderIndex, new Fun.Function2<Boolean, Integer, Person>() {
@Override
public Boolean run(Integer key, Person value) {
return Boolean.valueOf(value.isMale());
}
});
</pre>
4. Use the gender-index to read all male persons
<br />
<pre class="brush:java">Iterable<Integer> ids = Fun.filter(genderIndex, true);
for(Integer id: ids) {
System.out.println(primary.get(id));
}
</pre>
<br />
MapdDB offers multiple ways to define indexes on a given collection, It can also be extended to define specific kind of indexes. Follows is an example of implementing the <a href="http://en.wikipedia.org/wiki/Bitmap_index">Bitmap index</a> in MapDB:
<br />
<pre class="brush:java">public static <K, V, K2> void secondaryKey(MapWithModificationListener<K, V> map, final Map<K2, Set<K>> secondary,
final Fun.Function2<K2, K, V> fun) {
// fill if empty
if (secondary.isEmpty()) {
for (Map.Entry<K, V> e : map.entrySet()) {
K2 k2 = fun.run(e.getKey(), e.getValue());
Set<K> set = secondary.get(k2);
if (set == null) {
set = new TreeSet<K>();
secondary.put(k2, set);
}
set.add(e.getKey());
}
}
// hook listener
map.modificationListenerAdd(new MapListener<K, V>() {
@Override
public void update(K key, V oldVal, V newVal) {
if (newVal == null) {
// removal
secondary.get(fun.run(key, oldVal)).remove(key);
} else if (oldVal == null) {
// insert
K2 key2 = fun.run(key, newVal);
Set<K> set = secondary.get(key2);
if (set == null) {
set = new TreeSet<K>();
secondary.put(key2, set);
}
set.add(key);
} else {
// update, must remove old key and insert new
K2 oldKey = fun.run(key, oldVal);
K2 newKey = fun.run(key, newVal);
if (oldKey == newKey || oldKey.equals(newKey))
return;
Set<K> set1 = secondary.get(oldKey);
if (set1 != null) {
set1.remove(key);
}
Set<K> set2 = secondary.get(newKey);
if (set2 == null) {
set2 = new TreeSet<K>();
secondary.put(newKey, set2);
}
set2.add(key);
}
}
});
}
</pre>
This new index can be used as follows:
<br />
<pre class="brush:java">final Map<Boolean, Set<Integer>> bitmapIndex = new HashMap<Boolean, Set<Integer>>();
secondaryKey(primary, bitmapIndex, fun);
</pre>
<br />
Continue here</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-30352804377554753132014-05-03T08:23:00.002-07:002014-05-20T01:29:22.920-07:00Exploiting Big RAMs<div dir="ltr" style="text-align: left;" trbidi="on">
Those are notes from a talk given by Neil Ferguson about how to take benefit of very large amount of memory to improve the performance of server-side applications.<br />
<br />
<b>Background</b><br />
With the increases in the amount of managed data of any enterprise or web application, there is a continuous need for storing more and more of data while providing a real-time access to it. The performance of such applications can be improved by making data available directly from memory and efficiently use the available huge amount of memory that may reach many many terabytes in a near future.<br />
<br />
In fact, memory prices is continuously decreasing while the capacity increases to the point where terabytes of RAM will be available for servers in a near future. The cost of a 1MB of RAM was about $0.01 in Jan 2009 while it is $0.005 in 2013, source <a href="http://www.jcmit.com/memoryprice.htm">Memory Prices (1957-2013)</a>. In fact, we could by a <a href="http://www.theverge.com/2012/3/6/2848762/hp-z820-workstation-512-gb-memory">workstation with 512GB of RAM</a> for $2299, and new Intell processors (e.g. Xeon) allow up to 144GB of RAM and more (around terabytes) for new generation processors dedicated to server-class machines. However, it still not practical to do anything with such an amount of RAM. Why?<br />
<br />
<b>Garbage Collection Limitations</b><br />
In any Garbage-collected environment (like JVMs), if the object allocation rates overtake the rates at which the GC collect these objects then long GC pauses (time during which the JVM stops applications to just run the garbage collector) may become very frequent. One way to avoid such problem is to leave a plenty of free space in the heap. The thing is when you leave a third of 3GB it's not really a big deal compared to the case when leaving the third of 300GB even if it's the same ratio betwee free space and live data.<br />
The bad news, is that even with large free space there may be some situations where GC pauses are too long typically for memory defragmentation.<br />
You can improve an application performance with <span style="color: blue; font-family: monospace;">-XX:+ExplicitGCInvokesConcurrent</span> as a workaround to avoid long pausses when <span style="color: blue; font-family: monospace;">System.gc()</span> or <span style="color: blue; font-family: monospace;">Runtime.getRuntime().gc()</span> are explicitly called (e.g. Direct ByteBuffer allocation).<br />
<br />
<b>Off-Heap storage</b><br />
To overcome some of these limitations in JVMs or in Garbage-collected environment, allocation of memory off-heap can be a solution. This can be done in different ways:<br />
<br />
<i>1. Direct ByteBuffers</i><br />
The NIO api allows the allocation of off-heap memory (i.e. not part of the process heap and not subject to GC) for storing long-lived data via <span style="color: blue; font-family: monospace;">ByteBuffer.allocateDirect(int capacity)</span>. Capacity is limited to what was specified with the JVM option <span style="color: blue; font-family: monospace;">-XX:MaxDirectMemorySize</span>.<br />
The allocation through ByteByffer has implications for GC (long pauses) when it is freed not fast enough and makes it not suitable for short-lived objects, i.e. allocating and freeing a lot of memory frequently.<br />
<br />
<i>2. <a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html">sun.misc.Unsafe</a> </i><br />
Direct ByteByffer itself relies on <span style="color: blue; font-family: monospace;"><a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html#allocateMemory(long)">sun.misc.Unsafe.allocateMemory</a></span> to allocate a big block of memory off-heap and on <a href="http://www.docjar.com/docs/api/sun/misc/Unsafe.html#freeMemory(long)" style="font-family: monospace;">sun.misc.Unsafe.freeMemory</a> to explicitly free it.<br />
Here is a very sample implementation of a wrapper class based on the Unsafe API for managing off-heap memory:
<br />
<pre class="brush:java">public class OffHeapObject {
// fields
private static Unsafe UNSAFE;
static {
try {
// get instance using reflection
Field field = sun.misc.Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
UNSAFE = (sun.misc.Unsafe) field.get(null);
}catch(Exception e){
throw new IllegalStateException("Could not access theUnsafe instance field");
}
}
private static final int INT_SIZE = 4;
// base address for the allocated data
private long address;
// constructor
public OffHeapObject(T heapObject) {
// serialize data
byte[] data = serialize(heapObject);
// allocate off-heap memory
address = UNSAFE.allocateMemory(INT_SIZE + data.length);
// save the data size in first bytes
UNSAFE.putInt(address, data.length);
// Write data byte by byte to the allocated memory
for(int i=0; i < data.length; i++) {
UNSAFE.putByte(address + INT_SIZE + i, data[i]);
}
}
public T get() {
int length = UNSAFE.getInt(address);
// read data from the memory
byte[] data = new byte[length];
for(int i = 0; i < data.length; i++) {
data[i] = UNSAFE.getByte(address + INT_SIZE + i);
}
// return the deserialized data
return deserialize(data);
}
// free allocate space to avoid memory leaks
public void deallocate() {
//TODO make sure to not call this more than once
UNSAFE.freeMemory(address);
}
}</pre>
The OffHeapObject can be used for instance to store values of a cached data, e.g. using Google Guava to store keys-OffHeapObject pairs where the latter wraps data in the off-heap memory. This way GC pauses can be considerably reduced as these objects are just references and do not occupy big block of heap memory. Also, the process size may not grow indefinitely as fragmentation is reduced.<br />
<br />
Note that the implementation of the OffHeapObject is very basic, there is a performance impact for using off-heap memory. In fact, everything needs to be serialized on writes to off-heap and de-serialized on read from off-heap memory and these operations has some overhead and reduced throughput compared to on-heap storage.<br />
Furthermore, not every object can be stored in the off-heap memory for instance the OffHeapObject that keep a reference to a block of memory in the off-heap is actually stored in the heap.<br />
The performance of this implementation may be enhanced with techniques like <a href="http://en.wikipedia.org/wiki/Data_structure_alignment">data alignment</a>.<br />
<br />
Some existing caches based on off-heap storage<br />
<br />
continue from 28:39<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/ysf1RekaZoI?feature=player_embedded' frameborder='0'></iframe></div>
<div style="text-align: center;">
<i>Big RAM: How Java Developers Can Fully Exploit Massive Amounts of RAM</i> </div>
<br />
<b>Resources</b><br />
<ul style="text-align: left;">
<li>Understanding Java Garbage Collection presentation at JUG Victoria Nov/2013 - <a href="http://www.azulsystems.com/sites/default/files/images/UnderstandingGC-JUG-Victoria-Nov2013.pdf">Azul Systems</a></li>
<li>Measuring GC pauses with jHiccup - <a href="http://www.azulsystems.com/jHiccup">Azul Systems</a></li>
<li>A good documentation of the Unsafe API can be found in this <a href="http://mishadoff.github.io/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/">blog post</a>.</li>
<li>How Garbage Collection works in Java - <a href="http://javarevisited.blogspot.fr/2011/04/garbage-collection-in-java.html">blog post</a>.</li>
</ul>
<br />
<div>
<br /></div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-22324471358167058942014-05-03T06:59:00.003-07:002014-05-28T08:26:18.155-07:00Random resources related to Docker<div dir="ltr" style="text-align: left;" trbidi="on">
<b>General</b><br />
<br />
<ul style="text-align: left;">
<li><a href="http://mjbright.github.io/Pygre/2014/2014-Mar-27-LightWeightVirtualizationWithDocker/presentation.html">Light weight virtualization with Docker</a> </li>
<li>The Docker Book <a href="http://www.dockerbook.com/TheDockerBook_sample.pdf">sample</a> and <a href="http://www.dockerbook.com/toc.html">TOC</a>.</li>
<li><a href="http://blog.octo.com/en/docker-qas/">Docker Q&As</a> </li>
<li><a href="http://blog.octo.com/en/docker-containers-configuration/">Containers configuration</a> </li>
<li><a href="http://blog.octo.com/en/docker-registry-first-steps/">Docker registry</a> </li>
<li>Getting started with docker - <a href="http://serversforhackers.com/articles/2014/03/20/getting-started-with-docker/">serversforhackers.com</a></li>
<li>Advanced provisionning with Packer - <a href="http://mmckeen.net/blog/2013/12/27/advanced-docker-provisioning-with-packer/">mmckeen.net</a></li>
<li>A practical introduction to docker containers - <a href="https://developerblog.redhat.com/2014/05/15/practical-introduction-to-docker-containers/">developerblog.redhat.com</a></li>
</ul>
<br />
<b>Management</b><br />
<ul style="text-align: left;">
<li>Cleanup old images <a href="http://jimhoskins.com/2013/07/27/remove-untagged-docker-images.html">blog post</a></li>
<li>Docker Log Management Using Fluentd - <a href="http://jasonwilder.com/blog/2014/03/17/docker-log-management-using-fluentd/">jasonwilder.com</a></li>
</ul>
<br />
<br />
<b>Continuous Integration</b><br />
<ul style="text-align: left;">
<li>Provisionning jenkins slaves with docker - <a href="http://www.nuxeo.com/blog/development/2014/02/docker-jenkins-cloud-provider/">nuxeo.com</a> </li>
<li>Continuous Integration using Docker, Maven and Jenkins - <a href="http://www.wouterdanes.net/2014/04/11/continuous-integration-using-docker-maven-and-jenkins.html">wouterdanes.net</a></li>
<li>Next gen CI built with docker - <a href="http://blog.frozenridge.co/next-generation-continuous-integration-deployment-with-dotclouds-docker-and-strider/">frozenridge.co</a> </li>
<li>Intro to build tools and continuous delivery (french) <a href="http://fr.slideshare.net/Zenika/nightclazz">slideshare.net</a></li>
</ul>
<br />
<b>Environment configuration</b><br />
<ul>
<li>Docker-Based Development Environments - <a href="http://www.vagrantup.com/blog/feature-preview-vagrant-1-6-docker-dev-environments.html">vagrantup.com</a></li>
<li>Configuring dev environment with docker - <a href="http://tersesystems.com/2013/11/20/building-a-development-environment-with-docker/">Terse Systtems</a></li>
</ul>
<b>DevOps</b><br />
<ul style="text-align: left;">
<li>Introducing the IBM DevOps approach - <a href="http://www.ibm.com/developerworks/devops/principles.html">developerworks</a></li>
</ul>
<br />
<b>lmctfy</b><br />
<ul style="text-align: left;">
<li>What is the essential difference between lmctfy and LXC? <a href="https://groups.google.com/forum/#!topic/lmctfy/e6oGQELK2oA">google groups</a></li>
<li>What is the difference between lmctfy and lxc <a href="http://stackoverflow.com/questions/19196495/what-is-the-difference-between-lmctfy-and-lxc">stackoverflow</a></li>
<li>Containers track at <a href="http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/153">Linux plumber conf 2013</a></li>
<li>Let Me Contain That For You at <a href="http://www.linuxplumbersconf.org/2013/ocw//system/presentations/1239/original/lmctfy%20%281%29.pdf">Linux plumber conf 2013</a></li>
<li>LMCTFY on <a href="https://www.blogger.com/github%20https://github.com/google/lmctfy">github</a></li>
<li>Develop apps on the cloud - <a href="http://www.ibm.com/developerworks/library/d-bluemix-devops-services-project/index.html">developerworks</a></li>
<li>Difference between docker and Cloud Foundry's warden - <a href="https://docs.google.com/document/d/1DDBJlLJ7rrsM1J54MBldgQhrJdPS_xpc9zPdtuqHCTI">google docs</a></li>
</ul>
<br />
<b>Networking</b><br />
<ul style="text-align: left;">
<li>Configure Networking - <a href="http://docs.docker.io/use/networking/">official doc</a></li>
<li>Docker container’s configuration - <a href="http://blog.octo.com/en/docker-containers-configuration/">octo talks</a></li>
</ul>
<br />
<b>Ecosystem</b><br />
<ul style="text-align: left;">
<li><a href="http://www.projectatomic.io/">Atomic project</a> - Deploy and Manage your Docker Containers </li>
<li><a href="https://www.openshift.com/blogs/geard-the-intersection-of-paas-docker-and-project-atomic">GearD</a> - The Intersection of PaaS, Docker and Project Atomic </li>
<li>Classification of the ecosystem of <a href="http://www.centurylinklabs.com/top-10-startups-built-on-docker/">startups based on Docker</a> </li>
<li><a href="http://goo.gl/03sPBH">Slides</a> from DockerFr Meetup on Docker ecosystem</li>
<li><a href="http://opencore.io/">OpenCore</a> a Big Data (Hadoop) as a Service provider</li>
</ul>
<br />
<br />
<b>API Client</b><br />
<ul style="text-align: left;">
<li>Docker Java on <a href="https://github.com/kpelykh/docker-java">github</a></li>
<li>Maven plugin on <a href="https://github.com/etux/docker-maven-plugin">github</a></li>
</ul>
<div>
work in progress</div>
</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-35811131912757539582014-04-13T08:48:00.001-07:002014-05-01T04:24:03.334-07:00Managing Docker images and containers<div dir="ltr" style="text-align: left;" trbidi="on">
In addition to managing Docker resources (including containers, images, hosts) through the official CLI, there is plenty of solutions available in the community to manage Docker resources in a comprehensive way from a single web-based interface.<br />
<h4 style="text-align: left;">
DockerUI</h4>
Once our containers are running, <a href="https://github.com/crosbymichael/dockerui">DockerUI</a> can be use to manage the overall system. It's a simple web app with basic features for:<br />
- Check the states of the images (running, stopped)<br />
- Remove images<br />
- Start, Stop, Kill and Remove containers<br />
<br />
DockerUI can be used with the following commands<br />
<br />
1. Building the web app from the github repository and tag the build image<br />
<span style="color: blue; font-family: monospace;">$docker build -t crosbymichael/dockerui github.com/crosbymichael/dockerui</span><br />
<br />
2. Launch the built container, make the web app available on the 9000 port and connect to the docker uinx socket to remotely control docker<br />
<span style="color: blue; font-family: monospace;">$docker run -p 9000:9000 -v /var/run/docker.sock:/docker.sock crosbymichael/dockerui -e /docker.sock</span><br />
<br />
Then on the browser, visit localhot:9000 to get something like:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxKyu0Xmc83o2l4ziIPfVZfvomFVpwHRPNa7XZ6MWDVjI2FaHs0ovMVFCj5NItzUcKLESLeg6B3KHMdEa7y8sUS3v7NicblqBCE70dcf52w-TL3Re6pgZIFyF95iqip8TZv2HhA1zv8iQ/s1600/dockerui.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxKyu0Xmc83o2l4ziIPfVZfvomFVpwHRPNa7XZ6MWDVjI2FaHs0ovMVFCj5NItzUcKLESLeg6B3KHMdEa7y8sUS3v7NicblqBCE70dcf52w-TL3Re6pgZIFyF95iqip8TZv2HhA1zv8iQ/s1600/dockerui.png" height="273" width="320" /></a></div>
<h4 style="text-align: left;">
Shipyard</h4>
<a href="http://shipyard-project.com/">Shipyard</a> is a more advanced Docker management solution based on a client-server architecture where the agents (i.e. clients) collect information on Docker resources and report them to the Shipyard server. It providers in addition to the features available in DockerUI:<br />
- Authentication<br />
- Building new images by uploading local Dockerfile or providing URLs to a remote location<br />
- In the browser terminal emulation for attaching containers<br />
- Visualizing CPU and memory utilization of the running images<br />
- ...<br />
<br />
1. To use Shipyard, issue to pull the image from the Docker public index:<br />
<span style="color: blue; font-family: monospace;">$docker run -i -t -v /var/run/docker.sock:/docker.sock shipyard/deploy setup</span><br />
<br />
Now, we can register as admin to Shipyard on http://localhost:8000/<br />
<br />
2. Install the latest release (e.g. v0.2.5) of <a href="https://github.com/shipyard/shipyard-agent">Shipyard agent</a> on every hosts to collect the information on Docker resources:<br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">curl https://github.com/shipyard/shipyard-agent/releases/download/v0.2.5/shipyard-agent -L -o /usr/local/bin/shipyard-agent</span><br />
<span style="color: blue; font-family: monospace;">$chmod +x /usr/local/bin/shipyard-agent</span><br />
<br />
3. Run the agent and register to the main host where Shipyard is running<br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">/</span><span style="color: blue; font-family: monospace;">usr/local/bin/</span><span style="color: blue; font-family: monospace;">shipyard-agent -url http://localhost:8000 -register</span><br />
<br />
4. On the Shipyard interface, authorize the agents already deployed to enable them.<br />
5. Run the agent with the given key at registration:<br />
<span style="color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">/</span><span style="color: blue; font-family: monospace;">usr/local/bin/</span><span style="color: blue; font-family: monospace;">shipyard-agent -url http://localhost:8000 -key agent_key</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3SxWk3HOZkip-I-4QHKVMwTMsfeFJ3TPHTByfmWO8rs4yyn4pyoSVVrWrfyAhTqD77wsrcSP8RZRXweyVwWWrMM3RLlbneiRgBrB6VPuXiDGAajGNecNIaKYlZrK5vX00HG4Omog3L5s/s1600/shipyard.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3SxWk3HOZkip-I-4QHKVMwTMsfeFJ3TPHTByfmWO8rs4yyn4pyoSVVrWrfyAhTqD77wsrcSP8RZRXweyVwWWrMM3RLlbneiRgBrB6VPuXiDGAajGNecNIaKYlZrK5vX00HG4Omog3L5s/s1600/shipyard.png" height="315" width="320" /></a></div>
<br />
<br />
<b>Troubleshooting</b>, in case you get this message:<br />
<span style="color: red;">Error requesting images from Docker: Get http://127.0.0.1:4243/images/json?all=0</span><br />
Then stop the Docker service and re-start it while enabling Remote API access for any IP address:<br />
<span style="color: blue; font-family: monospace;">$sudo service docker stop</span><br />
<span style="color: blue; font-family: monospace;">$docker -H tcp://0.0.0.0:4243 -H unix:///var/run/docker.sock -d &</span><br />
<br />
happy dockering</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0tag:blogger.com,1999:blog-2035497736124196692.post-83288945424865651112014-04-06T13:49:00.003-07:002014-04-13T07:52:59.336-07:00Automating Docker image builds with Dockerfiles<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Hello Dockerfile</b><br />
This is a continuation of an previous <a href="http://elsoufy.blogspot.fr/2014/03/build-your-own-saas-with-docker.html">post</a> on Docker with the aim of using specific scripts called dockerfiles in order to automate the steps that we have been issuing to build docker images. When docker parse the script file, it sequentially executes the commands starting from a base image to create a new one after each command.<br />
The syntax of a dockerfile instruction is as simple as :<br />
<span style="color: blue; font-family: monospace;">command argument1 argument2 ... </span><br />
or<br />
<span style="color: blue; font-family: monospace;">command ["argument1", "argument2", ...] </span> only for the entry-point command !!<br />
<br />
It's preferable to write the command in uppercase!<br />
<br />
<b>Dockerfile instructions</b><br />
There is a dozen of instructions that can be present in a dockerfile, a detailed list can be found in the <a href="http://docs.docker.io/en/latest/reference/builder/">official documentation</a>. The most common ones are:<br />
<ul style="text-align: left;">
<li><span style="color: blue; font-family: monospace;">FROM</span> all dockerfile should start with this command that specify the name of the image to use as a working or base image;</li>
<li><span style="color: blue; font-family: monospace;">RUN</span> allows to run a command in the current container and commit (automatically) the changes to a new image;</li>
<li><span style="color: blue; font-family: monospace;">MAINTAINER</span> allows to specify information (name, email) on the person responsible for maintain this script;</li>
<li><span style="color: blue; font-family: monospace;">ENTRYPOINT</span> allows to specify what command should be executed at first once the container is started;</li>
<li><span style="color: blue; font-family: monospace;">USER</span> allows to specify with which user account the command inside the container have to be executed with; </li>
<li><span style="color: blue; font-family: monospace;">EXPOSE</span> allows to specify what port to expose for the running container.</li>
<li><span style="color: blue; font-family: monospace;">ENV</span> to use for setting environment variables</li>
<li><span style="color: blue; font-family: monospace;">ADD</span> to copy files from the build context (it does not work if using stdin to read dockerfile) into a physical directory in the image (e.g. copying a war file into tomcat webapps folder)</li>
</ul>
<a href="https://www.docker.io/learn/dockerfile/level1/">Here</a> you can find the official tutorial to experiment with these command.<br />
<br />
<b>Parsing dockerfiles</b><br />
Once finished editing the build script, issue <span style="color: blue; font-family: monospace;">docker build</span> to parse the dockerfile and create a new image. There is different ways to use this command:<br />
<ul style="text-align: left;">
<li>dockerfile is in current directory <span style="color: blue; font-family: monospace;">docker build .</span></li>
<li>from stdin <span style="color: blue; font-family: monospace;">docker build - < Dockerfile</span></li>
<li>from a github repository <span style="color: blue; font-family: monospace;">docker build github.com/username/repo</span> docker will then clone the repo and parse the files in the repo directory.</li>
</ul>
<br />
<b>Example</b><br />
Now lets take the instructions from the previous <a href="http://elsoufy.blogspot.fr/2014/03/build-your-own-saas-with-docker.html">post</a> and gather them into a dockerfile:<br />
<span style="color: blue; font-family: monospace;"># Use ubuntu as a base image</span><br />
<span style="color: blue; font-family: monospace;">FROM ubuntu</span><br />
<span style="color: blue; font-family: monospace;"><br /></span>
<span style="color: blue; font-family: monospace;"># update package respository</span><br />
<span style="color: blue; font-family: monospace;">RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list</span><br />
<span style="color: blue; font-family: monospace;"></span><br />
<span style="color: blue; font-family: monospace;">RUN echo "deb http://archive.ubuntu.com/ubuntu precise-security main universe" > /etc/apt/sources.list</span><br />
<span style="color: blue; font-family: monospace;">RUN </span><span style="background-color: white; color: blue; font-family: monospace;">apt-get update</span><br />
<span style="background-color: white; color: blue; font-family: monospace;"><br /></span>
<span style="background-color: white; color: blue; font-family: monospace;"># install java, tomcat7</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">RUN apt-get install -y default-jdk</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">RUN </span><span style="background-color: white; color: blue; font-family: monospace;">apt-get install -y tomcat7</span><br />
<div>
<br /></div>
<span style="color: blue; font-family: monospace;">RUN mkdir /usr/share/tomcat7/logs/</span><br />
<span style="color: blue; font-family: monospace;">RUN mkdir /usr/share/tomcat7/temp/</span><br />
<span style="color: blue; font-family: monospace;"><br /></span>
<span style="background-color: white; color: blue; font-family: monospace;"># set tomcat environment variables</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">ENV</span><span style="color: blue; font-family: monospace;"> JAVA_HOME=/usr/lib/jvm/default-java</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">ENV</span><span style="color: blue; font-family: monospace;"> </span><span style="color: blue; font-family: monospace;">JRE_HOME=/usr/</span><span style="color: blue; font-family: monospace;">lib/jvm/default-java</span><span style="color: blue; font-family: monospace;">/jre</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">ENV</span><span style="color: blue; font-family: monospace;"> CATALINA_HOME=/usr/share/tomcat7/</span><br />
<div>
<span style="color: blue; font-family: monospace;"><br /></span>
<span style="color: blue; font-family: monospace;"># copy war files to the webapps/ folder</span><br />
<span style="color: blue; font-family: monospace;">ADD path/to/war /usr/</span><span style="color: blue; font-family: monospace;">share/tomcat7/webapps/</span><br />
<span style="color: blue; font-family: monospace;"><br /></span></div>
<span style="color: blue; font-family: monospace;"># launch tomcat once the container started</span><br />
<span style="color: blue; font-family: monospace;">#ENTRYPOINT </span><span style="background-color: white; color: blue; font-family: monospace;">service tomcat7 start</span><br />
<span style="color: blue; font-family: monospace;">ENTRYPOINT </span><span style="color: blue; font-family: monospace;">/usr/share/tomcat7/bin/catalina.sh run</span><br />
<span style="background-color: white; color: blue; font-family: monospace;"><br /></span>
<span style="background-color: white; color: blue; font-family: monospace;"># expose the tomcat port number</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">EXPOSE 8080</span><br />
<br />
Save this script to <span style="color: blue; font-family: monospace;">Dockerfile</span>, build it and tag the image by <span style="color: blue; font-family: monospace;">tomcat7</span>, then launch the container while exposing publicaly the tomcat server port 8080, and finally check if the container is running<br />
<span style="color: blue; font-family: monospace;">$docker build -t tomcat7 - < Dockerfile</span><br />
<span style="color: blue; font-family: monospace;">$docker run -p 8080 tomcat7</span><br />
<span style="color: blue; font-family: monospace;">$docker ps</span><br />
<br />
to be continued;</div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com1tag:blogger.com,1999:blog-2035497736124196692.post-56492059220928818262014-03-31T11:46:00.000-07:002014-05-01T05:22:10.751-07:00Build your own SaaS with Docker - Part I<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Hello Docker</b><br />
Docker enables sand-boxing of applications and their dependencies in virtual containers to be able to run them in isolated mode. It provides an easy to use API for automating deployment operations that looks very close to Git commands. More introductory information can be found in its <a href="http://en.wikipedia.org/wiki/Docker_%28software%29">Wikipedia page</a>.<br />
<br />
<b>Installation</b><br />
Docker installation on a Ubuntu 64bit (for other OS check <a href="https://www.docker.io/gettingstarted/#h_installation">official documentation</a>)<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo sh -c "curl https://get.docker.io/gpg | apt-key add -"</span><span style="background-color: white; color: blue; font-family: monospace;"> </span><br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"</span><span style="color: blue; font-family: monospace;"> </span><br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo apt-get update</span><br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo apt-get install lxc-docker</span><br />
<br />
Once docker installed, run a shell from within a container as follow<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker run -i -t ubuntu /bin/bash</span><br />
<br />
As it is supposed to not find the <span style="color: blue; font-family: monospace;">ubuntu</span> image, docker will pull it from the <a href="https://index.docker.io/">registry</a>. Once, installed you can prompt:<br />
<ul style="text-align: left;">
<li><span style="background-color: white; color: blue; font-family: monospace;">#exit</span> to leave the container</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker images</span> to see all local images.</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker inspect image_name </span>to see detailed information on an image.</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker ps </span>to see the status of the container</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker stop CONTAINER_ID</span> to stop a running image (or container)</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker logs </span><span style="color: blue; font-family: monospace;">CONTAINER_ID</span> to see all logs if a given container</li>
<li><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker commit CONTAINER_ID image_name</span> to commit changes made to a container</li>
</ul>
<br />
<b>Installing Tomcat within a container</b><br />
Start a new container using the <span style="color: blue; font-family: monospace;">ubuntu</span> base image:<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker run -i -t ubuntu /bin/bash</span><br />
<br />
Update the image's system packages<br />
<span style="background-color: white; color: blue; font-family: monospace;">#apt-get update</span><br />
<br />
1. Install the <a href="http://tomcat.apache.org/%E2%80%8E">Apache Tomcat</a> application server:<br />
<span style="background-color: white; color: blue; font-family: monospace;">#apt-get install -y tomcat7</span><br />
<br />
Once installed the following directories are created (more details can be found <a href="http://askubuntu.com/questions/135824/what-is-the-tomcat-installation-directory">here</a>):<br />
<ul style="text-align: left;">
<li><span style="background-color: #cccccc;"><span style="font-family: Courier New, Courier, monospace;"><b>/etc/tomcat7</b></span></span> for configuration</li>
<li><b><span style="background-color: #cccccc; font-family: Courier New, Courier, monospace;">/usr/share/tomcat7</span></b> for runtime, called <b>$CATALINA_HOME</b></li>
<li><span style="background-color: #cccccc;"><span style="font-family: Courier New, Courier, monospace;"><b>/usr/share/tomcat7-root</b></span></span> for webapps</li>
</ul>
2. Install Java DK<br />
<span style="background-color: white; color: blue; font-family: monospace;">#apt-get install -y default-jdk</span><br />
<br />
3. Configure environment variables<br />
<span style="background-color: white; color: blue; font-family: monospace;">#pico ~/.bashrc</span><br />
<span style="color: blue; font-family: monospace;">export JAVA_HOME=/usr/lib/jvm/default-java</span><br />
<span style="color: blue; font-family: monospace;">export CATALINA_HOME=~/path/to/tomcat</span><br />
<span style="color: blue; font-family: monospace;">#. ~/.bashrc </span>to make the changes effective<br />
<br />
Now when typing <span style="background-color: white; color: blue; font-family: monospace;">#echo $</span><span style="color: blue; font-family: monospace;">CATALINA_HOME</span> you should see the exact path set to tomcat7.<br />
<br />
4. Start the Tomcat7 server<br />
<span style="background-color: white; color: blue; font-family: monospace;">#</span><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">CATALINA_HOME/bin/startup.sh</span><br />
or<br />
<span style="background-color: white; color: blue; font-family: monospace;">#service tomcat7 start</span><br />
<br />
The start-up may fail with something like "cannot create directory '/usr/share/tomcat7/logs/catalina.out/'". To solve this, you may just have create the logs directory:<br />
<span style="background-color: white; color: blue; font-family: monospace;">#mkdir </span><span style="color: blue; font-family: monospace;">/usr/share/tomcat7/logs</span><br />
<br />
to check if Tomcat is running issue<br />
<span style="background-color: white; color: blue; font-family: monospace;">#ps -ef | grep tomcat</span><br />
or<br />
<span style="background-color: white; color: blue; font-family: monospace;">#service tomcat7 status</span><br />
<br />
then check in your browser http://container_ip_address:8080/<br />
to get the IP address of the container issue<br />
<span style="background-color: white; color: blue; font-family: monospace;">#ifconfig</span><br />
<br />
5. Shutdown Tomcat7<br />
<span style="background-color: white; color: blue; font-family: monospace;">#</span><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">CATALINA_HOME/bin/shutdown.sh</span><br />
or<br />
<span style="background-color: white; color: blue; font-family: monospace;">#service tomcat7 stop</span><br />
<br />
<b>Save the image to index.docker.io</b><br />
The changes we made on the base image created a new one, we should <span style="color: blue; font-family: monospace;">commit</span> these changes to not lose these changes.<br />
<br />
1. Login to index.docker.io<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker login</span><br />
<span style="color: blue; font-family: monospace;">Username: your_user_name</span><br />
<span style="color: blue; font-family: monospace;">Password: your_password</span><br />
<span style="color: blue; font-family: monospace;">Email: your_email</span><br />
<span style="color: blue; font-family: monospace;">Login Succeeded</span><br />
<br />
If you don't have an account, sign up <a href="https://index.docker.io/account/signup/">here</a>.<br />
<br />
2. Commit changes to your repository<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker commit CONTAINER_ID USERNAME/REPO_NAME</span><br />
<br />
3. Push changes to this repository<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker push </span><span style="color: blue; font-family: monospace;">USERNAME/REPO_NAME</span><br />
<br />
4. Start a new container using the image commit to your repository as base image<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker run -i -t </span><span style="color: blue; font-family: monospace;">USERNAME/REPO_NAME /bin/bash</span><br />
<span style="color: blue; font-family: monospace;">#</span><br />
<br />
To run Tomcat in the container<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker run -i -t </span><span style="color: blue; font-family: monospace;">USERNAME/REPO_NAME </span><span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">CATALINA_HOME/bin/</span><span style="color: blue; font-family: monospace;">startup</span><span style="color: blue; font-family: monospace;">.sh</span><br />
or<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker run -i -t </span><span style="color: blue; font-family: monospace;">USERNAME/REPO_NAME</span><span style="color: blue; font-family: monospace;"> </span><span style="background-color: white; color: blue; font-family: monospace;">service tomcat7 start</span><br />
<br />
to cleanup old containers<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker ps -a -q | xargs sudo docker rm</span><br />
or<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker ps -a | awk '{print $1}' | xargs sudo docker rm</span><br />
<br />
to cleanup old and non tagged images<br />
<span style="background-color: white; color: blue; font-family: monospace;">$</span><span style="color: blue; font-family: monospace;">sudo docker images | grep "^<none><none>" | awk '{print $3}' | xargs sudo docker rmi -f</none></none></span><br />
<br />
<div style="text-align: left;">
<b>Resources</b></div>
If you are confused with docker terminology (e.g. container, image, etc.) check this <a href="http://docs.docker.io/en/latest/terms/">official documentation</a>.<br />
General purpose instructions for installing Tomcat7 on a ubuntu machine <a href="http://diegobenna.blogspot.fr/2011/01/install-tomcat-7-in-ubuntu-1010.html">here</a>.<br />
<br /></div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com2tag:blogger.com,1999:blog-2035497736124196692.post-72423016622682079062013-10-06T02:49:00.001-07:002014-05-03T09:34:43.140-07:00Joyn or RCS<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://en.wikipedia.org/wiki/Rich_Communication_Services">RCS</a> (Rich Communication Services) is a GSMA standard that aims to bring a set of rich communication (that goes beyond SMS and phone calls) yet inter-operable services across different domains managed by different telecom operators. This telcos standard is marketed under the name of Joyn.<br />
Many operators has already deployed on their networks offering users VoIP and presence services that can be accessed by installing an application from the market store (<a href="https://play.google.com/store/apps/details?id=com.witsoftware.wmc">Google Play</a>, <a href="https://itunes.apple.com/fr/app/joyn-by-orange/id509534836">AppStore</a>). In addition, some smarphone manufacturers who have joined the movement already embed the RCS stack into their devices.<br />
<br />
The next step of commercializing Joyn is to build an ecosystem by providing APIs and empowering the developers community to create communication-based applications that relies on the platform. Orange through <a href="http://www.orangepartner.com/">Orange Partner</a> and Deutsch Telekom through the <a href="https://www.developergarden.com/">Developer Garden</a> programs are leading these efforts in Europe. For instance, they jointly sponsored the <a href="http://www.orangepartner.com/articles/48-hours-open-innovation">Joyn Hackathon</a> (<a href="http://www.orange.com/en/press/press-releases/press-releases-2013/Orange-announces-the-winners-of-the-Paris-to-Berlin-Hackathon-dedicated-to-joyn-in-partnership-with-Deutsche-Telekom">press release</a>) were the <a href="http://www.orangepartner.com/content/joyn-api">Joyn API</a> was introduced.<br />
<br />
The remaining of this post explains how to use the Android <a href="https://rcsjta.googlecode.com/git/sdk-joyn/index.html">Joyn SDK</a> to build conversational applications. The overall interaction between an application and the Joyn SDK (and behind the RCS platform) is explained in the following figure.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://rcsjta.googlecode.com/git/sdk-joyn/concepts.html" style="margin-left: auto; margin-right: auto;"><img border="0" height="253" src="https://rcsjta.googlecode.com/git/sdk-joyn/assets-sdk/images/concepts_2.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Joyn API call flow</td></tr>
</tbody></table>
<br />
<ol style="text-align: left;">
<li>Instantiate Joyn service and establish a connection</li>
</ol>
<pre class="brush:java"> private ChatService mService;
private JoynServiceListener mListener = new JoynServiceListener() {
@Override public void onServiceDisconnected(int error) {
Log.i(TAG, "ChatService disconnected!");
}
@Override public void onServiceConnected() {
Log.i(TAG, "ChatService connected!");
}
};
...
// Instanciate API
mService = new ChatService(getApplicationContext(), mListener);
// Connect API
mService.connect();
</pre>
<ol start="2" style="text-align: left;">
<li>When the the connection is successfully established then start calling API methods</li>
</ol>
<pre class="brush:java"> private Chat mChat;
private ChatListener mChatListener = new ChatListener() {
@Override public void onReportMessageFailed(String arg0) {}
@Override public void onReportMessageDisplayed(String arg0) {}
@Override public void onReportMessageDelivered(String arg0) {}
@Override public void onNewMessage(ChatMessage arg0) {}
@Override public void onComposingEvent(boolean arg0) {}
};
...
@Override public void onServiceConnected() {
Log.i(TAG, "ChatService connected!");
if (mService != null && mService.isServiceRegistered()) {
// Get remote contact
String contact = getIntent().getStringExtra("contact");
// Call API Methods
mChat = mService.openSingleChat(contact, mChatListener);
mChat.sendMessage("hello world!");
}
}
</pre>
The API doc of the ChatService can be found on this <a href="https://rcsjta.googlecode.com/git/sdk-joyn/javadoc/org/gsma/joyn/chat/ChatService.html">link</a>.<br />
<br /></div>
b@ch!rhttp://www.blogger.com/profile/12329669313982425330noreply@blogger.com0