vendredi 13 mai 2016

Notes on Big Data related talks

Hadoop Summit 2016

Apache Eagle Monitor Hadoop in Real time
The talk was about Apache Eagle a Hadoop product developed by eBay to monitor activities on a Hadoop cluster from the security perspective. The talk started by describing the pillars of security in hadoop: perimeter security; authorization & access control; discovery (e.g. classifying data according to their sensitivity), activity monitoring. The talk is mainly on the last part to address info sec questions: who many users are using Hadoop cluster, what files are they accessing, etc. From this purpose Eagle was born to be able to track events form different sources (e.g. accidently deleting files from HDFS) and correlate them with some user-defined policies.

Ingest and Stream Processing What will you choose
The talk was divided in two parts, the first one was about streaming patterns. And how each part provide at least once or exactly one message delivery.
The second part was a demo for building a streaming pipeline using streamsets editor easily. The demo was about using land data of the city of San Fransisco, streaming it and trying to calculate the land with maximum area. The generated data is then store into two destinations Kudu for analytics (e.g. top ten areas) and another Kafa for the events to be used for rendering on minecraft (which was pretty neat).

Real time Search on Terabytes of Data Per Day Lessons Learned
Lessons learned from the plaform engineering team at Rocana (an Ops monitoring software vendering) on building a search engine on HDFS. They described the context and amount of data they are dealing with at a daily basis in terabytes of data. Then, they talked about their initial use of Solar cloud as an enabler to their platform, and how they struggled to scale it and finally decided to create their our search engine based on Lucene and HDFS to store indexes. The rest of the talk was about the specific time-oriented search engine architecture. In the Q&A, one question was on Elasticsearch, they didn't really tested but rather relied on an analysis made by the author of Jepsen (which is a tool for analysing distributed systems).

Spark Summit East 2016

Spark Performance: What's Next
The talk started by a finding since the Spark project started in 2010 up to now on the evolution of IO speed, network throughput and CPU speed as the two firsts increase by a factor of 10x while CPU is stuck at 3Gz. The first attempt to CPU and memory optimization was through project Tungsten. The, the speaker described the two phases of perf enhancement:
  • Phase 1 (Spark 1.4 to 1.6) enhanced memory managed through using java.unsafe API and offheap memory instead of using Java objects (that allocates memory unnecessary).
  • Phase 2 (Spark 2): instead of using the Volcano Iterator Model to implement operators (i.e. filter, projection, aggregation) use the Whole-stage Codegen to generate optimized code (and avoid virtual functions call). Plus the use of vectorization (i.e. columnar) to represent data in memory for an efficient scan.

Then the speaker described the impact of these enhancement by comparing the performance of Spark 1.6 vs Spark 2 for different queries. These modification are on master under active development.
In the QA, the described techniques are applicable for DataFrames as the the engine has more information on the data schema which is not the case with RDDs. With Dataset API (which is on top of the DataFrame API) you get the benefit of telling the Engine the data schema as well as the safety data types (i.e. accessing the items without having to cast them to their type). DataFrame gives you index access, while Datasets gives you object access.


Ted Dunning on Kafka, MapR, Drill, Apache Arrow and More
Ted Dunning talking about why the Hadoop ecosystem succeeded over the NoSQL movement thanks to the more stable API as a standard way to make consensus among the community. While in NoSQL it tends to be isolated icelands. As an example he gave Kafka release of version 0.9 as it reached a new level of stability thanks to its API. He then described how Kafka fit in its goal and give an example of a use case where it's going to be hard to used for. The use case was about real-time tracking of shipment containers, in the case where a dedicated Kafka topic is used to track each container, in this case it will be hard to replicate effectively.
Then, he described MapR approach to open source as way to innovate in the underneath implementation why applying to a standard API (e.g. HDFS).
He also talked about Drill and how MapR is trying to involve more member of the community so that it doesn't seem as the only supported. He also talked about the in-memory movement, and specially the Apache Arrow in-memory file system and how it enabled the co-author of pandas to be a Apache Feather a new file format to store data frames on disk and be able to send through wire with Apache Arrow without need for serialization.

more to come.

Aucun commentaire:

Enregistrer un commentaire