Random Notes: Getting started with Hive

Introducing Hive

Hive installation is straightforward (no much things to configure)

$ wget http://mir2.ovh.net/ftp.apache.org/dist/hive/stable/apache-hive--bin.tar.gz
$ tar xzf apache-hive--bin.tar.gz
$ cd apache-hive--bin/bin/
hive> show tables;

Notice that the environment variable HIVE_HOME is not required (which is not the case for hadoop/hbase/tez). Also, hive-site.xml is not required but if we want to use an hdfs directory then it should contain something like:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://namenode_hostname/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

Introducing HQL

So lets create a table 'quotes' and make it available to other hadoop programs as a text file:
hive> CREATE EXTERNAL TABLE quotes (symbol STRING, name STRING, price DOUBLE)
> ROW FORMAT DELIMITED FIELDS TERMINATED by ',' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '/tmp/quotes.txt';

Then, we can load data into these tables, for instance form a local file quotes.csv that looks like:
"GE","General Electric ",28.09
"MSFT","Microsoft Corpora",41.66
"GOOG","Google Inc.",604.83
"GM","General Motors Co",41.85
"FB","Facebook, Inc.",72.59
"AAPL","Apple Inc.",607.33
"T","AT&T Inc.",37.15
"VZ","Verizon Communica",52.06
"TM","Toyota Motor Corp",134.94

with the following query:
hive> LOAD DATA LOCAL INPATH '/path/to/quotes.csv'
> OVERWRITE INTO TABLE quotes;

Once the table is filled, we can query it with things like:
hive> SELECT * FROM quotes;
hive> SELECT symbol FROM quotes;

We can export and save the result of a query into a file locally say /tmp/..:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/quotes_100'
SELECT *
> FROM quotes
> WHERE quotes.price > 100;
The result of this export is a set of files under the quotes_100 directory, the list of quotes that match the criteria can be found in a file name 000000_0

Tuning Hive

Understanding the underlying details of how Hive plan when executing queries is essential for performance tuning. One way to understand the query plan is the use of the EXPLAIN key word:
hive> EXPLAIN SELECT * FROM quotes
> FROM quotes
> WHERE quotes.price > 100;
hive> EXPLAIN SELECT SUM(price) FROM quotes;
The result shows the translation of these queries into different possible operations called stages, for instance map-reduce, sampling, merge, or limit stages.
The use of the keyword EXTENDED along with explain will provide even further details for the query execution plan:
hive> EXPLAIN EXTENDED SELECT SUM(price) FROM quotes;

By default, hive executes a stage at once. This default behavior can be overridden by setting the property to true in hive-site.xml:
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>

The number of mappers/reducers launched is determined by the size of the input files divided by the default size attributed to a given task, it can be configured via:
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>750000000</value>
<description></description>
</property>

Resources

Random Notes

Navigation

jeudi 4 septembre 2014

Getting started with Hive

Introducing Hive

Introducing HQL

Tuning Hive

1 commentaire:

Qui êtes-vous ?

Visitors map

Nombre total de pages vues