GML blog: 2012/04

Hive Partitioned External Table

I had the same issue as Chris Zheng (see http://zhengdong.me/2012/02/22/hive-external-table-with-partitions/) re not being able to select anything out of my Hive external table partitions.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.

My data is being loaded in a simplistic way into the following directory structure - i.e. each day loads in load_dt=YYYYMMDD:

hdfs://data/myfeed/stg/load_dt=YYYYMMDD

E.g. given the following files:

cat 20120301_myfeed.dat

20120301|001500|test|A|35

20120301|003000|test|B|85

20120301|004500|test|A|25

20120301|010000|test|C|35

20120301|011500|test|A|95

20120301|013000|test|D|55

cat 20120301_myfeed.dat

20120302|001500|test|A|35

20120302|003000|test|B|85

20120302|004500|test|A|25

20120302|010000|test|C|35

20120302|011500|test|A|95

20120302|013000|test|D|55

Load them like this:

hadoop fs -put 20120301_myfeed.dat /data/myfeed/stg/load_dt=20120301

hadoop fs -put 20120302_myfeed.dat /data/myfeed/stg/load_dt=20120302

Create an external table (with load_dt partition) as follows:

set myfeed_stg_location=/data/myfeed/stg

set myfeed_stg_location;

set myfeed_stg=myfeed_stg_ext;

set myfeed_stg;

-- Suppose myfeed stg data had records like this

-- 20120301|001500|test|A|35

CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:myfeed_stg}

( event_dt STRING

, event_tm STRING

, category STRING

, code STRING

, num_hits INT

)

COMMENT 'Table for myfeed staging data'

PARTITIONED BY( load_dt STRING )

ROW FORMAT DELIMITED FIELDS TERMINATED by '|'

STORED AS TEXTFILE

LOCATION '${hiveconf:myfeed_stg}';

I found as Chris Zheng did, that I got nothing when I selected anything from my myfeed_stg_ext table.

Turns out you need to add the partitions explicitly :( like so:

hive> alter table myfeed_stg_ext add partition (load_dt=20120301);

Time taken: 0.751 seconds

hive> alter table myfeed_stg_ext add partition (load_dt=20120302);

Time taken: 0.279 seconds

hive> select * from myfeed_stg_ext;

20120301 001500 test A 35 20120301

20120301 003000 test B 85 20120301

20120301 004500 test A 25 20120301

20120301 010000 test C 35 20120301

20120301 011500 test A 95 20120301

20120301 013000 test D 55 20120301

20120302 001500 test A 35 20120302

20120302 003000 test B 85 20120302

20120302 004500 test A 25 20120302

20120302 010000 test C 35 20120302

20120302 011500 test A 95 20120302

20120302 013000 test D 55 20120302

Time taken: 0.501 seconds

hive> select * from myfeed_stg_ext where load_dt = 20120301;

20120301 001500 test A 35 20120301

20120301 003000 test B 85 20120301

20120301 004500 test A 25 20120301

20120301 010000 test C 35 20120301

20120301 011500 test A 95 20120301

20120301 013000 test D 55 20120301

Time taken: 0.314 seconds

hive>

Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.

#!/bin/bash

day=20120301
while [ $day -le 20120331 ]
do

echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
if [ $? -ne 0 ]
then
echo "ERROR: hadoop mv failed"
exit 1
fi
sleep 1 # don't need these sleeps - used during testing

hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
sleep 2 # don't need these sleeps - used during testing

day=$(($day+1))

done

TBC ...

Read up on dynamic partitions ... could this be a more elegant approach?

And compression - lzo, others?

(http://www.mrbalky.com/2011/02/24/hive-tables-partitions-and-lzo-compression/)

GML blog

Saturday, 14 April 2012

Hadoop Hive Partitioned External Table - notes to self

Blog Archive

About Me