I had the same issue as Chris Zheng (see http://zhengdong.me/2012/02/22/hive-external-table-with-partitions/) re not being able to select anything out of my Hive external table partitions.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.
My data is being loaded in a simplistic way into the following directory structure - i.e. each day loads in load_dt=YYYYMMDD:
hdfs://data/myfeed/stg/load_dt=YYYYMMDD
E.g. given the following files:
cat 20120301_myfeed.dat
20120301|001500|test|A|35
20120301|003000|test|B|85
20120301|004500|test|A|25
20120301|010000|test|C|35
20120301|011500|test|A|95
20120301|013000|test|D|55
cat 20120301_myfeed.dat
20120302|001500|test|A|35
20120302|003000|test|B|85
20120302|004500|test|A|25
20120302|010000|test|C|35
20120302|011500|test|A|95
20120302|013000|test|D|55
Load them like this:
hadoop fs -put 20120301_myfeed.dat /data/myfeed/stg/load_dt=20120301
hadoop fs -put 20120302_myfeed.dat /data/myfeed/stg/load_dt=20120302
Create an external table (with load_dt partition) as follows:
set myfeed_stg_location=/data/myfeed/stg
set myfeed_stg_location;
set myfeed_stg=myfeed_stg_ext;
set myfeed_stg;
-- Suppose myfeed stg data had records like this
-- 20120301|001500|test|A|35
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:myfeed_stg}
( event_dt STRING
, event_tm STRING
, category STRING
, code STRING
, num_hits INT
)
COMMENT 'Table for myfeed staging data'
PARTITIONED BY( load_dt STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED AS TEXTFILE
LOCATION '${hiveconf:myfeed_stg}';
I found as Chris Zheng did, that I got nothing when I selected anything from my myfeed_stg_ext table.
Turns out you need to add the partitions explicitly :( like so:
hive> alter table myfeed_stg_ext add partition (load_dt=20120301);
OK
Time taken: 0.751 seconds
hive> alter table myfeed_stg_ext add partition (load_dt=20120302);
OK
Time taken: 0.279 seconds
hive> select * from myfeed_stg_ext;
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
20120302 001500 test A 35 20120302
20120302 003000 test B 85 20120302
20120302 004500 test A 25 20120302
20120302 010000 test C 35 20120302
20120302 011500 test A 95 20120302
20120302 013000 test D 55 20120302
Time taken: 0.501 seconds
hive> select * from myfeed_stg_ext where load_dt = 20120301;
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
Time taken: 0.314 seconds
hive>
Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.
#!/bin/bash
day=20120301
while [ $day -le 20120331 ]
do
echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
if [ $? -ne 0 ]
then
echo "ERROR: hadoop mv failed"
exit 1
fi
sleep 1 # don't need these sleeps - used during testing
hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
sleep 2 # don't need these sleeps - used during testing
day=$(($day+1))
done
Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.
#!/bin/bash
day=20120301
while [ $day -le 20120331 ]
do
echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
if [ $? -ne 0 ]
then
echo "ERROR: hadoop mv failed"
exit 1
fi
sleep 1 # don't need these sleeps - used during testing
hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
sleep 2 # don't need these sleeps - used during testing
day=$(($day+1))
done
TBC ...
Read up on dynamic partitions ... could this be a more elegant approach?
And compression - lzo, others?
(http://www.mrbalky.com/2011/02/24/hive-tables-partitions-and-lzo-compression/)