GML blog: Move data from one Hadoop cluster to another using distcp (including a note at the end re distcp'ing between different hadoop versions)

Cluster Specs
I have 2 HDFS clusters in the same network domain (1G networking).
The source cluster has 9 data nodes (16 core, 48G RAM, 4x2TB disks) and the target cluster has 18 data nodes (16 core, 48 G RAM, 4x2TB disks).

Task
I needed to transfer a week's worth of data from one to the other - 80G per day.

Attempt 1 - Slow and Fiddly
Initially I used the following which was slow and fiddly:

copyToLocal - HDFS to NFS - 45 mins
copyFromLocal - NFS to HDFS - 1 hr 45 mins

Attempt 2 - Fast and Simple
Then I used hadoop distcp

hadoop distcp - 4mins
Rename path in HDFS cluster #2 (to be in line with naming convention there) - 1 sec
Create hive partitions in HDFS cluster 2 - 1 min

Conclusion
Hadoop distcp is pretty impressive running across 2 quiet clusters.

Misc Notes
Read more about hadoop distcp here.

Command used (note - just happened that hdfs nn was listening on different ports in the 2 clusters):

nohup hadoop distcp hdfs://hdfssvr1:8020/data/abc/stg/load_dt=20121127 hdfs://hdfssvr2:54310/user/hive/warehouse/abc > /var/tmp/distcp_20121127.log 2>&1 &

(Note the above command copies the dir load_dt=20121127 [full of subdirs in my case] into /user/hive/warehouse/abc dir in the remote hdfs cluster)

Logging generated by the command:

13/02/01 18:41:07 INFO tools.DistCp: srcPaths=[hdfs://hdfssvr1:8020/data/abc/stg/load_dt=20121127]

13/02/01 18:41:07 INFO tools.DistCp: destPath=hdfs://hdfssvr2:54310/user/hive/warehouse/abc

13/02/01 18:41:09 INFO tools.DistCp: sourcePathsCount=4901

13/02/01 18:41:09 INFO tools.DistCp: filesToCopyCount=4311

13/02/01 18:41:09 INFO tools.DistCp: bytesToCopyCount=80.1g

13/02/01 18:41:10 INFO mapred.JobClient: Running job: job_201302011505_0006

13/02/01 18:41:11 INFO mapred.JobClient: map 0% reduce 0%

13/02/01 18:41:26 INFO mapred.JobClient: map 1% reduce 0%

13/02/01 18:41:27 INFO mapred.JobClient: map 2% reduce 0%

13/02/01 18:41:28 INFO mapred.JobClient: map 3% reduce 0%

13/02/01 18:41:29 INFO mapred.JobClient: map 4% reduce 0%

13/02/01 18:41:30 INFO mapred.JobClient: map 5% reduce 0%

13/02/01 18:44:05 INFO mapred.JobClient: map 93% reduce 0%

13/02/01 18:44:07 INFO mapred.JobClient: map 94% reduce 0%

13/02/01 18:44:10 INFO mapred.JobClient: map 95% reduce 0%

13/02/01 18:44:12 INFO mapred.JobClient: map 96% reduce 0%

13/02/01 18:44:14 INFO mapred.JobClient: map 97% reduce 0%

13/02/01 18:44:17 INFO mapred.JobClient: map 98% reduce 0%

13/02/01 18:44:20 INFO mapred.JobClient: map 99% reduce 0%

13/02/01 18:44:31 INFO mapred.JobClient: map 100% reduce 0%

13/02/01 18:44:32 INFO mapred.JobClient: Job complete: job_201302011505_0006

13/02/01 18:44:32 INFO mapred.JobClient: Counters: 20

13/02/01 18:44:32 INFO mapred.JobClient: Job Counters

13/02/01 18:44:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12572408

13/02/01 18:44:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

13/02/01 18:44:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

13/02/01 18:44:32 INFO mapred.JobClient: Launched map tasks=186

13/02/01 18:44:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0

13/02/01 18:44:32 INFO mapred.JobClient: distcp

13/02/01 18:44:32 INFO mapred.JobClient: Files copied=4311

13/02/01 18:44:32 INFO mapred.JobClient: Bytes copied=85969881779

13/02/01 18:44:32 INFO mapred.JobClient: Bytes expected=85969881779

13/02/01 18:44:32 INFO mapred.JobClient: FileSystemCounters

13/02/01 18:44:32 INFO mapred.JobClient: HDFS_BYTES_READ=85971609725

13/02/01 18:44:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8955232

13/02/01 18:44:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=85969881779

13/02/01 18:44:32 INFO mapred.JobClient: Map-Reduce Framework

13/02/01 18:44:32 INFO mapred.JobClient: Map input records=4900

13/02/01 18:44:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=25844953088

13/02/01 18:44:32 INFO mapred.JobClient: Spilled Records=0

13/02/01 18:44:32 INFO mapred.JobClient: CPU time spent (ms)=3410430

13/02/01 18:44:32 INFO mapred.JobClient: Total committed heap usage (bytes)=35171532800

13/02/01 18:44:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=201811800064

13/02/01 18:44:32 INFO mapred.JobClient: Map input bytes=1473253

13/02/01 18:44:32 INFO mapred.JobClient: Map output records=0

13/02/01 18:44:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=25854

Update - 30th June 2013

Moving data between CDH3u3 and CDH4.3

Run this on the new cluster.

nohup hadoop distcp hftp://mycdh3u3nn:50070/data/xyz/stg/load_dt=YYYYMMDD hdfs://mycdh430nn:8020/data/xyz/stg/ &

This results in the load_dt=YYYYMMDD hdfs directory being copied here:
hdfs://data/xyz/stg/load_dt=YYYYMMDD
in the new cluster

Note:

The use of the hftp protocol to read-only copy the data from the old cluster and write into the new cluster using hdfs protocol.
I used a unix user of the same name on both namenodes. This happened to be the hadoop supergroup user on the old cluster. You might need to work out the permissions to get this to work.

GML blog

Friday, 1 February 2013

Move data from one Hadoop cluster to another using distcp (including a note at the end re distcp'ing between different hadoop versions)

No comments:

Blog Archive

About Me