Monday, 23 January 2012

Hadoop notes to self

Hadoop Tools Ecosystem

Useful link re the Hadoop tools ecosystem

Hadoop standalone rebuild

See useful link

rm -rf /data/hdfs
rm -rf /data/tmpd_hdfs
hadoop namenode -format

Hadoop install on CentOS - JDK + Cloudera distribution

# copy CDH tarfiles and jdk somewhere - say /tmp/downloads

cd /opt # or wherever you decide to install hadoop
# install jdk
/tmp/jdk-6u25-linux-x64.bin # Install jdk

# install hadoop apps
for `f in *cdh3*gz`
  tar -xvzf $f
# build soft links to current CDH version
for f in `ls -d *cdh3u3`; do   g=`echo $f| cut -d'-' -f 1`; ln -s $f $g; done
# check permissions and chown -R hadoop:hadoop if reqd

# edit /etc/profile and add necessary to the path e.g. +++

export JAVA_HOME="/opt/jdk"
export HIVE_HOME=/opt/hive
export HADOOP_HOME=/opt/hadoop
export PIG_HOME=/opt/pig

Good Installation/config notes

Great notes re setting up a cluster

To be aware of

Small files in HDFS problem


Read this article re configuration notes as a starter
Jobtracker hanging - memory issues - read here
Also read misconfiguration article
Transparent HugePages - see Linux reference and Greg Rahn's expose of THP issue on Hadoop

architectural notes here

$ hadoop fsck / 2>&1 | grep -v '\.\.\.'
FSCK started by hadoop (auth:SIMPLE) from / for path / at Tue Jun 19 08:16:59 BST 2012
 Total size: 14476540835550 B
 Total dirs: 6780
 Total files: 1040334 (Files currently being written: 3678)
 Total blocks (validated): 1207343 (avg. block size 11990412 B)
 Minimally replicated blocks: 1207343 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 3
 Average block replication: 3.0023208
 Corrupt blocks: 0
 Missing replicas: 0 (0.0 %)
 Number of data-nodes: 9
 Number of racks: 1
FSCK ended at Tue Jun 19 08:17:15 BST 2012 in 15878 milliseconds

The filesystem under path '/' is HEALTHY

$ hadoop dfsadmin -report
Configured Capacity: 65158503579648 (59.26 TB)
Present Capacity: 61841246261419 (56.24 TB)
DFS Remaining: 18049941311488 (16.42 TB)
DFS Used: 43791304949931 (39.83 TB)
DFS Used%: 70.81%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Decommissioning nodes from cluster

Decommissioning a node
  • Add to the exclude node file
  • Run hadoop dfsadmin -refreshNodes on the name node
This results in the data node added to the excludenode file to move all its data to other nodes in the clulster.

[hadoop@mynamenode conf]$ hadoop dfsadmin -report | more
Configured Capacity: 86201191694336 (78.4 TB)
Present Capacity: 81797299645664 (74.39 TB)
DFS Remaining: 32416639721472 (29.48 TB)
DFS Used: 49380659924192 (44.91 TB)
DFS Used%: 60.37%
Under replicated blocks: 315421
Blocks with corrupt replicas: 0
Missing blocks: 0

Name: mydecommnode:50010
Rack: /dc1/r16
Decommission Status : Decommission in progress
Configured Capacity: 7239833731072 (6.58 TB)
DFS Used: 6869802995712 (6.25 TB)
Non DFS Used: 369725145088 (344.33 GB)
DFS Remaining: 305590272(291.43 MB)
DFS Used%: 94.89%
DFS Remaining%: 0%
Last contact: Wed May 08 11:37:15 BST 2013

Removing the decommissioned datanode
  • Kill hadoop java processes still running on the node
  • Remove from slaves file on the name nodes
  • Leave in the exclude node file until a cluster reboot (changed in newer versions of Hadoop TBC?)
Fsck of hdfs shows replication violations/issues

Solution is to toggle the replication factor down one and then back to where it should be.
See example below.

$ ./ | head -20
fsck'ing fs [/data]
FSCK started by hadoop (auth:SIMPLE) from / for path /data at Tue May 14 08:33:37 BST 2013
/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo:  Replica placement policy is violated for blk_-1157106589514956189_3885959. Block should be additionally replicated on 1 more rack(s)
.......................Status: HEALTHY

So run the following:

$ hadoop fs -setrep 2 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

Replication 2 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

$ hadoop fs -setrep 3 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

Replication 3 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

And the replica violations disappear.

Also use this approach to solve underreplication issues.

No comments: