GML blog: Hadoop notes to self

Hadoop Tools Ecosystem

Useful link re the Hadoop tools ecosystem
http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem

Hadoop standalone rebuild

See useful link

rm -rf /data/hdfs
rm -rf /data/tmpd_hdfs
hadoop namenode -format
start-all.sh

Hadoop install on CentOS - JDK + Cloudera distribution

# copy CDH tarfiles and jdk somewhere - say /tmp/downloads

cd /opt # or wherever you decide to install hadoop
# install jdk
/tmp/jdk-6u25-linux-x64.bin # Install jdk

# install hadoop apps

for `f in *cdh3*gz`

tar -xvzf $f

done

# build soft links to current CDH version

for f in `ls -d *cdh3u3`; do g=`echo $f| cut -d'-' -f 1`; ln -s $f $g; done

# check permissions and chown -R hadoop:hadoop if reqd

# edit /etc/profile and add necessary to the path e.g. +++

export JAVA_HOME="/opt/jdk"
PATH="$PATH:$JAVA_HOME/bin:/opt/hadoop/bin:/opt/hive/bin:/opt/pig/bin"
export HIVE_HOME=/opt/hive
export HADOOP_HOME=/opt/hadoop
export PIG_HOME=/opt/pig

Good Installation/config notes

Great notes re setting up a cluster

To be aware of

Small files in HDFS problem

Configuring

Read this article re configuration notes as a starter
Jobtracker hanging - memory issues - read here
Also read misconfiguration article
Transparent HugePages - see Linux reference and Greg Rahn's expose of THP issue on Hadoop

architectural notes here
Troubleshooting

$ hadoop fsck / 2>&1 | grep -v '\.\.\.'
FSCK started by hadoop (auth:SIMPLE) from /10.1.2.5 for path / at Tue Jun 19 08:16:59 BST 2012
Total size: 14476540835550 B
Total dirs: 6780
Total files: 1040334 (Files currently being written: 3678)
Total blocks (validated): 1207343 (avg. block size 11990412 B)
Minimally replicated blocks: 1207343 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0023208
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 9
Number of racks: 1
FSCK ended at Tue Jun 19 08:17:15 BST 2012 in 15878 milliseconds

The filesystem under path '/' is HEALTHY

$ hadoop dfsadmin -report
Configured Capacity: 65158503579648 (59.26 TB)
Present Capacity: 61841246261419 (56.24 TB)
DFS Remaining: 18049941311488 (16.42 TB)
DFS Used: 43791304949931 (39.83 TB)
DFS Used%: 70.81%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Decommissioning nodes from cluster

Decommissioning a node

Add to the exclude node file
Run hadoop dfsadmin -refreshNodes on the name node

This results in the data node added to the excludenode file to move all its data to other nodes in the clulster.

[hadoop@mynamenode conf]$ hadoop dfsadmin -report | more

Configured Capacity: 86201191694336 (78.4 TB)

Present Capacity: 81797299645664 (74.39 TB)

DFS Remaining: 32416639721472 (29.48 TB)

DFS Used: 49380659924192 (44.91 TB)

DFS Used%: 60.37%

Under replicated blocks: 315421

Blocks with corrupt replicas: 0

Missing blocks: 0

Name: mydecommnode:50010

Rack: /dc1/r16

Decommission Status : Decommission in progress

Configured Capacity: 7239833731072 (6.58 TB)

DFS Used: 6869802995712 (6.25 TB)

Non DFS Used: 369725145088 (344.33 GB)

DFS Remaining: 305590272(291.43 MB)

DFS Used%: 94.89%

DFS Remaining%: 0%

Last contact: Wed May 08 11:37:15 BST 2013

Removing the decommissioned datanode

Kill hadoop java processes still running on the node
Remove from slaves file on the name nodes
Leave in the exclude node file until a cluster reboot (changed in newer versions of Hadoop TBC?)

Fsck of hdfs shows replication violations/issues

Solution is to toggle the replication factor down one and then back to where it should be.

See example below.

$ ./fsck_all.sh | head -20

fsck'ing fs [/data]

FSCK started by hadoop (auth:SIMPLE) from / for path /data at Tue May 14 08:33:37 BST 2013

/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo: Replica placement policy is violated for blk_-1157106589514956189_3885959. Block should be additionally replicated on 1 more rack(s)

.......................Status: HEALTHY

So run the following:

$ hadoop fs -setrep 2 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

Replication 2 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

$ hadoop fs -setrep 3 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

Replication 3 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo

And the replica violations disappear.

Also use this approach to solve underreplication issues.

Oozie

Good overview of Oozie
http://blog.cloudera.com/blog/2014/03/inside-apache-oozie-ha/
http://www.thecloudavenue.com/2013/10/installation-and-configuration-of.html

GML blog

Monday, 23 January 2012

Hadoop notes to self

No comments:

Blog Archive

About Me