GML blog: Balancing an HDFS cluster (including java LeaseChecker OutOfmemoryError

HDFS Balancer

Read the following articles for starters:

Yahoo tutorial module on Hadoop rebalancing
Rebalancer Design PDF

Architecture for Open Source Applications HDFS - see rebalancing paragraph but take care talks about the threshold being between 0 and 1

Log on a the hadoop user (the user that runs our cluster is called hadoop)

Change to the ${HADOOP_HOME}/bin where the hadoop scripts reside.

Then run the start-balancer.sh.

The default is a balancing threshold of 10% so choose something a little lower.

I chose 5%.
I should have started closer to 10% like 9% or 8%.
Why? Because start_balancer.sh TAKES FOREVER!

Use hadoop dfsadmin -report to check the redistribution of the space.

[hadoop@mynode hadoop]$ cd $HADOOP_HOME/bin

[hadoop@mynode bin]$ ./start-balancer.sh -threshold 5
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
Feb 19, 2013 6:44:27 PM 0 0 KB 516.65 GB 20 GB
[hadoop@mynode bin]$ hadoop dfsadmin -report

[hadoop@mynode bin]$ cat /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
Feb 19, 2013 6:44:27 PM 0 0 KB 516.65 GB 20 GB
Feb 19, 2013 7:05:57 PM 1 2.39 GB 514.07 GB 20 GB
Feb 19, 2013 7:28:28 PM 2 4.89 GB 511.59 GB 20 GB
Feb 19, 2013 7:50:29 PM 3 7.32 GB 509.2 GB 20 GB
Feb 19, 2013 8:12:29 PM 4 9.74 GB 506.67 GB 20 GB
Feb 19, 2013 8:34:30 PM 5 12.18 GB 504.51 GB 20 GB
Feb 19, 2013 8:56:30 PM 6 14.66 GB 502.14 GB 20 GB
Exception in thread "LeaseChecker" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:78)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1080)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy1.renewLease(Unknown Source)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.renew(DFSClient.java:1282)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1294)
at java.lang.Thread.run(Thread.java:662)

[hadoop@mynode bin]$ ./stop-balancer.sh
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: Resource temporarily unavailable
[hadoop@mynode bin]$ w
21:19:18 up 231 days, 11:44, 2 users, load average: 0.03, 0.01, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT

[hadoop@mynode bin]$ hadoop job -list
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: Resource temporarily unavailable

[hadoop@mynode bin]$ cd ../pids
[hadoop@mynode pids]$ ls -atlr
total 20
drwxr-xr-x 17 hadoop hadoop 4096 Mar 8 2012 ..
-rw-rw-r-- 1 hadoop hadoop 5 Feb 13 12:20 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 5 Feb 13 12:21 hadoop-hadoop-jobtracker.pid
-rw-rw-r-- 1 hadoop hadoop 5 Feb 19 18:44 hadoop-hadoop-balancer.pid
drwxr-xr-x 2 hadoop hadoop 4096 Feb 19 18:44 .

[hadoop@mynode bin]$ kill -0 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ kill 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ ps -ef | grep 2329 | grep -v grep
[hadoop@mynode bin]$

Sometime later ... restarted a start_balancer.sh using 9% then 8% threshold ...

[hadoop@mynode bin]$ ./start-balancer.sh -threshold 9
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
[hadoop@mynode bin]$ tail -10f /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
The cluster is balanced. Exiting...
Balancing took 629.0 milliseconds

[hadoop@mynode bin]$ ./start-balancer.sh -threshold 8
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out

Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
Mar 15, 2013 6:21:37 PM 0 0 KB 63.46 GB 10 GB
Mar 15, 2013 6:42:37 PM 1 1.22 GB 62.13 GB 10 GB
...

GML blog

Wednesday, 20 February 2013

Balancing an HDFS cluster (including java LeaseChecker OutOfmemoryError - still unresolved)

No comments:

Blog Archive

About Me