Wednesday 20 February 2013

Balancing an HDFS cluster (including java LeaseChecker OutOfmemoryError - still unresolved)

HDFS Balancer

Read the following articles for starters:

Yahoo tutorial module on Hadoop rebalancing 
Rebalancer Design PDF

Architecture for Open Source Applications HDFS - see rebalancing paragraph but take care talks about the threshold being between 0 and 1

Log on a the hadoop user (the user that runs our cluster is called hadoop) 
Change to the ${HADOOP_HOME}/bin where the hadoop scripts reside.
Then run the start-balancer.sh.
The default is a balancing threshold of 10% so choose something a little lower.
I chose 5%.
I should have started closer to 10% like 9% or 8%.
Why? Because start_balancer.sh TAKES FOREVER!
Use hadoop dfsadmin -report to check the redistribution of the space.

[hadoop@mynode hadoop]$ cd $HADOOP_HOME/bin


[hadoop@mynode bin]$ ./start-balancer.sh -threshold 5
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Feb 19, 2013 6:44:27 PM           0                 0 KB           516.65 GB              20 GB
[hadoop@mynode bin]$ hadoop dfsadmin -report


[hadoop@mynode bin]$ cat /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Feb 19, 2013 6:44:27 PM           0                 0 KB           516.65 GB              20 GB
Feb 19, 2013 7:05:57 PM           1              2.39 GB           514.07 GB              20 GB
Feb 19, 2013 7:28:28 PM           2              4.89 GB           511.59 GB              20 GB
Feb 19, 2013 7:50:29 PM           3              7.32 GB            509.2 GB              20 GB
Feb 19, 2013 8:12:29 PM           4              9.74 GB           506.67 GB              20 GB
Feb 19, 2013 8:34:30 PM           5             12.18 GB           504.51 GB              20 GB
Feb 19, 2013 8:56:30 PM           6             14.66 GB           502.14 GB              20 GB
Exception in thread "LeaseChecker" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:78)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1080)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy1.renewLease(Unknown Source)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.renew(DFSClient.java:1282)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1294)
at java.lang.Thread.run(Thread.java:662)


[hadoop@mynode bin]$ ./stop-balancer.sh 
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: Resource temporarily unavailable
[hadoop@mynode bin]$ w
 21:19:18 up 231 days, 11:44,  2 users,  load average: 0.03, 0.01, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT

[hadoop@mynode bin]$ hadoop job -list
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: Resource temporarily unavailable


[hadoop@mynode bin]$ cd ../pids
[hadoop@mynode pids]$ ls -atlr
total 20
drwxr-xr-x 17 hadoop hadoop 4096 Mar  8  2012 ..
-rw-rw-r--  1 hadoop hadoop    5 Feb 13 12:20 hadoop-hadoop-namenode.pid
-rw-rw-r--  1 hadoop hadoop    5 Feb 13 12:21 hadoop-hadoop-jobtracker.pid
-rw-rw-r--  1 hadoop hadoop    5 Feb 19 18:44 hadoop-hadoop-balancer.pid
drwxr-xr-x  2 hadoop hadoop 4096 Feb 19 18:44 .


[hadoop@mynode bin]$ kill -0 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ kill 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ ps -ef | grep 2329 | grep -v grep
[hadoop@mynode bin]$ 


Sometime later ... restarted a start_balancer.sh using 9% then 8% threshold ...


[hadoop@mynode bin]$ ./start-balancer.sh -threshold 9
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
[hadoop@mynode bin]$ tail -10f /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
The cluster is balanced. Exiting...
Balancing took 629.0 milliseconds

[hadoop@mynode bin]$ ./start-balancer.sh -threshold 8
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Mar 15, 2013 6:21:37 PM           0                 0 KB            63.46 GB              10 GB
Mar 15, 2013 6:42:37 PM           1              1.22 GB            62.13 GB              10 GB
...









No comments: