GML blog: 2014/05

Getting started

I am using a Windows 8 laptop - 8GB RAM, quad core i5-4200U, 1TB hard disk.

HortonWorks provides a HortonWorks Sandbox as a self-contained virtual machine.

Need to download a virtualisation environment - HortonWorks recommend Oracle VirtualBox but there are VMWare and Hyper-V options available.
Download (100MB) and install Oracle VirtualBox - take the latest - in my case "VirtualBox 4.3.10 for Windows hosts" on the Oracle VirtualBox Downloads page
Download (note 2.5GB) and import the HortonWorks sandbox image using the instructions provided on the HortonWorks Sandbox Installation page.
Start the Virtual Machine (VM). If your Windows machine is not set up so allow Virtualisation Services running on it as mine wasn't (I got a VERR_VMX_MSR_VMXON_DISABLED error in the VM), you'll need to reboot your machine, hit F10 as it is starting up and change the BIOS setting to enable Virtualisation Technology (V-T).
Your HortonWorks Sandbox should now be running.
Access the HortonWorks Sandbox:

welcome page using http://127.0.0.1:8888/ from a browser on your machine
Hue web interface using http://127.0.0.1:8000/ from a browser on your machine
from an SSH client like Putty by ssh'ing to root@127.0.0.1 on port 2222 (password is hadoop) - will need to enable root login via sshd_config (see below)
logging directly onto the VM (Alt-F5 on VirtualBox gets you to the login prompt). Note - root password is hadoop.
namenode web interface using http://127.0.0.1:50070/ from a browser on your machine
oozie server tomcat interface using http://127.0.0.1:11000/oozie from a browser on your machine

Using Hue

To get started, explore the Hue interace.

HCat

Let's you look at files registered in the HCatalog - take a look at the files available
You should find files sample_07 and sample_08
Click on the Browse the Data option

Beeswax (Hive UI)

Start by listening to the following Hortonworks Hive presentation on how to process data using Hive and how Hive compares to Pig
To run a simple query on sample_07 table to show the first 5 rows, type the following in the Beeswax editor

select * from sample_07 limit 5;

Developing a small java program against the Sandbox

Install Eclipse (used Kepler) 64 bit
Installed Java 7 SE 64 bit
Read the following article by Vivek Ganesan on creating a java MR job running against sandbox - so decided to try this for starters - it was referenced on the HortonWorks forums
I took the necessary java jars from the HortonWorks sandbox (but this wasn't enough - note my laptop had no hadoop "stuff" on it
Needed to allow root login via sshd

edit /etc/ssh/sshd_config - unhash PermitRootLogin yes
restart sshd /etc/init.d/sshd restart
note that although "ifconfig -a" showed a 10.0.x.x address 10.0.2.15 on my sandbox, you need to use 127.0.0.1 address and connect on port 2222 for ssh connectivity - see this post

So now should be able to port all the jars required from the sandbox to your PC/laptop using sftp or scp (using putty or MinGW or Cygwin - in my case mingw)

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-common-* .
root@127.0.0.1's password:
hadoop-common-2.4.0.2.1.1.0-385.jar 100% 2876KB 1.4MB/s 00:02

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-mapreduce-client-core-* .
root@127.0.0.1's password:
hadoop-mapreduce-client-core-2.4.0.2.1.1.0-38 100% 1458KB 1.4MB/s 00:01

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-mapreduce-client-jobclient-* .
root@127.0.0.1's password:
hadoop-mapreduce-client-jobclient-2.4.0.2.1.1 100% 35KB 34.9KB/s 00:00

Since this article was written, the HortonWorks sandbox has been upgraded and is currently using java 7 so you don't need to do the steps under "Setting Java Version - Compiler Options".
But when I came to "Creating Java Archive (JAR) file" - I was getting errors like:

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
at org.apache.hadoop.conf.Configuration$DeprecationDelta.(Configuration.java:314)
at org.apache.hadoop.conf.Configuration$DeprecationDelta.(Configuration.java:327)
at org.apache.hadoop.conf.Configuration.(Configuration.java:409)
at com.vivekGanesan.VoteCountApplication.main(VoteCountApplication.java:20)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 4 more

So I decided to follow the lead of Jesse Anderson's video (5'30'' into video - watch 2 mins) I had watched earlier and copied all the jars in the /usr/lib/hadoop/client directory in the HortonWorks Sandbox. Note this was in client.2.0 but now in client subdir. Also - looks like we don't need to take the 3 jars in /usr/lib/hadoop and common-httpclient jar in /usr/lib/hadoop/lib as these all appear to be correctly positioned in the client subdir or linked from there.

loureiros@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/*jar .

And as before right click on the UtopiaVoteCount package -> Properties -> Java Build Path -> Libraries -> Add External JARs ... then browse to where you copied all the jars locally from the HortonWorks Sandbox and add them all.

Then retry the "Creating Java Archive (JAR) file" step and got the following

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
usage: [input] [output]

.... that's almost what we were meant to get ;)

Next step is to set up a shared folder which one can do in VirtualBox settings

Make sure to set the automount option on and reboot the HortonWorks Sandbox VM (shutdown -r now from the command line). You could also manually mount the new share. Check the share is visible df -k.

cd /media/<shared folder> touch a; rm a;

Rest of the tutorial works well. Should note that when one uploads the booth-1.txt and booth-2.txt files via FileBrowser they are placed in the /user/hue/VoteCountInput directory on hdfs. Check that on the command line via this command: hdfs dfs -ls VoteCountInput

Installing Spark on the Sandbox

Spark is an interesting, relatively new distributed processing engine that claims speed improvements of orders of magnitude over Hadoop map reduce. Here's an attempt at getting started with Spark using the Hortonworks Sandbox.

Start by reading this pdf - great technical guide leading you through the steps
In fact it is so thorough there is not much more to say on the matter
Interesting tutorial here http://mbonaci.github.io/mbo-spark/
Useful Spark standalone documentation here http://spark.apache.org/docs/latest/spark-standalone.html

GML blog

Thursday, 29 May 2014

Big Data and Data Warehousing

Thursday, 22 May 2014

Playing with the HortonWorks Sandbox

Installing Spark on the Sandbox

Blog Archive

About Me