GML blog

Thursday, 29 May 2014

Big Data and Data Warehousing

Listen to an informative webinar by Ralph Kimball and Cloudera here

Thursday, 22 May 2014

Playing with the HortonWorks Sandbox

Getting started

I am using a Windows 8 laptop - 8GB RAM, quad core i5-4200U, 1TB hard disk.

HortonWorks provides a HortonWorks Sandbox as a self-contained virtual machine.

Need to download a virtualisation environment - HortonWorks recommend Oracle VirtualBox but there are VMWare and Hyper-V options available.
Download (100MB) and install Oracle VirtualBox - take the latest - in my case "VirtualBox 4.3.10 for Windows hosts" on the Oracle VirtualBox Downloads page
Download (note 2.5GB) and import the HortonWorks sandbox image using the instructions provided on the HortonWorks Sandbox Installation page.
Start the Virtual Machine (VM). If your Windows machine is not set up so allow Virtualisation Services running on it as mine wasn't (I got a VERR_VMX_MSR_VMXON_DISABLED error in the VM), you'll need to reboot your machine, hit F10 as it is starting up and change the BIOS setting to enable Virtualisation Technology (V-T).
Your HortonWorks Sandbox should now be running.
Access the HortonWorks Sandbox:

welcome page using http://127.0.0.1:8888/ from a browser on your machine
Hue web interface using http://127.0.0.1:8000/ from a browser on your machine
from an SSH client like Putty by ssh'ing to root@127.0.0.1 on port 2222 (password is hadoop) - will need to enable root login via sshd_config (see below)
logging directly onto the VM (Alt-F5 on VirtualBox gets you to the login prompt). Note - root password is hadoop.
namenode web interface using http://127.0.0.1:50070/ from a browser on your machine
oozie server tomcat interface using http://127.0.0.1:11000/oozie from a browser on your machine

Using Hue

To get started, explore the Hue interace.

HCat

Let's you look at files registered in the HCatalog - take a look at the files available
You should find files sample_07 and sample_08
Click on the Browse the Data option

Beeswax (Hive UI)

Start by listening to the following Hortonworks Hive presentation on how to process data using Hive and how Hive compares to Pig
To run a simple query on sample_07 table to show the first 5 rows, type the following in the Beeswax editor

select * from sample_07 limit 5;

Developing a small java program against the Sandbox

Install Eclipse (used Kepler) 64 bit
Installed Java 7 SE 64 bit
Read the following article by Vivek Ganesan on creating a java MR job running against sandbox - so decided to try this for starters - it was referenced on the HortonWorks forums
I took the necessary java jars from the HortonWorks sandbox (but this wasn't enough - note my laptop had no hadoop "stuff" on it
Needed to allow root login via sshd

edit /etc/ssh/sshd_config - unhash PermitRootLogin yes
restart sshd /etc/init.d/sshd restart
note that although "ifconfig -a" showed a 10.0.x.x address 10.0.2.15 on my sandbox, you need to use 127.0.0.1 address and connect on port 2222 for ssh connectivity - see this post

So now should be able to port all the jars required from the sandbox to your PC/laptop using sftp or scp (using putty or MinGW or Cygwin - in my case mingw)

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-common-* .
root@127.0.0.1's password:
hadoop-common-2.4.0.2.1.1.0-385.jar 100% 2876KB 1.4MB/s 00:02

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-mapreduce-client-core-* .
root@127.0.0.1's password:
hadoop-mapreduce-client-core-2.4.0.2.1.1.0-38 100% 1458KB 1.4MB/s 00:01

me@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/hadoop-mapreduce-client-jobclient-* .
root@127.0.0.1's password:
hadoop-mapreduce-client-jobclient-2.4.0.2.1.1 100% 35KB 34.9KB/s 00:00

Since this article was written, the HortonWorks sandbox has been upgraded and is currently using java 7 so you don't need to do the steps under "Setting Java Version - Compiler Options".
But when I came to "Creating Java Archive (JAR) file" - I was getting errors like:

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
at org.apache.hadoop.conf.Configuration$DeprecationDelta.(Configuration.java:314)
at org.apache.hadoop.conf.Configuration$DeprecationDelta.(Configuration.java:327)
at org.apache.hadoop.conf.Configuration.(Configuration.java:409)
at com.vivekGanesan.VoteCountApplication.main(VoteCountApplication.java:20)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 4 more

So I decided to follow the lead of Jesse Anderson's video (5'30'' into video - watch 2 mins) I had watched earlier and copied all the jars in the /usr/lib/hadoop/client directory in the HortonWorks Sandbox. Note this was in client.2.0 but now in client subdir. Also - looks like we don't need to take the 3 jars in /usr/lib/hadoop and common-httpclient jar in /usr/lib/hadoop/lib as these all appear to be correctly positioned in the client subdir or linked from there.

loureiros@PC /c/Tech/hadoop/client
$ scp -P 2222 root@127.0.0.1:/usr/lib/hadoop/client/*jar .

And as before right click on the UtopiaVoteCount package -> Properties -> Java Build Path -> Libraries -> Add External JARs ... then browse to where you copied all the jars locally from the HortonWorks Sandbox and add them all.

Then retry the "Creating Java Archive (JAR) file" step and got the following

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
usage: [input] [output]

.... that's almost what we were meant to get ;)

Next step is to set up a shared folder which one can do in VirtualBox settings

Make sure to set the automount option on and reboot the HortonWorks Sandbox VM (shutdown -r now from the command line). You could also manually mount the new share. Check the share is visible df -k.

cd /media/<shared folder> touch a; rm a;

Rest of the tutorial works well. Should note that when one uploads the booth-1.txt and booth-2.txt files via FileBrowser they are placed in the /user/hue/VoteCountInput directory on hdfs. Check that on the command line via this command: hdfs dfs -ls VoteCountInput

Installing Spark on the Sandbox

Spark is an interesting, relatively new distributed processing engine that claims speed improvements of orders of magnitude over Hadoop map reduce. Here's an attempt at getting started with Spark using the Hortonworks Sandbox.

Start by reading this pdf - great technical guide leading you through the steps
In fact it is so thorough there is not much more to say on the matter
Interesting tutorial here http://mbonaci.github.io/mbo-spark/
Useful Spark standalone documentation here http://spark.apache.org/docs/latest/spark-standalone.html

Tuesday, 29 October 2013

How to handling large and potentially complex XML datasets in Hadoop

How should one handle large and potentially complex XML datasets in Hadoop?

Suppose you have loads of XML data being generated by source systems.
You want to do some data analytics on this data.
Chances are your data scientists will need to access this data and you might need to run a few batch jobs over this data set.
XML parsing is CPU intensive.
XML data contains both data and meaning making it robust against upstream definition changes and against corruption.
How should one approach this problem?

Came across this interesting slideshow re handling XML data here: http://www.slideshare.net/Hadoop_Summit/bose-june26-405pmroom230cv3-24148869.
Then found the Youtube video presentation of it.
Good stuff that presents some of the issues/challenges and then provides some solution approaches and backs them up with some test results.

Another interesting article for python coders can be found here http://davidvhill.com/article/processing-xml-with-hadoop-streaming.

An old post from 2010 showing how to use Hadoop and Mahout XMLInputFormat class to process xml files. Debate about the effectiveness and richness of this approach found in this entry.

Another article using Hive and XPath to solve a specify XML parsing problem using Hive.

Articles others point people to is this one http://www.undercloud.org/?p=408.

Interesting article from June 2012 re this.

These guys in Feb 2012 were looking at the similar XML challenge - useful articulation of the problem and their thinking at the time.

Andy was playing with XML and Hadoop way back in 2008.

Friday, 21 June 2013

Impala error connecting running impala-shell on port 21000

Installed impala using Cloudera Manager.

Then logged on to one of the servers in the cluster (called below) and tried connecting to impala.
It failed with this error (see the details below):
"Error connecting: , Could not connect to localhost:21000"

I made a basic error.
There was no impalad daemon on this server (called below).
There was only an Impala Statestore daemon running on it.
So I needed to determine one of the servers running impalad and direct impala-shell to it using the -i command line option to determine the interface.
And it worked - see the access working in the last step.
Note I needed to refresh to be able to see my new mytable table.
Last word on impala - am impressed with the speed returned from a simple impala count(*) on mytable versus hive equivalent.

Not working

$ ps -ef | grep impala | grep -v grep
impala 30929 19867 0 13:13 ? 00:00:19 /opt/cloudera/parcels/IMPALA-1.0.1-1.p0.431/lib/impala/sbin-retail/statestored --flagfile=/var/run/cloudera-scm-agent/process/215-impala-STATESTORE/impala-conf/state_store_flags

$ impala-shell -i localhost
Error connecting: , Could not connect to localhost:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun 4 08:08:13 PDT 2013)
[Not connected] > exit;

$ impala-shell -i
Error connecting: , Could not connect to :21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun 4 08:08:13 PDT 2013)
[Not connected] > exit

Working

$ impala-shell -i myimpalaserver
Connected to myimpalaserver:21000
Server version: impalad version 1.0.1 RELEASE (build df844fb967cec8740f08dfb8b21962bc053527ef)
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun 4 08:08:13 PDT 2013)
[myimpalaserver:21000] > select * from mytable where load_dt = '20130410' and batch = '1' limit 4;
Query: select * from mytable where load_dt = '20130410' and batch = '1' limit 4
ERROR: AnalysisException: Unknown table: 'mytable'
[myimpalaserver:21000] > refresh ;
Successfully refreshed catalog
[myimpalaserver:21000] > show tables;
Query: show tables
Query finished, fetching results ...
+-----------------+
| name |
+-----------------+
| mytable |
+-----------------+
Returned 1 row(s) in 0.11s
[myimpalaserver:21000] > select * from mytable where load_dt = '20130410' and batch = '1' limit 4;
Query: select * from mytable where load_dt = '20130410' and batch = '1' limit 4

Query finished, fetching results ...

Tuesday, 18 June 2013

Installing Cloudera Manager without internet access (path B)

After an abortive attempt at trying to install Cloudera Manager and the Cloudera software using Path C ie building from tarballs, I reverted to using Path B ie building from local repos.
(Note - I recommend you read the Cloudera documentation on their web site in conjunction with this post. Cloudera is evolving their product rapidly so this post is likely to become out of date quickly. I wrote this mid-June 2013)

Note - these notes are a bit messy. They were completed a couple of days after doing the install and not keeping great notes as I went along :(. Treat them as draft but they may be helpful.

Background

Up to now, we have not been using Cloudera Manager.
Then again we have a really simple system running a small HDFS clusters.
Ganglia and simple scripting has done the job for us.
Now we have purchased Cloudera support and it makes sense to use their Cloudera Manager since it simplifies the support and gives visibility of the cluster.
(Wish Cloudera Manager and Ambari would come together as one management tool)

Our system lives in a DMZ and has no direct internet access. It has servers with the following specs:

2 x quad core Xeon 2.1GHz CPUs, 48GB RAM, 4 x 2TB hard disks and a pair of 1G NICs
Cent OS 6.3

On 6 servers in our dev/test area, I am looking to install:

One server to function as the management host running Cloudera Manager 4.5.3 and the databases (will use a single MySQL instance for all the db repositories).
5 servers running the latest stable release of Cloudera 4.2.1

Installation steps used:

Step - Download software

~~Java - use the java that comes in the Cloudera repo so no need to download Java from Oracle.~~
MySQL (I used MySQL but could also use PostgreSQL or Oracle) - downloaded the following:

MySQL-server-5.5.22-1.linux2.6.x86_64.rpm
MySQL-client-5.5.22-1.linux2.6.x86_64.rpm

Downloaded the CentOS 6.3 packages for x86_64

Download from vault.centos.org in my case vault.centos.org/6.3/os/x86_64/Packages/

Placed them in /var/www/html/CentOS/...
One of the first things will need is the create repo package createrepo-0.9.8-5.el6.noarch.rpm to create the proxy repo service on the server

Download the Cloudera CDH4 parcel files and placed them in /var/www/html/cdh4

Download from http://archive.cloudera.com/cdh4/parcels/latest/ including:

Download this http://archive.cloudera.com/cdh4/parcels/latest/CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel (for CentOS 6 in my case)
Download this http://archive.cloudera.com/cdh4/parcels/latest/manifest.json
Remember to create a file called CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel.sha file (as per the documentation) based on the relevant entry in the manifest.json file - ie the hash entry for the parcel you are installing. So list the contents of CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel.sha in my case is df5cc61b2d257aaf625341f709a4f8e09754038a

Download the Cloudera Manager software

Download this http://archive.cloudera.com/cm4/repo-as-tarball/4.6.0/cm4.6.0-centos6.tar.gz (for CentOS 6 in my case)

Step - install a local yum repos using createrepo rpm

# rpm -i createrepo-0.9.8-5.el6.noarch.rpm

warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
error: Failed dependencies:
deltarpm is needed by createrepo-0.9.8-5.el6.noarch
python-deltarpm is needed by createrepo-0.9.8-5.el6.noarch

# rpm -i deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm

warning: deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY

# rpm -i python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm

warning: python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY

# rpm -i createrepo-0.9.8-5.el6.noarch.rpm

warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

# cd /var/www/html/cm
# createrepo .

14/14 - 4.6.0/RPMS/x86_64/enterprise-debuginfo-4.6.0-1.cm460.p0.141.x86_64.rpm
Saving Primary metadata
Saving file lists metadata
Saving other metadata
make the following repo file

Edit the /etc/yum.repos.d to look something like:
# cat /etc/yum.repos.d/local-centos-6.3

[local-centos]
name=centos
baseurl=http:///CentOS/6.3/local/x86_64
enabled=1
gpgcheck=0

# cat /etc/yum.repos.d/cloudera-manager.repo

[cloudera-manager]
name = Cloudera Manager, Version 4.6.0
baseurl = http:///cm
gpgkey = http:///cm/RPM-GPG-KEY-cloudera
gpgcheck = 1

Step - Install httpd if reqd - you will need httpd running on the server with the yum repos.

Install the httpd packages and leave the DocumentRoot and Directory set to /var/www/html (unless you have the repos elsewhere in which case edit the /etc/httpd/conf/httpd.conf file changing the DocumentRoot and Directory tags.
Needed CentOS packages to install apache httpd web server on a server (put mine on the cloudera manager server) to allow the servers to install from this repository. Change the DocumentRoot in the httpd config settings in /etc/httpd/conf/httpd.conf.

# rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm

warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
error: Failed dependencies:
/etc/mime.types is needed by httpd-2.2.15-9.el6.centos.x86_64
apr-util-ldap is needed by httpd-2.2.15-9.el6.centos.x86_64
httpd-tools = 2.2.15-9.el6.centos is needed by httpd-2.2.15-9.el6.centos.x86_64
libaprutil-1.so.0()(64bit) is needed by httpd-2.2.15-9.el6.centos.x86_64

# rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm

warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
error: Failed dependencies:
libaprutil-1.so.0()(64bit) is needed by httpd-tools-2.2.15-9.el6.centos.x86_64

# rpm -i apr-util-1.3.9-3.el6_0.1.x86_64.rpm

warning: apr-util-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY

# rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm

warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

# rpm -i apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm

warning: apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY

# rpm -i mailcap-2.1.31-2.el6.noarch.rpm

warning: mailcap-2.1.31-2.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

# rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm

warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

Restarted the httpd service:

# service httpd restart

Stopping httpd: [ OK ]

Starting httpd: [ OK ]

# pwd

# cd /var/www/html/cdh4/

# sha1sum CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel

cadf5cc61b2d257aaf625341f709a4f8e09754038a CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel

# cat CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel.sha

df5cc61b2d257aaf625341f709a4f8e09754038a

# cd ../impala/

# sha1sum IMPALA-1.0.1-1.p0.431-el6.parcel

992467f2e54bd394cbdd3f4ed97b6e9bead60ff0 IMPALA-1.0.1-1.p0.431-el6.parcel

# cat IMPALA-1.0.1-1.p0.431-el6.parcel.sha

992467f2e54bd394cbdd3f4ed97b6e9bead60ff0

# yum clean all

Loaded plugins: fastestmirror, security

Cleaning repos: cloudera-manager

Cleaning up Everything

Cleaning up list of fastest mirrors

Step - Install Java SDK from the cloudera repo on the Cloudera Manager server - don't install this on the other servers as Cloudera Manager will install this (if I remember correctly ;). Note I had to remove an Oracle Java install before installing the Cloudera jdk version

# yum -y remove java-1.6.0-openjdk-1.6.0.0–1.45.1.11.1.el6.x86_64

# yum install jdk

Loaded plugins: fastestmirror, security

Determining fastest mirrors

cloudera-manager | 1.3 kB 00:00

cloudera-manager/primary | 5.0 kB 00:00

cloudera-manager 20/20

Setting up Install Process

Resolving Dependencies

--> Running transaction check

---> Package jdk.x86_64 2000:1.6.0_31-fcs will be installed

--> Finished Dependency Resolution

Dependencies Resolved

===================================================================================================================================================================================================================

Package Arch Version Repository Size

===================================================================================================================================================================================================================

Installing:

jdk x86_64 2000:1.6.0_31-fcs cloudera-manager 68 M

Transaction Summary

===================================================================================================================================================================================================================

Install 1 Package(s)

Total download size: 68 M

Installed size: 143 M

Is this ok [y/N]: y

Downloading Packages:

jdk-6u31-linux-amd64.rpm | 68 MB 00:01

Running rpm_check_debug

Running Transaction Test

Transaction Test Succeeded

Running Transaction

Installing : 2000:jdk-1.6.0_31-fcs.x86_64 1/1

Unpacking JAR files...

rt.jar...

jsse.jar...

charsets.jar...

tools.jar...

localedata.jar...

plugin.jar...

javaws.jar...

deploy.jar...

Verifying : 2000:jdk-1.6.0_31-fcs.x86_64 1/1

Installed:

jdk.x86_64 2000:1.6.0_31-fcs

Complete!
# java -version

java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

Step - Set up database

Got a choice of databases - PostgreSQL, MySQL or Oracle - we set up MySQL for Hive previously so has decided to create the repositories required for Cloudera Manager and monitoring repositories using MySQL. But all three of these were options for us. Interestingly in the documentation PostgreSQL is there first. Not sure if this is a preference.
Run the downloaded mySQL rpms - here's what I did on my system

rpm -i ./MySQL-server-5.5.22-1.linux2.6.x86_64.rpm --replacefiles
rpm -i ./MySQL-client-5.5.22-1.linux2.6.x86_64.rpm
including the following

mkdir -p /mysql_mon_rep_data
chown -R mysql:mysql /mysql_mon_rep_data
chmod 755 /mysql_mon_rep_data
mkdir -p /var/log/mysql/logs/binary/mysql_binary_log
chown -R mysql:mysql /var/log/mysql
mkdir -p /usr/share/java/
cp -ip mysql-connector-java-5.1.18-bin.jar /usr/share/java/mysql-connector-java.jar # copied from earlier download I had for Hive repository. This can be found on the Oracle mySQL downloads site
chown mysql:mysql /usr/share/java/mysql-connector-java.jar

Moved old version of MySQL out the way (ie that was in /var/lib/mysql_data)
Change the /etc/my.cnf mySQL config file to include the configuration settings as per the documentation. I foolishly changed the bind_address to be equal to the server name rather than leave this out and have the default as localhost (need to verify this but this led me down a goose path and the Cloudera consultant was very good at keeping to the script in their documentation).
Stopped and restarted the mysql daemon ie "service mysql restart" (note - looks like CentOS 6.3 uses mysql and not mysqld as the daemon)
This builds a new set of database files (if you are upgrading your database - do something different) ... but it fails with the following error (for me at least) because I had moved the files from the default location. So I had to run the following to get it to work.

# /usr/bin/mysql_secure_installation

NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MySQL
      SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!

In order to log into MySQL to secure it, we'll need the current
password for the root user.  If you've just installed MySQL, and
you haven't set the root password yet, the password will be blank,
so you should just press enter here.

Enter current password for root (enter for none): 
OK, successfully used password, moving on...

Setting the root password ensures that nobody can log into the MySQL
root user without the proper authorisation.

Set root password? [Y/n] Y
New password: 
Re-enter new password: 
Password updated successfully!
Reloading privilege tables..
 ... Success!


By default, a MySQL installation has an anonymous user, allowing anyone
to log into MySQL without having to have a user account created for
them.  This is intended only for testing, and to make the installation
go a bit smoother.  You should remove them before moving into a
production environment.

Remove anonymous users? [Y/n] Y
 ... Success!

Normally, root should only be allowed to connect from 'localhost'.  This
ensures that someone cannot guess at the root password from the network.

Disallow root login remotely? [Y/n] n
 ... skipping.

By default, MySQL comes with a database named 'test' that anyone can
access.  This is also intended only for testing, and should be removed
before moving into a production environment.

Remove test database and access to it? [Y/n] Y
 - Dropping test database...
 ... Success!
 - Removing privileges on test database...
 ... Success!

Reloading the privilege tables will ensure that all changes made so far
will take effect immediately.

Reload privilege tables now? [Y/n] Y
 ... Success!

Cleaning up...

All done!  If you've completed all of the above steps, your MySQL
installation should now be secure.

Thanks for using MySQL!

Step - Install cloudera-manager-daemons and cloudera-manager-server on the Cloudera Manager server

If you have a previous installation of the Cloudera scm and monitoring database schemas you may want to drop these (depending on whether you are starting fresh).
If so, drop the following Cloudera database schemas: amon, hive, hmon, rman, scm, smon

And also drop the scm user if it exists (you can leave the amon, hive, hmon, rman and smon users in the instance).

Check what's cloudera packages are available in the repo.

# yum search cloudera

Loaded plugins: fastestmirror, security

Loading mirror speeds from cached hostfile

============================================================================================== N/S Matched: cloudera ==============================================================================================

cloudera-manager-agent.x86_64 : The Cloudera Manager Agent

cloudera-manager-server.x86_64 : The Cloudera Manager Server

cloudera-manager-server-db.x86_64 : Embedded database for the Cloudera Manager Server

cloudera-manager-daemons.x86_64 : Provides daemons for monitoring Hadoop and related tools.

cloudera-manager-parcel-4.6.0.x86_64 : All the CM bits in one relocatable package

Name and summary matches only, use "search all" for everything.

# yum install cloudera-manager-daemons

Loaded plugins: fastestmirror, security

Loading mirror speeds from cached hostfile

Setting up Install Process

Resolving Dependencies

--> Running transaction check

---> Package cloudera-manager-daemons.x86_64 0:4.6.0-1.cm460.p0.141 will be installed

--> Finished Dependency Resolution

Dependencies Resolved

===================================================================================================================================================================================================================

Package Arch Version Repository Size

Installing:

cloudera-manager-daemons x86_64 4.6.0-1.cm460.p0.141 cloudera-manager 135 M

Transaction Summary

Install 1 Package(s)

Total download size: 135 M

Installed size: 175 M

Is this ok [y/N]: y

Downloading Packages:

cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64.rpm | 135 MB 00:02

Running rpm_check_debug

Running Transaction Test

Transaction Test Succeeded

Running Transaction

Installing : cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64 1/1

Verifying : cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64 1/1

Installed:

cloudera-manager-daemons.x86_64 0:4.6.0-1.cm460.p0.141

Complete!

# yum install cloudera-manager-server

Loaded plugins: fastestmirror, security

Loading mirror speeds from cached hostfile

Setting up Install Process

Resolving Dependencies

--> Running transaction check

---> Package cloudera-manager-server.x86_64 0:4.6.0-1.cm460.p0.141 will be installed

--> Finished Dependency Resolution

Dependencies Resolved

Package Arch Version Repository Size

Installing:

cloudera-manager-server x86_64 4.6.0-1.cm460.p0.141 cloudera-manager 7.6 k

Transaction Summary

Install 1 Package(s)

Total download size: 7.6 k

Installed size: 8.8 k

Is this ok [y/N]: y

Downloading Packages:

cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64.rpm | 7.6 kB 00:00

Running rpm_check_debug

Running Transaction Test

Transaction Test Succeeded

Running Transaction

Installing : cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64 1/1

Verifying : cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64 1/1

Installed:

cloudera-manager-server.x86_64 0:4.6.0-1.cm460.p0.141

Complete!

Edited the config.ini to add the name of the server hosting cloudera manager server service.

# grep -i server /etc/cloudera-scm-agent/config.ini

# Hostname of Cloudera SCM Server
server_host=.uk.pri.o2.com
# Port that server is listening on
server_port=7182

Step - Prepare the databases

Run the scm database create and user create script (initially set the user passwd to scm but change later)

/usr/share/cmf/schema/scm_prepare_database.sh mysql -u root -p scm scm scm 2>&1 | tee scm_prepare_database8.log

Verifying that we can write to /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server

log4j:ERROR Could not find value for key log4j.appender.A

log4j:ERROR Could not instantiate appender named "A".

Creating SCM configuration file in /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server

Executing: /usr/java/jdk1.6.0_45/bin/java -cp /usr/share/java/mysql-connector-java.jar:/opt/cloudera-manager/cm-4.6.0/share/cmf/schema/../lib/* com.cloudera.enterprise.dbutil.DbCommandExecutor /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server/db.properties com.cloudera.cmf.db.

log4j:ERROR Could not find value for key log4j.appender.A

log4j:ERROR Could not instantiate appender named "A".

[2013-06-17 11:05:25,426] INFO 0[main] - com.cloudera.enterprise.dbutil.DbCommandExecutor.testDbConnection(DbCommandExecutor.java:231) - Successfully connected to database.

All done, your SCM database is configured correctly!

The log4j errors appear but don't seem to be harmful.

Create the monitoring databases

#!/bin/bash -x
# Settings based on the article in
# http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Enterprise-Edition-Installation-Guide/cmeeig_topic_5_5.html#cmeeig_topic_5_5
# might want to check this post too http://forums.mysql.com/read.php?22,274891,274891

NOW=`date +"%Y%m%d%H%M%S"`
MYSQLCONF=/etc/my.cnf

amon_password=<amon password>
smon_password=<smon password>
rman_password=<rman password>
hmon_password=<hmon password>
hive_password=<hive password>
nav_password=<nav password>

echo "can login automatically if you have password in the ~/.my.cnf file"
mysql -u root <
create database amon DEFAULT CHARACTER SET utf8;
grant all on amon.* TO 'amon'@'%' IDENTIFIED BY '$amon_password';
create database smon DEFAULT CHARACTER SET utf8;
grant all on smon.* TO 'smon'@'%' IDENTIFIED BY '$smon_password';
create database rman DEFAULT CHARACTER SET utf8;
grant all on rman.* TO 'rman '@'%' IDENTIFIED BY '$rman_password';
create database hmon DEFAULT CHARACTER SET utf8;
grant all on hmon.* TO 'hmon'@'%' IDENTIFIED BY '$hmon_password';
create database hive DEFAULT CHARACTER SET utf8;
grant all on hive.* TO 'hive'@'%' IDENTIFIED BY '$hive_password';
create database nav DEFAULT CHARACTER SET utf8;
grant all on nav.* TO 'nav'@'%' IDENTIFIED BY '$nav_password';
flush privileges;
EOF

When changing user passwds in MySQL remember to flush the privileges for them to take effect:

mysql> flush privileges;

Step - Start the Cloudera Manager server

# service cloudera-scm-server start

Step - install packages in the cluster

Now use the Cloudera Manager UI to install packages across the other nodes in the cluster

Installed … see screenshots (still need to add these)

You need to fill these (or your equivalent ones) in the entry fields in the set up wizard.
http://<local httpd root>/cdh4

http://<local httpd root>/cdh4,http://<>/impala

http://<local httpd root>/cm

http://<local httpd root>/cm/RPM-GPG-KEY-cloudera

Niggles ...

Zookeeper perms in the

Wasn't able to create dirs in new dir

So initialized zookeeper from the menu somewhere (could also have set allow to create setting)

We had issues with lack of space in /var so did the following:

Renamed the hdfs log /var/log/hadoop-hdfs to /data/var/log/hadoop-hdfs

Renamed the scm logs to /data/var…

Renamed mapred logs to /data/var…

For some, preferred to lower the warning/critical thresholds in the UI.

Step - Install impala

Use UI menu to add service

Mentioned config changes to support Impala

Perms changed 700 to 755 so that impala could read hadoop block directly

Noted hdfs flag on console showing outdated config

Restart hdfs

Restart impala

Note - if stopping/starting everything, all the dependencies are taken care of.

Step - Post implementation bits

Create a user area for my user

$ hdfs dfs -ls /

Found 2 items

drwxrwxrwt - hdfs supergroup 0 2013-06-17 15:03 /tmp

drwxr-xr-x - hdfs supergroup 0 2013-06-17 15:01 /user

$ hdfs dfs -mkdir /user/<myuser>

$ hdfs dfs -chown <myuser>:<myuser> /user/<myuser>

Benchmark testing - TeraGen & TeraSort
(more benchmark testing in this article)

$ time hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.3.0.jar teragen -Dmapred.map.tasks=60 1000000000 tera/in

$ time hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.3.0.jar terasort -Dmapred.reduce.tasks=60 tera/in tera/out

Think about scripting everything for our environment using the Cloudera Manager APIs.

http://cloudera.github.io/cm_api/

http://cloudera.github.io/cm_api/apidocs/v4/tutorial.html

Step - Installing and using lzo

Follow these instructions or as this post describes, use the Cloudera software here and follow the instructions in the following Cloudera documentation.

GML blog

Thursday, 29 May 2014

Big Data and Data Warehousing

Thursday, 22 May 2014

Playing with the HortonWorks Sandbox

Installing Spark on the Sandbox

Tuesday, 29 October 2013

How to handling large and potentially complex XML datasets in Hadoop

Friday, 21 June 2013

Impala error connecting running impala-shell on port 21000

Tuesday, 18 June 2013

Installing Cloudera Manager without internet access (path B)

Blog Archive

About Me