Tuesday 29 October 2013

How to handling large and potentially complex XML datasets in Hadoop

How should one handle large and potentially complex XML datasets in Hadoop?

Suppose you have loads of XML data being generated by source systems.
You want to do some data analytics on this data.
Chances are your data scientists will need to access this data and you might need to run a few batch jobs over this data set.
XML parsing is CPU intensive.
XML data contains both data and meaning making it robust against upstream definition changes and against corruption.
How should one approach this problem?

Came across this interesting slideshow re handling XML data  here: http://www.slideshare.net/Hadoop_Summit/bose-june26-405pmroom230cv3-24148869.
Then found the Youtube video presentation of it.
Good stuff that presents some of the issues/challenges and then provides some solution approaches and backs them up with some test results.

Another interesting article for python coders can be found here http://davidvhill.com/article/processing-xml-with-hadoop-streaming.

An old post from 2010 showing how to use Hadoop and Mahout XMLInputFormat class to process xml files. Debate about the effectiveness and richness of this approach found in this entry.

Another article using Hive and XPath to solve a specify XML parsing problem using Hive.

Articles others point people to is this one http://www.undercloud.org/?p=408.

Interesting article from June 2012 re this.

These guys in Feb 2012 were looking at the similar XML challenge - useful articulation of the problem and their thinking at the time.

Andy was playing with XML and Hadoop way back in 2008.

Friday 21 June 2013

Impala error connecting running impala-shell on port 21000

Installed impala using Cloudera Manager.

Then logged on to one of the servers in the cluster (called below) and tried connecting to impala.
It failed with this error (see the details below):
"Error connecting: , Could not connect to localhost:21000"

I made a basic error.
There was no impalad daemon on this server (called below).
There was only an Impala Statestore daemon running on it.
So I needed to determine one of the servers running impalad and direct impala-shell to it using the -i command line option to determine the interface.
And it worked - see the access working in the last step.
Note I needed to refresh to be able to see my new mytable table.
Last word on impala - am impressed with the speed returned from a simple impala count(*) on mytable versus hive equivalent.

Not working

$ ps -ef | grep impala | grep -v grep 
impala   30929 19867  0 13:13 ?        00:00:19 /opt/cloudera/parcels/IMPALA-1.0.1-1.p0.431/lib/impala/sbin-retail/statestored --flagfile=/var/run/cloudera-scm-agent/process/215-impala-STATESTORE/impala-conf/state_store_flags

$ impala-shell  -i localhost
Error connecting: , Could not connect to localhost:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun  4 08:08:13 PDT 2013)
[Not connected] > exit;

$ impala-shell  -i
Error connecting: , Could not connect to :21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun  4 08:08:13 PDT 2013)
[Not connected] > exit

Working

$ impala-shell -i myimpalaserver
Connected to myimpalaserver:21000
Server version: impalad version 1.0.1 RELEASE (build df844fb967cec8740f08dfb8b21962bc053527ef)
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.0.1 (df844fb) built on Tue Jun  4 08:08:13 PDT 2013)
[myimpalaserver:21000] > select * from mytable where load_dt = '20130410' and batch = '1' limit 4;
Query: select * from mytable where load_dt = '20130410' and batch = '1' limit 4
ERROR: AnalysisException: Unknown table: 'mytable'
[myimpalaserver:21000] > refresh ;
Successfully refreshed catalog
[myimpalaserver:21000] > show tables;      
Query: show tables
Query finished, fetching results ...
+-----------------+
name            |
+-----------------+
| mytable         |
+-----------------+
Returned 1 row(s) in 0.11s
[myimpalaserver:21000] > select * from mytable where load_dt = '20130410' and batch = '1' limit 4;
Query: select * from mytable where load_dt = '20130410' and batch = '1' limit 4

Query finished, fetching results ...

Tuesday 18 June 2013

Installing Cloudera Manager without internet access (path B)

After an abortive attempt at trying to install Cloudera Manager and the Cloudera software using Path C ie building from tarballs, I reverted to using Path B ie building from local repos.
(Note - I recommend you read the Cloudera documentation on their web site in conjunction with this post. Cloudera is evolving their product rapidly so this post is likely to become out of date quickly. I wrote this mid-June 2013)

Note - these notes are a bit messy. They were completed a couple of days after doing the install and not keeping great notes as I went along :(. Treat them as draft but they may be helpful.

Background

Up to now, we have not been using Cloudera Manager.
Then again we have a really simple system running a small HDFS clusters.
Ganglia and simple scripting has done the job for us.
Now we have purchased Cloudera support and it makes sense to use their Cloudera Manager since it simplifies the support and gives visibility of the cluster.
(Wish Cloudera Manager and Ambari would come together as one management tool)

Our system lives in a DMZ and has no direct internet access. It has servers with the following specs:
  • 2 x quad core Xeon 2.1GHz CPUs, 48GB RAM, 4 x 2TB hard disks and a pair of 1G NICs
  • Cent OS 6.3
On 6 servers in our dev/test area, I am looking to install:
  • One server to function as the management host running Cloudera Manager 4.5.3 and the databases (will use a single MySQL instance for all the db repositories).
  • 5 servers running the latest stable release of Cloudera 4.2.1
Installation steps used:
  1. Step - Download software
    • Java - use the java that comes in the Cloudera repo so no need to download Java from Oracle.
    • MySQL (I used MySQL but could also use PostgreSQL or Oracle) - downloaded the following:
      • MySQL-server-5.5.22-1.linux2.6.x86_64.rpm
      • MySQL-client-5.5.22-1.linux2.6.x86_64.rpm
    • Downloaded the CentOS 6.3 packages for x86_64
    • Download the Cloudera CDH4 parcel files and placed them in /var/www/html/cdh4
    • Download the Cloudera Manager software 
  2. Step -  install a local yum repos using createrepo rpm
    • # rpm -i createrepo-0.9.8-5.el6.noarch.rpm
      • warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • error: Failed dependencies:
      • deltarpm is needed by createrepo-0.9.8-5.el6.noarch
      • python-deltarpm is needed by createrepo-0.9.8-5.el6.noarch
  • # rpm -i deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm 
    • warning: deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
  • # rpm -i python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm 
    • warning: python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
  • # rpm -i createrepo-0.9.8-5.el6.noarch.rpm 
    • warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
  • # cd /var/www/html/cm
  • # createrepo .
    • 14/14 - 4.6.0/RPMS/x86_64/enterprise-debuginfo-4.6.0-1.cm460.p0.141.x86_64.rpm  
    • Saving Primary metadata
    • Saving file lists metadata
    • Saving other metadata
    • make the following repo file

  • Edit the /etc/yum.repos.d to look something like:
  • # cat /etc/yum.repos.d/local-centos-6.3
    • [local-centos]
    • name=centos
    • baseurl=http:///CentOS/6.3/local/x86_64
    • enabled=1
    • gpgcheck=0
  • # cat /etc/yum.repos.d/cloudera-manager.repo
    • [cloudera-manager]
    • name = Cloudera Manager, Version 4.6.0
    • baseurl = http:///cm
    • gpgkey = http:///cm/RPM-GPG-KEY-cloudera
    • gpgcheck = 1
  • Step - Install httpd if reqd - you will need httpd running on the server with the yum repos.
    • Install the httpd packages and leave the DocumentRoot and Directory set to /var/www/html (unless you have the repos elsewhere in which case edit the /etc/httpd/conf/httpd.conf file changing the DocumentRoot and Directory tags.
    • Needed CentOS packages to install apache httpd web server on a server (put mine on the cloudera manager server) to allow the servers to install from this repository. Change the DocumentRoot in the httpd config settings in /etc/httpd/conf/httpd.conf. 
      • # rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm
        • warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
        • error: Failed dependencies:
        • /etc/mime.types is needed by httpd-2.2.15-9.el6.centos.x86_64
        • apr-util-ldap is needed by httpd-2.2.15-9.el6.centos.x86_64
        • httpd-tools = 2.2.15-9.el6.centos is needed by httpd-2.2.15-9.el6.centos.x86_64
        • libaprutil-1.so.0()(64bit) is needed by httpd-2.2.15-9.el6.centos.x86_64
      • # rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm 
        • warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
        • error: Failed dependencies:
        • libaprutil-1.so.0()(64bit) is needed by httpd-tools-2.2.15-9.el6.centos.x86_64
      • # rpm -i apr-util-1.3.9-3.el6_0.1.x86_64.rpm 
        • warning: apr-util-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
      • # rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm 
        • warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • # rpm -i apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm 
        • warning: apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
      • # rpm -i mailcap-2.1.31-2.el6.noarch.rpm
        • warning: mailcap-2.1.31-2.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • # rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm
        • warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
    •  Restarted the httpd service:
# service httpd restart
Stopping httpd:                                            [  OK  ]
Starting httpd:                                            [  OK  ]
# pwd

# cd /var/www/html/cdh4/                                     
# sha1sum CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel
cadf5cc61b2d257aaf625341f709a4f8e09754038a  CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel
# cat CDH-4.3.0-1.cdh4.3.0.p0.22-el6.parcel.sha 
df5cc61b2d257aaf625341f709a4f8e09754038a
# cd ../impala/
# sha1sum IMPALA-1.0.1-1.p0.431-el6.parcel
992467f2e54bd394cbdd3f4ed97b6e9bead60ff0  IMPALA-1.0.1-1.p0.431-el6.parcel
# cat IMPALA-1.0.1-1.p0.431-el6.parcel.sha 
992467f2e54bd394cbdd3f4ed97b6e9bead60ff0

# yum clean all
Loaded plugins: fastestmirror, security
Cleaning repos: cloudera-manager
Cleaning up Everything
Cleaning up list of fastest mirrors
    • Step - Install Java SDK from the cloudera repo on the Cloudera Manager server - don't install this on the other servers as Cloudera Manager will install this (if I remember correctly ;). Note I had to remove an Oracle Java install before installing the Cloudera jdk version
      • # yum -y remove java-1.6.0-openjdk-1.6.0.0–1.45.1.11.1.el6.x86_64

        # yum install jdk
        Loaded plugins: fastestmirror, security
        Determining fastest mirrors
        cloudera-manager                                                                                                                                                                            | 1.3 kB     00:00     
        cloudera-manager/primary                                                                                                                                                                    | 5.0 kB     00:00     
        cloudera-manager                                                                                                                                                                                             20/20
        Setting up Install Process
        Resolving Dependencies
        --> Running transaction check
        ---> Package jdk.x86_64 2000:1.6.0_31-fcs will be installed
        --> Finished Dependency Resolution

        Dependencies Resolved

        ===================================================================================================================================================================================================================
         Package                                    Arch                                          Version                                                    Repository                                               Size
        ===================================================================================================================================================================================================================
        Installing:
         jdk                                        x86_64                                        2000:1.6.0_31-fcs                                          cloudera-manager                                         68 M

        Transaction Summary
        ===================================================================================================================================================================================================================
        Install       1 Package(s)

        Total download size: 68 M
        Installed size: 143 M
        Is this ok [y/N]: y
        Downloading Packages:
        jdk-6u31-linux-amd64.rpm                                                                                                                                                                    |  68 MB     00:01     
        Running rpm_check_debug
        Running Transaction Test
        Transaction Test Succeeded
        Running Transaction
          Installing : 2000:jdk-1.6.0_31-fcs.x86_64                                                                                                                                                                    1/1 
        Unpacking JAR files...
        rt.jar...
        jsse.jar...
        charsets.jar...
        tools.jar...
        localedata.jar...
        plugin.jar...
        javaws.jar...
        deploy.jar...
          Verifying  : 2000:jdk-1.6.0_31-fcs.x86_64                                                                                                                                                                    1/1 

        Installed:
          jdk.x86_64 2000:1.6.0_31-fcs                                                                                                                                                                                   

        Complete!
      • # java -version
        • java version "1.6.0_31"
        • Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
        • Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
    • Step - Set up database
      • Got a choice of databases - PostgreSQL, MySQL or Oracle - we set up MySQL for Hive previously so has decided to create the repositories required for Cloudera Manager and monitoring repositories using MySQL. But all three of these were options for us. Interestingly in the documentation PostgreSQL is there first. Not sure if this is a preference. 
      • Run the downloaded mySQL rpms - here's what I did on my system
        • rpm -i ./MySQL-server-5.5.22-1.linux2.6.x86_64.rpm --replacefiles
        • rpm -i ./MySQL-client-5.5.22-1.linux2.6.x86_64.rpm
        • including the following
          • mkdir -p /mysql_mon_rep_data
          • chown -R mysql:mysql /mysql_mon_rep_data
          • chmod 755 /mysql_mon_rep_data
          • mkdir -p /var/log/mysql/logs/binary/mysql_binary_log
          • chown -R mysql:mysql /var/log/mysql
          • mkdir -p /usr/share/java/
          • cp -ip mysql-connector-java-5.1.18-bin.jar /usr/share/java/mysql-connector-java.jar  # copied from earlier download I had for Hive repository. This can be found on the Oracle mySQL downloads site
          • chown mysql:mysql /usr/share/java/mysql-connector-java.jar
      • Moved old version of MySQL out the way (ie that was in /var/lib/mysql_data)
      • Change the /etc/my.cnf mySQL config file to include the configuration settings as per the documentation. I foolishly changed the bind_address to be equal to the server name rather than leave this out and have the default as localhost (need to verify this but this led me down a goose path and the Cloudera consultant was very good at keeping to the script in their documentation).
      • Stopped and restarted the mysql daemon ie "service mysql restart" (note - looks like CentOS 6.3 uses mysql and not mysqld as the daemon)
      • This builds a new set of database files (if you are upgrading your database - do something different) ... but it fails with the following error (for me at least) because I had moved the files from the default location. So I had to run the following to get it to work.
      • # /usr/bin/mysql_secure_installation
        
        NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MySQL
              SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!
        
        In order to log into MySQL to secure it, we'll need the current
        password for the root user.  If you've just installed MySQL, and
        you haven't set the root password yet, the password will be blank,
        so you should just press enter here.
        
        Enter current password for root (enter for none): 
        OK, successfully used password, moving on...
        
        Setting the root password ensures that nobody can log into the MySQL
        root user without the proper authorisation.
        
        Set root password? [Y/n] Y
        New password: 
        Re-enter new password: 
        Password updated successfully!
        Reloading privilege tables..
         ... Success!
        
        
        By default, a MySQL installation has an anonymous user, allowing anyone
        to log into MySQL without having to have a user account created for
        them.  This is intended only for testing, and to make the installation
        go a bit smoother.  You should remove them before moving into a
        production environment.
        
        Remove anonymous users? [Y/n] Y
         ... Success!
        
        Normally, root should only be allowed to connect from 'localhost'.  This
        ensures that someone cannot guess at the root password from the network.
        
        Disallow root login remotely? [Y/n] n
         ... skipping.
        
        By default, MySQL comes with a database named 'test' that anyone can
        access.  This is also intended only for testing, and should be removed
        before moving into a production environment.
        
        Remove test database and access to it? [Y/n] Y
         - Dropping test database...
         ... Success!
         - Removing privileges on test database...
         ... Success!
        
        Reloading the privilege tables will ensure that all changes made so far
        will take effect immediately.
        
        Reload privilege tables now? [Y/n] Y
         ... Success!
        
        Cleaning up...
        
        All done!  If you've completed all of the above steps, your MySQL
        installation should now be secure.
        
        Thanks for using MySQL!
        

    • Step - Install cloudera-manager-daemons and cloudera-manager-server on the Cloudera Manager server
    If you have a previous installation of the Cloudera scm and monitoring database schemas you may want to drop these (depending on whether you are starting fresh).
    If so, drop the following Cloudera database schemas: amon, hive, hmon, rman, scm, smon
    And also drop the scm user if it exists (you can leave the amon, hive, hmon, rman and smon users in the instance).

    Check what's cloudera packages are available in the repo.

    # yum search cloudera
    Loaded plugins: fastestmirror, security
    Loading mirror speeds from cached hostfile
    ============================================================================================== N/S Matched: cloudera ==============================================================================================
    cloudera-manager-agent.x86_64 : The Cloudera Manager Agent
    cloudera-manager-server.x86_64 : The Cloudera Manager Server
    cloudera-manager-server-db.x86_64 : Embedded database for the Cloudera Manager Server
    cloudera-manager-daemons.x86_64 : Provides daemons for monitoring Hadoop and related tools.
    cloudera-manager-parcel-4.6.0.x86_64 : All the CM bits in one relocatable package

      Name and summary matches only, use "search all" for everything.

    # yum install cloudera-manager-daemons
    Loaded plugins: fastestmirror, security
    Loading mirror speeds from cached hostfile
    Setting up Install Process
    Resolving Dependencies
    --> Running transaction check
    ---> Package cloudera-manager-daemons.x86_64 0:4.6.0-1.cm460.p0.141 will be installed
    --> Finished Dependency Resolution

    Dependencies Resolved

    ===================================================================================================================================================================================================================
     Package                                                   Arch                                    Version                                                 Repository                                         Size
    ===================================================================================================================================================================================================================
    Installing:
     cloudera-manager-daemons                                  x86_64                                  4.6.0-1.cm460.p0.141                                    cloudera-manager                                  135 M

    Transaction Summary
    ===================================================================================================================================================================================================================
    Install       1 Package(s)

    Total download size: 135 M
    Installed size: 175 M
    Is this ok [y/N]: y
    Downloading Packages:
    cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64.rpm                                                                                                                                    | 135 MB     00:02     
    Running rpm_check_debug
    Running Transaction Test
    Transaction Test Succeeded
    Running Transaction
      Installing : cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64                                                                                                                                            1/1 
      Verifying  : cloudera-manager-daemons-4.6.0-1.cm460.p0.141.x86_64                                                                                                                                            1/1 

    Installed:
      cloudera-manager-daemons.x86_64 0:4.6.0-1.cm460.p0.141                                                                                                                                                          

    Complete!
    # yum install cloudera-manager-server
    Loaded plugins: fastestmirror, security
    Loading mirror speeds from cached hostfile
    Setting up Install Process
    Resolving Dependencies
    --> Running transaction check
    ---> Package cloudera-manager-server.x86_64 0:4.6.0-1.cm460.p0.141 will be installed
    --> Finished Dependency Resolution

    Dependencies Resolved

    ===================================================================================================================================================================================================================
     Package                                                  Arch                                    Version                                                  Repository                                         Size
    ===================================================================================================================================================================================================================
    Installing:
     cloudera-manager-server                                  x86_64                                  4.6.0-1.cm460.p0.141                                     cloudera-manager                                  7.6 k

    Transaction Summary
    ===================================================================================================================================================================================================================
    Install       1 Package(s)

    Total download size: 7.6 k
    Installed size: 8.8 k
    Is this ok [y/N]: y
    Downloading Packages:
    cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64.rpm                                                                                                                                     | 7.6 kB     00:00     
    Running rpm_check_debug
    Running Transaction Test
    Transaction Test Succeeded
    Running Transaction
      Installing : cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64                                                                                                                                             1/1 
      Verifying  : cloudera-manager-server-4.6.0-1.cm460.p0.141.x86_64                                                                                                                                             1/1 

    Installed:
      cloudera-manager-server.x86_64 0:4.6.0-1.cm460.p0.141                                                                                                                                                            

    Complete!


    Edited the config.ini to add the name of the server hosting cloudera manager server service.
    # grep -i server /etc/cloudera-scm-agent/config.ini 
    # Hostname of Cloudera SCM Server
    server_host=.uk.pri.o2.com
    # Port that server is listening on
    server_port=7182

    Step - Prepare the databases
    Run the scm database create and user create script (initially set the user passwd to scm but change later)
    /usr/share/cmf/schema/scm_prepare_database.sh mysql -u root -p scm scm scm 2>&1 | tee scm_prepare_database8.log
    Verifying that we can write to /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server
    log4j:ERROR Could not find value for key log4j.appender.A
    log4j:ERROR Could not instantiate appender named "A".
    Creating SCM configuration file in /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server
    Executing:  /usr/java/jdk1.6.0_45/bin/java -cp /usr/share/java/mysql-connector-java.jar:/opt/cloudera-manager/cm-4.6.0/share/cmf/schema/../lib/* com.cloudera.enterprise.dbutil.DbCommandExecutor /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-server/db.properties com.cloudera.cmf.db.
    log4j:ERROR Could not find value for key log4j.appender.A
    log4j:ERROR Could not instantiate appender named "A".
    [2013-06-17 11:05:25,426] INFO     0[main] - com.cloudera.enterprise.dbutil.DbCommandExecutor.testDbConnection(DbCommandExecutor.java:231) - Successfully connected to database.
    All done, your SCM database is configured correctly!

    The log4j errors appear but don't seem to be harmful.

    Create the monitoring databases

    #!/bin/bash -x
    # Settings based on the article in 
    # http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Enterprise-Edition-Installation-Guide/cmeeig_topic_5_5.html#cmeeig_topic_5_5
    # might want to check this post too http://forums.mysql.com/read.php?22,274891,274891

    NOW=`date +"%Y%m%d%H%M%S"`
    MYSQLCONF=/etc/my.cnf

    amon_password=<amon password>
    smon_password=<smon password>
    rman_password=<rman password>
    hmon_password=<hmon password>
    hive_password=<hive password>
    nav_password=<nav password>

    echo "can login automatically if you have password in the ~/.my.cnf file"
    mysql -u root <
    create database amon DEFAULT CHARACTER SET utf8;
    grant all on amon.* TO 'amon'@'%' IDENTIFIED BY '$amon_password';
    create database smon DEFAULT CHARACTER SET utf8;
    grant all on smon.* TO 'smon'@'%' IDENTIFIED BY '$smon_password';
    create database rman DEFAULT CHARACTER SET utf8;
    grant all on rman.* TO 'rman '@'%' IDENTIFIED BY '$rman_password';
    create database hmon DEFAULT CHARACTER SET utf8;
    grant all on hmon.* TO 'hmon'@'%' IDENTIFIED BY '$hmon_password';
    create database hive DEFAULT CHARACTER SET utf8;
    grant all on hive.* TO 'hive'@'%' IDENTIFIED BY '$hive_password';
    create database nav DEFAULT CHARACTER SET utf8;
    grant all on nav.* TO 'nav'@'%' IDENTIFIED BY '$nav_password';
    flush privileges;
    EOF

    When changing user passwds in MySQL remember to flush the privileges for them to take effect:
    mysql> flush privileges;

    Step - Start the Cloudera Manager server

    # service cloudera-scm-server start

    Step - install packages in the cluster

    Now use the Cloudera Manager UI to install packages across the other nodes in the cluster

    Installed … see screenshots (still need to add these)

    You need to fill these (or your equivalent ones) in the entry fields in the set up wizard.
    http://<local httpd root>/cdh4
    http://<local httpd root>/cdh4,http://<>/impala
    http://<local httpd root>/cm
    http://<local httpd root>/cm/RPM-GPG-KEY-cloudera

    Niggles ...
    Zookeeper perms in the
    Wasn't able to create dirs in new dir
    So initialized zookeeper from the menu somewhere (could also have set allow to create setting)

    We had issues with lack of space in /var so did the following:
    Renamed the hdfs log /var/log/hadoop-hdfs to /data/var/log/hadoop-hdfs
    Renamed the scm logs to /data/var…
    Renamed mapred logs to /data/var…
    For some, preferred to lower the warning/critical thresholds in the UI.

    Step - Install impala

    Use UI menu to add service
    Mentioned config changes to support Impala
    Perms changed 700 to 755 so that impala could read hadoop block directly
    Noted hdfs flag on console showing outdated config
    Restart hdfs
    Restart impala

    Note - if stopping/starting everything, all the dependencies are taken care of.

    Step - Post implementation bits

    Create a user area for my user

    $ hdfs dfs -ls /
    Found 2 items
    drwxrwxrwt   - hdfs supergroup          0 2013-06-17 15:03 /tmp
    drwxr-xr-x   - hdfs supergroup          0 2013-06-17 15:01 /user

    $ hdfs dfs -mkdir /user/<myuser>

    $ hdfs dfs -chown <myuser>:<myuser> /user/<myuser>

    Benchmark testing - TeraGen & TeraSort
    (more benchmark testing in this article)

    $ time hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.3.0.jar teragen -Dmapred.map.tasks=60 1000000000 tera/in

    $ time hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.3.0.jar terasort -Dmapred.reduce.tasks=60 tera/in tera/out

    Think about scripting everything for our environment using the Cloudera Manager APIs.


    Step - Installing and using lzo

    Follow these instructions or as this post describes, use the Cloudera software here and follow the instructions in the following Cloudera documentation.

    Saturday 1 June 2013

    Abortive (came close but ultimately failed :() installation of Cloudera Manager and Cloudera software without internet access (ie Cloudera Path C approach)

    Note - this was an abortive attempt using Path C.
    Read here for some notes re Path B installation approach.

    Old abortive Path C approach below ...

    Up to now, we have not been using Cloudera Manager.
    Then again we have a really simple system running a small HDFS clusters.
    Ganglia and simple scripting has done the job for us.

    Now we have bought Cloudera support and it makes sense to add their Cloudera Manager layer.
    Why? It seems like this simplifies the support (will give examples soon).

    Our system lives in a DMZ and has no direct internet access. It has servers with the following specs:
    • 2 x quad core Xeon 2.1GHz CPUs, 48GB RAM, 4 x 2TB hard disks and a pair of 1G NICs
    • Cent OS 6.3
    • Latest stable release of Cloudera at the time 4.2.1 (Cloudera Manager 4.5.3)
    Problem.
    The free edition of Cloudera Manager found here is not useful if you don't have a connection to the internet. Installing Cloudera Manager without internet access is a pain. I was speaking to a consultant from Cloudera who strongly urged me to get our firewalls opened up to allow access to archive.cloudera.com from at least one server in the walled off environment. He said this would help if we needed to patch the environment quickly. Anyway, we have this constraint now so that is the starting point for this article.

    Note - I have not yet completed this exercise but will do a write up as soon as I have got this working.

    Here are some places to start ...

    I started here in cmeeig_topic_6_1 but there wasn't enough information in this article.
    From that document, I found myself going round in circles in the Cloudera documentation and not getting anywhere.

    For a repository of tarballs including the Cloudera Manager, use this link
    The instructions I've followed below are based on this article -  Appendix C in this article (might also cross reference this guide in cmeeig_topic_21_5 for installing parcels).
    The Cloudera documentation moves around a bit - so this is following the Installation Path C ie installing from Tarballs. This route (ie Path C) is not the preferred route. It makes it difficult to get updates quickly and efficiently (chasing package dependencies is painful). So consider carefully before going this route rather than a direct connection from the cloudera manager server to archive.cloudera.com.

    Installation steps used:
    1. Download
      • Java - downloaded
        •  Latest JDK 6 - use the rpm version because if you move back to the Cloudera Manager installer with an internet connection, it will check the rpm repository to determine whether java is installed (or so it seems). So look here for Java JDK versions. I downloaded this one: http://download.oracle.com/otn-pub/java/jdk/6u45-b06/jdk-6u45-linux-x64-rpm.bin. If you want the version certified by Cloudera, you'll need to look for the correct older version but the documentation new versions should be fine.
      • MySQL (or PostgreSQL or Oracle) - downloaded the following:
        • MySQL-server-5.5.22-1.linux2.6.x86_64.rpm
        • MySQL-client-5.5.22-1.linux2.6.x86_64.rpm
      • Install the latest cloudera manager and CDH parcel
    2. Install Java SDK - don't install this as Cloudera Manager will install this
      • java -showversion
      • java version "1.6.0_24"
      • OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64)
    3. Set up database
      • Got a choice of databases - PostgreSQL, MySQL or Oracle - we set up MySQL for Hive so has decided to create another database for monitoring and reporting using MySQL. But all three of these were options for us. Interestingly in the documentation PostgreSQL is the first. Not sure if this is a preference. 
      • Run the downloaded mySQL rpms - here's what I did on my system
        • rpm -i ./MySQL-server-5.5.22-1.linux2.6.x86_64.rpm --replacefiles
        • rpm -i ./MySQL-client-5.5.22-1.linux2.6.x86_64.rpm
        • including the following
          • mkdir -p /mysql_mon_rep_data
          • chown -R mysql:mysql /mysql_mon_rep_data
          • chmod 755 /mysql_mon_rep_data
          • mkdir -p /var/log/mysql/logs/binary/mysql_binary_log
          • chown -R mysql:mysql /var/log/mysql
          • mkdir -p /usr/share/java/
          • cp -ip mysql-connector-java-5.1.18-bin.jar /usr/share/java/mysql-connector-java.jar  # copied from earlier download I had for Hive repository. This can be found on the Oracle mySQL downloads site
          • chown mysql:mysql /usr/share/java/mysql-connector-java.jar
      • Moved old version of MySQL out the way (ie that was in /var/lib/mysql_data)
      • Change the /etc/my.cnf mySQL config file to include the configuration settings as per the documentation. I foolishly changed the bind_address to be equal to the server name rather than leave this out and have the default as localhost (need to verify this but this led me down a goose path and the Cloudera consultant was very good at keeping to the script in their documentation).
      • Stopped and restarted the mysql daemon (note - looks like CentOS 6.3 uses mysql and not mysqld as the daemon)
      • This builds a new set of database files (if you are upgrading your database - do something different) ... but it fails with the following error (for me at least) because I had moved the files from the default location. So I had to run the following to get it to work.
      • # /usr/bin/mysql_secure_installation
        
        NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MySQL
              SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!
        
        In order to log into MySQL to secure it, we'll need the current
        password for the root user.  If you've just installed MySQL, and
        you haven't set the root password yet, the password will be blank,
        so you should just press enter here.
        
        Enter current password for root (enter for none): 
        OK, successfully used password, moving on...
        
        Setting the root password ensures that nobody can log into the MySQL
        root user without the proper authorisation.
        
        Set root password? [Y/n] Y
        New password: 
        Re-enter new password: 
        Password updated successfully!
        Reloading privilege tables..
         ... Success!
        
        
        By default, a MySQL installation has an anonymous user, allowing anyone
        to log into MySQL without having to have a user account created for
        them.  This is intended only for testing, and to make the installation
        go a bit smoother.  You should remove them before moving into a
        production environment.
        
        Remove anonymous users? [Y/n] Y
         ... Success!
        
        Normally, root should only be allowed to connect from 'localhost'.  This
        ensures that someone cannot guess at the root password from the network.
        
        Disallow root login remotely? [Y/n] n
         ... skipping.
        
        By default, MySQL comes with a database named 'test' that anyone can
        access.  This is also intended only for testing, and should be removed
        before moving into a production environment.
        
        Remove test database and access to it? [Y/n] Y
         - Dropping test database...
         ... Success!
         - Removing privileges on test database...
         ... Success!
        
        Reloading the privilege tables will ensure that all changes made so far
        will take effect immediately.
        
        Reload privilege tables now? [Y/n] Y
         ... Success!
        
        Cleaning up...
        
        All done!  If you've completed all of the above steps, your MySQL
        installation should now be secure.
        
        Thanks for using MySQL!
        
      • Set the various mySQL account passwords
      • amon_password=insertpassword
        smon_password=insertpassword
        rman_password=insertpassword
        hmon_password=insertpassword
        hive_password=insertpassword
        nav_password=insertpassword
        mysql -u root <<EOF
        create database amon DEFAULT CHARACTER SET utf8;
        grant all on amon.* TO 'amon'@'%' IDENTIFIED BY '$amon_password';
        create database smon DEFAULT CHARACTER SET utf8;
        grant all on smon.* TO 'smon'@'%' IDENTIFIED BY '$smon_password';
        create database rman DEFAULT CHARACTER SET utf8;
        grant all on rman.* TO 'rman '@'%' IDENTIFIED BY '$rman_password';
        create database hmon DEFAULT CHARACTER SET utf8;
        grant all on hmon.* TO 'hmon'@'%' IDENTIFIED BY '$hmon_password';
        create database hive DEFAULT CHARACTER SET utf8;
        grant all on hive.* TO 'hive'@'%' IDENTIFIED BY '$hive_password';
        create database nav DEFAULT CHARACTER SET utf8;
        grant all on nav.* TO 'nav'@'%' IDENTIFIED BY '$nav_password';
        EOF
    4. Install cloudera manager server 
      • mkdir /opt/cloudera-manager
      • Install the tarball in this directory  /opt/cloudera-manager. It creates cm-4.6.0/... 
      • Create a soft link to the latest Cloudera manager tarball installation ln -s cm-4.5.3 cm. I removed this step because I ran into difficulties and thought it might be caused by this. But it does make sense to have a non versioned path
      • Create a cloudera service manager group and user account (note - uid and gid not reqd):
        • groupadd --gid cloudera-scm
        • useradd --system --uid --home=/opt/cloudera-manager/cm-4.6.0/run/cloudera-scm-server --gid cloudera-scm --no-create-home --shell=/bin/false --comment "Cloudera SCM User" cloudera-scm 
      • Cloudera manager requires these directories (can be changed if required)
      • mkdir -p /var/log/cloudera-scm-headlamp
        chown cloudera-scm:cloudera-scm /var/log/cloudera-scm-headlamp
        mkdir -p /var/log/cloudera-scm-firehose
        chown cloudera-scm:cloudera-scm /var/log/cloudera-scm-firehose
        mkdir -p /var/log/cloudera-scm-alertpublisher
        chown cloudera-scm:cloudera-scm /var/log/cloudera-scm-alertpublisher
        mkdir -p /var/log/cloudera-scm-eventserver
        chown cloudera-scm:cloudera-scm /var/log/cloudera-scm-eventserver
        mkdir -p /var/lib/cloudera-scm-headlamp
        chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-headlamp
        mkdir -p /var/lib/cloudera-scm-firehose
        chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-firehose
        mkdir -p /var/lib/cloudera-scm-alertpublisher
        chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-alertpublisher
        mkdir -p /var/lib/cloudera-scm-eventserver
        chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-eventserver
      • Configure the /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-agent/config.ini file. Note initially you only need to change the server_host variable to be the name or ip address of the server you are running your cloudera manager server on. You will send this to the cloudera manager agents once you have installed the cloudera manager software there.
      • Create the scm database in mySQL (note the Cloudera documentation assumes you are on a remote server and does a proxy login using a temp user as root can't remote login I think). But on the cloudera manager server where my mySQL database is running , I used: /opt/cloudera-manager/cmf/schema/scm_prepare_database.sh mysql -u root -p scm scm scm (this defaults the -h option to localhost)
      • Start the cloudera manager server  /opt/cloudera-manager/cm-4.6.0/etc/init.d/cloudera-scm-server start
    5. Prepare the cluster
      • Do this at server build time for least admin overhead ...
      • Generate a ssh rsa key pair on the cloudera manager server
      • On each server running the agent, as root (or sudo user):
        • cd; mkdir .ssh; cd .ssh; vi authorized_keys # add ssh public key created above; chmod 700 . ; chmod 600 authorized_keys
      • On the cloudera manager server, create a file called host_list with the names of the servers in the cluster
    6. Install the cloudera manager agents (my servers had a bit of nfs on each server, but you could equally scp the tarball to the server and then install it) and copy the changed agent config.ini (see step 4 where we prep'd this)
      • Build the cloudera manager software (not # hashed host_list entries exclude them from the operation with the grep -v '^#')
        • for h in `cat host_list | grep -v '^#' | awk ' { print $1 } '`; do   ssh -q $h 'hostname; [ ! -d /mnt/nfs/vol1/packages ] && echo No_NFS && exit; mkdir -p /opt/cloudera-manager; cd /opt/cloudera-manager; tar xvfz /mnt/nfs/vol1/packages/cloudera-manager-el6-cm4.6.0_x86_64.tar.gz; cd /opt; chown -R root:root cloudera-manager '; scp /opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-agent/config.ini root@$h:/opt/cloudera-manager/cm-4.6.0/etc/cloudera-scm-agent ; done
      • Start and check the status of the agents in the cluster
        • for h in `cat host_list | grep -v '^#' | awk ' { print $1 } '`; do   ssh -q $h 'hostname; /opt/cloudera-manager/cm-4.6.0/etc/init.d/cloudera-scm-agent start; ' ; done
        • for h in `cat host_list | grep -v '^#' | awk ' { print $1 } '`; do   ssh -q $h 'hostname; /opt/cloudera-manager/cm-4.6.0/etc/init.d/cloudera-scm-agent status; ' ; done
    7. Try and connect to the cloudera server manager 
      • I had to use port forwarding to get round firewalling 
        • ssh -L 7180:localhost:7180
      • Using browser http://:7180
    8. Hit a problem installing the Cloudera packages so changed tack and opted to install a local cloudera repository ... download the latest cloudera repository archive.cloudera.com/cm4/repo-as-tarball/4.6.0/cm4.6.0-centos6.tar.gz
    9. Needed CentOS packages to install apache httpd web server on a server (put mine on the cloudera manager server) to allow the servers to install from this repository. Change the DocumentRoot in the httpd config settings in /etc/httpd/conf/httpd.conf.  DocumentRoot "/opt/cloudera/yum-repo" and .
      • # rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm
      • warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • error: Failed dependencies:
      • /etc/mime.types is needed by httpd-2.2.15-9.el6.centos.x86_64
      • apr-util-ldap is needed by httpd-2.2.15-9.el6.centos.x86_64
      • httpd-tools = 2.2.15-9.el6.centos is needed by httpd-2.2.15-9.el6.centos.x86_64
      • libaprutil-1.so.0()(64bit) is needed by httpd-2.2.15-9.el6.centos.x86_64
      • # rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm 
      • warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • error: Failed dependencies:
      • libaprutil-1.so.0()(64bit) is needed by httpd-tools-2.2.15-9.el6.centos.x86_64
      • # rpm -i apr-util-1.3.9-3.el6_0.1.x86_64.rpm 
      • warning: apr-util-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
      • # rpm -i httpd-tools-2.2.15-9.el6.centos.x86_64.rpm 
      • warning: httpd-tools-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • # rpm -i apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm 
      • warning: apr-util-ldap-1.3.9-3.el6_0.1.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
      • # rpm -i mailcap-2.1.31-2.el6.noarch.rpm
      • warning: mailcap-2.1.31-2.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • # rpm -i httpd-2.2.15-9.el6.centos.x86_64.rpm
      • warning: httpd-2.2.15-9.el6.centos.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
    10. Download and install createrepo rpm
      • Download from vault.centos.org in my case vault.centos.org/6.3/os/x86_64/Packages/
        • # rpm -i createrepo-0.9.8-5.el6.noarch.rpm 
        • warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
        • error: Failed dependencies:
        • deltarpm is needed by createrepo-0.9.8-5.el6.noarch
        • python-deltarpm is needed by createrepo-0.9.8-5.el6.noarch
        • # rpm -i deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm 
        • warning: deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
        • # rpm -i python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm 
        • warning: python-deltarpm-3.5-0.5.20090913git.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID c105b9de: NOKEY
        • # rpm -i createrepo-0.9.8-5.el6.noarch.rpm 
        • warning: createrepo-0.9.8-5.el6.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
      • Install a cloudera yum repository
        • # cd /opt/cloudera
        • # mkdir yum-repo
        • # cd yum-repo
        • # tar xvfz /cm4.6.0-centos6.tar.gz
        • # cd cm
        • # createrepo .
        • 14/14 - 4.6.0/RPMS/x86_64/enterprise-debuginfo-4.6.0-1.cm460.p0.141.x86_64.rpm  
        • Saving Primary metadata
        • Saving file lists metadata
        • Saving other metadata
        • make the following repo file
        • # cat /etc/yum.repos.d/clouderarepo.repo 
        • [clouderarepo]
        • name=clouderarepo
        • baseurl=http:///cm
        • enabled=1
        • gpgcheck=0

    Bit disappointing not to get this Path C to work.
    Tried hard but had hassles that never ended up working.
    We were close ... not sure whether I'll have the opportunity to return to this and retry.

    So we ended up using Path B documented here.