Friday 30 November 2012

HDFS Misc

HDFS configuration

LZO compression

Good overview re LZO set up here

LZO compression notes (on Centos 6.1)


Install these packages:
lzo-2.03-3.1.el6.x86_64.rpm
lzop-1.02-0.9.rc1.el6.x86_64.rpm
Check:
$ rpm -qa | grep lzo
lzo-2.03-3.1.el6.x86_64
lzop-1.02-0.9.rc1.el6.x86_64

# Next unpack the tarball lzo-hadoop.tar.gz 
cd /mnt/nfs/vol1/packages
tar tvfz lzo-hadoop.tar.gz
drwx------ root/root         0 2012-04-12 11:30 native/
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/lib/
-rwx------ root/root     67841 2012-04-12 11:40 native/Linux-amd64-64/lib/libgplcompression.so.0.0.0
lrwxrwxrwx root/root         0 2012-04-12 11:40 native/Linux-amd64-64/lib/libgplcompression.so.0 -> libgplcompression.so.0.0.0
lrwxrwxrwx root/root         0 2012-04-12 11:40 native/Linux-amd64-64/lib/libgplcompression.so -> libgplcompression.so.0.0.0
-rw------- root/root      1124 2012-04-12 11:40 native/Linux-amd64-64/lib/libgplcompression.la
-rw-r--r-- root/root    104238 2012-04-12 11:40 native/Linux-amd64-64/lib/libgplcompression.a
-rwx------ root/root    257940 2012-04-12 11:40 native/Linux-amd64-64/libtool
-rw------- root/root     18384 2012-04-12 11:40 native/Linux-amd64-64/Makefile
-rwx------ root/root     59681 2012-04-12 11:40 native/Linux-amd64-64/config.status
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/
-rw------- root/root       332 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/LzoDecompressor.lo
-rw------- root/root     49048 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/LzoDecompressor.o
-rw------- root/root     54400 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/LzoCompressor.o
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.libs/
-rw------- root/root     49048 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.libs/LzoDecompressor.o
-rw------- root/root     54400 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.libs/LzoCompressor.o
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.deps/
-rw------- root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.deps/.dirstamp
-rw------- root/root      3803 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.deps/LzoDecompressor.Plo
-rw------- root/root      3799 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.deps/LzoCompressor.Plo
-rw------- root/root         0 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/.dirstamp
-rw------- root/root       326 2012-04-12 11:40 native/Linux-amd64-64/impl/lzo/LzoCompressor.lo
-rw------- root/root        28 2012-04-12 11:40 native/Linux-amd64-64/impl/stamp-h1
-rw------- root/root      4324 2012-04-12 11:40 native/Linux-amd64-64/impl/config.h
drwx------ root/root         0 2012-04-12 11:40 native/Linux-amd64-64/.libs/
-rwx------ root/root     67841 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.so.0.0.0
lrwxrwxrwx root/root         0 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.so.0 -> libgplcompression.so.0.0.0
lrwxrwxrwx root/root         0 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.so -> libgplcompression.so.0.0.0
lrwxrwxrwx root/root         0 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.la -> ../libgplcompression.la
-rw------- root/root      1124 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.lai
-rw------- root/root    104238 2012-04-12 11:40 native/Linux-amd64-64/.libs/libgplcompression.a
drwx------ root/root         0 2012-04-12 11:30 native/Linux-amd64-64/src/
drwx------ root/root         0 2012-04-12 11:30 native/Linux-amd64-64/src/com/
drwx------ root/root         0 2012-04-12 11:30 native/Linux-amd64-64/src/com/hadoop/
drwx------ root/root         0 2012-04-12 11:30 native/Linux-amd64-64/src/com/hadoop/compression/
drwx------ root/root         0 2012-04-12 11:38 native/Linux-amd64-64/src/com/hadoop/compression/lzo/
-rw------- root/root      1423 2012-04-12 11:40 native/Linux-amd64-64/src/com/hadoop/compression/lzo/com_hadoop_compression_lzo_LzoDecompressor.h
-rw------- root/root      1398 2012-04-12 11:40 native/Linux-amd64-64/src/com/hadoop/compression/lzo/com_hadoop_compression_lzo_LzoCompressor.h
-rw------- root/root      1123 2012-04-12 11:40 native/Linux-amd64-64/libgplcompression.la
-rw------- root/root     34625 2012-04-12 11:40 native/Linux-amd64-64/config.log
-rw------- root/root     62240 2012-04-12 11:40 hadoop-lzo-0.4.15.jar

cd ${HADOOP_HOME}/lib
tar xvfz lzo-hadoop.tar.gz
# Get the permissions right
chmod 640 hadoop-lzo-0.4.15.jar
cd native
chown –R hadoop:hadoop Linux-amd64-64
cd Linux-amd64-64
find . -type f –exec chmod 640 {} \;
find . -type d –exec chmod 750 {} \;

Note - this caught me out
cd ${HADOOP_HOME}/lib/native/Linux-amd64-64
-bash-4.1$ ln -s ./lib/libgplcompression.so.0.0.0 libgplcompression.so
-bash-4.1$ ln -s ./lib/libgplcompression.so.0.0.0 libgplcompression.so.0
-bash-4.1$ ls -atl libg*
lrwxrwxrwx 1 hadoop hadoop   32 Nov 30 13:43 libgplcompression.so.0 -> ./lib/libgplcompression.so.0.0.0
lrwxrwxrwx 1 hadoop hadoop   32 Nov 30 13:42 libgplcompression.so -> ./lib/libgplcompression.so.0.0.0
-rw-r----- 1 hadoop hadoop 1123 Apr 12  2012 libgplcompression.la


# Add the following to ${HADOOP_HOME}/conf/core-site.xml
io.compression.codecs
org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec
added from build steps of lzo
io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec
added from build steps of lzo

Wednesday 14 November 2012

Unix stuff - fsck usb h/d eg

Unix bits

fsck external hard disk (/mnt/usb) that lost power


# df -h


/dev/sdh1             1.8T  1.7T   57G  97% /mnt/usb2
/dev/sdf1             1.8T  196M  1.7T   1% /mnt/usb

# cd /mnt/usb
# ls 
ls: reading directory .: Input/output error
# fdisk -l
# fdisk -l | grep dev


Disk /dev/mapper/RootVolume-OptVol: 34.4 GB, 34359738368 bytes
Disk /dev/sdh: 2000.4 GB, 2000398934016 bytes
/dev/sdh1               1      243202  1953514583+  ee  GPT
Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes
/dev/sdf1               1      243202  1953514583+  ee  GPT

# add device to fstab
# vi /etc/fstab
# cat /etc/fstab


/dev/sdh1             1.8T  1.7T   57G  97% /mnt/usb2
/dev/sdf1             1.8T  196M  1.7T   1% /mnt/usb

# /mnt/usb
# df -h


/dev/sdh1             1.8T  1.7T   57G  97% /mnt/usb2
/dev/sdf1             1.8T  196M  1.7T   1% /mnt/usb

# fsck.ext3 /dev/sdf1

e2fsck 1.41.12 (17-May-2010)
/dev/sdf1 is mounted.  

WARNING!!!  The filesystem is mounted.   If you continue you ***WILL***
cause ***SEVERE*** filesystem damage.

Do you really want to continue (y/n)? yes

/dev/sdf1: recovering journal
/dev/sdf1: clean, 11/122101760 files, 7713452/488378637 blocks

Tuesday 13 November 2012

Throttling traffic via iptables


Useful link answer to: Throttle Traffic via iptables question

Note from Graham Hargreaves (prob based on someone else) ...


Not very accurate but definitely restricts the flow.
Change modemif variable to be the interface you need to throttle.
The example below never allowed a download to go above 130kbs and set to ambit it never went above 3Mbs (Probably needs some more testing)

To turn on:

#!/bin/bash
modemif=eth0

iptables -t mangle -A POSTROUTING -o $modemif -p tcp -m tos --tos Minimize-Delay -j CLASSIFY --set-class 1:10
iptables -t mangle -A POSTROUTING -o $modemif -p tcp --dport 80 -j CLASSIFY --set-class 1:10
iptables -t mangle -A POSTROUTING -o $modemif -p tcp --dport 443 -j CLASSIFY --set-class 1:10

tc qdisc add dev $modemif root handle 1: htb default 12
tc class add dev $modemif parent 1:1 classid 1:12 htb rate 50kbit ceil 50kbit



To turn the above off simply run:

tc qdisc del dev eth0 root
iptables


Tuesday 17 July 2012

Friday 13 July 2012

Thursday 12 July 2012

AWS related - notes to self

Uploading files to AWS

A python interface to AWS

e.g. code using the interface

How to decrypting S3 data before EMRing

The below discussion demonstrates how to decrypt S3 data as a bootstrap action to the EMR cluster: https://forums.aws.amazon.com/thread.jspa?threadID=50189

Another example is to use the S3 Java client side encryption in Map/Reduce jobs: http://aws.typepad.com/aws/2011/04/client-side-data-encryption-using-the-aws-sdk-for-java.html

AWS streaming job flow

See link below re how to create a streaming job flow (note: can use gzip + password as part of the streaming job)

Wednesday 11 July 2012

Installing HA - notes to self

c/o Graham H

Checking the HA


[root@dmmlw-r410-12 ~]# crm_mon


============
Last updated: Tue Jul 10 14:12:10 2012
Stack: openais
Current DC: myserver2 - partition with quorum
Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============


Online: [ myserver1 myserver2 ]


shared_ip_one   (ocf::heartbeat:IPaddr):        Started myserver1

Configuration


Install these packages:
cifs-utils-4.8.1-2.el6.x86_64.rpm
cluster-glue-1.0.5-2.el6.x86_64.rpm
cluster-glue-libs-1.0.5-2.el6.x86_64.rpm
corosync-1.2.3-36.el6.x86_64.rpm
corosynclib-1.2.3-36.el6.x86_64.rpm
corosynclib-devel-1.2.3-36.el6.x86_64.rpm
heartbeat-3.0.4-1.el6.x86_64.rpm  #from epel repo
heartbeat-libs-3.0.4-1.el6.x86_64.rpm  #from epel repo
keyutils-1.4-1.el6.x86_64.rpm
libibverbs-1.1.4-2.el6.x86_64.rpm
libmlx4-1.0.1-7.el6.x86_64.rpm
librdmacm-1.0.10-2.el6.x86_64.rpm
libtalloc-2.0.1-1.1.el6.x86_64.rpm
libtool-ltdl-2.2.6-15.5.el6.x86_64.rpm
lm_sensors-libs-3.1.1-10.el6.x86_64.rpm
net-snmp-libs-5.5-31.el6.x86_64.rpm
pacemaker-1.1.5-5.el6.x86_64.rpm
pacemaker-cts-1.1.5-5.el6.x86_64.rpm
pacemaker-libs-1.1.5-5.el6.x86_64.rpm
perl-TimeDate-1.16-11.1.el6.noarch.rpm
PyXML-0.8.4-19.el6.x86_64.rpm
resource-agents-3.0.12-22.el6.x86_64.rpm
net-snmp-5.5-31.el6.x86_64.rpm

sudo rpm -i --nodeps  libvirt-0.8.7-18.el6.x86_64.rpm  libvirt-client-0.8.7-18.el6.x86_64.rpm numactl-2.0.3-9.el6.x86_64.rpm gnutls-utils-2.8.5-4.el6.x86_64.rpm nc-1.84-22.el6.x86_64.rpm libxslt-1.1.26-2.el6.x86_64.rpm netcf-libs-0.1.7-1.el6.x86_64.rpm augeas-libs-0.7.2-6.el6.x86_64.rpm cyrus-sasl-md5-2.1.23-8.el6.x86_64.rpm qpid-cpp-client-0.10-3.el6.x86_64.rpm boost-1.41.0-11.el6.x86_64.rpm boost-1.41.0-11.el6.x86_64.rpm             boost-date-time-1.41.0-11.el6.x86_64.rpm   boost-python-1.41.0-11.el6.x86_64.rpm           boost-test-1.41.0-11.el6.x86_64.rpm  boost-regex-1.41.0-11.el6.x86_64.rpm            boost-graph-1.41.0-11.el6.x86_64.rpm       boost-serialization-1.41.0-11.el6.x86_64.rpm    boost-wave-1.41.0-11.el6.x86_64.rpm boost-iostreams-1.41.0-11.el6.x86_64.rpm   boost-signals-1.41.0-11.el6.x86_64.rpm ebtables-2.0.9-6.el6.x86_64.rpm iscsi-initiator-utils-6.2.0.872-21.el6.x86_64.rpm libicu-4.2.1-9.el6.x86_64.rpm dnsmasq-2.48-4.el6.x86_64.rpm radvd-1.6-1.el6.x86_64.rpm qemu-img-0.12.1.2-2.160.el6.x86_64.rpm yajl-1.0.7-3.el6.x86_64.rpm libcgroup-0.37-2.el6.x86_64.rpm libpciaccess-0.10.9-4.el6.x86_64.rpm 
sudo rpm -i fence-virtd-libvirt-0.2.1-8.el6.x86_64.rpm fence-virtd-0.2.1-8.el6.x86_64.rpm
sudo rpm -i libesmtp-1.0.4-15.el6.x86_64.rpm
sudo rpm -i clusterlib-3.0.12-41.el6.x86_64.rpm
sudo rpm -i openais-1.1.1-7.el6.x86_64.rpm openaislib-1.1.1-7.el6.x86_64.rpm
sudo rpm -i pexpect-2.3-6.el6.noarch.rpm
sudo rpm -i perl-Net-Telnet-3.03-11.el6.noarch.rpm
sudo rpm -i cman-3.0.12-41.el6.x86_64.rpm fence-virt-0.2.1-8.el6.x86_64.rpm fence-agents-3.0.12-23.el6.x86_64.rpm net-snmp-utils-5.5-31.el6.x86_64.rpm ricci-0.16.2-35.el6.x86_64.rpm sg3_utils-1.28-3.el6.x86_64.rpm  sg3_utils-libs-1.28-3.el6.x86_64.rpm oddjob-0.30-5.el6.x86_64.rpm nss-tools-3.12.9-9.el6.x86_64.rpm nss-tools-3.12.9-9.el6.x86_64.rpm modcluster-0.16.2-10.el6.x86_64.rpm
sudo rpm -i pacemaker-1.1.5-5.el6.x86_64.rpm pacemaker-cts-1.1.5-5.el6.x86_64.rpm pacemaker-libs-1.1.5-5.el6.x86_64.rpm

create /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.x.x.x
#mcastaddr: 226.94.1.1
broadcast: yes
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
#end of file
############################
run: 
crm configure
paste the below into the new shell:
primitive shared_ip_one IPaddr params ip=10.x.x.0 cidr_netmask="255.255.254.0" nic="bond0"
property stonith-enabled="false"
location share_ip_one_master shared_ip_one 100: myserver1
monitor shared_ip_one 20s:10s
commit
exit


Monday 9 July 2012

Pentaho PDI (Kettle) - notes to self

Starters


The Pentaho download page majors on the commercial versions.
(not sure whether the Community Edition (CE) comes with the commercial version)
Scroll down to Community Projects to find the open source version.

Spoon - the GUI where one designs, develops and tests ETL graphs.
So remember to run spoon.sh (spoon.bat) to fire up the environment and not simply click on the "Data Integration 64-bit" application (this resulted in the necessary libext JDBC libraries not to be available and resulted in a few errors below).

http://wiki.pentaho.com/display/EAI/02.+Spoon+Introduction

Problem with MySQL connection and PDI v.4.3

Error connecting to database [mytest_mysql] : org.pentaho.di.core.exception.KettleDatabaseException: 
Error occured while trying to connect to the database
Exception while loading class
org.gjt.mm.mysql.Driver
...
Caused by: java.lang.ClassNotFoundException: org.gjt.mm.mysql.Driver

To resolve this problem, read the issue log here which requires you download the ConnectorJ from here
$ tar xvzf mysql-connector-java-5.1.21.tar.gz mysql-connector-java-5.1.21/mysql-connector-java-5.1.21-bin.jar
$ cp -ip /downloads/mysql-connector-java-5.1.21/mysql-connector-java-5.1.21-bin.jar /usr/local/pentaho/data-integration/libext/JDBC/


mysql misc - notes to self

installed mysql on mac
lazily running as root
needing to mkdir /var/run/mysqld and chmod 777 /var/run/mysqld
(missing something obviously)
Can across this useful document for installing mysql on mac after following my nose.

To connect to mysql using perl DBI/DBD

To load data into mysql

how to run an SQL command in a file from within mysql
mysql> source mysqlcmds.sql

how to run an SQL command from the command line
mysql < mysqlcmds.sql
(note can leave the if the first line in the mysqlcmds.sql file is use

Self consuming mysql sql script in shell script
cat load_myfile.sh

#!/bin/bash

MYFILE=/mypath/myfile.dat


mysql --user=myuser --password=xyz <
use mytest;

load data local infile '${MYFILE}'
replace
into table mytest.mytable
character set UTF8
fields terminated by '|';

EOF

If you are getting the following error, it could be that you are missing the "local" keyword (if you are providing a full path to the file"

$ ./load_myfile.sh 
ERROR 13 (HY000) at line 3: Can't get stat of '/mypath/myfile.dat' (Errcode: 13)





Thursday 5 July 2012

Tuesday 19 June 2012

Pig notes to self


Some commands


# Note SUBSTRING is like a python slice
# so suppose field x has "abcdfegh"
# SUBSTRING(x,3,4) => "d"
# SUBSTRING(x,2,5) => "cdef"


Note this code is there for syntax purposes only - it does nothing meaningful ...


comments 


/* .... over multiple lines ...*/


-- use -param arg1='abcd' on the command line
-- use -param myvar='xyz' on the command line
%default arg1 'default value'
%default myvar 'default value'


REGISTER myudf.jar;
REGISTER piggybank.jar;


DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();
DEFINE LENGTH  org.apache.pig.piggybank.evaluation.string.LENGTH();


my_file = LOAD '$myfile' USING PigStorage('|') AS (col1:chararray, col2:double, col3:long);
my_file = DISTINCT my_file; -- remove duplicates


my_recs = FOREACH my_file GENERATE SUBSTRING(col1,0,14) AS mycol, null AS col4:chararray, (LENGTH(col1) < 3 ? col1 : SUBSTRING(REPLACE(col1,' ',''), 0,LENGTH(REPLACE(col1,' ',''))-2)) AS col5:chararray, col2, col3;


-- CONCAT(myudf.ZeroPad6Left(col1), myudf.ZeroPad6Left(col1)) AS col6:chararray


my_joined = JOIN my_recs by (col1, col2), my_recs by (col1,col2);


my_joined = FILTER my_joined BY (col3 < 1000);


my_joined2 = JOIN my_joined by col1 LEFT OUTER, my_recs by col1;


my_fin_rec = FOREACH my_joined2 GENERATE ;


STORE my_fin_rec INTO '$OUTPUTfile' USING PigStorage('|');

Saturday 9 June 2012

Transferring Data via SSH - notes to self

Notes re transferring Data via SSH

ssh -c arcfour

If using SSH (scp/sftp/rsync with ssh), you can achieve speed enhancements using "-c arcfour" (sacrificing a little security - might be ok in-house e.g.). See notes re SSH from Charles Martin Reid's wiki.

Example using rsync

rsync can sync entire directory structures but this script needed data positioned in a certain way. rsync can do loads and is a good starting point ...
This script could/should be rewritten to make more use of rsync features.


#!/bin/ksh


eval $@


PUBKEY=${HOME}/.ssh/mykey.pub
svrname=`uname -n | cut -c1-8`
srcdir=/mysrcdir
sftpUsr=remuser
prisftpserver=remsvr
remdir=/remdestdir


cd ${srcdir}


START_DAY=${START_DAY:-`date --date="1 days ago" +%Y%m%d`}
END_DAY=${END_DAY:-`date --date="1 days ago" +%Y%m%d`}


DAY=${START_DAY}
while [ $DAY -le $END_DAY ]
do


echo "Starting DAY=$DAY ..."


echo "`date +'%Y/%m/%d %H:%M:%S'`|Start|${DAY}"


# Try and create the directory - it may have already be created
ssh -i ${PUBKEY} -q ${sftpUsr}@${prisftpserver} "mkdir ${remdir}/${DAY}; chmod 777 ${remdir}/${DAY}"


# replace with the pattern matching the files you want rsync'd
rsync -av --rsync-path=/opt/sfw/bin/rsync --rsh="ssh -i ${PUBKEY}"   ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}


echo "`date +'%Y/%m/%d %H:%M:%S'`|Complete|${DAY}"


DAY=$(($DAY+1))


done


Example not using rsync


#!/bin/ksh
# script built by several hence slightly different formatting stds used :(


eval $@


PUBKEY=${HOME}/.ssh/mykey.pub
svrname=`uname -n | cut -c1-8`   # local server
srcdir=/src_logs        # replace with location of source data files
sftpUsr=remuser          # replace with remote user
prisftpserver=remserver  # replace with remote server
remdir=/rem_logs         # replace with location of destination directory


cd ${srcdir}


# this example caters for daily logfiles
START_DAY=${START_DAY:-`date --date="1 days ago" +%Y%m%d`}
END_DAY=${END_DAY:-`date --date="1 days ago" +%Y%m%d`}


DAY=${START_DAY}
while [ $DAY -le $END_DAY ]
do


echo "Starting DAY=$DAY ..."


# Try and create the directory - it may have already be created
ssh -i ${PUBKEY} -q ${sftpUsr}@${prisftpserver} "mkdir ${remdir}/${DAY}; chmod 777 ${remdir}/${DAY}"


for filename in `ls -1 `  # replace
do


base_filename=`basename ${filename} .gz`
dir_filename=`dirname ${filename}`


scp_count=0
scp_error=1


while [ $scp_error -ne 0 ] && [ $scp_count -le 2 ] # give up after 3 scp attempts
do


scp_count=$(($scp_count+1))
echo "`date +'%Y/%m/%d %H:%M:%S'`|Started (${scp_count})|$filename|${base_filename}.gz"


# throttle speed to 1M with 120sec timeout to handle hanging scp's
scp -i ${PUBKEY} -l100000 -o ConnectTimeout=120 -q ${filename} ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}_${dir_filename}_${base_filename}.gz
# use arcfour cipher which is faster but less secure with 120sec timeout to handle hanging scp's
#scp -i ${PUBKEY} -c arcfour -o ConnectTimeout=120 -q ${filename} ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}_${dir_filename}_${base_filename}.gz
scp_error=$?


done

echo "`date +'%Y/%m/%d %H:%M:%S'`|Complete|${filename}|${base_filename}.gz"


done


DAY=$(($DAY+1))


done


Streaming data


Flume 
Scribe
Storm
S4


TBC

Saturday 2 June 2012

Dumbo python - links and notes to self

https://github.com/klbostee/dumbo/wiki/Short-tutorial

https://raw.github.com/klbostee/dumbo/dbeae6c939cf7ef84ac81996041fc368df054c52/examples/join.py

http://dumbotics.com/category/examples/

https://github.com/klbostee/dumbo/wiki/Example-programs

Thursday 31 May 2012

Mac things - notes to self

Installing hadoop

Check this link for general hadoop install including Mac
And this one for Lion OSX install

Start the ssh daemon

# sudo launchctl load -w /System/Library/LaunchDaemons/ssh.plist
### check it's running

# launchctl list | grep -i ssh
- 0 com.openssh.sshd


Wednesday 9 May 2012

R - notes to self

Contents



  1. Getting Started on a mac (including some good reads re getting started on R in general)
  2. Using R - Baby steps (example R commands)
  3. Installing R on CentOS from source



Getting started


Download R for mac from http://cran.r-project.org/bin/macosx/R-2.15.0.pkg 
Well written introductory guide to R
More R commands 

Using R - Baby steps


On file system:


# start with some arb data
$ pwd
/Users/myuser/data
$ cat test_data.csv Name,Test1,Test2,Test3,Test4 Adam,68,73,75,82 Ben,57,62,61,59 Jim,80,85,87,92 Zak,79,73,65,63


In R:



> student.data <- read.table("/Users/glourei1/data/test_data.csv",header = TRUE, sep = ",")
names(student.data)
[1] "Name"  "Test1" "Test2" "Test3" "Test4"
ls()
[1] "student.data"
summary(student.data[Test1])
Error in `[.data.frame`(student.data, Test1) : object 'Test1' not found
summary(student.data["Test1"])
     Test1      
 Min.   :57.00  
 1st Qu.:65.25  
 Median :73.50  
 Mean   :71.00  
 3rd Qu.:79.25  
 Max.   :80.00  
> mean(student.data["Test1"])
Test1 
   71 
Warning message:
mean() is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
> colMeans(student.data["Test1"])
Test1 
   71 
mean(student.data)
 Name Test1 Test2 Test3 Test4 
   NA 71.00 73.25 72.00 74.00 
Warning messages:
1: mean() is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
2: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA


Plotting



> a <- c(1:4)
> a
[1] 1 2 3 4
> b <- student.data[2,2:5]
> b
  Test1 Test2 Test3 Test4
2    57    62    61    59
> plot(a,b)
> plot(c(1:4),student.data[2,2:5])
> plot(c(1:4),student.data[2,2:5],main = "Ben's test Results", xlab = "Tests", ylab = "Test Result / 100")



Installing R on Centos VM and then deploying to a production machine


Downloaded the latest R 2.15.0 from here: http://cran.ma.imperial.ac.uk/src/base/R-2/R-2.15.0.tar.gz (find link from here http://cran.ma.imperial.ac.uk/)

scp'd tar file to my CentOS Virtual Machine
cd /opt
tar xvzf R-2.15.0.tar.gz
cd /opt/R-2.15.0
./configure --prefix=/opt/R --with-x=no --with-readline=yes
# the next step takes 5 mins and is CPU intensive
./make
./make test
./make prefix=/opt/R install

For the above to progress, I needed to install the following packages:
pkgconfig
ncurses-devel
readline-devel

Need "cluster" and "fields" R packages for macro.
To check whether an R package has been installed:
is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
E.g

> is.installed('cluster')
[1] TRUE

> is.installed('fields')
[1] FALSE

> is.installed('spam')
[1] FALSE

We needed the "fields" R package.
To install "fields", need to first install "spam":

/opt/R/bin/R CMD INSTALL --build /root/R-packages/spam_0.29-1.tar.gz
/opt/R/bin/R CMD INSTALL --build /root/R-packages/fields_6.6.3.tar.gz

Tar up the /opt/R tree.

cd /opt
tar cvzf /root/R.tar.gz R/

Shipped it to production server

Untarred into /opt

cd /opt
tar xvfz /root/R.tar.gz

Added PATH="……:/opt/R/bin" export PATH to /etc/profile

But was missing some packages

[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory

< note I didn't install the libgfortran*i686* rpm >

[root@myserver ~]# rpm -i libgfortran-4.4.5-6.el6.x86_64.rpm 
warning: libgfortran-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory

[root@myserver ~]# rpm -i libgomp-4.4.5-6.el6.x86_64.rpm 
warning: libgomp-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY 

Finally checked and all was working

[myuser@myserver ~]$ R

R version 2.15.0 (2012-03-30)
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
> is.installed('cluster');
[1] TRUE
> is.installed('spam');
[1] TRUE
> is.installed('fields');
[1] TRUE

Check if package is installed


> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
> is.installed('base')
[1] TRUE
> is.installed('fields')
[1] FALSE
> is.installed('cluster')
[1] TRUE
> is.installed('spam')
[1] FALSE

More data


Borrowed this from somewhere ...

> require(stats); require(graphics)
> ## work with pre-seatbelt period to identify a model, use logs
> work <- window(log10(UKDriverDeaths), end = 1982+11/12)
> work
          Jan      Feb      Mar      Apr      May      Jun      Jul
1969 3.227115 3.178401 3.178113 3.141450 3.212720 3.179264 3.192846
1970 3.243534 3.246745 3.234770 3.192567 3.197281 3.181844 3.256477
1971 3.307496 3.218798 3.228657 3.210319 3.256477 3.242044 3.254064
1972 3.318063 3.247482 3.263636 3.195623 3.295787 3.267875 3.293363
1973 3.321598 3.292920 3.224533 3.288026 3.301681 3.258398 3.303628
1974 3.206286 3.176959 3.189771 3.140508 3.238297 3.254790 3.250176
1975 3.197832 3.132260 3.218010 3.140508 3.181558 3.152594 3.158965
1976 3.168203 3.218798 3.148294 3.144574 3.184691 3.116940 3.183555
1977 3.216957 3.146438 3.149527 3.147058 3.144263 3.181844 3.184123
1978 3.291369 3.164947 3.193959 3.164055 3.160168 3.210051 3.219323
1979 3.258398 3.159868 3.246006 3.164650 3.192010 3.155640 3.154424
1980 3.221414 3.133858 3.177825 3.133539 3.162266 3.182415 3.164353
1981 3.168497 3.163758 3.188084 3.147367 3.182415 3.141450 3.215109
1982 3.163161 3.159868 3.163161 3.135133 3.172311 3.192567 3.172603
          Aug      Sep      Oct      Nov      Dec
1969 3.212188 3.198382 3.218273 3.332842 3.332034
1970 3.255273 3.235276 3.302764 3.350636 3.394101
1971 3.284656 3.209247 3.299289 3.348889 3.340841
1972 3.227630 3.249932 3.295787 3.379668 3.423901
1973 3.281488 3.318898 3.318063 3.325926 3.332438
1974 3.275772 3.301898 3.317436 3.320562 3.311966
1975 3.188366 3.219060 3.193403 3.279895 3.342225
1976 3.122871 3.211388 3.242541 3.291813 3.356790
1977 3.215638 3.180413 3.226600 3.301030 3.345374
1978 3.214314 3.215638 3.226084 3.311754 3.354493
1979 3.191451 3.216166 3.218273 3.304491 3.343802
1980 3.190892 3.189771 3.261739 3.239800 3.288026
1981 3.178977 3.225568 3.287354 3.271377 3.237041
1982 3.226342 3.202488 3.267172 3.300595 3.317854