GML blog

Tuesday, 19 June 2012

Pig notes to self

Some commands

# Note SUBSTRING is like a python slice
# so suppose field x has "abcdfegh"
# SUBSTRING(x,3,4) => "d"
# SUBSTRING(x,2,5) => "cdef"

Note this code is there for syntax purposes only - it does nothing meaningful ...

comments

/* .... over multiple lines ...*/

-- use -param arg1='abcd' on the command line
-- use -param myvar='xyz' on the command line
%default arg1 'default value'
%default myvar 'default value'

REGISTER myudf.jar;
REGISTER piggybank.jar;

DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();
DEFINE LENGTH org.apache.pig.piggybank.evaluation.string.LENGTH();

my_file = LOAD '$myfile' USING PigStorage('|') AS (col1:chararray, col2:double, col3:long);
my_file = DISTINCT my_file; -- remove duplicates

my_recs = FOREACH my_file GENERATE SUBSTRING(col1,0,14) AS mycol, null AS col4:chararray, (LENGTH(col1) < 3 ? col1 : SUBSTRING(REPLACE(col1,' ',''), 0,LENGTH(REPLACE(col1,' ',''))-2)) AS col5:chararray, col2, col3;

-- CONCAT(myudf.ZeroPad6Left(col1), myudf.ZeroPad6Left(col1)) AS col6:chararray

my_joined = JOIN my_recs by (col1, col2), my_recs by (col1,col2);

my_joined = FILTER my_joined BY (col3 < 1000);

my_joined2 = JOIN my_joined by col1 LEFT OUTER, my_recs by col1;

my_fin_rec = FOREACH my_joined2 GENERATE ;

STORE my_fin_rec INTO '$OUTPUTfile' USING PigStorage('|');

Saturday, 9 June 2012

Transferring Data via SSH - notes to self

Notes re transferring Data via SSH

ssh -c arcfour

If using SSH (scp/sftp/rsync with ssh), you can achieve speed enhancements using "-c arcfour" (sacrificing a little security - might be ok in-house e.g.). See notes re SSH from Charles Martin Reid's wiki.

Example using rsync

rsync can sync entire directory structures but this script needed data positioned in a certain way. rsync can do loads and is a good starting point ...
This script could/should be rewritten to make more use of rsync features.

#!/bin/ksh

eval $@

PUBKEY=${HOME}/.ssh/mykey.pub
svrname=`uname -n | cut -c1-8`
srcdir=/mysrcdir
sftpUsr=remuser
prisftpserver=remsvr
remdir=/remdestdir

cd ${srcdir}

START_DAY=${START_DAY:-`date --date="1 days ago" +%Y%m%d`}
END_DAY=${END_DAY:-`date --date="1 days ago" +%Y%m%d`}

DAY=${START_DAY}
while [ $DAY -le $END_DAY ]
do

echo "Starting DAY=$DAY ..."

echo "`date +'%Y/%m/%d %H:%M:%S'`|Start|${DAY}"

# Try and create the directory - it may have already be created
ssh -i ${PUBKEY} -q ${sftpUsr}@${prisftpserver} "mkdir ${remdir}/${DAY}; chmod 777 ${remdir}/${DAY}"

# replace with the pattern matching the files you want rsync'd
rsync -av --rsync-path=/opt/sfw/bin/rsync --rsh="ssh -i ${PUBKEY}" ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}

echo "`date +'%Y/%m/%d %H:%M:%S'`|Complete|${DAY}"

DAY=$(($DAY+1))

done

Example not using rsync

#!/bin/ksh
# script built by several hence slightly different formatting stds used :(

eval $@

PUBKEY=${HOME}/.ssh/mykey.pub
svrname=`uname -n | cut -c1-8` # local server
srcdir=/src_logs # replace with location of source data files
sftpUsr=remuser # replace with remote user
prisftpserver=remserver # replace with remote server
remdir=/rem_logs # replace with location of destination directory

cd ${srcdir}

# this example caters for daily logfiles
START_DAY=${START_DAY:-`date --date="1 days ago" +%Y%m%d`}
END_DAY=${END_DAY:-`date --date="1 days ago" +%Y%m%d`}

DAY=${START_DAY}
while [ $DAY -le $END_DAY ]
do

echo "Starting DAY=$DAY ..."

# Try and create the directory - it may have already be created
ssh -i ${PUBKEY} -q ${sftpUsr}@${prisftpserver} "mkdir ${remdir}/${DAY}; chmod 777 ${remdir}/${DAY}"

for filename in `ls -1 ` # replace
do

base_filename=`basename ${filename} .gz`
dir_filename=`dirname ${filename}`

scp_count=0
scp_error=1

while [ $scp_error -ne 0 ] && [ $scp_count -le 2 ] # give up after 3 scp attempts
do

scp_count=$(($scp_count+1))
echo "`date +'%Y/%m/%d %H:%M:%S'`|Started (${scp_count})|$filename|${base_filename}.gz"

# throttle speed to 1M with 120sec timeout to handle hanging scp's
scp -i ${PUBKEY} -l100000 -o ConnectTimeout=120 -q ${filename} ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}_${dir_filename}_${base_filename}.gz
# use arcfour cipher which is faster but less secure with 120sec timeout to handle hanging scp's
#scp -i ${PUBKEY} -c arcfour -o ConnectTimeout=120 -q ${filename} ${sftpUsr}@${prisftpserver}:${remdir}/${DAY}/${svrname}_${dir_filename}_${base_filename}.gz
scp_error=$?

done

echo "`date +'%Y/%m/%d %H:%M:%S'`|Complete|${filename}|${base_filename}.gz"

done

DAY=$(($DAY+1))

done

Streaming data

Flume
Scribe
Storm
S4

TBC

Saturday, 2 June 2012

Dumbo python - links and notes to self

https://github.com/klbostee/dumbo/wiki/Short-tutorial

https://raw.github.com/klbostee/dumbo/dbeae6c939cf7ef84ac81996041fc368df054c52/examples/join.py

http://dumbotics.com/category/examples/

https://github.com/klbostee/dumbo/wiki/Example-programs

Thursday, 31 May 2012

Mac things - notes to self

Installing hadoop

Check this link for general hadoop install including Mac
And this one for Lion OSX install

Start the ssh daemon

# sudo launchctl load -w /System/Library/LaunchDaemons/ssh.plist
### check it's running

# launchctl list | grep -i ssh
- 0 com.openssh.sshd

Wednesday, 9 May 2012

R - notes to self

Contents

Getting Started on a mac (including some good reads re getting started on R in general)
Using R - Baby steps (example R commands)
Installing R on CentOS from source

Getting started

Download R for mac from http://cran.r-project.org/bin/macosx/R-2.15.0.pkg
Well written introductory guide to R
More R commands

Using R - Baby steps

On file system:

# start with some arb data
$ pwd
/Users/myuser/data
$ cat test_data.csv Name,Test1,Test2,Test3,Test4 Adam,68,73,75,82 Ben,57,62,61,59 Jim,80,85,87,92 Zak,79,73,65,63

In R:

> student.data <- read.table("/Users/glourei1/data/test_data.csv",header = TRUE, sep = ",")

> names(student.data)

[1] "Name" "Test1" "Test2" "Test3" "Test4"

> ls()

[1] "student.data"

> summary(student.data[Test1])

Error in `[.data.frame`(student.data, Test1) : object 'Test1' not found

> summary(student.data["Test1"])

Test1

Min. :57.00

1st Qu.:65.25

Median :73.50

Mean :71.00

3rd Qu.:79.25

Max. :80.00

> mean(student.data["Test1"])

Test1

Warning message:

mean() is deprecated.

Use colMeans() or sapply(*, mean) instead.

> colMeans(student.data["Test1"])

Test1

> mean(student.data)

Name Test1 Test2 Test3 Test4

NA 71.00 73.25 72.00 74.00

Warning messages:

1: mean() is deprecated.

Use colMeans() or sapply(*, mean) instead.

2: In mean.default(X[[1L]], ...) :

argument is not numeric or logical: returning NA

Plotting

> a <- c(1:4)

> a

[1] 1 2 3 4

> b <- student.data[2,2:5]

> b

Test1 Test2 Test3 Test4

2 57 62 61 59

> plot(a,b)

> plot(c(1:4),student.data[2,2:5])
> plot(c(1:4),student.data[2,2:5],main = "Ben's test Results", xlab = "Tests", ylab = "Test Result / 100")

Installing R on Centos VM and then deploying to a production machine

Downloaded the latest R 2.15.0 from here: http://cran.ma.imperial.ac.uk/src/base/R-2/R-2.15.0.tar.gz (find link from here http://cran.ma.imperial.ac.uk/)

scp'd tar file to my CentOS Virtual Machine

cd /opt

tar xvzf R-2.15.0.tar.gz

cd /opt/R-2.15.0

./configure --prefix=/opt/R --with-x=no --with-readline=yes

# the next step takes 5 mins and is CPU intensive

./make

./make test

./make prefix=/opt/R install

For the above to progress, I needed to install the following packages:

pkgconfig

ncurses-devel

readline-devel

Need "cluster" and "fields" R packages for macro.

To check whether an R package has been installed:

is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])

E.g

> is.installed('cluster')

[1] TRUE

> is.installed('fields')

[1] FALSE

> is.installed('spam')

[1] FALSE

We needed the "fields" R package.

To install "fields", need to first install "spam":

/opt/R/bin/R CMD INSTALL --build /root/R-packages/spam_0.29-1.tar.gz

/opt/R/bin/R CMD INSTALL --build /root/R-packages/fields_6.6.3.tar.gz

Tar up the /opt/R tree.

cd /opt

tar cvzf /root/R.tar.gz R/

Shipped it to production server

Untarred into /opt

cd /opt

tar xvfz /root/R.tar.gz

Added PATH="……:/opt/R/bin" export PATH to /etc/profile

But was missing some packages

[myuser@myserver ~]$ R

/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory

< note I didn't install the libgfortran*i686* rpm >

[root@myserver ~]# rpm -i libgfortran-4.4.5-6.el6.x86_64.rpm

warning: libgfortran-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

[myuser@myserver ~]$ R

/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory

[root@myserver ~]# rpm -i libgomp-4.4.5-6.el6.x86_64.rpm

warning: libgomp-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

Finally checked and all was working

[myuser@myserver ~]$ R

R version 2.15.0 (2012-03-30)

ISBN 3-900051-07-0

Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])

> is.installed('cluster');

[1] TRUE

> is.installed('spam');

[1] TRUE

> is.installed('fields');

[1] TRUE

Check if package is installed

> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])

> is.installed('base')

[1] TRUE

> is.installed('fields')

[1] FALSE

> is.installed('cluster')

[1] TRUE

> is.installed('spam')

[1] FALSE

More data

Borrowed this from somewhere ...

> require(stats); require(graphics)

> ## work with pre-seatbelt period to identify a model, use logs

> work <- window(log10(UKDriverDeaths), end = 1982+11/12)

> work

Jan Feb Mar Apr May Jun Jul

1969 3.227115 3.178401 3.178113 3.141450 3.212720 3.179264 3.192846

1970 3.243534 3.246745 3.234770 3.192567 3.197281 3.181844 3.256477

1971 3.307496 3.218798 3.228657 3.210319 3.256477 3.242044 3.254064

1972 3.318063 3.247482 3.263636 3.195623 3.295787 3.267875 3.293363

1973 3.321598 3.292920 3.224533 3.288026 3.301681 3.258398 3.303628

1974 3.206286 3.176959 3.189771 3.140508 3.238297 3.254790 3.250176

1975 3.197832 3.132260 3.218010 3.140508 3.181558 3.152594 3.158965

1976 3.168203 3.218798 3.148294 3.144574 3.184691 3.116940 3.183555

1977 3.216957 3.146438 3.149527 3.147058 3.144263 3.181844 3.184123

1978 3.291369 3.164947 3.193959 3.164055 3.160168 3.210051 3.219323

1979 3.258398 3.159868 3.246006 3.164650 3.192010 3.155640 3.154424

1980 3.221414 3.133858 3.177825 3.133539 3.162266 3.182415 3.164353

1981 3.168497 3.163758 3.188084 3.147367 3.182415 3.141450 3.215109

1982 3.163161 3.159868 3.163161 3.135133 3.172311 3.192567 3.172603

Aug Sep Oct Nov Dec

1969 3.212188 3.198382 3.218273 3.332842 3.332034

1970 3.255273 3.235276 3.302764 3.350636 3.394101

1971 3.284656 3.209247 3.299289 3.348889 3.340841

1972 3.227630 3.249932 3.295787 3.379668 3.423901

1973 3.281488 3.318898 3.318063 3.325926 3.332438

1974 3.275772 3.301898 3.317436 3.320562 3.311966

1975 3.188366 3.219060 3.193403 3.279895 3.342225

1976 3.122871 3.211388 3.242541 3.291813 3.356790

1977 3.215638 3.180413 3.226600 3.301030 3.345374

1978 3.214314 3.215638 3.226084 3.311754 3.354493

1979 3.191451 3.216166 3.218273 3.304491 3.343802

1980 3.190892 3.189771 3.261739 3.239800 3.288026

1981 3.178977 3.225568 3.287354 3.271377 3.237041

1982 3.226342 3.202488 3.267172 3.300595 3.317854

Saturday, 5 May 2012

Mongo get started - notes to self

Installing MongoDB on a mac

Download from http://www.mongodb.org/downloads
Download for your O/S - I am on a mac - so mongodb-osx-x86_64-2.0.4.tgz
Go to the directory beneath which you want to install mongo.
tar xvzf mongodb-osx-x86_64-2.0.4.tgz
Create a soft link mongo to the versioned directory so that you can support upgrades seamlessly ie
ln -s mongodb-osx-x86_64-2.0.4 mongo

Pentaho loading into MongoDB

Loading data into mongodb using Pentaho Kettle (PDI)

MongoDB commands/tips

It's up to the client to determine the type of the field in a document.

Note - it's possible to have different types in the same field in different documents in the same collection!

At a glance MongoDB commands

> help
db.help() help on db methods
db.mycoll.help() help on collection methods
rs.help() help on replica set methods
show dbs show database names
show collections show collections in current database
show users show users in current database
show profile show most recent system.profile entries with time >= 1ms
show logs show the accessible logger names
show log [name] prints out the last segment of log in memory, 'global' is default
use set current database
db.foo.find() list objects in collection foo
db.foo.find( { a : 1 } ) list objects in foo where a == 1
it result of the last line evaluated; use to further iterate
DBQuery.shellBatchSize = x set default number of items to display on shell
exit quit the mongo shell

> db.foo.count() count items in collection foo

More commands here from the mongodb.org site

Query Dates

Find a record which has today's date between eff_from_dt (effective_from date) and eff_to_date (effective_to date)

> db.test.findOne({eff_from_dt: { $lte: new Date() }, eff_to_dt: { $gte: new Date() }} )
{
"_id" : ObjectId("50892120c2e69e0d395d6daa"),
"field1" : "129384749",
"field2" : "ABC",
"field3" : "XYZ",
"eff_from_dt" : ISODate("2012-05-14T23:00:00Z"),
"eff_to_dt" : ISODate("9999-12-31T00:00:00Z")
}

Wednesday, 2 May 2012

Misc Hive notes

Hive

Hive from the Command prompt

Read how to run from command prompt

Get info re hive options by running hive -h

From within hive, use:
source FILE

Run a hive program from shell or command line using: hive -f

Example shell script calling hive directly.

while [ 1 ]
do
hive -e "select count(*) from mytable where id = $RANDOM"
done

Hive UDFs

Write own Hive UDF

Join optimisation hints

Check out the join optimisation hints for Hive queries in this article.

Setting mapred.child.java.opts in hive

The first two did not work for me.
The third did.

~~hive> SET mapred.child.java.opts="-server -Xmx512M"~~
~~hive> SET mapred.child.java.opts=" -Xmx512M";~~
hive> SET mapred.child.java.opts=-Xmx512M;

GML blog

Tuesday, 19 June 2012

Pig notes to self

Saturday, 9 June 2012

Transferring Data via SSH - notes to self

Saturday, 2 June 2012

Dumbo python - links and notes to self

Thursday, 31 May 2012

Mac things - notes to self

Wednesday, 9 May 2012

R - notes to self

Check if package is installed

More data

Saturday, 5 May 2012

Mongo get started - notes to self

Installing MongoDB on a mac

Pentaho loading into MongoDB

MongoDB commands/tips

At a glance MongoDB commands

Query Dates

Wednesday, 2 May 2012

Misc Hive notes

Hive

Hive from the Command prompt

Hive UDFs

Join optimisation hints

Setting mapred.child.java.opts in hive

Blog Archive

About Me