Thursday 31 May 2012

Mac things - notes to self

Installing hadoop

Check this link for general hadoop install including Mac
And this one for Lion OSX install

Start the ssh daemon

# sudo launchctl load -w /System/Library/LaunchDaemons/ssh.plist
### check it's running

# launchctl list | grep -i ssh
- 0 com.openssh.sshd


Wednesday 9 May 2012

R - notes to self

Contents



  1. Getting Started on a mac (including some good reads re getting started on R in general)
  2. Using R - Baby steps (example R commands)
  3. Installing R on CentOS from source



Getting started


Download R for mac from http://cran.r-project.org/bin/macosx/R-2.15.0.pkg 
Well written introductory guide to R
More R commands 

Using R - Baby steps


On file system:


# start with some arb data
$ pwd
/Users/myuser/data
$ cat test_data.csv Name,Test1,Test2,Test3,Test4 Adam,68,73,75,82 Ben,57,62,61,59 Jim,80,85,87,92 Zak,79,73,65,63


In R:



> student.data <- read.table("/Users/glourei1/data/test_data.csv",header = TRUE, sep = ",")
names(student.data)
[1] "Name"  "Test1" "Test2" "Test3" "Test4"
ls()
[1] "student.data"
summary(student.data[Test1])
Error in `[.data.frame`(student.data, Test1) : object 'Test1' not found
summary(student.data["Test1"])
     Test1      
 Min.   :57.00  
 1st Qu.:65.25  
 Median :73.50  
 Mean   :71.00  
 3rd Qu.:79.25  
 Max.   :80.00  
> mean(student.data["Test1"])
Test1 
   71 
Warning message:
mean() is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
> colMeans(student.data["Test1"])
Test1 
   71 
mean(student.data)
 Name Test1 Test2 Test3 Test4 
   NA 71.00 73.25 72.00 74.00 
Warning messages:
1: mean() is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
2: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA


Plotting



> a <- c(1:4)
> a
[1] 1 2 3 4
> b <- student.data[2,2:5]
> b
  Test1 Test2 Test3 Test4
2    57    62    61    59
> plot(a,b)
> plot(c(1:4),student.data[2,2:5])
> plot(c(1:4),student.data[2,2:5],main = "Ben's test Results", xlab = "Tests", ylab = "Test Result / 100")



Installing R on Centos VM and then deploying to a production machine


Downloaded the latest R 2.15.0 from here: http://cran.ma.imperial.ac.uk/src/base/R-2/R-2.15.0.tar.gz (find link from here http://cran.ma.imperial.ac.uk/)

scp'd tar file to my CentOS Virtual Machine
cd /opt
tar xvzf R-2.15.0.tar.gz
cd /opt/R-2.15.0
./configure --prefix=/opt/R --with-x=no --with-readline=yes
# the next step takes 5 mins and is CPU intensive
./make
./make test
./make prefix=/opt/R install

For the above to progress, I needed to install the following packages:
pkgconfig
ncurses-devel
readline-devel

Need "cluster" and "fields" R packages for macro.
To check whether an R package has been installed:
is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
E.g

> is.installed('cluster')
[1] TRUE

> is.installed('fields')
[1] FALSE

> is.installed('spam')
[1] FALSE

We needed the "fields" R package.
To install "fields", need to first install "spam":

/opt/R/bin/R CMD INSTALL --build /root/R-packages/spam_0.29-1.tar.gz
/opt/R/bin/R CMD INSTALL --build /root/R-packages/fields_6.6.3.tar.gz

Tar up the /opt/R tree.

cd /opt
tar cvzf /root/R.tar.gz R/

Shipped it to production server

Untarred into /opt

cd /opt
tar xvfz /root/R.tar.gz

Added PATH="……:/opt/R/bin" export PATH to /etc/profile

But was missing some packages

[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory

< note I didn't install the libgfortran*i686* rpm >

[root@myserver ~]# rpm -i libgfortran-4.4.5-6.el6.x86_64.rpm 
warning: libgfortran-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY

[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory

[root@myserver ~]# rpm -i libgomp-4.4.5-6.el6.x86_64.rpm 
warning: libgomp-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY 

Finally checked and all was working

[myuser@myserver ~]$ R

R version 2.15.0 (2012-03-30)
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
> is.installed('cluster');
[1] TRUE
> is.installed('spam');
[1] TRUE
> is.installed('fields');
[1] TRUE

Check if package is installed


> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) 
> is.installed('base')
[1] TRUE
> is.installed('fields')
[1] FALSE
> is.installed('cluster')
[1] TRUE
> is.installed('spam')
[1] FALSE

More data


Borrowed this from somewhere ...

> require(stats); require(graphics)
> ## work with pre-seatbelt period to identify a model, use logs
> work <- window(log10(UKDriverDeaths), end = 1982+11/12)
> work
          Jan      Feb      Mar      Apr      May      Jun      Jul
1969 3.227115 3.178401 3.178113 3.141450 3.212720 3.179264 3.192846
1970 3.243534 3.246745 3.234770 3.192567 3.197281 3.181844 3.256477
1971 3.307496 3.218798 3.228657 3.210319 3.256477 3.242044 3.254064
1972 3.318063 3.247482 3.263636 3.195623 3.295787 3.267875 3.293363
1973 3.321598 3.292920 3.224533 3.288026 3.301681 3.258398 3.303628
1974 3.206286 3.176959 3.189771 3.140508 3.238297 3.254790 3.250176
1975 3.197832 3.132260 3.218010 3.140508 3.181558 3.152594 3.158965
1976 3.168203 3.218798 3.148294 3.144574 3.184691 3.116940 3.183555
1977 3.216957 3.146438 3.149527 3.147058 3.144263 3.181844 3.184123
1978 3.291369 3.164947 3.193959 3.164055 3.160168 3.210051 3.219323
1979 3.258398 3.159868 3.246006 3.164650 3.192010 3.155640 3.154424
1980 3.221414 3.133858 3.177825 3.133539 3.162266 3.182415 3.164353
1981 3.168497 3.163758 3.188084 3.147367 3.182415 3.141450 3.215109
1982 3.163161 3.159868 3.163161 3.135133 3.172311 3.192567 3.172603
          Aug      Sep      Oct      Nov      Dec
1969 3.212188 3.198382 3.218273 3.332842 3.332034
1970 3.255273 3.235276 3.302764 3.350636 3.394101
1971 3.284656 3.209247 3.299289 3.348889 3.340841
1972 3.227630 3.249932 3.295787 3.379668 3.423901
1973 3.281488 3.318898 3.318063 3.325926 3.332438
1974 3.275772 3.301898 3.317436 3.320562 3.311966
1975 3.188366 3.219060 3.193403 3.279895 3.342225
1976 3.122871 3.211388 3.242541 3.291813 3.356790
1977 3.215638 3.180413 3.226600 3.301030 3.345374
1978 3.214314 3.215638 3.226084 3.311754 3.354493
1979 3.191451 3.216166 3.218273 3.304491 3.343802
1980 3.190892 3.189771 3.261739 3.239800 3.288026
1981 3.178977 3.225568 3.287354 3.271377 3.237041
1982 3.226342 3.202488 3.267172 3.300595 3.317854

Saturday 5 May 2012

Mongo get started - notes to self

Installing MongoDB on a mac

Download from http://www.mongodb.org/downloads
Download for your O/S - I am on a mac - so mongodb-osx-x86_64-2.0.4.tgz
Go to the directory beneath which you want to install mongo.
tar xvzf mongodb-osx-x86_64-2.0.4.tgz
Create a soft link mongo to the versioned directory so that you can support upgrades seamlessly ie
ln -s mongodb-osx-x86_64-2.0.4 mongo

Pentaho loading into MongoDB

Loading data into mongodb using Pentaho Kettle (PDI)

MongoDB commands/tips

It's up to the client to determine the type of the field in a document.
Note - it's possible to have different types in the same field in different documents in the same collection!

At a glance MongoDB commands

> help
db.help()                    help on db methods
db.mycoll.help()             help on collection methods
rs.help()                    help on replica set methods
show dbs                     show database names
show collections             show collections in current database
show users                   show users in current database
show profile                 show most recent system.profile entries with time >= 1ms
show logs                    show the accessible logger names
show log [name]              prints out the last segment of log in memory, 'global' is default
use                set current database
db.foo.find()                list objects in collection foo
db.foo.find( { a : 1 } )     list objects in foo where a == 1
it                           result of the last line evaluated; use to further iterate
DBQuery.shellBatchSize = x   set default number of items to display on shell
exit                         quit the mongo shell

> db.foo.count()                     count items in collection foo


More commands here from the mongodb.org site

Query Dates

Find a record which has today's date between eff_from_dt (effective_from date) and eff_to_date (effective_to date)


> db.test.findOne({eff_from_dt: { $lte: new Date() }, eff_to_dt: { $gte: new Date() }} )
{
        "_id" : ObjectId("50892120c2e69e0d395d6daa"),
        "field1" : "129384749",
        "field2" : "ABC",
        "field3" : "XYZ",
        "eff_from_dt" : ISODate("2012-05-14T23:00:00Z"),
        "eff_to_dt" : ISODate("9999-12-31T00:00:00Z")
}




Wednesday 2 May 2012

Misc Hive notes

Hive

Hive from the Command prompt

Read how to run from command prompt

Get info re hive options by running hive -h

From within hive, use:
source FILE

Run a hive program from shell or command line using: hive -f

Example shell script calling hive directly.

while [ 1 ]
do
  hive -e "select count(*) from mytable where id = $RANDOM"
done

Hive UDFs

Write own Hive UDF

Join optimisation hints

Check out the join optimisation hints for Hive queries in this article.

Setting mapred.child.java.opts in hive


The first two did not work for me.
The third did.

hive> SET mapred.child.java.opts="-server -Xmx512M"
hive> SET mapred.child.java.opts=" -Xmx512M";
hive> SET mapred.child.java.opts=-Xmx512M;