Thursday, 30 May 2013

Overview of an Oracle DB

Quick overview of an Oracle database (these are from a few years back)

-- Oracle files details
select * from dba_data_files
select * from v$logfile
select * from v$log
select * from v$tempfile
select * from v$loghist
-- check the amt of redologging

-- Oracle database & instance details & parameter settings
select * from v$database
select * from v$instance
select * from v$parameter

-- #objects x owner & object_type
select owner, object_type, count(*)
from dba_objects
group by owner, object_type

-- #segments (count) & size (GB) x schema x segment type
select owner, segment_type, count(*), sum(bytes)/1024/1024/1024
from dba_segments
group by owner, segment_type
order by owner, segment_type

-- #segments (count) & size (GB) x schema x segment type per table
select owner, segment_type, count(*), sum(bytes)/1024/1024/1024
from dba_segments
where segment_name like '%%'
group by owner, segment_type
order by 4 desc

-- Listing of schema, segment type & bytes (per table commented out)
select owner, segment_type, segment_name, bytes
from dba_segments
--where segment_name like '%%'
order by 4 desc

-- other useful views
dba_jobs -- scheduled jobs
dba_mviews -- see all 9i mview dictionary views here

Thursday, 23 May 2013

Graph Analysis using MapReduce

Graph Analysis - Social Network Analysis

Was at a Teradata CTO day today and learnt about their new graph analysis capability in Asterdata which had skipped me by.

I wonder whether this is not development based on the same Yahoo seed that has grown Giraph.

Time to play with this ...

Friday, 3 May 2013

Hadoop MapReduce jobs hanging (after adding new nodes to cluster)

Interesting Hadoop issue

Background

We have been running a proof of concept (PoC) CDH3u3 cluster of 9 data nodes running CentOS 6.1.
Now we want to go production and don't have the time to upgrade to CDH4.2 ;-)
We were adding new data nodes running CentOS 6.3 with the same version of CDH3u3.
At the same time we took the time to update the topology of the cluster for these new nodes.
We added the rack they were in into the topology file.
We left the PoC cluster with their default rack locations.

Symptoms

We could load data into the cluster no problem.
But when we ran queries, they would run for a while and then stop.
The number of occupied Map Slots would be 0 but there were plenty of free slots.

We saw some errors/warning in the logs and they were certainly not obvious.

Hadoop jobtracker log warnings/errors

Example warnings/errors in namenode jobtracker logfile (maybe related)


2013-05-01 10:11:11,984 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved duplicate status update of 'KILLED' for 'attempt_201305010958_0001_m_000174_1' of TIP 'task_201305010958_0001_m_000174'oldTT=tracker_:localhost/127.0.0.1:54000 while newTT=tracker_:localhost/127.0.0.1:54000
2013-05-01 10:11:11,984 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201305010958_0001_m_000220_1' to tip task_201305010958_0001_m_000220, for tracker 'tracker_:localhost/127.0.0.1:54000'
2013-05-01 10:11:11,984 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: java.lang.NullPointerException
2013-05-01 10:11:11,984 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8021, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@17b0b765, false, false, true, 2320) from 10.173.226.117:53514: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException

Example odd  entries in namenode tasktracker logfile (maybe related)


2013-04-30 00:19:22,760 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549
2013-04-30 00:19:22,764 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549
2013-04-30 00:19:22,767 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549
2013-04-30 00:19:22,770 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549
2013-04-30 00:19:22,773 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549
2013-04-30 00:19:22,777 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to '' with reponseId '6549


Solution

Adding the rack locations of all the cluster nodes in the topology file did the trick.

Friday, 22 February 2013

Cheap archival NAS storage using BackBlaze design

BackBlaze have a great low-cost storage product.
The best part is they have open sourced their design.

They have just announced a new 3rd generation storage design as reported by TheRegister costing $2000 for the chassis holding 45 disks (disks not included in the price).

Interestingly Netflix were influenced by the BackBlaze design for their 100TB (36 x 3TB) design.


Thursday, 21 February 2013

BI, DW DBMS, Big Data articles

Articles covering the BI, DW, DBMSs, NoSQL and Big Data



13 Big Data Vendors to watch in 2013 - including AWS, 10gen, Cloudera, Hortonworks, 

Random entry - Graph DB Neo4j overview but the first 5 mins gives an interesting overview of Key-Vaue Pair vs ColumnStore vs Document vs Graph Databases


Big Data Architectures patterns by Eddie Satterley

Wednesday, 20 February 2013

Balancing an HDFS cluster (including java LeaseChecker OutOfmemoryError - still unresolved)

HDFS Balancer

Read the following articles for starters:

Yahoo tutorial module on Hadoop rebalancing 
Rebalancer Design PDF

Architecture for Open Source Applications HDFS - see rebalancing paragraph but take care talks about the threshold being between 0 and 1

Log on a the hadoop user (the user that runs our cluster is called hadoop) 
Change to the ${HADOOP_HOME}/bin where the hadoop scripts reside.
Then run the start-balancer.sh.
The default is a balancing threshold of 10% so choose something a little lower.
I chose 5%.
I should have started closer to 10% like 9% or 8%.
Why? Because start_balancer.sh TAKES FOREVER!
Use hadoop dfsadmin -report to check the redistribution of the space.

[hadoop@mynode hadoop]$ cd $HADOOP_HOME/bin


[hadoop@mynode bin]$ ./start-balancer.sh -threshold 5
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Feb 19, 2013 6:44:27 PM           0                 0 KB           516.65 GB              20 GB
[hadoop@mynode bin]$ hadoop dfsadmin -report


[hadoop@mynode bin]$ cat /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Feb 19, 2013 6:44:27 PM           0                 0 KB           516.65 GB              20 GB
Feb 19, 2013 7:05:57 PM           1              2.39 GB           514.07 GB              20 GB
Feb 19, 2013 7:28:28 PM           2              4.89 GB           511.59 GB              20 GB
Feb 19, 2013 7:50:29 PM           3              7.32 GB            509.2 GB              20 GB
Feb 19, 2013 8:12:29 PM           4              9.74 GB           506.67 GB              20 GB
Feb 19, 2013 8:34:30 PM           5             12.18 GB           504.51 GB              20 GB
Feb 19, 2013 8:56:30 PM           6             14.66 GB           502.14 GB              20 GB
Exception in thread "LeaseChecker" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:78)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:754)
at org.apache.hadoop.ipc.Client.call(Client.java:1080)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy1.renewLease(Unknown Source)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.renew(DFSClient.java:1282)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1294)
at java.lang.Thread.run(Thread.java:662)


[hadoop@mynode bin]$ ./stop-balancer.sh 
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: retry: Resource temporarily unavailable
./stop-balancer.sh: fork: Resource temporarily unavailable
[hadoop@mynode bin]$ w
 21:19:18 up 231 days, 11:44,  2 users,  load average: 0.03, 0.01, 0.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT

[hadoop@mynode bin]$ hadoop job -list
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: retry: Resource temporarily unavailable
/opt/hadoop/bin/hadoop: fork: Resource temporarily unavailable


[hadoop@mynode bin]$ cd ../pids
[hadoop@mynode pids]$ ls -atlr
total 20
drwxr-xr-x 17 hadoop hadoop 4096 Mar  8  2012 ..
-rw-rw-r--  1 hadoop hadoop    5 Feb 13 12:20 hadoop-hadoop-namenode.pid
-rw-rw-r--  1 hadoop hadoop    5 Feb 13 12:21 hadoop-hadoop-jobtracker.pid
-rw-rw-r--  1 hadoop hadoop    5 Feb 19 18:44 hadoop-hadoop-balancer.pid
drwxr-xr-x  2 hadoop hadoop 4096 Feb 19 18:44 .


[hadoop@mynode bin]$ kill -0 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ kill 2329
[hadoop@mynode bin]$ echo $?
0
[hadoop@mynode bin]$ ps -ef | grep 2329 | grep -v grep
[hadoop@mynode bin]$ 


Sometime later ... restarted a start_balancer.sh using 9% then 8% threshold ...


[hadoop@mynode bin]$ ./start-balancer.sh -threshold 9
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
[hadoop@mynode bin]$ tail -10f /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
The cluster is balanced. Exiting...
Balancing took 629.0 milliseconds

[hadoop@mynode bin]$ ./start-balancer.sh -threshold 8
starting balancer, logging to /opt/hadoop-0.20.2-cdh3u3/bin/../logs/hadoop-hadoop-balancer-mynode.out

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Mar 15, 2013 6:21:37 PM           0                 0 KB            63.46 GB              10 GB
Mar 15, 2013 6:42:37 PM           1              1.22 GB            62.13 GB              10 GB
...









Thursday, 14 February 2013

Hadoop Hive insert into partition table example script

Here's an example of creating Hadoop hive daily summary partitions and loading data from a Hive transaction table into newly created partitioned summary table.

The Hadoop Hive Manual has the insert syntax covered neatly but sometimes it's good to see an example.

Background

Colleagues wanted us to produce a smaller query set based on a large (billion rows per day) transaction table called big_txns that was partitioned by load date (load_dt).
They wanted the following in the smaller query set (1,000s of records):

  • Transaction Day - the day the transaction occurred (txn_dt was datetime yyyymmddHHMMSS so needed substringing for yyyymmdd)
  • Transaction Type - an interesting attribute
  • Counts - totals x transaction day x txn_type


Hive Create Partitioned External Table


DROP TABLE IF EXISTS txn_summ_x_txn_type;

CREATE EXTERNAL TABLE txn_summ_x_txn_type
    ( 
txn_dt     STRING,
        txn_type   STRING,
        cnt        BIGINT
     )
    COMMENT 'Transaction summary table showing counts x txn_dt x txn_type partitioned by load_dt'
    PARTITIONED BY( load_dt STRING )
    ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
    STORED AS TEXTFILE
    LOCATION '/data/txn/summ/txn_summ_x_txn_type';


Hive Insert into Daily Partitions

Here is a very basic shell script to build this summary table.
[Aside - originally I had the hive statement using self-consuming input a la ...
hive <
...
EOF
but when I ran it in the background it kept stopping for I don't know what reason.
So resorted to hive -e "hive cmds".
Need to go back and look at this]


#!/bin/bash
# Descrption: Insert the txn_type counts x txn_dt partitioned by load_dt
# Usage:      ins_txn_summ_x_txn_type.sh START_DAY=YYYYMMDD END_DAY=YYYYMMDD
# --------------------------------------------------------------------------


dohive() {

load_dt=$1

hive -e "
SET mapred.child.java.opts=-Xmx1024M;

alter table txn_summ_x_txn_type add if not exists partition (load_dt='${load_dt}');

insert overwrite table txn_summ_x_txn_type
partition ( load_dt = '${load_dt}' )
select SUBSTR(t.txn_dt,1,8) as txn_dt,
       t.txn_type, 
       count(*) as cnt
from big_txns t
where t.load_dt='${load_dt}'
group by SUBSTR(t.txn_dt,1,8),
         t.txn_type;
"

}

#START_DAY=20121001
#END_DAY=20121031

# Allow one to change the START_DAY and END_DAY
eval $@

if [ $# -ne 2 ]
then
  echo "ERROR: usage: ins_txn_summ_x_txn_type.sh START_DAY=YYYYMMDD END_DAY=YYYYMMDD (do not cross month boundaries - if non-Linux - see comment below)"
  exit 1
fi

DAY=$START_DAY

while [ $DAY -le $END_DAY ]
do
  echo DAY $DAY
  dohive $DAY 
  # DAY=$(($DAY+1)) # use this and don't span months if non-Linux
  # on linux this DAY increment works a treat - tx Matt Livesey for reminding
  DAY=`date --date=${DAY}' +1 day' +%Y%m%d`
done