Installing hadoop
Check this link for general hadoop install including Mac
And this one for Lion OSX install
Start the ssh daemon
# sudo launchctl load -w /System/Library/LaunchDaemons/ssh.plist
### check it's running
# launchctl list | grep -i ssh
- 0 com.openssh.sshd
Thursday, 31 May 2012
Wednesday, 9 May 2012
R - notes to self
Contents
Getting started
Download R for mac from http://cran.r-project.org/bin/macosx/R-2.15.0.pkg
Well written introductory guide to R
More R commands
Using R - Baby steps
On file system:
# start with some arb data
$ pwd
/Users/myuser/data
$ cat test_data.csv Name,Test1,Test2,Test3,Test4 Adam,68,73,75,82 Ben,57,62,61,59 Jim,80,85,87,92 Zak,79,73,65,63
In R:
Plotting
Installing R on Centos VM and then deploying to a production machine
- Getting Started on a mac (including some good reads re getting started on R in general)
- Using R - Baby steps (example R commands)
- Installing R on CentOS from source
Getting started
Download R for mac from http://cran.r-project.org/bin/macosx/R-2.15.0.pkg
Well written introductory guide to R
More R commands
Using R - Baby steps
On file system:
# start with some arb data
$ pwd
/Users/myuser/data
$ cat test_data.csv Name,Test1,Test2,Test3,Test4 Adam,68,73,75,82 Ben,57,62,61,59 Jim,80,85,87,92 Zak,79,73,65,63
In R:
> student.data <- read.table("/Users/glourei1/data/test_data.csv",header = TRUE, sep = ",")
> names(student.data)
[1] "Name" "Test1" "Test2" "Test3" "Test4"
> ls()
[1] "student.data"
> summary(student.data[Test1])
Error in `[.data.frame`(student.data, Test1) : object 'Test1' not found
> summary(student.data["Test1"])
Test1
Min. :57.00
1st Qu.:65.25
Median :73.50
Mean :71.00
3rd Qu.:79.25
Max. :80.00
> mean(student.data["Test1"])
Test1
71
Warning message:
mean() is deprecated.
Use colMeans() or sapply(*, mean) instead.
> colMeans(student.data["Test1"])
Test1
71
> mean(student.data)
Name Test1 Test2 Test3 Test4
NA 71.00 73.25 72.00 74.00
Warning messages:
1: mean() is deprecated.
Use colMeans() or sapply(*, mean) instead.
2: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
Plotting
> a <- c(1:4)
> a
[1] 1 2 3 4
> b <- student.data[2,2:5]
> b
Test1 Test2 Test3 Test4
2 57 62 61 59
> plot(a,b)
> plot(c(1:4),student.data[2,2:5])
> plot(c(1:4),student.data[2,2:5],main = "Ben's test Results", xlab = "Tests", ylab = "Test Result / 100")
> plot(c(1:4),student.data[2,2:5],main = "Ben's test Results", xlab = "Tests", ylab = "Test Result / 100")
Installing R on Centos VM and then deploying to a production machine
Downloaded the latest R 2.15.0 from here: http://cran.ma.imperial.ac.uk/src/base/R-2/R-2.15.0.tar.gz (find link from here http://cran.ma.imperial.ac.uk/)
scp'd tar file to my CentOS Virtual Machine
cd /opt
tar xvzf R-2.15.0.tar.gz
cd /opt/R-2.15.0
./configure --prefix=/opt/R --with-x=no --with-readline=yes
# the next step takes 5 mins and is CPU intensive
./make
./make test
./make prefix=/opt/R install
For the above to progress, I needed to install the following packages:
pkgconfig
ncurses-devel
readline-devel
Need "cluster" and "fields" R packages for macro.
To check whether an R package has been installed:
is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])
E.g
> is.installed('cluster')
[1] TRUE
> is.installed('fields')
[1] FALSE
> is.installed('spam')
[1] FALSE
We needed the "fields" R package.
To install "fields", need to first install "spam":
/opt/R/bin/R CMD INSTALL --build /root/R-packages/spam_0.29-1.tar.gz
/opt/R/bin/R CMD INSTALL --build /root/R-packages/fields_6.6.3.tar.gz
Tar up the /opt/R tree.
cd /opt
tar cvzf /root/R.tar.gz R/
Shipped it to production server
Untarred into /opt
cd /opt
tar xvfz /root/R.tar.gz
Added PATH="……:/opt/R/bin" export PATH to /etc/profile
But was missing some packages
[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
< note I didn't install the libgfortran*i686* rpm >
[root@myserver ~]# rpm -i libgfortran-4.4.5-6.el6.x86_64.rpm
warning: libgfortran-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
[myuser@myserver ~]$ R
/opt/R/lib64/R/bin/exec/R: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory
[root@myserver ~]# rpm -i libgomp-4.4.5-6.el6.x86_64.rpm
warning: libgomp-4.4.5-6.el6.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID c105b9de: NOKEY
Finally checked and all was working
[myuser@myserver ~]$ R
R version 2.15.0 (2012-03-30)
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])
> is.installed('cluster');
[1] TRUE
> is.installed('spam');
[1] TRUE
> is.installed('fields');
[1] TRUE
>
Check if package is installed
> is.installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])
> is.installed('base')
[1] TRUE
> is.installed('fields')
[1] FALSE
> is.installed('cluster')
[1] TRUE
> is.installed('spam')
[1] FALSE
More data
Borrowed this from somewhere ...
> require(stats); require(graphics)
> ## work with pre-seatbelt period to identify a model, use logs
> work <- window(log10(UKDriverDeaths), end = 1982+11/12)
> work
Jan Feb Mar Apr May Jun Jul
1969 3.227115 3.178401 3.178113 3.141450 3.212720 3.179264 3.192846
1970 3.243534 3.246745 3.234770 3.192567 3.197281 3.181844 3.256477
1971 3.307496 3.218798 3.228657 3.210319 3.256477 3.242044 3.254064
1972 3.318063 3.247482 3.263636 3.195623 3.295787 3.267875 3.293363
1973 3.321598 3.292920 3.224533 3.288026 3.301681 3.258398 3.303628
1974 3.206286 3.176959 3.189771 3.140508 3.238297 3.254790 3.250176
1975 3.197832 3.132260 3.218010 3.140508 3.181558 3.152594 3.158965
1976 3.168203 3.218798 3.148294 3.144574 3.184691 3.116940 3.183555
1977 3.216957 3.146438 3.149527 3.147058 3.144263 3.181844 3.184123
1978 3.291369 3.164947 3.193959 3.164055 3.160168 3.210051 3.219323
1979 3.258398 3.159868 3.246006 3.164650 3.192010 3.155640 3.154424
1980 3.221414 3.133858 3.177825 3.133539 3.162266 3.182415 3.164353
1981 3.168497 3.163758 3.188084 3.147367 3.182415 3.141450 3.215109
1982 3.163161 3.159868 3.163161 3.135133 3.172311 3.192567 3.172603
Aug Sep Oct Nov Dec
1969 3.212188 3.198382 3.218273 3.332842 3.332034
1970 3.255273 3.235276 3.302764 3.350636 3.394101
1971 3.284656 3.209247 3.299289 3.348889 3.340841
1972 3.227630 3.249932 3.295787 3.379668 3.423901
1973 3.281488 3.318898 3.318063 3.325926 3.332438
1974 3.275772 3.301898 3.317436 3.320562 3.311966
1975 3.188366 3.219060 3.193403 3.279895 3.342225
1976 3.122871 3.211388 3.242541 3.291813 3.356790
1977 3.215638 3.180413 3.226600 3.301030 3.345374
1978 3.214314 3.215638 3.226084 3.311754 3.354493
1979 3.191451 3.216166 3.218273 3.304491 3.343802
1980 3.190892 3.189771 3.261739 3.239800 3.288026
1981 3.178977 3.225568 3.287354 3.271377 3.237041
1982 3.226342 3.202488 3.267172 3.300595 3.317854
Saturday, 5 May 2012
Mongo get started - notes to self
Installing MongoDB on a mac
Download from http://www.mongodb.org/downloadsDownload for your O/S - I am on a mac - so mongodb-osx-x86_64-2.0.4.tgz
Go to the directory beneath which you want to install mongo.
tar xvzf mongodb-osx-x86_64-2.0.4.tgz
Create a soft link mongo to the versioned directory so that you can support upgrades seamlessly ie
ln -s mongodb-osx-x86_64-2.0.4 mongo
Pentaho loading into MongoDB
Loading data into mongodb using Pentaho Kettle (PDI)MongoDB commands/tips
It's up to the client to determine the type of the field in a document.
Note - it's possible to have different types in the same field in different documents in the same collection!
At a glance MongoDB commands
> helpdb.help() help on db methods
db.mycoll.help() help on collection methods
rs.help() help on replica set methods
show dbs show database names
show collections show collections in current database
show users show users in current database
show profile show most recent system.profile entries with time >= 1ms
show logs show the accessible logger names
show log [name] prints out the last segment of log in memory, 'global' is default
use
db.foo.find() list objects in collection foo
db.foo.find( { a : 1 } ) list objects in foo where a == 1
it result of the last line evaluated; use to further iterate
DBQuery.shellBatchSize = x set default number of items to display on shell
exit quit the mongo shell
> db.foo.count() count items in collection foo
More commands here from the mongodb.org site
Query Dates
Find a record which has today's date between eff_from_dt (effective_from date) and eff_to_date (effective_to date)> db.test.findOne({eff_from_dt: { $lte: new Date() }, eff_to_dt: { $gte: new Date() }} )
{
"_id" : ObjectId("50892120c2e69e0d395d6daa"),
"field1" : "129384749",
"field2" : "ABC",
"field3" : "XYZ",
"eff_from_dt" : ISODate("2012-05-14T23:00:00Z"),
"eff_to_dt" : ISODate("9999-12-31T00:00:00Z")
}
Wednesday, 2 May 2012
Misc Hive notes
Hive
Hive from the Command prompt
Read how to run from command promptGet info re hive options by running hive -h
From within hive, use:
source FILE
Run a hive program from shell or command line using: hive -f
Example shell script calling hive directly.
while [ 1 ]
do
hive -e "select count(*) from mytable where id = $RANDOM"
done
Hive UDFs
Write own Hive UDFJoin optimisation hints
Check out the join optimisation hints for Hive queries in this article.
Setting mapred.child.java.opts in hive
The first two did not work for me.
The third did.
hive> SET mapred.child.java.opts=-Xmx512M;
Saturday, 14 April 2012
Hadoop Hive Partitioned External Table - notes to self
Hive Partitioned External Table
I had the same issue as Chris Zheng (see http://zhengdong.me/2012/02/22/hive-external-table-with-partitions/) re not being able to select anything out of my Hive external table partitions.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.
In fact, I solved this problem several weeks ago without realising when I was moving data from one directory to another and altered partition definitions for the move.
My data is being loaded in a simplistic way into the following directory structure - i.e. each day loads in load_dt=YYYYMMDD:
hdfs://data/myfeed/stg/load_dt=YYYYMMDD
E.g. given the following files:
cat 20120301_myfeed.dat
20120301|001500|test|A|35
20120301|003000|test|B|85
20120301|004500|test|A|25
20120301|010000|test|C|35
20120301|011500|test|A|95
20120301|013000|test|D|55
cat 20120301_myfeed.dat
20120302|001500|test|A|35
20120302|003000|test|B|85
20120302|004500|test|A|25
20120302|010000|test|C|35
20120302|011500|test|A|95
20120302|013000|test|D|55
Load them like this:
hadoop fs -put 20120301_myfeed.dat /data/myfeed/stg/load_dt=20120301
hadoop fs -put 20120302_myfeed.dat /data/myfeed/stg/load_dt=20120302
Create an external table (with load_dt partition) as follows:
set myfeed_stg_location=/data/myfeed/stg
set myfeed_stg_location;
set myfeed_stg=myfeed_stg_ext;
set myfeed_stg;
-- Suppose myfeed stg data had records like this
-- 20120301|001500|test|A|35
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:myfeed_stg}
( event_dt STRING
, event_tm STRING
, category STRING
, code STRING
, num_hits INT
)
COMMENT 'Table for myfeed staging data'
PARTITIONED BY( load_dt STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED AS TEXTFILE
LOCATION '${hiveconf:myfeed_stg}';
I found as Chris Zheng did, that I got nothing when I selected anything from my myfeed_stg_ext table.
Turns out you need to add the partitions explicitly :( like so:
hive> alter table myfeed_stg_ext add partition (load_dt=20120301);
OK
Time taken: 0.751 seconds
hive> alter table myfeed_stg_ext add partition (load_dt=20120302);
OK
Time taken: 0.279 seconds
hive> select * from myfeed_stg_ext;
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
20120302 001500 test A 35 20120302
20120302 003000 test B 85 20120302
20120302 004500 test A 25 20120302
20120302 010000 test C 35 20120302
20120302 011500 test A 95 20120302
20120302 013000 test D 55 20120302
Time taken: 0.501 seconds
hive> select * from myfeed_stg_ext where load_dt = 20120301;
OK
20120301 001500 test A 35 20120301
20120301 003000 test B 85 20120301
20120301 004500 test A 25 20120301
20120301 010000 test C 35 20120301
20120301 011500 test A 95 20120301
20120301 013000 test D 55 20120301
Time taken: 0.314 seconds
hive>
Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.
#!/bin/bash
day=20120301
while [ $day -le 20120331 ]
do
echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
if [ $? -ne 0 ]
then
echo "ERROR: hadoop mv failed"
exit 1
fi
sleep 1 # don't need these sleeps - used during testing
hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
sleep 2 # don't need these sleeps - used during testing
day=$(($day+1))
done
Here's a simple shell script to move the data from a existing directory structure /data/myfeed/stg to /data/myfeed/stg/load_dt=YYYYMMDD. Make sure it runs per month or change to handle month/year boundaries.
#!/bin/bash
day=20120301
while [ $day -le 20120331 ]
do
echo "hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}"
hadoop fs -mv /data/myfeed/stg/${day} /data/myfeed/stg/load_dt=${day}
if [ $? -ne 0 ]
then
echo "ERROR: hadoop mv failed"
exit 1
fi
sleep 1 # don't need these sleeps - used during testing
hive -e "ALTER TABLE myfeed_stg_pext ADD PARTITION (load_dt=${day}); select * from myfeed_stg_pext where load_dt = '$day' limit 10;"
sleep 2 # don't need these sleeps - used during testing
day=$(($day+1))
done
TBC ...
Read up on dynamic partitions ... could this be a more elegant approach?
And compression - lzo, others?
(http://www.mrbalky.com/2011/02/24/hive-tables-partitions-and-lzo-compression/)
Saturday, 4 February 2012
Teradata Fastload - notes to self
Fastload
Date and timestamp example
http://forums.teradata.com/forum/tools/fastload-datetime-timestamp
Filler example
http://forums.teradata.com/forum/tools/fastload-utility-0
Date and timestamp example
http://forums.teradata.com/forum/tools/fastload-datetime-timestamp
Filler example
http://forums.teradata.com/forum/tools/fastload-utility-0
Monday, 23 January 2012
Hadoop notes to self
Hadoop Tools Ecosystem
Useful link re the Hadoop tools ecosystem
http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem
Hadoop standalone rebuild
See useful link
rm -rf /data/hdfs
rm -rf /data/tmpd_hdfs
hadoop namenode -format
start-all.sh
Hadoop install on CentOS - JDK + Cloudera distribution
# copy CDH tarfiles and jdk somewhere - say /tmp/downloads
cd /opt # or wherever you decide to install hadoop
# install jdk
/tmp/jdk-6u25-linux-x64.bin # Install jdk
# edit /etc/profile and add necessary to the path e.g. +++
export JAVA_HOME="/opt/jdk"
PATH="$PATH:$JAVA_HOME/bin:/opt/hadoop/bin:/opt/hive/bin:/opt/pig/bin"
export HIVE_HOME=/opt/hive
export HADOOP_HOME=/opt/hadoop
export PIG_HOME=/opt/pig
Good Installation/config notes
Great notes re setting up a cluster
To be aware of
Small files in HDFS problem
Configuring
Read this article re configuration notes as a starter
Jobtracker hanging - memory issues - read here
Also read misconfiguration article
Transparent HugePages - see Linux reference and Greg Rahn's expose of THP issue on Hadoop
architectural notes here
Troubleshooting
$ hadoop fsck / 2>&1 | grep -v '\.\.\.'
FSCK started by hadoop (auth:SIMPLE) from /10.1.2.5 for path / at Tue Jun 19 08:16:59 BST 2012
Total size: 14476540835550 B
Total dirs: 6780
Total files: 1040334 (Files currently being written: 3678)
Total blocks (validated): 1207343 (avg. block size 11990412 B)
Minimally replicated blocks: 1207343 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0023208
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 9
Number of racks: 1
FSCK ended at Tue Jun 19 08:17:15 BST 2012 in 15878 milliseconds
The filesystem under path '/' is HEALTHY
$ hadoop dfsadmin -report
Configured Capacity: 65158503579648 (59.26 TB)
Present Capacity: 61841246261419 (56.24 TB)
DFS Remaining: 18049941311488 (16.42 TB)
DFS Used: 43791304949931 (39.83 TB)
DFS Used%: 70.81%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Decommissioning nodes from cluster
Useful link re the Hadoop tools ecosystem
http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem
Hadoop standalone rebuild
See useful link
rm -rf /data/hdfs
rm -rf /data/tmpd_hdfs
hadoop namenode -format
start-all.sh
Hadoop install on CentOS - JDK + Cloudera distribution
# copy CDH tarfiles and jdk somewhere - say /tmp/downloads
cd /opt # or wherever you decide to install hadoop
# install jdk
/tmp/jdk-6u25-linux-x64.bin # Install jdk
# install hadoop apps
for `f in *cdh3*gz`
do
tar -xvzf $f
done
# build soft links to current CDH version
for f in `ls -d *cdh3u3`; do g=`echo $f| cut -d'-' -f 1`; ln -s $f $g; done
# check permissions and chown -R hadoop:hadoop if reqd
# edit /etc/profile and add necessary to the path e.g. +++
export JAVA_HOME="/opt/jdk"
PATH="$PATH:$JAVA_HOME/bin:/opt/hadoop/bin:/opt/hive/bin:/opt/pig/bin"
export HIVE_HOME=/opt/hive
export HADOOP_HOME=/opt/hadoop
export PIG_HOME=/opt/pig
Good Installation/config notes
Great notes re setting up a cluster
To be aware of
Small files in HDFS problem
Configuring
Read this article re configuration notes as a starter
Jobtracker hanging - memory issues - read here
Also read misconfiguration article
Transparent HugePages - see Linux reference and Greg Rahn's expose of THP issue on Hadoop
architectural notes here
Troubleshooting
$ hadoop fsck / 2>&1 | grep -v '\.\.\.'
FSCK started by hadoop (auth:SIMPLE) from /10.1.2.5 for path / at Tue Jun 19 08:16:59 BST 2012
Total size: 14476540835550 B
Total dirs: 6780
Total files: 1040334 (Files currently being written: 3678)
Total blocks (validated): 1207343 (avg. block size 11990412 B)
Minimally replicated blocks: 1207343 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0023208
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 9
Number of racks: 1
FSCK ended at Tue Jun 19 08:17:15 BST 2012 in 15878 milliseconds
The filesystem under path '/' is HEALTHY
$ hadoop dfsadmin -report
Configured Capacity: 65158503579648 (59.26 TB)
Present Capacity: 61841246261419 (56.24 TB)
DFS Remaining: 18049941311488 (16.42 TB)
DFS Used: 43791304949931 (39.83 TB)
DFS Used%: 70.81%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Decommissioning nodes from cluster
Decommissioning a node
- Add to the exclude node file
- Run hadoop dfsadmin -refreshNodes on the name node
This results in the data node added to the excludenode file to move all its data to other nodes in the clulster.
[hadoop@mynamenode conf]$ hadoop dfsadmin -report | more
Configured Capacity: 86201191694336 (78.4 TB)
Present Capacity: 81797299645664 (74.39 TB)
DFS Remaining: 32416639721472 (29.48 TB)
DFS Used: 49380659924192 (44.91 TB)
DFS Used%: 60.37%
Under replicated blocks: 315421
Blocks with corrupt replicas: 0
Missing blocks: 0
Name: mydecommnode:50010
Rack: /dc1/r16
Decommission Status : Decommission in progress
Configured Capacity: 7239833731072 (6.58 TB)
DFS Used: 6869802995712 (6.25 TB)
Non DFS Used: 369725145088 (344.33 GB)
DFS Remaining: 305590272(291.43 MB)
DFS Used%: 94.89%
DFS Remaining%: 0%
Last contact: Wed May 08 11:37:15 BST 2013
Removing the decommissioned datanode
- Kill hadoop java processes still running on the node
- Remove from slaves file on the name nodes
- Leave in the exclude node file until a cluster reboot (changed in newer versions of Hadoop TBC?)
Fsck of hdfs shows replication violations/issues
Solution is to toggle the replication factor down one and then back to where it should be.
See example below.
$ ./fsck_all.sh | head -20
fsck'ing fs [/data]
FSCK started by hadoop (auth:SIMPLE) from / for path /data at Tue May 14 08:33:37 BST 2013
/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo: Replica placement policy is violated for blk_-1157106589514956189_3885959. Block should be additionally replicated on 1 more rack(s)
.......................Status: HEALTHY
So run the following:
$ hadoop fs -setrep 2 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo
Replication 2 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo
$ hadoop fs -setrep 3 /data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo
Replication 3 set: hdfs://mynamenode/data/myfeed/load_dt=20130430/batch=202/myfile1_20130430110747+0100_20130430110848+0100_src1_0224180.dat.lzo
And the replica violations disappear.
Also use this approach to solve underreplication issues.
Oozie
Good overview of Oozie
http://blog.cloudera.com/blog/2014/03/inside-apache-oozie-ha/
http://www.thecloudavenue.com/2013/10/installation-and-configuration-of.html
Good overview of Oozie
http://blog.cloudera.com/blog/2014/03/inside-apache-oozie-ha/
http://www.thecloudavenue.com/2013/10/installation-and-configuration-of.html
Subscribe to:
Posts (Atom)