The Tech Munch: April 2012

I am playing with the Cloudera Manager lately. After installing the hdfs service successfully using Cloudera Manager, I hit the following error when I had tried to copy a file to the hdfs.

Here was the message.

-bash-3.2$ hadoop dfs -copyFromLocal file01 file01
12/04/23 23:24:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/cloudera/file01 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3553)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3421)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)

It took me a while to figure out the problem. If you use the command line to report the file system.

-bash-3.2$ hadoop dfsadmin -report
Configured Capacity: 0 (0 KB)
Present Capacity: 28672 (28 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 28672 (28 KB)
DFS Used%: 100%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 10.132.169.81:50010
Rack: /default
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 28672 (28 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Mon Apr 23 23:50:10 EDT 2012

From above, you see that the configured capacity is 0 KB. However, Cloudera manager reported the file system is healthy.

It took me a while to figure it out. Even though I have 3 data nodes and each have 2GB available. Unfortunately, the default configuration requires each data node having at least 10GB free space.

Since my Hadoop setup is running in a VM environment, I've added extra 20GB virtual disk on each VM and problem solved.

Recently I am working on a data mining project and investigating on different visualization and computation tools to help to analyze the data model. Both Octave and R with R Studio interest me.

I am eager to try out both tools and come across an interesting article about using Monte Carlo simulation to find out the pi. I know this is not exactly data mining topic. However, it is still interesting.

Here is the article by Niall O'Higgins.

http://niallohiggins.com/2007/07/05/monte-carlo-simulation-in-python-1/

The author uses python to demonstrate the concept. After learning R will couple hours, I rewrote code in R.

n <- 1000000
x <- runif(n, 0.0, 1.0)
y <- runif(n, 0.0, 1.0)
score <- ifelse((sqrt(x^2 + y^2) <=1.0), 1, 0)
hit <- sum(score)
myPi <- 4 * hit / n

I am surprisingly the result code can be pretty compact. R seems to have a better data import/export capability than Octave. I am very impressed with both Octave and R can do.

The Tech Munch

Monday, April 30, 2012

HDFS - java.io.IOException: File XXX could only be replicated to 0 nodes, instead of 1

Monday, April 2, 2012

A simple R script to find the Pi