The Tech Munch: Hadoop

I wanted to play with the RHadoop package to see how R worked with Hadoop. Since the demo CDH3 image I was using from Cloudera did not bundle R, the first thing I had to do was to install R. Easy, I thought, I just needed to install the 3 R rpms from CRAN and it would be done.

Turned out the R rpms had a lot of dependencies (about 20-30 of extra rpms required) and the easiest way to install them was to install the EPEL (extra package for enterprise linux) repo first. Unfortunately the repo location returned from the google search (http://download.fedora.redhat.com) didn't seem to be working any more. Finally, I found the right repo and everything was done in just 2 commands:

$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

$ sudo yum install R

*replace the x86_64 with i386 if you are installing on a 32-bit system.

I am playing with the Cloudera Manager lately. After installing the hdfs service successfully using Cloudera Manager, I hit the following error when I had tried to copy a file to the hdfs.

Here was the message.

-bash-3.2$ hadoop dfs -copyFromLocal file01 file01
12/04/23 23:24:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/cloudera/file01 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3553)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3421)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)

It took me a while to figure out the problem. If you use the command line to report the file system.

-bash-3.2$ hadoop dfsadmin -report
Configured Capacity: 0 (0 KB)
Present Capacity: 28672 (28 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 28672 (28 KB)
DFS Used%: 100%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 10.132.169.81:50010
Rack: /default
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 28672 (28 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Mon Apr 23 23:50:10 EDT 2012

From above, you see that the configured capacity is 0 KB. However, Cloudera manager reported the file system is healthy.

It took me a while to figure it out. Even though I have 3 data nodes and each have 2GB available. Unfortunately, the default configuration requires each data node having at least 10GB free space.

Since my Hadoop setup is running in a VM environment, I've added extra 20GB virtual disk on each VM and problem solved.

The Tech Munch

Friday, May 11, 2012

How to install R on Cloudera CDH3

Monday, April 30, 2012

HDFS - java.io.IOException: File XXX could only be replicated to 0 nodes, instead of 1