The Tech Munch: Big Data

Showing posts with label Big Data. Show all posts

Friday, May 11, 2012

How to install R on Cloudera CDH3

I wanted to play with the RHadoop package to see how R worked with Hadoop. Since the demo CDH3 image I was using from Cloudera did not bundle R, the first thing I had to do was to install R. Easy, I thought, I just needed to install the 3 R rpms from CRAN and it would be done.

Turned out the R rpms had a lot of dependencies (about 20-30 of extra rpms required) and the easiest way to install them was to install the EPEL (extra package for enterprise linux) repo first. Unfortunately the repo location returned from the google search (http://download.fedora.redhat.com) didn't seem to be working any more. Finally, I found the right repo and everything was done in just 2 commands:

$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

$ sudo yum install R

*replace the x86_64 with i386 if you are installing on a 32-bit system.

Monday, April 30, 2012

HDFS - java.io.IOException: File XXX could only be replicated to 0 nodes, instead of 1

I am playing with the Cloudera Manager lately. After installing the hdfs service successfully using Cloudera Manager, I hit the following error when I had tried to copy a file to the hdfs.

Here was the message.

-bash-3.2$ hadoop dfs -copyFromLocal file01 file01
12/04/23 23:24:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/cloudera/file01 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3553)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3421)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)

It took me a while to figure out the problem. If you use the command line to report the file system.

-bash-3.2$ hadoop dfsadmin -report
Configured Capacity: 0 (0 KB)
Present Capacity: 28672 (28 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 28672 (28 KB)
DFS Used%: 100%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 10.132.169.81:50010
Rack: /default
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 28672 (28 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Mon Apr 23 23:50:10 EDT 2012

From above, you see that the configured capacity is 0 KB. However, Cloudera manager reported the file system is healthy.

It took me a while to figure it out. Even though I have 3 data nodes and each have 2GB available. Unfortunately, the default configuration requires each data node having at least 10GB free space.

Since my Hadoop setup is running in a VM environment, I've added extra 20GB virtual disk on each VM and problem solved.

Monday, April 2, 2012

A simple R script to find the Pi

Recently I am working on a data mining project and investigating on different visualization and computation tools to help to analyze the data model. Both Octave and R with R Studio interest me.

I am eager to try out both tools and come across an interesting article about using Monte Carlo simulation to find out the pi. I know this is not exactly data mining topic. However, it is still interesting.

Here is the article by Niall O'Higgins.

http://niallohiggins.com/2007/07/05/monte-carlo-simulation-in-python-1/

The author uses python to demonstrate the concept. After learning R will couple hours, I rewrote code in R.

n <- 1000000
x <- runif(n, 0.0, 1.0)
y <- runif(n, 0.0, 1.0)
score <- ifelse((sqrt(x^2 + y^2) <=1.0), 1, 0)
hit <- sum(score)
myPi <- 4 * hit / n

I am surprisingly the result code can be pretty compact. R seems to have a better data import/export capability than Octave. I am very impressed with both Octave and R can do.

Thursday, March 15, 2012

Highlights of SVForum's Big Data Analytics Conference

I attended the SVForum's Big Data Analytics Conference today. The conference took place in the Microsoft conference center in Mountain View. There were 3 keynote sessions and 3-4 panel discussions. Here are some of the highlights:

Microsoft - Data Doesn't Matter Until...
Microsoft speaker Bruno Aziza kickstarted the keynote session. Surprisingly, Microsoft supports the open-source big data community (Hadoop and others) and even contributes an "Excel module for Hive". (I find this combination a bit mind-boggling.) His talk focused on 4 trends of big data:

Data as Service - Raw data is only the beginning, providing value added data service is the key.
Get(Data) - get data as much as you can since cost of storage is down and you can always analyze it later
Scarcity vs Accessibility - the demand of data analysts in the future is high, need new tool to make the data more accessible to leverage global talent pool.
Marketing matters - use result of data analytics to promote action, case study: how Lego analyze children's feedback to create new products and new markets.

IBM - Smarter Decision Making, Leveraging Big Data to Gain New Actionable Insights

I was surprised that IBM had already formed a big data division. The speaker from IBM is the VP of that division. Her talk started with "where are the data of the big data from" (social media, RFID sensors, phones, cameras, GPS devices, smart meters, etc), then followed by a few case studies IBM had done in different segments (healthcare, green technology, traffic control). A few key take aways are

noise (as in signal/noise ratio in the data stored in HDFS can be high
time to process data is important, Hadoop is good at batch processing the unstructured data, but not that good at doing real-time, interactive data analytics
a mix of Hadoop and traditional data warehouse approach may bring the best of breeds
big data analytics can provide values in 3 areas: strategic advantage, operational intelligence, customer insight.

Greylock - Data Jujitsu, the art of turning data into product

Dr DJ Patil (ex LinkedIn and eBay) talked about how to form a good data science team. How LinkedIn promoted data analysis. What will make a good data project successful (having a diverse set of tools to use are very important), etc.

Panel Discussions

The panel discussions are a bit free formed, so it's hard to summarize. Here are some insights I find interesting:

Hadoop's batch processing can be thought as big data 1.0. A more interactive, real-time type of big data analytics will be big data 2.0.
A successful big data tool should be able to handle the data both in the cloud or on-premises (security is not the concern, logistics of moving large amount of data is).
A good visualization tool is not enough. A good tool should provide actionable insights and recommendations (what we should do next).
Building a generic big data platform may not be a wise starting point for your big data project. Starts with a vertical (specific) big data problem, then move horizontally.
Structured data and unstructured data will coexist. So are SQL and NoSQL.
Hadoop opens doors to new ETL (extract, transform, load) architecture. New ETL tools for real time analysis and reentrant analysis will be expected.
With the capabilities provided by Hadoop/HDFS, disk has become the new tape, and memory the new disk.
Big data analytics is still at the first half of the hype curve, but the future is very exciting.

P.S. I asked the organizer if the slides in the conference would be available. He said he would ask the speakers. However, the conference was videoed and the videos should be available soon. So check out their web site if you find the talks interesting.