The Tech Munch: 2012

Sunday, November 25, 2012

Babbage's Difference Engine in Action

I used the holiday season to visit the computer history museum in Mountain View. It was not the first time I had visited this wonderful museum (a must see if you ever visit the Silicon Valley), but it was the first time I saw their Babbage machine (Difference Engine #2, Serial 2) in action (it was in repair due to the shipping damage last time I visited the museum). When I saw the double helix pattern appearing when the machine turned to do a calculation, I was speechless. It's like seeing the birth of the modern computer!

I would not go into the details about the machine, for which I encourage you to visit the museum (or the Science Museum in London), but I found a few "design principles" Babbage applied to the machine quite interesting. For example, to "improve" the speed of the calculation, the machine was designed to do the calculation in parallel. Babbage also added a printing system, a design he borrowed from his "analytical machine" (which he never finished). This reusable module could even print different fonts! He also designed the wheels in such a way that if some mechanical parts were out of sync (so they would generate wrong results) they would jam the whole machine so a properly functional machine would never generate incorrect calculations. It's fascinating to see those design ideas or safety mechanisms applicable even in the modern software or hardware design.

One interesting anecdote about the machine. The restoration project was sponsored by Microsoft's ex-CTO Nathan Myhrvold. He ordered a duplicate to be made so he could display it in his living room. It was on loan to the computer history museum when he was getting his home ready. Well, let's hope his home remodeling takes a bit longer so more people can see this wonderful machine in its full glory.

One of the two fully restored Babbage difference engines in the world.

Details of the difference engine You can read the results of the calculation here or have it printed.

The printing system. It supports multiple fonts.

Tuesday, September 25, 2012

Thought on AWS's Fast I/O Instance

A while back I wrote a blog on how to use Delphix in AWS cloud. It was sort of a thought experiment, because the performance of EBS was too slow for serious database usage. However, with the release of Amazon's latest high performance EC2 SSD instance, things changed. Netflix had done their Cassandra benchmark with the new instance. They were able to use this instance to replace the m2.4xlarge instance with cache and cut about 60% of cost.

To me, the fast I/O EC2 instance now makes AWS very attractive to be used for data intensive analytics project. It also means you can really run databases in the cloud without relying on heavy (sometimes custom) caching*. I can hardly wait to see if I can use it in my future project.

*See the very interesting five-minute rule on how SSD improves the disk performance.

Sunday, July 15, 2012

Exciting Python Machine Learning Package (scikit-learn)

A while back, I blogged about using rpy2 to leverage the power of plotting and aplenty model selection of R in python. It's usable but still a bit cumbersome. Turns out there is even an easier way to do machine learning in python: use the scikit-learn.

Scikit-learn is another project born out of Google's summer of code. It's currently only 0.11, but has been around for 2+ years and supports many models in supervised and unsupervised learning. Its BSD license may be more attractive to people who is considering embedding a machine learning library in their own products. Overall it seems to be a very exciting new module to be added to python's machine learning toolkit.

Their web site is full of useful info (docs, tutorials, and demo videos), so go check it out: scikit-learn.org

P.S. if you encounter problems installing scikit-learn on your mac, here is a very useful page on installing all the required packages: http://kral4u.blogspot.com/2012/07/installing-numpy-scipy-matplotlib.html. Also highly recommend upgrading easy_install to pip.

Sunday, June 10, 2012

How to Install rpy2 on Mac OS X 10.7 (Lion)

Python and R are powerful tools for machine learning and data analysis. Like super heroes in movies, their power can be unmatched when combined. Python provides a richer set of scientific and data processing modules, while R provides easier plotting and analytic modeling capabilities.

To access R from python, you will need to install the rpy2* package. Usually, it's just as easy as running the python "easy_install":

easy_install rpy2

However, I found I had to jump through a few hoops to get the rpy2 package compiled and installed on my mac. The time I spent/wasted convinced me the info is worth sharing.

If you encounter errors while running easy_install on your mac os 10.7, try the following steps:

1. Install Xcode on your mac.

You will need the gcc compiler to build the rpy2. If Xcode is not installed, download and install it from the mac app store. (It's free.) Then install the command line tools from the xcode (go to preferences -> Downloads tab and click the "Install" button next to the Command Line Tools). This is what the preferences pop-up looks like after installation.

Note if you upgraded your mac os to 10.7 (lion) from 1.6 (snow leopard) and had xcode installed before the upgrade, you still have to do this since the old xcode tools were moved from /usr/bin/ to /Developer/usr/bin (it was a surprise to me) and the old binaries may not work properly.

2. Make sure your R installation is shared library enabled. If not, build it yourself.

You will need the header files from R to build rpy2. If your R is installed from a binary only release (i.e installed from the one click mac os R package), you need to download the R source code and build it yourself. Here is the instruction from CRAN on how to build R from the source: http://cran.r-project.org/doc/manuals/R-admin.html#Installing-R-under-_0028Mac_0029-OS-X

You may have to install the gfortran to build R. Unfortunately the gfortran link provided from CRAN site does not work for osx 10.7. Make sure you get the right version. You can find a good reference here:

http://web.mit.edu/mfloyd/www/computing/mac/gfortran/

3. Download and build rpy2.

The rpy2 page on source forge (http://rpy.sourceforge.net/rpy2/doc-2.2/html/overview.html#installation) provides pretty good instructions on how to build and install rpy2.

Notice the default python installation (/usr/bin/python) on Lion is python 2.7. If you encounter version compatibility issue, you can still build it using python 2.6:

export ARCHFLAGS='-arch i386 -arch x86_64'

/usr/bin/python2.6 setup.py build # specify --r-home if R not in default location

4. Install and test.

After successfully building it, you can install the python package (to the same version you used to build the package):

python setup.py install

and verify your installation with the following:

import rpy2.robjects as robjects

If you don't see any error, congratulations, your rpy2 is ready to go.

*rpy2 is the redesign of the rpy module. It's interface to R is better designed and is recommended over the rpy module.

Wednesday, May 23, 2012

Mechanical Symapthy

I came across this article written by Martin Fowler

http://martinfowler.com/articles/lmax.html

Original I was looking for more info about the LMAX Disruptor after Nathan Marz talked about replacing the traditional queues with the LMAX Disruptor in his Storm 0.8 to increase the overall performance in a meetup.

In second part of the Martin Fowler's article, it was the first time I read the term "Mechanical Sympathy". According to the article, quote "The term comes from race car driving and it reflects the driver having an innate feel for the car, so they are able to feel how to get the best out of it." unquote. Basically, you need to understand how the modern hardware work in order to squeeze the last drop of the performance. In these days, it is no longer just the disk is a very slow operation, even memory is a very slow operation as well. You want to make sure that your code and data are in the cache to get the performance.

This reminds me the old day when I worked on the embedded system, you needed to consider both the software and hardware could provide in order to come up the best solution.

Friday, May 11, 2012

How to install R on Cloudera CDH3

I wanted to play with the RHadoop package to see how R worked with Hadoop. Since the demo CDH3 image I was using from Cloudera did not bundle R, the first thing I had to do was to install R. Easy, I thought, I just needed to install the 3 R rpms from CRAN and it would be done.

Turned out the R rpms had a lot of dependencies (about 20-30 of extra rpms required) and the easiest way to install them was to install the EPEL (extra package for enterprise linux) repo first. Unfortunately the repo location returned from the google search (http://download.fedora.redhat.com) didn't seem to be working any more. Finally, I found the right repo and everything was done in just 2 commands:

$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

$ sudo yum install R

*replace the x86_64 with i386 if you are installing on a 32-bit system.

Monday, April 30, 2012

HDFS - java.io.IOException: File XXX could only be replicated to 0 nodes, instead of 1

I am playing with the Cloudera Manager lately. After installing the hdfs service successfully using Cloudera Manager, I hit the following error when I had tried to copy a file to the hdfs.

Here was the message.

-bash-3.2$ hadoop dfs -copyFromLocal file01 file01
12/04/23 23:24:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/cloudera/file01 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3553)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3421)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)

It took me a while to figure out the problem. If you use the command line to report the file system.

-bash-3.2$ hadoop dfsadmin -report
Configured Capacity: 0 (0 KB)
Present Capacity: 28672 (28 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 28672 (28 KB)
DFS Used%: 100%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 10.132.169.81:50010
Rack: /default
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 28672 (28 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Mon Apr 23 23:50:10 EDT 2012

From above, you see that the configured capacity is 0 KB. However, Cloudera manager reported the file system is healthy.

It took me a while to figure it out. Even though I have 3 data nodes and each have 2GB available. Unfortunately, the default configuration requires each data node having at least 10GB free space.

Since my Hadoop setup is running in a VM environment, I've added extra 20GB virtual disk on each VM and problem solved.

Monday, April 2, 2012

A simple R script to find the Pi

Recently I am working on a data mining project and investigating on different visualization and computation tools to help to analyze the data model. Both Octave and R with R Studio interest me.

I am eager to try out both tools and come across an interesting article about using Monte Carlo simulation to find out the pi. I know this is not exactly data mining topic. However, it is still interesting.

Here is the article by Niall O'Higgins.

http://niallohiggins.com/2007/07/05/monte-carlo-simulation-in-python-1/

The author uses python to demonstrate the concept. After learning R will couple hours, I rewrote code in R.

n <- 1000000
x <- runif(n, 0.0, 1.0)
y <- runif(n, 0.0, 1.0)
score <- ifelse((sqrt(x^2 + y^2) <=1.0), 1, 0)
hit <- sum(score)
myPi <- 4 * hit / n

I am surprisingly the result code can be pretty compact. R seems to have a better data import/export capability than Octave. I am very impressed with both Octave and R can do.

Tuesday, March 27, 2012

SVForum Big Data Conference Video Available

Here is the link: http://www.youtube.com/playlist?list=PL7CCD96639D8A3087

Saturday, March 17, 2012

Cutting Edge vs Bleeding Edge

I was reading some article about how facebook designed their new data center. The photo of their data center suddenly reminded me the company I worked for more than 10 years ago: Terraspring.

Terraspring was ahead of its time. What did we do? We did utility computing. All the resources (compute, storage, network) in a data center can be put into a pool managed by Terraspring, they can be dynamically allocated, and users are only charged by their usage. Does it sound familiar? Yes, many of the ideas are now part of the cloud computing.

Our original plan was to build our own data centers and provided services to other companies. As the 2000 internet bubble busted, the demand for data center went for a nose dive. We were forced to change our business model to become a software provider. That didn't went well either. Soon we were sold to Sun Microsystems and the whole team disbanded in 2004*.

Looking at this new wave of data center boom, I was reminded one comment I heard from this week's Big Data Analytics Conference. Somebody asked the panel speakers what's their advice for building a successful company. One of them replied that you didn't have to start from the bleeding edge. That's too risky. You should start from the cutting edge. That's good enough**. When Terraspring started in late 90s and built its data centers, it was on the bleeding edge. Nobody ever heard of utility computing, nor did the market exist. Would our fate be different if we started at a different time or lasted a bit longer? I wish I knew.

*Interestingly, that's also about the time Amazon started their AWS effort.

**But you also have to move really really fast.

Capacity Planning for Virtualized Datacenters - Sun Network 2003

Terraspring became part of Sun's N1. On slide 35 you can still see the name Terraspring.

Thursday, March 15, 2012

Highlights of SVForum's Big Data Analytics Conference

I attended the SVForum's Big Data Analytics Conference today. The conference took place in the Microsoft conference center in Mountain View. There were 3 keynote sessions and 3-4 panel discussions. Here are some of the highlights:

Microsoft - Data Doesn't Matter Until...
Microsoft speaker Bruno Aziza kickstarted the keynote session. Surprisingly, Microsoft supports the open-source big data community (Hadoop and others) and even contributes an "Excel module for Hive". (I find this combination a bit mind-boggling.) His talk focused on 4 trends of big data:

Data as Service - Raw data is only the beginning, providing value added data service is the key.
Get(Data) - get data as much as you can since cost of storage is down and you can always analyze it later
Scarcity vs Accessibility - the demand of data analysts in the future is high, need new tool to make the data more accessible to leverage global talent pool.
Marketing matters - use result of data analytics to promote action, case study: how Lego analyze children's feedback to create new products and new markets.

IBM - Smarter Decision Making, Leveraging Big Data to Gain New Actionable Insights

I was surprised that IBM had already formed a big data division. The speaker from IBM is the VP of that division. Her talk started with "where are the data of the big data from" (social media, RFID sensors, phones, cameras, GPS devices, smart meters, etc), then followed by a few case studies IBM had done in different segments (healthcare, green technology, traffic control). A few key take aways are

noise (as in signal/noise ratio in the data stored in HDFS can be high
time to process data is important, Hadoop is good at batch processing the unstructured data, but not that good at doing real-time, interactive data analytics
a mix of Hadoop and traditional data warehouse approach may bring the best of breeds
big data analytics can provide values in 3 areas: strategic advantage, operational intelligence, customer insight.

Greylock - Data Jujitsu, the art of turning data into product

Dr DJ Patil (ex LinkedIn and eBay) talked about how to form a good data science team. How LinkedIn promoted data analysis. What will make a good data project successful (having a diverse set of tools to use are very important), etc.

Panel Discussions

The panel discussions are a bit free formed, so it's hard to summarize. Here are some insights I find interesting:

Hadoop's batch processing can be thought as big data 1.0. A more interactive, real-time type of big data analytics will be big data 2.0.
A successful big data tool should be able to handle the data both in the cloud or on-premises (security is not the concern, logistics of moving large amount of data is).
A good visualization tool is not enough. A good tool should provide actionable insights and recommendations (what we should do next).
Building a generic big data platform may not be a wise starting point for your big data project. Starts with a vertical (specific) big data problem, then move horizontally.
Structured data and unstructured data will coexist. So are SQL and NoSQL.
Hadoop opens doors to new ETL (extract, transform, load) architecture. New ETL tools for real time analysis and reentrant analysis will be expected.
With the capabilities provided by Hadoop/HDFS, disk has become the new tape, and memory the new disk.
Big data analytics is still at the first half of the hype curve, but the future is very exciting.

P.S. I asked the organizer if the slides in the conference would be available. He said he would ask the speakers. However, the conference was videoed and the videos should be available soon. So check out their web site if you find the talks interesting.

Thursday, March 8, 2012

The Magician's Trick

There are tons of info on the web about cloud computing and highly scalable systems. However, rarely the whole architecture of a product is illustrated. The architecture of a product is like the magician's trick: a trade secret.

That's why I got very excited when I stumbled upon the Netflix slides. For the first time, a product of such scalability and availability was showed in details. For those who have always pondered how the magician pulls it off, this is treasure. Better yet, the SVForum Cloud and Virtualization SIG is doing a Netflix Night on March 27th. There is also another meetup on the same night focused on how Netflix migrated from Oracle to Cassandra. Are things too good to be true?

Global Netflix Platform

View more presentations from Adrian Cockcroft

Monday, March 5, 2012

Bridging the Cloud with Delphix

For two years I worked at a wonderful startup called Delphix. Delphix is the leader in database virtualization and our technology allows customers to provision a database as easily as provisioning a VM. With Delphix, CIOs finally have the tool to provide the agility and cost saving their IT divisions have long enjoyed with server virtualization technology.

My focus had since shifted to cloud computing and other areas after Delphix. Recently, while reading about the AWS Storage Gateway announcement and the RDS support, I started ruminating on the Delphix's solution. Delphix has positioned its product as an enterprise/private cloud solution (because of the sensitivity of data, database is always the last thing an enterprise customer wants to move to the cloud). Can Delphix work with a public cloud such as Amazon AWS? Can it cross the chasm to enable DBaaS in a hybrid cloud environment? Most importantly, can it be done NOW?

The answer immediately appeared in my mind: using V2P*. With Delphix's V2P feature, a user can create an EC2 instance, attach an EBS volume, then run the Delphix V2P. With just a few clicks, a user then have a new database created, data loaded, in the cloud. The concept is illustrated in the following diagram**.

Fig 1. Use Delphix V2P to create a database instance in AWS.

The approach has several advantages over the AWS RDS approach. First, Delphix does all the bookkeeping, so the user can provision a new database to any time point without worrying about the backup schedules and how to find the right data set from the backup. Second, the whole operation is applied to the Delphix appliance only, so no load is applied to the production database (which the RDS export or dump approach requires). Third, the user gets a solution that works for both private and public cloud.

In short, cloud computing opens many new opportunities. How to bridge the gap between the private and public cloud is an interesting challenge for all solution providers. Delphix already has a strong starting with its current V2P offering. It will be really interesting to see what new features Delphix will provide to make the cloud transition even easier in the future.

* How did I know? Because the feature was implemented by yours truly.
** Thanks to my coworker Fred for contributing the diagram and validating the info.