The Tech Munch: March 2012

Tuesday, March 27, 2012

SVForum Big Data Conference Video Available

Here is the link: http://www.youtube.com/playlist?list=PL7CCD96639D8A3087

Saturday, March 17, 2012

Cutting Edge vs Bleeding Edge

I was reading some article about how facebook designed their new data center. The photo of their data center suddenly reminded me the company I worked for more than 10 years ago: Terraspring.

Terraspring was ahead of its time. What did we do? We did utility computing. All the resources (compute, storage, network) in a data center can be put into a pool managed by Terraspring, they can be dynamically allocated, and users are only charged by their usage. Does it sound familiar? Yes, many of the ideas are now part of the cloud computing.

Our original plan was to build our own data centers and provided services to other companies. As the 2000 internet bubble busted, the demand for data center went for a nose dive. We were forced to change our business model to become a software provider. That didn't went well either. Soon we were sold to Sun Microsystems and the whole team disbanded in 2004*.

Looking at this new wave of data center boom, I was reminded one comment I heard from this week's Big Data Analytics Conference. Somebody asked the panel speakers what's their advice for building a successful company. One of them replied that you didn't have to start from the bleeding edge. That's too risky. You should start from the cutting edge. That's good enough**. When Terraspring started in late 90s and built its data centers, it was on the bleeding edge. Nobody ever heard of utility computing, nor did the market exist. Would our fate be different if we started at a different time or lasted a bit longer? I wish I knew.

*Interestingly, that's also about the time Amazon started their AWS effort.

**But you also have to move really really fast.

Capacity Planning for Virtualized Datacenters - Sun Network 2003

Terraspring became part of Sun's N1. On slide 35 you can still see the name Terraspring.

Thursday, March 15, 2012

Highlights of SVForum's Big Data Analytics Conference

I attended the SVForum's Big Data Analytics Conference today. The conference took place in the Microsoft conference center in Mountain View. There were 3 keynote sessions and 3-4 panel discussions. Here are some of the highlights:

Microsoft - Data Doesn't Matter Until...
Microsoft speaker Bruno Aziza kickstarted the keynote session. Surprisingly, Microsoft supports the open-source big data community (Hadoop and others) and even contributes an "Excel module for Hive". (I find this combination a bit mind-boggling.) His talk focused on 4 trends of big data:

Data as Service - Raw data is only the beginning, providing value added data service is the key.
Get(Data) - get data as much as you can since cost of storage is down and you can always analyze it later
Scarcity vs Accessibility - the demand of data analysts in the future is high, need new tool to make the data more accessible to leverage global talent pool.
Marketing matters - use result of data analytics to promote action, case study: how Lego analyze children's feedback to create new products and new markets.

IBM - Smarter Decision Making, Leveraging Big Data to Gain New Actionable Insights

I was surprised that IBM had already formed a big data division. The speaker from IBM is the VP of that division. Her talk started with "where are the data of the big data from" (social media, RFID sensors, phones, cameras, GPS devices, smart meters, etc), then followed by a few case studies IBM had done in different segments (healthcare, green technology, traffic control). A few key take aways are

noise (as in signal/noise ratio in the data stored in HDFS can be high
time to process data is important, Hadoop is good at batch processing the unstructured data, but not that good at doing real-time, interactive data analytics
a mix of Hadoop and traditional data warehouse approach may bring the best of breeds
big data analytics can provide values in 3 areas: strategic advantage, operational intelligence, customer insight.

Greylock - Data Jujitsu, the art of turning data into product

Dr DJ Patil (ex LinkedIn and eBay) talked about how to form a good data science team. How LinkedIn promoted data analysis. What will make a good data project successful (having a diverse set of tools to use are very important), etc.

Panel Discussions

The panel discussions are a bit free formed, so it's hard to summarize. Here are some insights I find interesting:

Hadoop's batch processing can be thought as big data 1.0. A more interactive, real-time type of big data analytics will be big data 2.0.
A successful big data tool should be able to handle the data both in the cloud or on-premises (security is not the concern, logistics of moving large amount of data is).
A good visualization tool is not enough. A good tool should provide actionable insights and recommendations (what we should do next).
Building a generic big data platform may not be a wise starting point for your big data project. Starts with a vertical (specific) big data problem, then move horizontally.
Structured data and unstructured data will coexist. So are SQL and NoSQL.
Hadoop opens doors to new ETL (extract, transform, load) architecture. New ETL tools for real time analysis and reentrant analysis will be expected.
With the capabilities provided by Hadoop/HDFS, disk has become the new tape, and memory the new disk.
Big data analytics is still at the first half of the hype curve, but the future is very exciting.

P.S. I asked the organizer if the slides in the conference would be available. He said he would ask the speakers. However, the conference was videoed and the videos should be available soon. So check out their web site if you find the talks interesting.

Thursday, March 8, 2012

The Magician's Trick

There are tons of info on the web about cloud computing and highly scalable systems. However, rarely the whole architecture of a product is illustrated. The architecture of a product is like the magician's trick: a trade secret.

That's why I got very excited when I stumbled upon the Netflix slides. For the first time, a product of such scalability and availability was showed in details. For those who have always pondered how the magician pulls it off, this is treasure. Better yet, the SVForum Cloud and Virtualization SIG is doing a Netflix Night on March 27th. There is also another meetup on the same night focused on how Netflix migrated from Oracle to Cassandra. Are things too good to be true?

Global Netflix Platform

View more presentations from Adrian Cockcroft

Monday, March 5, 2012

Bridging the Cloud with Delphix

For two years I worked at a wonderful startup called Delphix. Delphix is the leader in database virtualization and our technology allows customers to provision a database as easily as provisioning a VM. With Delphix, CIOs finally have the tool to provide the agility and cost saving their IT divisions have long enjoyed with server virtualization technology.

My focus had since shifted to cloud computing and other areas after Delphix. Recently, while reading about the AWS Storage Gateway announcement and the RDS support, I started ruminating on the Delphix's solution. Delphix has positioned its product as an enterprise/private cloud solution (because of the sensitivity of data, database is always the last thing an enterprise customer wants to move to the cloud). Can Delphix work with a public cloud such as Amazon AWS? Can it cross the chasm to enable DBaaS in a hybrid cloud environment? Most importantly, can it be done NOW?

The answer immediately appeared in my mind: using V2P*. With Delphix's V2P feature, a user can create an EC2 instance, attach an EBS volume, then run the Delphix V2P. With just a few clicks, a user then have a new database created, data loaded, in the cloud. The concept is illustrated in the following diagram**.

Fig 1. Use Delphix V2P to create a database instance in AWS.

The approach has several advantages over the AWS RDS approach. First, Delphix does all the bookkeeping, so the user can provision a new database to any time point without worrying about the backup schedules and how to find the right data set from the backup. Second, the whole operation is applied to the Delphix appliance only, so no load is applied to the production database (which the RDS export or dump approach requires). Third, the user gets a solution that works for both private and public cloud.

In short, cloud computing opens many new opportunities. How to bridge the gap between the private and public cloud is an interesting challenge for all solution providers. Delphix already has a strong starting with its current V2P offering. It will be really interesting to see what new features Delphix will provide to make the cloud transition even easier in the future.

* How did I know? Because the feature was implemented by yours truly.
** Thanks to my coworker Fred for contributing the diagram and validating the info.