Wednesday, July 2, 2014

What's new in stream processing

The Lambda architecture seems to be popular in stream processing. It is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. The idea is you need to implement two transformation logic, one for batch system and one for stream processing systems. Then, we could combine the results from both system at query time to produce a complete answer.

The problems of Lambda architecture from Jay Kreps (@jaykreps) are also noticeable.

"First, maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems to be. One proposed approach to fix this is to have a language or framework that abstracts over both the real-time and batch framework. Summingbird is a framework that does this."

"Second, even we can only code once, the operational burden of running and debugging two systems is going to be very high."

The reason we are still interesting at the Lambda architecture would be:

"What they have at their disposal are two things that don’t quite solve their problem: a scalable high-latency batch system that can process historical data and a low-latency stream processing system that can’t reprocess results."
 "In this sense, even though it can be painful, I think the Lambda Architecture solves an important problem that was otherwise generally ignored. But I don’t think this is a new paradigm or the future or big data. It is just a temporary state driven by the current limitation of off-the-shelf tools. I also think there are better alternatives."
 
Some references:

  • Website, Lambda Architecture. http://lambda-architecture.net/
  • Book, big data. http://www.manning.com/marz/
  • Framework, Kafka 
  • Framework, Samza
  • Website, How to beat the guys that say they beat CAP,  Beating the CAP Theorem Checklist  

Tuesday, July 1, 2014

Want to write a Operation System

To most of the computer science students, operation systems are the most mystical pieces of software. I was quite curious about how a software can bootstrap itself from nothing back in my school time.  So, you may have a good reason to try this website: OSDev:

"This website provides information about the creation of operating systems and serves as a community for those people interested in OS creation. The wiki contain articles about various OS developing subjects.  "


It basically provides all the details you need to build your own operation system. I mean, if you would like to spend a semester in studying stuff listed in this site, you probably can learn much more than listening boring producer-consumer, semaphore stuff.

Monday, June 30, 2014

The Best Graph Processing Engine?

Graph processing is so hot these days. But, what's the qualities that the best graph processing engine should have? Here i list several native ideas, which i believe are wrong-prone and easily beat by you, so, please give me your feedback.

  1. Well support in-memory and out-of-core processing. Reason of this quality comes from the large size of graph. Even using a huge cluster, it is still possible that we can not load the whole graph into memory.
  2. Well support for time-evolving graph. It is stupid if we have to perform a complex process again on a graph with small portion of changes. Using reasonable resources to accelerate processing the time-evolving graph is essential here.
  3. Well support graph traversal. Not just map-reduce or scatter-gather style processing on graphs is important, the graph traversal starting from a given vertex and ending with a bunch of vertices is also critical in many use cases. However, an efficient graph traversal may conflict the divided-and-conquer graph partition strategy. Usually, the graph traversal was considered as a functionality of graph database, not graph processing framework. However, i still think this should be well considered before making such decision.
  4. Well support rich data graph. Graph structure only contains vertices and edges between them. However, the essential thing is the rich data on those vertices and edges. The processing model should be aware of those rich data and process them differently. 
To Be Continue...

Friday, June 27, 2014

Google abandoned MapReduce?

Recently, a big news is spread from Google I/O 2014: Google has abandoned MapReduce, which was considered as one of the most powerful weapons in Google. Rumor says the newest execution engine and programming model are MillWheel and FlumeJava respectively.

It is easy to see that MapReduce will be abandoned sooner or later: it is inefficient, slow, resource-wasting in most use cases.  From the programming model's perspective, it is simple and easy to use, but far from enough to abstract plentiful applications in real world, like iterative algorithms, multiple phase work-flows, incremental processing, and real-time stream processing etc.

Luckily, in open source troops, we got Spark, which was proven to be useful in almost all the applications we listed before. The in-memory computation also gives it lots of imaginary space in the future (maybe relevant with the anti-caching topics?).

P.S.
 After writing this, i saw some interesting posts talking the retirement of MapReduce in Google and also the possible next dotage Hadoop. For example: The elephant was a Trojan Horse: On the Death of Map-Reduce at Google.

The taste of your research

Research taste is an important but usually or deliberately forgotten concept in both science or technology research. There are thousands of researchers work all around the world trying to tackle problems. Some of them may not be capable, but most of them are really smart and working hard. So, what will distinguish you from them, like we can distinguish Eula from other mathematicians at his age? I guess the answer would be your research taste.

The taste would contain several aspects. First, you should have a good taste on the important problems. It makes no sense if you only solve some 'no-one-cares' problems and receive massive recognitions from others. Second, even you are solving an important problem, you should be able to identify yourself whether you are the right person to do such thing. It will not be a good idea if you are trying to solve a physical puzzle if you are a mathematician. With the right and important questions, you will still need a good taste for the possible solutions: whether they will work. Go too further in the wrong direction will waste your valuable time and destroy your confidence. OK, the last step, you have already had a good solution, your last taste will be how to present the results you got. In the history of science and technology, it is not rare that someone failed at representing themselves as a real contributor just because they failed to demonstrate what they have done and how this will impact the human.

It will be a little late for me to think about building a science taste now. But at least, it will not hurt. And, do you still remember that Einstein spent almost all his forty years on the unified filed theory? Your taste could go away even you have them once.