Wednesday, July 2, 2014

What's new in stream processing

The Lambda architecture seems to be popular in stream processing. It is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. The idea is you need to implement two transformation logic, one for batch system and one for stream processing systems. Then, we could combine the results from both system at query time to produce a complete answer.

The problems of Lambda architecture from Jay Kreps (@jaykreps) are also noticeable.

"First, maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems to be. One proposed approach to fix this is to have a language or framework that abstracts over both the real-time and batch framework. Summingbird is a framework that does this."

"Second, even we can only code once, the operational burden of running and debugging two systems is going to be very high."

The reason we are still interesting at the Lambda architecture would be:

"What they have at their disposal are two things that don’t quite solve their problem: a scalable high-latency batch system that can process historical data and a low-latency stream processing system that can’t reprocess results."
 "In this sense, even though it can be painful, I think the Lambda Architecture solves an important problem that was otherwise generally ignored. But I don’t think this is a new paradigm or the future or big data. It is just a temporary state driven by the current limitation of off-the-shelf tools. I also think there are better alternatives."
 
Some references:

  • Website, Lambda Architecture. http://lambda-architecture.net/
  • Book, big data. http://www.manning.com/marz/
  • Framework, Kafka 
  • Framework, Samza
  • Website, How to beat the guys that say they beat CAP,  Beating the CAP Theorem Checklist  

Tuesday, July 1, 2014

Want to write a Operation System

To most of the computer science students, operation systems are the most mystical pieces of software. I was quite curious about how a software can bootstrap itself from nothing back in my school time.  So, you may have a good reason to try this website: OSDev:

"This website provides information about the creation of operating systems and serves as a community for those people interested in OS creation. The wiki contain articles about various OS developing subjects.  "


It basically provides all the details you need to build your own operation system. I mean, if you would like to spend a semester in studying stuff listed in this site, you probably can learn much more than listening boring producer-consumer, semaphore stuff.

Monday, June 30, 2014

The Best Graph Processing Engine?

Graph processing is so hot these days. But, what's the qualities that the best graph processing engine should have? Here i list several native ideas, which i believe are wrong-prone and easily beat by you, so, please give me your feedback.

  1. Well support in-memory and out-of-core processing. Reason of this quality comes from the large size of graph. Even using a huge cluster, it is still possible that we can not load the whole graph into memory.
  2. Well support for time-evolving graph. It is stupid if we have to perform a complex process again on a graph with small portion of changes. Using reasonable resources to accelerate processing the time-evolving graph is essential here.
  3. Well support graph traversal. Not just map-reduce or scatter-gather style processing on graphs is important, the graph traversal starting from a given vertex and ending with a bunch of vertices is also critical in many use cases. However, an efficient graph traversal may conflict the divided-and-conquer graph partition strategy. Usually, the graph traversal was considered as a functionality of graph database, not graph processing framework. However, i still think this should be well considered before making such decision.
  4. Well support rich data graph. Graph structure only contains vertices and edges between them. However, the essential thing is the rich data on those vertices and edges. The processing model should be aware of those rich data and process them differently. 
To Be Continue...

Friday, June 27, 2014

Google abandoned MapReduce?

Recently, a big news is spread from Google I/O 2014: Google has abandoned MapReduce, which was considered as one of the most powerful weapons in Google. Rumor says the newest execution engine and programming model are MillWheel and FlumeJava respectively.

It is easy to see that MapReduce will be abandoned sooner or later: it is inefficient, slow, resource-wasting in most use cases.  From the programming model's perspective, it is simple and easy to use, but far from enough to abstract plentiful applications in real world, like iterative algorithms, multiple phase work-flows, incremental processing, and real-time stream processing etc.

Luckily, in open source troops, we got Spark, which was proven to be useful in almost all the applications we listed before. The in-memory computation also gives it lots of imaginary space in the future (maybe relevant with the anti-caching topics?).

P.S.
 After writing this, i saw some interesting posts talking the retirement of MapReduce in Google and also the possible next dotage Hadoop. For example: The elephant was a Trojan Horse: On the Death of Map-Reduce at Google.

The taste of your research

Research taste is an important but usually or deliberately forgotten concept in both science or technology research. There are thousands of researchers work all around the world trying to tackle problems. Some of them may not be capable, but most of them are really smart and working hard. So, what will distinguish you from them, like we can distinguish Eula from other mathematicians at his age? I guess the answer would be your research taste.

The taste would contain several aspects. First, you should have a good taste on the important problems. It makes no sense if you only solve some 'no-one-cares' problems and receive massive recognitions from others. Second, even you are solving an important problem, you should be able to identify yourself whether you are the right person to do such thing. It will not be a good idea if you are trying to solve a physical puzzle if you are a mathematician. With the right and important questions, you will still need a good taste for the possible solutions: whether they will work. Go too further in the wrong direction will waste your valuable time and destroy your confidence. OK, the last step, you have already had a good solution, your last taste will be how to present the results you got. In the history of science and technology, it is not rare that someone failed at representing themselves as a real contributor just because they failed to demonstrate what they have done and how this will impact the human.

It will be a little late for me to think about building a science taste now. But at least, it will not hurt. And, do you still remember that Einstein spent almost all his forty years on the unified filed theory? Your taste could go away even you have them once.


Monday, January 28, 2013

Some Need to be Read Papers in Cloud Computing Framework Recently




In this blog, i will continuously update a paper list which contains some of the famous papers about cloud computing, cloud storage and other relevant stuffs in last 5~6 years. I thought every new researcher in this area should at least have a glance over them.

Any comment or suggestion are welcome!




  • Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. LINK
  • Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. LINK
  • Chang, Fay, et al. "Bigtable: A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.  LINK
  • Yang, Hung-chih, et al. "Map-reduce-merge: simplified relational data processing on large clusters." Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007. LINK
  • Bu, Yingyi, et al. "HaLoop: Efficient iterative data processing on large clusters." Proceedings of the VLDB Endowment 3.1-2 (2010): 285-296. LINK
  • Borthakur, Dhruba, et al. "Apache Hadoop goes realtime at Facebook."Proceedings of the 2011 international conference on Management of data. ACM, 2011. LINK
  • Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006. LINK
  • DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., ... & Vogels, W. (2007, October). Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review (Vol. 41, No. 6, pp. 205-220). ACM. LINK
  • Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010, December). S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on (pp. 170-177). IEEE. LINK
  • McKusick, Marshall Kirk, and Sean Quinlan. "Gfs: Evolution on fast-forward."ACM Queue 7.7 (2009): 10-20. LINK
  • Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 2006. LINK
  • Lakshman, Avinash, and Prashant Malik. "Cassandra—A decentralized structured storage system." Operating systems review 44.2 (2010): 35. LINK
  • Baker, J., Bond, C., Corbett, J. C., Furman, J. J., Khorlin, A., Larson, J., ... & Yushprakh, V. (2011, January). Megastore: Providing scalable, highly available storage for interactive services. In Proc. of CIDR (pp. 223-234). LINK
  • Ousterhout, John, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra et al. "The case for ramcloud."Communications of the ACM 54, no. 7 (2011): 121-130. LINK
  • Ongaro, Diego, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. "Fast crash recovery in RAMCloud." In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29-41. ACM, 2011. LINK
  • Ekanayake, Jaliya, et al. "Twister: a runtime for iterative mapreduce."Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010. LINK
  • Zhang, Yanfeng, et al. "Priter: a distributed framework for prioritized iterative computations." Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011. LINK
  • Peng, Daniel, and Frank Dabek. "Large-scale incremental processing using distributed transactions and notifications." Proceedings of the 9th USENIX conference on Operating systems design and implementation. USENIX Association, 2010. LINK
  • Bhatotia, Pramod, et al. "Incoop: MapReduce for incremental computations."Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011. LINK
  • Zaharia, M.; Borthakur, D.; Sen Sarma, J.; Elmeleegy, K.; Shenker, S. & Stoica, I. "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling" Proceedings of the 5th European conference on Computer systems, 2010, 265-278.  LINK
  • Zaharia, M.; Chowdhury, M.; Franklin, M.; Shenker, S. & Stoica, I. "Spark: cluster computing with working sets" Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010, 10-10   LINK
  • Zaharia, M.; Konwinski, A.; Joseph, A.; Katz, R. & Stoica, I. "Improving mapreduce performance in heterogeneous environments" Proceedings of the 8th USENIX conference on Operating systems design and implementation, 2008, 29-42. LINK
  • Armbrust, M.; Fox, A.; Griffith, R.; Joseph, A.; Katz, R.; Konwinski, A.; Lee, G.; Patterson, D.; Rabkin, A.; Stoica, I. & others "A view of cloud computing" Communications of the ACM, ACM, 2010, 53, 50-58. LINK
  • Lloyd, Wyatt, et al. "Don't settle for eventual: scalable causal consistency for wide-area storage with COPS." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011. LINK
  • Low, Yucheng, et al. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment 5.8 (2012): 716-727. LINK
  • Mitchell, Christopher, Russell Power, and Jinyang Li. "Oolong: Programming Asynchronous Distributed Applications with Triggers." Proc. SOSP. 2011. LINK
  • Mitchell, Christopher, Russell Power, and Jinyang Li. "Oolong: asynchronous distributed applications made easy." Proceedings of the Asia-Pacific Workshop on Systems. ACM, 2012. LINK
  • Power, Russell, and Jinyang Li. "Piccolo: building fast, distributed programs with partitioned tables." Proceedings of the 9th USENIX conference on Operating systems design and implementation. USENIX Association, 2010. LINK
  • Harter, Tyler, et al. "A file is not a file: understanding the I/O behavior of Apple desktop applications." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011. LINK

Basic Theory

[Chubby] Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006. http://www.usenix.org/event/osdi06/tech/full_papers/burrows/burrows_html/

Armbrust, M.; Fox, A.; Griffith, R.; Joseph, A.; Katz, R.; Konwinski, A.; Lee, G.; Patterson, D.; Rabkin, A.; Stoica, I. & others "A view of cloud computing" Communications of the ACM, ACM, 2010, 53, 50-58. http://x-integrate.de/x-in-cms.nsf/id/DE_Von_Regenmachern_und_Wolkenbruechen_-_Impact_2009_Nachlese/$file/abovetheclouds.pdf

Programming Model

Abadi, D. J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J. H., ... & Zdonik, S. (2005, January). The design of the borealis stream processing engine. CIDR. http://www.cs.harvard.edu/~mdw/course/cs260r/papers/borealis-cidr05.pdf

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. http://fastandfuriousdecisiontree.googlecode.com/svn-history/r474/trunk/DIVERS/mapReduceByGoogle.pdf

Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010, December). S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on (pp. 170-177). IEEE. http://www.4lunas.org/pub/2010-s4.pdf

Zaharia, M.; Chowdhury, M.; Franklin, M.; Shenker, S. & Stoica, I. "Spark: cluster computing with working sets" Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010, 10-10   http://www.usenix.org/event/hotcloud10/tech/full_papers/Zaharia.pdf

Zaharia, Matei, et al. "Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 2012. http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Gunda, Pradeep Kumar, et al. "Nectar: automatic management of data and computation in datacenters." Proceedings of the 9th USENIX conference on Operating systems design and implementation. USENIX Association, 2010. http://static.usenix.org/events/osdi10/tech/full_papers/Gunda.pdf

Peng, Daniel, and Frank Dabek. "Large-scale incremental processing using distributed transactions and notifications." Proceedings of the 9th USENIX conference on Operating systems design and implementation. USENIX Association, 2010. http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf

Low, Yucheng, et al. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment 5.8 (2012): 716-727. http://arxiv.org/pdf/1204.6078

Mitchell, Christopher, Russell Power, and Jinyang Li. "Oolong: Programming Asynchronous Distributed Applications with Triggers." Proc. SOSP. 2011. http://sigops.org/sosp/sosp11/posters/summaries/sosp11-final6.pdf

Mitchell, Christopher, Russell Power, and Jinyang Li. "Oolong: asynchronous distributed applications made easy." Proceedings of the Asia-Pacific Workshop on Systems. ACM, 2012. https://apsys2012.kaist.ac.kr/media/papers/apsys2012-final28.pdf

Power, Russell, and Jinyang Li. "Piccolo: building fast, distributed programs with partitioned tables." Proceedings of the 9th USENIX conference on Operating systems design and implementation. USENIX Association, 2010. http://www.usenix.org/event/osdi10/tech/full_papers/Power.pdf

McSherry, Frank, et al. "Differential dataflow." Conference on Innovative Data Systems Research (CIDR). 2013. http://research.microsoft.com/pubs/176693/differentialdataflow.pdf

Improvement on Existed Models

Zaharia, M.; Konwinski, A.; Joseph, A.; Katz, R. & Stoica, I. "Improving mapreduce performance in heterogeneous environments" Proceedings of the 8th USENIX conference on Operating systems design and implementation, 2008, 29-42. http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/

Yang, Hung-chih, et al. "Map-reduce-merge: simplified relational data processing on large clusters." Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007. http://www.cs.duke.edu/courses/cps399.28/current/papers/sigmod07-YangDasdanEtAl-map_reduce_merge.pdf

Bu, Yingyi, et al. "HaLoop: Efficient iterative data processing on large clusters." Proceedings of the VLDB Endowment 3.1-2 (2010): 285-296. http://vldb2010.org/proceedings/files/papers/R25.pdf

Borthakur, Dhruba, et al. "Apache Hadoop goes realtime at Facebook."Proceedings of the 2011 international conference on Management of data. ACM, 2011. http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf

Ekanayake, Jaliya, et al. "Twister: a runtime for iterative mapreduce."Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010. http://www.iterativemapreduce.org/hpdc-camera-ready-submission.pdf

Zhang, Yanfeng, et al. "Priter: a distributed framework for prioritized iterative computations." Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011. http://rio.ecs.umass.edu/mnilpub/papers/socc11-zhang.pdf

Bhatotia, Pramod, et al. "Incoop: MapReduce for incremental computations."Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011. https://www.systems.ethz.ch/education/spring-2012/hotDMS/papers/incoop-socc11.pdf

Storage

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. ftp://121.9.13.178/PPP/ppt-hadoop/The.Google.File.System.pdf

Chang, Fay, et al. "Bigtable: A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4. http://static.usenix.org/event/osdi06/tech/chang/chang_html/

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., ... & Vogels, W. (2007, October). Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review (Vol. 41, No. 6, pp. 205-220). ACM. http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf

McKusick, Marshall Kirk, and Sean Quinlan. "Gfs: Evolution on fast-forward."ACM Queue 7.7 (2009): 10-20. http://queue.acm.org/detail.cfm?id=1594206

Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 2006. https://www.usenix.org/legacyurl/osdi-06-paper-4

Lakshman, Avinash, and Prashant Malik. "Cassandra—A decentralized structured storage system." Operating systems review 44.2 (2010): 35. http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF

Baker, J., Bond, C., Corbett, J. C., Furman, J. J., Khorlin, A., Larson, J., ... & Yushprakh, V. (2011, January). Megastore: Providing scalable, highly available storage for interactive services. In Proc. of CIDR (pp. 223-234). http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf

Ousterhout, John, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra et al. "The case for ramcloud."Communications of the ACM 54, no. 7 (2011): 121-130. http://ilpubs.stanford.edu:8090/942/1/ramcloud.pdf

Ongaro, Diego, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. "Fast crash recovery in RAMCloud." In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 29-41. ACM, 2011. http://www.cs.columbia.edu/~junfeng/11fa-e6121/papers/ramcloud-recovery.pdf

Lloyd, Wyatt, et al. "Don't settle for eventual: scalable causal consistency for wide-area storage with COPS." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011. http://www-users.cselabs.umn.edu/classes/Fall-2012/csci8980-2/papers/cops.pdf

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., ... & Zdonik, S. (2005, August). C-store: a column-oriented DBMS. In Proceedings of the 31st international conference on Very large data bases (pp. 553-564). VLDB Endowment. http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf

Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., & Nunkesser, M. (2012). Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 5(11), 1436-1446. http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf

Abadi, Daniel J., Samuel R. Madden, and Nabil Hachem. "Column-Stores vs. Row-Stores: How different are they really?." Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf

Scheduler

Zaharia, M.; Borthakur, D.; Sen Sarma, J.; Elmeleegy, K.; Shenker, S. & Stoica, I. "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling" Proceedings of the 5th European conference on Computer systems, 2010, 265-278.  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.1524&rep=rep1&type=pdf

Hindman, Benjamin, et al. "Mesos: A platform for fine-grained resource sharing in the data center." Proceedings of the 8th USENIX conference on Networked systems design and implementation. USENIX Association, 2011. http://incubator.apache.org/mesos/papers/nsdi_mesos.pdf

Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., & Stoica, I. (2011, March). Dominant resource fairness: fair allocation of multiple resource types. In USENIX NSDI. http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf

Systems

Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2), 330-339. http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf

Cluster ManagementApplications

Sunday, January 27, 2013

We shall fellow Prismatic

Prismatic is a news recommendation site created by several Berkeley students. Three of them are the famous Phd students in Berkeley: Aria Haghighi (work with Dan Klein on NPL, and also worked with Andrew Ng as undergraduate student in Stanford), Jenny Finkel (work with Chris on NPL) , and Jason Wolfe (work with Stuart Russell on AI Lab). Recently, it received an A round 15 million dollor from Jim Breyer and Yuri Milner to "Attack The Impossible Problem of Bringing You Relevant News"

In fact, before Prismatic, there were lots of tries to solve the complex problem: "What i shall read now". Including some not so successful works like Zita, Pulse, Flipboard, Digg, StumbleUpon, and Wavii. The core challenge of this problem is that it is still impossible for computers to understand the real content of a news or understand the needs of human being, besides, the timing is also tricky: what the readers want to read now is clearly less rhythmic than what the readers want to read eventually. 

Lots of people think that Prismatic is only another immature try on this impossible problem. However, in my mind, Prismatic has kind of grasp the core of this type of applications.

  1. Using social network. Ask computers to understand real human is not possible, but your friends are possible. Grasp enough materials from your social network is critical for solving this problem.
  2. Analyzing persons instead of contents. Persons have different kinds of tags: a geek, an artist, a student, or even a news report etc. Among all the geeks, there are still lots of subtypes. News and articles have different weights for different group of people. Besides, the interaction between your system and a person can future affect the system in a stable way.
  3. Strong academic background. Machine Learning and NLP are still hot research topics today. Phd. students in the best university will be more easy to deploy the obvious 'better' system.

We shall fellow Prismatic? 

Is that too hard for our ordinary programmers who do not obtain any ML or NLP doctor degree or do not study in the world's best laboratory or university. The answer is Yes, but there are other ways: "There are a simple version of Prismatic"

For example, I am a social network addictive person. Each day i spend at  least one hour (i call it lunch-social time) on different social network: Sina Weibo (a twitter like micro-blog system), Facebook, AcFun (an interesting video sites like youtube), and many others. Besides this hour, i do not like the social medias disrupt my work. So, in the first quater of the lunch-social hour, there are always interesting posts or news i like and read, but after around 15 minuts, i usually found there is not any new content, and i will spend next 45 minuts flushing the page again and again trying to get one interesting page (Of couse, all the posts i read during this lunch-social hour wound not be serious, they often are some jokes, break news, funny comic or popular short videos). How to help me? You need:


A Simple Version of Prismatic:

  1. Narrow down your contents only in funny stuff. "Funny comic, video, story", "Breaking news", "Social Trends" etc, other high quality blogs or posts can be left to Prismatic.  People usually do not criticize your system harshly  when given not so accurate recommendation while finding fun.
  2. Classify your users into pre-defined classes by human instead of unsupervised learning by computers. A heavy social network users can easily predict whether a post would be popular or not. Humans are really better than computers in such problems.

I would like to  try it any way not matter how hard it is because i want to use such tool for my launch-social time. :)


There are some reference materias about Prismatic. You can check it out if you are also interested in: