Om invited me to write a piece on what we've learned while building Hunch. My key point is that things like MapReduce and Hadoop are great for offline batch processing, but not so useful for doing harder and real-time parallel programming where you want to compute answers in milliseconds or need to combine different hard-to-predict pieces of data for each part of the computation.
At Hunch, we've built a bunch of systems using memcached, mysql and custom caching and replication code to attack this problem. Roughly, we store things in memcached across a bunch of machines. We then do client-side caching so that we don't have to go back to memcached for every request. Finally, processes that put data into memcached, also write them to either local disk files or to mysql for fast reloading back into memcached should we need to.
The website uses message queues or tcp requests to app servers for long computation instead of tying up Apache threads. These app servers also can cache a lot of data out memcached instead of spreading that cached data through every Apache instance.
I don't know that this is the right solution for everyone. It was a pain putting it together too. It would be great if there was an off-the-shelf open source set of tools to make irregular memory-intensive real-time parallel programming tasks easier. If you're interested in these sorts of parallel programming problems and have thoughts on this I've love to talk about it.
Saturday, February 6, 2010
blog comments powered by Disqus
Subscribe to:
Post Comments (Atom)