Saturday, February 20, 2010

PageRank for shipping

The other day my friend Jim Psota at Panjiva was explaining how their search and ranking systems work. Panjiva is basically a search engine for people to find overseas manufacturers. The technology was built by a bunch of MIT and Stanford computer scientists and I thought it was pretty interesting stuff. Like many search problems, the big challenge is figuring out how to rank the results for relevancy. Panjiva's insight here is that a lot of the things that make web search better also make searching for exporters better.

Companies like Alibaba already have directories of millions of exporters. But when you search for a supplier for a specific type of product, it's sort of like going back to the web before Google. For example, searching for "wool sweaters" returns a list of over 24,000 different contract manufacturers with no great way to rank them.

This is a hard problem. You can't just rank by sales volume or number of customers since or other measures of popularity. I might want a low volume exporter who is highly specialized. You also have the problem of exporters making (or claiming to make) a variety of product lines, but really only specializing in a few of them.

Panjiva has a clever approach to solving this problem using shipping data. Instead of looking at a supplier to decide how they should rank, Panjiva looks at the people buying from those suppliers. The network amongst suppliers and buyers gives a much more truthful representation of what a supplier is really good at building than purely analyzing the suppliers themselves. Other interesting factors like the rate at which customers are being gained or lost and which industries the buyers are in also mirror web search (inbound link growth rate and topic analysis respectively).

Hopefully systems like this will improve the overall quality of the market for finding suppliers. There will always be gaming, but perhaps it will follow the "seo" path of the web whereby sites try to rank better by creating great content that attracts links. Similarly, hopefully suppliers will focus less on "keyword" stuffing their profiles and more on getting quality customers. Assuming more honest marketplaces get more business, this should be good for everyone.

Saturday, February 6, 2010

Infrastructure for real-time parallel computing

Om invited me to write a piece on what we've learned while building Hunch. My key point is that things like MapReduce and Hadoop are great for offline batch processing, but not so useful for doing harder and real-time parallel programming where you want to compute answers in milliseconds or need to combine different hard-to-predict pieces of data for each part of the computation.

At Hunch, we've built a bunch of systems using memcached, mysql and custom caching and replication code to attack this problem. Roughly, we store things in memcached across a bunch of machines. We then do client-side caching so that we don't have to go back to memcached for every request. Finally, processes that put data into memcached, also write them to either local disk files or to mysql for fast reloading back into memcached should we need to.

The website uses message queues or tcp requests to app servers for long computation instead of tying up Apache threads. These app servers also can cache a lot of data out memcached instead of spreading that cached data through every Apache instance.

I don't know that this is the right solution for everyone. It was a pain putting it together too. It would be great if there was an off-the-shelf open source set of tools to make irregular memory-intensive real-time parallel programming tasks easier. If you're interested in these sorts of parallel programming problems and have thoughts on this I've love to talk about it.