Tom Pinckney

Sunday, February 27, 2011

Adding padding to a schedule

A couple of weeks ago we were putting together a schedule for a mobile app we're working on. We prefer having our developers come up with their own schedules (so they really believe in the schedule) and using short milestones to avoid surprises.

At some point, one of our developers mentioned they were adding some extra time to each step of their schedule to account for various unknowns. I'm a big believer in getting an accurate schedule by trying to build in some padding for unknowns. However, I like adding all of the padding as an explicit piece at the end of the schedule instead of sprinkling it throughout the plan.

The best example I heard for why you do it this way was an example from the military. A general asks how long it will take to get his jeep ready for a trip. His request goes down the chain of command until it reaches the private who will actually be pulling the general's jeep out of the motor pool, checking it out and driving over to the general.

The private figures it'll take about a day, but to be safe, he doubles it to two days. He tells his boss, a lieutenant that it'll take two days. The lieutenant, to be safe, doubles this and tells his boss the captain that it'll take four days. By the time the general hears back, the estimate is that it will take a year or two to have his jeep ready.

At every step of the process the people weren't aware of padding already built into the schedule and so to be safe they added more of their own padding. No one was being particularly unreasonable, it was just a lack of awareness.

Instead, the private could have told his boss that it'll take about a day to get the jeep ready and plus a second day of extra time just in case the jeep turns out to have some sort of serious mechanical problem that wasn't anticipated. Then everyone would know how much explicit padding was built in, could judge that it was reasonable, and not feel the need to add more.

This problem of excessive padding isn't just a problem in "vertical" stacks like chains of command. It also happens in "linear" stacks when there are many steps in a schedule and each one is padded. In all likelihood, every single step of the schedule isn't going to go over time and so you will have way over padded the schedule. And if every step IS going over schedule, you've got a more fundamental problem to solve first in learning how to come up with decent baseline time estimates for tasks.

Sunday, November 7, 2010

NoTrans vs NoSQL

I think the whole NoSQL movement is misnamed. The scalability problems most NoSQL databases aim to solve aren't really related to what the query language is. They're instead more about removing the overhead of transactions, using column stores instead of row stores, relaxing consistency guarantees or things like that.

Tuesday, October 19, 2010

Testing Facebook Connect applications

At Hunch, we use Facebook Connect to make it easier for our users to create accounts. Knowing who a Hunch user is on Facebook also improves the recommendations on Hunch as we can look at what the user has already "liked". Unfortunately, it's really difficult to test Facebook Connect applications as you cannot easily create test Facebook users.

We've worked around this by creating a minimal re-implementation of the Facebook Graph API. So instead of making Facebook API calls to the real Facebook, we can make API calls against our fake Facebook. Our fake Facebook, called Fakebook, can simulate an unlimited number of fake Facebook users which is handy for load testing.

Fakebook is still very incomplete in terms of which parts of the Facebook Graph API it emulates, but it emulates enough to allow us to see what happens when thousands of simulated users all try to Facebook Connect into Hunch at once. Currently, all of the data Fakebook returns is random (random email addresses, random first names, random friends etc). This makes it easy to do load testing but makes it harder to do functional testing where your tests actually care about the values in the data. This is probably what we'll work on handling next.

The Fakebook is written in Python and is designed to run on Google App Engine. You can check it out at http://code.google.com/p/thefakebook

Wednesday, September 29, 2010

Release scripts

Our release scripts are pretty simple, but they're still a lot more complicated than I would have expected. We started with just checking out our code, rsync'ing it out to every web server and restarting apache on those web servers.

Of course, our users were getting dropped connections or connection refused errors since the apache's were being restarted while people were using them and while our load balancers were still sending them traffic. We added a secret URL to our app that would return "UP" or "SHUTTING DOWN" based on whether a file in /tmp/ was present. The load balancers would check this page and stop sending traffic to a server if it's status was "SHUTTING DOWN". Our upgrade scripts now create this flag file in /tmp, give the load balancers 15 seconds to recognize that the server is going down and stop sending new traffic to the apache, and only then shut apache down on a server.

That page has a secondary purpose in that generating it requires our app to be working end-to-end and so it serves as an application-level health check for each server (db connections working, memcache connections are working, etc and not just that apache can return a page).

The next discovery was that under high load we couldn't restart our apache's too quickly because there was some lag between when a server came back up and when the load balancers detected that it was back up and really started sending traffic. If we restarted too quickly we ended up with all the servers being up except for one at any given point in time, but the load balancers not sending traffic to any of the servers. Now we wait a minute between each server restart.

Then there was the time that even though we tested code in a supposedly exact replica of production, the code totally failed on production. So we rolled out broken code to every web server. Now the push scripts test each server after it restarts before moving on to the next server to avoid pushing totally broken code to every machine.

Early on we were sloppy in releasing code and static content. We'd push out code that referenced new images without those images being on all the servers. So a request could be served by the new code, reference a new image and then the request for that image would go to an old server that didn't have the image. Now we rsync all the static content out before we rsync the code.

We eventually also created a simple database schema migration system that runs with each code push. The db's store a version number and the update scripts look in a special directory in our code base for files each named by version number. Any file named with a version number higher than the version in the db is run. After each file is run the version number in the db is updated.

We have monitoring that checks each server every minute. We disable this monitoring as we take each server down for updating to avoid spurious alarms being generated.

Other things we do on each code update:

1) We use Python and found that deleting all .pyc files on every update solved many obscure problems
2) Solr and Lucene have lock files that can get left behind. Delete these when restarting to make sure everything starts.
3) We patch the svn version number of the code into page templates so we can see which version of the code is generating each page.
4) email the commits from svn --log to our dev team so everyone knows what has gone to production
5) dynamically generate press and faq pages from a google spreadsheet that can be updated by non-technical people

Saturday, February 20, 2010

PageRank for shipping

The other day my friend Jim Psota at Panjiva was explaining how their search and ranking systems work. Panjiva is basically a search engine for people to find overseas manufacturers. The technology was built by a bunch of MIT and Stanford computer scientists and I thought it was pretty interesting stuff. Like many search problems, the big challenge is figuring out how to rank the results for relevancy. Panjiva's insight here is that a lot of the things that make web search better also make searching for exporters better.

Companies like Alibaba already have directories of millions of exporters. But when you search for a supplier for a specific type of product, it's sort of like going back to the web before Google. For example, searching for "wool sweaters" returns a list of over 24,000 different contract manufacturers with no great way to rank them.

This is a hard problem. You can't just rank by sales volume or number of customers since or other measures of popularity. I might want a low volume exporter who is highly specialized. You also have the problem of exporters making (or claiming to make) a variety of product lines, but really only specializing in a few of them.

Panjiva has a clever approach to solving this problem using shipping data. Instead of looking at a supplier to decide how they should rank, Panjiva looks at the people buying from those suppliers. The network amongst suppliers and buyers gives a much more truthful representation of what a supplier is really good at building than purely analyzing the suppliers themselves. Other interesting factors like the rate at which customers are being gained or lost and which industries the buyers are in also mirror web search (inbound link growth rate and topic analysis respectively).

Hopefully systems like this will improve the overall quality of the market for finding suppliers. There will always be gaming, but perhaps it will follow the "seo" path of the web whereby sites try to rank better by creating great content that attracts links. Similarly, hopefully suppliers will focus less on "keyword" stuffing their profiles and more on getting quality customers. Assuming more honest marketplaces get more business, this should be good for everyone.

Saturday, February 6, 2010

Infrastructure for real-time parallel computing

Om invited me to write a piece on what we've learned while building Hunch. My key point is that things like MapReduce and Hadoop are great for offline batch processing, but not so useful for doing harder and real-time parallel programming where you want to compute answers in milliseconds or need to combine different hard-to-predict pieces of data for each part of the computation.

At Hunch, we've built a bunch of systems using memcached, mysql and custom caching and replication code to attack this problem. Roughly, we store things in memcached across a bunch of machines. We then do client-side caching so that we don't have to go back to memcached for every request. Finally, processes that put data into memcached, also write them to either local disk files or to mysql for fast reloading back into memcached should we need to.

The website uses message queues or tcp requests to app servers for long computation instead of tying up Apache threads. These app servers also can cache a lot of data out memcached instead of spreading that cached data through every Apache instance.

I don't know that this is the right solution for everyone. It was a pain putting it together too. It would be great if there was an off-the-shelf open source set of tools to make irregular memory-intensive real-time parallel programming tasks easier. If you're interested in these sorts of parallel programming problems and have thoughts on this I've love to talk about it.

Saturday, January 30, 2010

How to figure out what those VC terms mean for your equity

My friend Chris Dixon wrote about what everyone should know about their equity grants. Following up on that, I wrote a simple python program that helps you simulate what your stock would be worth in the event someone buys your company. The reason it's not just as simple as purchase_price * your_percent_ownership is that many times VC deals include things like preferences and anti-dilution provisions. These are basically mechanisms where by the VCs may get more than their percent ownership.

Even though I'm currently working on my third VC backed company I found I still had to spend a lot of time looking up the definitions of terms and thinking through how they affected various outcomes. This was a really good exercise for me and I highly recommend it for anyone else raising VC funding. As a side note, I find that writing a program to simulate something is the best way to see if I really understand something.

The code is available at https://code.google.com/p/startupequitysimulator/ along with some extremely basic documentation. I know that I haven't gotten all the scenarios exactly right, so contributions or improvements are definitely welcome.

Preferred Stock Background

The key point that everyone in a VC backed company should understand is the difference between the stock VCs buy (called preferred stock) and the stock you and I get (common stock). Preferred stock is called that because it gets preferential treatment over common stock. As far as dividing up the proceeds from the sale of a company, that preferential treatment usually falls into two broad categories: preferences and anti-dilution provisions.

If there are multiple rounds of financing, each new round of preferred stock sold is called a "series" with the most recent series being "senior" to the older series. Debt is usually the most senior in a company's capital structure. Then higher seniority stock holders get paid before more junior stock holders. Common stock is the most junior of all.

Preferred stock almost always has the right to convert to common stock. Conversion causes the investor to lose any special privileges that the preferred stock holds. The preferred shares may not necessarily convert 1-to-1 into common stock depending on anti-dilution provisions. Typically, investor A will have the right to convert each preferred share into more than one common share if subsequent investor B paid less per share than A did.

There are several different standard ways to calculate how A's conversion ratio from preferred to common should be adjusted when B invests at a lower price. The two most common forms are called broad-weighted and full-ratchet. The former averages A and B's prices while the latter fully adjusts A's price down to B's price.

When people talk about percent ownership, they're really talking about the "as converted to common" percent ownership. This is a hypothetical number of common shares that would result from adding together all the preferred shares if they converted into common shares, all the stock options or warrants issued plus any other common stock granted.

Preferences give the preferred stock holder the right to get some multiple of their investment off the top without regards to the percent ownership that stockholder has. Participating preferred gives the preferred holder the right to take their preferences off the top of the deal AND then still get a cut of what's left based on their percent ownership. Non-participating preferred means that preferred stockholder can EITHER take their preference off the top OR convert to common and participate on a percent ownership basis only. Capped participating preferred limits how much a preferred holder can make through their preferences.