Tom Pinckney

Friday, April 5, 2013

NYC Cassandra Tech Day Talk

Here's the talk I did at the NYC* Big Data Day...

Saturday, March 9, 2013

The triumph of intuition

It's in vogue to say data trumps intuition. This is simply not possible. There are too many variables to test with only finite users to test those variables on. Intuition has to be at the top of the idea-generation funnel. Data may be able to separate the good ideas from the bad ones. Even then, it's not possible to validate every idea. Frequently have to go with your gut. There's no escaping that at the end of the day it's the quality of the people making the decisions that matters. No matter how much you spend on Hadoop or data scientists, if you have the wrong people you'll get the wrong results.

Sunday, February 10, 2013

The code test as recruiting tool

At SiteAdvisor, Hunch and now eBay we've long been fans of making practical onsite coding projects an integral part of the interview process. I always assumed this was all about us learning more about the candidate. What I have slowly started to learn is that this is also about the candidate learning about us: what we value in a developer, what we think is important, what kind of problems we work on, and how we solve those problems. I increasingly believe that an interesting, challenging, and educational code test helps recruit great programmers as much as it helps us identify them. Even in the cases where we don't end up hiring the candidate or they turn us down, they've many times told us that the coding projects were fun and taught them something. So whether a hire comes out of the process or not it was still a win-win for the candidate.

Sunday, February 27, 2011

Adding padding to a schedule

A couple of weeks ago we were putting together a schedule for a mobile app we're working on. We prefer having our developers come up with their own schedules (so they really believe in the schedule) and using short milestones to avoid surprises.

At some point, one of our developers mentioned they were adding some extra time to each step of their schedule to account for various unknowns. I'm a big believer in getting an accurate schedule by trying to build in some padding for unknowns. However, I like adding all of the padding as an explicit piece at the end of the schedule instead of sprinkling it throughout the plan.

The best example I heard for why you do it this way was an example from the military. A general asks how long it will take to get his jeep ready for a trip. His request goes down the chain of command until it reaches the private who will actually be pulling the general's jeep out of the motor pool, checking it out and driving over to the general.

The private figures it'll take about a day, but to be safe, he doubles it to two days. He tells his boss, a lieutenant that it'll take two days. The lieutenant, to be safe, doubles this and tells his boss the captain that it'll take four days. By the time the general hears back, the estimate is that it will take a year or two to have his jeep ready.

At every step of the process the people weren't aware of padding already built into the schedule and so to be safe they added more of their own padding. No one was being particularly unreasonable, it was just a lack of awareness.

Instead, the private could have told his boss that it'll take about a day to get the jeep ready and plus a second day of extra time just in case the jeep turns out to have some sort of serious mechanical problem that wasn't anticipated. Then everyone would know how much explicit padding was built in, could judge that it was reasonable, and not feel the need to add more.

This problem of excessive padding isn't just a problem in "vertical" stacks like chains of command. It also happens in "linear" stacks when there are many steps in a schedule and each one is padded. In all likelihood, every single step of the schedule isn't going to go over time and so you will have way over padded the schedule. And if every step IS going over schedule, you've got a more fundamental problem to solve first in learning how to come up with decent baseline time estimates for tasks.

Sunday, November 7, 2010

NoTrans vs NoSQL

I think the whole NoSQL movement is misnamed. The scalability problems most NoSQL databases aim to solve aren't really related to what the query language is. They're instead more about removing the overhead of transactions, using column stores instead of row stores, relaxing consistency guarantees or things like that.

Tuesday, October 19, 2010

Testing Facebook Connect applications

At Hunch, we use Facebook Connect to make it easier for our users to create accounts. Knowing who a Hunch user is on Facebook also improves the recommendations on Hunch as we can look at what the user has already "liked". Unfortunately, it's really difficult to test Facebook Connect applications as you cannot easily create test Facebook users.

We've worked around this by creating a minimal re-implementation of the Facebook Graph API. So instead of making Facebook API calls to the real Facebook, we can make API calls against our fake Facebook. Our fake Facebook, called Fakebook, can simulate an unlimited number of fake Facebook users which is handy for load testing.

Fakebook is still very incomplete in terms of which parts of the Facebook Graph API it emulates, but it emulates enough to allow us to see what happens when thousands of simulated users all try to Facebook Connect into Hunch at once. Currently, all of the data Fakebook returns is random (random email addresses, random first names, random friends etc). This makes it easy to do load testing but makes it harder to do functional testing where your tests actually care about the values in the data. This is probably what we'll work on handling next.

The Fakebook is written in Python and is designed to run on Google App Engine. You can check it out at http://code.google.com/p/thefakebook

Wednesday, September 29, 2010

Release scripts

Our release scripts are pretty simple, but they're still a lot more complicated than I would have expected. We started with just checking out our code, rsync'ing it out to every web server and restarting apache on those web servers.

Of course, our users were getting dropped connections or connection refused errors since the apache's were being restarted while people were using them and while our load balancers were still sending them traffic. We added a secret URL to our app that would return "UP" or "SHUTTING DOWN" based on whether a file in /tmp/ was present. The load balancers would check this page and stop sending traffic to a server if it's status was "SHUTTING DOWN". Our upgrade scripts now create this flag file in /tmp, give the load balancers 15 seconds to recognize that the server is going down and stop sending new traffic to the apache, and only then shut apache down on a server.

That page has a secondary purpose in that generating it requires our app to be working end-to-end and so it serves as an application-level health check for each server (db connections working, memcache connections are working, etc and not just that apache can return a page).

The next discovery was that under high load we couldn't restart our apache's too quickly because there was some lag between when a server came back up and when the load balancers detected that it was back up and really started sending traffic. If we restarted too quickly we ended up with all the servers being up except for one at any given point in time, but the load balancers not sending traffic to any of the servers. Now we wait a minute between each server restart.

Then there was the time that even though we tested code in a supposedly exact replica of production, the code totally failed on production. So we rolled out broken code to every web server. Now the push scripts test each server after it restarts before moving on to the next server to avoid pushing totally broken code to every machine.

Early on we were sloppy in releasing code and static content. We'd push out code that referenced new images without those images being on all the servers. So a request could be served by the new code, reference a new image and then the request for that image would go to an old server that didn't have the image. Now we rsync all the static content out before we rsync the code.

We eventually also created a simple database schema migration system that runs with each code push. The db's store a version number and the update scripts look in a special directory in our code base for files each named by version number. Any file named with a version number higher than the version in the db is run. After each file is run the version number in the db is updated.

We have monitoring that checks each server every minute. We disable this monitoring as we take each server down for updating to avoid spurious alarms being generated.

Other things we do on each code update:

1) We use Python and found that deleting all .pyc files on every update solved many obscure problems
2) Solr and Lucene have lock files that can get left behind. Delete these when restarting to make sure everything starts.
3) We patch the svn version number of the code into page templates so we can see which version of the code is generating each page.
4) email the commits from svn --log to our dev team so everyone knows what has gone to production
5) dynamically generate press and faq pages from a google spreadsheet that can be updated by non-technical people