Wednesday, September 29, 2010

Release scripts

Our release scripts are pretty simple, but they're still a lot more complicated than I would have expected. We started with just checking out our code, rsync'ing it out to every web server and restarting apache on those web servers.

Of course, our users were getting dropped connections or connection refused errors since the apache's were being restarted while people were using them and while our load balancers were still sending them traffic. We added a secret URL to our app that would return "UP" or "SHUTTING DOWN" based on whether a file in /tmp/ was present. The load balancers would check this page and stop sending traffic to a server if it's status was "SHUTTING DOWN". Our upgrade scripts now create this flag file in /tmp, give the load balancers 15 seconds to recognize that the server is going down and stop sending new traffic to the apache, and only then shut apache down on a server.

That page has a secondary purpose in that generating it requires our app to be working end-to-end and so it serves as an application-level health check for each server (db connections working, memcache connections are working, etc and not just that apache can return a page).

The next discovery was that under high load we couldn't restart our apache's too quickly because there was some lag between when a server came back up and when the load balancers detected that it was back up and really started sending traffic. If we restarted too quickly we ended up with all the servers being up except for one at any given point in time, but the load balancers not sending traffic to any of the servers. Now we wait a minute between each server restart.

Then there was the time that even though we tested code in a supposedly exact replica of production, the code totally failed on production. So we rolled out broken code to every web server. Now the push scripts test each server after it restarts before moving on to the next server to avoid pushing totally broken code to every machine.

Early on we were sloppy in releasing code and static content. We'd push out code that referenced new images without those images being on all the servers. So a request could be served by the new code, reference a new image and then the request for that image would go to an old server that didn't have the image. Now we rsync all the static content out before we rsync the code.

We eventually also created a simple database schema migration system that runs with each code push. The db's store a version number and the update scripts look in a special directory in our code base for files each named by version number. Any file named with a version number higher than the version in the db is run. After each file is run the version number in the db is updated.

We have monitoring that checks each server every minute. We disable this monitoring as we take each server down for updating to avoid spurious alarms being generated.

Other things we do on each code update:

1) We use Python and found that deleting all .pyc files on every update solved many obscure problems
2) Solr and Lucene have lock files that can get left behind. Delete these when restarting to make sure everything starts.
3) We patch the svn version number of the code into page templates so we can see which version of the code is generating each page.
4) email the commits from svn --log to our dev team so everyone knows what has gone to production
5) dynamically generate press and faq pages from a google spreadsheet that can be updated by non-technical people
blog comments powered by Disqus