Saturday, September 26, 2009

Code tests when hiring engineers

I also always do some sort of code test after a first round of interviews. At the end of the day you're not hiring programmers based on their ability to talk a good game. They should want to show off what they can do. If they really are 10x better than the average (and yes, good programmers are at least 10x better than then average) they will want to show off their coding since they know it's their competitive advantage. I think a lot of great candidates think better of a company that openly evaluates people based on programming skill -- it's the kind of environment they want to work in.

A good programming project should evaluate whether the programmer will work well in your company. For me, this means working on big underspecified problems, getting something to work quickly and iterating, using any open source tools or code they want, looking up stuff on the web, etc. The only thing I don't allow is someone to talk to another person for help. I do the code test in person just to reduce the possibility of outside help.

A classic coding problem we used back at SiteAdvisor was to write a program that would crawl a website, automatically fill out forms and evaluate whether the form was accepted successfully or not. We'd ask people to do this in about two hours. Obviously, this is a big project but it was extremely interesting to see what people would accomplish in those two hours. Would they have simplified the problem down to a core that worked and which could be expanded on later? Or had they simply drawn up a list of outside vendors they would hire to help with the project (I kid you not one senior candidate actually did this).

The programming test gets at what is one of the most important things to me. Is the candidate a scrappy thinker who finds ways to get something working no matter what? Most startup programming projects are not about inventing new algorithms or big ideas. They're about making daily progress against a nearly insurmountable set of challenges that require quick clever workarounds.

Friday, September 25, 2009

The intangibles of hiring programmers

When you first get a chance to have a prospective engineer come visit your office, it's critical to subtly show off all the smart people you have and all the big interesting projects they are working on. Make sure the candidate gets a chance to hang out with your existing engineers, chat about the problems you're working on etc. You'll get to evaluate each other in a more casual environment than a one-on-one interview and hopefully impress them with what working in your company is like.

Of course, these informal discussions are also your chance to evaluate the candidate, but be careful. You've been thinking about your projects every waking minute for a year while they just got exposed to them. Don't jump to conclusions based on their lack of amazing insights after only 5 minutes.

Ask the candidate about interesting problems they've solved. If they can't come up with any good problems they've worked on, or if they've come from a big company where they work on some really narrow problem then be careful. Have they ever thought up new features for products before or have they always had everything carefully spec'ed and spoon fed to them? Have they ever built a new type of product? Do they have (non-fanatically held) opinions about preferred tools, languages etc? Anyone really into programming will have these sorts of opinions.

Speaking of which, I think it is incredibly rare that a great programmer is someone who just programs 9-5 as a job and then forgets about it. Most of the really good programmers I've known were programming as a hobby before doing it professionally. When growing up they found ways to get access to computers at a local school, they wrote games on PCs, they built websites, whatever. The point is they love to program, they have done it for free, and they have done a lot of it. Like anything, programming every day for the past ten years, is what makes you a great programmer today.

Wednesday, September 16, 2009

QA and benefiting from your users' testing

From time to time someone asks me about how I think you should test a website. It depends on the type of website -- if your website is responsible for millions of dollars or peoples lives you need to test it differently than if your website lets you play with animated puppies.

The fundamental challenge in testing your website is the number of possible states your code can be in and your inability to test every one of those states. Every "if", variable, or loop in your code, every record and field value in your db, every API call you make to a third party increases the number of states your system can be. It's impossible to test even a billionth of the possible states a tiny site can be.

QA people have developed all sorts of approaches to deal with this like randomized inputs, testing boundary conditions, black box testing based on requirements, etc. However, consumer web sites in particular can benefit from the thousands to millions of users they have since each of these users is constantly testing different states of your system.

You don't want your users running into serious bugs like accidentally deleting their accounts, buying things they didn't mean to etc. Those are the kinds of features you need to have your QA people focus on to make sure they never happen. Your QA people also need to focus on backend processes that user's don't see. But that still leaves a large body of other problems your users may run into.

The key to benefiting from your users finding bugs is to detect problems immediately and fix them so quickly that (hopefully) no one notices them. Have your site send you email on errors (but rate limit or aggregate them so you don't get flooded!). Even better, send the entire dev team the errors so everyone knows asap when there's a problem and so there can be some social pressure to not check in buggy code. If you're fixing problems asap, you need a basic regression test and an automated deployment system so you can push a button and know you're releasing basically good code.

Tuesday, September 15, 2009

PythonDB

I've written earlier about my frustrations with mysql and memcache. Instead of just complaining, I'll now try to offer some ideas on what might be a better alternative for web applications.

As a first degree of approximation, consider a server that accepted short snippets of python from clients and executed them. You could potentially have a very large amount of RAM and an even bigger swap file to provide hundreds of gigabytes of storage via the heap. Instead of writing "select * from users where id = 'bob'" you might instead say "users['bob']" and get back a json encoded version of whatever object was stored in the dict 'users' under key 'bob'. The snippets might also be more complicated like defining classes, using loops, etc.

You would no longer have to think of the world as only tables (mysql) or key/value pairs (memcache). Instead you can use the same data representations you're already using in your programs. For example, if you wanted a message queue you could simply create a class that has a list, push/pop methods, and a lock for synchronizing it. It's hard to get much simpler than that!

Speaking of locks, you would control the synchronization instead of having the database implicitly try to do it with transactions. As I've written before, I'm very skeptical about using transactions to build modular software.

Replication could work by having the same python snippets sent to all copies of the database. So long as the snippets were deterministic and they were applied in the same order, all of the replicas would end up with the same data.

Obviously, there are some (big!) disadvantages to what I've proposed. There's no persistence, though a python interpreter could manage a persistent paged heap. There would have to be different options for how deeply to serialize results and when to return proxies instead of the actual object values. Different levels of access control may have to be added in some way.

You probably wouldn't have CPython and its full library embedded in some sort of server but instead have a separate implementation of the python language with its own memory manager, concurrency, etc.

Thursday, September 10, 2009

Bring the iphone app store to the desktop

In a previous life I was part of an effort to help people avoid malware on their computers. SiteAdvisor was an attempt to help inform people about which websites they could trust and which downloads they could install with confidence. However, it still relies on people paying attention to the red / yellow / green warning symbols in search results.

The Apple iPhone App Store offers a different model for how to protect people's computers. The only software you can download onto iPhones is software that has been reviewed by Apple. This makes a lot of sense to me. Many people just click 'Yes' on anything that pops up on their computer. Or they just type their admin password in whenever prompted to do so. Or they don't know what those handy red SiteAdvisor ratings next to Yahoo search results mean. Or maybe they really did get malware installed on their computer by a worm or browser exploit.

Why not go one step further than SiteAdvisor's advisory ratings and implement mandatory access control for laptop/desktop systems? I know there are a lot of people that hate having their iPhones locked down. Fine. Give the power users an option to unlock their desktop or laptop. But for the other 95% of the population simply block installing any software that is not approved by Apple, Microsoft or whomever is the security auditor of choice for your operating system. Or at least block installing any system software update, browser plugin, or other critical piece of software that is not whitelisted.

This needs to be implemented at a low level in the OS. Any data loaded into a region of memory that is marked or will be marked executable needs to have a code signature verified to prove it is reviewed and whitelisted code. One consequence of this is that run-time code generation is not possible. So no just-in-time compilers like Java use. But in the end, I'd happily settle for a much more secure computer and let Intel make my Java apps run faster.

Tuesday, September 8, 2009

How to set up mysql multi-master replication

Here's one way to implement multi-master replication in MySql for fault-tolerance. I'm assuming you have two databases with each one replicating from the other one. In principle you could extend this to more databases replicating to each other in a ring. It's important to note that the reason to do this is NOT to gain more write scalability (you actually loose write scalability since every write has to happen twice). What you gain is reliability since the different db's can be in different data centers on different networks, with different power grids, in different countries etc.

The basic problem you need to solve with multi-master replication is that there is no single db that holds locks that synchronize transactions. Two transactions on the different dbs will happily run completely unaware of each other and then try to replicate their results to each other and potentially conflict in the updates they made.

1) Set up each mysql instance to use its own unique id sequence so that inserts done on one database will replicate to the other database without conflict. Having unique IDs on each db also helps you debug problems later if something goes wrong since you can tell which db an insert was done on.

Set one db to start inserting at auto increment value 1, the other to start at 2 and then have them each increment by 10 between successive IDs. So all the IDs on the first db will be 1, 11, 21, etc and the ones on the second will be 2, 12, 22, 32 etc. You can also set server-id to the start value so that you remember which db uses which sequence of values.

Read more on how to do this at the mysql website

2) Don't use the pattern "SELECT something, operate on it in your app code and then INSERT or UPDATE". For example, to increment a counter don't read the value, increment it in your app and then write it back with an UPDATE.

Normally innodb tables will create a read lock when you do the SELECT and thus prevent any other transaction from inserting or modifying a record that would have matched the SELECT's WHERE clause. However, since transactions don't span across replication, you could have both dbs run the SELECT, not see a record and both decide to insert it. Or two different transactions on the two db's could read a value, operate on it and then write it back and conflict with each other.

Instead, you can use an INSERT ... ON DUPLICATE KEY UPDATE ... combined with a judicious choice of unique index fields to make sure that the record is only inserted once and then updated in the future. You can also move the update logic out of your application and into your UPDATE statement.

For example, if you're trying to update a counter don't SELECT it's value, if it's not there INSERT it with a zero and if it is there UPDATE it with a new value. Instead create an index on the stat's name perhaps and then use INSERT ... ON DUPLICATE KEY UPDATE to either initialize the counter with a zero value or increment it by one if the counter name already exists. Or a simple UPDATE bar SET foo = foo + 1 could be used if you know the stat record will be there since this can run on both db's and give the correct result without requiring an locks.

3) Since writes are 2x more expensive now that they have to be executed on two different db's, you may want to consider ways to remove writes from your application. Maybe you can store the data in memcache? Or, if you can't remove the write from the db, maybe you can write the data to non-replicated tables and later merge them together in a batch job?

4) Try appending to tables instead of inserting or updating values in table. INSERTs to a table where the only unique index is an auto-increment field can run without any locks (other than for the auto increment value) and so can safely replicate across db's. Maybe you can insert values from your web app running against both db's and then later have a back-end batch process merge the results together?

5) You probably can't use ORMs like sqlalchemy since you can't tell what sql they're generating and it probably makes assumptions about transaction that don't hold.

6) Make users sticky to the same db if at all possible. For example, if you're using geographic DNS to load balance across two different data centers, always send the same client subnet to the same datacenter if possible. This will reduce the chance the the user will write to one db, bounce to the other db, and not see their change or somehow make a conflicting change.

7) Try to avoid using unique values beyond auto increment values. They're impossible to implement since two users could exactly simultaneously try to create the same unique value on both dbs, succeed and then have the replication fail since the value being INSERTed or UPDATEd into the other db will already exist.

Sometimes you just can't avoid unique values. For example, if your site lets you create usernames there's a tiny chance that two users could try to create the same username at the same time. You can at least make this less likely by requiring minimum username lengths.

Saturday, September 5, 2009

master-standby vs multi-master replication

When setting out to build a reliable web application, people frequently try to get rid of any single point of failure. Running two data centers and two databases is usually the hardest part. There are a few different options for building in database and data center reliability: master-standby and master-master.

In master-standby, one data center and database is active processing user requests while the other is an inactive copy that can be quickly activated in case the master fails. In master-master, two different data centers and databases are both actively serving user traffic.

The advantage of master-standby fault tolerance is that at first blush it seems simpler. You just replicate your database to a standby data center. You don't need to change your app. If the primary data center or database fails, you just change your DNS to point to the backup data center.

The first problem is you don't really know if your backup data center works since most of the time it's just sitting there unused. It's relatively easy to test that the read functionality of your app works in the standby, but much harder to test that write functionality works without corrupting your database replication from the master.

The other problem is that you probably don't want automated fail over to the backup since once you fail over, you probably have to do a manual shut down of the master and later go through some reconciliation process when the master comes back up. So you risk longer periods of downtime while you try to decide whether a failure is really persistent, whether to fail everyone over to the standby, testing the standby after failing over etc.

With master-master replication, you know both sides are always working and you can constantly and automatically have failover happen any time health checks fail for one data center even if they're only transient several minute hiccups. You also have the knowledge that real users are getting served out of both data centers constantly so you know they both work (assuming you have internal reporting of errors, have users who notice problems and complain about them, run monitoring on both sides constantly etc).

One additional benefit of master-master replication is that your second data center isn't sitting idle 99% of the time. You have to be careful not to assume master-master replication increases your write capacity (it actually decreases it since all writes have to now happen twice, once in each data center), but you can increase the read performance from your site. As many web sites are read-mostly, this can be a big win.

The obvious downsides of master-master replication are that when replication gets out of sync, you have a bigger mess to clean up and that you have to write your application to be master-master aware. I'll talk about my experience with both of these in a future post.

Wednesday, September 2, 2009

What I don't like about mysql and memcache

I love mysql and memcache, don't get me wrong. But I find a lot of times they just don't match up well with how I want to build software.

1) I have some graph of objects I want to store into the database. Creating a table for every relationship and reading/writing it via SQL is tedious. Memcache on the other hand make this very easy, but only by serializing it and thus loosing any understanding of what the data means so the data cannot be operated on within memcache.

2) I have some one-off piece of data that I want to compute in a back-end process and have my website read. I could store it in memcache, but it's not persistent and I don't want to loose this data. I could write it into mysql but then I have a table with one record with a single field which feels kind of silly. I could write it into files, but I'd like to use the replication infrastructure I have with mysql to get this value to all my data centers, back up servers, dev machines, qa machines etc. I could write it into memcache, but I want it be persistent.

3) I frequently want to use message queues without the hassle of setting up message queing software like ActiveMQ. I don't want yet another server to monitor, yet another complicated configuration file to support and yet another app to understand. So I end up emulating message queues in memcache or mysql, but neither make it a natural thing to do.

4) It's annoying to keep memcache and mysql in sync with each other. If you're caching underlying database result sets in memcache, then you have problems of inconsistencies between the objects that came out of the database vs the objects that came out of memcache. If you're caching higher-level more granular data (like the result of some internal API call) then you may get more cache misses since two different parts of the system might be caching different objects even though they use the same underlying db records.

5) I've learned to hate transactions as programming model. It is impossible to develop modular software if you have to have acquire all locks in a fixed global order to avoid dead locks. Further, one rogue application that holds some critical lock too long can completely kill your web application. Transactions are also nearly impossible to replicate reliably and are impossible to reconcile in a distributed environment where data centers come up and go down periodically.

The alternative to locks and transactions is to write lock free code that potentially leaves things in a messy state if it crashes or fails to clean up things properly. I'd rather have a back end process run periodically over my data and clean it up. Obviously in the classic "bank account" kind of application this would be unacceptable, but for many applications this is fine.

6) Stored procedures are nice but always second class citizens in databases. I'd like them to be upgraded to first class citizens in terms of the performance I get, the ease of writing, being able to use modern languages etc.

7) Multi-master replication for fault-tolerance is hard. Statement and row based replication are too low-level. Instead of replicating a bunch of statements between locations, I'd like to replicate a higher level operation like "try to make a user with this username and this user id". If there was lag between the two masters and somehow the same user created two different accounts on both masters with the same username but with different user ids, I'd like to have custom reconciliation logic that that might rename one of the users.

Tuesday, September 1, 2009

DNS load balancing

I got annoyed that to get a DNS server that does geographic load balancing, you have to buy expensive Foundry or F5 gear. It seems like you should be able to plug in your own code to an existing DNS server and have it perform a health check before resolving a request. Apparently there's no great option to do this though.

So I started a project to do this: http://sourceforge.net/projects/pymds Note, there is not an official release yet, so just get the latest from SVN.

The basic idea is that pymds (python modular DNS) is a simple DNS server where all the real logic is supplied by extensions written in python. So the standard ability to read a zone file and answer responses is one extension. Round-robin load balancing is another extension. You can chain extensions together so, for example, you could use the zone file extension to look up several possible IPs to return and then the load balancing extension will pick one.

Next up is to write a health check extension that will ping a set of servers and not return their IP address if they haven't recently passed a health check.

I've used this server on some personal domains for a while and tested resolving against it using a bunch of different resolver libraries. That said, this is still extremely alpha software and probably will erase your hard drive or something else terrible if you were to be silly enough to try and actually use it.