Saturday, October 4, 2008

ORMs must die

Until the past year,  I had always done DB programming the old fashioned way: I wrote SQL that was embedded in my software. It seemed to work pretty well too: it was fast, predictable and easy to debug. There were some annoyances such as when the table schema changed I had to track down changes in my code, but this was pretty minor. 

I always felt like I was doing something unhygienic though. It felt like I wasn't brushing my teeth or I wasn't showering. All the cool kids made fun of me for being stuck in the past. So I decided to try an ORM as the interface between my program and the DB. Big mistake. 

Now, I can't say an ORM is always the wrong choice. But in my particular case I was working on building a website that needed to be very high performance and crunch on large datasets. What I quickly found was that while it is possible to write high performance code with ORMs, it's very easy to write low performance code. Now it's always easy to write low performance code, but I found that it was much easier to shoot myself in the foot with an ORM.

For example, suppose I was trying to access a series of variables though an expression such as a.b.c in my (Python) program. Looks pretty straightforward right? Actually, it turns out that my ORM is turning each one of those object references into a SQL join and loading exactly one record. So if my code looked like:

for a in a_lot_of_objects:

Then I was generating a ton of individual joins. Now, if I were to write this by hand I would have seen that this equivalent SQL looks really stupid:

for a_id in a_lot_of_object_ids:
  c = select c from a join b join c where = a_id

I would instead move the select outside of the loop, select all the records I want at once and then use the loop to iterate over them. Duh.

This may sound like a trivial problem, but I think it's just once aspect of a bigger problem. ORMs just don't have enough information to do a good job. And if you build an ORM that has enough flexibility to have enough information about what you're trying to do then that ORM is way too complicated to use.

This is all reminiscent of the days when high level language compilers weren't smart enough to generate good machine code and hand written assembly would always be better. Perhaps in time languages will understand the semantics of DB calls and ORM calls and could optimize my example code above to automatically hoist the select out of the loop.

But until then, I'm writing my SQL by hand.

PS, the other thing I find really annoying about ORMs is that too many of them require you to duplicate your schema -- once in the DB and once in a config/models file somewhere. Every time I have to do things twice I end up doing things in two different ways accidentally and bugs result.