Tuesday, May 20, 2008

What I've learned about programming, part 1

I figured I'd try to start writing down some of the ideas that I've found useful when building large systems. There is no great general framework to the series. It's really just a grab-bag of observations about a few abstract ideas and some concrete examples from the real world. 

I'll start the series with some thoughts on queues. I think queues are one of the great under-appreciated abstractions in software. I constantly find that they're the basis of simplifying and solving many hard processing problems.

For example, the simple queue allows a producer to be separated from a consumer of data. Fundamentally, a producer-consumer queue is powerful because is synchronizes and decouples two independent parts of a system and lets them be written and run in isolation from each other. Anything that lets us decouple, synchronize and isolate a complicated system into simpler pieces is a good thing!

The producer and consumer might be in the same thread and simply be used as a convenience to store work that still needs to be done (as in remembering nodes that still need to be visited for a breadth-first search). Alternatively, the producer and consumer could be in the same thread but asynchronous events such as a disk controller generating an interrupt that is quickly queued for later handling by a disk driver in an operating system. Finally, the producer and consumer may be in different threads and serve as an explicit loosely coupled way for the two threads to coordinate their actions -- perhaps one thread generates log records and the other thread writes them out to a file.

A natural extension of a producer-consumer queue is to have multiple producers and consumers arranged linearly to form a processing pipeline. N different processing steps can be de-coupled from each other and still be properly synchronized through the very simple to understand idea of putting work into an output queue and reading work from an input queue.

Naturally, each stage of a pipeline can run concurrently with every other stage without requiring subtle reasoning about the correctness of locking schemes. In fact, multiple instances of each stage of the pipeline can run concurrently in order to increase throughput of the system without requiring additional synchronization effort.

If the queues are transactional and persistent, then the stages of the pipeline can be run on independent machines where machines and processes can fail without affecting the correctness of the system. Again, all without subtle failure recovery logic in the applications using the queuing service.

If the queues are of finite size, they can be used to schedule processes. When an output queue fills up a process must wait for space to free up, thus letting whatever process that is reading from that queue a chance to catch up. 

Imagine being able to write a program that scales to large numbers of machines/processors, runs reliably even in the face of machine and process failures, properly synchronizes itself, inherently applies back-pressure to slow down parts of the system that are running too fast and which requires nothing more complicated than understanding the idea of being able to put() to an output queue and get() from an input queue?

Monday, May 12, 2008

Thoughts on computers in 2018

When I was in grad school, I realized that the really hard part was coming up with an interesting problem to work on. Actually solving an interesting problem was much easier than coming up with it in the first place. And the prospect of spending six years working on a problem that wasn't particularly relevant was pretty depressing.

The trick with computer science, and particularly my area of computer science which is systems computer science, is to try and target problems that will be relevant based on how hardware technology advances. And since this is supposed to be relatively pure research, it should be focused on how what new problems will need to be solved based on where things will be in ten years or so.

To that end, I tried to put down a set of thoughts on what computers will be like in 2018 and then see if that inspires any particularly interesting systems problems.

First, my assumptions for the hardware itself:
  1. No rotating media, all solid state storage (except perhaps for mass backup)
  2. Many processor cores with various purposes
  3. Hundreds of megabits of wireless bandwidth (and much more wired bandwidth)
  4. Integrated biometric authentication that will pervasively authenticate the computer's operator to applications local and remote
Next, how people will use computers:
  1. Everyone uses many computers -- cellphone/PDA, home server/STB, home computer, navigation unit in cars, game console etc. People expect these computers to all work well with each other, share content, etc.
  2. Applications are mixed between cloud and local device without a clean delineation. Bandwidth won't be so plentiful that everything can just run remotely. We'll still have at least UIs running locally, and more than just UIs for many applications such as games.
  3. People use a variety of interfaces -- keyboards, speech recognition, touch, accelerometers etc
  4. There will not be a single standard, a single type of computer etc. There will be many diverse choices with varying levels of integration and compatibility.
Finally, some implications for systems computer science research:
  1. The storage hierarchy will not be as extreme as it is today (today there is a factor of a million -- nanoseconds to access L1 cache and tens of milliseconds to access disk blocks). Many issues in filesystems will go away (layout, seek minimization, random access will be nearly as fast as sequential access, etc). 
  2. Resources will be plentiful and infrequently shared between applications or users on the same computer. One of the traditional jobs of operating systems will have disappeared. Instead the challenge will be how to put all of these resources to productive use.
  3. Firewalls, NATs etc  will loose their relevance as people realize that inside the network is just as dangerous as outside, that with IPv6 there are plenty of addresses, etc. However, any given network device is likely to have many different IP addresses over the course of a day as users move between networks. Some way besides IP addresses and DNS needs to exist so that all of a users computers can name each other regardless of which network they're currently on.
  4. People will not find it that hard to put many cores to use. This is a relatively solved problem from an academic point of view.
  5. There will be a greater and greater distinction between client computers and server computers and their system software needs will be very different. While bandwidth to servers will be much higher (especially for wireless devices), latency will not be that much better. It'll be back to the 1950s mainframe days. Time to dust off those IBM redbooks.
  6. Given how many devices people will use, replication / syncing / authorization will be important. People will want a consistent view of their world regardless of which device they're using. Third parties / content owners will also want to control how content is replicated. 
  7. Single sign on, password/authentication systems etc will loose their relevance in the face of devices with integrated biometric authentication. If I can use my thumbprint to authenticate myself to my phone/laptop when I first use it each day, then it can securely and transparently authenticate me to remote computers as I need to access them during the day. Anonymizing services can act as intermediaries if I want to authenticate to a third party using a unique identity without revealing my actual identity.
  8. Most sharing will not be within a computer but between users on different computers. This will not be particularly complicated since it will be mediated by mutually trusted central computers. Think how easy it is to share pictures on Flickr with someone today -- you just send them an email invite with a capability in it. 
  9. Given the importance of the cloud and the value of the data stored on it, better systems will have to be put into place to prevent exploits, worms etc from accessing private data.