We use delayed_job for our background queue, and it's great. You can do this really cool thing where you call "send_later" on any object, with the method you want eventually called, and it will queue it up for later execution. Very lightweight.
This is what we do for our reporting features. A user selects a date range and which report they want to run, and it gets offloaded to the background queue, sent to them later as an email link to an S3 file.
Today, suddenly things aren't working. Reports aren't coming through, in fact the whole queue has frozen, it's backed up to 85 jobs, and when we look at "top" on one of our EC2 instances, we can see the ruby processed dying as quickly as they spool up. Shit.
Frantically we search the logs. Nothing. Configuration? Still the same, and looks good. We try to manually run a few jobs, no issues. Finally I try spooling up a worker in process and telling it to work off the queue, and I see this:
ArgumentError: argument out of range from /usr/lib/ruby/1.8/yaml.rb:133:in `utc' from /usr/lib/ruby/1.8/yaml.rb:133:in `node_import' from /usr/lib/ruby/1.8/yaml.rb:133:in `load' from /usr/lib/ruby/1.8/yaml.rb:133:in `load'
From YAML? Why?!
Fortunately, we got some help from our friends at EngineYard (we've always had top notch support from them) who pointed out this little beauty on the ruby bug list. A bad date could cause YAML to fail hard, and this would be during the "deserialization" of our jobs (when the process loads them out of the database to decide which one to lock and run), so it would effectively hit any worker that came across that job, killing the process, effectively destroying our job queue by halting it in it's path. Upon investigation, we found that someone was indeed trying to run a report from 08/01/2010 to 10/01/12010. Dammit!
Deleting that job allowed our queue to get back up and start processing again (and since we use AppCloud, we were able to spool up another utility instance to work off the now VERY long backlog of jobs). Nevertheless, this now means we need to sanitize all of our report inputs, because it's not enough to be a VALID date (which technically that is, although far in the future), it also has to be within a reasonable distance from now (at least until we're using Ruby 1.9.2, in which this is fixed).