We use delayed_job for our background queue, and it's great. You can do this really cool thing where you call "send_later" on any object, with the method you want eventually called, and it will queue it up for later execution. Very lightweight.
This is what we do for our reporting features. A user selects a date range and which report they want to run, and it gets offloaded to the background queue, sent to them later as an email link to an S3 file.
Today, suddenly things aren't working. Reports aren't coming through, in fact the whole queue has frozen, it's backed up to 85 jobs, and when we look at "top" on one of our EC2 instances, we can see the ruby processed dying as quickly as they spool up. Shit.
Frantically we search the logs. Nothing. Configuration? Still the same, and looks good. We try to manually run a few jobs, no issues. Finally I try spooling up a worker in process and telling it to work off the queue, and I see this:
ArgumentError: argument out of range from /usr/lib/ruby/1.8/yaml.rb:133:in `utc' from /usr/lib/ruby/1.8/yaml.rb:133:in `node_import' from /usr/lib/ruby/1.8/yaml.rb:133:in `load' from /usr/lib/ruby/1.8/yaml.rb:133:in `load'
From YAML? Why?!
Fortunately, we got some help from our friends at EngineYard (we've always had top notch support from them) who pointed out this little beauty on the ruby bug list. A bad date could cause YAML to fail hard, and this would be during the "deserialization" of our jobs (when the process loads them out of the database to decide which one to lock and run), so it would effectively hit any worker that came across that job, killing the process, effectively destroying our job queue by halting it in it's path. Upon investigation, we found that someone was indeed trying to run a report from 08/01/2010 to 10/01/12010. Dammit!
Deleting that job allowed our queue to get back up and start processing again (and since we use AppCloud, we were able to spool up another utility instance to work off the now VERY long backlog of jobs). Nevertheless, this now means we need to sanitize all of our report inputs, because it's not enough to be a VALID date (which technically that is, although far in the future), it also has to be within a reasonable distance from now (at least until we're using Ruby 1.9.2, in which this is fixed).

2 comments:
umm...how about error handling around a single job. This way even if one of them fails others later in the queue execute
Yes, to the uninformed it would seem like it would be that easy, but you have to realize that this error was not occurring as part of the job, it was happening in the delayed_job library itself as it tried to deserialize the job object to begin processing, and since every DJ process runs through the first few available jobs to find one to run, they all would try to deserialize this job and the whole DJ process would exit.
Remember, DJ has built in exception handling for problems that occur during the job run itself. If a job fails during execution it will be setup to rerun later, and the associated DB column will be updated to show what the latest error was.
Post a Comment