Friday, March 12, 2010

Get the gist of data loading (csv,xls,xlsx,ods)

Common problem: you need to load some spreadsheet data into your Rails application.

Common solution: send it to me as a csv file, I'll take care of it.

Why? Because a csv file is super easy to parse. Using the built-in ruby csv reader, every row comes back as an array as you traverse the file, so it's easy to process one row at a time:

CSV::Reader.parse(file).each do |row|
  #...save some model that this represents...
end

The complication that often comes up is that your customers might not know what a csv file is. So you'll get xls, xlsx, ods, pretty much anything.

You could manually save each file as a csv (kinda frustrating).

You could also write a cludgy class that handles the interfaces for different libraries differently.

Or you could use this gist I'm about to provide you for all your parsing needs, taking the power of the "roo" gem, and the built in "csv" parser, and translating all your file types into the same iterate-able interface that you so love:

GIST!

Now you can just write:

SpreadsheetParser.parse(file) do |row|
  #...pretend its a csv file, every row is just an array
end


Enjoy!

Tuesday, March 9, 2010

New Mini-Gem in the Wild

Ok, here's a problem I've had:

I want to deploy my application to a "cloud" setup rather than my current "slice" infrastructure. This is great, but means I have to refactor out many features that used to make use of the local filesystem because I cannot depend on it. I could spawn new servers at any time in a cloud infrastructure, and there's no GFS joining them. For most features, this is no big deal: store your assets on Amazon S3 and get over it. That's what I've done for user uploads, report files, and all manner of other assets.

The problem came with some of my data updates. We have a process at my company where some of our internal users have a page on our application they go to in order to update our data from non-web-enabled sources (mostly through the use of CSV files, since all our data sources seem to be able to generate those). Currently we store those files on the GFS, than kick off a background job passing the file name and let the background job do whatever processing it needs to do on that file.

It's harder to do on the cloud, though, because you can't just store it locally, our utility server instance that's running our background jobs won't be able to get to it since it's a different filesystem entirely. You could put it in the database, and if I were using MongoDB on this project I probably would, but that's not a habit I want to get into with MySQL.

The patchwork solution we're going forward with for now is ditching the file into s3, passing the key to the background task, and re-downloading the file on that end for processing.

In order to make this process a little more palatable, I've quickly built and released a mini-gem called Cumulus CSV. It just wraps a simple interface around storing an uploaded csv file to S3, and iterating over it later. It's available from my github account, or on gemcutter as "cumulus_csv", so if you're struggling with the same problem see if it will help you out!

Monday, March 8, 2010

We're Hiring!

This is a call to all Rubyists in Columbia, MO; if you're looking for a new (or your first) coding gig, we're interested in talking to you.

Here's the story: we (mostly I) went and built this Rails app, therapylog.com. It lets therapists at public school districts track all of the therapy they do from day to day, in what we think is a pleasant way. No paper, no hassle! We even started a BLOG to let our therapists see how our webapp was growing (you can check it out too to see some sample functionality).

Now the problem: We've grown. We've got a slew of new feature ideas, plus a bit of refactoring work ahead of us, and I can't do it alone.

What I'm looking for is somebody at what an enterprise operation might call the "junior" level. We've got a pretty good code base established, but need someone to step in and keep the gears of progress moving forward. You would be working with me directly, as locally as possible, and we could have you start as soon as you're ready.

Things I'd like:

-at least some experience with Ruby or Python, doesn't have to be professional in nature
-interest in making things better, not just getting the checklist done
-comfortable on a non-windows development environment (*nix, Mac OS X)
-history of making self better
-Not necessarily a design pro, but unwilling to create something ugly.
-good sense of humor (won't help your coding, but should at least help our interaction) :)

Although we'd love to have a 5-7 year rails vet, fact of the matter is, we're talking a $30,000-$40,000+ position ($20/hr if part time), and being a developer myself I know what you're worth when you've been in the business that long, and I can't see you being happy at our current cashflow level.

However, if you're a recent/soon-to-be college grad or a programming hobbyist looking for a career change, this might be for you!

You can reach me directly at ethan.vizitei@gmail.com, can't wait to hear from you!

Friday, March 5, 2010

I've been had!

Here is a lesson for you that you should take to heart: Trust your mentors, but find out for yourself.

When I first heard about batched finding in ActiveRecord (#find_each adn #find_in_batches), it was a great revelation. My buddy told me how it worked, and I loved it!

According to him, find_in_batches used a "batch_size" hash option, defaulting to 1000, to decide how many records to load into memory at a time, and "find_each" would perform a query for each record one at a time.

Give those options, and wanting to balance my memory conservation against responsible use of database time, I ended up with a lot of blocks like this:


Model.find_in_batches do |models|
  models.each do |model|
    ....
  end
end

It seemed tedious to do a nested block every time, but I didn't want to use the "find_each" option which was going to slam my database with a million single row queries.

Eventually I'd repeated myself enough that I decided I was going to do something about it. I figured I'd create a plugin that would monkeypatch in a method called "find_each_in_batches" that would do the extra block for you, so you could automatically iterate over one model at a time, but still have the finding occur in batches. In order to get the namespacing right, I opened up the source file for ActiveRecord's built in batching methods (batch.rb), and imagine my surprise when I see this:


def find_each(options = {})
  find_in_batches(options) do |records|
    records.each { |record| yield record }
  end
  self
end

What do you know! #find_each actually does exactly what I was planning on making my new plugin do! My buddy had been misinformed, and had in turn misinformed me, and I just took his word for it!

If I had checked out this source code myself in the first place, I would have known this in the first place, and wouldn't have those double iteration blocks littered throughout my code base.

Reading the source code is a good habit anyway, because it introduces you to idioms the pros are using that you may not be familiar with and is a great way to improve your own code. But even just a glance to confirm what i'd been told with the inline documentation would have prevented a world of hurt.

So, while I'm here slapping my forehead, I encourage to you take heed to what I said at the top: Trust your mentors, but find out for yourself.