Thursday, June 4, 2009

Processing large files with Ruby and Rails

UPDATE: As "stefano" pointed out in the comments, the standard "gets" method does indeed accept a parameter. My fault for not checking the documentation!

although we as web developers prefer working on neat features for our websites, sometimes we need to get down and dirty with data processing. I know, I don't like it any more than you do, but if you want to run a business sometimes you have to do stuff that isn't that fun. That's one of the reasons I liked Ruby; the file library for reading and writing files makes data processing a lot simpler than similar tasks I written in C++, Java, and other mainstream languages. Usually I just do something like this:



file = File.open("some_file")
#read all the contents of the file into "str" variable
str = file.read
file.close
...#do some processing...


Wow, that was easy! However, sometimes code like this just won't cut it. For example: if your file is too big to read into memory all wants and could cause performance issues for the server, you may want to process the file iteratively. Again, and Ruby provides a pleasant way for handling the scenario. The "gets" method accepts a block and yields back to you each line of the file one at a time, thus conserving your precious memory. See the example below:



file = File.open("some_file")
while(cur_line = file.gets)
...#do some processing...
end
file.close


Also pretty easy. Today, however, I ran into a new problem. What happens when your file is too big to read into memory at one time, but all the data is all in one single line? Don't believe that would ever happen? Check out EDI sometime and see what you think about it (On second thought, never checked that out. Never ever look at EDI. I wouldn't want to make you cry). Sometimes even XML or HTML files are written all on one line in human readability isn't of any particular concern.

Well, I have never dealt with that situation before. I had this file I needed to process, roughly 30 MB, all on one line. now, iterative processing would be okay, because each segment in the file was in the proper order for processing and it was delineated by a pipe character, but there just aren't any built-in methods all in the file object that do what the "gets" method does on a delimiter other than newline. So I wrote one:


class File
def uber_gets(delimiter)
segment = ""
self.each_byte do |byte|
char = byte.chr
if char == delimiter
yield segment
segment = ""
else
segment = "#{segment}#{char}"
end
end
end
end


with this modification, you can now do small iterative processing based on any delimiter. In my case, using EDI files, each record is separated by a "~". so, I used the above method as follows:

file = File.open("some_file")
file.uber_gets("~") do |segment|
...#do some processing...
end
file.close

There you go. The whole file is on one line, but the code is still respecting memory consumption. If it helps you, enjoy. I'll post a link to the gist:

http://gist.github.com/123924

A new way to Blog

Most of my day I spend in front of a computer. I write code, I answer e-mails, and then when I get home, I blog. All that typing can be hard on the hands. I try to do most of the right things, ergonomically speaking. But I still end up with tendinitis. Being only 22 years old, this is obviously something I really want to avoid if I'd like to continue a career in the software business.

Enter dictation. This isn't the first time that I play with the idea of speaking to my computer. Both Windows Vista, which I had installed in my old computer, and Mac OS X, which I have all my new computers, have built-in speech recognition software. However, this is all mostly for command and control. The software helps you do things; you can open new windows, push menu buttons, click links, and do all sorts of other command based tasks. But when it comes to actually writing, these solutions fall short. Today, though, I'm happy to say that I'm now past that point. every word that you are reading on this page was put there by dictation software. MacSpeech dictation is the program I'm using, and I have to admit I'm impressed. Everything I say seems to just end up on the screen without me having to use my hands or forearms.

there are some drawbacks to the software. For one, it's not cheap. $200 for the current release. Now admittedly, that's not the most expensive piece of software ever seen, but for regular consumer consumption of price point seems a bit steep. On top of that, there is no trial version you can download. In fact, even when you buy it you can't download it. You have to have it shipped to you as if we were back in the 90s. so if you decide you want to buy MacSpeech Dictation, be aware that it's all or nothing.

For someone like me though, the advantage of being able to do my blog posts without my hands far outweighs the drawbacks inherent in MacSpeech's distribution system. I hope that as my experience with the package progresses, I'll be able to say that all my e-mails and all my blog posts are done without putting any unnecessary strain on my forearms. That way, I can save my limited typing capacity for what I enjoy most: code!

If this is something you'd like to try, check out the link below.

Mac Speech