although we as web developers prefer working on neat features for our websites, sometimes we need to get down and dirty with data processing. I know, I don't like it any more than you do, but if you want to run a business sometimes you have to do stuff that isn't that fun. That's one of the reasons I liked Ruby; the file library for reading and writing files makes data processing a lot simpler than similar tasks I written in C++, Java, and other mainstream languages. Usually I just do something like this:
file = File.open("some_file")
#read all the contents of the file into "str" variable
str = file.read
file.close
...#do some processing...
Wow, that was easy! However, sometimes code like this just won't cut it. For example: if your file is too big to read into memory all wants and could cause performance issues for the server, you may want to process the file iteratively. Again, and Ruby provides a pleasant way for handling the scenario. The "gets" method accepts a block and yields back to you each line of the file one at a time, thus conserving your precious memory. See the example below:
file = File.open("some_file")
while(cur_line = file.gets)
...#do some processing...
end
file.close
Also pretty easy. Today, however, I ran into a new problem. What happens when your file is too big to read into memory at one time, but all the data is all in one single line? Don't believe that would ever happen? Check out EDI sometime and see what you think about it (On second thought, never checked that out. Never ever look at EDI. I wouldn't want to make you cry). Sometimes even XML or HTML files are written all on one line in human readability isn't of any particular concern.
Well, I have never dealt with that situation before. I had this file I needed to process, roughly 30 MB, all on one line. now, iterative processing would be okay, because each segment in the file was in the proper order for processing and it was delineated by a pipe character, but there just aren't any built-in methods all in the file object that do what the "gets" method does on a delimiter other than newline. So I wrote one:
class File
def uber_gets(delimiter)
segment = ""
self.each_byte do |byte|
char = byte.chr
if char == delimiter
yield segment
segment = ""
else
segment = "#{segment}#{char}"
end
end
end
end
with this modification, you can now do small iterative processing based on any delimiter. In my case, using EDI files, each record is separated by a "~". so, I used the above method as follows:
file = File.open("some_file")
file.uber_gets("~") do |segment|
...#do some processing...
end
file.close
There you go. The whole file is on one line, but the code is still respecting memory consumption. If it helps you, enjoy. I'll post a link to the gist:
http://gist.github.com/123924

3 comments:
http://www.reddit.com/r/ruby/comments/8pydz/processing_large_files_with_ruby_and_rails
"(...) there just aren't any built-in methods all in the file object that do what the "gets" method does on a delimiter other than newline."
Ahem... maybe you should have a second look at the documentation for IO#gets :-)
@stefano,
I looked it up, and you are right, "gets" takes a parameter specifying the seperator. All the examples I've ever seen of it never mention that capability. I guess that's what I get for putting faith in blog posts. :)
Thanks for the heads up!
Post a Comment