Wednesday, January 27, 2010

gem Read-through: slim_scrooge

Ok, new project. I believe it's dangerous to rely on code that you do not understand. As a rails-developer, I have tons of plugins and gems that I do not understand. See the problem?

To rectify this, I'm making it my goal to read through one of my main project's many dependancies each week. Two side benefits:

1) I will probably be better at writing my own open source libraries if I've seen a larger sample of how they're usually constructed.

2) code reading is good for you, but it's tough to find time to just sit down and crack open a library. This will give me a good reason.

So without further ado, today I'm doing a read-through of, a great ActiveRecord optimizing library that has made a difference in the performance of my current main project. Don't expect anything linear here, I'm just going to record my notes and if you want to use them too you're welcome to them.

Slim Scrooge

The point of the slim scrooge library is to moniter your active record queries, and optimize them so that they only pull back the columns that you end up using in that section of code. Let's find out how it works:


1) First thing I noticed. There is a test directory, but no tests. Problem? maybe....

2) Scratch my first note. It appears that SlimScrooge::ActiveRecordTest actually runs the ActiveRecord tests that are included with Rails. I guess this makes sense, as a regression test. Anything that filters activerecord should still pass the activerecord test suite. Still, this definitely means that the code itself is not under test. The gem could do nothing, and the tests would still go green. I'm not here to judge, though. I've written my own share of untested code.

3) first included file in the main library is a C extension called 'callsite_hash'. Looking in the /ext directory of the plugin. My "C" is a little rusty since I've been out of it for 3 years, but I think I get that it's defining the global ruby function "callsite_hash", and mapping it to the c function "rb_f_callsite" in this callsite_hash.c file. I don't know what it does yet, as it's the rb_f_callsite function is a little dense for my limited C skills, but maybe it will make more sense in context. So, moving on.

4) Next inclusion is SlimScrooge::SimpleSet (a subclass of Hash, /lib/slim_scrooge/simple_set.rb). This class stores a set of keys based on a submitted array, all mapped to the value "true". Because of the syntax, each time an element is added, it will only create a new entry if it's not already in the set. So basically it's a set of unique elements with some helper methods to keep operations restricted to only the keys (like a collect method that only runs over the keys array). Knowing what the gem does, at this point I'm guessing this is the structure that column names are stored in so you know which ones were used and which ones weren't after a query. We'll see.

5) Moving on to /lib/slim_scrooge/callsites.rb, which defines the class SlimScrooge::Callsites (no parent class). This class only has static methods, so I guess it's never instantiated. It has a class-level variable called @@callsites, which is a hash. Write access to the hash is synchronized through the uses of a Mutex which is instatiated at the time of class definition as a class-level constant (SlimScrooge::Callsites::CallsitesMutex). Given that I don't know what's being stored here, I don't feel like I can accurately analyze it. Therefore, I'm jumping over to the top-level algorithm in /lib/slim_scrooge/slim_scrooge.rg

6) lib/slim_scrooge/slim_scrooge.rb definately is the meat of the gem. SlimScrooge uses good old alias_method_chain to bring about "find_by_sql_with_slim_scrooge" (defined in the gem) and "find_by_sql_without_slim_scrooge" (the original "find_by_sql" method in ActiveRecord). This is how the gem inserts itself into every activerecord call. In the "find_by_sql_with_slim_scrooge", we see what's being done step by step:

A) if the sql passed in is an array (that is, a custom query directly from a programmer writing Model.find_by_sql("blah")), don't bother. Let it run like normal.
B) if this "callsite" has been seen before, try to optimize it.
C) if it hasn't been seen before, try to monitor it
D) otherwise, let it go (find_by_sql_without_slim_scrooge)

7) So what is a "callsite"? How do you know if you've been here before? Well, apparently that's what the C extension is for "callsite_hash.c". The query is passed into this black-magic-extension which by some occult method creates a unique key for it (called a callsite_key). This is then stored in that class-level hash in the "Callsites" class.

8)There is logic written in here to pass it through unoptimized if the query is not "scroogable", and there are several conditions that meet that. For one, if there's any joining, it won't bother. Also, if it's not a "select" query (that is, it doesn't start with SELECT, include the expected table name, and have a "FROM" in it). [These were limitations I was unaware of before].

9) The monitoring of a query is done by attaching a MonitoredHash to each row in the first query. This hash maintains a reference to the callsite, and can be configured to not monitor certain columns. Anytime a column is accessed that was previously unseen, the callsite is notified.

10) next time the query is run, the callsite has a record of which columns were used and uses "scrooged_sql()" to only produce a select query for those columns.

Well, this was fun. I feel like I've learned a bit about how my site works under the hood, and a little more qualified to comment on the use of this gem in the future. Here are a few things I learned that are not directly about the slim_scrooge gem:

1) The Mutex class can be used to synchronize access to an object.

2) ActiveRecord appears to direct all queries through the "find_by_sql" method. That's the place to hit it if you want to get in some sort of filtering.

3) C extensions for ruby use an "Init_*" method to integrate themselves into the runtime.

Until next time,


Friday, January 22, 2010

Un-Joining your Scopes

I had a suprising problem today when one of my tests started failing after I had done a little refactoring. You see, I'd had this ActiveRecord class that was doing some reporting (massive data extraction) and it was originally using a pretty ugly SQL statement:

Model.find_in_batches("Massive SQL statement") do |models|
  models.each do |model|
    models_to_compare = Model.scope_with_other_joins

Naturally, I wanted to make that SQL statement go away, and use a bundle of named scopes instead. I had good tests wrapping this area already, so I set autotest running and started hacking away:

Model.scope_with_some_joins.find_in_batches do |models|
  models.each do |mdl|
    other_comparisons = Model.scope_with_other_joins

Note that both queries (line 1 and line 3) have similar joins involved. Now, my tests started failing on line 3 -- I get a runtime error showing me that for some reason when running the second query it's maintaining the join scope from the outer query, giving me an "ambigious column" error because there is one table that is joined in from both queries. Now, this "some reason" is really just that this is the way it's designed....the whole point of a "scope" is to be able to nest other things inside of it. In my case, though, I needed the line three query to be totally seperate and distinct. It took some googling and stack-overflowing (love that community), but here's what I discovered:

Model.scope_with_some_joins.find_in_batches do |models|
  models.each do |mdl|
    Model.send(:with_exclusive_scope) do
      other_comparisons = Model.scope_with_other_joins

this protected "with_exclusive_scope" method resets the scope entirely for that model within that block. Thus, you're able to have a clean query regardless of the surrounding context. Now, I'm not saying that this hack of sending a protected method is a good idea anytime, but in my case I didn't have an easy way to get around it (other than leaving the SQL statement in place). It's still cleaner to me than having that giant SQL string I had in the code before, and maybe once I do a little more reading on the subject I'll get an even better idea. Other suggestions welcome!

Thursday, January 21, 2010

Autotest saves the day

This is not a tutorial on how to setup autotest on your machine. People have already done that plenty of times, a good one is here.

I'm just writing to say what a big difference it's made for me. My first job was at a "test-everything" development shop. I really agreed with the notion of having solid tests surrounding all possible code, and running them before every commit/deployment. The problem for me arose when I moved to start my own business. Without all that peer pressure (I'm working as the only development talent currently), it's easy to fall off the wagon. Especially when using tools that don't exactly integrate testing into your workflow. Typically I'd write tests around all new code, and run them before committing, but if I was making a quick change just to format something better or to fix a bug, I was often hurried enough to not only write no new tests, but to not run any of my current tests before committing just to get the damn thing out the door.

Autotest silently runs in the background, running your tests anytime you change a file. Not only that, it can be configured to use Growl to notify you every time a test breaks. Now I don't even have to think about it. Every time I press command-S, my tests get run and I know that at the least I haven't broken anything that's currently covered.

Of course the limitation is that you must write tests in the first place. Having your tests run all the time without any real coverage doesn't save you much. For me, though, just knowing that my tests are being run consistently gives me more motivation to write more of them, more often. If you're a rails developer, just try it. It's really not too much of a time commitment to set up, and if you're like me you'll be suddenly one giant step closer to being the unit-testing-guru that you always wished you were.

Monday, January 11, 2010

My favorite coding music

I thought I'd kick off the new year on a slightly less technical note than usual: the tunes I play to boost my productivity.  Silence kills me and causes my mind to drift; I personally find that having the right kind of music playing low in the background keeps me going in the right direction when I'm trying work.

This is a very personal subject, so I can't say that what works for me will work for you, but it can't hurt to try out what I'm playing and see what it does for your work life.  So, without further ado, here are my 5 top albums for cranking out code:

5) 'Marimbach' - Beverley Johnston

Soft, compelling, and beautiful. This is an album made entirely from classical marimba pieces (mostly solos). It's not distracting at all, but isn't so slow or quiet as to make you lethargic.

4) 'Flight of the Cosmic Hippo' - Bela Fleck and the Flecktones

A nostalgic favorite of mine from high school, this album is a collection of eclectic instruments playing a genre that is entirely it's own. Don't be thrown off by the funky title, these are some serious musicians (including the great Victor Wooten) and it is pretty high-energy the whole way through. My favorite for pressure situations. If I have a deadline today, this is what I play.

3) 'Hush' - Yo-Yo Ma & Bobby McFerrin

This is a collection of duets between a pair of virtuosos: Yo-Yo Ma, the master cellist, and Bobby McFerrin, the most amazing vocalist I've ever listened to. This is a man who has truly made his voice an instrument, and it's almost trance-inducing. I like to put this on first thing in the morning when I sit down to start, although I can't articulate why. Just give it a chance. It's strange at first, but once it grows on you it's an indispensable part of your collection.

2) 'Evolution' - Stefon Harris & Blackout

As a mallet player myself, there are few contemporary musicians I love as much as Stefon Harris. This is much richer and fuller music than anything else on my list, but I think it's perfect for the "deep coding" work. Unhurried, but high energy, and smooth without being "easy listening" (which I despise).

1) 'Libertango' - Gary Burton

My favorite album in my collection, it's steady rhythm always makes me feel motivated. If you try nothing else on my list, give this a listen, especially track one (the title track). It's very dense music, but so accessible even to one who is not a musician. I can't recommend it enough.

So if you're looking for a few new tunes to keep you going during those times of loose focus, see if you can find something you like among my favorites, and drop me a line if you do! I've got more in my back pocket, I just can't share it all in one blog post. :)