Wednesday, December 31, 2008

Happy New Year

My first post on this blog happened in January 2008. At that time, I figured I'd be doing this for a couple months, and then move on to something else. Instead it's turned into something I truly have enjoyed, as I've gotten all kinds of valuable feedback and advice from readers. As a final post of the year (before I go find some eggnog and an X-Box party), I just wanted to thank everybody who took the time to read what I had to say and start a conversation with me about it. I hope those opportunities come up even more frequently ini 2009.

And now, by the authority of Google Analytics, I give you the 10 most popular posts of 2008:

10)Rails Reporting with Excel Why write functionality into your webapp that already exists in a spreadsheet program that's been around for years? Just give your users the data in an excel sheet, and they can manipulate it any way they want (this post will show you how).

9)Centurytel: What an Embarrassment. Hey kids! Want to see some comedy? Well, everyone enjoys a good rant, so step on up and watch as i totally lose my cool and blast my old ISP for doing exactly what all the other ISPs do all the time!

8)Background Jobs in Rails Have some work that needs to be done in a background thread? Here's a Q&D way to get it done.

7)Maintaining Discipline: A remorseful post I wrote the morning after I decided to go ahead and "get shit done" without writing tests or refactoring. A case study in technical debt.

6)Refactoring in Ruby: You might call this a "lab" in refactoring. Basically I just walked through a real-life example of fixing up some bad code that I had run into the day of that post.

5)Remote NHibernate Apparently I'm not the only person who's ever had trouble getting NHibernate to work over a network. This post doesn't have all the answers, but it was everything I learned when I first attempted it and I've gotten many emails thanking me for the information it does have, so it must be useful to some people.

4)Becoming Less Obtrusive I wrote this post when I finally caught up with the times and started using an unobtrusive approach to javascript. Not much in the way of implementation here, but if you want to know why I write all my javascript this way now, than this post will tell you.

3)Sending email with Rails and Gmail. Ever wanted to have your rails application send out emails, but don't want to configure postfix? Here's your ticket. I don't use this trick anymore because I eventually was sending out so many emails to users that google thought I was a spammer, but if your email volume is relatively low this might be the perfect solution for you.

2)Burning the midnight oil. Honestly I have no idea why this one became as popular as it did. All I was saying was that I seemed to work better at night, but it became a huge deal. If you like midnight-coding as well, check it out.

1) All I need is a programmer: a short diatribe I wrote after a friend of mine expected me to do a programming project for him for free. For some reason it must have struck a nerve with a lot of people out there, because I got a lot of very positive feedback from programmers in the same position.

Thanks again for keeping this blog alive (not through money, because I don't accept donations or run ads, but through attention, because that's really what I crave anyway).

Happy 2009!

ActiveRecord and Large Datasets

So I ran into a neat problem today. I should clarify right off the bat that when I say "neat" I really mean "ate up all the memory on my server and caused it to slow to a crawl where it was serving requests up at an average of 20 seconds each". Now on a regular day that would be cause for panic, but our product is used by school districts, so we had next to no traffic today, so I had the luxury of considering my problem a "challenge" instead of an "emergency". Here's what I found:

Those of you who are repeat readers of my blog will know that I've done several posts in the past on the topic of generating reports via a background thread so that they don't take up time on one of your mongrels. That's where my problem was. One of the "Reports" I can run as the administrator of my startups product is an audit that checks every transaction ever put into our system and reports on the ones that need attention. The specific business rules aren't important for this post, suffice it to say that we have to filter through every record in our transactions table.

Now, the way I had written this before looked something like this:


District.all.each do |dist|
transactions = Transaction.find(:all, :conditions=>"..belongs to district..")
transactions.each do |trans|
report_on(trans) if (trans.should_report)
end
end


That's obviously way oversimplified, but the gist of it is there. In previous months this has worked fine, but every month the transactions table picks up another 200,000 or so records, and I guess this month was the breaking point for the audit report. You see, the reason I had split up the transactions by school district instead of iterating through all of them at once was to diffuse the problem of bringing too many records into memory at once. Clearly that's no longer enough of a precaution, because the table has grown to the point where even the number of records for just one school district is just too much to process in memory at once. We need a new solution.

ActiveRecord (IMO) is great as a database interface. What it's not so super at (out of the box) is very large sets of data. Usually that's not a problem for a website, because for almost everything a user needs to do for a website you'll either be dealing with just one single record per action, or a very small set of records that are pertinent to that specific user. Now that I need to deal with millions of records in one go, I've kind of exceeded what AR alone is going to be able to do for me.

However, let's not forget that there is another case common to web-apps where we need to take a dataset in chunks rather than as a whole. What happens when you need to show a user 100 records in a table? Do you just display it all on one page? You can, but most people wouldn't. That's why we have so many great plugins that deal with pagination, and that's what I used to dig me out of this hole.

At first it seemed unorthodox to use a piece of functionality designed to make views more pleasant as an asset for solving my data processing problem, but after thinking about it for a few minutes I realized that the problem I was trying to solve was exactly the same: I want to deal with only small chunks of data at a time.

If you don't have a pagination plugin, I recommend highly the one that I use, Paginating Find. Here's what my code looks like now:


arr = Transaction.find(:all, :page=>{:size=>100})
proceed = true
while arr.next_page? and arr.next_page! do
arr.each do |trans|
report_on(trans) if (trans.should_report)
end
end


Done. Now we're only processing the transactions in batches of 100. The running time is a little longer, but it happily uses just a very little amount of memory the whole time, so the web-server can keep on cranking.

Hope someone else out there benefits from the fix!

Tuesday, December 30, 2008

Binary Search

This is another journal entry as I follow along with an MIT algorithms class from iTunesU. If you haven't read the first two posts in this series (Insertion Sort and Merge Sort), what I'm trying to do here is record what I'm learning so that I can solidify it in my mind and catalog any information that others might find useful later. As I've said in my previous posts, I'm presenting this information as a student, not as a master, so any corrections would be most welcome.

Today I want to look at the following problem:

I have a sorted array of 10,000 elements. I am asked to find a certain element in the array, and return the index of it's location (or some other value like -1 or nil if it is not in the array). What's the best way to do this?

Well, before I give you the good answer (which has really already been given away in my title), let's look at the naive solution: iteration. This is the easy one to implement, and frankly is the first one that would have occurred to me (although I know better now). Pseudocode for an iterative-searching algorithm follows:


1 define linear_search(array,token)
2 for each i in 0...(array.size - 1)
3 value = array[i]
4 if(value == token)
5 return i
6 else if(value > token)
7 return -1
8 end
9 return -1
10 end


Not bad really, short and sweet. Why don't we do a quick analysis to determine what the runtime is?

Let's see, I want to do a worst-case analysis, so I'm going to assume that the desired token is the last element in the array. Line 2 (the loop initializer) is run exactly 1 time, regardless of the size of the array, so there's one operation, and Line 5 is run exactly one time as well (on the loop iteration when we find the token), so now we're up to 2 operations:

2

lines 3 and 4 will be run once for every item in the array (in the worst case), so thats 2 operations n times added on at the front:

2n + 2

line 6 is run for every element in the array except the last one when it finally finds the token, so that's 1 operation occurring (n-1) times:

2n + (n - 1) + 2

which simplifies to:

3n + 1

So the naive iteration search runs in O(n) [remember, big-O notation is the upper bound/worst case. If the token were found near the beginning of the array it would finish much quicker]. You may want to observe as a side note here that our FASTEST sorting algorithm we've looked at so far (merge sort) runs in O(n log2 n), which is actually slower than our slowest searching algorithm. Be aware that sorting is comparatively expensive, it's just something good to know when working with large collections.

Now, you probably have guessed from the title of this post that there is a faster way to accomplish all this searching, and even if you've never heard of "Binary Search" before, I guarantee you already know how it works. To prove it, let me ask you how you might look up the last name "Johnson" in the phone book. Would you start at the beginning and read through every name until you found the one you were looking for? Of course not! You'd just open the book to about the middle and see where you were. Based on the page you opened to, you'd say either "I'm too far, I need to go back to the left" or "I'm not far enough in, I need to continue somewhere to the right". In that one action, you've eliminated half of the set of names you have to dig through. That action, performed recursively, is binary searching. See the pseudo-code below:


1 define binary_search(array, token, low, high)
2 if (high < low)
3 return -1
4 mid = (low + high) / 2
5 if (array[mid] > token)
6 return binary_search(array, token, low, mid-1)
7 else if (array[mid] < token)
8 return binary_search(array, token, mid+1, high)
9 else
10 return mid
11 end
12 end


If you run the benchmarking I've already done for you in my searching code on github, you will see empirically that this is MUCH faster than our original iterative search. But, if we want to do this right, we should go ahead and analyze the binary search algorithm so we can give a mathematical comparison between the two.

First, in order to analyze binary search, we need to produce a recurrence that describes it. Although we did this in a pretty lengthy way in my post on merge sort, the quick way to do this is to ask "what are the recurrent subproblems" and "how much work is done non-recursively in each call". For binary search, this is very easy. As you can see from the pseudo-code above, binary search will decide whether to go to the left or to the right half of the array based on the element it pulls from the middle, so on each successive call it will be considering half of the previous array (n/2). Unlike merge sort, which performed a recursive call on BOTH halves of the array [2T(n/2)], binary search only considers one or the other [T(n/2)]. So the beginning of our recurrence looks like this:

T(n) = T(n/2)

Then the only question we have to ask now is "how much work is being done non-recursively in each recursive call"? Remember, in merge sort we were merging both halves of the array together, so we had to do one operation for each member of the original array, giving us "n" work per level. In the case of binary search, we just do the same 3 comparisons regardless of how many elements there are in the array, so there is always some constant "K" of work being done.

T(n) = T(n/2) + K

Pretty simple. Now, we could use substitution and try to guess what the right answer is (this one is simple enough that it would not be too hard to guess "log n", since the problem set halves every time). If we weren't sure of that, we could use a recursion tree, like we did in merge sort, to show us that we were doing "K" work over "log n" levels, giving us "K * log n" which in Big-O notation would be "O(log n)", and then we could use substitution to prove that log n was a reasonable upper bound. However, the simplicity and form of the algorithm means this is a great time to introduce the "master theorem", a quick way to solve recurrences of this type.

In the lecture series I'm listening too, they made the master theorem sound very complex (it could also be because it's all audio, so I couldn't see the blackboard). After listening to the same section 3 or 4 times though, it began to click. Here's the process:

The master theorem can only be used to solve recurrences of a certain type:

T(n) = aT(n/b) + f(n)

Now, don't balk just because it's a multi-variable equation, it's really quite simple. "a" is just the number of subproblems (recursive calls). in merge sort, we had 2 of these because we called merge_sort once on each half of the original array. For binary search, "a" is 1, as we just call search again on one half of the array or the other. It must, logically, be at least one, otherwise you wouldn't need to be using a recurrence because it wouldn't be a recursive algorithm. "b" is what's used to reduce the size of the problem set on each recursive call. In both merge sort and binary search, the problem set is being divided in half for each recursive call, so "b" is 2. Logically, "b" must be greater than 1, because if it is 1 than the problem set is not getting smaller on each go around, and it will recurse infinitely (or until you run out of memory, pragmatically). "f(n)" is just some function describing the amount of work done non-recursively in each call. As we said a moment ago, for binary search there is a constant amount of work being done for each recursive call, so "f(n)" would be "K" (some constant). Remember that for merge sort, we must merge the two arrays together, which is done in O(n) time, so the work is "n". Looking again at our recurrence for binary search:

T(n) = T(n/2) + K

We can see that binary search does indeed meet the qualifications for the master theorem. It has an "a" of 1, a "b" of 2, and an "f(n)" of K. Now, this is where it get's scary if you are intimidated easily. There are three cases of the master theorem, and you must decide which one your recurrence falls into. Fortunately, it's easy to remember them because they're all concerned with the relation of "f(n)" to the quantity "n ^(logb a)" (it's hard to show complex equations in a blog. That last quantity, in english, is "n to the log base b of a". Although I may be over simplifying, if f(n) is less than "n ^(logb a)", you're in case one. If f(n) is roughly equal to "n ^(logb a)" (multiplied by log n to some power k, and "k" can be 0), you're in case 2. And if f(n) is greater than "n ^(logb a)", you're in case three. Each one of the cases basically tells you what the answer is.

Case 1 => T(n) = O(n^[logb a])
Case 2 => T(n) = O(n^[logb a] * (log n)^(k+1))
Case 3 => T(n) = O(f(n))

Now a full proof and analysis of this theorem is beyond the scope of this blog post. A good resource for more information on the master theorem is HERE. All I want to do is apply the master theorem to my binary search recurrence. So, first I need to compare f(n) to (n^[logb a]) so I can decide which case to use. Remember, we said earlier that our "f(n)", or "work done non-recursively on each call", was some constant "K". Now I need to calculate (n^[logb a]) so I have something to compare it to. For binary search, b = 2, and a = 1, so we're calculating "n to the log base 2 of 1". What power do we need to take 2 to in order to produce 1? 0, of course, so we're looking at n^0, which is 1.

So which case do we fall into? Is K less than, roughly equal, or greater than n^0? Well, K is a constant, n^0 is a constant (1), so they're roughly equal. Looking back, we see that "roughly equal" is defined as equal to n ^(logb a)"(multiplied by log n to some power k, and "k" can be 0)". we don't really need that log n to some power in order to be roughly equal, so we'll use 0 as our "k" (little k, not to be confused with the big K we're using to represent our "f(n)"). Looking at the above statement of the answer for Case 2, I know that our runtime is:

O(n^[logb a] * (log n)^(k+1))

We already showed that n ^(logb a) in our case is n^0, which is 1, so we can make this:

O(1 * (log n)^(k+1))

and of course 1 is the multiplicative identity so:

O((log n)^(k+1))

and we're using 0 for our little k, so:

O((log n)^1)

and 1 is the power identity:

O(log n)

so binary search runs in O(log n). MUCH faster than the O(n) we have for our linear search above.

So there you have it. We've now covered all three methods for solving recurrences, and we're successfully analyzing and comparing algorithms. Pretty neat, huh? Nothing like a little academic exercise to get your mind going in the morning.

As I've mentioned before, I'm implementing each of these algorithms as we discuss them and the code (and the harnesses that benchmark them) are available on github. Also, don't forget that I'm learning all of this material off of iTunesU, and you can too. There are hundreds of interesting courses from top schools available online, for free. It's an incredibly rich opportunity for self-improvement, and there's no reason not to take advantage of it.

Merge Sort

In my last post, I mentioned that I've started listening through an online lecture series on algorithms, and that I'm recording what I'm learning here both to solidify it in my memory, and to try to present the information in a non-intimidating way so that others who share my lack of formal computer science background (and passion for software) can benefit.

Last time, I demonstrated the "insertion sort" algorithm, and then gave a worst-case analysis that showed that insertion sort performs "n^2 + 2n -2" operations to sort an array, meaning that it scales at On^2. This is an interesting piece of information in and of itself, but it becomes USEFUL when you can compare insertion sort to other sorting algorithms to see which one is a better choice for your present needs. To conduct such a comparison, I'll be showing the "merge sort" algorithm in this post, conducting an analysis, and then comparing it to the insertion sort analysis done in my last post.

Merge Sort

Merge Sort, unlike insertion sort, is a recursive algorithm (if you don't know what that means, the short version is that it's a method that calls itself. Google for more detail if necessary). It has two paths on each call:

1) if the input array has a size of 1 or 0 elements, it's already sorted, and can be returned.
2) if the input array has 2 or more elements, divide the array in half and call merge_sort on each one of them, than merge the two results together.

Pseudo code for this algorithm is below:

0 define merge_sort(array)
1 if array.size <= 1
2 return array
3 else
4 split_index = array.size / 2
5 first_half = merge_sort(array[1...split_index])
6 second_half = merge_sort(array[split_index...array.size])
7 return merge(first_half,second_half)
8 end
10 end
11
12 define merge(left,right)
13 result = []
14 while length(left) > 0 and length(right) > 0
15 if first(left) ≤ first(right)
16 append first(left) to result
17 left = rest(left)
18 else
19 append first(right) to result
20 right = rest(right)
21 end while
22 while length(left) > 0
23 append left to result
24 while length(right) > 0
25 append right to result
26 return result
27 end


Ok, so far so good. Now, how do we analyze this? Well, it's tougher than insertion sort, that's for sure. Because of the recursive nature of the function, it's difficult to just start adding up the number of times a line gets executed as a function of the size of the input array. A better place to start to analyze a recursive function is to write a recurrence for it (A recursive definition of the sequence. ). How does that work? Well, let's take a look at the english version of the merge sort algorithm again:

1) if the input array has a size of 1 or 0 elements, it's already sorted, and can be returned.
2) if the input array has 2 or more elements, divide the array in half and call merge_sort on each one of them, than merge the two results together.

So the time it takes to sort an array is equal to the time it takes to sort the left half of the array, plus the time it takes to sort the right half of the array, plus the time it takes to merge the two together. This can be written as follows:

T(n) = T(n/2) + T(n/2) + (time to merge)

Which simplifies to:

T(n) = 2T(n/2) + (time to merge)

Now what are we going to do with the "time to merge" statement (the one that combines the two sorted arrays)? Do you see why that's easy to take care of? Maybe it would be easier if you saw it written out, rather than in psuedo code; the merge function works like this:

1) take two sorted lists
2) compare the first element in each list.
3) place the smaller element of the two into the output array (and delete it from the list it came from).
4) repeat until each element has been merged into the output array.

You should be able to intuitively spot that there will be exactly as many iterations as there are elements to merge, so the runtime is (x*k*n) where x = the number of operations in an iteration (which I think is 4, in our case), k = the constant time it takes to execute one operation, and n is the number of elements in each of the two lists added together. Thus, in big-O notation, merge has a runtime of On (linear time). Now, to simplify the final equation, we will take the liberty of saying that since each iteration is a constant number of operations, each iteration runs in constant time, and thus it's easiest to say that it takes n "steps" to merge the two sorted lists. This means that our above time calculation can be rewritten like this:

T(n) = 2T(n/2) + n

And this is our "recurrence". Remember, all that means is that the amount of time it takes to merge_sort a list is equal to the amount of time it takes to merge_sort the first half of the list, plus the amount of time it takes to merge_sort the second half of the list, plus the amount of time it takes to merge them together.

Now, how do we take that recurrence and turn it into a formula that represents the growth of the runtime in proportion to n (which is what we need to do if we want to compare it to insertion sort)?

Well, it's not going to be pretty. The only method I know of solving this is to use substitution, and that method basically goes like this:

1) guess the right answer
2) prove it

You can most likely guess that a lot hangs on step one. Now, you may know off the top of your head that merge sort is "O(n log2 n)", but let's pretend that we didn't know it. I might just toss out a guess of O(n^2). What's cool about this is that technically it's correct. Big O notation just tells us the upper bound, like a <= sign, which means that if an algorithm has a runtime of On^2, it also fits into the set of On^1000 because it's runtime will grow proportionately with n at a rate that is LESS THAN OR EQUAL to n^1000. So since we know that merge sort's tight upper bound is n log n, we should definitely be able to prove that O(n^2) is a viable loose upper bound. So, to demonstrate, let's say that I take a guess and say "I think that merge sort will run in less than O(n^2)". Remember, big-O notation removes all coefficients, so what I'm really saying is that the highest order term will be n^2 and it will have some constant C as it's coefficient. This means that I'm saying T(n) is actually less than or equal to Cn:

T(n) <= Cn^2

This could also be said as follows: "If I pick a C that is big enough, merge sort will ALWAYS execute with less than C*n^2 operations". Now, how do we prove it? Well let's look back at our recurrence:

T(n) = 2T(n/2) + n

Since we are trying to prove that T(n) is <= Cn^2, let's substitute in Cn^2 for T(n) in the right side of the equation and solve from there. Now, we don't actually have any T(n) on the right side, but we do have a T(n/2), and since we know that n/2 is less than n, we can safely say T(n/2) < T(n), so at most T(n/2) is still less than Cn^2, so the following inequality will hold:

T(n) <= 2C(n/2)^2 + n

Which can be rewritten to:

T(n) <= 2C(n^2/4) + n

and the leading 2 can be eliminated by dividing numerator and denominator by 2:

T(n) <= C(n^2/2) + n

and to make it easier to read, I can take out the fraction to the front:

T(n) <= (1/2)Cn^2 + n

We're getting close to the inequality we need to prove, which is T(n) <= Cn^2, but we need to get rid of the (1/2) somehow. The simplest solution in this case is to just add (1/2)Cn^2 to the first term, and then subtract the same amount from the back end of the equation, which gives us this:

T(n) <= Cn^2 + n - (1/2)Cn^2

So now we've got Cn^2, but we also have all this trailing junk that's throwing off the inequality. What you need to remember in order to go through this next step is what an inequality really is. Consider the following:

If I asked you to prove that x < 5, but I already told you before hand that x < 3. You have no work to do at all. See, 3 is already less than 5, so anything less than three is less than 5 automatically. We have the same property in play here, when I rewrite the equation thus:

T(n) <= Cn^2 - [(1/2)Cn^2 - n]

It's going to be tough to show that T(n) is less than Cn^2 by itself, but if we can be sure that the quantity we're comparing it to is ALREADY less than Cn^2, then we've no more work to do. So how do we make sure that Cn^2 - [(1/2)Cn^2 - n] is less than Cn^2? well, X minus any positive number is less than X, so as long as everything inside of those brackets comes out to a number that is >= 0, then we know our whole right side is less than Cn^2, and we've proved that On^2 is a reasonable loose upper bound for T(n). So to prove this, we just need to set up the following inequality:

(1/2)Cn^2 - n >= 0

And now we solve:

(1/2)Cn^2 - n >= 0
(1/2)Cn^2 >= n
Cn^2 >= 2n

So now we see that as long as C is greater than or equal to 2, the quantity [(1/2)Cn^2 - n] will be greater than 0, which means the entire quantity Cn^2 - [(1/2)Cn^2 - n] will be less than Cn^2, which means that we've shown On^2 to be an upper bound for the merge sort. Note carefully here that we have not shown that it is the tightest upper bound, only that it is AN upper bound (as is On^3, On^4, On^5, etc). All we've said is that T(n) is guaranteed to be less than Cn^2 if C is at least 2:

T(n) <= 2n^2

So that's an interesting piece of information. However, it's not super useful. Remember, we want to compare merge sort to insertion sort. Insertion sort has a worst case operation count of "n^2 + 2n -2", and the upper bound we've just proved for merge sort (2n^2) is much bigger than n^2 + 2n - 2 for large n. This let's us know that we haven't proven a very tight upper bound, because empirically if you benchmark insertion sort against merge sort for n = 10000, merge sort actually performs at least two orders of magnitude BETTER than insertion sort (download my example implementations from a github and run them yourself for proof).

We need to be able to prove a tighter upper bounds, and it would be helpful if we could do it without guessing.

The way that the MIT course recommends finding out what sort of bound we're looking for involves using what's called a recursion tree to visually picture the algorithm and determine the time spent based on it. For merge sort, this is fortunately very simple, as it's hard to convey something as visual as a recurrence tree on a blog.

The idea is this: We want to compute the amount of time it will take to perform T(n) (the full sort). T(n) is defined as the sum of T(n/2) [sorting the lefts side] plus T(n/2) [sorting the right side] plus n work on the side that is done non-recursively (that's the merging part).

If I were to draw this in a tree, for an 8 element array, it would look something like this:

T(n)
/ \
T(n/2) T(n/2)
/ \ / \
T(n/4) T(n/4) T(n/4) T(n/4)
/ \ / \ / \ / \
T(n/8) T(n/8) T(n/8) T(n/8) T(n/8) T(n/8) T(n/8) T(n/8)

In order to determine the amount of work done, you find the amount of work done on each level, and sum it up. This is easy in the case of merge sort, because if you look at each level you will notice a pattern. The root will have to merge 8 elements, which is our n, so level one does n work. level 2 has two nodes, each one works on 4 elements, which is n/2, so the amount of work done on that level is n/2 + n/2 which is n. level 3 has 4 nodes, each one working on 2 elements, which is n/4, so the total work is n/4 + n/4 + n/4 + n/4. See the pattern? The sum of work on each level is n, so in order to find the total we just need to know how many levels there are, and multiply that number by the work done on each level (n). Since we see that an array of 8 elements has 4 levels, and if we had 16 there would be 5 levels, and if we had 32 there would be 6 levels, we can tell immediately that the recursion tree for merge sort will have (log2 n + 1) levels. What does that give us?

n(log2 n + 1)
n log2 n + n

which is in big-O notation:

O(n log2 n)

Alright! Now we're getting somewhere! Now, we could use substitution to prove beyond a shadow of a doubt that this is a valid upper bound, but for the purposes of this blog post it's unnecessary. What's important to us now is the ability to compare one algorithm's runtime to another. We saw before that insertion sort performs "n^2 + 2n -2" operations, which in big-O notation would be O(n^2). We just showed that merge sort grows at O(n log2 n). Which one is better for a large array? Merge sort, clearly, and the bigger the array the more true it is. On a small set it hardly matters (in fact, on very small sets insertion sort is actually faster than merge sort), but the more elements you add the greater the disparity between the two.

That's it for today, so I'll take this last post to remind everyone reading that I'm approaching this subject not as an expert, but as an amateur. I want to get better, so I welcome your corrections if you find fault in the above, but please don't think that I'm maliciously trying to mislead anyone or that I believe I'm some sort of authority, any mistakes in the above are the honest errors of the apprentice.

Again, code for everything I've covered in this series can be obtained from github:

http://github.com/evizitei/ruby_algorithms/tree/master

Monday, December 29, 2008

Insertion Sort

I've discovered a really fun new past-time this week as I've been looking for ways to increase my facility and skill in programming without necessarily spending any money (cash is short in the startup world). For those of you who don't know about this great opportunity, go check out iTunesU immediately. Essentially, it's got real lectures from top universities on pretty much any topic you could be interested in (many in video format as well as audio), and all the content I've run into so far has been free. Now, is it as good as taking the class yourself? Probably not, as you obviously don't get college credit, and you can't ask questions to your professor, but for my purposes (pure self-improvement) and with my research ability (I'm pretty sure I can find the answer to any questions I might have online somewhere), I feel like it's something worth my time.

Currently, I'm listening to the first lecture in an MIT course on algorithms, and as an exercise to see if I'm retaining the information at all, I decided to do a little report here on my blog regarding what I learned in that short lecture recording. If it all comes out well, I'll continue throughout the series and see if I just might make a computer scientist out of myself yet. ; )

If you want to brush up on your familiarity with algorithms, or if you're an expert and would like to critique my presentation of the information for accuracy and clarity [I'm looking at you, (cadr life)], I welcome you as a reader.

Is performance important?

The professor for this course began by asking this question, but in the form of "what is MORE important than performance?". Many answers came to the front: usability, features, maintainability, security, and a host of others. The question is than raised, since all these things are often desired above performance, should we care about learning algorithms at all? Of course, since the professor probably wants to give the other 24 lectures in the series, the answer was "yes". He says it's because good performance gives you the currency you need to spend on other necessary aspects of the software, and I'm sure there's a long discussion that could be had about what aspects of improving performance for your software are and aren't important, but for the sake of moving on to the real meat of the course I will avoid spending any unnecessary energy on fleshing out that discussion. Instead, I want to proceed directly to the first example of an algorithm given and the analysis conducted upon it.

Insertion-sort

So far I'm not out of my depth; insertion sort isn't an unfamiliar algorithm to me. You pull elements off the unsorted portion of the array (which initially is the entire array), and insert them at the correct point in the sorted portion of the array (which starts out as none of the array, but grows by one element with each iteration). Some non-standard over-simplified pseudo-code follows:

0 define insertion_sort(A)
1 for i = 1 to length[A]-1 do
2 element = A[i]
3 j = i-1
4 while j ≥ 0 and A[j] > element do
5 A[j + 1] = A[j]
6 j = j-1
7 end
8 A[j+1] = element
9 end
10 end

So, how do analyze this algorithm's runtime in a meaningful way so that we can compare it to other algorithms that accomplish the same thing? (this is where my lack of a formal background is shown, so correct me where necessary, but be nice).

As I understand it, one good way to analyze and compare algorithms is to first assume that each "operation", or "step", takes some constant time, and then count how many times each operation in the algorithm is executed in the worst case scenario (ie, operating on the input that would cause it to perform the most work). What follows is my attempt at performing this analysis on the insertion sort pseudo-code above:

OK, line 0 is a method definition, so we can discard that as part of the discussion. Next is line 1, which is really only executed once (since the loop is not redefined on each iteration). So far, our analysis looks like this:

L1

Now, I think I can look at lines 2,3, and 8 all on the same level, since they are each executed once for each iteration of the outer for loop. Since that loop starts with the second element in the array (index 1) and moves through each element, each of these lines will be executed "n-1" times ("n" being the number of elements in the array to be sorted). So including each of these in the equation, our analysis now looks like this:

L1 + (n-1)*(L2+L3+L8)

The only lines we have left to include in the analysis are lines 5 and 6, but this part is the trickiest of the whole process. You see, those lines are only run as many times as it takes to find the right spot to insert the current element being considered. If that element is smaller than all the elements in the already sorted array, it will have to iterate through every sorted element to arrive at the correct insertion point, but if the current element is BIGGER than all the sorted elements, then those lines won't even get executed once on that particular run through the outer for loop. Thus, in an average scenario, you'll have a variable number of executions for that particular segment. However, remember now that we're trying to consider the worst case scenario. That means that we should assume that the list begins sorted in reverse order, and that each element will always have to be inserted at the beginning of the already sorted portion of the array, thus requiring a full traversal of the sorted portion on each iteration of the for loop. This means that on the first for loop iteration, those lines will be run 0 times, than once, than twice, than three times and so on for each iteration. So on the LAST iteration of the for loop, they will be run "n-1" times (one less than the full size of the array). So now we're looking at the following for our analysis:

L1 + (n-1)*(L2+L3+L8) + (L5+L6)*(0+1+2..(n-3)+(n-2)+(n-1))

Now this probably looks rather intimidating, but it can be simplified rather easily. First, since we said earlier that we were assuming all operations took some constant time, I can replace each of the line numbers with a "k". This gives us the following:

k + (n-1)*(3k) + (2k)*(0+1+2..(n-3)+(n-2)+(n-1))

That's a little better. Next, if you've had any higher math, you'll know that the arithmetic series at the right of the equation (the summation of 1 to n-1) can be simplified as well. I seem to remember that the sum of 1 to n can be written as follows:

(1+n)*(n/2)

if this is the case (and I just looked it up again, so I know that it is) than if we only wanted the sum up until "n-1", not all the way to n, all we need to do is subtract the final term in the series (n):

[(1+n)*(n/2)] - n

I actually prefer it to be multiplied out, so I would write it this way:

[(n+n^2)/2] - n

which gives us the following formula for the length of time in our worst case insertion sort as a function of the size of the input array:

f(n) = k + (n-1)*(3k) + (2k)*([(n+n^2)/2] - n)

Still sort of big and ugly, but stick with me here. The next step towards simplifying it is realizing that we don't want to know the actual runtime, we just want to be able to compare it's runtime to the runtime of similar algorithms. We want to be able to relate it to other algorithms using a common measuring stick. As I understand it, this means we aren't looking for "amount of time" so much as "number of operations". Since k is one operation in our equation, what we are really asking is "how many k's will occur in the worst case scenario?". This means we can move the k constant out of the equation ("divide both sides by k", if you want to think about it that way) and just look at the whole thing in terms of n:

f(n)/k = 1 + 3(n-1) + 2([(n+n^2)/2] - n)

which can be multiplied out to be:

f(n)/k = 1 + 3n - 3 + n + n^2 - 2n

And now we can combine like terms:

f(n)/k = n^2 + 2n -2

Hey! That's a quadratic equation! That's not so bad at all! (Check my math on this, BTW; I'm no pro and have no easy way to validate my answer). Now I can say that for any array of size "n", insertion sort (in the worst case) will execute "n^2 + 2n -2" operations to sort it, and I can compare that function to similar functions for other sorting algorithms to get an idea of what's faster. See? This stuff isn't nearly as hard as it seems.

This would probably be as good a time as any to mention that you needn't go through this whole process when analyzing an algorithm, as the most significant factor in the final performance appears to be the highest order term with disregard for constants or lower order terms. Thus, in the common Big O notation , insertion sort's worst-case scenario analysis is really just "On^2".

Well, this post has probably gone on for long enough. Check back soon, as I'm having a blast with this and hope to continue down this path for a few posts, and I really encourage everybody to go check out iTunesU, there's a ton of interesting material there.

By the way, I had to play around with some code while I was doing this to make sure I understood the sorting algorithm correctly. If you'd like to see my implementation, you can check it out on github here:

http://github.com/evizitei/ruby_algorithms/tree/master

Tuesday, December 23, 2008

Merry Christmas to Me!

It's funny how my RSS reader structured things for me today. The first article I picked up was one from a developer's blog where he talked about how Rails was slowing down in it's evolution and was beginning to frustrate him as it failed to reinvent itself in the same exciting ways it did when it was cutting edge. I could sympathize with that to a point. After all, that's a natural consequence of having more and more people involved with and dependent on the framework in question. It was humorous to me, though, that the very next article I looked at was the announcement that Merb 2.0 and Rails 3.0 would be the same entity. Woot!

Merb was built after rails, and learned a lot of lessons from it's development, and now those lessons are going to be brought back to their source as all the performance optimizations and component-agnosticism that make Merb such a great framework are brought into Rails to allow the old workhorse to make a major leap forward. Looks like the beta will be attempting a release around May, but I'm sure that one way or another we'll have a functional version to play with by the end of next year. Hooray for cooperation!

Ramen Profitable

Big news from the startup I'm working on: we've hit a major milestone. After about 8 months of work, we've finally reached the level of income that Paul Graham describes as Ramen Profitable. Yes, this month for the first time we'll actually take in slightly more money than we spent. No telling if this will continue steadily, but it's a very encouraging sign. Theoretically, as long as we contain our expenses, this gives us the runway we need to stay alive while we continue to bring on new customers and realize the full potential of the ones who are already on board with us. So in my house, we'll be celebrating more than just the holidays this year. ; )

Branching Out

Just a one time announcement for readers who have an interest in other areas of my life besides technology: I started a second blog yesterday on blogger that will be following my attempt to become a member of the local volunteer fire department. If that kind of thing interests you, head on over and take a look.

Tuesday, December 16, 2008

How to advance medical science while you sleep

A friend of mine recently sent me a link to BOINC. If you're familiar with the program SETI@home, than you already know about this middleware. Basically, BOINC is the framework that allows you to donate your unused computer time (via a screensaver that kicks in when you aren't working) to the CPU-intensive project(s) of your choice. That's right, you can contribute to furthering science just by not using your computer from time to time (if you're one of those people who requires sleep, this is particularly easy since you already have 1/3 of the world's rotation cycle where your computer processor is unused that you can give away to needy science projects).

Personally, I have all three of my computers running Rosetta@home. The full description of what they do is here, along with the medical relevance of the project, but the short version is as follows:

Proteins are made up of amino acids. There are 20-some amino acids used in the construction of proteins, and each one of them has a different shape, and exerts different forces on it's surroundings. The forces exerted by the amino acids cause the protein (which could be thought of as a chain of amino acids) to take on a particular shape (which they call a "fold"). The basic goal of Rosetta@home is to run through many likely folds for each protein, trying to produce the one with the lowest energy, which is the greatest predictor of the proteins true shape. The results of this research can potentially be applied to problems ranging from Malaria, to Cancer, to HIV. As you can imagine, this is a CPU-intensive project, so distributed computing works quite well.

Hence, it's a good cause, and you've got a computer anyway that you don't use 24-hours daily. Why not give?

Thursday, December 11, 2008

Internet Explorer and the Z-Index (or, how I wasted a whole 3 hours of my life)

As my most recent post mentioned, I frequently get frustrated over small inconsistencies between the way firefox/safari renders my projects, and the way they look in internet explorer. However, as any pragmatic web-developer will tell you, IE ain't leaving anytime soon, so I didn't have the option of just telling all my users to switch browsers (though it would have pleased me greatly). Fortunately, IE-8 looks like it will be fixing a lot of the quirks that IE has had in the past. UN-fortunately, IE6 is still hanging around on many corporate/government/home systems as a plague to all developers, even a couple years after the release of IE7, which means that if we follow the pattern than we won't be able to rely on windows users having IE8 until a full decade after it's release. So, until that happy day arrives, here's a little tip that may save someone out there time when they are having a bad day with IE.

z-index

I've got this layout I've been working on for my startup. It has a header, and three vertical columns on the page (should seem familiar, this is a very common layout scheme). At one point about halfway down the page, I have this really large element in the center column that expands a bit over the borders of the column it's in. That's totally intentional, because my design called for this large element to overlap and stand in front of the edges of elements in both of the side columns, making it look kind of like the top-most magazine clipping in a child's collage.

Now, the way I was achieving this effect was through the use of the "z-index" css property. If any of you readers are software developers who are not familiar with HTML/CSS, the "z-index" basically is an integer that you use to guide the browser into placing elements "on top" of each other (with respect to the perspective of the viewer. x-axis runs left to right, y-axis runs top to bottom, z-axis runs front to back). The higher an element's z-index, the closer it is to the viewer, or the closer to the "top of the stack" it is. For the sake of example, let's say I gave the elements in the left and right columns of my layout all a z-index of "1", and then gave my large center element a z-index of "2". In Firefox and Safari, this works no problem. Open it in IE, and God only knows what will be on top of what.

Now, I had no idea what was going on when I saw this, and it took quite a bit of research for me to figure out what the problem was. Aren't you lucky that you came across this blog post before YOU had to do that kind of research? I'll answer that for you: "Yes, you are very lucky". I can say this confidently because the answer is so stupidly simple that you would have cried if you had spent hours trying to find it (I know I did).

Internet explorer (even version 7) does not respect the z-index property when the elements in question have different parent elements. In my case, the "div" tags that represented my three columns didn't have z-indexes assigned to them. The elements in the center column were children of one div tag, the elements in the left column were children of another, and the elements in the right column were children of a third div tag, none of which had z-indexes assigned to them. Basically, it seems IE only uses z-indexes of children with the same direct parent to calculate positioning. The fix? I gave my columns Z-index properties ("1" for each of the flanking columns, "2" for the center) instead of the elements inside of them. Since the columns were siblings, IE was finally able to figure out what was on top, and now it looks fine in all three browsers. So there you have it, a special fix for a special browser.

Tuesday, December 9, 2008

UI Pain

Time for a mini-rant.

Everytime I get something working just the way I want it: looking a certain way, positioned a certain way, colored a certain way. Everytime, I end up slapped in the face as soon as I look at it in Internet Explorer. It's like it's actively rejecting my work, telling me that it just isn't content to render content the way the other browsers do (Firefox and Safari, of course). No; instead it requires custom attention and special handling, kind of like a worthless train-wreck of a person strung out on crack who happens to be the nephew of a powerful politician.

Screw it, I guess I've got more work ahead of me.

Scaling Images with GIMP

Here's a UI tip for all of you who are software developers but happen to need to get something done with an image.

As readers of this blog probably know, my primary skill-set is software. That's what I've spent the most time learning, and it's what I'm most comfortable with. In previous jobs, when it came to the User Interface, I'd have an HTML template handed to me, built by a talented web designer, and then I would make it actually do stuff. But now, working on a startup, we don't have a dedicated UI guy. I AM the UI guy. I'm also the software guy, the server guy, the internal IT guy, and the customer service guy. Because of the position that puts me in, I've learned a lot about web user interfaces in a sort of trial-by-fire pedagogy, and this morning I figured something out that just hadn't occurred to me before, and thought I might share it with other "non-Ui-guys" who are in the position of doing the best they can.

For a web page I was working on, I had three images that had to be resized way way down to fit up in a banner. I have a passing knowledge working with GIMP, so I fired it up and cropped and scaled until it was about the right size. But you know what? It looked terrible. It was so pixelated that you couldn't even tell what it was by just looking at it (a speedboat, for the curious).

Not knowing what else to do, I started redoing the process from the beginning, using slightly different configuration settings, trying to scale down less, saving to different formats, and it just looked bad and worse. I had to actually THINK for a minute before I figured out what the deal was. Basically, when you scale an image using GIMP, you have the following dialog box to work with:



Notice the resolution settings, as I just didn't mess with them as I was scaling. If you think about it, of course you will lose quality as you shrink a picture. Why? Well if you leave the resolution at a constant 72 pixels per inch, then scale the picuture down by half, it can only fit every other pixel from the original image into the new one, so you've lost half the data of the original file. What I did instead was to increase the resolution by the same percentage that I was decreasing the size of the picture. Thus, if I was making it a 4th of the original width, I quadrupled the number of pixels per inch along the X-axis (and likewise matched the ratio of heights to the Y-axis resolution), thus preserving all the data from the original picture as the size shrank. What resulted was a much sharper and clearer tiny picture, that now looks great on our webpage.


Overall word to the wise: I was about to just hire a "professional" to do this stuff for me, and if it was a really complicated layout that I needed it might have made more sense. But for something simple, it's not to challenging to just sit down and grind out something that will work. If you're smart enough to write good code, I guarantee you that you are capable of understanding HTML/CSS and image editors. At least give this stuff a shot before you outsource it, it might not be as challenging as you think, and when you're bootstrapping almost any cash savings is a win.

Sunday, December 7, 2008

Cross Training

I spend most of the day, every day, writing RoR code for my startup. It's fun, to a point, but sometimes you need to take a break. I've got a list of hobbies, things I usually do in my free-time: Rock-Climbing, Piano, etc. But recently I've been into video-game development as my down-time filler, and so I've found some new fun things to do in that sector. One of those things is 3D-modeling.

Now I'm not going to claim to be any kind of authority on this subject. I've been playing around in Blender for maybe 8 hours total, and my lack of artistic skills is apparent to no-one so much as me. Nevertheless, I've quickly come to appreciate just how much work artists have to do in order to make the spectacular renderings I've seen in most of the video-games I own.

I think that one of the most valuable skills a person can have is the ability to learn new things. As I've said in previous posts, being an expert in your craft is great, but I'm more interested in a person who's an expert at becoming an expert. And what's the best way to get good at learning new things? By learning new things, of course! So that's why I work on side projects like this. Not only are they enjoyable, but I think they make me better at my job because they give me practice in continuing the evolution of my skillset.

Anyway, I've been having a great time going through tutorials and resources on 3D modeling, and I finally have my own first project that I did without any tutorial or guidance. Yes, I'm proud of it; it's the kind of proud that a 7-year-old feels when he makes a clay pot for his parents not realizing how much more advanced the art of ceramics is beyond his skill level, but proud nonetheless. So here it is, my primitive model of a Colonial Viper Mark II from the re-imagined Battlestar Galactica series:






Now, before all you real artists pounce on me for my amateurism, let's remember that no field would ever be advanced if not for the encouragement of amateurs.

Also, you've been doing it a lot longer than me, so back off, k? ; )

Wednesday, December 3, 2008

SVN to git: it doesn't hurt

I just migrated one of my startups projects to the git version control system today. I've been meaning to do it for some time, but I kept putting it off. Not because I didn't see any advantages of git over SVN, I did, I've read the blog posts. It's because new things can be uncomfortable. I've been using SVN for a couple years now, and I know how it works. I've never used git, and that can be a little intimidating.

If anyone out there is in the same place: just do it, it's not that bad. You CAN work with git almost the same way you do with SVN (Not that this in necessarily the best way):


SVN:

> svn update
....work on code...
> svn status
....see changes....
> svn commit -m "[message]"



GIT:

> git pull
....work on code...
> git status
.....see changes....
> git add [things you want to commit]
> git commit -m "[message]"
> git push [remote server] [local branch]


Now admittedly, if you just use git this way, you're not taking advantage of everything it has to offer. As I was saying to Ray Myers of Cadr Life today:

Me: I can't believe I waited this long to switch. This isn't that bad at all. It's hardly different!
Ray: Yeah, as long as you use none of the distributed version control features it's just a superior version control system!

Some might say if the way you're going to use it brings hardly any difference, why switch? Well, for one thing, it's faster. That's nice for big commits. More importantly, although the distributed version control model makes little difference when you have one or two developers, a team of 20+ people (hey, I can dream can't I?) would be better off with git than svn (for a host of reasons that others have already covered in multiple blog posts and that I won't rehash here. See here if curious). If it doesn't hurt in the beginning, and has a positive effect on scalability, I'm all for it.

All that being said, let me clarify quickly that I'm not a bigot about this. Some people prefer SVN, I don't fault them for it. I've used it on plenty of projects and been plenty happy with it. All I'm saying is that if you're one of those guys like I was who wants to try git out but just hasn't gotten the time to because of the innate fear of the learning curve, go ahead and give it a shot, it'll take much less time than you think.

By the way, if you're looking for an easy way to migrate an SVN repository to a git repo, check out github, they've got an importer that will get you up and running in no time.

Monday, December 1, 2008

12-hour time select for Rails 2.2

The title of this post is lengthy, but it's because I want to capture a good amount of search traffic for this particular topic. As much time as I spent looking for a solution, others should benefit from this post.

I used the Thanksgiving Holiday to spend some time upgrading my application to the latest release of Rails (2.2). What was super-cool about this upgrade for me in particular was that the built-in timezone support that came with Rails 2.1 was something I really needed for my application (we had been running Rails 2.0 and doing the whole TzInfo plugin thing, which worked fine, but wasn't nearly as clean as having it baked in like the newer rails). What was NOT super-cool was the fact that the plugin I was using for building a 12-hour time picker (necessary for us because our users don't like using military time) was broken during the upgrade. I feel like the Rails team did a great job on refactoring the code for generating select html in their view-helpers, but the changes weren't backwards compatible so every part of my app that used time selection was back to the default 24-hour time selector, which wasn't going to fly.

Quick note: I am not complaining about the changes to rails not being backwards compatible. On the contrary, I support it. I think one of the reasons that some APIs become ugly and unpleasent over time is because of an obsessive concern with not breaking apps built against older versions. Having the freedom to make broad changes to the API if there's a truly better way to accomplish something allows a Framework like Rails to grow in a healthy way by cutting out the tumorous growths that turn up over time.

Anyway, the plugin I was using, and the others that I was aware of that served the same purpose, had not been upgraded to work with the refactored DateHelper module. There probably is a plugin out there that has been worked up to this point, but I had enough trouble finding one that I just modded one of them myself in order to get up and running. Credit for the original goes to Bruno (bruno@bopia.com), who wrote the twelve-hour plugin that I modified to get my app working.

Here's the code for the plugin:



module ActionView
module Helpers
class DateTimeSelector
def select_hour_with_twelve_hour_time
datetime = @datetime
options = @options
return select_hour_without_twelve_hour_time(datetime, options) unless options[:twelve_hour].eql? true
if options[:use_hidden]
build_hidden(:hour, hour)
else
val = datetime ? (datetime.kind_of?(Fixnum) ? datetime : datetime.hour) : ''
hour_options = []
0.upto(23) do |hr|
ampm = hr <= 11 ? ' AM' : ' PM'
ampm_hour = hr == 12 ? 12 : (hr / 12 == 1 ? hr % 12 : hr)
if(val == hr)
hour_options << "<option value='#{hr}' selected='selected'>#{ampm_hour}#{ampm}</option>\n"
else
hour_options << "<option value='#{hr}'>#{ampm_hour}#{ampm}</option>\n"
end
end
build_select(:hour, hour_options)
end
end
alias_method_chain :select_hour, :twelve_hour_time
end
end
end


If you want to use this plugin to do the same for your Rails 2.2 app, you can package this into a plugin by doing the following:

1) create a folder called "twelve_hour" under the vendors/plugins directory.

2) put a new folder called "lib" into the twelve_hour directory, and a new file called "init.rb"

3) init.rb should have one line of code:

require 'twelve_hour.rb'

4) create a new file in the new "lib" directory called "twelve_hour.rb".

5) paste the above code into "twelve_hour.rb"

6) start up your server and test.

Hope that helps somebody out there!