Wednesday, November 18, 2009

What to do with expensive and schema-less data in Rails

OK, the title for this post definitely needs work, but this is an important topic.

Scenario: I have some invoice that needs to be generated once a month for each customer in my database. Each invoice has a table on it that specifies the amount the customer is being charged for each category of services. incomplete ruby-code follows:

class Invoice
  def data_row(category)
    [category.name,
     category.description,
     self.expenses_for(category),
     self.expense_ratio(category)]
  end

  def invoice_data_table
    Category.all.map{|c| data_row(c)}
  end
end

for the sake of argument, assume that "expenses_for" and "expense_ratio" are both defined and are expensive calculations. Also, the categories list changes from month to month. Now, how do I store this data so that I don't need to re-calculate it if I need to re-print this invoice later?

1) static columns
This is probably the worst one, and the one I see most commonly in large corporate applications. Basically you write 50 new columns in your invoices table, each one looking like this: category_1_expense_ratio, category_1_expense, category_2_expense_ratio, etc, etc.

Retrieval is efficient, but every time the category list changes you have to add new columns to your database, and the code for populating these fields is probably ugly and repetitive.

2) child records in join table
Better, but still somewhat expensive in relation to it's purpose. We add an invoice_category_expenses table, and it's modeled as "category_id, invoice_id, expense_ratio, expenses".

This works, and is more robust in the face of change, but in order to reprint the invoice that means you have to fetch 20 other records besides just the invoice. They're not huge, so it's not a real performance problem, but it does feel wrong to me to grace this flat data table with all the trimmings of a full model class.

3) serialized data-structure

This is what I ended up going with, and I'm maddeningly happy with my decision. For readonly structured primitive data that just needs to be stored for possible re-examination (ie, no other behavior necessary) I don't think you can beat it.

Rather than stuffing it in another table, or making your invoices table a schema nightmare, you just add one more column: category_expenses_data. It's a text column with a large size limitation. Now in your invoice model you write the following:

class Invoice
  serialize :category_expenses_data,Array
end

This specifies to rails that this column is going to be a serialized Array (basically a string representation of an array that can easily be re-instantiated). Now when creating the invoice, you generate the expensive data table, and store it by saying "invoice.category_expenses_data = invoice.invoice_data_table". The expensive calculations are done once and stored in the database. When you need to reprint the invoice, you can just ask for "invoice.category_expenses_data" and it will give you back your array (deserialized from the database) ready to be used to generate your PDF again or whatever.

1 comment:

Aaron Scruggs said...

If the data is expensive & schemaless, is a relational database really the best place to store it? Perhaps S3 or some other type of filesystem would be a better option.