Archive

Tag Archives: excelsior

I found myself yesterday needing to parse an Excel file into the database, which I approached first by of course exporting the file into CSV format. The CSV file came out fairly well despite Excel’s CSV oddities, which included putting embedded newlines (\n) into some fields and only sometimes enclosing fields in quotes.

When it came to parsing it, I decided to try the relatively new Excelsior gem (home page here with a link to its github page). Its docs are rather minimal, but I looked briefly at the source and decided to give it a go. Thankfully, it turns out that not only is it super fast, but it also handled the embedded newlines and sometimes-but-not-always field quoting. The only thing I don’t quite like is that Excelsior outputs all of the fields surrounded by double-quotes. I haven’t looked into how easy/hard it might be to make that configurable. In any case, I thought I’d share some bits of my code in case it helps speed up work for other folks.

First, of course, sudo gem install excelsior to get it going. I decided to put my CSV parsing in a helper, in case I do find that I need to swap in an alternative later, so I have /helpers/csv_helper.rb:

  require 'rubygems'
  require 'excelsior'

  module CsvHelper

    #
    # Accepts a file path to a CSV. Uses a CSV reader to process
    # the file one line at a time, and yields the resulting array
    # to the caller's code block. NOTE: The first row will be the
    # header row of the file, which is treated like any other row
    # by this method.
    #
    def self.parse_csv(file_path)
      Excelsior::Reader.rows(File.open(file_path, 'rb')) do |row|
        yield row
      end
    end
  end

Note that I decided to simply have this method do a yield, so the caller can process the row however it wants.

Right now I’ve got the importing code as class methods in a model, to be run via a rake task. As an example:

class Thing < ActiveRecord::Base

  require 'csv_helper'

  def self.load_csv(file_path)
    header_fields = Hash.new
    got_header = false
    CsvHelper::parse_csv(file_path) { |row|
      if (!got_header)
        header_fields = load_header(row)
        got_header = true
      else
        load_row(header_fields, row)
      end
    }
  end

  #
  # Loads the header row from a CSV file, returning a Hash that contains the
  # field name as a key and the integer offset into the row array as the value.
  # This is used later by load_row to access each field in a row.
  #
  def self.load_header(csv_row)
    headers = Hash.new
    csv_row.each_index do |i|
      headers[csv_row[i]] = i
    end
    return headers
  end

  #
  # Loads a row of a CSV file into the database. Expects a hash containing the
  # header field names & indices, and an array containing the row to be processed.
  #
  def self.load_row(header_fields, csv_row)
    my_field = csv_row[header_fields[MY_FIELD_NAME]]
  end
end

So as you can see, the load_csv() method is the entry point, given the path to a file. It invokes the helper’s CsvHelper::parse_csv() method with a code block that’s called for each row. The slightly messy part here is handling the first row differently, since it will contain the field names. So for the first row it calls load_header(), which creates a hash. The keys are the field names, and the values are the array indices. This lets the load_row() method figure out where to find the fields that it’s interested in.

Obviously load_row() above doesn’t do anything as shown (I have to edit out the details from my code), but it illustrates getting the value of a field. MY_FIELD_NAME is a constant containing the field name, such as “My Field”, which is used to look up the index where that field was found in the header row. If it’s 2, for example, then that field is the third one in the row (counting from 0, remember).

Of course if you can guarantee that your CSV will always have the fields in a known order then you can simplify things a lot by using constants and simply throwing away the header row because you won’t need it. In my case I can’t be sure about the field order, unfortunately.

I haven’t gotten to optimizing things yet, so I won’t be surprised if there are ways to improve the efficiency here. But I thought I’d share the code in case it helps someone out.

Advertisements