Comparing MongoDB and Redis, Part 2

As discussed in Part 1, I reached a certain point with MongoDB, and decided that rather than fussing with things I’d move over to Redis and see how things went. The first thing was to change the model — rather than using the mongoid “:field” definitions, for Redis the model becomes a simple PORO (Plain Old Ruby Object). I chose to borrow a nice initialization technique from StackOverflow here so that I didn’t have to hand-code all of the attributes, but basically my initialize() method just sets the attributes and then creates a Redis connection via @redis = Redis.new. So the changes to the model were easy. The harder part was working out how the relationships between objects would work.

Rather than the document-style storage of MongoDB, Redis is purely based on key-value, but with more advanced data structures placed on the top. For my purposes, after some fantastic answers from Salvatore (author of Redis) and folks on the redis mailing list, I worked out how to use Sets to access the data in the ways I needed. So let’s say we have three books, ISBN numbers 123, 456, and 789. Book 123 references book 456, and book 789 references both 123 and 456. We have two authors, “Matsumoto,Yukihiro” who wrote 123 and 456, and “Flanagan,David” who wrote 456 and 789. How do we handle this in a key-value store? By using Sets:

  • Create entries for each book, with key pattern “book::data”. The value is a JSON string of data like title, price, etc (see below for note on this).
  • Create set called “books” which contains the number of every book.
  • Create sets called “backrefs:” that contain the numbers of books referenced by book #.
  • Create a set called “authors” which contains all of the authors.
  • Create sets called “author:” that contains the numbers of books written by the author.

Using the set operations in Redis, then, I can display all of the books by using the “books” set; I can display all of the books by a given author by using the “author:” set; I can display all of the books referenced by a given book by using the “backrefs:” set. In the latter case, you might be thinking that I could just keep an array in the JSON string — and yes, that could work, but I wouldn’t be able to use some of the other interesting set operations, such as intersections to determine references for a given author, for example. Note that right now, since an author is just a name, there actually is no longer any Author model! If I add more meta-data about authors in the future, I can add that easily.

About that JSON string: this has advantages and disadvantages that I’m still considering. Some would say that every individual attribute (or “column” in RDBMS-speak) should be a separate key-value pair. In that approach, for example, if I have a book title and price, I’d have book:123:title => “The Ruby Programming Language” and book:123:price => “39.99”. Obviously I can then do things like add a book to sets like “Under $50” by adding the price item to the set. The big advantage noted by some is that attributes can be added instantly by just saving a new key. Using a JSON string, adding an attribute requires reading/updating all of the existing keys. On the other hand, it is tidy to have a single key, and working with JSON is easy. For the time being, I’m giving it a try by using “book:123:data” to store the “data” about the book, and separating out certain attributes if it makes sense to use them in other data structures like sets and lists. Is this the best of both worlds or the worst of both? I’m not sure yet.

A quick note here before getting into the code: I did this using the redis-rb plugin, which has a lot of functionality but is definitely lacking in documentation. However, the code is extremely clear and easy to read through, so I strongly recommend reading through it, particularly the main lib/redis.rb file. Using it’s pretty much just a matter of installing the plugin and then calling Redis.new.

So, my save() method looks like this:

def save
    book_key = "book:#{number}:data"
    @redis[book_key] = json_data		# creates JSON string
    @redis.set_add "books", number	# add to global books set
    if (back_references)
      back_references.each do |ref|
        @redis.set_add "backrefs:#{ref}", number
      end
    end
    if (authors) then
      authors.each do |a|
        a = CGI::escape(a)
        @redis.set_add "authors", a		# add to global authors set
        @redis.set_add "author:#{a}", number
      end
    end
  end

Improvements to be made here include handling the author names in a better way; doing a CGI::escape works, but a proper hash would be better. During prototyping, the escaping is nice because I can go in with the redis client and see human-readable names, but it makes the keys too long in my opinion.

So now the index() action in the Books controller looks like this:

  def index
    redis = Redis.new
    @entries = redis.set_count 'books'
    @pager = Paginator.new(@entries, 20) do |offset, per_page|
      redis.sort('books', { :limit => [ offset, per_page ], : order => "alpha asc" })
    end
    @keys = @pager.page(params[:page])

    @books = Hash.new
    @keys.each do |k|
      @books[k] = redis["book:#{k}:data"]
    end
  end

Here we get a redis connection, and use Paginator to do its thing — we have to get a count of the set, and then we use sort. This is a big part of the magic, and something that took me some time to work out. The sort command in redis (doc here) is the entry point to doing a lot of the interesting operations once you have things in a set. You’ll notice that in the save() method, all I do is add the book number to the set, not the actual key. That’s much more efficient (Redis is especially good with integers), and is enough. In the case above, all it does is call sort on the “books” set, with the “limit” and “order” options — “limit” as shown takes an offset and number of entries to return, which makes pagination a cinch. For “order” you’ll see that I use “alpha asc” which might seem confusing here since we’re dealing with numbers. In my actual use case the “numbers” can have alphanumerics, and I decided to leave this here because it’s a useful variant to see. In reality, the default for the sort command is ascending numeric so you wouldn’t need to even specify the option here.

Once the keys are retrieved, then I iterate on each one and get the actual data. This is very quick with Redis, but still not ideal. Redis supports an MGET command to retrieve multiple items in a single command, but it doesn’t return the keys, which would mean I’d have to data but not know which book number each one goes with. The redis-rb library provides a great mapped_mget() method, but at the moment it doesn’t support passing in an array. I would have to iterate each key and build a string of them. Presumably a fix can be made to accept an array, in which case this can all be collapsed down to a one-liner: @books = redis.mapped_mget(@keys). (By the way, in case you’re wondering why @keys is an instance variable, it’s because it contains Paginator metadata like page number, to display in my view).

Hopefully it’s pretty obvious that showing a book is pretty straightforward:

    book_data = redis["book:#{@book_number}:data"]
    if (book_data)
      @book = JSON.parse(book_data)
    end

Also simple, here’s the code to get the list of books which reference the current book — that is, the books that have the current book as one of their backward references:

    begin
      references = redis.sort("backrefs:#{number}")
    rescue
      return Array.new
    end

That’s pretty easy, isn’t it? Obviously you can add in an “order” option and even a “limit” if necessary. More interesting, here we get the list of authors, with the list of books written by each:

    alist = redis.sort("authors", { : order => "alpha asc" })
    @authors = Hash.new
    alist.each do |a|
        @authors[CGI::unescape(i)] = redis.sort("author:#{a}")
    end

First we do an initial call to sort to get the authors, sorted in ascending alphabetical order (note that this will be a little undependable given my current implementation since the names are CGI::escaped). Then we iterate each one and do a further sort to get each one’s books. This is fine, but it just returns the number of each book by the author — they key, not the value. Do we have to iterate yet again and do a third call to get the data for each book? Not at all, and this is one of the magic bits of the Redis sort command. If instead of the above sort call we can ask sort to return the values to us, instead of the keys. Using the redis client, the difference is like so:

$ ./redis-cli sort authors:Smith%3B+Bob limit 0 5
1. 123456789
2. 465768794
3. 344756635
4. 436485606
5. 347634767

$ ./redis-cli sort authors:Smith%3B+Bob limit 0 5 get book:*:data
1. {"title":"My Book","price":"19.99"}
…etc…

The second command, as you can see, adds a “get” option. This is a somewhat magic option that instructs Redis to get the values of the keys matching the pattern provided. So what happens, in a sense, is that Redis does the sort, and gets the keys. It then takes the keys and plugs them into the pattern, and does a get. So the first sort command is augmented with a “get 123456789” and so on for the others, and the results are returned. This is all done on the Redis side, very quickly indeed. It is, clearly, extremely powerful. So if we change our code to get the data for the list of books, rather than just the keys:

    alist = redis.sort("authors", { : order => "alpha asc" })
    @authors = Hash.new
    alist.each do |a|
      books = Array.new
      a_data = redis.sort("author:#{a}", { :get => "book:*:data" })
      if (a_data)
        a_data.each do |data|
          books << (JSON.parse(data))
        end
      end
      @authors[CGI::unescape(i)] = books
    end

With this, my controller is passing @authors to the view, which is a Hash keyed off the unescaped author names. The value of each entry in the Hash is an Array of data (which is actually another Hash, created by the JSON.parse call). In the view, I can do something like this rather silly example:

<% @authors.keys.sort.each do |author| %>
  <% books = @authors[author] %>
  <tr class="<%= cycle("even", "odd") -%>">
    <td><%= author %></td>
    <td>
      <% if (books.length > 0) -%>
        <%= books.length %> :
        <% books.each do |b| -%>
        (<%= truncate(b["title"], :length => 25) %>) |
      <% end -%>
      <% else -%>
        0
      <% end -%>
    </td>
  </tr>

This page simply iterates through the authors, and for each one it displays the number of books they’ve written, and the first 25 characters of each title. If they didn’t write any books, it shows a zero.

There is one problem here, and it’s one that I’m working on a solution for: the “sort” with “get” is very cool, but it returns the value of each entry instead of the key. That means that in the above view, I have access to the book’s title, price, etc — but NOT the number! That’s because the number is embodied in the key. This is obviously a problem, since I need to display the book number. Right now, I’m working around this by storing the number in the JSONified data, but that’s not the right thing to do. Ideally, there would be a way to have the “sort get” return the key along with the data, though I’m not certain what that would look like. Alternately, the app can get the keys, and use them to do an MGET for the data. We’ll see.

In any case, we’re now able to display the books and the authors, approaching the objects from either direction to access the others. I’ll post more and/or update this post as I experiment further, but I hope this and the first part serve as a useful introduction to people interested in exploring MongoDB and Redis. For my purposes, I plan to continue forward with Redis rather than MongoDB, but as I’ve shown, they’re not at all the same thing — I can easily see cases where MongoDB might be a better fit. It’s clearly worthwhile to do quick prototyping to make sure you understand your problem set, and then see what the best tool is. One of the most exciting things about the so-called “NoSQL” data stores is that developers now have more tools to work with. If I get the time, I hope to play with Cassandra and Tokyo Cabinet to see how they might fit in. It’s always great to have more options in the tool box.

Advertisements
12 comments
  1. Kyle said:

    Really a great set of articles! Very informative.

  2. Bojacob said:

    Thank you. That was really informative. Redis seems to be simpler and makes more sense to me somehow.

    It’s pretty tough breaking out of the SQL shell, but it looks like it’ll have to happen sooner or later.

  3. Great write-up. So the question is, based on your experience if you need to do complex set operations, specifically union and intersections, as well as grouping and aggregation, would you go with Mongo or Redis?

    I have a set of arbitrary data, such as {a:’1′, b:’2′, e:’7′, t:’9′} etc etc and then need to be able to query against the data set with something where ALL or SOME of the elements may or may not match documents inside the db and want to return aggregated, weighted information about the results (i.e. which records are the closest match to what I send grouped).

    Thanks for the thoughts.

    • Glad the write-up is useful. Regarding your question, given your example I’m not quite sure which way I’d go. It sounds like what you’re getting at is a number of sets with data and you want to find the sets that match a query set (e.g. union and intersection) — which Redis does better than any other solution I know of. However, the idea of aggregating and weighting based on the closeness of the match is something I’m not sure about. I’d recommend looking at the sorted sets in Redis to see if you could use that idea to do what you need. If you know ahead of time how you’re likely to slice and dice your data, you may be able to create sorted sets that themselves identify sets, where the scores indicate the closeness. Something to look at, in any case. Good luck!

  4. Pierre said:

    Great post!
    I didn’t know what that feature “that instructs Redis to get the values of the keys matching the pattern provided”.
    Thanks

  5. rycfung said:

    Nice writeup. I do some questions regarding how you decide to store your data on both Redis and Mongo. I’ve been using Mongo for a while, and isn’t very familiar with the key-value approach, so please excuse me if my comments are incorrect about Redis.

    In Mongo, it looks like that you’re embedding the whole Authors object within the Book collection. So far so good, it makes a lot of sense if you need to retrieve your authors for any given book. This saves the extra query to the Authors collection by id (essentially a join in relational terms). However, it may not be a reasonable choice to use the embedded Author collection as the source of Author records. Taking the instance where you need to update the information of an Author. You’ll run into the problem that you now have multiple records of the same Author littered all over the Books collection.

    You have two choice here.
    1.) Find all records in the Book collection to update the Author
    2.) Go back to storing just the collection of Authors’ IDs (rather than the Author object) rather than embedding the whole things.
    3.) Use the embedded collection of Authors in the Book collection as a merely “snapshots”

    – The first one is pretty obviously a bad choice, as you’re fetching the whole Books Collection and it’s bad for performance. (* This is a write problem)
    – The second one is also not so great since you’ll be missing the point of embedding objects, since you now have to make extra queries to fetch the Authors of books. (* This is a read problem)
    – The third one is more interesting. The embedded Authors are merely snapshots which you can use at your disposal when you’re trying to get the Authors of the books. However, you also keep an explicit collection of all Authors. Consider this as your master copy of the Authors. Whenever you update your Authors records, you would update the author in this Authors Collection. And you would continue to use your embedded Authors as snapshots. Note that this is good for both write and read, and is my preferred way to design the schema. Now, considering your alternate considerations back in Part 1):

    “The alternative seems to be to make authors a first-level document, and link explicitly with book numbers, which isn’t horrible but means, again, multiple queries to get our list of authors with their books. This was beginning to look like it might be too relational a problem for MongoDB to make sense”

    Solution 3 solves the problem of having to ‘fire multiple queries to get our list of authors with their books’. If necessary, you would also use the same approach to save your Books as embedded objects inside your Authors. Of course, you will need to break the cycle by not saving embedded objects within embedded objects in something like beforeSave().

    On to the Redis approach. You have defined multiple sets to store these data which addresses the indexing problem. You are still able to index by the Author name and Books by having the different sets storing your data with different keys. However, (pls correct me if I’m wrong) I do feel like you have somewhat neglected the database performance in this example. From the way you have implemented the data structures, it looks like you would still require n+1 queries in finding ‘books of authors’ or ‘authors of books’ (ie. 1 query to find the book, n queries to find the n authors of books, and vice versa). The code probably looks a tad cleaner because the data structure returned by Redis fits snuggly into how you would operate on them in Ruby, but the performance don’t seem to compare up to the MongoDB approach.

    • Thanks for the excellent comment and the good thoughts about the post. It may be time for me to add an update to the post, given the age of these and what’s changed since then! But regarding your comment, you’re correct that your third choice could work, but it does leave you in the somewhat uncomfortable situation of having the same data (authors) in two places. Depending on the use cases it might not matter, but let’s pretend that the system needs to show the author’s name and agent (hypothetically). You’d then want to store both of those fields in the embedded author object, of course. Separately, we store the author with name, agent, city, and other information. Now let’s say the author changes agents. Instead of just changing that data in the author record, we need to change it in every book for that author as well. It’s added complexity and runs the risk of the data getting out of sync if anything goes wrong — basically it’s the same problem as any caching-type system involves. I’m certainly not saying that’s the end of the world, but it’s also not a situation I care for very much. As always, there are trade-offs in anything.

      In the case of Redis I have to say that the info is more out of date, because a lot has happened with Redis since this was written. But even given what’s here, the “sort…get” operator shown in the post demonstrates how it’s possible to get all of the books for a given author in a single command, and that work in Redis is extraordinarily fast — all work in Redis is extremely fast, which is one of its many advantages. It’s pretty much built to manage sets (and lists) and do it very quickly.

      Thanks again for the thoughts and the good ideas.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: