Archive

Tag Archives: mongodb

I recently was provided with an eBook review copy of PHP and MongoDB Web Development from Packt Publishing, by Rubayeet Islam. Being interested in MongoDB, though a bit out of date with my PHP development, I read through the book fairly quickly.

As the title indicates, the author focuses on using MongoDB to provide the storage back-end for PHP web applications. After a short introduction to the basic concepts behind MongoDB, we get a walk-through of installing MongoDB and getting PHP to talk with it, before starting in on building a blog. It’s a safe example, since a blog is a reasonable candidate for a document store like MongoDB. It also provides a way to address one of the big design questions when using MongoDB, which is when to use embedded documents and when to store references. To my mind the question is glossed over a bit too quickly, but it is discussed.

Additional projects like session management and geolocation get a bit off-track, as a lot of time is spent describing the concepts rather than MongoDB, but the meatier sections that get into topics like Map-Reduce (creating a tag cloud) and web analytics are certainly worthwhile. I did feel that the chapter reviewing two MongoDB management tools could have been skipped, since the information will likely be out of date within a couple of months.

Overall, this is a reasonable beginner’s guide, as its subtitle indicates. There’s a great deal of PHP code filling its pages, which will give you a starting point if you need a boost to get going. Reading through it will give you the basics about MongoDB, and a bit more — hints on indexing, optimizing, and Map-Reduce will keep you running. A lot of the information felt cursory, and I would have appreciated more depth, but that’s probably just me wanting more than a beginner’s introduction. Perhaps more relevant were my concerns about the copy-editing and grammar. I didn’t notice any actual errors, but the grammar is quite rough and it made me wonder. It may make me old-fashioned these days, but I still expect my books to be well-edited and grammatically correct. The issue didn’t get directly in the way of the information to be had, but it’s still a pity. Nonetheless, if you’re a PHP developer and you’re looking to get started with MongoDB, you’ll doubtless find this a useful book.

PHP and MongoDB Web Development from Packt Publishing, also available from Amazon.

Advertisements

As discussed in Part 1, I reached a certain point with MongoDB, and decided that rather than fussing with things I’d move over to Redis and see how things went. The first thing was to change the model — rather than using the mongoid “:field” definitions, for Redis the model becomes a simple PORO (Plain Old Ruby Object). I chose to borrow a nice initialization technique from StackOverflow here so that I didn’t have to hand-code all of the attributes, but basically my initialize() method just sets the attributes and then creates a Redis connection via @redis = Redis.new. So the changes to the model were easy. The harder part was working out how the relationships between objects would work.

Rather than the document-style storage of MongoDB, Redis is purely based on key-value, but with more advanced data structures placed on the top. For my purposes, after some fantastic answers from Salvatore (author of Redis) and folks on the redis mailing list, I worked out how to use Sets to access the data in the ways I needed. So let’s say we have three books, ISBN numbers 123, 456, and 789. Book 123 references book 456, and book 789 references both 123 and 456. We have two authors, “Matsumoto,Yukihiro” who wrote 123 and 456, and “Flanagan,David” who wrote 456 and 789. How do we handle this in a key-value store? By using Sets:

  • Create entries for each book, with key pattern “book::data”. The value is a JSON string of data like title, price, etc (see below for note on this).
  • Create set called “books” which contains the number of every book.
  • Create sets called “backrefs:” that contain the numbers of books referenced by book #.
  • Create a set called “authors” which contains all of the authors.
  • Create sets called “author:” that contains the numbers of books written by the author.

Using the set operations in Redis, then, I can display all of the books by using the “books” set; I can display all of the books by a given author by using the “author:” set; I can display all of the books referenced by a given book by using the “backrefs:” set. In the latter case, you might be thinking that I could just keep an array in the JSON string — and yes, that could work, but I wouldn’t be able to use some of the other interesting set operations, such as intersections to determine references for a given author, for example. Note that right now, since an author is just a name, there actually is no longer any Author model! If I add more meta-data about authors in the future, I can add that easily.

About that JSON string: this has advantages and disadvantages that I’m still considering. Some would say that every individual attribute (or “column” in RDBMS-speak) should be a separate key-value pair. In that approach, for example, if I have a book title and price, I’d have book:123:title => “The Ruby Programming Language” and book:123:price => “39.99”. Obviously I can then do things like add a book to sets like “Under $50” by adding the price item to the set. The big advantage noted by some is that attributes can be added instantly by just saving a new key. Using a JSON string, adding an attribute requires reading/updating all of the existing keys. On the other hand, it is tidy to have a single key, and working with JSON is easy. For the time being, I’m giving it a try by using “book:123:data” to store the “data” about the book, and separating out certain attributes if it makes sense to use them in other data structures like sets and lists. Is this the best of both worlds or the worst of both? I’m not sure yet.

A quick note here before getting into the code: I did this using the redis-rb plugin, which has a lot of functionality but is definitely lacking in documentation. However, the code is extremely clear and easy to read through, so I strongly recommend reading through it, particularly the main lib/redis.rb file. Using it’s pretty much just a matter of installing the plugin and then calling Redis.new.

So, my save() method looks like this:

def save
    book_key = "book:#{number}:data"
    @redis[book_key] = json_data		# creates JSON string
    @redis.set_add "books", number	# add to global books set
    if (back_references)
      back_references.each do |ref|
        @redis.set_add "backrefs:#{ref}", number
      end
    end
    if (authors) then
      authors.each do |a|
        a = CGI::escape(a)
        @redis.set_add "authors", a		# add to global authors set
        @redis.set_add "author:#{a}", number
      end
    end
  end

Improvements to be made here include handling the author names in a better way; doing a CGI::escape works, but a proper hash would be better. During prototyping, the escaping is nice because I can go in with the redis client and see human-readable names, but it makes the keys too long in my opinion.

So now the index() action in the Books controller looks like this:

  def index
    redis = Redis.new
    @entries = redis.set_count 'books'
    @pager = Paginator.new(@entries, 20) do |offset, per_page|
      redis.sort('books', { :limit => [ offset, per_page ], : order => "alpha asc" })
    end
    @keys = @pager.page(params[:page])

    @books = Hash.new
    @keys.each do |k|
      @books[k] = redis["book:#{k}:data"]
    end
  end

Here we get a redis connection, and use Paginator to do its thing — we have to get a count of the set, and then we use sort. This is a big part of the magic, and something that took me some time to work out. The sort command in redis (doc here) is the entry point to doing a lot of the interesting operations once you have things in a set. You’ll notice that in the save() method, all I do is add the book number to the set, not the actual key. That’s much more efficient (Redis is especially good with integers), and is enough. In the case above, all it does is call sort on the “books” set, with the “limit” and “order” options — “limit” as shown takes an offset and number of entries to return, which makes pagination a cinch. For “order” you’ll see that I use “alpha asc” which might seem confusing here since we’re dealing with numbers. In my actual use case the “numbers” can have alphanumerics, and I decided to leave this here because it’s a useful variant to see. In reality, the default for the sort command is ascending numeric so you wouldn’t need to even specify the option here.

Once the keys are retrieved, then I iterate on each one and get the actual data. This is very quick with Redis, but still not ideal. Redis supports an MGET command to retrieve multiple items in a single command, but it doesn’t return the keys, which would mean I’d have to data but not know which book number each one goes with. The redis-rb library provides a great mapped_mget() method, but at the moment it doesn’t support passing in an array. I would have to iterate each key and build a string of them. Presumably a fix can be made to accept an array, in which case this can all be collapsed down to a one-liner: @books = redis.mapped_mget(@keys). (By the way, in case you’re wondering why @keys is an instance variable, it’s because it contains Paginator metadata like page number, to display in my view).

Hopefully it’s pretty obvious that showing a book is pretty straightforward:

    book_data = redis["book:#{@book_number}:data"]
    if (book_data)
      @book = JSON.parse(book_data)
    end

Also simple, here’s the code to get the list of books which reference the current book — that is, the books that have the current book as one of their backward references:

    begin
      references = redis.sort("backrefs:#{number}")
    rescue
      return Array.new
    end

That’s pretty easy, isn’t it? Obviously you can add in an “order” option and even a “limit” if necessary. More interesting, here we get the list of authors, with the list of books written by each:

    alist = redis.sort("authors", { : order => "alpha asc" })
    @authors = Hash.new
    alist.each do |a|
        @authors[CGI::unescape(i)] = redis.sort("author:#{a}")
    end

First we do an initial call to sort to get the authors, sorted in ascending alphabetical order (note that this will be a little undependable given my current implementation since the names are CGI::escaped). Then we iterate each one and do a further sort to get each one’s books. This is fine, but it just returns the number of each book by the author — they key, not the value. Do we have to iterate yet again and do a third call to get the data for each book? Not at all, and this is one of the magic bits of the Redis sort command. If instead of the above sort call we can ask sort to return the values to us, instead of the keys. Using the redis client, the difference is like so:

$ ./redis-cli sort authors:Smith%3B+Bob limit 0 5
1. 123456789
2. 465768794
3. 344756635
4. 436485606
5. 347634767

$ ./redis-cli sort authors:Smith%3B+Bob limit 0 5 get book:*:data
1. {"title":"My Book","price":"19.99"}
…etc…

The second command, as you can see, adds a “get” option. This is a somewhat magic option that instructs Redis to get the values of the keys matching the pattern provided. So what happens, in a sense, is that Redis does the sort, and gets the keys. It then takes the keys and plugs them into the pattern, and does a get. So the first sort command is augmented with a “get 123456789” and so on for the others, and the results are returned. This is all done on the Redis side, very quickly indeed. It is, clearly, extremely powerful. So if we change our code to get the data for the list of books, rather than just the keys:

    alist = redis.sort("authors", { : order => "alpha asc" })
    @authors = Hash.new
    alist.each do |a|
      books = Array.new
      a_data = redis.sort("author:#{a}", { :get => "book:*:data" })
      if (a_data)
        a_data.each do |data|
          books << (JSON.parse(data))
        end
      end
      @authors[CGI::unescape(i)] = books
    end

With this, my controller is passing @authors to the view, which is a Hash keyed off the unescaped author names. The value of each entry in the Hash is an Array of data (which is actually another Hash, created by the JSON.parse call). In the view, I can do something like this rather silly example:

<% @authors.keys.sort.each do |author| %>
  <% books = @authors[author] %>
  <tr class="<%= cycle("even", "odd") -%>">
    <td><%= author %></td>
    <td>
      <% if (books.length > 0) -%>
        <%= books.length %> :
        <% books.each do |b| -%>
        (<%= truncate(b["title"], :length => 25) %>) |
      <% end -%>
      <% else -%>
        0
      <% end -%>
    </td>
  </tr>

This page simply iterates through the authors, and for each one it displays the number of books they’ve written, and the first 25 characters of each title. If they didn’t write any books, it shows a zero.

There is one problem here, and it’s one that I’m working on a solution for: the “sort” with “get” is very cool, but it returns the value of each entry instead of the key. That means that in the above view, I have access to the book’s title, price, etc — but NOT the number! That’s because the number is embodied in the key. This is obviously a problem, since I need to display the book number. Right now, I’m working around this by storing the number in the JSONified data, but that’s not the right thing to do. Ideally, there would be a way to have the “sort get” return the key along with the data, though I’m not certain what that would look like. Alternately, the app can get the keys, and use them to do an MGET for the data. We’ll see.

In any case, we’re now able to display the books and the authors, approaching the objects from either direction to access the others. I’ll post more and/or update this post as I experiment further, but I hope this and the first part serve as a useful introduction to people interested in exploring MongoDB and Redis. For my purposes, I plan to continue forward with Redis rather than MongoDB, but as I’ve shown, they’re not at all the same thing — I can easily see cases where MongoDB might be a better fit. It’s clearly worthwhile to do quick prototyping to make sure you understand your problem set, and then see what the best tool is. One of the most exciting things about the so-called “NoSQL” data stores is that developers now have more tools to work with. If I get the time, I hope to play with Cassandra and Tokyo Cabinet to see how they might fit in. It’s always great to have more options in the tool box.

For the new project I’m working on, after doing some initial very simple prototyping using MySQL (mainly because I could get from 0 to somewhere very quickly with ActiveScaffold and a few simple migrations), I started to look at alternate data stores. There are real reasons given the type of data being managed, but I have to admit that at least some of it was my desire to get a bit of hands-on experience with some of the new kids on the block, too. After exploring the alternatives, I settled on doing some prototyping with both MongoDB, and Redis. There are obviously others that are equally interesting, particularly Cassandra, but there simply isn’t time for everything! I selected Redis because I’d already done some playing with it, understood its basic concepts, and felt that its support for sets would be valuable for what I’m working on. I chose MongoDB as another option after doing some reading on it and finding it to be an interesting combination of key-value with some relational-style support. I also thought the mongoid was a nice bit of work that would be nice to use.

I want to note that I purposely did not call this “MongoDB vs Redis” — they’re different tools, and have different uses, which is one of the things I hope will be clear from these posts. This isn’t a competition, but just a summary of my experiments in looking at how I might approach my needs using the two.

The “problem” to be solved

I’m not at liberty to divulge the details of what I’m working on, so I have a sort of parallel-world simulation of the problem that replicates the types of issues I have to take care of. The idea, then, is to model a reference library, where we have Books and Authors. A Book can have multiple Authors, while an Author may have written multiple Books, so in a relational schema there would be a many-to-many relationship between them. In addition, a Book can contain references to other Books. We want to build a web app that will:

  • Show all of the Books
  • Show all of the Authors
  • For a Book, show all of the Authors
  • For a Book, show all of the Books that it references
  • For a Book, show all of the Books that reference it
  • For an Author, show all of the Books they’ve authored

MongoDB

As I mentioned above, I liked the look of the mongoid plugin to work with MongoDB, though I did do an initial pass using MongoMapper as well. I just felt that mongoid was a bit smoother, had more support for associations, and had somewhat more documentation, but they both did the job. Using Mongoid, my models looked something like this:

class Book
  include Mongoid::Document

  field :number
  field :title
  field :back_references, :type => Array
  field :forward_references, :type => Array
  index :number
  has_many :authors
end

class Author
    include Mongoid::Document

    field :name
    belongs_to :book, :inverse_of => :authors
end

As you can see, much like with ActiveRecord, you simply specify the fields you want persisted, and use a has_many/belongs_to pair to create an association. Do note that instead of extending a class as you would with AR, for mongoid you simply include Mongoid::Document. When I want to create a Book, it goes something like the following, assuming that I have the book number/title and an array of author names:

    the_book = Book.new(
                        :number => book_number,
                        :title => book_title
    )
    authors.each do |a|
      the_book.authors << Author.new(:name => a)
    end
    the_book.save

But what about the references, then? In the Book model above, I have two arrays, back_references (a list of books that reference this one) and forward_references (a list of books that are referenced by this one). Actually, all it takes for these is to create arrays containing the book numbers, assign them to the instance, and save. That’s one of the nice things about MongoDB, as we’ll see: you can query for items in embedded arrays.

A quick note here: I’ve glossed over the setup and configuration of MongoDB here, somewhat on purpose. Once you’ve installed it, if you’re using mongoid there are very clear instructions on setting up your Rails app to use the db so there’s not much need for me to repeat things here. Let’s just say we’re using a db called “books-development” which will then contain our collection, which is called “books”. Wait, shouldn’t we have another collection called “authors” since we have an Author model? Well, no, because the way we set up the has_many/belongs_to it means that Authors are embedded objects within Books. Let’s see what an entry looks like when we persist it. Running the mongo shell:

> db.books.find({number : "1234567890"});
{ "_id" : "4b58f90c69bef38f8f000720", "number" : "1234567890", "forward_references" : [
        "6215628454",
        "63107472345"
], "back_references" : [
        "39848733434",
        "51895763321",
        "5216434662"
], "authors" : [
        {
                "_id" : "4b58f90569bef38f8f000091",
                "name" : "Matsumoto,Yukihiro",
                "_type" : "Author"
        },
        {
                "_id" : "4b58f90569bef38f8f000092",
                "name" : "Flanagan,David",
                "_type" : "Author"
        }
],  "_type" : "Book", "title" : "The Ruby Programming Language" }

From this, you can see that Mongo has assigned “_id” values to each object, the references are both just arrays of book numbers, and the authors have become embedded objects with their own “_id” and “_type” (used by mongoid). As we’ll see in a bit, the fact that the authors are embedded objects is convenient for some purposes, but problematic for others due to the queries I needed to do. For now, though, let’s see what our queries look like for the various activities listed above.

  # Inside books_controller.rb, index action to list the books
  def index
    @entries = Book.count
    @pager = Paginator.new(@entries, 20) do |offset, per_page|
      Book.criteria.skip(offset).limit(per_page).order_by([[:title, :asc]])
    end
    @books = @pager.page(params[:page]) 
  end

  # show action to display a single book's details
  def show
    @book = Book.find(:first,  :conditions => { :number => params[:number] })
  end

Pretty straightforward stuff, even when bringing Paginator into the picture. Being able to chain the criteria with mongoid is a nice bonus to using it. So when a single book is displayed, the page can show the list of author names by simply iterating the array:

  <tr>
    <td class="label">Authors</td>
    <td class="show">
      <% if (@book.authors)
         @book.authors.each do |author| -%>
        <%= author.name %> |
      <% end -%>
      <% else -%>
        None
      <% end -%>
    </td>
  </tr>

The backward references are exactly the same way. However, I discovered while writing the data entry scripts that the forward references (i.e. the books that reference the current book) were not available. No problem, I figured, instead of storing that I’ll just query it:

  def referenced_by
    Book.find(:all, :conditions => { :back_references => number }) 
  end

There’s some nice MongoDB magic. Very simply, that will return any Book entry that contains “number” in its “back_references” attribute — even though that attribute is an array! That ability to query for contents of an array comes in very handy, needless to say. As an aside, I came across a reference that I sadly can’t find now to link to it, but it showed me how to add a super simple search. To make the books searchable, I just took the title and the author, did a split(), and created an array containing each word. I called that “search_words” and made it a new array-type attribute. The search is then a simple query:

  def search_books(search_term)
    Book.find(:all, :conditions => { :search_words => search_term }) 
  end

This is obviously a very simplistic search, but given that it takes about 2 minutes to implement, who’s complaining?

The Author problem

So now we come to where I began to find problems with the approach. I wanted to display the list of all authors. Hmm, the authors are embedded documents within the books. Okay, it is possible:

  def get_author_list
    results = Books.criteria.only(:authors)
    author_list = Hash.new
    results.each do |book|
      book.authors.each do |a|
        if (!author_list.has_key?(a))
          author_list[a] = Book.where(:authors => a)
        end
      end
    end
    return author_list
  end

Pretty ugly, ain’t it? It queries all of the books and gets just the authors attribute, then iterates each book, then iterates the authors. For each one, it does a query to get the list of books (so our page can show each author followed by their books), and creates a Hash with key=author, value=books array. This obviously doesn’t do any pagination, which would make it even messier, plus the results aren’t sorted yet. Nope, I didn’t like it.

The alternative seems to be to make authors a first-level document, and link explicitly with book numbers, which isn’t horrible but means, again, multiple queries to get our list of authors with their books. This was beginning to look like it might be too relational a problem for MongoDB to make sense.

Update: as noted in the comment below by module0000, using distinct(“author”) solves this particular problem in a much cleaner way — thanks for the comment! I’ll still stand by the thought that this is really a relational problem and a document database has shortcomings in that regard (and of course strengths in other ways).

So, I set this aside, since as a prototype it did work. I made a new branch (thanks, git) and converted it to use Redis. Which I’ll cover in part 2, shortly.

A quick one here, but it took a few minutes to work out the best approach so I figured I could perhaps save someone else some time. I’ve been playing with MongoMapper and MongoDB (more on that later I’m sure) and after scaffolding a model I wanted to add pagination to the index action, since the list was loading a few hundred items. After a little searching in the MongoMapper group archives, it was clear that will_paginate wasn’t going to work, since it has some reliance on the ORM being used (though it does support both ActiveRecord and DataMapper). After spending a bit of time fruitlessly trying to get ActiveScaffold to work with MongoMapper, I didn’t want to try to do the same thing with will_paginate. So, someone in the MongoMapper group posted some helpful advice about using Paginator and I figured I would try that.

It turned out to be very easy indeed. After the usual gem install paginator I just had to a add a few lines to my controller:

  def index
    @pager = Paginator.new(MyModel.count, 20) do |offset, per_page|
      MyModel.find(:all, :offset => offset, :limit => per_page, :order => 'name asc, file_date asc')
    end
    @mythings = @pager.page(params[:page])

    respond_to do |format|
      format.html # index.html.erb
      format.xml  { render :xml => @mythings }
    end
  end

This could pretty much be ActiveRecord, of course, since MongoMapper makes the finds look easy. A few things to note here… There’s the initial MyModel.count which determines the number of items so that Paginator knows how many pages it will take to show them all, based on our 20 items per page. That of course works because my find is for all; otherwise the count will have to be based on the actual query, of course. Caching the count would be smart, though with Mongo it’s pretty fast and my data set’s not going to be millions so it doesn’t worry me in this particular case. Something to think about, but in general providing a paginating interface for a massive data set isn’t very smart anyway.

The find is pretty straightforward, and then @mythings is filled with the items for the page. In the view, we do this:

<% @mythings.each do |thing| %>
  <tr>
    <td><%= thing.name %></td>
    <td>Etc...</td>
  </tr>
<% end %>
</table>

<br />

Page <%= @mythings.number %> -
<%= link_to("Prev", mythings_path(:page => @mythings.prev.number)) if @mythings.prev? %>
<%= link_to("Next", mythings_path(:page => @mythings.next.number)) if @mythings.next? %>
<br />

This should be clear enough — Paginator adds a couple of attributes to @mythings that can be used to display the current page number and determine whether there’s a previous or a next page for which a link should be shown.

And that’s it — simple pagination using MongoMapper or any other persistence layer you might want to use, since Paginator makes no assumptions about it. This means there’s a little more work to do yourself, but as you can see from the code above, not much.