Archive

Tag Archives: search

I’ve recently had the chance and excuse to play with Elasticsearch, after reading good things about it. We’ve been using Solr with decent success, but it feels like whenever we try to do anything outside the normal index-and-search it’s more complicated than it should be. The basics are easy thanks to the terrific Sunspot gem, though. So when I had a small project to prototype that involved indexing PDFs as well as database records, I figured it was a good opportunity to try out Elasticsearch.

I quickly reached for the Tire gem, which is very similar to Sunspot if you’re using ActiveRecord. Where Sunspot has you include a “searchable” block, Tire adds a “mapping” block, but the idea is the same — that’s where you tell it what fields to index, and how to do it. For each field you can adjust the data type, boost, and more. You can also tack on a “settings” block to adjust things like the analyzers.

The documentation for Tire is pretty good, but I found that I made a number of mistakes trying to adapt the instructions on the Elasticsearch site to the Tire way of doing things, so I thought I’d write up some of the things I learned in hopes that it can help save time for others. Many thanks to the folks on StackOverflow who answered my questions and pointed me in the right direction.

One starter suggestion is to configure Tire’s debugger, which is really convenient because it will output the request being sent to the ES server as a curl command that you can copy and paste into a terminal for testing. Very handy. I added this to my config/environments/development.rb file:

  Tire.configure do
    logger STDERR, :level => 'debug'
  end

Now on to the model. I’ll call mine Publication, so inside app/models/publication.rb:

class Publication < ActiveRecord::Base
  include Tire::Model::Search
  include Tire::Model::Callbacks

  attr_accessible :title, :isbn, :authors, :abstract, :pub_date

  settings :analysis => {
    :filter  => {
      :ngram_filter => {
        :type => "nGram",
        :min_gram => 2,
        :max_gram => 12
      }
    },
    :analyzer => {
      :index_ngram_analyzer => {
        :type  => "custom",
        :tokenizer  => "standard",
        :filter  => ["lowercase", "ngram_filter"]
      },
      :search_ngram_analyzer => {
        :type  => "custom",
        :tokenizer  => "standard",
        :filter  => ["standard", "lowercase", "ngram_filter"]
      }
    }
  } do
    mapping :_source => { :excludes => ['attachment'] } do
      indexes :id, :type => 'integer'
      indexes :isbn
      [:title, :abstract].each do |attribute|
        indexes attribute, :type => 'string', :index_analyzer => 'index_ngram_analyzer', :search_analyzer => 'search_ngram_analyzer'
      end
      indexes :authors
      indexes :pub_date, :type => 'date'
      indexes :attachment, :type => 'attachment'
    end
  end

  def to_indexed_json
    to_json(:methods => [:attachment])
  end

  def attachment
    if isbn.present?
      path_to_pdf = "/Users/foobar/Documents/docs/#{isbn}.pdf"
      Base64.encode64(open(path_to_pdf) { |pdf| pdf.read })
    end
  end
end

Okay, that’s a lot of code, so let’s look things over bit by bit. Naturally the includes at the top are needed to mix-in the Tire methods. There are two includes so that you can include the calls needed for searching without the callbacks if you don’t need them. The callbacks, though, are what make things work auto-magically when you persist ActiveRecord objects. With those, whenever you call save() on an AR model, the object will be indexed into ES for you.

Next up are two blocks of code: settings, and mapping. The settings block defines a filter, and two analyzers, one for indexing and one for searching. I can’t claim to be enough of an expert yet to fully explain the ramifications of the filter/analyzer options, so rather than risk confusion I’ll just note that this code is there to set up the nGram filter and connect it with two analyzers, index and search, which differ slightly in order to ensure that the standard filter is included for searching. You may want to play with the nGram’s min and max settings to get the matching behavior you want. Note that if you don’t need the nGram filter, you can remove the settings block and let the mapping block stand on its own, in which case the default settings will be used (but you’ll have to change the mapping entry for the :title and :abstract fields, as described below).

The mapping block is the more interesting one, as it defines the fields and indexing behavior. The first line took me some searching and StackOverflow questioning to figure out. The issue that by default, Elasticsearch will put all of the fields you index into its _source storage. Because I’m indexing large PDF documents, the result was that a huge Base64-encoded field was being stored. If I wanted to serve the PDFs out of Elasticsearch that might be okay, but that’s not the plan. The :excludes instruction prevents the attachment field from being stored.

Next are the fields themselves, and I won’t spend much time on these because the Tire documentation does a fine job of explaining these. The only interesting items are the :attachment field and the entry for :title and :abstract — that one specifies that for those fields the custom analyzers defined in the settings block should be used. For :attachment it gets a little bit tricky.

When the indexing is performed, the fields themselves are gathered up by calling the method to_indexed_json(). Normally that will just do a to_json() on your model and then collect the fields. But you can also override it, which we do here. You can see that we add in the method attachment(), which is defined below. So the other fields will be JSONized as normal, as well as the output of the attachment() method. The attachment() method itself uses the ISBN number to open the PDF file, which is read and Base64-encoded. The results of that encoding will be included with the other fields and sent to ES for indexing.

Performing the searching is almost too easy, but there was one bit that threw me off initially, which was getting highlighting to work. The search block in my controller looks like this:

results = Publication.search do
  query { string query_string }
  sort { by :pub_date, 'desc' }
  highlight :title, :options => { :tag => "<strong class='highlight'>" }
end

I was trying to test the highlighting and was thrown off by the field names being case-sensitive (see my question on StackOverflow), but this now works. The other key is that the highlighted fields are returned separately from the plain fields, which was odd to see. This means that to display the highlighting I have to check for the field:

results.each do |r|
  r_title = (r.highlight.nil? ? r.title : r.highlight.title[0])
  puts "Title: #{r_title}"
end

If the highlighting is present then it’s used; if not (because the term isn’t present in that field) then the regular field is used. The other handy thing to note is that you can specify the tag with which to wrap the term. The default is “<em>” but I wanted to specify the “highlight” CSS class, as is shown here. This is a really convenient feature.

That covers the basics, though it’s also probably worth sharing just how nice it is to be able to test using curl. For example, I wanted to check how easy it is to have the search call return just a single field (to speed up certain requests), so I tried it first in curl:

curl -XPOST http://localhost:9200/publications/_search\?pretty\=true -d '{
"query": {"query_string": {"query": "Foobar"}},
"fields": ["title"]
}'

That’s of course assuming that ES is running on your local system on port 9200; if not, adjust accordingly.

There you go. I hope this writeup is helpful to folks getting started and it saves you some time.

After spending time to get some data into Redis (as documented in some of my previous posts here), I not surprisingly wanted to make the data searchable. After looking around at some of the full-text search solutions available for Ruby, I really liked the look of Sunspot. Well-presented, well-designed, and it even has decent documentation. It uses Solr underneath, which is a very respectable search engine, so that’s all good. Of course, it didn’t take me long to discover that the sunspot_rails plugin makes things drop-and-go when using ActiveRecord, but those of us branching off into alternatives have to put in more effort. Hence, I’ll document my findings here to hopefully make it easier for others.

I won’t bother going into the details of getting things set up, as the Sunspot wiki does a fine job of that. Suffice it to say that we install the gem (and the sunspot_rails gem if you’re going to have some ActiveRecord models as well), start the Solr server, and that’s about it. We’ve got Redis already going, right? So now it’s time to get our model indexed and searchable!

There are a few steps that we need to follow to make this happen. First, we put code in the model to tell Sunspot what fields should be indexed, which ones are just for ordering/filtering, and which ones should be stored if desired for quicker display:

class Book
  require 'sunspot'
  require 'sunspot_helper'

  # Pretend some attributes like number, title, etc are defined here

  Sunspot.setup(Book) do
    text :number, :boost => 2.0
    text :title, :boost => 2.0
    text :excerpt
    text :authors
    string :title, :stored => true
    string :number, :stored => true
    date :publication_date
  end

  def save
    book_key = "book:#{number}:data"
    @redis[book_key] = json_data
    @redis.set_add 'books', number
    # Make searchable
    Sunspot.index( self )
    Sunspot.commit
  end

  def self.find_by_number(redis, number)
    redis["book:#{number}:data"]
  end

First, note that we need to require 'sunspot' to get access to the Sunspot class. This isn’t required for ActiveRecord models, but since we’re on our own, we have to specify that. Then, we call setup, passing the name of our model. In the code block, we specify a few text fields: the number, title, excerpt, and authors. Those fields will be indexed and searchable. Then we specify title and number again as strings, asking that they be stored for quicker retrieval. This is so we can display just that data without fetching the whole object, if we want — I won’t get into the details of doing that here because, well, fetching the objects in Redis is so fast that I found it didn’t matter. Last, the publication date is also listed, so we can filter and order by it if we want.

In our save() method, after we store a book in Redis, we tell Sunspot to index it, and commit the updated index. So far, so good. In theory, we should be able to create a Book, save it, and then search for it. Alas, if this were an ActiveRecord model we’d be pretty much done (and wouldn’t even have to do the index/commit part because those are automagically triggered on create and update). Unfortunately, we have some harder work ahead of us.

Sunspot uses what it calls “adapters” to tell it what to do when it wants to identify an object, and when it wants to fetch an object given an id. We have to provide the adapters for our model. To give credit where it’s due, this Linux Magazine article helped me figure out what to do, and then reading through the Sunspot adapter source code filled in the blanks. If you look back at our model, you’ll see that it requires ‘sunspot_helper’. That’s where we’ll put our adapters:

/app/helpers/sunspot_helper.rb:

require 'rubygems'
require 'sunspot'

module SunspotHelper

  class InstanceAdapter < Sunspot::Adapters::InstanceAdapter
    def id
      @instance.number  # return the book number as the id
    end
  end

  class DataAccessor < Sunspot::Adapters::DataAccessor
    def load( id )
      Book.new(JSON.parse(Book.find_by_number(Redis.new, id)))
    end

    def load_all( ids )
      redis = Redis.new
      ids.map { |id| Book.new(JSON.parse(Book.find_by_number(redis, id))) }
    end
  end

end

So, what’s going on here? We provide two adapters for Sunspot: the InstanceAdapter, and the DataAccessor. The InstanceAdapter just provides a method that returns the ID of the object. Easy enough, we just return the book’s number, which is the unique identifier. The DataAccessor has to provide two methods, load() and load_all(), that take an id and a list of ids, respectively, and expect objects back. In my case, the objects are serialized JSON, so we just call our find_by_number() method to get each object, call JSON.parse() to get the Hash of data, and construct a new Book object. (Note: obviously this requires having an initializer that can take a Hash and create the object, which I’ll leave as an exercise) Now we just register our adapters, by adding a couple of lines of code right before the call to Sunspot.setup():

  Sunspot::Adapters::InstanceAdapter.register(SunspotHelper::InstanceAdapter, Book)

  Sunspot::Adapters::DataAccessor.register(SunspotHelper::DataAccessor, Book)

Now we should be good to go, right? Okay, we construct a Book object, and call save…then search for it:

b = Book.new({ "number" => 8888888, "title" => "My test title"})
=> #<Book:blahblah...
b.save
=> nil
search = Sunspot.search(Book) { keywords 'test' }
=> <Sunspot::Search:{:rows=>1, blahblah…
r = search.results
=> [#<Book:blahblah...
r[0].title
=> "My test title"

And we’re good! Congratulations. So now we want to add the search capability to our controller, right?

# In a view, put in a search form. I have a little search image, so excuse the image_submit_tag:
<% form_for(:book, :url => { :action => "search" }) do |f| %>
    <p>
      <%= f.label "Search for:" %>
      <input type="text" name="searchterm" id="searchterm" size="20">
      <%= image_submit_tag('search.png', :width => '30', :alt => 'Search', :style => 'vertical-align:middle') %>
    </p>
<% end %>

# Now in the controller. Note the pagination, which is why we store the search in the session,
# so we can grab it out again if they click forward/back through the pages.
  def search
    @search_term = params[:searchterm] || session[:searchterm]
    if (@search_term)
      session[:searchterm] = @search_term
    end
    page_number = params[:page] || 1
    search = Sunspot.search(Book) do |query|
      query.keywords @search_term
      query.paginate :page => page_number, :per_page => 30
      query.order_by :number, :asc
    end

    @books = search.results
  end

# And then in our search view, display the results:
<ul>
<% @books.each do |book| %>
    <li><%= book.number %>: <%= book.title %></li>
</ul>
<br />
Found: <%= @books.total_entries %> - <%= will_paginate @books %>

Yes, Sunspot is so cool that it integrates automatically with will_paginate. So, looking through the above, we have a form that posts to our action (assuming you set the routes up, which you did, yes?). The action then takes the searchterm parameter if it’s there, or extracts it from the session if it’s not there. Note that this is not robust code — if it’s called with no parm and nothing in the session, it will end up searching for an empty string, which will return every book. In any case, we store the search term in the session, so that when someone clicks through to page 2, we can re-run the search to get the second page. The more important code here, though, is the call to search.

I will give a thousand thanks to this blog post, specifically the fourth item! I was doing this:

    search = Sunspot.search(Book) do
      keywords @search_term
    end

And it didn’t work — it was fetching every object, even though I knew that @search_term was getting set properly. As that blog post notes, though, the search is done in a new scope, so this didn’t work. The code I showed above, using the query argument, fixes that problem. It certainly took me a while to figure that out, though, because nothing is said about it anywhere in the examples in the Sunspot wiki.

So now you should be all set. Put “test” into the form, submit it, the controller will do the search, return the book, and your view will list it. You are searching! Not so bad, and the fetches from Redis are so fast that the whole thing really speeds along. Pretty simple free-text search against any objects that you put into Redis.

    A Warning

I had one other hitch when I was working on this, which mysteriously went away. I hate that. So, in case someone else encounters here, I wanted to document the issue. When I got the adapters in place for the Book model, and tried to work with it, I got an error saying that there was no adapter registered for String. I was very puzzled, wondering if something about the fact that Redis was returning a JSON String was confusing Sunspot. So I made a quick change to the InstanceAdapter:

  class InstanceAdapter < Sunspot::Adapters::InstanceAdapter
    def id
      if (@instance.class.to_s == "String")
        @instance
      else
        @instance.number  # return the book number as the id
      end
    end
  end

And changed the register lines in my model:

  Sunspot::Adapters::InstanceAdapter.register(SunspotHelper::InstanceAdapter, Book, String)

  Sunspot::Adapters::DataAccessor.register(SunspotHelper::DataAccessor, Book, String)

And that did the trick. I didn’t like it, and intended to try to figure out what was going on. But after getting all the rest of it working, when I put the code back to its pre-String-adapter state, the error didn’t return. Like I said, I hate that. Hopefully it was just due to something that I was unknowingly doing wrong which I fixed along the way, but…just in case, now the quick-fix is documented here for anyone else who runs into the problem.

Follow

Get every new post delivered to your Inbox.