Machine Intelligence

And now for something different

Improving results of phrase searches in Drupal with Apache Solr

Apache Solr is a very powerful tool and is becoming part of many sites and online applications. It's focus is very specific i.e. to provide text based search but by its very nature it can be applied to many different kind of applications. To adapt to various applications it also provides number of options. To get most out of it you have to understand the configuration options as well as the usage scenarios.

Drupal 7 and Apache Solr in use at India Environment Portal

We have been working with CSE for their India Environment Portal, which has a large database of news collection along with many articles. All the content is classified with a taxonomy of around ten thousand tags. The site itself is built on Drupal 7 but uses Apache Solr for searches. What we observed was that visitor to the site are not just searching for keywords but phrases e.g. Pollution in Bihar, Air Quality around Delhi. Although standard search did get good results we found that the more relevant articles were not always listed on top. That is because by default Solr queries will look for individual words and rank them and searching for phrase is like searching for two independent words.

Apache Solr sloppy phrase query

There is a way to force searches for phrases by enclosing them in quotes (") e.g "Pollution in bihar" , this will give better result for phrase but its not a good user interface. As we want to make it simpler for users. Also we would not like to expose such implementation details. Among its many options Apache Solr has sloppy phrase query which ranks the document higher based on how close the words of a phrase appear together. This option can be used easily to get better results.

It can be added to standard request handler as :

q=content:"Pollution in bihar"~50

Where ~50 means to look for pollution and bihar 50 words apart.

or to dismax query handler as:

q=Pollution in bihar&pf=content&ps=50
  • pf is the phrase fields and you can specify multiples of them.
  • ps is the phrase slop i.e. how many words apart the two words will be matched.

In both cases "in" is not considered as it will be generally filtered out by stopwords.

After implementing this change to query we did find improved search results.

Implementing sloppy phrase query in Drupal 7

To enable sloppy phrase query in Drupal 7 with apachesolr module you need to alter the apachesolr search query. Following code will handle it just change MYMODULE to whatever module you are using.

function MYMODULE_apachesolr_query_alter($query) {

// improve search results for phrases                                                                                        
// where words appear closer to each other                                                                                     
  $param = $query->getParam('pf');
  if ( empty($param) ) {
     $query->addParam('pf', 'label');
     $query->addParam('pf', 'content');
     $query->addParam('ps', 50);
 }
}

Both label and content field will be used for phrase sloppy query.

References:

  1. Apache Solr
  2. Solr Relevancy FAQ
  3. Drupal Module for Apache Solr