mrtopf.de

Suche
Close this search box.

FOSDEM 2008: Introducting the ApacheSolr Module for Drupal

speaker: Robert Douglass
subtitle: Next Generation Drupal search

He first wants to assert that searching for content on drupal.org can be problematic. N00bs complain about it, like „cannot find anything“, „why not use Google“?

Robert back then in 2004 suggested to use Google on drupal.org. He got a response stating that it’s a bad idea because then the Drupal search wouldn’t be enhanced and Drupal also has more information about the content than Google has.

He then tries it and searchs for „Google search robertDouglass“ but drupal.org can’t find it. He does the same with Google „site:drupal.org google search robertDouglass“ and it’s the first result. He still agrees though that the Google box shouldn’t be up there.

The solution

The underlying technology is based on Java and relatively old and called Lucene. Basically you put test in and get search results out.

Solr is a web service which sits on top of Lucene and you can use GET requests to search for content. It has a number of return formats such as JSON and others. Solr can also serve many clients. That would mean that you can search multiple Drupal sites with one Solr instance. This makes sense for a family of sites.

Also good is that Solr replicates „outta the box“. He mentioned the drupal.org site and how back in the day it was the only way to keep the Drupal.org site running by turning search off because it adds a burden on MySQL. This of course is not a good solution. With Solr it’s easier because you simply have to buy a new box if that happens.

You can also filter results easily like digg does (they use Solr) and you can sort these.

Solr with Drupal

There is the ApacheSolr Module (http://drupal.org/project/apachesolr) from Robert
And there is the Solr module http://drupal.org/project/solr but this is a replacement for the core search.module. Additionally you need to run one instance per content-type which Robert did not like and thus wrote ApacheSolr.

Installing Solr

  • You need Java 1.5 or later
  • Download and unpack the tarball
  • Move schema.xml from ApacheSolr module into solr/conf
  • java -jar start.jar

Solr Features

  • Faceted search (means: you do a search for a keyword and addtionally to the search results you get some clues and guides on how you can narrow down the scope of your search. In the ApacheSolr module you can e.g. narrow it down by content type.)
  • Range facets (e.g. date ranges)
  • Spelling suggestions (not yet in ApacheSolr). This does not take some dictionary but looks at the index statistics and suggests from there.
  • keyword extraction. Might be interesting for content recommendations.

Solr Performance

There are 1000 ways to scale Drupal (caching, memcached, static files, …) but cannot scale search really. Only solution usually is using e.g. Google instead.

  • But Solr has distributed searching builtin
  • It takes load off of the database
  • Extra point of scalability

He then quoted some people from the Solr website, handling a lot more search requests than Drupal actually handles requests per month (7 million) with a single not that powerful box.

Direct comparion Drupal – Solr

„drupal“ with Solr takes 61 database queries in 36.67 ms 660.82ms page load time

„drupal“ with drupal takes 227 queries in 44746.32 ms. Page load time was 45495.2 ms.
Doing it another time made it 29278.42 ms and 29484.35 ms.

But: Computer on which it was executed wasn’t optimized for it (mysql wise).

Relevant results

  • Solr doesn’t weight HTML
  • Solr can have query time boosts on specific fields (Drupal can do that, like amount of comments, how recent it is etc.)
  • Solr doesn’t yet have any „Page Rank“ mechanism for incoming links as Drupal does.

Configurability

Tokenizers, analyzers, filters are possible but not yet implemented in ApacheSolr

He then showed an XML file which defines what is indexed. Only indexed fields can searched upon. Body is not indexed because it’s copied to text. nid is indexed so that we can search content from a specific author.

Index time vs. Query time

  • Custom pipeline for indexing
  • Must be mirrored (roughly) at query time.

Drupal’s search framework

  • Any module can define a search tab
  • Generic keyword handling (type:book, uid: 4459)
  • Excerpt highlighting
  • programmatic searches (do search)
  • Dynamic scoring factors
  • Search theming support

Limitations

  • Can’t turn off content and user search
  • Query factor injections is fragile and nearly impossible to understand
  • Markup weights aren’t configurable
  • cron based indexing hard to get right
  • Only one search per page (temp tables). No longer true for Drupal 6.

Example: solr.robshouse.net
This also shows facets and you can check the queries out which Solr used.


Technorati Tags: , , , , ,

Teile diesen Beitrag