FOSDEM 2008: Introducting the ApacheSolr Module for Drupal

24.Februar.2008
, 15:31
, Conferences and Meetings, English Posts, Open Source, Planet Plone

speaker: Robert Douglass
subtitle: Next Generation Drupal search

He first wants to assert that searching for content on drupal.org can be problematic. N00bs complain about it, like „cannot find anything“, „why not use Google“?

Robert back then in 2004 suggested to use Google on drupal.org. He got a response stating that it’s a bad idea because then the Drupal search wouldn’t be enhanced and Drupal also has more information about the content than Google has.

He then tries it and searchs for „Google search robertDouglass“ but drupal.org can’t find it. He does the same with Google „site:drupal.org google search robertDouglass“ and it’s the first result. He still agrees though that the Google box shouldn’t be up there.

The solution

The underlying technology is based on Java and relatively old and called Lucene. Basically you put test in and get search results out.

Solr is a web service which sits on top of Lucene and you can use GET requests to search for content. It has a number of return formats such as JSON and others. Solr can also serve many clients. That would mean that you can search multiple Drupal sites with one Solr instance. This makes sense for a family of sites.

Also good is that Solr replicates „outta the box“. He mentioned the drupal.org site and how back in the day it was the only way to keep the Drupal.org site running by turning search off because it adds a burden on MySQL. This of course is not a good solution. With Solr it’s easier because you simply have to buy a new box if that happens.

You can also filter results easily like digg does (they use Solr) and you can sort these.

Solr with Drupal

There is the ApacheSolr Module (http://drupal.org/project/apachesolr) from Robert
And there is the Solr module http://drupal.org/project/solr but this is a replacement for the core search.module. Additionally you need to run one instance per content-type which Robert did not like and thus wrote ApacheSolr.

Installing Solr

You need Java 1.5 or later
Download and unpack the tarball
Move schema.xml from ApacheSolr module into solr/conf
java -jar start.jar

Solr Features

Faceted search (means: you do a search for a keyword and addtionally to the search results you get some clues and guides on how you can narrow down the scope of your search. In the ApacheSolr module you can e.g. narrow it down by content type.)
Range facets (e.g. date ranges)
Spelling suggestions (not yet in ApacheSolr). This does not take some dictionary but looks at the index statistics and suggests from there.
keyword extraction. Might be interesting for content recommendations.

Solr Performance

There are 1000 ways to scale Drupal (caching, memcached, static files, …) but cannot scale search really. Only solution usually is using e.g. Google instead.

But Solr has distributed searching builtin
It takes load off of the database
Extra point of scalability

He then quoted some people from the Solr website, handling a lot more search requests than Drupal actually handles requests per month (7 million) with a single not that powerful box.

Direct comparion Drupal – Solr

„drupal“ with Solr takes 61 database queries in 36.67 ms 660.82ms page load time

„drupal“ with drupal takes 227 queries in 44746.32 ms. Page load time was 45495.2 ms.
Doing it another time made it 29278.42 ms and 29484.35 ms.

But: Computer on which it was executed wasn’t optimized for it (mysql wise).

Relevant results

Solr doesn’t weight HTML
Solr can have query time boosts on specific fields (Drupal can do that, like amount of comments, how recent it is etc.)
Solr doesn’t yet have any „Page Rank“ mechanism for incoming links as Drupal does.

Configurability

Tokenizers, analyzers, filters are possible but not yet implemented in ApacheSolr

He then showed an XML file which defines what is indexed. Only indexed fields can searched upon. Body is not indexed because it’s copied to text. nid is indexed so that we can search content from a specific author.

Index time vs. Query time

Custom pipeline for indexing
Must be mirrored (roughly) at query time.

Drupal’s search framework

Any module can define a search tab
Generic keyword handling (type:book, uid: 4459)
Excerpt highlighting
programmatic searches (do search)
Dynamic scoring factors
Search theming support

Limitations

Can’t turn off content and user search
Query factor injections is fragile and nearly impossible to understand
Markup weights aren’t configurable
cron based indexing hard to get right
Only one search per page (temp tables). No longer true for Drupal 6.

Example: solr.robshouse.net
This also shows facets and you can check the queries out which Solr used.

Technorati Tags: fosdem, fosdem2008, opensource, drupal, solr, search

mrtopf.de

FOSDEM 2008: Introducting the ApacheSolr Module for Drupal

Teile diesen Beitrag