Nelz's Blog

Mah blogginess

Lucene Overview

What Is It?

Lucene (The Website) is an open-source suite of index-based search software projects hosted by the Apache Software Foundation. Lucene (The Project) is the base project at the center of all of the other projects presented on Lucene (The Website).

My Perspective

One could use Lucene to search on any type of text, be it from HTML pages, from Excel or Word documents, or from a database. I am using Lucene to search out the members of a website, based on information entered by those members. This information is primarily stored within a database.

Why is Lucene Better than Database Searching?

Imagine that you want to search for the word "balloon" across three different columns in your database. "Balloon" doesn’t need to be in all three columns, but it must be in at least one column. Yeah, you say to yourself "That wouldn’t be too hard of an SQL query to write."

What if I then tell you that you need to come up with rankings of the search results based on frequency of the occurrence of the word "balloon"? Oh, and did I metion that we need the ability to weight the results from one column heavier than another column? This whole searching thing becomes a much harder task…

Now, when I tell you that this should scale to include not only "balloon", but also "giraffe" and "cotton candy" and a varying number of other phrases, we’ve basically put the whole SQL-based searching option to bed. Yeah, you could come up with a whole code-plus-SQL framework to do all these things dynamically… But, why bother? The Lucene project already solved all these issues for you.

The Two Sides of Search

Lucene is an index-based search. This means that the information to be searched upon must be converted and pre-processed into efficiently-searchable chunks of data. This data (a.k.a. index) is kept in a Lucene specific set of files on a file system. Searching the index requires knowledge of the pre-processing and conversion conventions.

This is how I say there are two sides of search. The first side is the involved process of creating the index. The second side is actually querying against the data set in the index.

It turns out that there is very little overhead requiring a connection between the two sides of searching, other than the index itself. You could, theoretically, build your index using Lucy (the loose C port of the Lucene Java library), and consume that index Java-style in your Java-based Web Application.

This is all I have time for right now. I plan to make several more Lucene-based posts over the next couple of weeks. Let me know if you have any questions on what I’ve posted so far…