Nelz's Blog

Mah blogginess

Lucene - Ranged Numerical Search

(This is the 2nd part of my ongoing series on Lucene, which I have been working with as of late. Also see the first part: Lucene Overview.)

I was psyched when I read that Lucene did ranged searches. This was going to be a big help when searching on member age, or even with latitude/longitude searches. The thing I glazed over (on the first reading, at least) was that Lucene does lexicographical sorting. (Which can also colloquially be referred to as alphabetical sorting.)

To see an example of some of the problems with lexicographical sorting, here is a quick Java/TestNG method that can display a lexicographically sorted set of strings:

public void example() {
final List<String> list = new ArrayList<String>();
list.add("<string to sort>");
for (String tmp : list) {System.out.println(tmp);}

Magnitudes of Numbers

Using the method above, what happens when we sort the four following strings {“1.00”, “2.00”, “11.00”, “12.00”}? We get the following ordering:

  1. “1.00”
  2. “11.00”
  3. “12.00”
  4. “2.00”

That doesn’t really meet our needs, but there is an answer to this issue, and it has to do with formatting. If you can figure out the entire conceivable range of values, write your entry into the index using a padded formatter. Example: For a weight range in Kg (but also calculated from Lbs), we will probably see a values from about 45Kg to 150Kg. Let’s go for 3 digits before the decimal place, and 4 decimal digits of precision. Then you will probably want to use the following:

public static final DecimalFormat DECIMAL_FORMAT = new DecimalFormat("000.0000");

The great thing about this formatting trick is that it is automatically going to be understood when you convert the String back to a Double.

Negative Numbers

Using the method above again, notice what happens when we sort the four following strings {“-11.00”, “-01.00”, “02.00”, “12.00”}? We get the following ordering:

  1. “-01.00”
  2. “-11.00”
  3. “02.00”
  4. “12.00”

Now, we get negative numbers first, in descending (value, not lexicographical) order, and then the positive numbers in ascending (value) order.

We have two options here:

  • Either make our query builder smart enough to recognize when a query would bridge the negative to positive value ranges, and break the query into two pieces.
  • Add an offset to your values so that even the smallest (most negative) possible value can still be represented as a positive value. Again, this requires adding intelligence to your query builder.

In the end, we chose the latter of the two options… We figured it was going to be easier in the end to have a single query search over a contiguous range rather than conditionally breaking a query into separate positive and negative pieces.

For example, when we were trying to index on Longitudes, the possible values are -180 degrees to +180 degrees. Since we’re dealing in trigonometric degree measurement, we opted to use 360 degrees as our offset. (Since trig-wise -179 deg = -179 deg + 360 deg = 181 deg…)

Remember the trade-off though. You now have to add the offset at two times, once when you are creating the index, and once when you are creating your query against this index. This is far from an insurmountable problem, but you have to keep this additional point of synchronization in mind.

Well, that’s all I’ve got for today. Let me know if you have any questions.