Lucene conference touches many areas of growth in search

With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.

Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.


Wait till they get big

In his opening remarks CEO Paul Doscher showed some stats from the sign-ups and indicated that many of the 350 attendees were new to Lucene and Solr. One third had less than one year of experience. This explains to me why turn-out for the regular tracks was higher than the new “big data” track on advanced processing and performance issues, which I expected to draw more participants. Speakers in the big data track had some fascinating applications to show off, suggesting that this is an example of the future not being equally distributed.

Thus, Mark Davis did a fast-pace presentation on the use of Solr along with Hadoop, <a href="http://mahout.apache.org/"Mahout, and systems hosting GPUs at the information processing firm Kitenga. A RESTful API from LucidWorks Enterprise gives Solr access to Hadoop to run jobs. Glenn Engstrand described how Zoosk, The “Romantic Social Network,” keeps slow operations on the update side of the operation so that searches can be simple and fast. As in many applications, Solr at Zoosk pulls information from MySQL. Other tools they use include the High-speed ObjectWeb Logger (HOWL) to log transactions and RabbitMQ for auto-acknowledge messages. HOWL is also useful for warming Solr’s cache with recent searches, because certain operations flush the cache.

Along these lines, Apache has released a replication tool called Solr Cloud that is supposed to make it much easier to manage sharding (partitioning) and multiple servers in Solr. Lucid Imagination used the show to announce their LucidWorks Big Data platform, now accepting Beta applicants, which will allow organizations to do pretty much what Davis described in his talk without having to configure all the tools on local systems. I suspect that first uses of this cloud service will be restricted to early adopters, but that next year both the “big data” presentations and LucidWorks Big Data will be popular.

The flexibility of a good search

Several presenters pointed out that Google has spoiled users and they expect every commercial site, health provider, or other major organization to provide a local site with Google-like features, including auto-completion and auto-suggestion, fuzzy searches and spelling correction (“Did you mean to search for…?”), and of course highly relevant “give me what I’m thinking of” search results.

Many companies offer search solutions–and O’Reilly actually has a book on another open source project with some very sophisticated back-end features, Introduction to Search with Sphinx–but Lucene with its strong Apache branding is the most popular open source solution, and (again according to Paul Doscher) probably the most popular independent search engine anywhere.

Sudarshan Gaikaiwari presented a talk on auto-completion, concentrating on geospatially informed results. For instance, if you enter “pi” into a search box, you may be presented with pizza joints, piano bars, and other popular searches within a few miles. Gaikaiwari achieved this with careful mining of log files and by creating a hidden prefix to the search term (for instance, “pi” can be altered to “times square new york city pi”). It’s important to create the long prefix because the longer a search string is, the fewer results have to be returned and the quicker you can present suggested search items while the user is still typing. (To feel responsive, a site should present result to a user within 140 milliseconds.)

The geospatial information is retrieved through geohashes, a way of representing the world’s grid as arbitrary strings. Shorter strings represent larger geographical areas, and as you add a character to the end of the string you zoom in on a smaller area. By a mixture of four-character and five-character strings, you can create a reasonable area in which to show local search results.

Some of the other interesting parts of Gaikaiwari’s talk included:

  • Check each search string you recommend against the main search index to make sure that someone clicking on that search string will come up with at least one actual document. Users quickly come to distrust your recommendations if they try when and come up with an empty set.

  • Measure “time to first click” to check how good your recommendations are. This metric is valuable because it combines two important criteria: presenting suggestions to the user quickly and success in actually producing a suggestion the user likes. Gaikaiwari also listed several other metrics.

Interestingly, search engines such as Solr and Sphinx functioned as NoSQL replacements for relational databases (though usually used to offload the search function from these databases) long before the term NoSQL was invented. Although people don’t tend to think of the search tools in that light, they do in fact work like NoSQL in that they perform specific functions more efficiently than a relational database can, and they sometimes compete with document stores like CouchDB and MongoDB. But search engines have evolved tremendously to intersect with the worlds of taxonomy and analytics. And now they’re dealing with big enough data sets to require sharding and replication as well.

Related

Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.
topic: Programming
  • http://www.flax.co.uk/blog Charlie Hull

    Although Lucid Imagination does employ some Lucene/Solr committers, the company is not the only one maintaining the software – it is an Apache open source project.

  • http://dealiacs.com Dealiacs

    Although most of the DB take huge spaces, there is a huge potential i the cloud. This cloud will make all computations easy and affordable to the people in the days to come.

  • Anon

    Andy,

    You should be aware of ElasticSearch. Also based on Lucene, ElasticSearch is about a year ahead of Solr.

    Interesting how at *Lucene* Revolution conference there is *zero* mention of ElasticSearch.

  • http://www.couponsforindia.com/ Vibha

    I would like to express my appreciation regarding this article. It is very informative and helpful, regarding importance of this kind of task. Thank you for sharing it!