How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search

Aug 26 • Posted 3 weeks ago

By Luke Lovett, Python Engineer at MongoDB

Introduction

Suppose you’re running MongoDB. Great! Now you can find exact matches to all the queries you can throw at the database. Now, imagine that you’re also building a text-search feature into your application. It has to draw words out of misspelled noise, and results may match on synonyms, too! For this daunting task you’ve chosen to use one of the Lucene-based projects, Elasticsearch or Solr. But now you have a problem— How will this tool search through your documents stored in MongoDB? And how will you keep the contents of the search engine up-to-date?

Mongo Connector fills a gap between MongoDB and some of the best search tools out there, such as Elasticsearch and Solr. It is not only capable of exporting data from your MongoDB replica set or sharded cluster to these systems, but also keeps your data consistent between these systems: as you insert, update, and remove documents in MongoDB, these changes are soon reflected on the other side through Mongo Connector. You may even use Mongo Connector to stream changes performed on one replica set primary to another, thus simulating a “multi-master” cluster.

When Mongo Connector saw its first release in August of 2012, it was very simplistic in its capabilities and lacked fault tolerance. I’ve been working on Mongo Connector since November, 2013 with the help of the MongoDB Python team, and I’m glad to say that Mongo Connector has come a long way in terms of the features it provides and (especially) stability. This post will show off some of these new features and give an example of how to replicate operations from MongoDB to Elasticsearch, an open-source search engine, using Mongo Connector. At the end of this post, we’ll be able to make fuzzy-match text queries against data streaming into Elasticsearch.

Getting our Dataset

For this post, we’ll be pulling in posts from the popular link aggregation website, Reddit. We recently added safe encoding of data types supported by MongoDB (i.e., BSON types) to types external database drivers (in this case, elasticsearch-py) can handle. This makes it safe to use for replicating documents whose content we may not have much control over (e.g., from web scraping). Using this script that pulls new posts from reddit, we’ll stream new Reddit posts to MongoDB:

./reddit2mongo --mongo-host localhost --mongo-port 27017

As the post is processed, you should see the first 20 characters of the title. This is (I admit, slowly, thanks to Reddit API limits) emulating the inserts into MongoDB that your application is making.

Firing up the Connector

Next, we’ll start Mongo Connector. To download and install Mongo Connector, you can use pip:

pip install mongo-connector

For this demonstration, we’ll assume that you already have Elasticsearch set up and running on your local machine, listening on port 9200. You can start replicating from MongoDB to Elasticsearch using the following command:

Of course, if we only wanted to perform text search on post titles and text, we can restrict what fields are passed through to Elasticsearch using the —fields option. This way, we can minimize the amount of data we are actually duplicating:

Just as you see the Reddit posts printed to STDOUT by reddit2mongo, you should see output coming from Mongo Connector logging the fact that each document has been forwarded to ES at about the same time! What a beautiful scene!

Searching, Elastically

Now we’re ready to use Elasticsearch to perform fuzzy text queries on our dataset as it arrives from MongoDB. Because we’re streaming directly from Reddit’s website, I can’t really say what results you’ll find in your dataset, but as this particular corner of the internet seems to love cats almost as much as we love search engines, it’s probably safe to say that a query for kitten will get you somewhere:

Because we’re performing a fuzzy search, we can even do a search for the non-word kiten. Since most people aren’t too careful with their spelling, you can imagine how powerful this feature is when performing text searches based directly on a user’s input:

The fuzziness parameter determines the maximum “edit distance” the text query can be in order to match a field. The prefix_length parameter says that results have to match the first letter of the query. This article offers a great explanation of how this works. This search yielded the same results for me as its properly-spelled version.

More than just Inserts

Although our demo was just taking advantage of continuously streaming documents from MongoDB to Elasticsearch, Mongo Connector is more than just an import/export tool. When you update or delete documents in MongoDB, those operations are replicated to your other systems as well, keeping all systems in-sync with the current primary of a replica set. If the primary fails over and a rollback occurs, Mongo Connector can detect these and do the Right Thing to maintain consistency regardless.

Recap

The really great thing about this is that we’re performing operations in MongoDB and Elasticsearch at the same time. Without a tool like Mongo Connector, we would have to use a tool like mongoexport to dump data from MongoDB periodically into JSON, then upload this data into an empty Elasticsearch index, so we don’t have previously-deleted documents hanging around. This would probably be an enormous hassle, and we would lose the near real-time capability of our ES-powered search engine.

Although Mongo Connector has improved substantially since its first release, it’s still an experimental project and has a ways to go before official support by MongoDB, Inc. However, I am committed to answering questions as well as reviewing feature requests and bug reports reported to Mongo Connector’s issues page on Github. Also be sure to check out the full documentation on its Github wiki page.

Resources

blog comments powered by Disqus
blog comments powered by Disqus