Posts tagged:

navigation

Faceted Search with MongoDB

Aug 30 • Posted 11 months ago

By Jon Rangel, MongoDB Consulting Engineer

Introduction

Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. It is increasingly seen as an important part of the UI for many search platforms, and indeed nowadays is pretty much expected in places such as e-commerce websites.

Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on.

Some core functionality that a faceted search feature should provide might include:

  • finding the items that match a particular value of a certain facet (e.g. colour:blue)
  • finding the items in the intersection of multiple facet values (e.g. colour:blue AND size:medium)
  • finding the items in the union of multiple facet values (e.g. colour:blue OR colour:red OR size:large)
  • for each possible facet filter combination, display to the user the possible facet values on which it is possible to filter further (“drill down”)
  • for each facet value on which it is possible to drill down, display to the user the count of items matching that filter.

In this article, we’ll look at implementing the above faceted search functionality using a pure MongoDB solution. We’ll examine a number of approaches to solving this problem, and discuss their relative performance characteristics and any other pros/cons. We will also introduce some third party tools that, alternatively, can integrate with MongoDB to provide faceted search functionality.

Navigating a Book Store

Suppose we want to build faceted search functionality for a product catalog for a book store. A typical document representing a publication in the catalog might look something like the following:

  {
        _id : 123,
        title : "MongoDB: The Definitive Guide",
        authors : [ "Kristina Chodorow" ],
        publication_date : ISODate("2013-05-23"),
        pages : 432,
        edition : 2,
        isbn_10 : 1449344682,
        isbn_13 : 978-1449344689,
        language : "English",
        publisher : {
            name: "O’Reilly Media",
            ...
        },
        last_updated : ISODate("2013-05-16"),
        ...
    } 

First off, let’s state some reasonable assumptions about the facets for this (or indeed any other) catalog:

  • The total number of facets will be small.
  • The total number of possible facet values for each facet may be large, but will typically be small.
  • Each item in the catalog may have zero or more facet values (“tags”) for each facet (but typically one).
  • The facets are well-known up front, and change rarely if at all. The set of facet values may change frequently i.e. any time the product catalog is updated to add/remove items, or change the tags on existing items.
  • The application has knowledge of the facets being used, but not the set of all possible facet values that exist in the catalog for each of those facets.

For this example, let’s say we have three facets on which we wish to search — Subject, Publisher and Language — and consider how to search efficiently, and how to generate the faceted navigation meta-data to present to the user. We will test on some pre-generated test data based on a real-world product catalog.

Searching

The first part of the problem to solve is how to efficiently search for items in the product catalog. A few schema and indexing approaches are presented below.

Solution #1

One way to define the facet tags for a publication would be to store all facet types and values in subdocuments in an array, as follows:

  {
        _id: 123,
        ...
        facets1 : [
            {
                type : "subject",
                val : "MongoDB"
            },
            {
                type : "subject",
                val : "Databases"
            },
            {
                type : "publisher",
                val : "O'Reilly Media"
            },
            {
                type : "language",
                val : "English"
            }
        ]
    } 

A single ‘generic’ compound index can then be created containing all the facets and facet values:

 > db.books.ensureIndex({"facets1.type" : 1, "facets1.val" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 29891456,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536
        },
        "ok" : 1
    }

See this blog post for a good treatment on these kinds of generic indexes.

Let’s see how this performs for some faceted searches, using explain(). We’ll look at queries on a single facet tag to start with.

Find all books about databases:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 27,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Find all books by a specific publisher:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 39960,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 133,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Both of these queries use the index optimally as the number of documents returned is the same as the number of documents scanned (nscanned is the same as n).

How about queries for documents matching the union or intersection of multiple facet values? To do these “and”/”or” queries we use the $all/$in operators respectively.

Find all books about databases OR published by O’Reilly Media:

  > db.books.find(
    ...     { "facets1" :
    ...         { "$in" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    Fri Aug 16 15:59:04.989 JavaScript execution failed: error: 
    { "$err" : "$elemMatch not allowed within $in", "code" : 15881 } at src/mongo/shell/query.js:L128

Oops! This type of search doesn’t work using $in to construct the query as we cannot use the $elemMatch operator within a $in clause. This query can instead be constructed using the $or operator:

  > db.books.find(
    ...     { "$or" : [
    ...             { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } },
    ...             { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ...         ]
    ...     }
    ... ).explain()
    {
        "clauses" : [
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 40019,
                "nscannedObjects" : 40019,
                "nscanned" : 40019,
                "nscannedObjectsAllPlans" : 40019,
                "nscannedAllPlans" : 40019,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 0,
                "nChunkSkips" : 0,
                "millis" : 118,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "publisher",
                            "publisher"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "O'Reilly Media",
                            "O'Reilly Media"
                        ]
                    ]
                }
            },
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 6640,
                "nscannedObjects" : 7374,
                "nscanned" : 7374,
                "nscannedObjectsAllPlans" : 7374,
                "nscannedAllPlans" : 7374,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 1,
                "nChunkSkips" : 0,
                "millis" : 123,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "subject",
                            "subject"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "Databases",
                            "Databases"
                        ]
                    ]
                }
            }
        ],
        "n" : 46659,
        "nscannedObjects" : 47393,
        "nscanned" : 47393,
        "nscannedObjectsAllPlans" : 47393,
        "nscannedAllPlans" : 47393,
        "millis" : 242,
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index is used for both parts of the “or” statement.

Next, find all books about databases AND published by O’Reilly Media:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 118,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query uses the index, but is not optimal as many more documents are scanned than returned. Note that the number of documents scanned is the same as the number of books by this publisher (as seen from the previous query) — this is because at present $all only uses the index for the first element in the query array.

The performance of these kinds of queries will improve significantly once MongoDB supports index intersection, which is a feature that is coming soon (see SERVER-3071). With single index intersection, queries like the above will not need to scan more documents than those returned. In the meantime, to optimize these kinds of queries put the most selective filter criterion as the first element of the $all array if possible to minimize scanning:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } },
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }
Solution #2

Store all facet types and values in in an array, but instead of each element of the array being a subdocument, concatenate the facet type name and value into a single string value:

 {
        _id: 123,
        ...
        facets2 : [
            "subject:MongoDB",
            "subject:Databases",
            "publisher:O'Reilly Media",
            "language:English"
        ]
    }

Create an index on the facets field:

  > db.books.ensureIndex({"facets2" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 55694912,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456
        },
        "ok" : 1
    }

Now let’s try some of the same queries as before. First, a simple query on a single facet value (all books about databases):

   > db.books.find(
    ...     { "facets2" : "subject"+":"+"Databases" }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 28,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This works exactly as expected.

Now, lets try an “or” query (all books about databases OR published by O’Reilly Media):

    > db.books.find(
    ...     { "facets2" :
    ...         { "$in" : [
    ...             "publisher"+":"+"O'Reilly Media",
    ...             "subject"+":"+"Databases"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1 multi",
        "isMultiKey" : true,
        "n" : 46600,
        "nscannedObjects" : 47275,
        "nscanned" : 47276,
        "nscannedObjectsAllPlans" : 47275,
        "nscannedAllPlans" : 47276,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 117,
        "indexBounds" : {
            "facets2" : [
                [
                    "publisher:O'Reilly Media",
                    "publisher:O'Reilly Media"
                ],
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index bounds look sensible, showing that the index is used for both elements of the $in array. Note that $in may be used to construct this type of query since we don’t need to use the $elemMatch operator with this schema.

Finally, an “and” query (all books about databases that are published by O’Reilly Media):

  > db.books.find(
    ...     { "facets2" :
    ...         { "$all" : [
    ...             "subject"+":"+"Databases",
    ...             "publisher"+":"+"O'Reilly Media"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

If you’ve been following so far, you won’t be too surprised to see that, unfortunately, this performs exactly the same as in solution #1, for the same reasons described there. Index intersection is coming soon though!

Solution #3

Consider the following schema, where each facet is a field in a subdocument, associated with an array of the tags for that facet:

 {
        _id: 123,
        ...
        facets3 : {
            subject : [ "MongoDB", "Databases" ],
            publisher : [ "O'Reilly Media" ],
            language : [ "English" ]
        }
    }

Add an index on each facet individually:

  > db.books.ensureIndex({"facets3.subject" : 1})
    > db.books.ensureIndex({"facets3.publisher" : 1})
    > db.books.ensureIndex({"facets3.language" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 75464480,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456,
            "facets3.subject_1" : 12084128,
            "facets3.publisher_1" : 2321984,
            "facets3.language_1" : 5363456
        },
        "ok" : 1
    }

This solution has the same performance characteristics as the first two solutions, with the additional benefit that the total size of the indexes required is significantly smaller. This is because we are not storing the facet names in the index for each indexed value.

Once index intersection using multiple indexes is supported (which is also coming under SERVER-3071), this approach will also perform well for “and” queries.

Generating the Faceted Navigation Information

The other part of the faceted search problem is how to most efficiently generate and return the faceted search meta-data. One way to do this would be to use the Aggregation Framework to calculate this information on-the-fly.

For example, to get all the facet values for the collection and the count of documents associated with each one, we could perform the following aggregation query (assuming schema #2 as above):

 > db.books.aggregate([{ "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", count : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 39960
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

Then, as the user drills down using the facets, we need to add the filter predicates to the aggregation query. For instance, if the user clicks on the “Databases” subject facet, we can obtain the facet values and counts for documents matching this filter as follows:

 > db.books.aggregate([{ "$match" : { "facets2" : "subject"+":"+"Databases" } },
                          { "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", "count" : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 675
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

The downside to this approach is that it incurs the overhead of an additional aggregation query each time the user queries the product catalog. Furthermore, for certain choices of schema (e.g. solution #3 above) we actually need to do one aggregation query per distinct facet.

It’s reasonable to assume that the product catalog will be updated much less frequently than it is queried, therefore it may well make sense to pre-compute the faceted navigation meta-data and store it in a separate collection. Consider the following schema for a collection of faceted navigation documents:

 {
        _id : "'facet_filter_string",
        value : {
            count : 12,
            facets : {
                facet1_name : {
                    facet1_val1 : 8,
                    facet1_val2 : 12,
                    ...
                },
                facet2_name : {
                    facet2_val1 : 5,
                    ...
                },
                ...
            }
        }
    }

where <facet_filter_string> is either the empty string (for the document representing the root of the faceted navigation) or one or more of “|<facet_name>:<facet_filter_val>|" concatenated together.

Then, to find the faceted navigation information pertaining to all books about databases, the following simple query on _id will do the job:

 > db.facetnav.find({_id:"|subject:Databases|"}).pretty()
    {
        "_id" : "|subject:Databases|",
        "value" : {
            "count" : 7315,
            "facets" : {
                "publisher" : {
                    "O'Reilly Media" : 675,
                    "Pub2" : 3605,
                    "Pub3" : 185,
                    "Pub4" : 305,
                    "Pub5" : 2505,
                    "Pub6" : 15,
                    "Pub7" : 25
                },
                "language" : {
                    "English" : 7250,
                    "French" : 1095,
                    "German" : 1290
                }
            }
        }
    }

Note that it’s not necessary to generate a document like the above for every single permutation of facet filters, only for each unique combination of filters according to some predetermined canonical ordering of facets (e.g. Subject, Publisher, Language). We can then ensure that the application always builds the _id string with which to query using this canonical ordering.

The faceted navigation meta-data collection can be generated quite easily using a Map-Reduce job. For some example code that does this, take a look at my GitHub repo. With the map and reduce functions defined there, the facetnav info for the entire product catalog can be generated as follows:

 > db.books.mapReduce(mapFn, reduceFn, { "out" : "facetnav" })
    {
        "result" : "facetnav",
        "timeMillis" : 117529,
        "counts" : {
            "input" : 105280,
            "emit" : 2423080,
            "reduce" : 63850,
            "output" : 1599
        },
        "ok" : 1,
    }

Subsequently, whenever the product catalog is updated, the facetnav collection can be quickly updated by specifying that the map-reduce job operate only on the recently updated items and fold those changes in to the existing facetnav collection. For example:

  > db.books.ensureIndex({"last_updated : 1"})
    > db.books.mapReduce(mapFn, reduceFn,
    ...                  { "query" : { "last_updated" : { "$gt" : new Date(2013,7,1) } },
    ...                    "out" : { "reduce" : "facetnav" } })
    {
        "result" : "facetnav",
        "timeMillis" : 724,
        "counts" : {
            "input" : 1000,
            "emit" : 13484,
            "reduce" : 198,
            "output" : 1599
        },
        "ok" : 1,
    }

Third-Party Tools

There are a number of search engine software packages that provide faceted search capabilities. These typically provide the core functionality we have described above, plus more advanced features such as more convenient searching on ranges of facet values (e.g. finding documents that fall within a certain date or price range) or auto-completion (i.e. displaying relevant suggestions, grouped by facet, as a user types in a search query).

The trade-offs with using an additional search engine are:

  • Extra complexity due to adding another ‘moving part’ to your deployment
  • Your application must deal with the fact that the system as a whole is now eventually consistent, with respect to the data stored in MongoDB versus the data stored in the external search engine. This may be undesirable, particularly for a product catalog that changes very frequently, for example.

Two of the most popular search engines are Solr and ElasticSearch which, like MongoDB, are also free and open-source products.

Solr and ElasticSearch can be easily integrated with MongoDB using Mongo Connector, which comes bundled with plugins for interfacing with each of them. Using the appropriate plugin, Mongo Connector can integrate data from MongoDB into the desired target system and keep the two systems in sync.

Conclusion

Faceted search functionality can be implemented in MongoDB, without requiring the use of external search engines. When index intersection arrives, all the types of queries we have examined here will perform optimally. Integrating with an external search engine to provide faceted search is also a good option, and something to consider depending on the specific requirements of your application.

blog comments powered by Disqus