Managing the web nuggets with MongoDB and MongoKit

MongoDB
September 27, 2013 | Updated: May 2, 2018
#Releases

This is a guest post by Nicolas Clairon, maintainer of MongoKit and founder of Elkorado

MongoKit is a python ODM for MongoDB. I created it in 2009 (when the ODM acronym wasn’t even used) for my startup project called Elkorado. Now that the service is live, I realize that I never wrote about MongoKit. I’d like to introduce it to you with this quick tutorial based on real use cases from Elkorado.

Elkorado: a place to store web nuggets

Elkorado is a collaborative, interest-based curation tool. It was born over the frustration that there is no place where to find quality resources about a particular topic of interest. There are so many blogs, forums, videos and websites out there that it is very difficult to find our way over this massive wealth of information.

Elkorado aims at helping people to centralize quality content, so they can find them later easily and discover new ones.

MongoDB to the rescue

Rapid prototyping is one of the most important thing in startup world and it is an area where MongoDB shines.

The web is changing fast, and so are web resources and their metadata. MongoDB’s and schemaless database is a perfect fit to store this kind of data. After losing hair by trying to use polymorphism with SQL databases, I went into MongoDB… and I felt in love with it.

While playing with the data, I needed a validation layer and wanted to add some methods to my documents. Back then, they was no ODM for Python. And so I created MongoKit.

MongoKit: MongoDB ODM for Python

MongoKit is a thin layer on top of Pymongo. It brings field validations, inheritance, polymorphism and a bunch of other features. Let’s see how it is used in Elkorado.

Elkorado is a collection of quality web resources called nuggets. This is how we could fetch a nugget discovered by the user “namlook” with Pymongo:

>>> import pymongo
>>> con = pymongo.Connection()
>>> nugget = con.elkorado.nuggets.find_one({"discoverer": "namlook"})

nuggets here is a regular python dict.

Here’s a simple nugget definition with MongoKit:

import mongokit
connection = mongokit.Connection()
<p>@connection.register
class Nugget(mongokit.Document):
<strong>database</strong> = "elkorado"
<strong>collection</strong> = "nuggets"
structure = {
"url": unicode,
"discoverer": unicode,
"topics": list,
"popularity": int
}
default_values = {"popularity": 0}
def is_popular(self):
""" this is for the example purpose """
return self.popularity > 1000

Fetching a nugget with MongoKit is pretty the same:

nugget = connection.Nugget.find_one({"discoverer": "namlook"})

However, this time, nugget is a Nugget object and we can call the is_popular method on it:

>>> nugget.is_popular()
True

One of the main advantages of MongoKit is that all your models are registered and accessible via the connection instance. MongoKit look at the <strong>database</strong> and <strong>collection</strong> fields to know which database and which collection has to be used. This is useful so we have only one place to specify those variables.

Inheritance

MongoKit was first build to natively support inheritance:

from datetime import datetime
    class Core(mongokit.Document):
        __database__ = "elkorado"
        use_dot_notation = True
        structure = {
            "created_at": datetime,
            "updated_at": datetime
        }
        default_values = {
            "created_at": datetime.utcnow,
            "updated_at": datetime.utcnow
        }
        def save(self, *args, **kwargs):
           self.updated_at = datetime.utcnow()    
           super(Core, self).save(*args, **kwargs)

In this Core object, we are defining the database name and some fields that will be shared by other models.

If one wants a Nugget object to have date metadata, one just have to make it inherit from Core:

@connection.register
class Nugget(Core):
    __collection__ = "nuggets"
    stucture = {
        "url": unicode,
        "topics": list,
        "discoverer": unicode,
        "popularity": int
    }
    default_values = {"popularity": 0}

It’s all about Pymongo

With MongoKit, your are still very close to Pymongo. In fact, MongoKit’s connection, database and collection are subclasses of Pymongo’s. If once in an algorithm, you need pure performances, you can directly use Pymongo’s layer which is blazing fast:

>>> nuggets = connection.Nugget.find() # nuggets is a list of Nugget object
>>> nuggets = connection.elkorado.nuggets.collection.find() # nuggets is a list of python dict object.

Here, connection is a MongoKit connection but it can be used like a Pymongo connection. Note that to keep the benefice of DRY, we can call the pymongo’s layer from a MongoKit document:

>>> nuggets = connection.Nugget.collection.find() # fast!

A real life “simplified” example

Let’s see an example of CRUD done with MongoKit.

On Elkorado, each nugget is unique but multiple users can share a nugget which have differents metadata. Each time a user picks up a nugget, a UserNugget is created with specific informations. If this is the first time the nugget is discovered, a Nugget object is created, otherwise, it is updated. Here is a simplified UserNugget structure:

from mongokit import ObjectId, Connection
<p>connection = Connection()</p>
<p>@connection.register
class UserNugget(Core):
<strong>collection</strong> = "user_nuggets"
structure = {
"url": unicode,
"topics": [unicode],
"user_id": unicode
}
required_fields = ["url", "topics", "user_id"]</p>
<pre><code>def save(self, *args, **kwargs):
    super(self, UserNugget).save(*args, **kwargs)
    nugget = self.db.Nugget.find_one({"url": self.url})
    if not nugget:
        nugget = self.db.Nugget(url=url, discoverer=self.user_id)
        nugget.save()
    self.db.Nugget.collection.update({"url": self.url}, {"$addToSet": {"topics": {"$each": self.topics}}, "$inc": 1})

This example well describes what can be done with MongoKit. Here, the save method has been overloaded to check if a nugget exists (remember, each nugget is unique by its URL). It will create it if it is not already created, and update it.

Updating data with MongoKit is similar to Pymongo. Use save on the object or use directly the Pymongo’s layer to make atomic updates. Here, we use atomic updates to push new topics and increase the popularity:

self.db.Nugget.collection.update({"url": self.url}, {
    "$addToSet": {"topics": {"$each": self.topics}},
    "$inc": 1
})

Getting live

Let’s play with our model:

>>> user_nugget = connection.UserNugget()
>>> user_nugget.url = u"http://www.example.org/blog/post123"
>>> user_nugget.user_id = u"namlook"
>>> user_nugget.topics = [u"example", u"fun"]
>>> user_nugget.save()

When calling the save method, the document is validated against the UserNugget’s structure. As expected, the fields created_at and updated_at have been added:

>>> user_nugget
{
    "_id": ObjectId("4f314163a1e5fa16fe000000"),
    "created_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "updated_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "url": u"http://www.example.org/blog/post123",
    "user_id": u"namlook",
    "topics": [u"example", u"fun"]
}

and the related nugget has been created:

>>> nugget = connection.Nugget.find_one({"url": "http://www.example.org/blog/post123"})
{
    "_id": ObjectId("4f314163a1e5fa16fe000001"),
    "created_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "updated_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "url": u"http://www.example.org/blog/post123",
    "discoverer": u"namlook",
    "topics": [u"example", u"fun"],
    "popularity": 1
}

Conclusion

MongoKit is a central piece of Elkorado. It has been written to be small and minimalist but powerful. There is so much more to say about features like inherited queries, i18n and gridFS, so take a look at the wiki to read more about how this tool can help you.

Check the documentation for more information about MongoKit. And if you register on Elkorado, check out the nuggets about MongoDB. Don’t hesitate to share you nuggets as well, the more the merrier.

← Previous

Setting Up Actionable Alerts and Procedures in MMS

This is part two of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross . In my last post , I went over the metrics MMS Monitoring that I find most interesting. Having the metrics is a useful first step but shouldn’t be end goal. Far more important than viewing the metrics in a web page is having clear procedures for how to act upon them. In my case, most of the problems arose because of replication lag and page faults. In the case of high replication lag, our application would automatically fail back to the primary server, which is always up to date. The engineers could then investigate the root cause for the issue and fix it. For page faults, the process was lengthier and most of the time meant going back to the application and improving the queries or design that was causing the page faults. For every key metric, set sensible alert thresholds emailing or texting someone with a clear procedure set about what to do for each type of alert. Sensible thresholds should be emphasized. An alert should be a real situation waiting for an action. Set the threshold too low and you’ll receive alerts all the time and eventually get desensitized to them. Set the threshold too high and by the time you get the alert, you may have already lost data or otherwise be too late to act upon it. Unfortunately, it takes a bit of time before you can establish what normal is for your system. Once you have a baseline, you can setup the alerts to make sure that you are operating within normal parameters. An overlooked feature of MMS is that you can get a web view of logs and profile data using a single authentication mechanism across your servers. This is useful for troubleshooting when the production servers are locked up in a room and the janitor has eaten the keys ;) In my next post, I’ll discuss how you can use MMS to QA new code. For more on setting alerts in MMS, see Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track .

September 26, 2013

Next →

Building AI With MongoDB: Integrating Vector Search And Cohere to Build Frontier Enterprise Apps

Cohere is the leading enterprise AI platform, building large language models (LLMs) which help businesses unlock the potential of their data. Operating at the frontier of AI, Cohere’s models provide a more intuitive way for users to retrieve, summarize, and generate complex information. Cohere offers both text generation and embedding models to its customers. Enterprises running mission-critical AI workloads select Cohere because its models offer the best performance-cost tradeoff and can be deployed in production at scale. Cohere’s platform is cloud-agnostic. Their models are accessible through their own API as well as popular cloud managed services, and can be deployed on a virtual private cloud (VPC) or even on-prem to meet companies where their data is, offering the highest levels of flexibility and control. Cohere’s leading Embed 3 and Rerank 3 models can be used with MongoDB Atlas Vector Search to convert MongoDB data to vectors and build a state-of-the-art semantic search system. Search results also can be passed to Cohere’s Command R family of models for retrieval augmented generation (RAG) with citations. Check out our AI resource page to learn more about building AI-powered apps with MongoDB. A new approach to vector embeddings It is in the realm of embedding where Cohere has made a host of recent advances. Described as “AI for language understanding,” Embed is Cohere’s leading text representation language model. Cohere offers both English and multilingual embedding models, and gives users the ability to specify the type of data they are computing an embedding for (e.g., search document, search query). The result is embeddings that improve the accuracy of search results for traditional enterprise search or retrieval-augmented generation. One challenge developers faced using Embed was that documents had to be passed one by one to the model endpoint, limiting throughput when dealing with larger data sets. To address that challenge and improve developer experience, Cohere has recently announced its new Embed Jobs endpoint . Now entire data sets can be passed in one operation to the model, and embedded outputs can be more easily ingested back into your storage systems. Additionally, with only a few lines of code, Rerank 3 can be added at the final stage of search systems to improve accuracy. It also works across 100+ languages and offers uniquely high accuracy on complex data such as JSON, code, and tabular structure. This is particularly useful for developers who rely on legacy dense retrieval systems. Demonstrating how developers can exploit this new endpoint, we have published the How to use Cohere embeddings and rerank modules with MongoDB Atlas tutorial . Readers will learn how to store, index, and search the embeddings from Cohere. They will also learn how to use the Cohere Rerank model to provide a powerful semantic boost to the quality of keyword and vector search results. Figure 1: Illustrating the embedding generation and search workflow shown in the tutorial Why MongoDB Atlas and Cohere? MongoDB Atlas provides a proven OLTP database handling high read and write throughput backed by transactional guarantees. Pairing these capabilities with Cohere’s batch embeddings is massively valuable to developers building sophisticated gen AI apps. Developers can be confident that Atlas Vector Search will handle high scale vector ingestion, making embeddings immediately available for accurate and reliable semantic search and RAG. Increasing the speed of experimentation, developers and data scientists can configure separate vector search indexes side by side to compare the performance of different parameters used in the creation of vector embeddings. In addition to batch embeddings, Atlas Triggers can also be used to embed new or updated source content in real time, as illustrated in the Cohere workflow shown in Figure 2. Figure 2: MongoDB Atlas Vector Search supports Cohere’s batch and real time workflows. (Image courtesy of Cohere) Supporting both batch and real-time embeddings from Cohere makes MongoDB Atlas well suited to highly dynamic gen AI-powered apps that need to be grounded in live, operational data. Developers can use MongoDB’s expressive query API to pre-filter query predicates against metadata, making it much faster to access and retrieve the more relevant vector embeddings. The unification and synchronization of source application data, metadata, and vector embeddings in a single platform, accessed by a single API, makes building gen AI apps faster, with lower cost and complexity. Those apps can be layered on top of the secure, resilient, and mature MongoDB Atlas developer data platform that is used today by over 45,000 customers spanning startups to enterprises and governments handling mission-critical workloads. What's next? To start your journey into gen AI and Atlas Vector Search, review our 10-minute Learning Byte . In the video, you’ll learn about use cases, benefits, and how to get started using Atlas Vector Search.

April 25, 2024