Transitioning from Relational Databases to MongoDB - Data Models

MongoDB
January 10, 2014 | Updated: April 27, 2017
#Releases

This post was written by Bryan Reinero, a Consulting Engineer at MongoDB. Understanding how to use MongoDB isn’t difficult, but it does require you to change the way you think about databases if you are coming from a traditional relational database management system (RDBMS). This blog post is designed to help you understand data modeling and schema design in MongoDB from the perspective of someone used to programming with an RDBMS. I’ll explore the fundamental differences between RDBMS and MongoDB and highlight the advantages and design compromises of each approach.

Fundamental Differences The immediate and fundamental difference between MongoDB and an RDBMS is the underlying data model. A relational database structures data into tables and rows, while MongoDB structures data into collections of JSON documents. JSON is a self-describing, human readable data format. Originally designed for lightweight exchanges between browser and server, it has become widely accepted for many types of applications.

JSON documents are particularly useful for data management for several reasons. A JSON document is composed of a set of fields which are themselves key-value pairs. This means each JSON document carries its own human readable schema design with it wherever it goes, allowing the documents to easily move between database and client applications without losing their meaning.

JSON is also a natural data format for use in the application layer. JSON supports a richer and more flexible data structure than tables made up of columns and rows. In addition to supporting field types like number, string, Boolean, etc., JSON fields can be arrays or nested sub-objects. This means we can represent a set of sophisticated relations which are a closer representation of the objects our applications work with. Using JSON documents in our database means we don’t need an object relational mapper between our database and the applications it serves. We can persist our data in the right form for our application.

Let’s dive into an example. Imagine I have an application dealing with information describing vehicle information including make, manufacturer, and category. My documents might look like this:

  {
    "_id" : ObjectId("528ba7691738025d11aab772"),
    "manufacturer" : "Porsche",
    "name" : "550 Spyder",
    "category" : [
        "sports",
        "touring",
        "coupé"
    ]
  }

It’s pretty clear what this document describes, and I could easily unmarshall this into an object native to my chosen language. Notice also that the “category” field is an array of strings. The ability to support arrays is an especially helpful feature; it simplifies the way my application interfaces with the database and helps me avoid a complicated database schema. Consider the complexity of supporting a repeating group in a properly normalized table structure. To represent the same data object in a single table row would like like this:

In the real world, I know that vehicles belong to multiple categories, so I’ve chosen to represent that relation with a comma separated list: “sports,touring,race,coupé”. However, a comma separated list can be difficult to work with because:

I can’t use equality to match single embedded values
I must use regular expressions to find data

Which means that:

Aggregate functions are difficult
Updating a specific element is difficult

The first normal form was designed to avoid such problems by requiring that each relation contain a single atomic value. In other words, No repeating groups!

MongoDB’s JSON arrays do a far better job of matching the semantics of our categories than a set of rows since arrays are multi-element types, representing lists and sets by nature. As a document store, MongoDB’s multi value fields are a natural fit.

The Normal Forms and Why We Don’t Need Them All of this is nice, but a perfectly valid argument can be made to implement the first normal form properly in my relational database and all the trouble I have with repeating groups will be solved. To adhere to the first normal form, I could restructure my rows to look like this:

Vehicle_Id | Manufacturer | Name | Category

This looks great, but I’ve clearly created data redundancies that make me vulnerable to anomalies. For example, consider an update to a single row:

update Manufacturer = "Porsche AG" where Vehicle_Id = 2253 and Category = "coupé"

This update leads to the following data anomaly.

Vehicle_Id | Manufacturer | Name | Category

This happened because I failed to adhere to the second normal form. To avoid this problem I’d have to normalize my data into three separate tables.

Vehicle_Id | Name

2253 | “"550 Spyder”

Vehicle_Id | Category

2253 | “sports” 2253 | “touring” 2253 | “coupé”

Vehicle_Id | Manufacturer

2253 | “Porsche AG”

This structure prevents the anomaly in the first example, but introduces a new set of problems. The new schema has become much more complex, and I have lost any semantic understanding of my stored objects. By adhering to these two normal forms, I’ve also entered the realm of cross-table joins and the need to enforce ACID transactions across multiple tables. The enforcement of referential integrity in multi-row, multi-table operations will require concurrency controls, increasing overhead and affecting performance.

MongoDB avoids these complications through the use of a document data model. The JSON document forms a more natural representation on the data than the normalized schema we explored. It does this by allowing the embedding of related data via arrays and sub-documents within a single document - thus eliminating the need for JOINs, referential integrity and multi-record ACID transactions

The Tao of MongoDB { “_id” : ObjectId(“528ba7691738025d11aab772”), “manufacturer” : “Porsche AMG”, “name” : “550 Spyder”, “categories” : [ “sports”, “touring”, “coupé” ] }

The data anomalies are avoided since the denormalized JSON structure has a single field “manufacturer.” As an attribute of the single document, the “manufacturer” field can be modified atomically, and consistency is preserved across all relations in the document. Using JSON gives us less reason to adhere to the normal forms. This doesn’t mean that you are prohibited from normalizing data in MongoDB. Of course you can, but the the reasons why are going to be different than those of a traditional RDBMS. We’ll look into that in a subsequent post on schema design.

For a full-length presentation on moving from Relational Databases to MongoDB, please visit the MongoDB presentations page. You can learn how to use MongoDB with our online education courses here. And please contact MongoDB support with any questions about transitioning to or using the product. You can also download our new whitepaper which provides best practices and considerations for migrating from an RDBMS to MongoDB.

More Information

This post was updated in January 2015 to include additional resources and updated links.

← Previous

What We're Reading This Week

We’ve got a lot of great pieces to share with you this week. Here’s your weekly roundup of news from the MongoDB community. SD Times : Hadoop and NoSQL -- Friends, not Frenemies MongoDB Blog : Hudl -- Getting Athletes to the Top with MongoDB MongoDB Blog : MongoDB Named 2013 Database of the Year: Why it Matters MongoDB Blog : Meet Stacy Ferranti, Campus Recruiting and University Relations Manager MongoDB Blog : The Ten Most Popular Posts of 2013 Data and Co. : MongoDB in 2013, a Year in Review MongoDB Press : MongoDB Powers Hudl’s Online Video Analysis Platform AppSignal : Realtime Graphs from MongoDB with Aggregations MongoHQ : There’s Always Money in the MongoDB Oplog Database Trends and Applications : Distribution Reshapes the Database World Glorified Geek : Web Analytics Using the MongoDB Aggregation Framework Midsize Insider : Midsize IT Strategy -- Tips from Startups Trisha Gee : Goodbye 2013 Wired : Meet the Open Source Trio Primed to Topple Oracle

January 10, 2014

Next →

Retrieval Augmented Generation for Claim Processing: Combining MongoDB Atlas Vector Search and Large Language Models

Following up on our previous blog, AI, Vectors, and the Future of Claims Processing: Why Insurance Needs to Understand The Power of Vector Databases , we’ll pick up the conversation right where we left it. We discussed extensively how Atlas Vector Search can benefit the claim process in insurance and briefly covered Retrieval Augmented Generation (RAG) and Large Language Models (LLMs). MongoDB.local NYC Join us in person on May 2, 2024 for our keynote address, announcements, and technical sessions to help you build and deploy mission-critical applications at scale. Use Code Web50 for 50% off your ticket! Learn More One of the biggest challenges for claim adjusters is pulling and aggregating information from disparate systems and diverse data formats. PDFs of policy guidelines might be stored in a content-sharing platform, customer information locked in a legacy CRM, and claim-related pictures and voice reports in yet another tool. All of this data is not just fragmented across siloed sources and hard to find but also in formats that have been historically nearly impossible to index with traditional methods. Over the years, insurance companies have accumulated terabytes of unstructured data in their data stores but have failed to capitalize on the possibility of accessing and leveraging it to uncover business insights, deliver better customer experiences, and streamline operations. Some of our customers even admit they’re not fully aware of all the data in their archives. There’s a tremendous opportunity to leverage this unstructured data to benefit the insurer and its customers. Our image search post covered part of the solution to these challenges, opening the door to working more easily with unstructured data. RAG takes it a step further, integrating Atlas Vector Search and LLMs, thus allowing insurers to go beyond the limitations of baseline foundational models, making them context-aware by feeding them proprietary data. Figure 1 shows how the interaction works in practice: through a chat prompt, we can ask questions to the system, and the LLM returns answers to the user and shows what references it used to retrieve the information contained in the response. Great! We’ve got a nice UI, but how can we build an RAG application? Let’s open the hood and see what’s in it! Figure 1: UI of the claim adjuster RAG-powered chatbot Architecture and flow Before we start building our application, we need to ensure that our data is easily accessible and in one secure place. Operational Data Layers (ODLs) are the recommended pattern for wrangling data to create single views. This post walks the reader through the process of modernizing insurance data models with Relational Migrator, helping insurers migrate off legacy systems to create ODLs. Once the data is organized in our MongoDB collections and ready to be consumed, we can start architecting our solution. Building upon the schema developed in the image search post , we augment our documents by adding a few fields that will allow adjusters to ask more complex questions about the data and solve harder business challenges, such as resolving a claim in a fraction of the time with increased accuracy. Figure 2 shows the resulting document with two highlighted fields, “claimDescription” and its vector representation, “claimDescriptionEmbedding” . We can now create a Vector Search index on this array, a key step to facilitate retrieving the information fed to the LLM. Figure 2: document schema of the claim collection, the highlighted fields are used to retrieve the data that will be passed as context to the LLM Having prepared our data, building the RAG interaction is straightforward; refer to this GitHub repository for the implementation details. Here, we’ll just discuss the high-level architecture and the data flow, as shown in Figure 3 below: The user enters the prompt, a question in natural language. The prompt is vectorized and sent to Atlas Vector Search; similar documents are retrieved. The prompt and the retrieved documents are passed to the LLM as context. The LLM produces an answer to the user (in natural language), considering the context and the prompt. Figure 3: RAG architecture and interaction flow It is important to note how the semantics of the question are preserved throughout the different steps. The reference to “adverse weather” related accidents in the prompt is captured and passed to Atlas Vector Search, which surfaces claim documents whose claim description relates to similar concepts (e.g., rain) without needing to mention them explicitly. Finally, the LLM consumes the relevant documents to produce a context-aware question referencing rain, hail, and fire, as we’d expect based on the user's initial question. So what? To sum it all up, what’s the benefit of combining Atlas Vector Search and LLMs in a Claim Processing RAG application? Speed and accuracy: Having the data centrally organized and ready to be consumed by LLMs, adjusters can find all the necessary information in a fraction of the time. Flexibility: LLMs can answer a wide spectrum of questions, meaning applications require less upfront system design. There is no need to build custom APIs for each piece of information you’re trying to retrieve; just ask the LLM to do it for you. Natural interaction: Applications can be interrogated in plain English without programming skills or system training. Data accessibility: Insurers can finally leverage and explore unstructured data that was previously hard to access. Not just claim processing The same data model and architecture can serve additional personas and use cases within the organization: Customer Service: Operators can quickly pull customer data and answer complex questions without navigating different systems. For example, “Summarize this customer's past interactions,” “What coverages does this customer have?” or “What coverages can I recommend to this customer?” Customer self-service: Simplify your members’ experience by enabling them to ask questions themselves. For example, “My apartment is flooded. Am I covered?” or “How long do windshield repairs take on average?” Underwriting: Underwriters can quickly aggregate and summarize information, providing quotes in a fraction of the time. For example, “Summarize this customer claim history.” “I Am renewing a customer policy. What are the customer's current coverages? Pull everything related to the policy entity/customer. I need to get baseline info. Find relevant underwriting guidelines.” If you would like to discover more about Converged AI and Application Data Stores with MongoDB, take a look at the following resources: RAG for claim processing GitHub repository From Relational Databases to AI: An Insurance Data Modernization Journey Modernize your insurance data models with MongoDB and Relational Migrator

April 18, 2024