Analyzing Your MongoDB Data with Analytica

MongoDB
January 29, 2013
#Releases

This is a guest post by Nosh Petigara, president of Analytica

Analytica is an analytics platform that makes it easy to analyze and report on data like user profiles, event logs, product catalogs, user-generated content, financial assets, or anything else you may have stored in you MongoDB database.

Analytica is built from the ground up for rich document type data and uses a JSON-like representation throughout its architecture. You use Analytica Script a declarative expression language tailored for JSON data, to tell Analytica how perform calculations, filter, group, and transform your documents into the results you want. You can interact with Analytica using a plug-in to Microsoft Excel or a command line shell. Analytica can also be used through its REST API. Browser-based and mobile interfaces are coming soon.

To show some of Analytica’s capabilities, we downloaded all of the tweets sent by the @mongodb twitter account over the last 4 years into a MongoDB database using the Twitter API. Using Analytica, we then developed a dashboard which shows @mongodb’s entire twitter history:

Assuming you had a database called ‘twitter’ and a collection called 'tweets’, which contained the JSON documents for @mongodb’s tweets from the Twitter API- here is how you’d use Analytica to calculate the most commonly used hashtags with 3 commands:

SET twitter.byHashtag = group(tweets.by(entities.hashtags.text)) //group our tweets by hashtag and store them in a calculated (virtual) collection called 'byHashtag'
SET twitter.byHashtag.count = count(tweets) // counts up the number of tweets for each hashtags in our virtual collection
SET twitter.tophashtags = orderdesc(byHashtag.by(count)) //sort the results in descending order

Analytica uses dot notion to specify what collections, documents, or properties to operate on. Each SET command in Analytica results in a computation or the transformation of a set of documents, the results of which are stored in what we call calculated properties or calculated collections. These are intermediate results, stored in Analytica (at the database, collection, or document level - depending on how you specify them), which can be used in subsequent computations. Finally the command 'twitter.tophashtags.(text, count)’ retrieves the text of the hashtags along with the count of how many tweets use that hashtag.

Since we wanted to graph out our results, we used Analytica’s plug in for Excel to enter a series of Analytica script expressions. In addition to calculating the most tweeted hashtags, we also looked at the frequency of tweets per month from the @mongodb account, analyzed the content of @mongodb’s tweets to see how hashtags and URLs were being used, and computed a few other metrics. With this quick analysis, we saw that @mongodb’s tweeting patterns have changed over time (a lot more tweets recently!), figured out that over 80% of @mongodb’s tweets are retweeted at least once, and learnt (perhaps not surprisingly!) that the most popular tweets are about new releases. We graphed out the results and generated the HTML page to share with the MongoDB community.

We’re holding a webinar with 10gen on February 12 so that you can learn more about Analytica and ask questions. In the webinar, we’ll go through how you can use Analytica on your own data to produce in-depth analyses, dashboards and reports and become a data whiz! In the meantime you can learn more and download the beta version of Analytica. You’ll be able to run Analytica against your own datasets or in an example we’ve put together on data from StackOverflow.

If you are looking for other datasets to try, I’d recommend checking out Twitter’s API, Foursquare’s API, the NYTimes API, or Sunlight Labs API. Each of these has JSON, CSV or XML data that you can easily import into MongoDB to start analyzing with Analytica or MongoDB’s query language and aggregation framework. We’ll also post a step-by-step guide soon, which will describe how you can run an analysis on your own twitter history. We’d love to hear from you - you can email with questions or feedback.

← Previous

Announcing New MMS Alerts

From the beginning, 10gen has focused MongoDB on four key areas: flexibility, power, speed, and ease of use. In terms of ease of use, we are very much interested not only in improving usability for developers, but also for IT operations. Early in MongoDB’s development, we released MMS ( MongoDB Monitoring Service ) to offer users visibility into the right metrics to manage and optimize applications during development and in production. To continue to improve our users’ ability to manage MongoDB in production, there is a powerful new type of alert type available in MMS - Metric Min/Max Value. The Metric Min/Max Value alerts provides alerting on a variety of host types and corresponding MongoDB performance metrics. For example, you can say ...Alert me if any of my secondaries experience a repl lag of greater than <x> mins.â€œ This alert type is now flexible enough to provide alerts for the most important performance boundaries that's specific to your application's performance profile. What does each new alert mean? You can alert against a number of different host types listed below. With a replication-enabled host type selected, you'll also have the option to select a specific replica set for this alert, or to have the alert apply to all replica sets. There's a wide variety of performance metrics to choose from for your alerts. To enumerate all the metric types would be intense. The options are essentially straight from existing MMS chart types, and hopefully they're pretty self explanatory. Example of available metrics when ...Secondariesâ€œ host type is selected: Limitations: no hardware stats For now, even if you have hardware stats configured and enabled, they cannot be used for Metric Min/Max Value alerts. Not on MMS? Sign up here MMS Docs Tagged with: MongoDB Monitoring Service, MMS, server, hosts, monitoring

January 25, 2013

Next →

Retrieval Augmented Generation for Claim Processing: Combining MongoDB Atlas Vector Search and Large Language Models

Following up on our previous blog, AI, Vectors, and the Future of Claims Processing: Why Insurance Needs to Understand The Power of Vector Databases , we’ll pick up the conversation right where we left it. We discussed extensively how Atlas Vector Search can benefit the claim process in insurance and briefly covered Retrieval Augmented Generation (RAG) and Large Language Models (LLMs). MongoDB.local NYC Join us in person on May 2, 2024 for our keynote address, announcements, and technical sessions to help you build and deploy mission-critical applications at scale. Use Code Web50 for 50% off your ticket! Learn More One of the biggest challenges for claim adjusters is pulling and aggregating information from disparate systems and diverse data formats. PDFs of policy guidelines might be stored in a content-sharing platform, customer information locked in a legacy CRM, and claim-related pictures and voice reports in yet another tool. All of this data is not just fragmented across siloed sources and hard to find but also in formats that have been historically nearly impossible to index with traditional methods. Over the years, insurance companies have accumulated terabytes of unstructured data in their data stores but have failed to capitalize on the possibility of accessing and leveraging it to uncover business insights, deliver better customer experiences, and streamline operations. Some of our customers even admit they’re not fully aware of all the data in their archives. There’s a tremendous opportunity to leverage this unstructured data to benefit the insurer and its customers. Our image search post covered part of the solution to these challenges, opening the door to working more easily with unstructured data. RAG takes it a step further, integrating Atlas Vector Search and LLMs, thus allowing insurers to go beyond the limitations of baseline foundational models, making them context-aware by feeding them proprietary data. Figure 1 shows how the interaction works in practice: through a chat prompt, we can ask questions to the system, and the LLM returns answers to the user and shows what references it used to retrieve the information contained in the response. Great! We’ve got a nice UI, but how can we build an RAG application? Let’s open the hood and see what’s in it! Figure 1: UI of the claim adjuster RAG-powered chatbot Architecture and flow Before we start building our application, we need to ensure that our data is easily accessible and in one secure place. Operational Data Layers (ODLs) are the recommended pattern for wrangling data to create single views. This post walks the reader through the process of modernizing insurance data models with Relational Migrator, helping insurers migrate off legacy systems to create ODLs. Once the data is organized in our MongoDB collections and ready to be consumed, we can start architecting our solution. Building upon the schema developed in the image search post , we augment our documents by adding a few fields that will allow adjusters to ask more complex questions about the data and solve harder business challenges, such as resolving a claim in a fraction of the time with increased accuracy. Figure 2 shows the resulting document with two highlighted fields, “claimDescription” and its vector representation, “claimDescriptionEmbedding” . We can now create a Vector Search index on this array, a key step to facilitate retrieving the information fed to the LLM. Figure 2: document schema of the claim collection, the highlighted fields are used to retrieve the data that will be passed as context to the LLM Having prepared our data, building the RAG interaction is straightforward; refer to this GitHub repository for the implementation details. Here, we’ll just discuss the high-level architecture and the data flow, as shown in Figure 3 below: The user enters the prompt, a question in natural language. The prompt is vectorized and sent to Atlas Vector Search; similar documents are retrieved. The prompt and the retrieved documents are passed to the LLM as context. The LLM produces an answer to the user (in natural language), considering the context and the prompt. Figure 3: RAG architecture and interaction flow It is important to note how the semantics of the question are preserved throughout the different steps. The reference to “adverse weather” related accidents in the prompt is captured and passed to Atlas Vector Search, which surfaces claim documents whose claim description relates to similar concepts (e.g., rain) without needing to mention them explicitly. Finally, the LLM consumes the relevant documents to produce a context-aware question referencing rain, hail, and fire, as we’d expect based on the user's initial question. So what? To sum it all up, what’s the benefit of combining Atlas Vector Search and LLMs in a Claim Processing RAG application? Speed and accuracy: Having the data centrally organized and ready to be consumed by LLMs, adjusters can find all the necessary information in a fraction of the time. Flexibility: LLMs can answer a wide spectrum of questions, meaning applications require less upfront system design. There is no need to build custom APIs for each piece of information you’re trying to retrieve; just ask the LLM to do it for you. Natural interaction: Applications can be interrogated in plain English without programming skills or system training. Data accessibility: Insurers can finally leverage and explore unstructured data that was previously hard to access. Not just claim processing The same data model and architecture can serve additional personas and use cases within the organization: Customer Service: Operators can quickly pull customer data and answer complex questions without navigating different systems. For example, “Summarize this customer's past interactions,” “What coverages does this customer have?” or “What coverages can I recommend to this customer?” Customer self-service: Simplify your members’ experience by enabling them to ask questions themselves. For example, “My apartment is flooded. Am I covered?” or “How long do windshield repairs take on average?” Underwriting: Underwriters can quickly aggregate and summarize information, providing quotes in a fraction of the time. For example, “Summarize this customer claim history.” “I Am renewing a customer policy. What are the customer's current coverages? Pull everything related to the policy entity/customer. I need to get baseline info. Find relevant underwriting guidelines.” If you would like to discover more about Converged AI and Application Data Stores with MongoDB, take a look at the following resources: RAG for claim processing GitHub repository From Relational Databases to AI: An Insurance Data Modernization Journey Modernize your insurance data models with MongoDB and Relational Migrator

April 18, 2024