How Buffer uses MongoDB to power its Growth Platform

MongoDB
June 9, 2014 | Updated: May 2, 2018
#Releases

Buffer, powered by experiments and metrics

At Buffer, every product decision we make is driven by quantitative metrics. We have always sought to be lean in our decision making, and one of the core tenants of being lean is launching experimental features early and measuring their impact.

Buffer is a social media tool to help you schedule and space out your posts on social media networks like Twitter, Facebook, Google+ and Linkedin. We started in late 2010 and thanks to a keen focus on analytical data, we have now grown to over 1.5 million users and 155k unique active users per month. We’re now responsible for sharing 3 million social media posts a week.

When I started at Buffer in September 2012 we were using a mixture of Google Analytics, Kissmetrics and an internal tool to track our app usage and analytics. We struggled to move fast and effectively measure product and feature usage with these disconnected tools. We didn’t have an easy way to generate powerful reports like cohort analysis charts or measure things like activation segmented by signup sources over time. Third party tracking services were great for us early on, but as we started to dig deeper into our app insights, we realized there was no way around it—we needed to build our own custom metrics and event tracking.

We took the plunge in April 2013 to build our own metrics framework using MongoDB. While we’ve had some bumps and growing pains setting this up, it’s been one of the best decisions we’ve made. We are now in control of all metrics and event tracking and are able to understand what’s going on with our app at a deeper level. Here’s how we use MongoDB to power our metrics framework.

Why we chose MongoDB

At the time we were evaluating datastores, we had no idea what our data would look like. When I started designing our schema, I quickly found that we needed something that would let us change the metrics we track over time and on the fly. Today, I’ll want to measure our signup funnel based on referrals, tomorrow I might want to measure some custom event and associated data that is specific to some future experiment. I needed to plan for the future, and give our developers the power to track any arbitrary data. MongoDB and its dynamic schema made the most sense for us. MongoDB’s super powerful aggregation framework also seemed perfect for creating the right views with our tracking data.

Our Metrics Framework Architecture

In our app, we’ve set up an AWS SQS queue and any data we want to track from the app goes immediately to this queue. We use SQS heavily in our app and have found it to be a great tool to manage messaging at high throughput levels. We have a simple python worker which picks off messages from this queue and writes them to our metrics database. The reason why we’ve done this instead of connecting and writing directly to our metrics MongoDB database is because we wanted our metrics set up to have absolutely zero impact on application performance. Much like Google Analytics offers no overhead to an application, our event tracking had to do the same. The MongoDB database that would store our events would be extremely write heavy since we would be tracking anything we could think of, including every API request, page visited, Buffer user/profile/post/email created etc. If, for whatever reason our metrics db goes down, or starts having write locking issues, our users shouldn’t be impacted. Using SQS as a middleman would allow tracking data to queue up if any of these issues occur. SQS gives us enough time to figure out what the issue is, fix, it and then process that backlog. We’ve had quite a few times in the past year where using Amazon’s robust SQS service has saved us from losing any data during maintenance or downtime that would occur when creating a robust high throughput metrics framework from scratch. We use MongoHQ to host our data. They’ve been super helpful with any challenges in scaling a db like ours. Since our setup is write heavy, we’ve initially set up a 400GB SSD replica set. As of today (May 16) we have 90 collections and are storing over 500 million documents.

We wrote simple client libraries for tracking data for every language that we use (PHP, Python, Java, NodeJS, Javascript, Objective-C). In addition to bufferapp.com, our API, mobile apps and internal tools all plug into this framework.

Tracking events

Our event tracking is super simple. When a developer creates a new event message, our python worker creates a generic event collection (if it doesn’t exist) and stores event data that’s defined by the developer. It will store the user or visitor id, and the date that the event occurred. It’ll also store the user_joined_at date which is useful for cohort analysis.

Here are some examples of event tracking our metrics platform lets us do.

Visitor page views in the app.

Like many other apps, we want to track every visitor that hits our page. There is a bunch of data that we want to store to understand the context around the event. We’d like to know the IP address, the URI they viewed, the user agent they’re using among other data.

Here’s what the tracking would look like in our app written in PHP:

$visit_event = array(
	'visitor_id' => $visitor_id,
	'ip' => $ip_address,
	'uri' => $uri,
	'referrer' => $referrer,
	'user_agent' => $user_agent
);
//track, < metric name >, < metric data > , < operation type >
$visitor->track('visits', $visit_event, 'event')

Here’s the corresponding result in our MongoDB metrics db:

> db.event.visits.findOne({date:{$gt:ISODate("2014-05-05")}})
{
    	"_id" : ObjectId("5366d48148222c37e51a9f31"),
    	"domain" : "blog.rafflecopter.com",
    	"user_id" : null,
    	"ip" : "50.27.200.15",
     	"user_joined_at" : null,
     	"visitor_id" : ObjectId("5366d48151823c7914450517"),
     	"uri" : "",
     	"agent" : {
            	"platform" : "Windows 7",
            	"version" : "34.0.1847.131",
            	"browser" : "Chrome"
       	},
       	"referrer" : "blog.rafflecopter.com/",
       	"date" : ISODate("2014-05-05T00:00:01.603Z"),
       	"page" : "/"
}

Logging User API calls

We track every API call our clients make to the Buffer API. Essentially what we’ve done here is create query-able logging for API requests. This has been way more effective than standard web server logs and has allowed us to dig deeper into API bugs, security issues and understanding the load on our API.

db.event.api.findOne()
{
    	"_id" : ObjectId("536c1a7648222c105f807212"),
    	"endpoint" : {
            	"name" : "updates/create"
    	},
    	"user_id" : ObjectId("50367b2c6ffb36784c000048"),
    	"params" : {
            	"get" : {
                    	"text" : "Sending a test update for the the blog post!",
                    	"profile_ids" : [
                            	"52f52d0a86b3e9211f000012"
                    	],
                    	"media" : ""
            	}
    	},
    	"client_id" : ObjectId("4e9680b8562f7e6b22000000"),
    	"user_joined_at" : ISODate("2012-08-23T18:50:20.405Z"),
    	"date" : ISODate("2014-05-08T23:59:50.419Z"),
    	"ip_address" : "32.163.4.8",
    	"response_time" : 414.95399475098
}
<p>

Experiment data

With this type of event tracking, our developers are able to track anything by writing a single line of code. This has been especially useful for measuring events specific to a feature experiment. This frictionless process helps keep us lean: we can measure feature usage as soon as a feature is launched. For example, we recently launched a group sharing feature for business customers so that they can group their Buffer social media accounts together. Our hypothesis was that people with several social media accounts prefer to share specific content to subsets of accounts. We wanted to quantifiably validate whether this is something many would use, or whether it’s a niche or power user feature. After a week of testing this out, we had our answer.

This example shows our tracking of our ‘group sharing’ experiment. We wanted to track each group that was created with this new feature. With this, we’re able to track the user, the groups created, the name of the group, and the date it was created.

> db.event.web.group_sharing.create_group.find().pretty()
{
    	"_id" : ObjectId("536c07e148022c1069b4ff3d"),
    	"via" : "web",
    	"user_id" : ObjectId("536bfbea61bb78af76e2a94d"),
    	"user_joined_at" : ISODate("2014-05-08T21:49:30Z"),
    	"date" : ISODate("2014-05-08T22:40:33.880Z"),
    	"group" : {
            	"profile_ids" : [
                    	"536c074d613b7d9924e1a90f",
                    	"536c07c361bb7d732d198f1"
            	],
            	"id" : "536c07e156a66a28563f14ec",
            	"name" : "Dental"
    	}
}

Making sense of the data

We store a lot of tracking data. While it’s great that we’re tracking all this data, there would be no point if we weren’t able to make sense of it. Our goal for tracking this data was to create our own growth dashboard so we can keep track of key metrics, and understand results of experiments. Making sense of the data was one of the most challenging parts of setting up our growth platform.

MongoDB Aggregation

We rely heavily on MongoDB’s aggregation framework. It has been super handy for things like gauging API client requests by hour, response times separated by API endpoint, number of visitors based on referrers, cohort analysis and so much more.

Here’s a simple example of how we use MongoDB aggregation to obtain our average API response times between April 8th and April 9th:

db.event.api.aggregate({
	$match: {
    		date: {
        		$gt: ISODate("2014-05-08T20:02:33.133Z"),
        		$lt: ISODate("2014-05-09T20:02:33.133Z"),
         			}
    		}
	}, {
	$group: {
    	_id: {
        			endpoint: '$endpoint.name'
    		},
    	avgResponseTime: {
        			$avg: '$response_time'
    	},
    		count: {
        	$sum: 1
    		}
	}
}, {
	$sort: {
    	"count": -1
	}
})

Result:

{
    	"result" : [
            	{
                    	"_id" : {
                            	"endpoint" : "profiles/updates_pending"
                    	},
                    	"avgResponseTime" : 118.69420306241872,
                    	"count" : 749800
            	},
            	{
                    	"_id" : {
                            	"endpoint" : "updates/create"
                    	},
                    	"avgResponseTime" : 1597.2882786981013,
                    	"count" : 393282
            	},
            	{
                    	"_id" : {
                            	"endpoint" : "profiles/updates_sent"
                    	},
                    	"avgResponseTime" : 281.65717282199824,
                    	"count" : 368860
            	},
            	{
                    	"_id" : {
                            	"endpoint" : "profiles/index"
                    	},
                    	"avgResponseTime" : 112.43379622794643,
                    	"count" : 323844
            	},
            	{
                    	"_id" : {
                            	"endpoint" : "user/friends"
                    	},
                    	"avgResponseTime" : 559.7830099245549,
                    	"count" : 122320
            	},
            	...

With the aggregation framework, we have powerful insight into how clients are using our platform, which users are power users and a lot more. We previously created long running scripts to generate our cohort analysis reports. Now we can use MongoDB aggregation for much of this.

Running ETL jobs

We have several ETL jobs that run periodically to power our growth dashboard. This is the way we make sense of our data core. Some of the more complex reports need this level of reporting. For example, the way we measure product activation is whether someone has posted an update within a week of joining. With the way we’ve structured our data, this requires a join query in two different collections. All of this processing is done in our ETL jobs. We’ll upload the results to a different database which is used to power the views in our growth dashboard for faster loading.

Here are some reports on our growth dashboard that are powered by ETL jobs

Scaling Challenges and Pitfalls

We’ve faced a few different challenges and we’ve iterated to get to a point where we can make solid use out of our growth platform. Here are a few pitfalls and examples of challenges that we’ve faced in setting this up and scaling our platform.

Plan for high disk I/O and write throughput.

The DB server size and type has a key role in how quickly we could process and store events. In planning for the future we knew that we’d be tracking quite a lot of data and a fast pace, so a db with high disk write throughput was key for us. We ended up going for a large SSD replica set. This of course really depends on your application and use case. If you use an intermediate datastore like SQS, you can always start small, and upgrade db servers when you need it without any data loss.

We keep an eye on mongostat and SQS queue size almost daily to see how our writes are doing.

One of the good things about an SSD backed DB is that disk reads are much quicker compared to hard disk. This means it’s much more feasible to run ad hoc queries on un-indexed fields. We do this all the time whenever we have a hunch of something to dig into further.

Be mindful of the MongoDB document limit and how data is structured

Our first iteration of schema design was not scalable. True, MongoDB does not perform schema validation but that doesn’t mean it’s not important to think about how data is structured. Originally, we tracked all events in a single user_metrics and visitor_metrics collection. An event was stored as an embedded object in an array in a user document. Our hope was that we wouldn’t need to do any joins and we could effectively segment out tracking data super easily by user.

We had fields as arrays that were unbounded and could grow infinitely causing the document size to grow. For some highly active users (and bots), after a few months of tracking data in this way some documents in this collection would hit the 16MB document limit and fail to write any more. This created various performance issues in processing updates, and in our growth worker and ETL jobs because there were these huge documents transferred over the wire. When this happened we had to move quickly to restructure our data.

Moving to a single collection per event type has been the most scalable solution and a more flexible solution.

Reading from secondaries

Some of our ETL jobs read and process a lot of data. If you end up querying documents that haven’t been read or written to recently, it is very possible this may be stored out of memory and on disk. Querying this data means MongoDB will page out some documents that have been touched recently and bring query results into memory. This will then make writing to that paged out document slower. It’s for this reason that we have set up our ETL and aggregation queries to read only from our secondaries in our replica-set, even though they may not be consistent with the primary.

Our secondaries have a high number of faults because of paging due to reading ‘stale’ data

Visualizing results

As I mentioned before, one of the more challenging parts about maintaining our own growth platform is extracting and visualizing the data in a way that makes a lot of sense. I can’t say that we’ve come to a great solution yet. We’ve put a lot of effort into building out and maintaining our growth dashboard and creating visualizations is the bottleneck for us today. There is really a lot of room to reduce the turnaround time. We have started to experiment a bit with using Stripe’s MoSQL to map results from MongoDB to PostgresSQL and connect with something like Chart.io to make this a bit more seamless. If you’ve come across some solid solutions for visualizing event tracking with MongoDB, I’d love to hear about it!

Event tracking for everyone!

We would love to open source our growth platform. It’s something we’re hoping to do later this year. We’ve learned a lot by setting up our own tracking platform. If you have any questions about any of this or would like to have more control of your own event tracking with MongoDB, just hit me up @sunils34

Want to help build out our growth platform? Buffer is looking to grow its growth team and reliability team!

Like what you see? Sign up for the MongoDB Newsletter and get MongoDB updates straight to your inbox

← Previous

Using MMS to Build New Environments

What is your process for spinning up development, test or staging environments? If you’re like most, it may be complex and manual. You may be have difficulty simulating production data. If that’s the case, attend our webinar on July 1 . We’ll show you how you can quickly spin up new MongoDB environments using MongoDB Management Service . MongoDB Backup using the MongoDB Management Service provides unlimited restores and a simple process that enables you to quickly and easily deploy new environments from live production data. You can be sure you’re working with the most current data at all times, without affecting your production environment. In addition, the newly released MMS API enables you to programmatically perform restores to automate the process of building new environments. Ready to learn more? Sign up for the webinar – it’s free! Back up MongoDB with MMS – use our free tier to get started

June 9, 2014

Next →

Retrieval Augmented Generation for Claim Processing: Combining MongoDB Atlas Vector Search and Large Language Models

Following up on our previous blog, AI, Vectors, and the Future of Claims Processing: Why Insurance Needs to Understand The Power of Vector Databases , we’ll pick up the conversation right where we left it. We discussed extensively how Atlas Vector Search can benefit the claim process in insurance and briefly covered Retrieval Augmented Generation (RAG) and Large Language Models (LLMs). MongoDB.local NYC Join us in person on May 2, 2024 for our keynote address, announcements, and technical sessions to help you build and deploy mission-critical applications at scale. Use Code Web50 for 50% off your ticket! Learn More One of the biggest challenges for claim adjusters is pulling and aggregating information from disparate systems and diverse data formats. PDFs of policy guidelines might be stored in a content-sharing platform, customer information locked in a legacy CRM, and claim-related pictures and voice reports in yet another tool. All of this data is not just fragmented across siloed sources and hard to find but also in formats that have been historically nearly impossible to index with traditional methods. Over the years, insurance companies have accumulated terabytes of unstructured data in their data stores but have failed to capitalize on the possibility of accessing and leveraging it to uncover business insights, deliver better customer experiences, and streamline operations. Some of our customers even admit they’re not fully aware of all the data in their archives. There’s a tremendous opportunity to leverage this unstructured data to benefit the insurer and its customers. Our image search post covered part of the solution to these challenges, opening the door to working more easily with unstructured data. RAG takes it a step further, integrating Atlas Vector Search and LLMs, thus allowing insurers to go beyond the limitations of baseline foundational models, making them context-aware by feeding them proprietary data. Figure 1 shows how the interaction works in practice: through a chat prompt, we can ask questions to the system, and the LLM returns answers to the user and shows what references it used to retrieve the information contained in the response. Great! We’ve got a nice UI, but how can we build an RAG application? Let’s open the hood and see what’s in it! Figure 1: UI of the claim adjuster RAG-powered chatbot Architecture and flow Before we start building our application, we need to ensure that our data is easily accessible and in one secure place. Operational Data Layers (ODLs) are the recommended pattern for wrangling data to create single views. This post walks the reader through the process of modernizing insurance data models with Relational Migrator, helping insurers migrate off legacy systems to create ODLs. Once the data is organized in our MongoDB collections and ready to be consumed, we can start architecting our solution. Building upon the schema developed in the image search post , we augment our documents by adding a few fields that will allow adjusters to ask more complex questions about the data and solve harder business challenges, such as resolving a claim in a fraction of the time with increased accuracy. Figure 2 shows the resulting document with two highlighted fields, “claimDescription” and its vector representation, “claimDescriptionEmbedding” . We can now create a Vector Search index on this array, a key step to facilitate retrieving the information fed to the LLM. Figure 2: document schema of the claim collection, the highlighted fields are used to retrieve the data that will be passed as context to the LLM Having prepared our data, building the RAG interaction is straightforward; refer to this GitHub repository for the implementation details. Here, we’ll just discuss the high-level architecture and the data flow, as shown in Figure 3 below: The user enters the prompt, a question in natural language. The prompt is vectorized and sent to Atlas Vector Search; similar documents are retrieved. The prompt and the retrieved documents are passed to the LLM as context. The LLM produces an answer to the user (in natural language), considering the context and the prompt. Figure 3: RAG architecture and interaction flow It is important to note how the semantics of the question are preserved throughout the different steps. The reference to “adverse weather” related accidents in the prompt is captured and passed to Atlas Vector Search, which surfaces claim documents whose claim description relates to similar concepts (e.g., rain) without needing to mention them explicitly. Finally, the LLM consumes the relevant documents to produce a context-aware question referencing rain, hail, and fire, as we’d expect based on the user's initial question. So what? To sum it all up, what’s the benefit of combining Atlas Vector Search and LLMs in a Claim Processing RAG application? Speed and accuracy: Having the data centrally organized and ready to be consumed by LLMs, adjusters can find all the necessary information in a fraction of the time. Flexibility: LLMs can answer a wide spectrum of questions, meaning applications require less upfront system design. There is no need to build custom APIs for each piece of information you’re trying to retrieve; just ask the LLM to do it for you. Natural interaction: Applications can be interrogated in plain English without programming skills or system training. Data accessibility: Insurers can finally leverage and explore unstructured data that was previously hard to access. Not just claim processing The same data model and architecture can serve additional personas and use cases within the organization: Customer Service: Operators can quickly pull customer data and answer complex questions without navigating different systems. For example, “Summarize this customer's past interactions,” “What coverages does this customer have?” or “What coverages can I recommend to this customer?” Customer self-service: Simplify your members’ experience by enabling them to ask questions themselves. For example, “My apartment is flooded. Am I covered?” or “How long do windshield repairs take on average?” Underwriting: Underwriters can quickly aggregate and summarize information, providing quotes in a fraction of the time. For example, “Summarize this customer claim history.” “I Am renewing a customer policy. What are the customer's current coverages? Pull everything related to the policy entity/customer. I need to get baseline info. Find relevant underwriting guidelines.” If you would like to discover more about Converged AI and Application Data Stores with MongoDB, take a look at the following resources: RAG for claim processing GitHub repository From Relational Databases to AI: An Insurance Data Modernization Journey Modernize your insurance data models with MongoDB and Relational Migrator

April 18, 2024