Engineering: MongoDB Indexes, Query Optimization, and You 🫵

Hi there! This is Logan with Knudge Engineering. I’m using this opportunity to go through some of the lessons and tricks I’ve learned in optimizing our usage of MongoDB — the database we use to store all Knudge data. I’m going to use the plural “indexes” here, though it and “indices” are both valid, because it’s a simpler plural and both are widely used.

Aside: What constitutes “Knudge data”?
Knudge data is every piece of information that has to be persisted (saved somewhere) in order to run the Knudge app. There is primary data, secondary data, and so on (similar to a primary/secondary sources in journalism) where primary data is something directly input by users or an integration: think contact name, email address, nudge title, etc. Secondary data (and so on) generally comes from taking some bits of primary data and transmogrifying them into some new form to better fulfill another purpose. For example, in order to sort on contact names sensibly, we have to apply language-specific collation rules. The primary data is the contact’s first name (e.g. “Ralph”) and last name (e.g. “Wiggum”). The secondary data is a reflected field which contains the collation of their full name (e.g. “Wiggum, Ralph”). If either the first name or last name of that contact is updated, that reflected field must be updated too. This is one small example, but there are hundreds of places where we reflect data like this.

Background

Database optimization is everywhere. It’s there when you pick a database. It’s in your schemas and (de)normalization. It’s in the indexes you choose to build across your data. When you turn on the TV, when you go to work, when you go to sleep. It is everywhere and it is everything—ok, maybe it’s not all-encompassing, but it’s important! A database is a very interesting part of any tech stack. It is central — underlying most everything else. Databases tend to be huge and increasingly-complicated software projects. They’re full of power, but also peculiarities, limitations, and even bugs! So is the ORM/ODM software that sits between the database and the rest of the codebase, as we keep finding out 🙃

Indexes & Optimization

The goal of any database is to be able to support all the queries that are necessary for the various functions of the software application using it. The concept of a database index is central to the pursuit of any optimized database. Wikipedia is a bit verbose there, so here’s the short: every database holds data in a way that is efficient to store, but not necessarily easy to query in every way. When you submit a query, you’re asking the database a question. How many nudges were sent to this contact? What are all the nudges sent by this organization? The database engine is tasked with finding the most efficient way to answer that question. In order to make it easier to answer different types of questions, we create indexes that pre-filter and slice the data in ways that make it much faster to fulfill certain queries (i.e. answer certain questions).

Generally, a good index for a given query is one where the database has to consult the least amount of data in order to fulfill the query. If a query asks about 10 nudges, and the database only has to read information about those 10 nudges, that’s pretty good. That’s what’s considered linear time complexity, referred to as O(n) in Big-O Notation — a fancy way of saying that, for each piece of data retrieved, a constant number of things had to happen, whether 1, or 2, or 100. If a query asks about 10 nudges, but only has to read a constant amount of data, that is O(1) or constant time.

Aside: Coming from SQL
Fretting about indexes is a task you face down far less frequently in the world of SQL databases. They exist, certainly, but many of the queries and JOINs performed in SQL can be efficiently optimized and handled by the database without explicit indexes. This is not so with MongoDB (and most every document database). Indexes are a necessity, and balancing having data well-indexed for known queries, while keeping the overall number of indexes low, is a juggling act. Careful data design allows for indexes to be repurposed for several use-cases. Experimentation is a must! Documentation cannot describe every use-case, and the specific limits of a specific database for the specific task you’re trying to do are hard to approximate outright. I’ve learned lots about MongoDB trying to do something I thought should work, using the query explainer, and seeing it completely ignore the index I thought would be perfect. Usually, the solution is denormalization (so the database people call it) but fear not: that’s just another word for another level of indirection.

A “simple” query

I like examples. Let’s look at an example of one of these queries on our contacts collection. For context, Knudge has different kinds of contacts — normal contacts, other professionals, and team members. We use Mongoose as an ODM on top of Mongo, which lets us define schemas, then attach indexes to those schemas like so:

contactSchema.index({
  kind: 1
});

What does that do? It creates an index in the contacts collection on the kind field in ascending order (the order isn’t important here, but can be, as we might see later). This index copies minimal metadata from each contact (i.e. their kind and ID) into a separate data structure (the index!) where it can very efficiently answer the question Which contacts have the kind […]? That index essentially buckets all contacts by what kind they are.

A little more complexity, and ESR

One issue: contacts are always within a given organization (what we call a group of people using Knudge professionally). Because we’re servicing queries from users within an organization, they are always only looking at their own data. Because of that, we’re almost always querying by a specific organization when looking up contacts. Let’s add a field to support that:

contactSchema.index({
  organization: 1,
  kind: 1
});

You may notice that I put organization before kind. Order matters! But why this order?

It makes basic queries including these two fields faster. The organization field is a randomly-generated ID, while the kind field is an enum with only a few, fixed possible values. The former lets MongoDB cut down the data a lot faster, because there are going to be fewer contacts of any kind within a specific organization than contacts with a specific kind across all organizations in Knudge.
The MongoDB quirk of ESR (equality, then sort, then range). When we perform queries against this index, we’re always looking up a specific organization (equality), then filtering by one or more kinds. If it’s one kind, that’s another equality check, but if it’s more than one — when using something like the $in operator, Mongo is doing a range check. To fulfill a query like { organization: 'abcd1234', kind: { $in: [ … ] } } with this index, we need the kind field (for the range check) to appear after organization (for the equality check).

Even more with `partialFilterExpression`

Let’s say we want to support deleting contacts, but we want to keep the backing contact to e.g. maintain metadata about existing nudges. We add a deletedAt field, which is a date if deleted, and null otherwise. For most everything in the app, we only care about showing non-deleted contacts to users. We could add a deletedAt: 1 clause to our index, but we almost never need to look for deleted contacts or query by the specific date they were deleted. We’d rather just ignore the ones that are deleted, omitting them from the index entirely. And we can!

contactSchema.index({
  organization: 1,
  kind: 1
}, {
  partialFilterExpression: { deletedAt: null }
});

The partialFilterExpression gives a way to match which documents should appear in the index at all. When querying against the index, the entire partialFilterExpression must be included as-is in the query. The above index can now field a query for

{
  organization: 'abcd1234',
  kind: 'contact',
  deletedAt: null
}

but it cannot field a query for

{
  organization: 'abcd1234',
  kind: 'contact'
}

because it doesn’t hold every contact in the index — only those with the value null (or undefined! know thy enemy, for their name is “mongoose”) in their deletedAt field. This can be highly constraining, but in the right circumstances, it cuts down on index size pretty significantly. It’s particularly useful in cases where you have some object that is relevant for a specific, short period of time, and are mostly querying for objects in that state — for example, building a backing collection for a message queue, where messages in the queue are no longer relevant after they are processed

One thing worth noting is that the partialFilterExpression field does not support everything that a normal find would — for example, the $ne operator. You’ll have to get clever where you’d usually do a { $ne: null } to check for truthy values. Even $exists has its quirks — a null field still exists! We use $type like { someStringField: { $type: 'string' } } to filter by truthy values in a partialFilterExpression.

Where do I go from here?

Go try some queries! Grab a sample data set, or just insert a few basic documents. Learn the query explainer! Fear the COLLSCAN! So much of learning a complex piece of software like MongoDB is through experimentation. The docs can’t feed you the exact way that every database feature is going to intersect and interact with every other database feature. Set up minimal test cases, verify that queries give the expected results, and see what the explainer spits out for how efficiently it was able to execute the query. Don’t be afraid to massage data into a different form to efficiently fulfill queries. Read through the available query predicates — some notables, not yet mentioned: $elemMatch, $all, and $regex (efficient when doing prefix matches!).

MongoDB is a fast, massively useful, incredibly generic database. At Knudge, we’ve built message queues, global locking mechanisms, global abort signals (à la AbortController), synchronized caches, and a couple simple key-value stores — all on top of the same database. It has its caveats, as with any database, but is incredibly useful. It can probably do whatever you want it to do, if only with one more layer of indirection.