Hi there! This is Logan with Knudge Engineering. I’m using this opportunity to go through some of the lessons and tricks I’ve learned in optimizing our usage of MongoDB — the database we use to store all Knudge data. I’m going to use the plural “indexes” here, though it and “indices” are both valid, because it’s a simpler plural and both are widely used.
Background
Database optimization is everywhere. It’s there when you pick a database. It’s in your schemas and (de)normalization. It’s in the indexes you choose to build across your data. When you turn on the TV, when you go to work, when you go to sleep. It is everywhere and it is everything—ok, maybe it’s not all-encompassing, but it’s important! A database is a very interesting part of any tech stack. It is central — underlying most everything else. Databases tend to be huge and increasingly-complicated software projects. They’re full of power, but also peculiarities, limitations, and even bugs! So is the ORM/ODM software that sits between the database and the rest of the codebase, as we keep finding out 🙃
Indexes & Optimization
The goal of any database is to be able to support all the queries that are necessary for the various functions of the software application using it. The concept of a database index is central to the pursuit of any optimized database. Wikipedia is a bit verbose there, so here’s the short: every database holds data in a way that is efficient to store, but not necessarily easy to query in every way. When you submit a query, you’re asking the database a question. How many nudges were sent to this contact? What are all the nudges sent by this organization? The database engine is tasked with finding the most efficient way to answer that question. In order to make it easier to answer different types of questions, we create indexes that pre-filter and slice the data in ways that make it much faster to fulfill certain queries (i.e. answer certain questions).
Generally, a good index for a given query is one where the database has to consult the least amount of data in order to fulfill the query. If a query asks about 10 nudges, and the database only has to read information about those 10 nudges, that’s pretty good. That’s what’s considered linear time complexity, referred to as O(n) in Big-O Notation — a fancy way of saying that, for each piece of data retrieved, a constant number of things had to happen, whether 1, or 2, or 100. If a query asks about 10 nudges, but only has to read a constant amount of data, that is O(1) or constant time.
A “simple” query
I like examples. Let’s look at an example of one of these queries on our contacts collection. For context, Knudge has different kinds of contacts — normal contacts, other professionals, and team members. We use Mongoose as an ODM on top of Mongo, which lets us define schemas, then attach indexes to those schemas like so:
contactSchema.index({
kind: 1
});
What does that do? It creates an index in the contacts collection on the kind field in ascending order (the order isn’t important here, but can be, as we might see later). This index copies minimal metadata from each contact (i.e. their kind and ID) into a separate data structure (the index!) where it can very efficiently answer the question Which contacts have the kind […]? That index essentially buckets all contacts by what kind they are.
A little more complexity, and ESR
One issue: contacts are always within a given organization (what we call a group of people using Knudge professionally). Because we’re servicing queries from users within an organization, they are always only looking at their own data. Because of that, we’re almost always querying by a specific organization when looking up contacts. Let’s add a field to support that:
contactSchema.index({
organization: 1,
kind: 1
});
You may notice that I put organization before kind. Order matters! But why this order?
- It makes basic queries including these two fields faster. The
organizationfield is a randomly-generated ID, while thekindfield is an enum with only a few, fixed possible values. The former lets MongoDB cut down the data a lot faster, because there are going to be fewer contacts of any kind within a specificorganizationthan contacts with a specifickindacross all organizations in Knudge. - The MongoDB quirk of ESR (equality, then sort, then range). When we perform queries against this index, we’re always looking up a specific
organization(equality), then filtering by one or morekinds. If it’s onekind, that’s another equality check, but if it’s more than one — when using something like the$inoperator, Mongo is doing a range check. To fulfill a query like{ organization: 'abcd1234', kind: { $in: [ … ] } }with this index, we need thekindfield (for the range check) to appear afterorganization(for the equality check).
Even more with partialFilterExpression
Let’s say we want to support deleting contacts, but we want to keep the backing contact to e.g. maintain metadata about existing nudges. We add a deletedAt field, which is a date if deleted, and null otherwise. For most everything in the app, we only care about showing non-deleted contacts to users. We could add a deletedAt: 1 clause to our index, but we almost never need to look for deleted contacts or query by the specific date they were deleted. We’d rather just ignore the ones that are deleted, omitting them from the index entirely. And we can!
contactSchema.index({
 organization: 1,
 kind: 1
}, {
partialFilterExpression: { deletedAt: null }
});
The partialFilterExpression gives a way to match which documents should appear in the index at all. When querying against the index, the entire partialFilterExpression must be included as-is in the query. The above index can now field a query for
{
organization: 'abcd1234',
kind: 'contact',
deletedAt: null
}
but it cannot field a query for
{
organization: 'abcd1234',
kind: 'contact'
}
because it doesn’t hold every contact in the index — only those with the value null (or undefined! know thy enemy, for their name is “mongoose”) in their deletedAt field. This can be highly constraining, but in the right circumstances, it cuts down on index size pretty significantly. It’s particularly useful in cases where you have some object that is relevant for a specific, short period of time, and are mostly querying for objects in that state — for example, building a backing collection for a message queue, where messages in the queue are no longer relevant after they are processed
One thing worth noting is that the partialFilterExpression field does not support everything that a normal find would — for example, the $ne operator. You’ll have to get clever where you’d usually do a { $ne: null } to check for truthy values. Even $exists has its quirks — a null field still exists! We use $type like { someStringField: { $type: 'string' } } to filter by truthy values in a partialFilterExpression.
Where do I go from here?
Go try some queries! Grab a sample data set, or just insert a few basic documents. Learn the query explainer! Fear the COLLSCAN! So much of learning a complex piece of software like MongoDB is through experimentation. The docs can’t feed you the exact way that every database feature is going to intersect and interact with every other database feature. Set up minimal test cases, verify that queries give the expected results, and see what the explainer spits out for how efficiently it was able to execute the query. Don’t be afraid to massage data into a different form to efficiently fulfill queries. Read through the available query predicates — some notables, not yet mentioned: $elemMatch, $all, and $regex (efficient when doing prefix matches!).
MongoDB is a fast, massively useful, incredibly generic database. At Knudge, we’ve built message queues, global locking mechanisms, global abort signals (à la AbortController), synchronized caches, and a couple simple key-value stores — all on top of the same database. It has its caveats, as with any database, but is incredibly useful. It can probably do whatever you want it to do, if only with one more layer of indirection.