[“Thinking”, “About”, “Arrays”, “In”, “MongoDB”]

Greetings adventurers!

The growing popularity of MongoDB means more and more people are thinking about data in ways divergent from traditional relational models. For this reason alone, it's exciting to experiment with new ways of modelling data. However, with additional flexibility comes the need to properly analyze the performance impact of data model decisions.

Embedding arrays in documents is a great example of this. MongoDB's versatile array operators ($push/$pull, $addToSet, $elemMatch, etc.) offer the ability to manage data sets within documents. However, one must be careful. Data models that call for very large arrays, or arrays with high rates of modification, can often lead to performance problems.

That's because at high levels of scale, large arrays require relatively high CPU-overhead that leads to longer insert/update/query times than desired.

Let's use car dealerships as an example and discuss why an array of cars in a dealer document isn't necessarily the ideal data model.

Here's our dealer document:

{ "_id" : 1234,
  "dealershipName": "Eric's Mongo Cars",
  "cars": [
           {"year": 2013,
            "make": "10gen",
            "model": "MongoCar",
            "vin": 3928056,
            "mechanicNotes": "Runs great!"},
           {"year": 1985,
            "make": "DeLorean",
            "model": "DMC-12",
            "vin": 8056309,
            "mechanicNotes": "Great Scott!"}

Now for some concerns with this model.

Querying Cars

One of the advantages of MongoDB is its rich query language, which supports accessing documents by the contents of an array. If we want to locate cars by make and model using a query like {"make": MAKE, "model": MODEL}, we need a specific feature of MongoDB's query language, $elemMatch. This operator, while optimizable by indexes, still needs to traverse the entire "cars" array of every eligible dealer document in order to execute the query. If we query documents with "cars" arrays that contain thousands of car entries, we are basically doing mini collection-scans. As a result, we'll notice high CPU utilization and slow query execution.

What about more complex computation? The Aggregation framework -- with the $unwind operator -- could support many of the queries we'd like to perform, but indexes may not be used to their full effectiveness.

Finally, dealer documents can get very large with this data model. Although projections and atomic update operators cut down the size of the documents transmitted over the wire, there may still be scenarios where the system is hauling more baggage around than it should.

Adding and updating Cars

Adding and modifying car entries can require a scan of much or all of each array being updated, resulting in slow operations. For example, $addToSet, is a way of adding new elements to arrays that requires the database to scan through every array item to make sure the new element does not already exist. $pull can be similarly inefficient.

Furthermore, sometimes modifications to the "cars" array can grow a dealer document in size such that it must be moved in memory. These moves can be very expensive, particularly when the collection is heavily indexed as each index bucket that points to the document being relocated must also be updated to point to its new memory location.

Can we fix it?

Sometimes a data model that entails very large arrays can be reformulated into a data model that is much more efficient. In our case we could alternatively model dealers and cars like this:


{ "_id": 3423, "dealershipName": "Eric's Mongo Cars" }


{ "_id" : 1234,
  "dealership": 3423,
  "year": 2013,
  "make": "10gen",
  "model": "Mongos",
  "vin": 3928056,
  "mechanicNotes": "Runs great!"},
{ "_id" : 54321,
  "dealership": 3423,
  "year": 1985,
  "make": "DeLorean",
  "model": "DMC-12",
  "vin": 8056309,
  "mechanicNotes": "Great Scott!"}

In this data model we avoid the excessively large arrays and, with the right indexes, perform efficient queries on dealerships with even the largest of inventories. By keeping cars in their own collection, Eric's Mongo Cars is ready to move inventory with crazy low prices, without fear that our volume is going to bring down the system for Eddie's Junker Shack down the road. We love those guys.


Storing information in document arrays is an exciting capability available in MongoDB, but we want to avoid it under the following conditions:

  • The arrays can get very large, even if only in some documents
  • Individual elements must be regularly queried and computed on
  • Elements are added and removed often

None of the pitfalls described above are deal-breakers in and of themselves. It's just that when summed together, the total overhead can become noticeable. So, be wary. In these high-volume cases, it is appropriate for us to use collections, not arrays, to store data. When we store such data using collections,

  • Regular computation is performed using simple, efficient methods
  • Adding and removing elements are simple insert/remove operations
  • Each element is accessible using simpler queries that can be effectively indexed to scale well

There's a lot more detail under the hood, but if you'd like to discuss it, we'll have to get there in the comments section below.

Thanks for reading, and good luck out there!