[“Thinking”, “About”, “Arrays”, “In”, “MongoDB”]

Greetings adventurers!

The growing popularity of MongoDB means more and more people are thinking about data in ways divergent from traditional relational models. For this reason alone, it's exciting to experiment with new ways of modelling data. However, with additional flexibility comes the need to properly analyze the performance impact of data model decisions.

Embedding arrays in documents is a great example of this. MongoDB's versatile array operators ($push/$pull, $addToSet, $elemMatch, etc.) offer the ability to manage data sets within documents. However, one must be careful. Data models that call for very large arrays, or arrays with high rates of modification, can often lead to performance problems.

That's because at high levels of scale, large arrays require relatively high CPU-overhead that leads to longer insert/update/query times than desired.

Let's use car dealerships as an example and discuss why an array of cars in a dealer document isn't necessarily the ideal data model.

Here's our dealer document:

{ "_id" : 1234,
  "dealershipName": "Eric's Mongo Cars",
  "cars": [
           {"year": 2013,
            "make": "10gen",
            "model": "MongoCar",
            "vin": 3928056,
            "mechanicNotes": "Runs great!"},
           {"year": 1985,
            "make": "DeLorean",
            "model": "DMC-12",
            "vin": 8056309,
            "mechanicNotes": "Great Scott!"}

Now for some concerns with this model.

Querying Cars

One of the advantages of MongoDB is its rich query language, which supports accessing documents by the contents of an array. If we want to locate cars by make and model using a query like {"make": MAKE, "model": MODEL}, we need a specific feature of MongoDB's query language, $elemMatch. This operator, while optimizable by indexes, still needs to traverse the entire "cars" array of every eligible dealer document in order to execute the query. If we query documents with "cars" arrays that contain thousands of car entries, we are basically doing mini collection-scans. As a result, we'll notice high CPU utilization and slow query execution.

What about more complex computation? The Aggregation framework -- with the $unwind operator -- could support many of the queries we'd like to perform, but indexes may not be used to their full effectiveness.

Finally, dealer documents can get very large with this data model. Although projections and atomic update operators cut down the size of the documents transmitted over the wire, there may still be scenarios where the system is hauling more baggage around than it should.

Adding and updating Cars

Adding and modifying car entries can require a scan of much or all of each array being updated, resulting in slow operations. For example, $addToSet, is a way of adding new elements to arrays that requires the database to scan through every array item to make sure the new element does not already exist. $pull can be similarly inefficient.

Furthermore, sometimes modifications to the "cars" array can grow a dealer document in size such that it must be moved in memory. These moves can be very expensive, particularly when the collection is heavily indexed as each index bucket that points to the document being relocated must also be updated to point to its new memory location.

Can we fix it?

Sometimes a data model that entails very large arrays can be reformulated into a data model that is much more efficient. In our case we could alternatively model dealers and cars like this:


{ "_id": 3423, "dealershipName": "Eric's Mongo Cars" }


{ "_id" : 1234,
  "dealership": 3423,
  "year": 2013,
  "make": "10gen",
  "model": "Mongos",
  "vin": 3928056,
  "mechanicNotes": "Runs great!"},
{ "_id" : 54321,
  "dealership": 3423,
  "year": 1985,
  "make": "DeLorean",
  "model": "DMC-12",
  "vin": 8056309,
  "mechanicNotes": "Great Scott!"}

In this data model we avoid the excessively large arrays and, with the right indexes, perform efficient queries on dealerships with even the largest of inventories. By keeping cars in their own collection, Eric's Mongo Cars is ready to move inventory with crazy low prices, without fear that our volume is going to bring down the system for Eddie's Junker Shack down the road. We love those guys.


Storing information in document arrays is an exciting capability available in MongoDB, but we want to avoid it under the following conditions:

  • The arrays can get very large, even if only in some documents
  • Individual elements must be regularly queried and computed on
  • Elements are added and removed often

None of the pitfalls described above are deal-breakers in and of themselves. It's just that when summed together, the total overhead can become noticeable. So, be wary. In these high-volume cases, it is appropriate for us to use collections, not arrays, to store data. When we store such data using collections,

  • Regular computation is performed using simple, efficient methods
  • Adding and removing elements are simple insert/remove operations
  • Each element is accessible using simpler queries that can be effectively indexed to scale well

There's a lot more detail under the hood, but if you'd like to discuss it, we'll have to get there in the comments section below.

Thanks for reading, and good luck out there!


43 Responses to [“Thinking”, “About”, “Arrays”, “In”, “MongoDB”]

  1. Sky Viker-Rumsey 2013/04/23 at 4:21 am #

    Hey there

    Just a comment of support!

    I attended a meetup with Trisha Gee and I had exactly this concern, and her reply was as your article mentions, to move to a more document model based approach.

    This resolved the issues I was facing along with my concerns about array growth. It’s also changed the way I thought about incorporating MongoDB for future models.

  2. Tom 2013/04/23 at 9:29 am #

    Nice, but it’s a shame we need to go back to relational fundamentals to be performant

  3. Eric Sedor 2013/04/23 at 3:18 pm #

    Thanks! It’s true that this feels similar to RDBMS optimizations. We feel an important point here is that MongoDB does not rely on enforcing predefined tabular schemas to sidestep the scalability problem of managing dynamic data on disk. It puts that flexibility and responsibility in the hands of the developer. To a developer, normalization of fast data is a valid performance tuning strategy in all computing applications. It is a superset of the methodologies that guide the RDBMS sphere, not a subset. It is also important, especially at high levels of scale, to work in harmony with the tool you’re using. For MongoDB, that means acknowledging that the primary access mode is ‘querying documents in collections.’

  4. Geoffroy 2013/04/24 at 12:55 am #

    Same as relational database….so what’s advantage of MongoDB!?

  5. Eric Sedor 2013/04/24 at 11:12 am #


    I think the right way to think about it is: the data structures in
    MongoDB are a superset of those you can express in a relational
    database. So just because MongoDB lets you create “rows” that are
    complex objects with nested arrays and sub-objects, it doesn’t eliminate
    all relational concepts (like normalization via references).

  6. jchlu 2013/04/26 at 1:23 am #

    Nice to see a rational write-up suggesting DB normalisation for performance tuning is good practise, and that good practise is good practise – no matter what the underlying technology.

  7. hpavc 2013/05/15 at 6:52 am #

    This is great, however it would be nice to have both worlds, a symbolic reference to the collection Cars within the original unnamed document at times would be wonderful.

  8. Gene Vilain 2013/05/21 at 8:59 am #

    Isn’t one of the trade-offs the cost of doing joins when you need to see dealer and car information together? Any recommendations or best practices for measuring the cost of both in order to assess the trade-off? Great stuff! thanks!

  9. Ravi Kishor Shakya 2013/07/16 at 9:13 pm #

    Useful schema design hint you have provided. Thanks

  10. ryan 2013/08/07 at 9:01 pm #

    Hi there,

    Thanks for article.

    However in a little bit more complicate case, say we have multiple dealers and therefore dealer ship ‘class’ stores more information (eg, location, revenue).

    Then it would be hard to find a DeLorean from a dealer with more than 100M revenue with the altered structure.

    Also I guess this are some cases where one database call on a medium size array would be faster than two database calls when we want information for dealer and car.

    Would be nice to know the boundary.

  11. Eric Sedor 2013/08/07 at 10:38 pm #

    It’s a great question, Ryan. The normalization optimization presented in this blog is designed to address fast-moving “child” data while also serving heavy user-traffic. For serving such traffic while performing hardcore data-mining of arbitrary relationships like dealer revenue and car model, two queries may be better than one. In short, bring the information to the app-side and perform computation there, rather than relying on the database to perform computation for you.

    The boundary of the tradeoff may not be easy to determine in advance, but is a function of total dealer count, average dealer inventory count, and rate of inventory change. Because the challenge of the outlying dealer with the largest or fastest-moving inventory remains, RANGE of dealer inventory count is also a factor.

    An alternate strategy, if the desired data-relationships are known, is to annotate each car document with the necessary information about the dealership. So a rough car document for this example might look like:

    { vin: 123, model: “DeLorean”, dealer: { id: 1, revenue: “100M”}}

    This would require two updates to modify dealer revenue (one to the dealer document and one to all cars matching dealer.id). In this example, that would probably be a semi-regular bulk/batch process, not a fast-moving real-time one. All updates to cars would still benefit from the efficiency of having their own collection (and the app could have the dealer document in memory to assist with all car document construction). Then, data-mining based on car model and dealership information could be performed with a simple query to an index like { model: 1, dealer.revenue: 1}.

    Again, this annotation strategy is most useful if the relationships are known in advance, the parent information is slow-moving, and the goal is ideal performance for data-mining the known relationship. In terms of tradeoff, the cost is spent in disk space rather than query-time.

  12. Jesse Pedersen 2013/11/22 at 3:53 pm #

    Thanks for the great write up! We’re happy MongoLab customers ourselves. Just curious, when you say “very large arrays”, what order of magnitude are we talking about? 1,000? 100,000? 1M+


  13. Eric Sedor 2013/11/25 at 9:41 am #

    Really good question, thank you! I definitely warn against 100k or higher, especially if array elements are coming and going. In general, begin being concerned in the 10,000s range. That said, your exact mileage may vary. For example, with a high $push/$pull/$addToSet load on an array of subdocuments–with multiple indexes on subdocument fields–could be problematic as early as the 1,000s. Please feel free to email support@mongolab.com if you have questions about specific operations on your MongoLab cluster(s)!

  14. Gordon 2014/01/28 at 3:01 pm #

    So a query to get dealerid, then another query using the id. What happens if you need to get a hundred ids and use them in another query? That would be a huge ‘match’ list.

  15. Ian Weisberger 2014/03/13 at 7:48 pm #

    Forgive me if I’m wrong, but this seems incorrect. Doing “joins” in mongodb is obscenely inefficient because of a lack of in-database join capabilities. So the application would have to do the “joins”. Why not just index the subdocument array?

    Keep the first schema, and run db.cars.ensureIndex({‘cars.make’:1})

  16. Eric Sedor 2014/03/13 at 11:22 pm #

    Even including the cost of $elemMatch operations for more sophisticated car queries, the first strategy can be performant with the index you suggest, so you’re not wrong. Even fast-moving child relationships can be modeled this way at low points of scale.

    The idea here is that as the speed at which dealers move inventory increases, so does the cost of frequent $push/$pull/$addToSet operations to change car inventories. The Power of 2 Sizes option mitigates that cost for an additional range of use-cases, but regularly rewriting arrays can still add up depending on how heavily-indexed they are.

    The second model is intended for maximum query versatility (including aggregation pipelines that won’t require $unwinding into unmapped memory) at high scale. In this environment, the cost of two distinct queries to resolve the relationship is more easily paid for full query features and faster writes.

    This even buys the ability to more readily embed and manipulate arrays in car documents, should it be necessary.

  17. Eric Sedor 2014/03/13 at 11:40 pm #

    This is the challenge in dealing with the relationship resolution component at high points of scale. In addition to large query bodies, resolving sets of ids incurs less efficient index traversal, and large result sets to boot.

    For this reason, the slower-moving components of the data model, dealers, might well be cached in memory by the app, so reversing a multi-dealer extrapolation from a set of cars doesn’t involve the db directly.

    Still, tuning large, multi-dealer operations on cars does become critical; that topic is explored in this blog.

  18. Jernej Jerin 2014/03/31 at 8:22 am #

    What about if you are storing in the fields large chunks of text. Would not there be quite a large overhead when transmitting data back and forth even when having just a couple of hundred subdocuments? So this magnitude probably varies with regard to size of each subdocument in array, right?

  19. Eric Sedor 2014/04/01 at 11:25 am #

    One thing that dramatically helps a case in which network bandwidth is the concern, is that having “elements” in their own collection allows you to selectively query for the elements that you want, rather than querying for a document that contains those elements and asking the DB to extract the value with a query projection and the $slice operator, which may not even be applicable for some array use-cases.

    Because projection with $slice is both a data copy AND requires some degree of iterating through the array, it is exactly the sort of operation we discuss–one that offers unpredictable performance depending on array length and size.

    The sheer number of bytes devoted to the array incurs a cost during potentially any move or rewrite associated with any write operation, even if the operation doesn’t target the array itself.

    By contrast, the number of elements in the array more directly impacts operations like $pull, $addToSet, and $elemMatch, which require iteration through the array.

    For cases in which moving sub-elements to their own collection does NOT allow for more selective querying of those elements, queries will still require more or less the same network bandwidth because the same data has to be transmitted.

  20. Ian Mercer 2014/10/02 at 11:47 am #

    Designing your schema with MongoDB really doesn’t have to be a binary decision: (a) store it normalized or (b) store it in an array decision like that presented here. Depending on how the data is accessed you might also decide to perform any number of different partial denormalizations. For example, if your car dealership view needs a list of manufacturers they have in stock you might have that as an array on the dealership record. You would update it each time their inventory changes, or periodically (eventual consistency), or both. You might also store a count by type on the dealer record. Or you might keep the 10 most recent additions in an array so that the initial page load for a dealer can be accomplished in one round trip to the database, while requests to ‘show more’ are met by querying the full car collection. What you will need really depends on the query patterns not on some abstract storage model. The one caveat is that you must be clear in your code / design documents as to what’s ground-truth data and what’s computed or cached data so other developers aren’t confused.

  21. Akash Gupta 2014/10/30 at 3:23 am #

    How would a query run on such a structure? JOINS are not supported by MongoDB. I am just starting out with MongoDB and the word “denormalization” used in it’s context sends shivers down my spine.

  22. Dewsworld 2014/12/18 at 4:06 am #

    Mongodb should work on the 16mb limit.

    Also, by linking the previous embedded sub-doc, don’t you think the total read time might get worse?

  23. Randy 2014/12/23 at 11:40 am #

    Good analysis. I was running into the same problem trying to keep thousands of recordset under one entity. It’s a tradeoff worth taking for massive data

  24. jonnyonion 2015/01/17 at 8:55 pm #

    what about storing only the ids of the cars in the dealers collection as an array. And then use this field in combination with $in operator in cars.

    Like: hey that’s me, and these are the keys to my cars ;)

    In your case you need an additional index in cars “dealership”, and each index means RAM.

  25. Antonio de Perio 2015/01/28 at 8:58 pm #

    Mate, I have this exact same modelling problem right now. This article was exactly what I needed. Thanks!

  26. mason 2015/02/27 at 11:54 am #

    Hi. Great article, thanks. We recently came across just this issue at our company and I am tasked with refactoring the data model. I am curious how you would go about maintaining the ordering of the cars objects in a separate collection as they would be in the embedded array?

  27. Eric 2015/04/07 at 12:50 pm #

    Hmm. considering something very relative at the moment, i’m also separating things out for better parsing. But wondering if it will hinder me at all when running cross collection queries.

  28. Irrelon Software Limited 2015/05/13 at 4:48 am #

    We had similar concerns when writing ForerunnerDB and agree that a normalisation is a good idea. We also added a $join operator to allow your final query results to incorporate data from other collections even though we’d like to maintain compatibility with MongoDB as much as possible as it is our hero :) sometimes an extra operator like $join can be very useful.

  29. Eli Arad 2015/12/16 at 4:22 am #

    How do you do index the field year for example in the array?
    i did an index but the executionStats shows still that mongo scan the entire collection

  30. Chris Chang 2015/12/16 at 1:44 pm #

    Hey Eli,

    Check out multikey indexes here: https://docs.mongodb.org/v3.0/core/index-multikey/

  31. Luiz Felipe Pedone 2016/01/13 at 4:33 am #

    Nice article! Thanks for sharing your experience about these issues.

    Do you think executing all necessary queries on this normalised data structure in MongoDB is better than doing these same queries on a relational database?

  32. Arthur Chen 2016/03/03 at 4:03 pm #

    Would it be good that in a dealer document, we store an array of ‘car id’, instead of array of ‘car object’?

  33. Marvin 2016/05/04 at 2:14 pm #


    I really enjoyed this article, because it tackles a current challenge I am facing.
    So I have been working on a project for quite a while and I want to remodel my mongo db accordingly to be more efficient.

    Let’s assume you had 10000+ car dealers in your database and each one of them had 300 cars. If the cars are stored in a separate collection, you would end up with 3 million cars in the collection. While the car document itself, might be small about 100 bytes large or less, finding the associate cars one specific dealers owns would take the database to query 3 million documents. I would assume this could be time intensive quickly.

    Contrary, if you stored 300 cars within the car dealer document, as an array, you would end up with one really long car dealer document. It could take long to update, select, delete and rearrange etc… car 217 within this document.

    For a case like this what would be the best practice or the advice here?

  34. pablo moreno 2016/10/28 at 7:08 am #

    I have an array field containing thousand of ids (nowadays between 1K and 4K), and I have to push a couple of ids every day. Is an array a good idea? Or better to move that content to a collection?

  35. Ben 2016/11/03 at 10:25 pm #

    How you would go about maintaining the ordering of the cars objects? Specifically maintaining the order after new cars are added and old cars are deleted?

  36. Nathan Park 2016/11/04 at 1:33 pm #

    We can’t be exactly sure of what’s best for your app. 1000-4000 is definitely a size range within which we suggest considering the tradeoffs and other factors described above.

    If you’re asking about a database deployment hosted on mLab, feel free to reach out to us at support@mlab.com.

  37. Nathan Park 2016/11/04 at 5:00 pm #

    The $sort update operator (https://docs.mongodb.com/manual/reference/operator/update/sort/) will help maintain an ordering within an array, but this is exactly the kind of operation that will become expensive on larger, faster-moving arrays.


  1. MongoDB 2.4 now available on all MongoLab plans | MongoLab: MongoDB-as-Service - 2013/06/05

    […] you tend you use arrays in your data model, our recent blog post might be of interest to […]

  2. Production-Ready MongoDB | MongoLab: MongoDB-as-Service - 2013/08/29

    […] Thinking about indexing on arrays […]

  3. Tuning MongoDB Performance with MMS | MongoLab: MongoDB-as-Service - 2013/12/04

    […] as well as updates to large documents, and especially updates to documents with large arrays (or large subdocument arrays). These are CPU-intensive operations that can be avoided by altering […]

  4. MongoDB – What Scales Better, An Array Property, A Nested Object Property or Putting That Data In A Separate Model? - 2014/08/27

    […] a REST API that will hopefully scale to be quite large (if I’m lucky). I’ve read about difficulties that can arise with Arrays in MongoDB, so I’m hesitant to use them. Here are the specific requirements for the […]

  5. Avoiding arrays in MongoDB | Ambrose - 2014/12/16

    […] review here that is some-more fit to prevaricate controlling arrays in MongoDB. Therefore, we combined […]

  6. START CREATING AMAZING WEB EXPERIENCES - Tech news, Windows - Mobile - Tips & Tricks, Computers, Science, Books, Extreme Videos... - 2016/01/17

    […] are often used in big data and real-time web applications. Instead of tables, NoSQL databases likeMongoDB use arrays to store […]

Leave a Reply