Telemetry Series: Queues and Effective Lock Percent

A key component of optimizing application performance is tuning the performance of the database that supports it. Each post in our Telemetry series discusses an important metric used by developers and database administrators to tune the database and describes how MongoLab users can leverage Telemetry, MongoLab's monitoring interface, to effectively review and take action on these metrics.

Queues and Effective Lock Percent

Any time an operation can't acquire a lock it needs, it becomes queued and must wait for the lock to be released. Because operations that are queued on the database side often imply that operations are queued on the application side, Queues is an important Telemetry metric for assessing how well your MongoLab deployment is responding to the demands of your app.

In MongoDB 2.6 and earlier, you will find that Queues tend to rise and fall with Effective Lock %. This is because Queues refers specifically to the operations that are waiting on another operation (or series of operations) to release MongoDB's database-level and global-level locks.

With MongoDB 3.0 (using the MMAP storage engine), locking is enforced at the collection level, and Effective Lock % is not reported as a server-level metric. This makes the Queues metric even more important. While it may not be clear from the Telemetry interface exactly which collection(s) is/are heavily locked, elevated queueing is usually a consequence of locking.

The focus on Queues is also preferable because, by design, locking is going to happen on any MongoDB that is receiving writes. As long as that locking isn't resulting in queueing, it is usually not a concern.

Image of Telemetry charts

High locks leading to high queues

What is Effective Lock Percent?

MongoDB uses multi-granular reader-writer locking. Reads prevent a write from acquiring the lock, and a write prevents reads or other writes from acquiring the lock. But, reads do not block other reads. As well, each operation holds the lock at a granularity level appropriate for the operation itself.

In MongoDB 2.6 there are two granularity levels: a Global lock and Database lock for each database. In this scheme, operations performed on separate databases do not lock each other unless those operations also require the Global lock.

Effective Lock Percent in MongoDB 2.6 is a calculated metric that adds together the Global Lock % and the Lock % of the most-locked database at the time. Because of this computation, and because of the way operations are sampled, values greater than 100% may occur.    

In MongoDB 3.0 with the MMAP storage engine, MongoDB locks at the Global, Database, and Collection-level. A normal write operation holds the Database lock in MongoDB 2.6, but only holds a specific collection's Collection lock in MongoDB 3.0. This improvement means separate collections can be concurrently read from or written to.

MongoDB 3.0 with the WiredTiger storage engine uses document-level locking for even greater parallelization. Writes to a single collection won't block each other unless they are to the same document.

Note that locking operations can and do yield periodically, so incoming operations may still progress on a heavily locked server. For more detail, read MongoDB's concurrency documentation.

What do I do if I see locking and queueing?

Locking is a normal part of databases so some level of locking and queueing is expected. First, consider if the locking and queueing is a problem. You should typically not be concerned with Effective Lock Percent values of less than 15%, but each app is different. Likewise, queueing can be fine as long as the app is not blocked on queued requests.

If you see a rise in Queues and Effective Lock % in Telemetry that corresponds to problems with your application, try the following steps:

  1. If queues and locks coincide with Page Faults, check out Telemetry Series: Page Faults--the previous blog in this series--for potential resolutions, such as optimizing indexes or ultimately increasing RAM.
  2. If locking and queueing don't coincide with Page Faults, there are two potential causes:
    1. You may have an inefficient index. While poor indexing typically leads to page faulting, this is not the case if all of your data and indexes already fit into available RAM. Yet the CPU cost of collection scans can still cause a lock to be held for longer than necessary. In this case, reduce collection scanning using the index optimization steps in Telemetry Series: Page Faults.
    2. If operations are well-indexed, check your write operations and consider reducing the need for frequent incidents of:
      • updates to large documents
      • updates that require document moves
      • full-document updates (i.e., those that don't use update operators)
      • updates using array update operators like $push, $pull, etc.
  3. If queuing and locking cannot be reduced by improving indexes, write strategies, or the data model, it is time to consider heavier hardware, and potentially sharding.

Importantly, queueing can occur because of a small number of long-running operations. If those operations haven't finished yet, they won't appear in the mongod logs.  Viewing and potentially killing the offending current operations can be a short-term fix until those operations can be examined for efficiency. To learn more about viewing and killing operations, refer to our documentation on Operation Management.

Have questions or feedback?

We'd love to hear from you as this Telemetry blog series continues. What topics would be most interesting to you? What types of performance problems have you struggled to diagnose?

Email us at support@mongolab.com to let us know your thoughts, or to get our help tuning your MongoLab deployment.

 

, , ,

Leave a Reply