RabbitMQ Monitoring: RabbitMQ Metrics In Action

Understanding and optimising RabbitMQ’s behaviour relies heavily on metrics. We’ve taken insights directly from our latest comprehensive white paper on Observability in RabbitMQ to give you a sneak preview into the practical applications of metrics through RabbitMQ monitoring real-world use cases…

Use Case 1. Unravelling Client Troubles by Monitoring RabbitMQ Logs

The developers just deployed the new version of the application but it is not working. Seemingly the application is running, but when users try to log in on the website, they are presented with a 500 Internal Server Error, and nothing happens.

While looking at the Grafana dashboard RabbitMQ Overview, we notice that after the redeployment of the application, the channel churn is not zero, it’s a few channels opened per second:

We also see that before the deployment the channel churn was zero. If we take a look at the RabbitMQ logs, we can immediately see what is the cause of the issue:

08:20:13.653913+00:00 [info] <0.244750.0> accepting AMQP connection <0.244750.0> (94.44.249.142:2339 -> 172.31.30.226:5672)
08:20:13.760417+00:00 [info] <0.244750.0> connection <0.244750.0> (94.44.249.142:2339 -> 172.31.30.226:5672): user 'testuser' authenticated and granted access to vhost '/'
08:20:13.850177+00:00 [error] <0.244766.0> Channel error on connection <0.244750.0> (94.44.249.142:2339 -> 172.31.30.226:5672, vhost: '/', user: 'testuser'), channel 1:
08:20:13.850177+00:00 [error] <0.244766.0> operation basic.consume caused a channel exception not_found: no queue 'messages' in vhost '/'
08:20:13.927795+00:00 [info] <0.244750.0> closing AMQP connection <0.244750.0> (94.44.249.142:2339 -> 172.31.30.226:5672, vhost: '/', user: 'testuser')

From the logs, we can see that the application is stuck in a reconnect – trying to consume – disconnect loop. The fix for this is to create the queue in some way. In many cases, it’s the application’s responsibility to create it. However, in some other cases it may be a configurator tool or the RabbitMQ Administrator.

Now that the queue is created, messages start to flow, however the application is slow and users are unable to log in in time.

Now that the queue exists the users started to log in but for many users it times out. The error rates are still elevated but now instead of 500 users receive 408 Request Timeout. By looking at the queue in question, we can see that the messages are piling up.

This means that either there is no consumer attached to the queue or the consumer can not keep up with the load. By comparing the incoming and outgoing rates of messages, we can immediately see that the consumer throughput is well below the required.

rabbitmq monitoring incoming message rate

We have an incoming message rate of about 15 messages / second.

rabbitmq monitoring acknowledgment and delivery rate

The consumer is only able to process about 2 messages / second.

This situation is common in cases where the backend service or database is under high load, or for example if our authentication service is provided by a third party API that experiences latency.

We can verify that the consumer is slow by looking at the number of messages that are unacknowledged. We know that for this consumer the prefetch count is 10, and we see from the chart that it is maxed out.

This definitely identifies that the issue is the performance of the consumer.

There can be multiple solutions for this kind of scenario, like deploying more resources for the consumer application, purging the messages in the queue, or setting up TTLs so we are not processing stale messages.

Use Case 2. Managing Node Overload with RabbitMQ Monitoring Tools and Memory Usage Dashboards

For performance optimisation, many applications choose not to use publish confirms. This approach, while potentially increasing throughput by reducing the minimal communication overhead, may also increase the risk of causing instability of the RabbitMQ server. If no publish confirms are used then RabbitMQ will need to buffer messages in memory which will lead to excessive load on the queues and on the garbage collector. That will eventually lead to situations where the memory alarm will be raised and traffic will be stopped; in bad cases, it might even lead to the crash of the RabbitMQ server, underscoring the importance of monitoring system metrics.

As traffic increases, there will be a point in any installation where RabbitMQ can not handle the incoming message flow fast enough, leading to increased memory usage.

The load from the publishers increases gradually up to a limit.

memory available before publishers blocked

We see that correlating with traffic, as traffic increases the memory usage of one of the nodes increases, then reaches the memory alarm threshold (zero on the graph). At this point, we may also notice elevated CPU usage, which contributes to system stress.

The Seventh State Clusters dashboard shows that the cluster is having issues, while we need to be careful, as the memory alarm is only raised for a split second.

By investigating some more on the “Erlang Memory” dashboard we see that the heap allocator and the ETS allocator have the highest amount of memory allocated correlating with traffic. From this, we can suspect that something is storing a large amount of data in ETS and a process or multiple processes consume a lot of memory.

We can use the RabbitMQ Top plugin to investigate a bit deeper into what could be the cause of the issue. We see that the Write-Ahead Log (WAL) of the Quorum Queues is backed up and has high memory usage. The WAL process is also responsible for ETS memory usage, which is one of the key metrics to monitor.

This points to the issue that RabbitMQ is overloaded and this particular RabbitMQ node can not handle the load.

Memory alarm is not raised anymore after the load is reduced.

The write-ahead log on this node catches up with the messages.

After the load has stopped, the node returns to a normal level of memory usage and the memory alarm clears. Due to the non-usage of publish confirms, it takes some time for all nodes to catch up with incoming messages.

This is a good example of why a comprehensive monitoring tool like Prometheus and Grafana is essential for RabbitMQ.

Due to the non-usage of publish confirms, it takes some time for all nodes to catch up with incoming messages. In such cases, having a properly tuned alert system for memory, RAM, and disk space can prevent service interruptions.

Use Case 3. Navigating RabbitMQ Cluster Partitions and Node Metrics for High Availability

Today RabbitMQ is mostly deployed in single datacenter installations. There can be many reasons for this but the main one is that RabbitMQ, especially Classic Mirrored Queues, do not handle failures on the network level very well. RabbitMQ needs to restart to resynchronise the internal databases, which even if the network issues were short, can create a few seconds or more of disruption in the traffic. Network partitions are not very common, but more common than we’d like, therefore it is recommended to introduce monitoring to ensure RabbitMQ recovers successfully.

In this trial, we’ve created three queues in the system. Initially the number of queues and leaders are distributed equally on the cluster nodes, which we can verify using the RabbitMQ Overview dashboard.

On the Management Interface, we can drill into individual queues. At this point all members of the queues are online. For this queue pictured above, the leader is located on node ip-172-31-25-196.

We introduce an artificial network partition on the node, placing node
ip-172-31-25-196 into a full partition. In this case, the node loses all connectivity to the other nodes.

It takes some time for the nodes to realise that the network is down. The errors displayed are usually symmetrical, all nodes display similar messages, but this is not a given.

On node ip-172-31-25-196 we notice the following messages for both other nodes:

2024-04-12 15:18:21.311282+00:00 [error] <0.251.0> ** Node 'rabbit@ip-172-31-19-156' not responding **
2024-04-12 15:18:21.311282+00:00 [error] <0.251.0> ** Removing (timedout) connection **

A timed-out connection in this case means that the intra-cluster heartbeat timed out, and the TCP connection got forcefully disconnected by the runtime.

In certain error scenarios, we get an Mnesia partitioned error, which may require manual restarts of the nodes:

2024-04-12 15:10:15.532724+00:00 [error] <0.338.0> Mnesia('rabbit@ip-172-31-25-196'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@ip-172-31-19-156'}

The node also displays that it has lost connection to the other RabbitMQ application, as well as the Quorum Queues stop serving traffic:

2024-04-12 15:18:21.311640+00:00 [info] <0.547.0> rabbit on node 'rabbit@ip-172-31-19-156' down
2024-04-12 15:18:21.311936+00:00 [info] <0.637.0> queue 'queue-3' in vhost '/': Leader monitor down with noconnection, setting election timeout

This is because Quorum Queues require the majority of the nodes to be online, but node ip-172-31-25-196 is in a full partition, and has no connectivity to other replicas.

Because of how RabbitMQ works today, this node also needs to go into the paused state and will stop serving all traffic:

2024-04-12 15:18:22.822023+00:00 [warning] <0.547.0> Cluster minority/secondary status detected - awaiting recovery
2024-04-12 15:18:22.822096+00:00 [info] <0.2442.0> RabbitMQ is asked to stop…

On other nodes in the cluster we can verify that the queue failed over,
and now has the leader on a different node.

On the RabbitMQ Overview dashboard, we can observe that the paused node no longer reports metrics, and also does not host any queue leaders. The queue leader failed over to the green node.

Metrics are not received from paused nodes. The yellow line is the memory available metric of the paused node.

Let’s introduce another network partition, on node ip-172-31-19-156. Due to how pause minority works in RabbitMQ, the cluster is fully unavailable, i.e. all nodes stop automatically:

After the network recovers, the nodes restart, traffic can flow again, and metrics will be reported again.

No metrics are reported at all.

Monitoring RabbitMQ Metrics with the Management Plugin

While external dashboards offer broad observability, sometimes the quickest way to understand what’s happening inside RabbitMQ is by using the tools that ship with it. One of the most practical of these is the RabbitMQ management plugin. This web-based interface gives you a live view into your broker without extra configuration or third-party setup.

Once enabled, the plugin opens access to real-time information about your queues, exchanges, nodes, and connections. From the browser, you can check which queues are backing up, how many messages are unacknowledged, and whether any consumers are falling behind. It’s a direct and intuitive way to keep tabs on what’s flowing through your system.

The plugin also helps surface issues at the node level. You can see which nodes are running hot, which ones might be approaching memory limits, and whether file descriptors are becoming a bottleneck. It complements external monitoring tools and is ideal for quick diagnostics or understanding an issue as it unfolds.

For those managing systems manually, the plugin also provides some control. You can create or delete queues, set user permissions, and inspect bindings, all without needing to touch the command line. While it won’t replace automation or API-driven workflows, it’s often the fastest way to troubleshoot RabbitMQ or test a configuration change.

To turn it on, you just need to run:

rabbitmq-plugins enable rabbitmq_management

Then navigate to http://localhost:15672 in your browser. The plugin makes it easy to view RabbitMQ metrics and interact with your system hands-on, especially useful during testing, incident response, or early-stage deployments.

Gabor Olah - Technical Leas, Seventh State

“This is just a snapshot of the insights you can find in our latest white paper – your guide to effective monitoring for ultimate RabbitMQ performance. Download the full white paper now.👇“

Gabor Olah
Technical Lead |Seventh State

White Paper

Observability in RabbitMQ

Observability is key to unlocking the full potential of RabbitMQ.

Our seasoned RabbitMQ consultants have developed a comprehensive 66 page white paper on effective monitoring for ultimate RabbitMQ performance.