Mesos Observability Metrics

This document describes the observability metrics provided by Mesos master and agent nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.

Overview

Mesos master and agent nodes report a set of statistics and metrics that enable cluster operators to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active agents, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.

Metric information is not persisted to disk at either master or agent nodes, which means that metrics will be reset when masters and agents are restarted. Similarly, if the current leading master fails and a new leading master is elected, metrics at the new master will be reset.

Metric Types

Mesos provides two different kinds of metrics: counters and gauges.

Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of agent registrations. For some metrics of this type, the rate of change is often more useful than the value itself.

Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected agents. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.

The tables in this document indicate the type of each available metric.

Master Nodes

Metrics from each master node are available via the /metrics/snapshot master endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability metrics

This section lists all available metrics from Mesos master nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.

MetricDescriptionType
master/cpus_percent Percentage of allocated CPUs Gauge
master/cpus_used Number of allocated CPUs Gauge
master/cpus_total Number of CPUs Gauge
master/cpus_revocable_percent Percentage of allocated revocable CPUs Gauge
master/cpus_revocable_total Number of revocable CPUs Gauge
master/cpus_revocable_used Number of allocated revocable CPUs Gauge
master/disk_percent Percentage of allocated disk space Gauge
master/disk_used Allocated disk space in MB Gauge
master/disk_total Disk space in MB Gauge
master/disk_revocable_percent Percentage of allocated revocable disk space Gauge
master/disk_revocable_total Revocable disk space in MB Gauge
master/disk_revocable_used Allocated revocable disk space in MB Gauge
master/gpus_percent Percentage of allocated GPUs Gauge
master/gpus_used Number of allocated GPUs Gauge
master/gpus_total Number of GPUs Gauge
master/gpus_revocable_percent Percentage of allocated revocable GPUs Gauge
master/gpus_revocable_total Number of revocable GPUs Gauge
master/gpus_revocable_used Number of allocated revocable GPUs Gauge
master/mem_percent Percentage of allocated memory Gauge
master/mem_used Allocated memory in MB Gauge
master/mem_total Memory in MB Gauge
master/mem_revocable_percent Percentage of allocated revocable memory Gauge
master/mem_revocable_total Revocable memory in MB Gauge
master/mem_revocable_used Allocated revocable memory in MB Gauge

Master

The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.

MetricDescriptionType
master/elected Whether this is the elected master Gauge
master/uptime_secs Uptime in seconds Gauge

System

The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.

MetricDescriptionType
system/cpus_total Number of CPUs available in this master node Gauge
system/load_15min Load average for the past 15 minutes Gauge
system/load_5min Load average for the past 5 minutes Gauge
system/load_1min Load average for the past minute Gauge
system/mem_free_bytes Free memory in bytes Gauge
system/mem_total_bytes Total memory in bytes Gauge

Agents

The following metrics provide information about agent events, agent counts, and agent states. A low number of active agents may indicate that agents are unhealthy or that they are not able to connect to the elected master.

MetricDescriptionType
master/slave_registrations Number of agents that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected. Counter
master/slave_removals Number of agent removed for various reasons, including maintenance Counter
master/slave_reregistrations Number of agent re-registrations Counter
master/slave_unreachable_scheduled Number of agents which have failed their health check and are scheduled to be marked unreachable. They will not be marked unreachable immediately due to the Agent Removal Rate-Limit, but master/slave_unreachable_completed will start increasing as they do get removed. Counter
master/slave_unreachable_canceled Number of times that an agent was due to be marked unreachable but this transition was cancelled. This happens when the agent removal rate limit is enabled and the agent sends a PONG response message to the master before the rate limit allows the agent to be marked unreachable. Counter
master/slave_unreachable_completed Number of agents that were marked as unreachable because they failed health checks. These are agents which were not heard from despite the agent-removal rate limit, and have been marked as unreachable in the master's agent registry. Counter
master/slaves_active Number of active agents Gauge
master/slaves_connected Number of connected agents Gauge
master/slaves_disconnected Number of disconnected agents Gauge
master/slaves_inactive Number of inactive agents Gauge
master/slaves_inactive Number of unreachable agents. Unreachable agents are periodically garbage collected from the registry, which will cause this value to decrease. Gauge

Frameworks

The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.

MetricDescriptionType
master/frameworks_active Number of active frameworks Gauge
master/frameworks_connected Number of connected frameworks Gauge
master/frameworks_disconnected Number of disconnected frameworks Gauge
master/frameworks_inactive Number of inactive frameworks Gauge
master/outstanding_offers Number of outstanding resource offers Gauge

Tasks

The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.

MetricDescriptionType
master/tasks_error Number of tasks that were invalid Counter
master/tasks_failed Number of failed tasks Counter
master/tasks_finished Number of finished tasks Counter
master/tasks_killed Number of killed tasks Counter
master/tasks_killing Number of tasks currently being killed Gauge
master/tasks_lost Number of lost tasks Counter
master/tasks_running Number of running tasks Gauge
master/tasks_staging Number of staging tasks Gauge
master/tasks_starting Number of starting tasks Gauge
master/tasks_unreachable Number of unreachable tasks Gauge

Messages

The following metrics provide information about messages between the master and the agents and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.

MetricDescriptionType
master/invalid_executor_to_framework_messages Number of invalid executor to framework messages Counter
master/invalid_framework_to_executor_messages Number of invalid framework to executor messages Counter
master/invalid_status_update_acknowledgements Number of invalid status update acknowledgements Counter
master/invalid_status_updates Number of invalid status updates Counter
master/dropped_messages Number of dropped messages Counter
master/messages_authenticate Number of authentication messages Counter
master/messages_deactivate_framework Number of framework deactivation messages Counter
master/messages_decline_offers Number of offers declined Counter
master/messages_executor_to_framework Number of executor to framework messages Counter
master/messages_exited_executor Number of terminated executor messages Counter
master/messages_framework_to_executor Number of messages from a framework to an executor Counter
master/messages_kill_task Number of kill task messages Counter
master/messages_launch_tasks Number of launch task messages Counter
master/messages_reconcile_tasks Number of reconcile task messages Counter
master/messages_register_framework Number of framework registration messages Counter
master/messages_register_slave Number of agent registration messages Counter
master/messages_reregister_framework Number of framework re-registration messages Counter
master/messages_reregister_slave Number of agent re-registration messages Counter
master/messages_resource_request Number of resource request messages Counter
master/messages_revive_offers Number of offer revival messages Counter
master/messages_status_update Number of status update messages Counter
master/messages_status_update_acknowledgement Number of status update acknowledgement messages Counter
master/messages_unregister_framework Number of framework unregistration messages Counter
master/messages_unregister_slave Number of agent unregistration messages Counter
master/messages_update_slave Number of update agent messages Counter
master/recovery_slave_removals Number of agents not re-registered during master failover Counter
master/slave_removals/reason_registered Number of agents removed when new agents registered at the same address Counter
master/slave_removals/reason_unhealthy Number of agents failed due to failed health checks Counter
master/slave_removals/reason_unregistered Number of agents unregistered Counter
master/valid_framework_to_executor_messages Number of valid framework to executor messages Counter
master/valid_status_update_acknowledgements Number of valid status update acknowledgement messages Counter
master/valid_status_updates Number of valid status update messages Counter
master/task_lost/source_master/reason_invalid_offers Number of tasks lost due to invalid offers Counter
master/task_lost/source_master/reason_slave_removed Number of tasks lost due to agent removal Counter
master/task_lost/source_slave/reason_executor_terminated Number of tasks lost due to executor termination Counter
master/valid_executor_to_framework_messages Number of valid executor to framework messages Counter

Event queue

The following metrics provide information about different types of events in the event queue.

MetricDescriptionType
master/event_queue_dispatches Number of dispatches in the event queue Gauge
master/event_queue_http_requests Number of HTTP requests in the event queue Gauge
master/event_queue_messages Number of messages in the event queue Gauge

Registrar

The following metrics provide information about read and write latency to the agent registrar.

MetricDescriptionType
registrar/state_fetch_ms Registry read latency in ms Gauge
registrar/state_store_ms Registry write latency in ms Gauge
registrar/state_store_ms/max Maximum registry write latency in ms Gauge
registrar/state_store_ms/min Minimum registry write latency in ms Gauge
registrar/state_store_ms/p50 Median registry write latency in ms Gauge
registrar/state_store_ms/p90 90th percentile registry write latency in ms Gauge
registrar/state_store_ms/p95 95th percentile registry write latency in ms Gauge
registrar/state_store_ms/p99 99th percentile registry write latency in ms Gauge
registrar/state_store_ms/p999 99.9th percentile registry write latency in ms Gauge
registrar/state_store_ms/p9999 99.99th percentile registry write latency in ms Gauge

Replicated log

The following metrics provide information about the replicated log underneath the registrar, which is the persistent store for masters.

MetricDescriptionType
registrar/log/recovered Whether the replicated log for the registrar has caught up with the other masters in the cluster. A cluster is operational as long as a quorum of "recovered" masters is available in the cluster. Gauge

Allocator

The following metrics provide information about performance and resource allocations in the allocator.

MetricDescriptionType
allocator/mesos/allocation_run_ms Allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/count Number of allocation algorithm latency measurements in the window Gauge
allocator/mesos/allocation_run_ms/max Maximum allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/min Minimum allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p50 Median allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p90 90th percentile allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p95 95th percentile allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p99 99th percentile allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p999 99.9th percentile allocation algorithm latency in ms Gauge
allocator/mesos/allocation_run_ms/p9999 99.99th percentile allocation algorithm latency in ms Gauge
allocator/mesos/allocation_runs Number of times the allocation algorithm has run Counter
allocator/mesos/roles/<role>/shares/dominant Dominant resource share for the role, exposed as a percentage (0.0-1.0) Gauge
allocator/mesos/event_queue_dispatches Number of dispatch events in the event queue Gauge
allocator/mesos/offer_filters/roles/<role>/active Number of active offer filters for all frameworks within the role Gauge
allocator/mesos/quota/roles/<role>/resources/<resource>/offered_or_allocated Amount of resources considered offered or allocated towards a role's quota guarantee Gauge
allocator/mesos/quota/roles/<role>/resources/<resource>/guarantee Amount of resources guaranteed for a role via quota Gauge
allocator/mesos/resources/cpus/offered_or_allocated Number of CPUs offered or allocated Gauge
allocator/mesos/resources/cpus/total Number of CPUs Gauge
allocator/mesos/resources/disk/offered_or_allocated Allocated or offered disk space in MB Gauge
allocator/mesos/resources/disk/total Total disk space in MB Gauge
allocator/mesos/resources/mem/offered_or_allocated Allocated or offered memory in MB Gauge
allocator/mesos/resources/mem/total Total memory in MB Gauge

Basic Alerts

This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.

master/uptime_secs is low

The master has restarted.

master/uptime_secs < 60 for sustained periods of time

The cluster has a flapping master node.

master/tasks_lost is increasing rapidly

Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.

master/slaves_active is low

Agents are having trouble connecting to the master.

master/cpus_percent > 0.9 for sustained periods of time

Cluster CPU utilization is close to capacity.

master/mem_percent > 0.9 for sustained periods of time

Cluster memory utilization is close to capacity.

master/elected is 0 for sustained periods of time

No master is currently elected.

Agent Nodes

Metrics from each agent node are available via the /metrics/snapshot agent endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability Metrics

This section lists all available metrics from Mesos agent nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the agent and their current usage.

MetricDescriptionType
slave/cpus_percent Percentage of allocated CPUs Gauge
slave/cpus_used Number of allocated CPUs Gauge
slave/cpus_total Number of CPUs Gauge
slave/cpus_revocable_percent Percentage of allocated revocable CPUs Gauge
slave/cpus_revocable_total Number of revocable CPUs Gauge
slave/cpus_revocable_used Number of allocated revocable CPUs Gauge
slave/disk_percent Percentage of allocated disk space Gauge
slave/disk_used Allocated disk space in MB Gauge
slave/disk_total Disk space in MB Gauge
slave/gpus_percent Percentage of allocated GPUs Gauge
slave/gpus_used Number of allocated GPUs Gauge
slave/gpus_total Number of GPUs Gauge
slave/gpus_revocable_percent Percentage of allocated revocable GPUs Gauge
slave/gpus_revocable_total Number of revocable GPUs Gauge
slave/gpus_revocable_used Number of allocated revocable GPUs Gauge
slave/mem_percent Percentage of allocated memory Gauge
slave/disk_revocable_percent Percentage of allocated revocable disk space Gauge
slave/disk_revocable_total Revocable disk space in MB Gauge
slave/disk_revocable_used Allocated revocable disk space in MB Gauge
slave/mem_used Allocated memory in MB Gauge
slave/mem_total Memory in MB Gauge
slave/mem_revocable_percent Percentage of allocated revocable memory Gauge
slave/mem_revocable_total Revocable memory in MB Gauge
slave/mem_revocable_used Allocated revocable memory in MB Gauge

Agent

The following metrics provide information about whether an agent is currently registered with a master and for how long it has been running.

MetricDescriptionType
slave/registered Whether this agent is registered with a master Gauge
slave/uptime_secs Uptime in seconds Gauge

System

The following metrics provide information about the agent system.

MetricDescriptionType
system/cpus_total Number of CPUs available Gauge
system/load_15min Load average for the past 15 minutes Gauge
system/load_5min Load average for the past 5 minutes Gauge
system/load_1min Load average for the past minute Gauge
system/mem_free_bytes Free memory in bytes Gauge
system/mem_total_bytes Total memory in bytes Gauge

Executors

The following metrics provide information about the executor instances running on the agent.

MetricDescriptionType
containerizer/mesos/container_destroy_errors Number of containers destroyed due to launch errors Counter
slave/container_launch_errors Number of container launch errors Counter
slave/executors_preempted Number of executors destroyed due to preemption Counter
slave/frameworks_active Number of active frameworks Gauge
slave/executor_directory_max_allowed_age_secs Maximum allowed age in seconds to delete executor directory Gauge
slave/executors_registering Number of executors registering Gauge
slave/executors_running Number of executors running Gauge
slave/executors_terminated Number of terminated executors Counter
slave/executors_terminating Number of terminating executors Gauge
slave/recovery_errors Number of errors encountered during agent recovery Gauge

Tasks

The following metrics provide information about active and terminated tasks.

MetricDescriptionType
slave/tasks_failed Number of failed tasks Counter
slave/tasks_finished Number of finished tasks Counter
slave/tasks_killed Number of killed tasks Counter
slave/tasks_lost Number of lost tasks Counter
slave/tasks_running Number of running tasks Gauge
slave/tasks_staging Number of staging tasks Gauge
slave/tasks_starting Number of starting tasks Gauge

Messages

The following metrics provide information about messages between the agents and the master it is registered with.

MetricDescriptionType
slave/invalid_framework_messages Number of invalid framework messages Counter
slave/invalid_status_updates Number of invalid status updates Counter
slave/valid_framework_messages Number of valid framework messages Counter
slave/valid_status_updates Number of valid status updates Counter