Per-container Network Monitoring and Isolation
Mesos on Linux provides support for per-container network monitoring and
isolation. The network isolation prevents a single container from exhausting the
available network ports, consuming an unfair share of the network bandwidth or
significantly delaying packet transmission for others. Network statistics for
each active container are published through the /monitor/statistics.json
endpoint on the slave. The network isolation is transparent for the majority of
tasks running on a slave (those that bind to port 0 and let the kernel allocate
their port).
Installation
Per-container network monitoring and isolation is not supported by default. To enable it you need to install additional dependencies and configure it during the build process.
Prerequisites
Per-container network monitoring and isolation is only supported on Linux kernel versions 3.6 and above. Additionally, the kernel must include these patches (merged in kernel version 3.15).
- 6a662719c9868b3d6c7d26b3a085f0cd3cc15e64
- 0d5edc68739f1c1e0519acbea1d3f0c1882a15d7
- e374c618b1465f0292047a9f4c244bd71ab5f1f0
- 25f929fbff0d1bcebf2e92656d33025cd330cbf8
The following packages are required on the slave:
Additionally, if you are building from source, you need will also need the libnl3 development package to compile Mesos:
- libnl3-devel / libnl3-dev >= 3.2.26
Build
To build Mesos with per-container network monitoring and isolation support, you need to add a configure option:
$ ./configure --with-network-isolator
$ make
Configuration
Per-container network monitoring and isolation is enabled on the slave by adding
network/port_mapping
to the slave command line --isolation
flag.
--isolation="network/port_mapping"
If the slave has not been compiled with per-container network monitoring and isolation support, it will refuse to start and print an error:
I0708 00:17:08.080271 44267 containerizer.cpp:111] Using isolation: network/port_mapping
Failed to create a containerizer: Could not create MesosContainerizer: Unknown or unsupported
isolator: network/port_mapping
Configuring network ports
Without network isolation, all the containers on a host share the public IP address of the slave and can bind to any port allowed by the OS.
When network isolation is enabled, each container on the slave has a separate network stack (via Linux network namespaces). All containers still share the same public IP of the slave (so that the service discovery mechanism does not need to be changed). The slave assigns each container a non-overlapping range of the ports and only packets to/from these assigned port ranges will be delivered. Applications requesting the kernel assign a port (by binding to port 0) will be given ports from the container assigned range. Applications can bind to ports outside the container assigned ranges but packets from to/from these ports will be silently dropped by the host.
Mesos provides two ranges of ports to containers:
-
OS allocated "ephemeral" ports are assigned by the OS in a range specified for each container by Mesos.
-
Mesos allocated "non-ephemeral" ports are acquired by a framework using the same Mesos resource offer mechanism used for cpu, memory etc. for allocation to executors/tasks as required.
Additionally, the host itself will require ephemeral ports for network communication. You need to configure these three non-overlapping port ranges on the host.
Host ephemeral port range
The currently configured host ephemeral port range can be discovered at any time
using the command sysctl net.ipv4.ip_local_port_range
. If ports need to be set
aside for slave containers, the ephemeral port range can be updated in
/etc/sysctl.conf
. Rebooting after the update will apply the change and
eliminate the possibility that ports are already in use by other processes. For
example, by adding the following:
# net.ipv4.ip_local_port_range defines the host ephemeral port range, by
# default 32768-61000. We reduce this range to allow the Mesos slave to
# allocate ports 32768-57344
# net.ipv4.ip_local_port_range = 32768 61000
net.ipv4.ip_local_port_range = 57345 61000
Container port ranges
The container ephemeral and non-ephemeral port ranges are configured using the
slave --resources
flag. The non-ephemeral port range is provided to the
master, which will then offer it to frameworks for allocation.
The ephemeral port range is sub-divided by the slave, giving
ephemeral_ports_per_container
(default 1024) to each container. The maximum
number of containers on the slave will therefore be limited to approximately:
number of ephemeral_ports / ephemeral_ports_per_container
The master --max_executors_per_slave
flag is be used to prevent allocation of
more executors on a slave when the ephemeral port range has been exhausted.
It is recommended (but not required) that ephemeral_ports_per_container
be set
to a power of 2 (e.g., 512, 1024) and the lower bound of the ephemeral port
range be a multiple of ephemeral_ports_per_container
to minimize CPU overhead
in packet processing. For example:
--resources=ports:[31000-32000];ephemeral_ports:[32768-57344] \
--ephemeral_ports_per_container=512
Rate limiting container traffic
Outbound traffic from a container to the network can be rate limited to prevent
a single container from consuming all available network resources with
detrimental effects to the other containers on the host. The
--egress_rate_limit_per_container
flag specifies that each container launched
on the host be limited to the specified bandwidth (in bytes per second).
Network traffic which would cause this limit to be exceeded is delayed for later
transmission. The TCP protocol will adjust to the increased latency and reduce
the transmission rate ensuring no packets need be dropped.
--egress_rate_limit_per_container=100MB
We do not rate limit inbound traffic since we can only modify the network flows after they have been received by the host and any congestion has already occurred.
Egress traffic isolation
Delaying network data for later transmission can increase latency and jitter
(variability) for all traffic on the interface. Mesos can reduce the impact on
other containers on the same host by using flow classification and isolation
using the containers port ranges to maintain unique flows for each container and
sending traffic from these flows fairly (using the
FQ_Codel
algorithm). Use the --egress_unique_flow_per_container
flag to enable.
--egress_unique_flow_per_container
Putting it all together
A complete slave command line enabling network isolation, reserving ports 57345-61000 for host ephemeral ports, 32768-57344 for container ephemeral ports, 31000-32000 for non-ephemeral ports allocated by the framework, limiting container transmit bandwidth to 300 Mbits/second (37.5MBytes) with unique flows enabled would thus be:
mesos-slave \
--isolation=network/port_mapping \
--resources=ports:[31000-32000];ephemeral_ports:[32768-57344] \
--ephemeral_ports_per_container=1024 \
--egress_rate_limit_per_container=37500KB \
--egress_unique_flow_per_container
Monitoring container network statistics
Mesos exposes statistics from the Linux network stack for each container network
on the /monitor/statistics.json
slave endpoint.
From the network interface inside the container, we report the following
counters (since container creation) under the statistics
key:
Metric | Description | Type |
---|---|---|
net_rx_bytes |
Received bytes | Counter |
net_rx_dropped |
Packets dropped on receive | Counter |
net_rx_errors |
Errors reported on receive | Counter |
net_rx_packets |
Packets received | Counter |
net_tx_bytes |
Sent bytes | Counter |
net_tx_dropped |
Packets dropped on send | Counter |
net_tx_errors |
Errors reported on send | Counter |
net_tx_packets |
Packets sent | Counter |
Additionally, Linux Traffic Control can report the following
statistics for the elements which implement bandwidth limiting and bloat
reduction under the statistics/net_traffic_control_statistics
key. The entry
for each of these elements includes:
Metric | Description | Type |
---|---|---|
backlog |
Bytes queued for transmission [1] | Gauge |
bytes |
Sent bytes | Counter |
drops |
Packets dropped on send | Counter |
overlimits |
Count of times the interface was over its transmit limit when it attempted to send a packet. Since the normal action when the network is overlimit is to delay the packet, the overlimit counter can be incremented many times for each packet sent on a heavily congested interface. [2] | Counter |
packets |
Packets sent | Counter |
qlen |
Packets queued for transmission | Gauge |
ratebps |
Transmit rate in bytes/second [3] | Gauge |
ratepps |
Transmit rate in packets/second [3] | Gauge |
requeues |
Packets failed to send due to resource contention (such as kernel locking) [3] | Counter |
[1] backlog
is only reported on the bloat_reduction interface
[2] overlimits
are only reported on the bw_limit interface
[3] Currently always reported as 0 by the underlying Traffic Control element.
For example, these are the statistics you will get by hitting the /monitor/statistics
endpoint on a slave with network monitoring turned on:
$ curl -s http://localhost:5051/monitor/statistics | python2.6 -mjson.tool
[
{
"executor_id": "job.1436298853",
"executor_name": "Command Executor (Task: job.1436298853) (Command: sh -c 'iperf ....')",
"framework_id": "20150707-195256-1740121354-5150-29801-0000",
"source": "job.1436298853",
"statistics": {
"cpus_limit": 1.1,
"cpus_nr_periods": 16314,
"cpus_nr_throttled": 16313,
"cpus_system_time_secs": 2667.06,
"cpus_throttled_time_secs": 8036.840845388,
"cpus_user_time_secs": 123.49,
"mem_anon_bytes": 8388608,
"mem_cache_bytes": 16384,
"mem_critical_pressure_counter": 0,
"mem_file_bytes": 16384,
"mem_limit_bytes": 167772160,
"mem_low_pressure_counter": 0,
"mem_mapped_file_bytes": 0,
"mem_medium_pressure_counter": 0,
"mem_rss_bytes": 8388608,
"mem_total_bytes": 9945088,
"net_rx_bytes": 10847,
"net_rx_dropped": 0,
"net_rx_errors": 0,
"net_rx_packets": 143,
"net_traffic_control_statistics": [
{
"backlog": 0,
"bytes": 163206809152,
"drops": 77147,
"id": "bw_limit",
"overlimits": 210693719,
"packets": 107941027,
"qlen": 10236,
"ratebps": 0,
"ratepps": 0,
"requeues": 0
},
{
"backlog": 15481368,
"bytes": 163206874168,
"drops": 27081494,
"id": "bloat_reduction",
"overlimits": 0,
"packets": 107941070,
"qlen": 10239,
"ratebps": 0,
"ratepps": 0,
"requeues": 0
}
],
"net_tx_bytes": 163200529816,
"net_tx_dropped": 0,
"net_tx_errors": 0,
"net_tx_packets": 107936874,
"perf": {
"duration": 0,
"timestamp": 1436298855.82807
},
"timestamp": 1436300487.41595
}
}
]