External Containerizer

EC = external containerizer. A part of the mesos slave that provides an API for containerizing via external plugin executables.
ECP = external containerizer program. An external plugin executable implementing the actual containerizing by interfacing with a containerizing system (e.g. Docker).

Containerizing

General Overview

EC invokes ECP as a shell process, passing the command as a parameter to the ECP executable. Additional data is exhanged via stdin and stdout.

The ECP is expected to return a zero exit code for all commands it was able to process. A non-zero status code signals an error. Below you will find an overview of the commands that have to be implemented by an ECP, as well as their invocation scheme.

The ECP is expected to be using stderr for state info and displaying additional debug information. That information is getting logged to a file, see Enviroment: Sandbox.

Call and communication scheme

Interface describing the functions an ECP has to implement via command calls. Many invocations on the ECP will also pass a protobuf message along via stdin. Some invocations on the ECP also expect to deliver a result protobuf message back via stdout. All protobuf messages are prefixed by their original length - this is sometimes referred to as "Record-IO"-format. See Record-IO De/Serializing Example.

COMMAND < INPUT-PROTO > RESULT-PROTO

launch < containerizer::Launch
update < containerizer::Update
usage < containerizer::Usage > mesos::ResourceStatistics
wait < containerizer::Wait > containerizer::Termination
destroy < containerizer::Destroy
containers > containerizer::Containers
recover

Command Ordering

Make no assumptions

Commands may pretty much come in any order. There is only one exception to this rule; when launching a task, the EC will make sure that the ECP first receives a launch on that specific container, all other commands are queued until launch returns from the ECP.

Use Cases

Task Launching EC / ECP Overview

EC invokes launch on the ECP.
Along with that call, the ECP will receive a containerizer::Launch protobuf message via stdin.
ECP now makes sure the executor gets started. Note that launch is not supposed to block. It should return immediately after triggering the executor/command - that could be done via fork-exec within the ECP.
EC invokes wait on the ECP.
Along with that call, the ECP will receive a containerizer::Wait protobuf message via stdin.
ECP now blocks until the launched command is reaped - that could be implemented via waitpid within the ECP.
Once the command is reaped, the ECP should deliver a containerizer::Termination protobuf message via stdout, back to the EC.

Container Lifecycle Sequence Diagrams

Container Launching

A container is in a staging state and now gets started and observed until it gets into a final state.

Container Launching Scheme

Container Running

A container has gotten launched at some point and now is considered being in a non terminal state by the slave. The following commands will get triggered multiple times at the ECP over the lifetime of a container. Their order however is not determined.

Container Running Scheme

Resource Limitation

While a container is active, a resource limitation was identified (e.g. out of memory) by the ECP isolation mechanism of choice.

Resource Limitation Scheme

Slave Recovery Overview

Slave recovers via check pointed state.
EC invokes recover on the ECP - there is no protobuf message sent or expected as a result from this command.
The ECP may try to recover internal states via its own failover mechanisms, if needed.
After recover returns, the EC will invoke containers on the ECP.
The ECP should return Containers which is a list of currently active containers. Note these containers are known to the ECP but might in fact partially be unknown to the slave (e.g. slave failed after launch but before or within wait) - those containers are considered to be orphans.
The EC now compares the list of slave known containers to those listed within Containers. For each orphan it identifies, the slave will invoke a wait followed by a destroy on the ECP for those containers.
Slave will now call wait on the ECP (via EC) for all recovered containers. This does once again put wait into the position of the ultimate command reaper.

Slave Recovery Sequence Diagram

Recovery

While containers are active, the slave fails over.

Recovery Scheme

Orphan Destruction

Containers identified by the ECP as being active but not slave state recoverable are getting terminated.

Orphan Destruction Scheme

Command Details

launch

Start the containerized executor

Hands over all information the ECP needs for launching a task via an executor. This call should not wait for the executor/command to return. The actual reaping of the containerized command is done via the wait call.

launch < containerizer::Launch

This call receives the containerizer::Launch protobuf via stdin.

/**
 * Encodes the launch command sent to the external containerizer
 * program.
 */
message Launch {
  required ContainerID container_id = 1;
  optional TaskInfo task_info = 2;
  optional ExecutorInfo executor_info = 3;
  optional string directory = 4;
  optional string user = 5;
  optional SlaveID slave_id = 6;
  optional string slave_pid = 7;
  optional bool checkpoint = 8;
}

This call does not return any data via stdout.

wait

Gets information on the containerized executor's Termination

Is expected to reap the executor/command. This call should block until the executor/command has terminated.

wait < containerizer::Wait > containerizer::Termination

This call receives the containerizer::Wait protobuf via stdin.

/**
 * Encodes the wait command sent to the external containerizer
 * program.
 */
message Wait {
  required ContainerID container_id = 1;
}

This call is expected to return containerizer::Termination via stdout.

/**
 * Information about a container termination, returned by the
 * containerizer to the slave.
 */
message Termination {
  // A container may be killed if it exceeds its resources; this will
  // be indicated by killed=true and described by the message string.
  required bool killed = 1;
  required string message = 2;

  // Exit status of the process.
  optional int32 status = 3;
}

The Termination attribute killed is to be set only when the containerizer or the underlying isolation had to enforce a limitation by killing the task (e.g. task exceeded suggested memory limit).

update

Updates the container's resource limits

Is sending (new) resource constraints for the given container. Resource constraints onto a container may vary over the lifetime of the containerized task.

update < containerizer::Update

This call receives the containerizer::Update protobuf via stdin.

/**
 * Encodes the update command sent to the external containerizer
 * program.
 */
message Update {
  required ContainerID container_id = 1;
  repeated Resource resources = 2;
}

This call does not return any data via stdout.

usage

Gathers resource usage statistics for a containerized task

Is used for polling the current resource uses for the given container.

usage < containerizer::Usage > mesos::ResourceStatistics

This call received the containerizer::Usage protobuf via stdin.

/**
 * Encodes the usage command sent to the external containerizer
 * program.
 */
message Usage {
  required ContainerID container_id = 1;
}

This call is expected to return mesos::ResourceStatistics via stdout.

/*
 * A snapshot of resource usage statistics.
 */
message ResourceStatistics {
  required double timestamp = 1; // Snapshot time, in seconds since the Epoch.

  // CPU Usage Information:
  // Total CPU time spent in user mode, and kernel mode.
  optional double cpus_user_time_secs = 2;
  optional double cpus_system_time_secs = 3;

  // Number of CPUs allocated.
  optional double cpus_limit = 4;

  // cpu.stat on process throttling (for contention issues).
  optional uint32 cpus_nr_periods = 7;
  optional uint32 cpus_nr_throttled = 8;
  optional double cpus_throttled_time_secs = 9;

  // Memory Usage Information:
  optional uint64 mem_rss_bytes = 5; // Resident Set Size.

  // Amount of memory resources allocated.
  optional uint64 mem_limit_bytes = 6;

  // Broken out memory usage information (files, anonymous, and mmaped files)
  optional uint64 mem_file_bytes = 10;
  optional uint64 mem_anon_bytes = 11;
  optional uint64 mem_mapped_file_bytes = 12;
}

destroy

Terminates the containerized executor

Is used in rare situations, like for graceful slave shutdown but also in slave fail over scenarios - see Slave Recovery for more.

destroy < containerizer::Destroy

This call receives the containerizer::Destroy protobuf via stdin.

/**
 * Encodes the destroy command sent to the external containerizer
 * program.
 */
message Destroy {
  required ContainerID container_id = 1;
}

This call does not return any data via stdout.

containers

Gets all active container-id's

Returns all container identifiers known to be currently active.

containers > containerizer::Containers

This call does not receive any additional data via stdin.

This call is expected to pass containerizer::Containers back via stdout.

/**
 * Information on all active containers returned by the containerizer
 * to the slave.
 */
message Containers {
  repeated ContainerID containers = 1;
}

recover

Internal ECP state recovery

Allows the ECP to do a state recovery on its own. If the ECP uses state check-pointing e.g. via file system, then this call would be a good moment to de-serialize that state information. Make sure you also see Slave Recovery Overview for more.

recover

This call does not receive any additional data via stdin. No returned data via stdout.

Protobuf Message Definitions

For possibly more up-to-date versions of the above mentioned protobufs as well as protobuf messages referenced by them, please check:

containerizer::XXX are defined within include/mesos/containerizer/containerizer.proto.
mesos::XXX are defined within include/mesos/mesos.proto.

Environment

Sandbox

A sandbox environment is formed by cd into the work-directory of the executor as well as a stderr redirect into the executor's "stderr" log-file. Note not all invocations have a complete sandbox environment.

Addional Environment Variables

Additionally, there are a few new environment variables set when invoking the ECP.

MESOS_LIBEXEC_DIRECTORY = path to mesos-executor, mesos-usage, ... This information is always present.
MESOS_WORK_DIRECTORY = slave work directory. This should be used for distinguishing slave instances. This information is always present.

Note that this is specifically helpful for being able to tie a set of containers to a specific slave instance, thus allowing proper recovery when needed.

MESOS_DEFAULT_CONTAINER_IMAGE = default image as provided via slave flags (default_container_image). This variable is provided only in calls to launch.

Debugging

Enhanced Verbosity Logging

For receiving an increased level of status information from the EC use the GLOG verbosity level. Prefix your mesos startup call by setting the level to a value higher than or equal to two.

GLOG_v=2 ./bin/mesos-slave --master=[...]

ECP stderr Logging

All output to stderr of your ECP will get logged to the executor's 'stderr' log file. The specific location can be extracted from the Enhanced Verbosity Logging of the EC.

Example Log Output:

I0603 02:12:34.165662 174215168 external_containerizer.cpp:1083] Invoking external containerizer for method 'launch'
I0603 02:12:34.165675 174215168 external_containerizer.cpp:1100] calling: [/Users/till/Development/mesos-till/build/src/test-containerizer launch]
I0603 02:12:34.165678 175824896 slave.cpp:497] Successfully attached file '/tmp/ExternalContainerizerTest_Launch_lP22ci/slaves/20140603-021232-16777343-51377-7591-0/frameworks/20140603-021232-16777343-51377-7591-0000/executors/1/runs/558e0a69-70da-4d71-b4c4-c2820b1d6345'
I0603 02:12:34.165686 174215168 external_containerizer.cpp:1101] directory: /tmp/ExternalContainerizerTest_Launch_lP22ci/slaves/20140603-021232-16777343-51377-7591-0/frameworks/20140603-021232-16777343-51377-7591-0000/executors/1/runs/558e0a69-70da-4d71-b4c4-c2820b1d6345

The stderr output of the ECP for this call is found within the stderr file located in the directory displayed in the last quoted line.

cat /tmp/ExternalContainerizerTest_Launch_lP22ci/slaves/20140603-021232-16777343-51377-7591-0/frameworks/20140603-021232-16777343-51377-7591-0000/executors/1/runs/558e0a69-70da-4d71-b4c4-c2820b1d6345/stderr

Appendix

Record-IO Proto Example: Launch

This is what a properly record-io formatted protobuf looks like.

name: offset

length: 00 - 03 = record length in byte
payload: 04 - (length + 4) = protobuf payload

Example length: 00000240h = 576 byte total protobuf size

Example Hexdump:

00000000:  4002 0000 0a26 0a24 3433 3532 3533 6162 2d64 3234 362d 3437  :@....&.$435253ab-d246-47
00000018:  6265 2d61 3335 302d 3335 3432 3034 3635 6438 3638 1a81 020a  :be-a350-35420465d868....
00000030:  030a 0131 2a16 0a04 6370 7573 1000 1a09 0900 0000 0000 0000  :...1*...cpus............
00000048:  4032 012a 2a15 0a03 6d65 6d10 001a 0909 0000 0000 0000 9040  :@2.**...mem............@
00000060:  3201 2a2a 160a 0464 6973 6b10 001a 0909 0000 0000 0000 9040  :2.**...disk............@
00000078:  3201 2a2a 180a 0570 6f72 7473 1001 220a 0a08 0898 f201 1080  :2.**...ports..".........
00000090:  fa01 3201 2a3a 2a1a 2865 6368 6f20 274e 6f20 7375 6368 2066  :..2.*:*.(echo 'No such f
000000a8:  696c 6520 6f72 2064 6972 6563 746f 7279 273b 2065 7869 7420  :ile or directory'; exit
000000c0:  3142 2b0a 2932 3031 3430 3532 362d 3031 3530 3036 2d31 3637  :1B+.)20140526-015006-167
000000d8:  3737 3334 332d 3535 3430 332d 3632 3536 372d 3030 3030 4a3d  :77343-55403-62567-0000J=
000000f0:  436f 6d6d 616e 6420 4578 6563 7574 6f72 2028 5461 736b 3a20  :Command Executor (Task:
00000108:  3129 2028 436f 6d6d 616e 643a 2073 6820 2d63 2027 7768 696c  :1) (Command: sh -c 'whil
00000120:  6520 7472 7565 203b 2e2e 2e27 2952 0131 22c5 012f 746d 702f  :e true ;...')R.1"../tmp/
00000138:  4578 7465 726e 616c 436f 6e74 6169 6e65 7269 7a65 7254 6573  :ExternalContainerizerTes
00000150:  745f 4c61 756e 6368 5f6c 5855 6839 662f 736c 6176 6573 2f32  :t_Launch_lXUh9f/slaves/2
00000168:  3031 3430 3532 362d 3031 3530 3036 2d31 3637 3737 3334 332d  :0140526-015006-16777343-
00000180:  3535 3430 332d 3632 3536 372d 302f 6672 616d 6577 6f72 6b73  :55403-62567-0/frameworks
00000198:  2f32 3031 3430 3532 362d 3031 3530 3036 2d31 3637 3737 3334  :/20140526-015006-1677734
000001b0:  332d 3535 3430 332d 3632 3536 372d 3030 3030 2f65 7865 6375  :3-55403-62567-0000/execu
000001c8:  746f 7273 2f31 2f72 756e 732f 3433 3532 3533 6162 2d64 3234  :tors/1/runs/435253ab-d24
000001e0:  362d 3437 6265 2d61 3335 302d 3335 3432 3034 3635 6438 3638  :6-47be-a350-35420465d868
000001f8:  2a04 7469 6c6c 3228 0a26 3230 3134 3035 3236 2d30 3135 3030  :*.till2(.&20140526-01500
00000210:  362d 3136 3737 3733 3433 2d35 3534 3033 2d36 3235 3637 2d30  :6-16777343-55403-62567-0
00000228:  3a18 736c 6176 6528 3129 4031 3237 2e30 2e30 2e31 3a35 3534  ::.slave(1)@127.0.0.1:554
00000240:  3033 4000

Record-IO De/Serializing Example

How to send and receive such record-io formatted message using Python

taken from src/examples/python/test_containerizer.py

# Read a data chunk prefixed by its total size from stdin.
def receive():
    # Read size (uint32 => 4 bytes).
    size = struct.unpack('I', sys.stdin.read(4))
    if size[0] <= 0:
        print >> sys.stderr, "Expected protobuf size over stdin. " \
                         "Received 0 bytes."
        return ""

    # Read payload.
    data = sys.stdin.read(size[0])
    if len(data) != size[0]:
        print >> sys.stderr, "Expected %d bytes protobuf over stdin. " \
                         "Received %d bytes." % (size[0], len(data))
        return ""

    return data

# Write a protobuf message prefixed by its total size (aka recordio)
# to stdout.
def send(data):
    # Write size (uint32 => 4 bytes).
    sys.stdout.write(struct.pack('I', len(data)))

    # Write payload.
    sys.stdout.write(data)