20/ZRE

20/ZRE

ZeroMQ Realtime Exchange Protocol

The ZeroMQ Realtime Exchange Protocol (ZRE) governs how a group of peers on a network discover each other, organize into groups, and send each other events. ZRE runs over the ZeroMQ ZMTP protocol. Note that this specification has been superceded by rfc.zeromq.org/spec:36/ZRE.

Preamble

Copyright (c) 2009-2013 iMatix Corporation

This Specification is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This Specification is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses.

This Specification is a free and open standard and is governed by the Digital Standards Organization’s Consensus-Oriented Specification System.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Goals

The ZRE protocol provides a way for a set of nodes on a local network to discover each other, track when peers come and go, send messages to individual peers (unicast), and send messages to groups of peers (multicast). Its goals are:

  • To work with no centralized services or mediation except those available by default on a network.
  • To be robust on poor-quality networks, especially wireless networks.
  • To minimize the time taken to detect a new peer’s arrival on the network.
  • To recover from transient failures in connectivity.
  • To be neutral with respect to operating system, programming language, and hardware.
  • To allow any number of nodes to run in one process, to allow large-scale simulation and testing.
  • To provide mechanisms for collecting logging information across the network.

Implementation

Node Identification and Life-cycle

A ZRE node represents a source or a target for messaging. Nodes usually map to applications. A ZRE node is identified by a 16-octet universally unique identifier (UUID). ZRE does not define how a node is created or destroyed but does assume that nodes have a certain durability.

Node Discovery and Presence

ZRE uses UDP IPv4 beacon broadcasts to discover nodes. Each ZRE node SHALL listen to the ZRE discovery service which is UDP port 5670 (ZRE-DISC port assigned by IANA). Each ZRE node SHALL broadcast, at regular intervals, on UDP port 5670 a beacon that identifies itself to any listening nodes on the network.

The ZRE beacon consists of one 22-octet UDP message with this format:

+---+---+---+------+  +------+------+
| Z | R | E | %x01 |  | UUID | port |
+---+---+---+------+  +------+------+

       Header               Body

The header SHALL consist of the letters ‘Z’, ‘R’, and ‘E’, followed by the beacon version number, which SHALL be %x01.

The body SHALL consist of the sender’s 16-octet UUID, followed by a two-byte mailbox port number in network order. If the port is non-zero this signals that the peer will accept ZeroMQ TCP connections on that port number. If the port is zero, this signals that the peer is disconnecting from the network.

A valid beacon SHALL use a recognized header and a body of the correct size. A node that receives an invalid beacon SHALL discard it silently. A node MAY log the sender IP address for the purposes of debugging. A node SHALL discard beacons that it receives from itself.

When a ZRE node receives a beacon from a node that it does not already know about, with a non-zero port number, it SHALL consider this to be a new peer.

When a ZRE node receives a beacon from a known node, with a non-zero port number, it SHALL disconnect from this peer.

Interconnection Model

Each node SHALL create a ZeroMQ ROUTER socket and bind this to an ephemeral TCP port (in the range %C000x - %FFFFx). The node SHALL broadcast this mailbox port number in all beacons that it sends. Note that a node does not broadcast its IP address as this is provided by the UDP recvfrom function.

This ROUTER socket SHALL be used for all incoming ZeroMQ messages from other nodes. A node SHALL NOT send messages to peers via this socket.

When a node discovers a new peer, it SHALL create a ZeroMQ DEALER socket, set its identity (binary 16-octet UUID) on that socket, and connect this to the peer’s mailbox port. A node may immediately, after connection, start to send messages to a peer via this DEALER socket.

A node SHALL connect each DEALER sockets to at most one peer. A node may disconnect its DEALER socket if the peer has failed to respond within some time (see Heartbeating).

This DEALER socket SHALL be used for all outgoing ZeroMQ messages to a specific peer. A node SHALL not receive messages on this socket. The sender MAY set a high-water mark (HWM) of, for example, 100 messages per second (if the timeout period is 30 second, this means a HWM of 3,000 messages). The sender SHOULD set the send timeout on the socket to zero so that a full send buffer can be detected and treated as “peer not responding”.

Note that the ROUTER socket provides the caller with the UUID of the sender for any message received on the socket, as an identity frame that precedes other frames in the message. A peer can thus use the identity on received messages to look up the appropriate DEALER socket for messages back to that peer. The identity SHALL be a binary 16-octet UUID value.

When a node receives, on its ROUTER socket, a valid message from an unknown node, it SHALL treat this as a new peer in the identical fashion as if a UDP beacon was received from an unknown node.

NOTE: the ROUTER-to-DEALER pattern that ZRE uses is designed to ensure that messages are never lost due to synchronization issues. Sending to a ROUTER socket that does not (yet) have a connection to a peer causes the message to be dropped.

Protocol Signature

Every ZRE message sent by TCP SHALL start with the ZRE protocol signature, %xAA %xA1. A node SHALL silently discard any message received that does not start with these two octets.

This mechanism is designed particularly for applications that bind to ephemeral ports which may have been previously used by other protocols, and to which there are still nodes attempting to connect. It is also a general fail-fast mechanism to detect ill-formed messages.

TCP Protocol Grammar

The following ABNF grammar defines the ZRE protocol, where all commands are sent by one node (the sender, “S:") to another peer (the recipient, “R:"):

zre-protocol    = greeting *traffic

greeting        = S:HELLO
traffic         = S:WHISPER
                / S:SHOUT
                / S:JOIN
                / S:LEAVE
                / S:PING R:PING-OK

;   Greet a peer so it can connect back to us
S:HELLO         = signature %x01 sequence ipaddress mailbox groups status headers
signature       = %xAA %xA1
sequence        = 2OCTET        ; Incremental sequence number
ipaddress       = string        ; Sender IP address
string          = size *VCHAR
size            = OCTET
mailbox         = 2OCTET        ; Sender mailbox port number
groups          = strings       ; List of groups sender is in
strings         = size *string
status          = OCTET         ; Sender group status sequence
headers         = dictionary    ; Sender header properties
dictionary      = size *key-value
key-value       = string        ; Formatted as name=value

; Send a message to a peer
S:WHISPER       = signature %x02 sequence content
content         = FRAME         ; Message content as 0MQ frame

; Send a message to a group
S:SHOUT         = signature %x03 sequence group content
group           = string        ; Name of group
content         = FRAME         ; Message content as 0MQ frame

; Join a group
S:JOIN          = signature %x04 sequence group status
status          = OCTET         ; Sender group status sequence

; Leave a group
S:LEAVE         = signature %x05 sequence group status

; Ping a peer that has gone silent
S:PING          = signature %06 sequence

; Reply to a peer's ping
R:PING-OK       = signature %07 sequence

ZRE Commands

The HELLO Command

Each node SHALL start a dialog by sending HELLO as the first command on an connection to a peer.

When a node receives messages from a new peer it SHALL silently ignore any commands that precede a HELLO command.

The HELLO command contains these fields:

  • ipaddress - IP address that the sender will accept connections on.
  • mailbox - port number of the sender’s mailbox.
  • groups - the list of groups that the sender is present in, as a list of strings.
  • status - the sender’s group status sequence.
  • headers - zero or more properties set by the sender.

If the recipient has not already connected to this peer it SHALL create a ZeroMQ DEALER socket and connect it to the endpoint specified as “tcp://ipaddress:mailbox”.

The “group status sequence” is a one-octet number that is incremented each time the peer joins or leaves a group. Each peer MAY use this to assert the accuracy of its own group management information.

The WHISPER Command

When a node wishes to send a message to a single peer it SHALL use the WHISPER command. The WHISPER command contains a single field, which is the message content defined as one 0MQ frame. ZRE does not support multi-frame message contents.

The SHOUT Command

When a node wishes to send a message to a set of nodes participating in a group it SHALL use the SHOUT command. The SHOUT command contains two fields: the name of the group, and the the message content defined as one 0MQ frame.

Note that messages are sent via ZeroMQ over TCP, so the SHOUT command is unicast to each peer that should receive it. ZRE does not provide any UDP multicast functionality.

The JOIN Command

When a node joins a group it SHALL broadcast a JOIN command to all its peers. The JOIN command has two fields: the name of the group to join, and the group status sequence number after joining the group. Group names are case sensitive.

The LEAVE Command

When a node leaves a group it SHALL broadcast a LEAVE command to all its peers. The LEAVE command has two fields: the name of the group to leave, and the group status sequence number after leaving the group.

The PING Command

A node SHOULD send a PING command to any peer that it has not received a UDP beacon from within a certain time (typically five seconds). Note that UDP traffic may be dropped on a network that is heavily saturated. If a node receives no reply to a PING command, and no other traffic from a peer within a somewhat longer time (typically 30 seconds), it SHOULD treat that peer as dead.

Note that PING commands SHOULD be used only in targeted cases where a peer is otherwise silent. Otherwise, the cost of PING commands will rise exponentially with the number of peers connected, and can degrade network performance.

The PING-OK Command

When a node receives a PING command it SHALL reply with a PING-OK command.

ZRE Extension Protocols

ZRE allows the addition of extension protocols that share the ZRE discovery mechanisms. These protocols have the following common properties:

  • They use ZeroMQ messaging over TCP;
  • They are based on a service model where each node may offer zero or more services;
  • Each service binds to an ephemeral port and announces itself to other nodes;
  • Nodes that wish to use a service may then connect to it.

Each extension protocol uses a specific ZeroMQ socket pattern and message format. Currently ZRE supports two extension protocols:

To announce an extension protocol a node adds a headers field to the HELLO command, in the form:

service-name=tcp://ipaddress:port

We specify these extension protocols formally:

  • The ZRE log collection protocol (ZRE/LOG), which provides a subsystem for collecting log data from a distributed network. The service name is “X-ZRELOG”.

The ZRE/LOG Extension Protocol

The log service SHALL create a ZeroMQ SUB socket and bind this to an ephemeral TCP port (in the range %C000x - %FFFFx). The log service SHALL broadcast its connection endpoint in the HELLO command as explained above. Any node wishing to send log data SHALL create a PUB socket and connect it to this endpoint.

Every ZRE/LOG message SHALL start with the protocol signature %xAA %xA2. The log service SHALL silently discard any message received that does not start with these two octets.

The following ABNF grammar defines the ZRE/LOG protocol:

zrelog-protocol = *LOG

;   Send a log message to the log service
LOG             = signature %x01 level event node peer time data
header          = signature
signature       = %xAA %xA2

level           = ERROR / WARNING / INFO
ERROR           = %x01
WARNING         = %x02
INFO            = %x03

event           = JOIN / LEAVE / ENTER / EXIT / SEND / RECV
JOIN            = %x01          ;  We joined a group
LEAVE           = %x02          ;  We left a group
ENTER           = %x03          ;  Peer joined network
EXIT            = %x04          ;  Peer left network
SEND            = %x05          ;  Sent outgoing message
RECV            = %x06          ;  Received message

node            = 2OCTET        ;  Hash of node UUID
peer            = 2OCTET        ;  Hash of peer UUID
time            = 8OCTET        ;  Time in msecs
data            = string        ;  Printable error text
string          = size *VCHAR
size            = OCTET

Node Discovery and Presence

ZRE uses UDP IPv4 beacon broadcasts to discover nodes and track their presence. This works as follows:

  • A ZRE node SHALL listen to the ZRE discovery service which is UDP port 5670 (ZRE-DISC port assigned by IANA).
  • Each ZRE node SHALL broadcast, at regular intervals, a UDP beacon that identifies itself to any listening nodes on the network.
  • When a ZRE node receives a beacon from a node that it does not already know about, it SHALL consider this to be a new peer.
  • When a ZRE node stops receiving beacons from a peer that it knows about, after a certain interval it SHALL consider this peer to be disconnected or dead.

The ZRE beacon consists of one 22-octet UDP message with this format:

+---+---+---+------+  +------+------+
| Z | R | E | %x01 |  | UUID | port |
+---+---+---+------+  +------+------+

       Header               Body

Notes for implementors:

  • The header SHALL consist of the letters ‘Z’, ‘R’, and ‘E’, followed by the beacon version number, which SHALL be %x01.
  • The body SHALL consist of the sender’s 16-octet UUID, followed by a two-byte mailbox port number in network order.
  • A valid beacon SHALL: use a recognized header; use a body of the right size; and provide a non-zero mailbox port number.
  • A node that receives an invalid beacon SHALL discard it silently. A node MAY log the sender IP address for the purposes of debugging.
  • A node SHALL discard beacons that it receives from itself.

Security Aspects

ZRE security is handled by the underlying transport. For ZMTP v2 and v1, there is no security model and all information is exchanged in clear text. For ZMTP v3, any of the defined security mechanism may be used.