#include <sys/socket.h> #include <netinet/in.h>
Currently, RDS can be transported over Infiniband, and loopback. RDS over TCP is disabled, but will be re-enabled in the near future.
RDS uses standard AF_INET addresses as described in ip(7) to identify end points.
rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
In addition to these, RDS supports a number of protocol specific options (with socket level SOL_RDS). Just as with the RDS protocol family, an official value has not been assigned yet, so the kernel will assign a value dynamically. The assigned value can be retrieved from the sol_rds sysctl parameter file.
RDS specific socket options will be described in a separate section below.
For instance, when binding to the address of an Infiniband interface such as ib0, the socket will use the Infiniband transport. If RDS is not able to associate a transport with the given address, it will return EADDRNOTAVAIL.
An RDS socket can only be bound to one address and only one socket can be bound to a given address/port pair. If no port is specified in the binding address then an unbound port is selected at random.
RDS does not allow the application to bind a previously bound socket to another address. Binding to the wildcard address INADDR_ANY is not permitted either.
The send queue size limits how much data local processes can queue on a local socket (see the following section). If that limit is exceeded, the kernel will not accept further messages until the queue is drained and messages have been delivered to and acknowledged by the remote host.
The receive queue size limits how much data RDS will put on the receive queue of a socket before marking the socket as congested. When a socket becomes congested, RDS will send a congestion map update to the other participating hosts, who are then expected to stop sending more messages to this port.
There is a timing window during which a remote host can still continue to send messages to a congested port; RDS solves this by accepting these messages even if the socket's receive queue is already over the limit.
As the application pulls incoming messages off the receive queue using recvmsg(2), the number of bytes on the receive queue will eventually drop below the receive queue size, at which point the port is then marked uncongested, and another congestion update is sent to all participating hosts. This tells them to allow applications to send additional messages to this port.
The default values for the send and receive buffer size are controlled by the A given RDS socket has limited transmit buffer space. It defaults to the system wide socket send buffer size set in the wmem_default and rmem_default sysctls, respectively. They can be tuned by the application through the SO_SNDBUF and SO_RCVBUF socket options.
In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used to specify a timeout (in seconds) after which the call will abort waiting, and return an error. The default timeout is 0, which tells RDS to block indefinitely.
RDS does not support out of band data. Applications are allowed to send to unicast addresses only; broadcast or multicast are not supported.
A successful sendmsg(2) call puts the message in the socket's transmit queue where it will remain until either the destination acknowledges that the message is no longer in the network or the application removes the message from the send queue.
Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO socket option described below.
While a message is in the transmit queue its payload bytes are accounted for. If an attempt is made to send a message while there is not sufficient room on the transmit queue, the call will either block or return EAGAIN.
Trying to send to a destination that is marked congested (see above), the call will either block or return ENOBUFS.
A message sent with no payload bytes will not consume any space in the destination's send buffer but will result in a message receipt on the destination. The receiver will not get any payload data but will be able to see the sender's address.
Messages sent to a port to which no socket is bound will be silently discarded by the destination host. No error messages are reported to the sender.
The address of the sender will be returned in the sockaddr_in structure pointed to by the msg_name field, if set.
If the MSG_PEEK flag is given, the first message on the receive is returned without removing it from the queue.
The memory consumed by messages waiting for delivery does not limit the number of messages that can be queued for receive. RDS does attempt to perform congestion control as described in the section above.
If the length of the message exceeds the size of the buffer provided to recvmsg(2), then the remainder of the bytes in the message are discarded and the MSG_TRUNC flag is set in the msg_flags field. In this truncating case recvmsg(2) will still return the number of bytes copied, not the length of entire messge. If MSG_TRUNC is set in the flags argument to recvmsg(2), then it will return the number of bytes in the entire message. Thus one can examine the size of the next message in the receive queue without incurring a copying overhead by providing a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the flags argument.
The sending address of a zero-length message will still be provided in the msg_name field.
The only exception is the RDS_CMSG_CONG_UPDATE message, which is described in the following section.
Sending to congested ports requires special handling. When an application tries to send to a congested destination, the system call will return ENOBUFS. However, it cannot poll for POLLOUT, as there is probably still room on the transmit queue, so the call to poll(2) would return immediately, even though the destination is still congested.
There are two ways of dealing with this situation. The first is to simply poll for POLLIN. By default, a process sleeping in poll(2) is always woken up when the congestion map is updated, and thus the application can retry any previously congested sends.
The second option is explicit congestion monitoring, which gives the application more fine-grained control.
With explicit monitoring, the application polls for POLLIN as before, and additionally uses the RDS_CONG_MONITOR socket option to install a 64bit mask value in the socket, where each bit corresponds to a group of ports. When a congestion update arrives, RDS checks the set of ports that became uncongested against the bit mask installed in the socket. If they overlap, a control messages is enqueued on the socket, and the application is woken up. When it calls recvmsg(2), it will be given the control message containing the bitmap. on the socket.
The congestion monitor bitmask can be set and queried using setsockopt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
Congestion updates are delivered to the application via RDS_CMSG_CONG_UPDATE control messages. These control messages are always delivered by themselves (or possibly additional control messages), but never along with a RDS data message. The cmsg_data field of the control message is an 8 byte datum containing the 64bit mask value.
Applications can use the following macros to test for and set bits in the bitmask:
#define RDS_CONG_MONITOR_SIZE 64 #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
Note that this affects messages that have not yet been transmitted as well as messages that have been transmitted, but for which no acknowledgment from the remote host has been received yet.
If there is no socket bound on the destination, the message is silently dropped. If the sending RDS can't be sure that there is no socket bound then it will try to send the message indefinitely until it can be sure or the sent message is canceled.
If a socket is closed then all pending sent messages on the socket are canceled and may or may not be seen by the receiver.
The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending messages to a given destination.
If a receiving socket is closed with pending messages then the sender
considers those messages as having left the network and will not
retransmit them.
A message will only be seen by recvmsg(2) once, unless MSG_PEEK was specified. Once the message has been delivered it is removed from the sending socket's transmit queue.
All messages sent from the same socket to the same destination will be delivered in the order they're sent. Messages sent from different sockets, or to different destinations, may be delivered in any order.