1 2Overview 3======== 4 5This readme tries to provide some background on the hows and whys of RDS, 6and will hopefully help you find your way around the code. 7 8In addition, please see this email about RDS origins: 9http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html 10 11RDS Architecture 12================ 13 14RDS provides reliable, ordered datagram delivery by using a single 15reliable connection between any two nodes in the cluster. This allows 16applications to use a single socket to talk to any other process in the 17cluster - so in a cluster with N processes you need N sockets, in contrast 18to N*N if you use a connection-oriented socket transport like TCP. 19 20RDS is not Infiniband-specific; it was designed to support different 21transports. The current implementation used to support RDS over TCP as well 22as IB. Work is in progress to support RDS over iWARP, and using DCE to 23guarantee no dropped packets on Ethernet, it may be possible to use RDS over 24UDP in the future. 25 26The high-level semantics of RDS from the application's point of view are 27 28 * Addressing 29 RDS uses IPv4 addresses and 16bit port numbers to identify 30 the end point of a connection. All socket operations that involve 31 passing addresses between kernel and user space generally 32 use a struct sockaddr_in. 33 34 The fact that IPv4 addresses are used does not mean the underlying 35 transport has to be IP-based. In fact, RDS over IB uses a 36 reliable IB connection; the IP address is used exclusively to 37 locate the remote node's GID (by ARPing for the given IP). 38 39 The port space is entirely independent of UDP, TCP or any other 40 protocol. 41 42 * Socket interface 43 RDS sockets work *mostly* as you would expect from a BSD 44 socket. The next section will cover the details. At any rate, 45 all I/O is performed through the standard BSD socket API. 46 Some additions like zerocopy support are implemented through 47 control messages, while other extensions use the getsockopt/ 48 setsockopt calls. 49 50 Sockets must be bound before you can send or receive data. 51 This is needed because binding also selects a transport and 52 attaches it to the socket. Once bound, the transport assignment 53 does not change. RDS will tolerate IPs moving around (eg in 54 a active-active HA scenario), but only as long as the address 55 doesn't move to a different transport. 56 57 * sysctls 58 RDS supports a number of sysctls in /proc/sys/net/rds 59 60 61Socket Interface 62================ 63 64 AF_RDS, PF_RDS, SOL_RDS 65 AF_RDS and PF_RDS are the domain type to be used with socket(2) 66 to create RDS sockets. SOL_RDS is the socket-level to be used 67 with setsockopt(2) and getsockopt(2) for RDS specific socket 68 options. 69 70 fd = socket(PF_RDS, SOCK_SEQPACKET, 0); 71 This creates a new, unbound RDS socket. 72 73 setsockopt(SOL_SOCKET): send and receive buffer size 74 RDS honors the send and receive buffer size socket options. 75 You are not allowed to queue more than SO_SNDSIZE bytes to 76 a socket. A message is queued when sendmsg is called, and 77 it leaves the queue when the remote system acknowledges 78 its arrival. 79 80 The SO_RCVSIZE option controls the maximum receive queue length. 81 This is a soft limit rather than a hard limit - RDS will 82 continue to accept and queue incoming messages, even if that 83 takes the queue length over the limit. However, it will also 84 mark the port as "congested" and send a congestion update to 85 the source node. The source node is supposed to throttle any 86 processes sending to this congested port. 87 88 bind(fd, &sockaddr_in, ...) 89 This binds the socket to a local IP address and port, and a 90 transport. 91 92 sendmsg(fd, ...) 93 Sends a message to the indicated recipient. The kernel will 94 transparently establish the underlying reliable connection 95 if it isn't up yet. 96 97 An attempt to send a message that exceeds SO_SNDSIZE will 98 return with -EMSGSIZE 99 100 An attempt to send a message that would take the total number 101 of queued bytes over the SO_SNDSIZE threshold will return 102 EAGAIN. 103 104 An attempt to send a message to a destination that is marked 105 as "congested" will return ENOBUFS. 106 107 recvmsg(fd, ...) 108 Receives a message that was queued to this socket. The sockets 109 recv queue accounting is adjusted, and if the queue length 110 drops below SO_SNDSIZE, the port is marked uncongested, and 111 a congestion update is sent to all peers. 112 113 Applications can ask the RDS kernel module to receive 114 notifications via control messages (for instance, there is a 115 notification when a congestion update arrived, or when a RDMA 116 operation completes). These notifications are received through 117 the msg.msg_control buffer of struct msghdr. The format of the 118 messages is described in manpages. 119 120 poll(fd) 121 RDS supports the poll interface to allow the application 122 to implement async I/O. 123 124 POLLIN handling is pretty straightforward. When there's an 125 incoming message queued to the socket, or a pending notification, 126 we signal POLLIN. 127 128 POLLOUT is a little harder. Since you can essentially send 129 to any destination, RDS will always signal POLLOUT as long as 130 there's room on the send queue (ie the number of bytes queued 131 is less than the sendbuf size). 132 133 However, the kernel will refuse to accept messages to 134 a destination marked congested - in this case you will loop 135 forever if you rely on poll to tell you what to do. 136 This isn't a trivial problem, but applications can deal with 137 this - by using congestion notifications, and by checking for 138 ENOBUFS errors returned by sendmsg. 139 140 setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) 141 This allows the application to discard all messages queued to a 142 specific destination on this particular socket. 143 144 This allows the application to cancel outstanding messages if 145 it detects a timeout. For instance, if it tried to send a message, 146 and the remote host is unreachable, RDS will keep trying forever. 147 The application may decide it's not worth it, and cancel the 148 operation. In this case, it would use RDS_CANCEL_SENT_TO to 149 nuke any pending messages. 150 151 152RDMA for RDS 153============ 154 155 see rds-rdma(7) manpage (available in rds-tools) 156 157 158Congestion Notifications 159======================== 160 161 see rds(7) manpage 162 163 164RDS Protocol 165============ 166 167 Message header 168 169 The message header is a 'struct rds_header' (see rds.h): 170 Fields: 171 h_sequence: 172 per-packet sequence number 173 h_ack: 174 piggybacked acknowledgment of last packet received 175 h_len: 176 length of data, not including header 177 h_sport: 178 source port 179 h_dport: 180 destination port 181 h_flags: 182 CONG_BITMAP - this is a congestion update bitmap 183 ACK_REQUIRED - receiver must ack this packet 184 RETRANSMITTED - packet has previously been sent 185 h_credit: 186 indicate to other end of connection that 187 it has more credits available (i.e. there is 188 more send room) 189 h_padding[4]: 190 unused, for future use 191 h_csum: 192 header checksum 193 h_exthdr: 194 optional data can be passed here. This is currently used for 195 passing RDMA-related information. 196 197 ACK and retransmit handling 198 199 One might think that with reliable IB connections you wouldn't need 200 to ack messages that have been received. The problem is that IB 201 hardware generates an ack message before it has DMAed the message 202 into memory. This creates a potential message loss if the HCA is 203 disabled for any reason between when it sends the ack and before 204 the message is DMAed and processed. This is only a potential issue 205 if another HCA is available for fail-over. 206 207 Sending an ack immediately would allow the sender to free the sent 208 message from their send queue quickly, but could cause excessive 209 traffic to be used for acks. RDS piggybacks acks on sent data 210 packets. Ack-only packets are reduced by only allowing one to be 211 in flight at a time, and by the sender only asking for acks when 212 its send buffers start to fill up. All retransmissions are also 213 acked. 214 215 Flow Control 216 217 RDS's IB transport uses a credit-based mechanism to verify that 218 there is space in the peer's receive buffers for more data. This 219 eliminates the need for hardware retries on the connection. 220 221 Congestion 222 223 Messages waiting in the receive queue on the receiving socket 224 are accounted against the sockets SO_RCVBUF option value. Only 225 the payload bytes in the message are accounted for. If the 226 number of bytes queued equals or exceeds rcvbuf then the socket 227 is congested. All sends attempted to this socket's address 228 should return block or return -EWOULDBLOCK. 229 230 Applications are expected to be reasonably tuned such that this 231 situation very rarely occurs. An application encountering this 232 "back-pressure" is considered a bug. 233 234 This is implemented by having each node maintain bitmaps which 235 indicate which ports on bound addresses are congested. As the 236 bitmap changes it is sent through all the connections which 237 terminate in the local address of the bitmap which changed. 238 239 The bitmaps are allocated as connections are brought up. This 240 avoids allocation in the interrupt handling path which queues 241 sages on sockets. The dense bitmaps let transports send the 242 entire bitmap on any bitmap change reasonably efficiently. This 243 is much easier to implement than some finer-grained 244 communication of per-port congestion. The sender does a very 245 inexpensive bit test to test if the port it's about to send to 246 is congested or not. 247 248 249RDS Transport Layer 250================== 251 252 As mentioned above, RDS is not IB-specific. Its code is divided 253 into a general RDS layer and a transport layer. 254 255 The general layer handles the socket API, congestion handling, 256 loopback, stats, usermem pinning, and the connection state machine. 257 258 The transport layer handles the details of the transport. The IB 259 transport, for example, handles all the queue pairs, work requests, 260 CM event handlers, and other Infiniband details. 261 262 263RDS Kernel Structures 264===================== 265 266 struct rds_message 267 aka possibly "rds_outgoing", the generic RDS layer copies data to 268 be sent and sets header fields as needed, based on the socket API. 269 This is then queued for the individual connection and sent by the 270 connection's transport. 271 struct rds_incoming 272 a generic struct referring to incoming data that can be handed from 273 the transport to the general code and queued by the general code 274 while the socket is awoken. It is then passed back to the transport 275 code to handle the actual copy-to-user. 276 struct rds_socket 277 per-socket information 278 struct rds_connection 279 per-connection information 280 struct rds_transport 281 pointers to transport-specific functions 282 struct rds_statistics 283 non-transport-specific statistics 284 struct rds_cong_map 285 wraps the raw congestion bitmap, contains rbnode, waitq, etc. 286 287Connection management 288===================== 289 290 Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and 291 ERROR states. 292 293 The first time an attempt is made by an RDS socket to send data to 294 a node, a connection is allocated and connected. That connection is 295 then maintained forever -- if there are transport errors, the 296 connection will be dropped and re-established. 297 298 Dropping a connection while packets are queued will cause queued or 299 partially-sent datagrams to be retransmitted when the connection is 300 re-established. 301 302 303The send path 304============= 305 306 rds_sendmsg() 307 struct rds_message built from incoming data 308 CMSGs parsed (e.g. RDMA ops) 309 transport connection alloced and connected if not already 310 rds_message placed on send queue 311 send worker awoken 312 rds_send_worker() 313 calls rds_send_xmit() until queue is empty 314 rds_send_xmit() 315 transmits congestion map if one is pending 316 may set ACK_REQUIRED 317 calls transport to send either non-RDMA or RDMA message 318 (RDMA ops never retransmitted) 319 rds_ib_xmit() 320 allocs work requests from send ring 321 adds any new send credits available to peer (h_credits) 322 maps the rds_message's sg list 323 piggybacks ack 324 populates work requests 325 post send to connection's queue pair 326 327The recv path 328============= 329 330 rds_ib_recv_cq_comp_handler() 331 looks at write completions 332 unmaps recv buffer from device 333 no errors, call rds_ib_process_recv() 334 refill recv ring 335 rds_ib_process_recv() 336 validate header checksum 337 copy header to rds_ib_incoming struct if start of a new datagram 338 add to ibinc's fraglist 339 if competed datagram: 340 update cong map if datagram was cong update 341 call rds_recv_incoming() otherwise 342 note if ack is required 343 rds_recv_incoming() 344 drop duplicate packets 345 respond to pings 346 find the sock associated with this datagram 347 add to sock queue 348 wake up sock 349 do some congestion calculations 350 rds_recvmsg 351 copy data into user iovec 352 handle CMSGs 353 return to application 354 355 356