1Open vSwitch datapath developer documentation 2============================================= 3 4The Open vSwitch kernel module allows flexible userspace control over 5flow-level packet processing on selected network devices. It can be 6used to implement a plain Ethernet switch, network device bonding, 7VLAN processing, network access control, flow-based network control, 8and so on. 9 10The kernel module implements multiple "datapaths" (analogous to 11bridges), each of which can have multiple "vports" (analogous to ports 12within a bridge). Each datapath also has associated with it a "flow 13table" that userspace populates with "flows" that map from keys based 14on packet headers and metadata to sets of actions. The most common 15action forwards the packet to another vport; other actions are also 16implemented. 17 18When a packet arrives on a vport, the kernel module processes it by 19extracting its flow key and looking it up in the flow table. If there 20is a matching flow, it executes the associated actions. If there is 21no match, it queues the packet to userspace for processing (as part of 22its processing, userspace will likely set up a flow to handle further 23packets of the same type entirely in-kernel). 24 25 26Flow key compatibility 27---------------------- 28 29Network protocols evolve over time. New protocols become important 30and existing protocols lose their prominence. For the Open vSwitch 31kernel module to remain relevant, it must be possible for newer 32versions to parse additional protocols as part of the flow key. It 33might even be desirable, someday, to drop support for parsing 34protocols that have become obsolete. Therefore, the Netlink interface 35to Open vSwitch is designed to allow carefully written userspace 36applications to work with any version of the flow key, past or future. 37 38To support this forward and backward compatibility, whenever the 39kernel module passes a packet to userspace, it also passes along the 40flow key that it parsed from the packet. Userspace then extracts its 41own notion of a flow key from the packet and compares it against the 42kernel-provided version: 43 44 - If userspace's notion of the flow key for the packet matches the 45 kernel's, then nothing special is necessary. 46 47 - If the kernel's flow key includes more fields than the userspace 48 version of the flow key, for example if the kernel decoded IPv6 49 headers but userspace stopped at the Ethernet type (because it 50 does not understand IPv6), then again nothing special is 51 necessary. Userspace can still set up a flow in the usual way, 52 as long as it uses the kernel-provided flow key to do it. 53 54 - If the userspace flow key includes more fields than the 55 kernel's, for example if userspace decoded an IPv6 header but 56 the kernel stopped at the Ethernet type, then userspace can 57 forward the packet manually, without setting up a flow in the 58 kernel. This case is bad for performance because every packet 59 that the kernel considers part of the flow must go to userspace, 60 but the forwarding behavior is correct. (If userspace can 61 determine that the values of the extra fields would not affect 62 forwarding behavior, then it could set up a flow anyway.) 63 64How flow keys evolve over time is important to making this work, so 65the following sections go into detail. 66 67 68Flow key format 69--------------- 70 71A flow key is passed over a Netlink socket as a sequence of Netlink 72attributes. Some attributes represent packet metadata, defined as any 73information about a packet that cannot be extracted from the packet 74itself, e.g. the vport on which the packet was received. Most 75attributes, however, are extracted from headers within the packet, 76e.g. source and destination addresses from Ethernet, IP, or TCP 77headers. 78 79The <linux/openvswitch.h> header file defines the exact format of the 80flow key attributes. For informal explanatory purposes here, we write 81them as comma-separated strings, with parentheses indicating arguments 82and nesting. For example, the following could represent a flow key 83corresponding to a TCP packet that arrived on vport 1: 84 85 in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), 86 eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, 87 frag=no), tcp(src=49163, dst=80) 88 89Often we ellipsize arguments not important to the discussion, e.g.: 90 91 in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) 92 93 94Wildcarded flow key format 95-------------------------- 96 97A wildcarded flow is described with two sequences of Netlink attributes 98passed over the Netlink socket. A flow key, exactly as described above, and an 99optional corresponding flow mask. 100 101A wildcarded flow can represent a group of exact match flows. Each '1' bit 102in the mask specifies a exact match with the corresponding bit in the flow key. 103A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit 104of a incoming packet. Using wildcarded flow can improve the flow set up rate 105by reduce the number of new flows need to be processed by the user space program. 106 107Support for the mask Netlink attribute is optional for both the kernel and user 108space program. The kernel can ignore the mask attribute, installing an exact 109match flow, or reduce the number of don't care bits in the kernel to less than 110what was specified by the user space program. In this case, variations in bits 111that the kernel does not implement will simply result in additional flow setups. 112The kernel module will also work with user space programs that neither support 113nor supply flow mask attributes. 114 115Since the kernel may ignore or modify wildcard bits, it can be difficult for 116the userspace program to know exactly what matches are installed. There are 117two possible approaches: reactively install flows as they miss the kernel 118flow table (and therefore not attempt to determine wildcard changes at all) 119or use the kernel's response messages to determine the installed wildcards. 120 121When interacting with userspace, the kernel should maintain the match portion 122of the key exactly as originally installed. This will provides a handle to 123identify the flow for all future operations. However, when reporting the 124mask of an installed flow, the mask should include any restrictions imposed 125by the kernel. 126 127The behavior when using overlapping wildcarded flows is undefined. It is the 128responsibility of the user space program to ensure that any incoming packet 129can match at most one flow, wildcarded or not. The current implementation 130performs best-effort detection of overlapping wildcarded flows and may reject 131some but not all of them. However, this behavior may change in future versions. 132 133 134Unique flow identifiers 135----------------------- 136 137An alternative to using the original match portion of a key as the handle for 138flow identification is a unique flow identifier, or "UFID". UFIDs are optional 139for both the kernel and user space program. 140 141User space programs that support UFID are expected to provide it during flow 142setup in addition to the flow, then refer to the flow using the UFID for all 143future operations. The kernel is not required to index flows by the original 144flow key if a UFID is specified. 145 146 147Basic rule for evolving flow keys 148--------------------------------- 149 150Some care is needed to really maintain forward and backward 151compatibility for applications that follow the rules listed under 152"Flow key compatibility" above. 153 154The basic rule is obvious: 155 156 ------------------------------------------------------------------ 157 New network protocol support must only supplement existing flow 158 key attributes. It must not change the meaning of already defined 159 flow key attributes. 160 ------------------------------------------------------------------ 161 162This rule does have less-obvious consequences so it is worth working 163through a few examples. Suppose, for example, that the kernel module 164did not already implement VLAN parsing. Instead, it just interpreted 165the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the 166packet. The flow key for any packet with an 802.1Q header would look 167essentially like this, ignoring metadata: 168 169 eth(...), eth_type(0x8100) 170 171Naively, to add VLAN support, it makes sense to add a new "vlan" flow 172key attribute to contain the VLAN tag, then continue to decode the 173encapsulated headers beyond the VLAN tag using the existing field 174definitions. With this change, a TCP packet in VLAN 10 would have a 175flow key much like this: 176 177 eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) 178 179But this change would negatively affect a userspace application that 180has not been updated to understand the new "vlan" flow key attribute. 181The application could, following the flow compatibility rules above, 182ignore the "vlan" attribute that it does not understand and therefore 183assume that the flow contained IP packets. This is a bad assumption 184(the flow only contains IP packets if one parses and skips over the 185802.1Q header) and it could cause the application's behavior to change 186across kernel versions even though it follows the compatibility rules. 187 188The solution is to use a set of nested attributes. This is, for 189example, why 802.1Q support uses nested attributes. A TCP packet in 190VLAN 10 is actually expressed as: 191 192 eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), 193 ip(proto=6, ...), tcp(...))) 194 195Notice how the "eth_type", "ip", and "tcp" flow key attributes are 196nested inside the "encap" attribute. Thus, an application that does 197not understand the "vlan" key will not see either of those attributes 198and therefore will not misinterpret them. (Also, the outer eth_type 199is still 0x8100, not changed to 0x0800.) 200 201Handling malformed packets 202-------------------------- 203 204Don't drop packets in the kernel for malformed protocol headers, bad 205checksums, etc. This would prevent userspace from implementing a 206simple Ethernet switch that forwards every packet. 207 208Instead, in such a case, include an attribute with "empty" content. 209It doesn't matter if the empty content could be valid protocol values, 210as long as those values are rarely seen in practice, because userspace 211can always forward all packets with those values to userspace and 212handle them individually. 213 214For example, consider a packet that contains an IP header that 215indicates protocol 6 for TCP, but which is truncated just after the IP 216header, so that the TCP header is missing. The flow key for this 217packet would include a tcp attribute with all-zero src and dst, like 218this: 219 220 eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) 221 222As another example, consider a packet with an Ethernet type of 0x8100, 223indicating that a VLAN TCI should follow, but which is truncated just 224after the Ethernet type. The flow key for this packet would include 225an all-zero-bits vlan and an empty encap attribute, like this: 226 227 eth(...), eth_type(0x8100), vlan(0), encap() 228 229Unlike a TCP packet with source and destination ports 0, an 230all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka 231VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan 232attribute expressly to allow this situation to be distinguished. 233Thus, the flow key in this second example unambiguously indicates a 234missing or malformed VLAN TCI. 235 236Other rules 237----------- 238 239The other rules for flow keys are much less subtle: 240 241 - Duplicate attributes are not allowed at a given nesting level. 242 243 - Ordering of attributes is not significant. 244 245 - When the kernel sends a given flow key to userspace, it always 246 composes it the same way. This allows userspace to hash and 247 compare entire flow keys that it may not be able to fully 248 interpret. 249