1Distributed Switch Architecture 2=============================== 3 4Introduction 5============ 6 7This document describes the Distributed Switch Architecture (DSA) subsystem 8design principles, limitations, interactions with other subsystems, and how to 9develop drivers for this subsystem as well as a TODO for developers interested 10in joining the effort. 11 12Design principles 13================= 14 15The Distributed Switch Architecture is a subsystem which was primarily designed 16to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) 17using Linux, but has since evolved to support other vendors as well. 18 19The original philosophy behind this design was to be able to use unmodified 20Linux tools such as bridge, iproute2, ifconfig to work transparently whether 21they configured/queried a switch port network device or a regular network 22device. 23 24An Ethernet switch is typically comprised of multiple front-panel ports, and one 25or more CPU or management port. The DSA subsystem currently relies on the 26presence of a management port connected to an Ethernet controller capable of 27receiving Ethernet frames from the switch. This is a very common setup for all 28kinds of Ethernet switches found in Small Home and Office products: routers, 29gateways, or even top-of-the rack switches. This host Ethernet controller will 30be later referred to as "master" and "cpu" in DSA terminology and code. 31 32The D in DSA stands for Distributed, because the subsystem has been designed 33with the ability to configure and manage cascaded switches on top of each other 34using upstream and downstream Ethernet links between switches. These specific 35ports are referred to as "dsa" ports in DSA terminology and code. A collection 36of multiple switches connected to each other is called a "switch tree". 37 38For each front-panel port, DSA will create specialized network devices which are 39used as controlling and data-flowing endpoints for use by the Linux networking 40stack. These specialized network interfaces are referred to as "slave" network 41interfaces in DSA terminology and code. 42 43The ideal case for using DSA is when an Ethernet switch supports a "switch tag" 44which is a hardware feature making the switch insert a specific tag for each 45Ethernet frames it received to/from specific ports to help the management 46interface figure out: 47 48- what port is this frame coming from 49- what was the reason why this frame got forwarded 50- how to send CPU originated traffic to specific ports 51 52The subsystem does support switches not capable of inserting/stripping tags, but 53the features might be slightly limited in that case (traffic separation relies 54on Port-based VLAN IDs). 55 56Note that DSA does not currently create network interfaces for the "cpu" and 57"dsa" ports because: 58 59- the "cpu" port is the Ethernet switch facing side of the management 60 controller, and as such, would create a duplication of feature, since you 61 would get two interfaces for the same conduit: master netdev, and "cpu" netdev 62 63- the "dsa" port(s) are just conduits between two or more switches, and as such 64 cannot really be used as proper network interfaces either, only the 65 downstream, or the top-most upstream interface makes sense with that model 66 67Switch tagging protocols 68------------------------ 69 70DSA currently supports 4 different tagging protocols, and a tag-less mode as 71well. The different protocols are implemented in: 72 73net/dsa/tag_trailer.c: Marvell's 4 trailer tag mode (legacy) 74net/dsa/tag_dsa.c: Marvell's original DSA tag 75net/dsa/tag_edsa.c: Marvell's enhanced DSA tag 76net/dsa/tag_brcm.c: Broadcom's 4 bytes tag 77 78The exact format of the tag protocol is vendor specific, but in general, they 79all contain something which: 80 81- identifies which port the Ethernet frame came from/should be sent to 82- provides a reason why this frame was forwarded to the management interface 83 84Master network devices 85---------------------- 86 87Master network devices are regular, unmodified Linux network device drivers for 88the CPU/management Ethernet interface. Such a driver might occasionally need to 89know whether DSA is enabled (e.g.: to enable/disable specific offload features), 90but the DSA subsystem has been proven to work with industry standard drivers: 91e1000e, mv643xx_eth etc. without having to introduce modifications to these 92drivers. Such network devices are also often referred to as conduit network 93devices since they act as a pipe between the host processor and the hardware 94Ethernet switch. 95 96Networking stack hooks 97---------------------- 98 99When a master netdev is used with DSA, a small hook is placed in in the 100networking stack is in order to have the DSA subsystem process the Ethernet 101switch specific tagging protocol. DSA accomplishes this by registering a 102specific (and fake) Ethernet type (later becoming skb->protocol) with the 103networking stack, this is also known as a ptype or packet_type. A typical 104Ethernet Frame receive sequence looks like this: 105 106Master network device (e.g.: e1000e): 107 108Receive interrupt fires: 109- receive function is invoked 110- basic packet processing is done: getting length, status etc. 111- packet is prepared to be processed by the Ethernet layer by calling 112 eth_type_trans 113 114net/ethernet/eth.c: 115 116eth_type_trans(skb, dev) 117 if (dev->dsa_ptr != NULL) 118 -> skb->protocol = ETH_P_XDSA 119 120drivers/net/ethernet/*: 121 122netif_receive_skb(skb) 123 -> iterate over registered packet_type 124 -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv() 125 126net/dsa/dsa.c: 127 -> dsa_switch_rcv() 128 -> invoke switch tag specific protocol handler in 129 net/dsa/tag_*.c 130 131net/dsa/tag_*.c: 132 -> inspect and strip switch tag protocol to determine originating port 133 -> locate per-port network device 134 -> invoke eth_type_trans() with the DSA slave network device 135 -> invoked netif_receive_skb() 136 137Past this point, the DSA slave network devices get delivered regular Ethernet 138frames that can be processed by the networking stack. 139 140Slave network devices 141--------------------- 142 143Slave network devices created by DSA are stacked on top of their master network 144device, each of these network interfaces will be responsible for being a 145controlling and data-flowing end-point for each front-panel port of the switch. 146These interfaces are specialized in order to: 147 148- insert/remove the switch tag protocol (if it exists) when sending traffic 149 to/from specific switch ports 150- query the switch for ethtool operations: statistics, link state, 151 Wake-on-LAN, register dumps... 152- external/internal PHY management: link, auto-negotiation etc. 153 154These slave network devices have custom net_device_ops and ethtool_ops function 155pointers which allow DSA to introduce a level of layering between the networking 156stack/ethtool, and the switch driver implementation. 157 158Upon frame transmission from these slave network devices, DSA will look up which 159switch tagging protocol is currently registered with these network devices, and 160invoke a specific transmit routine which takes care of adding the relevant 161switch tag in the Ethernet frames. 162 163These frames are then queued for transmission using the master network device 164ndo_start_xmit() function, since they contain the appropriate switch tag, the 165Ethernet switch will be able to process these incoming frames from the 166management interface and delivers these frames to the physical switch port. 167 168Graphical representation 169------------------------ 170 171Summarized, this is basically how DSA looks like from a network device 172perspective: 173 174 175 |--------------------------- 176 | CPU network device (eth0)| 177 ---------------------------- 178 | <tag added by switch | 179 | | 180 | | 181 | tag added by CPU> | 182 |--------------------------------------------| 183 | Switch driver | 184 |--------------------------------------------| 185 || || || 186 |-------| |-------| |-------| 187 | sw0p0 | | sw0p1 | | sw0p2 | 188 |-------| |-------| |-------| 189 190Slave MDIO bus 191-------------- 192 193In order to be able to read to/from a switch PHY built into it, DSA creates a 194slave MDIO bus which allows a specific switch driver to divert and intercept 195MDIO reads/writes towards specific PHY addresses. In most MDIO-connected 196switches, these functions would utilize direct or indirect PHY addressing mode 197to return standard MII registers from the switch builtin PHYs, allowing the PHY 198library and/or to return link status, link partner pages, auto-negotiation 199results etc.. 200 201For Ethernet switches which have both external and internal MDIO busses, the 202slave MII bus can be utilized to mux/demux MDIO reads and writes towards either 203internal or external MDIO devices this switch might be connected to: internal 204PHYs, external PHYs, or even external switches. 205 206Data structures 207--------------- 208 209DSA data structures are defined in include/net/dsa.h as well as 210net/dsa/dsa_priv.h. 211 212dsa_chip_data: platform data configuration for a given switch device, this 213structure describes a switch device's parent device, its address, as well as 214various properties of its ports: names/labels, and finally a routing table 215indication (when cascading switches) 216 217dsa_platform_data: platform device configuration data which can reference a 218collection of dsa_chip_data structure if multiples switches are cascaded, the 219master network device this switch tree is attached to needs to be referenced 220 221dsa_switch_tree: structure assigned to the master network device under 222"dsa_ptr", this structure references a dsa_platform_data structure as well as 223the tagging protocol supported by the switch tree, and which receive/transmit 224function hooks should be invoked, information about the directly attached switch 225is also provided: CPU port. Finally, a collection of dsa_switch are referenced 226to address individual switches in the tree. 227 228dsa_switch: structure describing a switch device in the tree, referencing a 229dsa_switch_tree as a backpointer, slave network devices, master network device, 230and a reference to the backing dsa_switch_driver 231 232dsa_switch_driver: structure referencing function pointers, see below for a full 233description. 234 235Design limitations 236================== 237 238DSA is a platform device driver 239------------------------------- 240 241DSA is implemented as a DSA platform device driver which is convenient because 242it will register the entire DSA switch tree attached to a master network device 243in one-shot, facilitating the device creation and simplifying the device driver 244model a bit, this comes however with a number of limitations: 245 246- building DSA and its switch drivers as modules is currently not working 247- the device driver parenting does not necessarily reflect the original 248 bus/device the switch can be created from 249- supporting non-MDIO and non-MMIO (platform) switches is not possible 250 251Limits on the number of devices and ports 252----------------------------------------- 253 254DSA currently limits the number of maximum switches within a tree to 4 255(DSA_MAX_SWITCHES), and the number of ports per switch to 12 (DSA_MAX_PORTS). 256These limits could be extended to support larger configurations would this need 257arise. 258 259Lack of CPU/DSA network devices 260------------------------------- 261 262DSA does not currently create slave network devices for the CPU or DSA ports, as 263described before. This might be an issue in the following cases: 264 265- inability to fetch switch CPU port statistics counters using ethtool, which 266 can make it harder to debug MDIO switch connected using xMII interfaces 267 268- inability to configure the CPU port link parameters based on the Ethernet 269 controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/ 270 271- inability to configure specific VLAN IDs / trunking VLANs between switches 272 when using a cascaded setup 273 274Common pitfalls using DSA setups 275-------------------------------- 276 277Once a master network device is configured to use DSA (dev->dsa_ptr becomes 278non-NULL), and the switch behind it expects a tagging protocol, this network 279interface can only exclusively be used as a conduit interface. Sending packets 280directly through this interface (e.g.: opening a socket using this interface) 281will not make us go through the switch tagging protocol transmit function, so 282the Ethernet switch on the other end, expecting a tag will typically drop this 283frame. 284 285Slave network devices check that the master network device is UP before allowing 286you to administratively bring UP these slave network devices. A common 287configuration mistake is forgetting to bring UP the master network device first. 288 289Interactions with other subsystems 290================================== 291 292DSA currently leverages the following subsystems: 293 294- MDIO/PHY library: drivers/net/phy/phy.c, mdio_bus.c 295- Switchdev: net/switchdev/* 296- Device Tree for various of_* functions 297- HWMON: drivers/hwmon/* 298 299MDIO/PHY library 300---------------- 301 302Slave network devices exposed by DSA may or may not be interfacing with PHY 303devices (struct phy_device as defined in include/linux/phy.h), but the DSA 304subsystem deals with all possible combinations: 305 306- internal PHY devices, built into the Ethernet switch hardware 307- external PHY devices, connected via an internal or external MDIO bus 308- internal PHY devices, connected via an internal MDIO bus 309- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a 310 fixed PHYs 311 312The PHY configuration is done by the dsa_slave_phy_setup() function and the 313logic basically looks like this: 314 315- if Device Tree is used, the PHY device is looked up using the standard 316 "phy-handle" property, if found, this PHY device is created and registered 317 using of_phy_connect() 318 319- if Device Tree is used, and the PHY device is "fixed", that is, conforms to 320 the definition of a non-MDIO managed PHY as defined in 321 Documentation/devicetree/bindings/net/fixed-link.txt, the PHY is registered 322 and connected transparently using the special fixed MDIO bus driver 323 324- finally, if the PHY is built into the switch, as is very common with 325 standalone switch packages, the PHY is probed using the slave MII bus created 326 by DSA 327 328 329SWITCHDEV 330--------- 331 332DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and 333more specifically with its VLAN filtering portion when configuring VLANs on top 334of per-port slave network devices. Since DSA primarily deals with 335MDIO-connected switches, although not exclusively, SWITCHDEV's 336prepare/abort/commit phases are often simplified into a prepare phase which 337checks whether the operation is supporte by the DSA switch driver, and a commit 338phase which applies the changes. 339 340As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN 341objects. 342 343Device Tree 344----------- 345 346DSA features a standardized binding which is documented in 347Documentation/devicetree/bindings/net/dsa/dsa.txt. PHY/MDIO library helper 348functions such as of_get_phy_mode(), of_phy_connect() are also used to query 349per-port PHY specific details: interface connection, MDIO bus location etc.. 350 351HWMON 352----- 353 354Some switch drivers feature internal temperature sensors which are exposed as 355regular HWMON devices in /sys/class/hwmon/. 356 357Driver development 358================== 359 360DSA switch drivers need to implement a dsa_switch_driver structure which will 361contain the various members described below. 362 363register_switch_driver() registers this dsa_switch_driver in its internal list 364of drivers to probe for. unregister_switch_driver() does the exact opposite. 365 366Unless requested differently by setting the priv_size member accordingly, DSA 367does not allocate any driver private context space. 368 369Switch configuration 370-------------------- 371 372- priv_size: additional size needed by the switch driver for its private context 373 374- tag_protocol: this is to indicate what kind of tagging protocol is supported, 375 should be a valid value from the dsa_tag_protocol enum 376 377- probe: probe routine which will be invoked by the DSA platform device upon 378 registration to test for the presence/absence of a switch device. For MDIO 379 devices, it is recommended to issue a read towards internal registers using 380 the switch pseudo-PHY and return whether this is a supported device. For other 381 buses, return a non-NULL string 382 383- setup: setup function for the switch, this function is responsible for setting 384 up the dsa_switch_driver private structure with all it needs: register maps, 385 interrupts, mutexes, locks etc.. This function is also expected to properly 386 configure the switch to separate all network interfaces from each other, that 387 is, they should be isolated by the switch hardware itself, typically by creating 388 a Port-based VLAN ID for each port and allowing only the CPU port and the 389 specific port to be in the forwarding vector. Ports that are unused by the 390 platform should be disabled. Past this function, the switch is expected to be 391 fully configured and ready to serve any kind of request. It is recommended 392 to issue a software reset of the switch during this setup function in order to 393 avoid relying on what a previous software agent such as a bootloader/firmware 394 may have previously configured. 395 396- set_addr: Some switches require the programming of the management interface's 397 Ethernet MAC address, switch drivers can also disable ageing of MAC addresses 398 on the management interface and "hardcode"/"force" this MAC address for the 399 CPU/management interface as an optimization 400 401PHY devices and link management 402------------------------------- 403 404- get_phy_flags: Some switches are interfaced to various kinds of Ethernet PHYs, 405 if the PHY library PHY driver needs to know about information it cannot obtain 406 on its own (e.g.: coming from switch memory mapped registers), this function 407 should return a 32-bits bitmask of "flags", that is private between the switch 408 driver and the Ethernet PHY driver in drivers/net/phy/*. 409 410- phy_read: Function invoked by the DSA slave MDIO bus when attempting to read 411 the switch port MDIO registers. If unavailable, return 0xffff for each read. 412 For builtin switch Ethernet PHYs, this function should allow reading the link 413 status, auto-negotiation results, link partner pages etc.. 414 415- phy_write: Function invoked by the DSA slave MDIO bus when attempting to write 416 to the switch port MDIO registers. If unavailable return a negative error 417 code. 418 419- poll_link: Function invoked by DSA to query the link state of the switch 420 builtin Ethernet PHYs, per port. This function is responsible for calling 421 netif_carrier_{on,off} when appropriate, and can be used to poll all ports in a 422 single call. Executes from workqueue context. 423 424- adjust_link: Function invoked by the PHY library when a slave network device 425 is attached to a PHY device. This function is responsible for appropriately 426 configuring the switch port link parameters: speed, duplex, pause based on 427 what the phy_device is providing. 428 429- fixed_link_update: Function invoked by the PHY library, and specifically by 430 the fixed PHY driver asking the switch driver for link parameters that could 431 not be auto-negotiated, or obtained by reading the PHY registers through MDIO. 432 This is particularly useful for specific kinds of hardware such as QSGMII, 433 MoCA or other kinds of non-MDIO managed PHYs where out of band link 434 information is obtained 435 436Ethtool operations 437------------------ 438 439- get_strings: ethtool function used to query the driver's strings, will 440 typically return statistics strings, private flags strings etc. 441 442- get_ethtool_stats: ethtool function used to query per-port statistics and 443 return their values. DSA overlays slave network devices general statistics: 444 RX/TX counters from the network device, with switch driver specific statistics 445 per port 446 447- get_sset_count: ethtool function used to query the number of statistics items 448 449- get_wol: ethtool function used to obtain Wake-on-LAN settings per-port, this 450 function may, for certain implementations also query the master network device 451 Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN 452 453- set_wol: ethtool function used to configure Wake-on-LAN settings per-port, 454 direct counterpart to set_wol with similar restrictions 455 456- set_eee: ethtool function which is used to configure a switch port EEE (Green 457 Ethernet) settings, can optionally invoke the PHY library to enable EEE at the 458 PHY level if relevant. This function should enable EEE at the switch port MAC 459 controller and data-processing logic 460 461- get_eee: ethtool function which is used to query a switch port EEE settings, 462 this function should return the EEE state of the switch port MAC controller 463 and data-processing logic as well as query the PHY for its currently configured 464 EEE settings 465 466- get_eeprom_len: ethtool function returning for a given switch the EEPROM 467 length/size in bytes 468 469- get_eeprom: ethtool function returning for a given switch the EEPROM contents 470 471- set_eeprom: ethtool function writing specified data to a given switch EEPROM 472 473- get_regs_len: ethtool function returning the register length for a given 474 switch 475 476- get_regs: ethtool function returning the Ethernet switch internal register 477 contents. This function might require user-land code in ethtool to 478 pretty-print register values and registers 479 480Power management 481---------------- 482 483- suspend: function invoked by the DSA platform device when the system goes to 484 suspend, should quiesce all Ethernet switch activities, but keep ports 485 participating in Wake-on-LAN active as well as additional wake-up logic if 486 supported 487 488- resume: function invoked by the DSA platform device when the system resumes, 489 should resume all Ethernet switch activities and re-configure the switch to be 490 in a fully active state 491 492- port_enable: function invoked by the DSA slave network device ndo_open 493 function when a port is administratively brought up, this function should be 494 fully enabling a given switch port. DSA takes care of marking the port with 495 BR_STATE_BLOCKING if the port is a bridge member, or BR_STATE_FORWARDING if it 496 was not, and propagating these changes down to the hardware 497 498- port_disable: function invoked by the DSA slave network device ndo_close 499 function when a port is administratively brought down, this function should be 500 fully disabling a given switch port. DSA takes care of marking the port with 501 BR_STATE_DISABLED and propagating changes to the hardware if this port is 502 disabled while being a bridge member 503 504Hardware monitoring 505------------------- 506 507These callbacks are only available if CONFIG_NET_DSA_HWMON is enabled: 508 509- get_temp: this function queries the given switch for its temperature 510 511- get_temp_limit: this function returns the switch current maximum temperature 512 limit 513 514- set_temp_limit: this function configures the maximum temperature limit allowed 515 516- get_temp_alarm: this function returns the critical temperature threshold 517 returning an alarm notification 518 519See Documentation/hwmon/sysfs-interface for details. 520 521Bridge layer 522------------ 523 524- port_join_bridge: bridge layer function invoked when a given switch port is 525 added to a bridge, this function should be doing the necessary at the switch 526 level to permit the joining port from being added to the relevant logical 527 domain for it to ingress/egress traffic with other members of the bridge. DSA 528 does nothing but calculate a bitmask of switch ports currently members of the 529 specified bridge being requested the join 530 531- port_leave_bridge: bridge layer function invoked when a given switch port is 532 removed from a bridge, this function should be doing the necessary at the 533 switch level to deny the leaving port from ingress/egress traffic from the 534 remaining bridge members. When the port leaves the bridge, it should be aged 535 out at the switch hardware for the switch to (re) learn MAC addresses behind 536 this port. DSA calculates the bitmask of ports still members of the bridge 537 being left 538 539- port_stp_update: bridge layer function invoked when a given switch port STP 540 state is computed by the bridge layer and should be propagated to switch 541 hardware to forward/block/learn traffic. The switch driver is responsible for 542 computing a STP state change based on current and asked parameters and perform 543 the relevant ageing based on the intersection results 544 545Bridge VLAN filtering 546--------------------- 547 548- port_pvid_get: bridge layer function invoked when a Port-based VLAN ID is 549 queried for the given switch port 550 551- port_pvid_set: bridge layer function invoked when a Port-based VLAN ID needs 552 to be configured on the given switch port 553 554- port_vlan_add: bridge layer function invoked when a VLAN is configured 555 (tagged or untagged) for the given switch port 556 557- port_vlan_del: bridge layer function invoked when a VLAN is removed from the 558 given switch port 559 560- vlan_getnext: bridge layer function invoked to query the next configured VLAN 561 in the switch, i.e. returns the bitmaps of members and untagged ports 562 563- port_fdb_add: bridge layer function invoked when the bridge wants to install a 564 Forwarding Database entry, the switch hardware should be programmed with the 565 specified address in the specified VLAN Id in the forwarding database 566 associated with this VLAN ID 567 568Note: VLAN ID 0 corresponds to the port private database, which, in the context 569of DSA, would be the its port-based VLAN, used by the associated bridge device. 570 571- port_fdb_del: bridge layer function invoked when the bridge wants to remove a 572 Forwarding Database entry, the switch hardware should be programmed to delete 573 the specified MAC address from the specified VLAN ID if it was mapped into 574 this port forwarding database 575 576TODO 577==== 578 579The platform device problem 580--------------------------- 581DSA is currently implemented as a platform device driver which is far from ideal 582as was discussed in this thread: 583 584http://permalink.gmane.org/gmane.linux.network/329848 585 586This basically prevents the device driver model to be properly used and applied, 587and support non-MDIO, non-MMIO Ethernet connected switches. 588 589Another problem with the platform device driver approach is that it prevents the 590use of a modular switch drivers build due to a circular dependency, illustrated 591here: 592 593http://comments.gmane.org/gmane.linux.network/345803 594 595Attempts of reworking this has been done here: 596 597https://lwn.net/Articles/643149/ 598 599Making SWITCHDEV and DSA converge towards an unified codebase 600------------------------------------------------------------- 601 602SWITCHDEV properly takes care of abstracting the networking stack with offload 603capable hardware, but does not enforce a strict switch device driver model. On 604the other DSA enforces a fairly strict device driver model, and deals with most 605of the switch specific. At some point we should envision a merger between these 606two subsystems and get the best of both worlds. 607 608Other hanging fruits 609-------------------- 610 611- making the number of ports fully dynamic and not dependent on DSA_MAX_PORTS 612- allowing more than one CPU/management interface: 613 http://comments.gmane.org/gmane.linux.network/365657 614- porting more drivers from other vendors: 615 http://comments.gmane.org/gmane.linux.network/365510 616