1Wei Yang <weiyang@linux.vnet.ibm.com> 2Benjamin Herrenschmidt <benh@au1.ibm.com> 3Bjorn Helgaas <bhelgaas@google.com> 426 Aug 2014 5 6This document describes the requirement from hardware for PCI MMIO resource 7sizing and assignment on PowerKVM and how generic PCI code handles this 8requirement. The first two sections describe the concepts of Partitionable 9Endpoints and the implementation on P8 (IODA2). The next two sections talks 10about considerations on enabling SRIOV on IODA2. 11 121. Introduction to Partitionable Endpoints 13 14A Partitionable Endpoint (PE) is a way to group the various resources 15associated with a device or a set of devices to provide isolation between 16partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism 17to freeze a device that is causing errors in order to limit the possibility 18of propagation of bad data. 19 20There is thus, in HW, a table of PE states that contains a pair of "frozen" 21state bits (one for MMIO and one for DMA, they get set together but can be 22cleared independently) for each PE. 23 24When a PE is frozen, all stores in any direction are dropped and all loads 25return all 1's value. MSIs are also blocked. There's a bit more state that 26captures things like the details of the error that caused the freeze etc., but 27that's not critical. 28 29The interesting part is how the various PCIe transactions (MMIO, DMA, ...) 30are matched to their corresponding PEs. 31 32The following section provides a rough description of what we have on P8 33(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB 34is a completely separate HW entity that replicates the entire logic, so has 35its own set of PEs, etc. 36 372. Implementation of Partitionable Endpoints on P8 (IODA2) 38 39P8 supports up to 256 Partitionable Endpoints per PHB. 40 41 * Inbound 42 43 For DMA, MSIs and inbound PCIe error messages, we have a table (in 44 memory but accessed in HW by the chip) that provides a direct 45 correspondence between a PCIe RID (bus/dev/fn) with a PE number. 46 We call this the RTT. 47 48 - For DMA we then provide an entire address space for each PE that can 49 contain two "windows", depending on the value of PCI address bit 59. 50 Each window can be configured to be remapped via a "TCE table" (IOMMU 51 translation table), which has various configurable characteristics 52 not described here. 53 54 - For MSIs, we have two windows in the address space (one at the top of 55 the 32-bit space and one much higher) which, via a combination of the 56 address and MSI value, will result in one of the 2048 interrupts per 57 bridge being triggered. There's a PE# in the interrupt controller 58 descriptor table as well which is compared with the PE# obtained from 59 the RTT to "authorize" the device to emit that specific interrupt. 60 61 - Error messages just use the RTT. 62 63 * Outbound. That's where the tricky part is. 64 65 Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" 66 from the CPU address space to the PCI address space. There is one M32 67 window and sixteen M64 windows. They have different characteristics. 68 First what they have in common: they forward a configurable portion of 69 the CPU address space to the PCIe bus and must be naturally aligned 70 power of two in size. The rest is different: 71 72 - The M32 window: 73 74 * Is limited to 4GB in size. 75 76 * Drops the top bits of the address (above the size) and replaces 77 them with a configurable value. This is typically used to generate 78 32-bit PCIe accesses. We configure that window at boot from FW and 79 don't touch it from Linux; it's usually set to forward a 2GB 80 portion of address space from the CPU to PCIe 81 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually 82 reserved for MSIs but this is not a problem at this point; we just 83 need to ensure Linux doesn't assign anything there, the M32 logic 84 ignores that however and will forward in that space if we try). 85 86 * It is divided into 256 segments of equal size. A table in the chip 87 maps each segment to a PE#. That allows portions of the MMIO space 88 to be assigned to PEs on a segment granularity. For a 2GB window, 89 the segment granularity is 2GB/256 = 8MB. 90 91 Now, this is the "main" window we use in Linux today (excluding 92 SR-IOV). We basically use the trick of forcing the bridge MMIO windows 93 onto a segment alignment/granularity so that the space behind a bridge 94 can be assigned to a PE. 95 96 Ideally we would like to be able to have individual functions in PEs 97 but that would mean using a completely different address allocation 98 scheme where individual function BARs can be "grouped" to fit in one or 99 more segments. 100 101 - The M64 windows: 102 103 * Must be at least 256MB in size. 104 105 * Do not translate addresses (the address on PCIe is the same as the 106 address on the PowerBus). There is a way to also set the top 14 107 bits which are not conveyed by PowerBus but we don't use this. 108 109 * Can be configured to be segmented. When not segmented, we can 110 specify the PE# for the entire window. When segmented, a window 111 has 256 segments; however, there is no table for mapping a segment 112 to a PE#. The segment number *is* the PE#. 113 114 * Support overlaps. If an address is covered by multiple windows, 115 there's a defined ordering for which window applies. 116 117 We have code (fairly new compared to the M32 stuff) that exploits that 118 for large BARs in 64-bit space: 119 120 We configure an M64 window to cover the entire region of address space 121 that has been assigned by FW for the PHB (about 64GB, ignore the space 122 for the M32, it comes out of a different "reserve"). We configure it 123 as segmented. 124 125 Then we do the same thing as with M32, using the bridge alignment 126 trick, to match to those giant segments. 127 128 Since we cannot remap, we have two additional constraints: 129 130 - We do the PE# allocation *after* the 64-bit space has been assigned 131 because the addresses we use directly determine the PE#. We then 132 update the M32 PE# for the devices that use both 32-bit and 64-bit 133 spaces or assign the remaining PE# to 32-bit only devices. 134 135 - We cannot "group" segments in HW, so if a device ends up using more 136 than one segment, we end up with more than one PE#. There is a HW 137 mechanism to make the freeze state cascade to "companion" PEs but 138 that only works for PCIe error messages (typically used so that if 139 you freeze a switch, it freezes all its children). So we do it in 140 SW. We lose a bit of effectiveness of EEH in that case, but that's 141 the best we found. So when any of the PEs freezes, we freeze the 142 other ones for that "domain". We thus introduce the concept of 143 "master PE" which is the one used for DMA, MSIs, etc., and "secondary 144 PEs" that are used for the remaining M64 segments. 145 146 We would like to investigate using additional M64 windows in "single 147 PE" mode to overlay over specific BARs to work around some of that, for 148 example for devices with very large BARs, e.g., GPUs. It would make 149 sense, but we haven't done it yet. 150 1513. Considerations for SR-IOV on PowerKVM 152 153 * SR-IOV Background 154 155 The PCIe SR-IOV feature allows a single Physical Function (PF) to 156 support several Virtual Functions (VFs). Registers in the PF's SR-IOV 157 Capability control the number of VFs and whether they are enabled. 158 159 When VFs are enabled, they appear in Configuration Space like normal 160 PCI devices, but the BARs in VF config space headers are unusual. For 161 a non-VF device, software uses BARs in the config space header to 162 discover the BAR sizes and assign addresses for them. For VF devices, 163 software uses VF BAR registers in the *PF* SR-IOV Capability to 164 discover sizes and assign addresses. The BARs in the VF's config space 165 header are read-only zeros. 166 167 When a VF BAR in the PF SR-IOV Capability is programmed, it sets the 168 base address for all the corresponding VF(n) BARs. For example, if the 169 PF SR-IOV Capability is programmed to enable eight VFs, and it has a 170 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. 171 This region is divided into eight contiguous 1MB regions, each of which 172 is a BAR0 for one of the VFs. Note that even though the VF BAR 173 describes an 8MB region, the alignment requirement is for a single VF, 174 i.e., 1MB in this example. 175 176 There are several strategies for isolating VFs in PEs: 177 178 - M32 window: There's one M32 window, and it is split into 256 179 equally-sized segments. The finest granularity possible is a 256MB 180 window with 1MB segments. VF BARs that are 1MB or larger could be 181 mapped to separate PEs in this window. Each segment can be 182 individually mapped to a PE via the lookup table, so this is quite 183 flexible, but it works best when all the VF BARs are the same size. If 184 they are different sizes, the entire window has to be small enough that 185 the segment size matches the smallest VF BAR, which means larger VF 186 BARs span several segments. 187 188 - Non-segmented M64 window: A non-segmented M64 window is mapped entirely 189 to a single PE, so it could only isolate one VF. 190 191 - Single segmented M64 windows: A segmented M64 window could be used just 192 like the M32 window, but the segments can't be individually mapped to 193 PEs (the segment number is the PE#), so there isn't as much 194 flexibility. A VF with multiple BARs would have to be in a "domain" of 195 multiple PEs, which is not as well isolated as a single PE. 196 197 - Multiple segmented M64 windows: As usual, each window is split into 256 198 equally-sized segments, and the segment number is the PE#. But if we 199 use several M64 windows, they can be set to different base addresses 200 and different segment sizes. If we have VFs that each have a 1MB BAR 201 and a 32MB BAR, we could use one M64 window to assign 1MB segments and 202 another M64 window to assign 32MB segments. 203 204 Finally, the plan to use M64 windows for SR-IOV, which will be described 205 more in the next two sections. For a given VF BAR, we need to 206 effectively reserve the entire 256 segments (256 * VF BAR size) and 207 position the VF BAR to start at the beginning of a free range of 208 segments/PEs inside that M64 window. 209 210 The goal is of course to be able to give a separate PE for each VF. 211 212 The IODA2 platform has 16 M64 windows, which are used to map MMIO 213 range to PE#. Each M64 window defines one MMIO range and this range is 214 divided into 256 segments, with each segment corresponding to one PE. 215 216 We decide to leverage this M64 window to map VFs to individual PEs, since 217 SR-IOV VF BARs are all the same size. 218 219 But doing so introduces another problem: total_VFs is usually smaller 220 than the number of M64 window segments, so if we map one VF BAR directly 221 to one M64 window, some part of the M64 window will map to another 222 device's MMIO range. 223 224 IODA supports 256 PEs, so segmented windows contain 256 segments, so if 225 total_VFs is less than 256, we have the situation in Figure 1.0, where 226 segments [total_VFs, 255] of the M64 window may map to some MMIO range on 227 other devices: 228 229 0 1 total_VFs - 1 230 +------+------+- -+------+------+ 231 | | | ... | | | 232 +------+------+- -+------+------+ 233 234 VF(n) BAR space 235 236 0 1 total_VFs - 1 255 237 +------+------+- -+------+------+- -+------+------+ 238 | | | ... | | | ... | | | 239 +------+------+- -+------+------+- -+------+------+ 240 241 M64 window 242 243 Figure 1.0 Direct map VF(n) BAR space 244 245 Our current solution is to allocate 256 segments even if the VF(n) BAR 246 space doesn't need that much, as shown in Figure 1.1: 247 248 0 1 total_VFs - 1 255 249 +------+------+- -+------+------+- -+------+------+ 250 | | | ... | | | ... | | | 251 +------+------+- -+------+------+- -+------+------+ 252 253 VF(n) BAR space + extra 254 255 0 1 total_VFs - 1 255 256 +------+------+- -+------+------+- -+------+------+ 257 | | | ... | | | ... | | | 258 +------+------+- -+------+------+- -+------+------+ 259 260 M64 window 261 262 Figure 1.1 Map VF(n) BAR space + extra 263 264 Allocating the extra space ensures that the entire M64 window will be 265 assigned to this one SR-IOV device and none of the space will be 266 available for other devices. Note that this only expands the space 267 reserved in software; there are still only total_VFs VFs, and they only 268 respond to segments [0, total_VFs - 1]. There's nothing in hardware that 269 responds to segments [total_VFs, 255]. 270 2714. Implications for the Generic PCI Code 272 273The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be 274aligned to the size of an individual VF BAR. 275 276In IODA2, the MMIO address determines the PE#. If the address is in an M32 277window, we can set the PE# by updating the table that translates segments 278to PE#s. Similarly, if the address is in an unsegmented M64 window, we can 279set the PE# for the window. But if it's in a segmented M64 window, the 280segment number is the PE#. 281 282Therefore, the only way to control the PE# for a VF is to change the base 283of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact 284amount of space required for the VF(n) BAR space, the VF BAR value is fixed 285and cannot be changed. 286 287On the other hand, if the PCI core allocates additional space, the VF BAR 288value can be changed as long as the entire VF(n) BAR space remains inside 289the space allocated by the core. 290 291Ideally the segment size will be the same as an individual VF BAR size. 292Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) 293are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we 294allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. 295 296If the segment size is smaller than the VF BAR size, it will take several 297segments to cover a VF BAR, and a VF will be in several PEs. This is 298possible, but the isolation isn't as good, and it reduces the number of PE# 299choices because instead of consuming only numVFs segments, the VF(n) BAR 300space will consume (numVFs * n) segments. That means there aren't as many 301available segments for adjusting base of the VF(n) BAR space. 302