1Coherent Accelerator Interface (CXL)
2====================================
3
4Introduction
5============
6
7    The coherent accelerator interface is designed to allow the
8    coherent connection of accelerators (FPGAs and other devices) to a
9    POWER system. These devices need to adhere to the Coherent
10    Accelerator Interface Architecture (CAIA).
11
12    IBM refers to this as the Coherent Accelerator Processor Interface
13    or CAPI. In the kernel it's referred to by the name CXL to avoid
14    confusion with the ISDN CAPI subsystem.
15
16    Coherent in this context means that the accelerator and CPUs can
17    both access system memory directly and with the same effective
18    addresses.
19
20
21Hardware overview
22=================
23
24          POWER8               FPGA
25       +----------+        +---------+
26       |          |        |         |
27       |   CPU    |        |   AFU   |
28       |          |        |         |
29       |          |        |         |
30       |          |        |         |
31       +----------+        +---------+
32       |   PHB    |        |         |
33       |   +------+        |   PSL   |
34       |   | CAPP |<------>|         |
35       +---+------+  PCIE  +---------+
36
37    The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
38    unit which is part of the PCIe Host Bridge (PHB). This is managed
39    by Linux by calls into OPAL. Linux doesn't directly program the
40    CAPP.
41
42    The FPGA (or coherently attached device) consists of two parts.
43    The POWER Service Layer (PSL) and the Accelerator Function Unit
44    (AFU). The AFU is used to implement specific functionality behind
45    the PSL. The PSL, among other things, provides memory address
46    translation services to allow each AFU direct access to userspace
47    memory.
48
49    The AFU is the core part of the accelerator (eg. the compression,
50    crypto etc function). The kernel has no knowledge of the function
51    of the AFU. Only userspace interacts directly with the AFU.
52
53    The PSL provides the translation and interrupt services that the
54    AFU needs. This is what the kernel interacts with. For example, if
55    the AFU needs to read a particular effective address, it sends
56    that address to the PSL, the PSL then translates it, fetches the
57    data from memory and returns it to the AFU. If the PSL has a
58    translation miss, it interrupts the kernel and the kernel services
59    the fault. The context to which this fault is serviced is based on
60    who owns that acceleration function.
61
62
63AFU Modes
64=========
65
66    There are two programming modes supported by the AFU. Dedicated
67    and AFU directed. AFU may support one or both modes.
68
69    When using dedicated mode only one MMU context is supported. In
70    this mode, only one userspace process can use the accelerator at
71    time.
72
73    When using AFU directed mode, up to 16K simultaneous contexts can
74    be supported. This means up to 16K simultaneous userspace
75    applications may use the accelerator (although specific AFUs may
76    support fewer). In this mode, the AFU sends a 16 bit context ID
77    with each of its requests. This tells the PSL which context is
78    associated with each operation. If the PSL can't translate an
79    operation, the ID can also be accessed by the kernel so it can
80    determine the userspace context associated with an operation.
81
82
83MMIO space
84==========
85
86    A portion of the accelerator MMIO space can be directly mapped
87    from the AFU to userspace. Either the whole space can be mapped or
88    just a per context portion. The hardware is self describing, hence
89    the kernel can determine the offset and size of the per context
90    portion.
91
92
93Interrupts
94==========
95
96    AFUs may generate interrupts that are destined for userspace. These
97    are received by the kernel as hardware interrupts and passed onto
98    userspace by a read syscall documented below.
99
100    Data storage faults and error interrupts are handled by the kernel
101    driver.
102
103
104Work Element Descriptor (WED)
105=============================
106
107    The WED is a 64-bit parameter passed to the AFU when a context is
108    started. Its format is up to the AFU hence the kernel has no
109    knowledge of what it represents. Typically it will be the
110    effective address of a work queue or status block where the AFU
111    and userspace can share control and status information.
112
113
114
115
116User API
117========
118
119    For AFUs operating in AFU directed mode, two character device
120    files will be created. /dev/cxl/afu0.0m will correspond to a
121    master context and /dev/cxl/afu0.0s will correspond to a slave
122    context. Master contexts have access to the full MMIO space an
123    AFU provides. Slave contexts have access to only the per process
124    MMIO space an AFU provides.
125
126    For AFUs operating in dedicated process mode, the driver will
127    only create a single character device per AFU called
128    /dev/cxl/afu0.0d. This will have access to the entire MMIO space
129    that the AFU provides (like master contexts in AFU directed).
130
131    The types described below are defined in include/uapi/misc/cxl.h
132
133    The following file operations are supported on both slave and
134    master devices.
135
136    A userspace library libcxl is available here:
137	https://github.com/ibm-capi/libcxl
138    This provides a C interface to this kernel API.
139
140open
141----
142
143    Opens the device and allocates a file descriptor to be used with
144    the rest of the API.
145
146    A dedicated mode AFU only has one context and only allows the
147    device to be opened once.
148
149    An AFU directed mode AFU can have many contexts, the device can be
150    opened once for each context that is available.
151
152    When all available contexts are allocated the open call will fail
153    and return -ENOSPC.
154
155    Note: IRQs need to be allocated for each context, which may limit
156          the number of contexts that can be created, and therefore
157          how many times the device can be opened. The POWER8 CAPP
158          supports 2040 IRQs and 3 are used by the kernel, so 2037 are
159          left. If 1 IRQ is needed per context, then only 2037
160          contexts can be allocated. If 4 IRQs are needed per context,
161          then only 2037/4 = 509 contexts can be allocated.
162
163
164ioctl
165-----
166
167    CXL_IOCTL_START_WORK:
168        Starts the AFU context and associates it with the current
169        process. Once this ioctl is successfully executed, all memory
170        mapped into this process is accessible to this AFU context
171        using the same effective addresses. No additional calls are
172        required to map/unmap memory. The AFU memory context will be
173        updated as userspace allocates and frees memory. This ioctl
174        returns once the AFU context is started.
175
176        Takes a pointer to a struct cxl_ioctl_start_work:
177
178                struct cxl_ioctl_start_work {
179                        __u64 flags;
180                        __u64 work_element_descriptor;
181                        __u64 amr;
182                        __s16 num_interrupts;
183                        __s16 reserved1;
184                        __s32 reserved2;
185                        __u64 reserved3;
186                        __u64 reserved4;
187                        __u64 reserved5;
188                        __u64 reserved6;
189                };
190
191            flags:
192                Indicates which optional fields in the structure are
193                valid.
194
195            work_element_descriptor:
196                The Work Element Descriptor (WED) is a 64-bit argument
197                defined by the AFU. Typically this is an effective
198                address pointing to an AFU specific structure
199                describing what work to perform.
200
201            amr:
202                Authority Mask Register (AMR), same as the powerpc
203                AMR. This field is only used by the kernel when the
204                corresponding CXL_START_WORK_AMR value is specified in
205                flags. If not specified the kernel will use a default
206                value of 0.
207
208            num_interrupts:
209                Number of userspace interrupts to request. This field
210                is only used by the kernel when the corresponding
211                CXL_START_WORK_NUM_IRQS value is specified in flags.
212                If not specified the minimum number required by the
213                AFU will be allocated. The min and max number can be
214                obtained from sysfs.
215
216            reserved fields:
217                For ABI padding and future extensions
218
219    CXL_IOCTL_GET_PROCESS_ELEMENT:
220        Get the current context id, also known as the process element.
221        The value is returned from the kernel as a __u32.
222
223
224mmap
225----
226
227    An AFU may have an MMIO space to facilitate communication with the
228    AFU. If it does, the MMIO space can be accessed via mmap. The size
229    and contents of this area are specific to the particular AFU. The
230    size can be discovered via sysfs.
231
232    In AFU directed mode, master contexts are allowed to map all of
233    the MMIO space and slave contexts are allowed to only map the per
234    process MMIO space associated with the context. In dedicated
235    process mode the entire MMIO space can always be mapped.
236
237    This mmap call must be done after the START_WORK ioctl.
238
239    Care should be taken when accessing MMIO space. Only 32 and 64-bit
240    accesses are supported by POWER8. Also, the AFU will be designed
241    with a specific endianness, so all MMIO accesses should consider
242    endianness (recommend endian(3) variants like: le64toh(),
243    be64toh() etc). These endian issues equally apply to shared memory
244    queues the WED may describe.
245
246
247read
248----
249
250    Reads events from the AFU. Blocks if no events are pending
251    (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
252    unrecoverable error or if the card is removed.
253
254    read() will always return an integral number of events.
255
256    The buffer passed to read() must be at least 4K bytes.
257
258    The result of the read will be a buffer of one or more events,
259    each event is of type struct cxl_event, of varying size.
260
261            struct cxl_event {
262                    struct cxl_event_header header;
263                    union {
264                            struct cxl_event_afu_interrupt irq;
265                            struct cxl_event_data_storage fault;
266                            struct cxl_event_afu_error afu_error;
267                    };
268            };
269
270    The struct cxl_event_header is defined as:
271
272            struct cxl_event_header {
273                    __u16 type;
274                    __u16 size;
275                    __u16 process_element;
276                    __u16 reserved1;
277            };
278
279        type:
280            This defines the type of event. The type determines how
281            the rest of the event is structured. These types are
282            described below and defined by enum cxl_event_type.
283
284        size:
285            This is the size of the event in bytes including the
286            struct cxl_event_header. The start of the next event can
287            be found at this offset from the start of the current
288            event.
289
290        process_element:
291            Context ID of the event.
292
293        reserved field:
294            For future extensions and padding.
295
296    If the event type is CXL_EVENT_AFU_INTERRUPT then the event
297    structure is defined as:
298
299            struct cxl_event_afu_interrupt {
300                    __u16 flags;
301                    __u16 irq; /* Raised AFU interrupt number */
302                    __u32 reserved1;
303            };
304
305        flags:
306            These flags indicate which optional fields are present
307            in this struct. Currently all fields are mandatory.
308
309        irq:
310            The IRQ number sent by the AFU.
311
312        reserved field:
313            For future extensions and padding.
314
315    If the event type is CXL_EVENT_DATA_STORAGE then the event
316    structure is defined as:
317
318            struct cxl_event_data_storage {
319                    __u16 flags;
320                    __u16 reserved1;
321                    __u32 reserved2;
322                    __u64 addr;
323                    __u64 dsisr;
324                    __u64 reserved3;
325            };
326
327        flags:
328            These flags indicate which optional fields are present in
329            this struct. Currently all fields are mandatory.
330
331        address:
332            The address that the AFU unsuccessfully attempted to
333            access. Valid accesses will be handled transparently by the
334            kernel but invalid accesses will generate this event.
335
336        dsisr:
337            This field gives information on the type of fault. It is a
338            copy of the DSISR from the PSL hardware when the address
339            fault occurred. The form of the DSISR is as defined in the
340            CAIA.
341
342        reserved fields:
343            For future extensions
344
345    If the event type is CXL_EVENT_AFU_ERROR then the event structure
346    is defined as:
347
348            struct cxl_event_afu_error {
349                    __u16 flags;
350                    __u16 reserved1;
351                    __u32 reserved2;
352                    __u64 error;
353            };
354
355        flags:
356            These flags indicate which optional fields are present in
357            this struct. Currently all fields are Mandatory.
358
359        error:
360            Error status from the AFU. Defined by the AFU.
361
362        reserved fields:
363            For future extensions and padding
364
365Sysfs Class
366===========
367
368    A cxl sysfs class is added under /sys/class/cxl to facilitate
369    enumeration and tuning of the accelerators. Its layout is
370    described in Documentation/ABI/testing/sysfs-class-cxl
371
372
373Udev rules
374==========
375
376    The following udev rules could be used to create a symlink to the
377    most logical chardev to use in any programming mode (afuX.Yd for
378    dedicated, afuX.Ys for afu directed), since the API is virtually
379    identical for each:
380
381	SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
382	SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
383	                  KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"
384