1Rationale 2========= 3 4How significant is the cache maintenance overhead? 5It depends. Fast eMMC and multiple cache levels with speculative cache 6pre-fetch makes the cache overhead relatively significant. If the DMA 7preparations for the next request are done in parallel with the current 8transfer, the DMA preparation overhead would not affect the MMC performance. 9The intention of non-blocking (asynchronous) MMC requests is to minimize the 10time between when an MMC request ends and another MMC request begins. 11Using mmc_wait_for_req(), the MMC controller is idle while dma_map_sg and 12dma_unmap_sg are processing. Using non-blocking MMC requests makes it 13possible to prepare the caches for next job in parallel with an active 14MMC request. 15 16MMC block driver 17================ 18 19The mmc_blk_issue_rw_rq() in the MMC block driver is made non-blocking. 20The increase in throughput is proportional to the time it takes to 21prepare (major part of preparations are dma_map_sg() and dma_unmap_sg()) 22a request and how fast the memory is. The faster the MMC/SD is the 23more significant the prepare request time becomes. Roughly the expected 24performance gain is 5% for large writes and 10% on large reads on a L2 cache 25platform. In power save mode, when clocks run on a lower frequency, the DMA 26preparation may cost even more. As long as these slower preparations are run 27in parallel with the transfer performance won't be affected. 28 29Details on measurements from IOZone and mmc_test 30================================================ 31 32https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req 33 34MMC core API extension 35====================== 36 37There is one new public function mmc_start_req(). 38It starts a new MMC command request for a host. The function isn't 39truly non-blocking. If there is an ongoing async request it waits 40for completion of that request and starts the new one and returns. It 41doesn't wait for the new request to complete. If there is no ongoing 42request it starts the new request and returns immediately. 43 44MMC host extensions 45=================== 46 47There are two optional members in the mmc_host_ops -- pre_req() and 48post_req() -- that the host driver may implement in order to move work 49to before and after the actual mmc_host_ops.request() function is called. 50In the DMA case pre_req() may do dma_map_sg() and prepare the DMA 51descriptor, and post_req() runs the dma_unmap_sg(). 52 53Optimize for the first request 54============================== 55 56The first request in a series of requests can't be prepared in parallel 57with the previous transfer, since there is no previous request. 58The argument is_first_req in pre_req() indicates that there is no previous 59request. The host driver may optimize for this scenario to minimize 60the performance loss. A way to optimize for this is to split the current 61request in two chunks, prepare the first chunk and start the request, 62and finally prepare the second chunk and start the transfer. 63 64Pseudocode to handle is_first_req scenario with minimal prepare overhead: 65 66if (is_first_req && req->size > threshold) 67 /* start MMC transfer for the complete transfer size */ 68 mmc_start_command(MMC_CMD_TRANSFER_FULL_SIZE); 69 70 /* 71 * Begin to prepare DMA while cmd is being processed by MMC. 72 * The first chunk of the request should take the same time 73 * to prepare as the "MMC process command time". 74 * If prepare time exceeds MMC cmd time 75 * the transfer is delayed, guesstimate max 4k as first chunk size. 76 */ 77 prepare_1st_chunk_for_dma(req); 78 /* flush pending desc to the DMAC (dmaengine.h) */ 79 dma_issue_pending(req->dma_desc); 80 81 prepare_2nd_chunk_for_dma(req); 82 /* 83 * The second issue_pending should be called before MMC runs out 84 * of the first chunk. If the MMC runs out of the first data chunk 85 * before this call, the transfer is delayed. 86 */ 87 dma_issue_pending(req->dma_desc); 88