1Wait/Wound Deadlock-Proof Mutex Design 2====================================== 3 4Please read mutex-design.txt first, as it applies to wait/wound mutexes too. 5 6Motivation for WW-Mutexes 7------------------------- 8 9GPU's do operations that commonly involve many buffers. Those buffers 10can be shared across contexts/processes, exist in different memory 11domains (for example VRAM vs system memory), and so on. And with 12PRIME / dmabuf, they can even be shared across devices. So there are 13a handful of situations where the driver needs to wait for buffers to 14become ready. If you think about this in terms of waiting on a buffer 15mutex for it to become available, this presents a problem because 16there is no way to guarantee that buffers appear in a execbuf/batch in 17the same order in all contexts. That is directly under control of 18userspace, and a result of the sequence of GL calls that an application 19makes. Which results in the potential for deadlock. The problem gets 20more complex when you consider that the kernel may need to migrate the 21buffer(s) into VRAM before the GPU operates on the buffer(s), which 22may in turn require evicting some other buffers (and you don't want to 23evict other buffers which are already queued up to the GPU), but for a 24simplified understanding of the problem you can ignore this. 25 26The algorithm that the TTM graphics subsystem came up with for dealing with 27this problem is quite simple. For each group of buffers (execbuf) that need 28to be locked, the caller would be assigned a unique reservation id/ticket, 29from a global counter. In case of deadlock while locking all the buffers 30associated with a execbuf, the one with the lowest reservation ticket (i.e. 31the oldest task) wins, and the one with the higher reservation id (i.e. the 32younger task) unlocks all of the buffers that it has already locked, and then 33tries again. 34 35In the RDBMS literature this deadlock handling approach is called wait/wound: 36The older tasks waits until it can acquire the contended lock. The younger tasks 37needs to back off and drop all the locks it is currently holding, i.e. the 38younger task is wounded. 39 40Concepts 41-------- 42 43Compared to normal mutexes two additional concepts/objects show up in the lock 44interface for w/w mutexes: 45 46Acquire context: To ensure eventual forward progress it is important the a task 47trying to acquire locks doesn't grab a new reservation id, but keeps the one it 48acquired when starting the lock acquisition. This ticket is stored in the 49acquire context. Furthermore the acquire context keeps track of debugging state 50to catch w/w mutex interface abuse. 51 52W/w class: In contrast to normal mutexes the lock class needs to be explicit for 53w/w mutexes, since it is required to initialize the acquire context. 54 55Furthermore there are three different class of w/w lock acquire functions: 56 57* Normal lock acquisition with a context, using ww_mutex_lock. 58 59* Slowpath lock acquisition on the contending lock, used by the wounded task 60 after having dropped all already acquired locks. These functions have the 61 _slow postfix. 62 63 From a simple semantics point-of-view the _slow functions are not strictly 64 required, since simply calling the normal ww_mutex_lock functions on the 65 contending lock (after having dropped all other already acquired locks) will 66 work correctly. After all if no other ww mutex has been acquired yet there's 67 no deadlock potential and hence the ww_mutex_lock call will block and not 68 prematurely return -EDEADLK. The advantage of the _slow functions is in 69 interface safety: 70 - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow 71 has a void return type. Note that since ww mutex code needs loops/retries 72 anyway the __must_check doesn't result in spurious warnings, even though the 73 very first lock operation can never fail. 74 - When full debugging is enabled ww_mutex_lock_slow checks that all acquired 75 ww mutex have been released (preventing deadlocks) and makes sure that we 76 block on the contending lock (preventing spinning through the -EDEADLK 77 slowpath until the contended lock can be acquired). 78 79* Functions to only acquire a single w/w mutex, which results in the exact same 80 semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL 81 context. 82 83 Again this is not strictly required. But often you only want to acquire a 84 single lock in which case it's pointless to set up an acquire context (and so 85 better to avoid grabbing a deadlock avoidance ticket). 86 87Of course, all the usual variants for handling wake-ups due to signals are also 88provided. 89 90Usage 91----- 92 93Three different ways to acquire locks within the same w/w class. Common 94definitions for methods #1 and #2: 95 96static DEFINE_WW_CLASS(ww_class); 97 98struct obj { 99 struct ww_mutex lock; 100 /* obj data */ 101}; 102 103struct obj_entry { 104 struct list_head head; 105 struct obj *obj; 106}; 107 108Method 1, using a list in execbuf->buffers that's not allowed to be reordered. 109This is useful if a list of required objects is already tracked somewhere. 110Furthermore the lock helper can use propagate the -EALREADY return code back to 111the caller as a signal that an object is twice on the list. This is useful if 112the list is constructed from userspace input and the ABI requires userspace to 113not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl). 114 115int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 116{ 117 struct obj *res_obj = NULL; 118 struct obj_entry *contended_entry = NULL; 119 struct obj_entry *entry; 120 121 ww_acquire_init(ctx, &ww_class); 122 123retry: 124 list_for_each_entry (entry, list, head) { 125 if (entry->obj == res_obj) { 126 res_obj = NULL; 127 continue; 128 } 129 ret = ww_mutex_lock(&entry->obj->lock, ctx); 130 if (ret < 0) { 131 contended_entry = entry; 132 goto err; 133 } 134 } 135 136 ww_acquire_done(ctx); 137 return 0; 138 139err: 140 list_for_each_entry_continue_reverse (entry, list, head) 141 ww_mutex_unlock(&entry->obj->lock); 142 143 if (res_obj) 144 ww_mutex_unlock(&res_obj->lock); 145 146 if (ret == -EDEADLK) { 147 /* we lost out in a seqno race, lock and retry.. */ 148 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); 149 res_obj = contended_entry->obj; 150 goto retry; 151 } 152 ww_acquire_fini(ctx); 153 154 return ret; 155} 156 157Method 2, using a list in execbuf->buffers that can be reordered. Same semantics 158of duplicate entry detection using -EALREADY as method 1 above. But the 159list-reordering allows for a bit more idiomatic code. 160 161int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 162{ 163 struct obj_entry *entry, *entry2; 164 165 ww_acquire_init(ctx, &ww_class); 166 167 list_for_each_entry (entry, list, head) { 168 ret = ww_mutex_lock(&entry->obj->lock, ctx); 169 if (ret < 0) { 170 entry2 = entry; 171 172 list_for_each_entry_continue_reverse (entry2, list, head) 173 ww_mutex_unlock(&entry2->obj->lock); 174 175 if (ret != -EDEADLK) { 176 ww_acquire_fini(ctx); 177 return ret; 178 } 179 180 /* we lost out in a seqno race, lock and retry.. */ 181 ww_mutex_lock_slow(&entry->obj->lock, ctx); 182 183 /* 184 * Move buf to head of the list, this will point 185 * buf->next to the first unlocked entry, 186 * restarting the for loop. 187 */ 188 list_del(&entry->head); 189 list_add(&entry->head, list); 190 } 191 } 192 193 ww_acquire_done(ctx); 194 return 0; 195} 196 197Unlocking works the same way for both methods #1 and #2: 198 199void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 200{ 201 struct obj_entry *entry; 202 203 list_for_each_entry (entry, list, head) 204 ww_mutex_unlock(&entry->obj->lock); 205 206 ww_acquire_fini(ctx); 207} 208 209Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, 210e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, 211and edges can only be changed when holding the locks of all involved nodes. w/w 212mutexes are a natural fit for such a case for two reasons: 213- They can handle lock-acquisition in any order which allows us to start walking 214 a graph from a starting point and then iteratively discovering new edges and 215 locking down the nodes those edges connect to. 216- Due to the -EALREADY return code signalling that a given objects is already 217 held there's no need for additional book-keeping to break cycles in the graph 218 or keep track off which looks are already held (when using more than one node 219 as a starting point). 220 221Note that this approach differs in two important ways from the above methods: 222- Since the list of objects is dynamically constructed (and might very well be 223 different when retrying due to hitting the -EDEADLK wound condition) there's 224 no need to keep any object on a persistent list when it's not locked. We can 225 therefore move the list_head into the object itself. 226- On the other hand the dynamic object list construction also means that the -EALREADY return 227 code can't be propagated. 228 229Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a 230list of starting nodes (passed in from userspace) using one of the above 231methods. And then lock any additional objects affected by the operations using 232method #3 below. The backoff/retry procedure will be a bit more involved, since 233when the dynamic locking step hits -EDEADLK we also need to unlock all the 234objects acquired with the fixed list. But the w/w mutex debug checks will catch 235any interface misuse for these cases. 236 237Also, method 3 can't fail the lock acquisition step since it doesn't return 238-EALREADY. Of course this would be different when using the _interruptible 239variants, but that's outside of the scope of these examples here. 240 241struct obj { 242 struct ww_mutex ww_mutex; 243 struct list_head locked_list; 244}; 245 246static DEFINE_WW_CLASS(ww_class); 247 248void __unlock_objs(struct list_head *list) 249{ 250 struct obj *entry, *temp; 251 252 list_for_each_entry_safe (entry, temp, list, locked_list) { 253 /* need to do that before unlocking, since only the current lock holder is 254 allowed to use object */ 255 list_del(&entry->locked_list); 256 ww_mutex_unlock(entry->ww_mutex) 257 } 258} 259 260void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 261{ 262 struct obj *obj; 263 264 ww_acquire_init(ctx, &ww_class); 265 266retry: 267 /* re-init loop start state */ 268 loop { 269 /* magic code which walks over a graph and decides which objects 270 * to lock */ 271 272 ret = ww_mutex_lock(obj->ww_mutex, ctx); 273 if (ret == -EALREADY) { 274 /* we have that one already, get to the next object */ 275 continue; 276 } 277 if (ret == -EDEADLK) { 278 __unlock_objs(list); 279 280 ww_mutex_lock_slow(obj, ctx); 281 list_add(&entry->locked_list, list); 282 goto retry; 283 } 284 285 /* locked a new object, add it to the list */ 286 list_add_tail(&entry->locked_list, list); 287 } 288 289 ww_acquire_done(ctx); 290 return 0; 291} 292 293void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 294{ 295 __unlock_objs(list); 296 ww_acquire_fini(ctx); 297} 298 299Method 4: Only lock one single objects. In that case deadlock detection and 300prevention is obviously overkill, since with grabbing just one lock you can't 301produce a deadlock within just one class. To simplify this case the w/w mutex 302api can be used with a NULL context. 303 304Implementation Details 305---------------------- 306 307Design: 308 ww_mutex currently encapsulates a struct mutex, this means no extra overhead for 309 normal mutex locks, which are far more common. As such there is only a small 310 increase in code size if wait/wound mutexes are not used. 311 312 In general, not much contention is expected. The locks are typically used to 313 serialize access to resources for devices. The only way to make wakeups 314 smarter would be at the cost of adding a field to struct mutex_waiter. This 315 would add overhead to all cases where normal mutexes are used, and 316 ww_mutexes are generally less performance sensitive. 317 318Lockdep: 319 Special care has been taken to warn for as many cases of api abuse 320 as possible. Some common api abuses will be caught with 321 CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. 322 323 Some of the errors which will be warned about: 324 - Forgetting to call ww_acquire_fini or ww_acquire_init. 325 - Attempting to lock more mutexes after ww_acquire_done. 326 - Attempting to lock the wrong mutex after -EDEADLK and 327 unlocking all mutexes. 328 - Attempting to lock the right mutex after -EDEADLK, 329 before unlocking all mutexes. 330 331 - Calling ww_mutex_lock_slow before -EDEADLK was returned. 332 333 - Unlocking mutexes with the wrong unlock function. 334 - Calling one of the ww_acquire_* twice on the same context. 335 - Using a different ww_class for the mutex than for the ww_acquire_ctx. 336 - Normal lockdep errors that can result in deadlocks. 337 338 Some of the lockdep errors that can result in deadlocks: 339 - Calling ww_acquire_init to initialize a second ww_acquire_ctx before 340 having called ww_acquire_fini on the first. 341 - 'normal' deadlocks that can occur. 342 343FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic 344implemented. 345