1<?xml version="1.0" encoding="UTF-8"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> 4 5<book id="LKLockingGuide"> 6 <bookinfo> 7 <title>Unreliable Guide To Locking</title> 8 9 <authorgroup> 10 <author> 11 <firstname>Rusty</firstname> 12 <surname>Russell</surname> 13 <affiliation> 14 <address> 15 <email>rusty@rustcorp.com.au</email> 16 </address> 17 </affiliation> 18 </author> 19 </authorgroup> 20 21 <copyright> 22 <year>2003</year> 23 <holder>Rusty Russell</holder> 24 </copyright> 25 26 <legalnotice> 27 <para> 28 This documentation is free software; you can redistribute 29 it and/or modify it under the terms of the GNU General Public 30 License as published by the Free Software Foundation; either 31 version 2 of the License, or (at your option) any later 32 version. 33 </para> 34 35 <para> 36 This program is distributed in the hope that it will be 37 useful, but WITHOUT ANY WARRANTY; without even the implied 38 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 39 See the GNU General Public License for more details. 40 </para> 41 42 <para> 43 You should have received a copy of the GNU General Public 44 License along with this program; if not, write to the Free 45 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 46 MA 02111-1307 USA 47 </para> 48 49 <para> 50 For more details see the file COPYING in the source 51 distribution of Linux. 52 </para> 53 </legalnotice> 54 </bookinfo> 55 56 <toc></toc> 57 <chapter id="intro"> 58 <title>Introduction</title> 59 <para> 60 Welcome, to Rusty's Remarkably Unreliable Guide to Kernel 61 Locking issues. This document describes the locking systems in 62 the Linux Kernel in 2.6. 63 </para> 64 <para> 65 With the wide availability of HyperThreading, and <firstterm 66 linkend="gloss-preemption">preemption </firstterm> in the Linux 67 Kernel, everyone hacking on the kernel needs to know the 68 fundamentals of concurrency and locking for 69 <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>. 70 </para> 71 </chapter> 72 73 <chapter id="races"> 74 <title>The Problem With Concurrency</title> 75 <para> 76 (Skip this if you know what a Race Condition is). 77 </para> 78 <para> 79 In a normal program, you can increment a counter like so: 80 </para> 81 <programlisting> 82 very_important_count++; 83 </programlisting> 84 85 <para> 86 This is what they would expect to happen: 87 </para> 88 89 <table> 90 <title>Expected Results</title> 91 92 <tgroup cols="2" align="left"> 93 94 <thead> 95 <row> 96 <entry>Instance 1</entry> 97 <entry>Instance 2</entry> 98 </row> 99 </thead> 100 101 <tbody> 102 <row> 103 <entry>read very_important_count (5)</entry> 104 <entry></entry> 105 </row> 106 <row> 107 <entry>add 1 (6)</entry> 108 <entry></entry> 109 </row> 110 <row> 111 <entry>write very_important_count (6)</entry> 112 <entry></entry> 113 </row> 114 <row> 115 <entry></entry> 116 <entry>read very_important_count (6)</entry> 117 </row> 118 <row> 119 <entry></entry> 120 <entry>add 1 (7)</entry> 121 </row> 122 <row> 123 <entry></entry> 124 <entry>write very_important_count (7)</entry> 125 </row> 126 </tbody> 127 128 </tgroup> 129 </table> 130 131 <para> 132 This is what might happen: 133 </para> 134 135 <table> 136 <title>Possible Results</title> 137 138 <tgroup cols="2" align="left"> 139 <thead> 140 <row> 141 <entry>Instance 1</entry> 142 <entry>Instance 2</entry> 143 </row> 144 </thead> 145 146 <tbody> 147 <row> 148 <entry>read very_important_count (5)</entry> 149 <entry></entry> 150 </row> 151 <row> 152 <entry></entry> 153 <entry>read very_important_count (5)</entry> 154 </row> 155 <row> 156 <entry>add 1 (6)</entry> 157 <entry></entry> 158 </row> 159 <row> 160 <entry></entry> 161 <entry>add 1 (6)</entry> 162 </row> 163 <row> 164 <entry>write very_important_count (6)</entry> 165 <entry></entry> 166 </row> 167 <row> 168 <entry></entry> 169 <entry>write very_important_count (6)</entry> 170 </row> 171 </tbody> 172 </tgroup> 173 </table> 174 175 <sect1 id="race-condition"> 176 <title>Race Conditions and Critical Regions</title> 177 <para> 178 This overlap, where the result depends on the 179 relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>. 180 The piece of code containing the concurrency issue is called a 181 <firstterm>critical region</firstterm>. And especially since Linux starting running 182 on SMP machines, they became one of the major issues in kernel 183 design and implementation. 184 </para> 185 <para> 186 Preemption can have the same effect, even if there is only one 187 CPU: by preempting one task during the critical region, we have 188 exactly the same race condition. In this case the thread which 189 preempts might run the critical region itself. 190 </para> 191 <para> 192 The solution is to recognize when these simultaneous accesses 193 occur, and use locks to make sure that only one instance can 194 enter the critical region at any time. There are many 195 friendly primitives in the Linux kernel to help you do this. 196 And then there are the unfriendly primitives, but I'll pretend 197 they don't exist. 198 </para> 199 </sect1> 200 </chapter> 201 202 <chapter id="locks"> 203 <title>Locking in the Linux Kernel</title> 204 205 <para> 206 If I could give you one piece of advice: never sleep with anyone 207 crazier than yourself. But if I had to give you advice on 208 locking: <emphasis>keep it simple</emphasis>. 209 </para> 210 211 <para> 212 Be reluctant to introduce new locks. 213 </para> 214 215 <para> 216 Strangely enough, this last one is the exact reverse of my advice when 217 you <emphasis>have</emphasis> slept with someone crazier than yourself. 218 And you should think about getting a big dog. 219 </para> 220 221 <sect1 id="lock-intro"> 222 <title>Two Main Types of Kernel Locks: Spinlocks and Mutexes</title> 223 224 <para> 225 There are two main types of kernel locks. The fundamental type 226 is the spinlock 227 (<filename class="headerfile">include/asm/spinlock.h</filename>), 228 which is a very simple single-holder lock: if you can't get the 229 spinlock, you keep trying (spinning) until you can. Spinlocks are 230 very small and fast, and can be used anywhere. 231 </para> 232 <para> 233 The second type is a mutex 234 (<filename class="headerfile">include/linux/mutex.h</filename>): it 235 is like a spinlock, but you may block holding a mutex. 236 If you can't lock a mutex, your task will suspend itself, and be woken 237 up when the mutex is released. This means the CPU can do something 238 else while you are waiting. There are many cases when you simply 239 can't sleep (see <xref linkend="sleeping-things"/>), and so have to 240 use a spinlock instead. 241 </para> 242 <para> 243 Neither type of lock is recursive: see 244 <xref linkend="deadlock"/>. 245 </para> 246 </sect1> 247 248 <sect1 id="uniprocessor"> 249 <title>Locks and Uniprocessor Kernels</title> 250 251 <para> 252 For kernels compiled without <symbol>CONFIG_SMP</symbol>, and 253 without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at 254 all. This is an excellent design decision: when no-one else can 255 run at the same time, there is no reason to have a lock. 256 </para> 257 258 <para> 259 If the kernel is compiled without <symbol>CONFIG_SMP</symbol>, 260 but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks 261 simply disable preemption, which is sufficient to prevent any 262 races. For most purposes, we can think of preemption as 263 equivalent to SMP, and not worry about it separately. 264 </para> 265 266 <para> 267 You should always test your locking code with <symbol>CONFIG_SMP</symbol> 268 and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it 269 will still catch some kinds of locking bugs. 270 </para> 271 272 <para> 273 Mutexes still exist, because they are required for 274 synchronization between <firstterm linkend="gloss-usercontext">user 275 contexts</firstterm>, as we will see below. 276 </para> 277 </sect1> 278 279 <sect1 id="usercontextlocking"> 280 <title>Locking Only In User Context</title> 281 282 <para> 283 If you have a data structure which is only ever accessed from 284 user context, then you can use a simple mutex 285 (<filename>include/linux/mutex.h</filename>) to protect it. This 286 is the most trivial case: you initialize the mutex. Then you can 287 call <function>mutex_lock_interruptible()</function> to grab the mutex, 288 and <function>mutex_unlock()</function> to release it. There is also a 289 <function>mutex_lock()</function>, which should be avoided, because it 290 will not return if a signal is received. 291 </para> 292 293 <para> 294 Example: <filename>net/netfilter/nf_sockopt.c</filename> allows 295 registration of new <function>setsockopt()</function> and 296 <function>getsockopt()</function> calls, with 297 <function>nf_register_sockopt()</function>. Registration and 298 de-registration are only done on module load and unload (and boot 299 time, where there is no concurrency), and the list of registrations 300 is only consulted for an unknown <function>setsockopt()</function> 301 or <function>getsockopt()</function> system call. The 302 <varname>nf_sockopt_mutex</varname> is perfect to protect this, 303 especially since the setsockopt and getsockopt calls may well 304 sleep. 305 </para> 306 </sect1> 307 308 <sect1 id="lock-user-bh"> 309 <title>Locking Between User Context and Softirqs</title> 310 311 <para> 312 If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares 313 data with user context, you have two problems. Firstly, the current 314 user context can be interrupted by a softirq, and secondly, the 315 critical region could be entered from another CPU. This is where 316 <function>spin_lock_bh()</function> 317 (<filename class="headerfile">include/linux/spinlock.h</filename>) is 318 used. It disables softirqs on that CPU, then grabs the lock. 319 <function>spin_unlock_bh()</function> does the reverse. (The 320 '_bh' suffix is a historical reference to "Bottom Halves", the 321 old name for software interrupts. It should really be 322 called spin_lock_softirq()' in a perfect world). 323 </para> 324 325 <para> 326 Note that you can also use <function>spin_lock_irq()</function> 327 or <function>spin_lock_irqsave()</function> here, which stop 328 hardware interrupts as well: see <xref linkend="hardirq-context"/>. 329 </para> 330 331 <para> 332 This works perfectly for <firstterm linkend="gloss-up"><acronym>UP 333 </acronym></firstterm> as well: the spin lock vanishes, and this macro 334 simply becomes <function>local_bh_disable()</function> 335 (<filename class="headerfile">include/linux/interrupt.h</filename>), which 336 protects you from the softirq being run. 337 </para> 338 </sect1> 339 340 <sect1 id="lock-user-tasklet"> 341 <title>Locking Between User Context and Tasklets</title> 342 343 <para> 344 This is exactly the same as above, because <firstterm 345 linkend="gloss-tasklet">tasklets</firstterm> are actually run 346 from a softirq. 347 </para> 348 </sect1> 349 350 <sect1 id="lock-user-timers"> 351 <title>Locking Between User Context and Timers</title> 352 353 <para> 354 This, too, is exactly the same as above, because <firstterm 355 linkend="gloss-timers">timers</firstterm> are actually run from 356 a softirq. From a locking point of view, tasklets and timers 357 are identical. 358 </para> 359 </sect1> 360 361 <sect1 id="lock-tasklets"> 362 <title>Locking Between Tasklets/Timers</title> 363 364 <para> 365 Sometimes a tasklet or timer might want to share data with 366 another tasklet or timer. 367 </para> 368 369 <sect2 id="lock-tasklets-same"> 370 <title>The Same Tasklet/Timer</title> 371 <para> 372 Since a tasklet is never run on two CPUs at once, you don't 373 need to worry about your tasklet being reentrant (running 374 twice at once), even on SMP. 375 </para> 376 </sect2> 377 378 <sect2 id="lock-tasklets-different"> 379 <title>Different Tasklets/Timers</title> 380 <para> 381 If another tasklet/timer wants 382 to share data with your tasklet or timer , you will both need to use 383 <function>spin_lock()</function> and 384 <function>spin_unlock()</function> calls. 385 <function>spin_lock_bh()</function> is 386 unnecessary here, as you are already in a tasklet, and 387 none will be run on the same CPU. 388 </para> 389 </sect2> 390 </sect1> 391 392 <sect1 id="lock-softirqs"> 393 <title>Locking Between Softirqs</title> 394 395 <para> 396 Often a softirq might 397 want to share data with itself or a tasklet/timer. 398 </para> 399 400 <sect2 id="lock-softirqs-same"> 401 <title>The Same Softirq</title> 402 403 <para> 404 The same softirq can run on the other CPUs: you can use a 405 per-CPU array (see <xref linkend="per-cpu"/>) for better 406 performance. If you're going so far as to use a softirq, 407 you probably care about scalable performance enough 408 to justify the extra complexity. 409 </para> 410 411 <para> 412 You'll need to use <function>spin_lock()</function> and 413 <function>spin_unlock()</function> for shared data. 414 </para> 415 </sect2> 416 417 <sect2 id="lock-softirqs-different"> 418 <title>Different Softirqs</title> 419 420 <para> 421 You'll need to use <function>spin_lock()</function> and 422 <function>spin_unlock()</function> for shared data, whether it 423 be a timer, tasklet, different softirq or the same or another 424 softirq: any of them could be running on a different CPU. 425 </para> 426 </sect2> 427 </sect1> 428 </chapter> 429 430 <chapter id="hardirq-context"> 431 <title>Hard IRQ Context</title> 432 433 <para> 434 Hardware interrupts usually communicate with a 435 tasklet or softirq. Frequently this involves putting work in a 436 queue, which the softirq will take out. 437 </para> 438 439 <sect1 id="hardirq-softirq"> 440 <title>Locking Between Hard IRQ and Softirqs/Tasklets</title> 441 442 <para> 443 If a hardware irq handler shares data with a softirq, you have 444 two concerns. Firstly, the softirq processing can be 445 interrupted by a hardware interrupt, and secondly, the 446 critical region could be entered by a hardware interrupt on 447 another CPU. This is where <function>spin_lock_irq()</function> is 448 used. It is defined to disable interrupts on that cpu, then grab 449 the lock. <function>spin_unlock_irq()</function> does the reverse. 450 </para> 451 452 <para> 453 The irq handler does not to use 454 <function>spin_lock_irq()</function>, because the softirq cannot 455 run while the irq handler is running: it can use 456 <function>spin_lock()</function>, which is slightly faster. The 457 only exception would be if a different hardware irq handler uses 458 the same lock: <function>spin_lock_irq()</function> will stop 459 that from interrupting us. 460 </para> 461 462 <para> 463 This works perfectly for UP as well: the spin lock vanishes, 464 and this macro simply becomes <function>local_irq_disable()</function> 465 (<filename class="headerfile">include/asm/smp.h</filename>), which 466 protects you from the softirq/tasklet/BH being run. 467 </para> 468 469 <para> 470 <function>spin_lock_irqsave()</function> 471 (<filename>include/linux/spinlock.h</filename>) is a variant 472 which saves whether interrupts were on or off in a flags word, 473 which is passed to <function>spin_unlock_irqrestore()</function>. This 474 means that the same code can be used inside an hard irq handler (where 475 interrupts are already off) and in softirqs (where the irq 476 disabling is required). 477 </para> 478 479 <para> 480 Note that softirqs (and hence tasklets and timers) are run on 481 return from hardware interrupts, so 482 <function>spin_lock_irq()</function> also stops these. In that 483 sense, <function>spin_lock_irqsave()</function> is the most 484 general and powerful locking function. 485 </para> 486 487 </sect1> 488 <sect1 id="hardirq-hardirq"> 489 <title>Locking Between Two Hard IRQ Handlers</title> 490 <para> 491 It is rare to have to share data between two IRQ handlers, but 492 if you do, <function>spin_lock_irqsave()</function> should be 493 used: it is architecture-specific whether all interrupts are 494 disabled inside irq handlers themselves. 495 </para> 496 </sect1> 497 498 </chapter> 499 500 <chapter id="cheatsheet"> 501 <title>Cheat Sheet For Locking</title> 502 <para> 503 Pete Zaitcev gives the following summary: 504 </para> 505 <itemizedlist> 506 <listitem> 507 <para> 508 If you are in a process context (any syscall) and want to 509 lock other process out, use a mutex. You can take a mutex 510 and sleep (<function>copy_from_user*(</function> or 511 <function>kmalloc(x,GFP_KERNEL)</function>). 512 </para> 513 </listitem> 514 <listitem> 515 <para> 516 Otherwise (== data can be touched in an interrupt), use 517 <function>spin_lock_irqsave()</function> and 518 <function>spin_unlock_irqrestore()</function>. 519 </para> 520 </listitem> 521 <listitem> 522 <para> 523 Avoid holding spinlock for more than 5 lines of code and 524 across any function call (except accessors like 525 <function>readb</function>). 526 </para> 527 </listitem> 528 </itemizedlist> 529 530 <sect1 id="minimum-lock-reqirements"> 531 <title>Table of Minimum Requirements</title> 532 533 <para> The following table lists the <emphasis>minimum</emphasis> 534 locking requirements between various contexts. In some cases, 535 the same context can only be running on one CPU at a time, so 536 no locking is required for that context (eg. a particular 537 thread can only run on one CPU at a time, but if it needs 538 shares data with another thread, locking is required). 539 </para> 540 <para> 541 Remember the advice above: you can always use 542 <function>spin_lock_irqsave()</function>, which is a superset 543 of all other spinlock primitives. 544 </para> 545 546 <table> 547<title>Table of Locking Requirements</title> 548<tgroup cols="11"> 549<tbody> 550 551<row> 552<entry></entry> 553<entry>IRQ Handler A</entry> 554<entry>IRQ Handler B</entry> 555<entry>Softirq A</entry> 556<entry>Softirq B</entry> 557<entry>Tasklet A</entry> 558<entry>Tasklet B</entry> 559<entry>Timer A</entry> 560<entry>Timer B</entry> 561<entry>User Context A</entry> 562<entry>User Context B</entry> 563</row> 564 565<row> 566<entry>IRQ Handler A</entry> 567<entry>None</entry> 568</row> 569 570<row> 571<entry>IRQ Handler B</entry> 572<entry>SLIS</entry> 573<entry>None</entry> 574</row> 575 576<row> 577<entry>Softirq A</entry> 578<entry>SLI</entry> 579<entry>SLI</entry> 580<entry>SL</entry> 581</row> 582 583<row> 584<entry>Softirq B</entry> 585<entry>SLI</entry> 586<entry>SLI</entry> 587<entry>SL</entry> 588<entry>SL</entry> 589</row> 590 591<row> 592<entry>Tasklet A</entry> 593<entry>SLI</entry> 594<entry>SLI</entry> 595<entry>SL</entry> 596<entry>SL</entry> 597<entry>None</entry> 598</row> 599 600<row> 601<entry>Tasklet B</entry> 602<entry>SLI</entry> 603<entry>SLI</entry> 604<entry>SL</entry> 605<entry>SL</entry> 606<entry>SL</entry> 607<entry>None</entry> 608</row> 609 610<row> 611<entry>Timer A</entry> 612<entry>SLI</entry> 613<entry>SLI</entry> 614<entry>SL</entry> 615<entry>SL</entry> 616<entry>SL</entry> 617<entry>SL</entry> 618<entry>None</entry> 619</row> 620 621<row> 622<entry>Timer B</entry> 623<entry>SLI</entry> 624<entry>SLI</entry> 625<entry>SL</entry> 626<entry>SL</entry> 627<entry>SL</entry> 628<entry>SL</entry> 629<entry>SL</entry> 630<entry>None</entry> 631</row> 632 633<row> 634<entry>User Context A</entry> 635<entry>SLI</entry> 636<entry>SLI</entry> 637<entry>SLBH</entry> 638<entry>SLBH</entry> 639<entry>SLBH</entry> 640<entry>SLBH</entry> 641<entry>SLBH</entry> 642<entry>SLBH</entry> 643<entry>None</entry> 644</row> 645 646<row> 647<entry>User Context B</entry> 648<entry>SLI</entry> 649<entry>SLI</entry> 650<entry>SLBH</entry> 651<entry>SLBH</entry> 652<entry>SLBH</entry> 653<entry>SLBH</entry> 654<entry>SLBH</entry> 655<entry>SLBH</entry> 656<entry>MLI</entry> 657<entry>None</entry> 658</row> 659 660</tbody> 661</tgroup> 662</table> 663 664 <table> 665<title>Legend for Locking Requirements Table</title> 666<tgroup cols="2"> 667<tbody> 668 669<row> 670<entry>SLIS</entry> 671<entry>spin_lock_irqsave</entry> 672</row> 673<row> 674<entry>SLI</entry> 675<entry>spin_lock_irq</entry> 676</row> 677<row> 678<entry>SL</entry> 679<entry>spin_lock</entry> 680</row> 681<row> 682<entry>SLBH</entry> 683<entry>spin_lock_bh</entry> 684</row> 685<row> 686<entry>MLI</entry> 687<entry>mutex_lock_interruptible</entry> 688</row> 689 690</tbody> 691</tgroup> 692</table> 693 694</sect1> 695</chapter> 696 697<chapter id="trylock-functions"> 698 <title>The trylock Functions</title> 699 <para> 700 There are functions that try to acquire a lock only once and immediately 701 return a value telling about success or failure to acquire the lock. 702 They can be used if you need no access to the data protected with the lock 703 when some other thread is holding the lock. You should acquire the lock 704 later if you then need access to the data protected with the lock. 705 </para> 706 707 <para> 708 <function>spin_trylock()</function> does not spin but returns non-zero if 709 it acquires the spinlock on the first try or 0 if not. This function can 710 be used in all contexts like <function>spin_lock</function>: you must have 711 disabled the contexts that might interrupt you and acquire the spin lock. 712 </para> 713 714 <para> 715 <function>mutex_trylock()</function> does not suspend your task 716 but returns non-zero if it could lock the mutex on the first try 717 or 0 if not. This function cannot be safely used in hardware or software 718 interrupt contexts despite not sleeping. 719 </para> 720</chapter> 721 722 <chapter id="Examples"> 723 <title>Common Examples</title> 724 <para> 725Let's step through a simple example: a cache of number to name 726mappings. The cache keeps a count of how often each of the objects is 727used, and when it gets full, throws out the least used one. 728 729 </para> 730 731 <sect1 id="examples-usercontext"> 732 <title>All In User Context</title> 733 <para> 734For our first example, we assume that all operations are in user 735context (ie. from system calls), so we can sleep. This means we can 736use a mutex to protect the cache and all the objects within 737it. Here's the code: 738 </para> 739 740 <programlisting> 741#include <linux/list.h> 742#include <linux/slab.h> 743#include <linux/string.h> 744#include <linux/mutex.h> 745#include <asm/errno.h> 746 747struct object 748{ 749 struct list_head list; 750 int id; 751 char name[32]; 752 int popularity; 753}; 754 755/* Protects the cache, cache_num, and the objects within it */ 756static DEFINE_MUTEX(cache_lock); 757static LIST_HEAD(cache); 758static unsigned int cache_num = 0; 759#define MAX_CACHE_SIZE 10 760 761/* Must be holding cache_lock */ 762static struct object *__cache_find(int id) 763{ 764 struct object *i; 765 766 list_for_each_entry(i, &cache, list) 767 if (i->id == id) { 768 i->popularity++; 769 return i; 770 } 771 return NULL; 772} 773 774/* Must be holding cache_lock */ 775static void __cache_delete(struct object *obj) 776{ 777 BUG_ON(!obj); 778 list_del(&obj->list); 779 kfree(obj); 780 cache_num--; 781} 782 783/* Must be holding cache_lock */ 784static void __cache_add(struct object *obj) 785{ 786 list_add(&obj->list, &cache); 787 if (++cache_num > MAX_CACHE_SIZE) { 788 struct object *i, *outcast = NULL; 789 list_for_each_entry(i, &cache, list) { 790 if (!outcast || i->popularity < outcast->popularity) 791 outcast = i; 792 } 793 __cache_delete(outcast); 794 } 795} 796 797int cache_add(int id, const char *name) 798{ 799 struct object *obj; 800 801 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 802 return -ENOMEM; 803 804 strlcpy(obj->name, name, sizeof(obj->name)); 805 obj->id = id; 806 obj->popularity = 0; 807 808 mutex_lock(&cache_lock); 809 __cache_add(obj); 810 mutex_unlock(&cache_lock); 811 return 0; 812} 813 814void cache_delete(int id) 815{ 816 mutex_lock(&cache_lock); 817 __cache_delete(__cache_find(id)); 818 mutex_unlock(&cache_lock); 819} 820 821int cache_find(int id, char *name) 822{ 823 struct object *obj; 824 int ret = -ENOENT; 825 826 mutex_lock(&cache_lock); 827 obj = __cache_find(id); 828 if (obj) { 829 ret = 0; 830 strcpy(name, obj->name); 831 } 832 mutex_unlock(&cache_lock); 833 return ret; 834} 835</programlisting> 836 837 <para> 838Note that we always make sure we have the cache_lock when we add, 839delete, or look up the cache: both the cache infrastructure itself and 840the contents of the objects are protected by the lock. In this case 841it's easy, since we copy the data for the user, and never let them 842access the objects directly. 843 </para> 844 <para> 845There is a slight (and common) optimization here: in 846<function>cache_add</function> we set up the fields of the object 847before grabbing the lock. This is safe, as no-one else can access it 848until we put it in cache. 849 </para> 850 </sect1> 851 852 <sect1 id="examples-interrupt"> 853 <title>Accessing From Interrupt Context</title> 854 <para> 855Now consider the case where <function>cache_find</function> can be 856called from interrupt context: either a hardware interrupt or a 857softirq. An example would be a timer which deletes object from the 858cache. 859 </para> 860 <para> 861The change is shown below, in standard patch format: the 862<symbol>-</symbol> are lines which are taken away, and the 863<symbol>+</symbol> are lines which are added. 864 </para> 865<programlisting> 866--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 867+++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 868@@ -12,7 +12,7 @@ 869 int popularity; 870 }; 871 872-static DEFINE_MUTEX(cache_lock); 873+static DEFINE_SPINLOCK(cache_lock); 874 static LIST_HEAD(cache); 875 static unsigned int cache_num = 0; 876 #define MAX_CACHE_SIZE 10 877@@ -55,6 +55,7 @@ 878 int cache_add(int id, const char *name) 879 { 880 struct object *obj; 881+ unsigned long flags; 882 883 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 884 return -ENOMEM; 885@@ -63,30 +64,33 @@ 886 obj->id = id; 887 obj->popularity = 0; 888 889- mutex_lock(&cache_lock); 890+ spin_lock_irqsave(&cache_lock, flags); 891 __cache_add(obj); 892- mutex_unlock(&cache_lock); 893+ spin_unlock_irqrestore(&cache_lock, flags); 894 return 0; 895 } 896 897 void cache_delete(int id) 898 { 899- mutex_lock(&cache_lock); 900+ unsigned long flags; 901+ 902+ spin_lock_irqsave(&cache_lock, flags); 903 __cache_delete(__cache_find(id)); 904- mutex_unlock(&cache_lock); 905+ spin_unlock_irqrestore(&cache_lock, flags); 906 } 907 908 int cache_find(int id, char *name) 909 { 910 struct object *obj; 911 int ret = -ENOENT; 912+ unsigned long flags; 913 914- mutex_lock(&cache_lock); 915+ spin_lock_irqsave(&cache_lock, flags); 916 obj = __cache_find(id); 917 if (obj) { 918 ret = 0; 919 strcpy(name, obj->name); 920 } 921- mutex_unlock(&cache_lock); 922+ spin_unlock_irqrestore(&cache_lock, flags); 923 return ret; 924 } 925</programlisting> 926 927 <para> 928Note that the <function>spin_lock_irqsave</function> will turn off 929interrupts if they are on, otherwise does nothing (if we are already 930in an interrupt handler), hence these functions are safe to call from 931any context. 932 </para> 933 <para> 934Unfortunately, <function>cache_add</function> calls 935<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol> 936flag, which is only legal in user context. I have assumed that 937<function>cache_add</function> is still only called in user context, 938otherwise this should become a parameter to 939<function>cache_add</function>. 940 </para> 941 </sect1> 942 <sect1 id="examples-refcnt"> 943 <title>Exposing Objects Outside This File</title> 944 <para> 945If our objects contained more information, it might not be sufficient 946to copy the information in and out: other parts of the code might want 947to keep pointers to these objects, for example, rather than looking up 948the id every time. This produces two problems. 949 </para> 950 <para> 951The first problem is that we use the <symbol>cache_lock</symbol> to 952protect objects: we'd need to make this non-static so the rest of the 953code can use it. This makes locking trickier, as it is no longer all 954in one place. 955 </para> 956 <para> 957The second problem is the lifetime problem: if another structure keeps 958a pointer to an object, it presumably expects that pointer to remain 959valid. Unfortunately, this is only guaranteed while you hold the 960lock, otherwise someone might call <function>cache_delete</function> 961and even worse, add another object, re-using the same address. 962 </para> 963 <para> 964As there is only one lock, you can't hold it forever: no-one else would 965get any work done. 966 </para> 967 <para> 968The solution to this problem is to use a reference count: everyone who 969has a pointer to the object increases it when they first get the 970object, and drops the reference count when they're finished with it. 971Whoever drops it to zero knows it is unused, and can actually delete it. 972 </para> 973 <para> 974Here is the code: 975 </para> 976 977<programlisting> 978--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 979+++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 980@@ -7,6 +7,7 @@ 981 struct object 982 { 983 struct list_head list; 984+ unsigned int refcnt; 985 int id; 986 char name[32]; 987 int popularity; 988@@ -17,6 +18,35 @@ 989 static unsigned int cache_num = 0; 990 #define MAX_CACHE_SIZE 10 991 992+static void __object_put(struct object *obj) 993+{ 994+ if (--obj->refcnt == 0) 995+ kfree(obj); 996+} 997+ 998+static void __object_get(struct object *obj) 999+{ 1000+ obj->refcnt++; 1001+} 1002+ 1003+void object_put(struct object *obj) 1004+{ 1005+ unsigned long flags; 1006+ 1007+ spin_lock_irqsave(&cache_lock, flags); 1008+ __object_put(obj); 1009+ spin_unlock_irqrestore(&cache_lock, flags); 1010+} 1011+ 1012+void object_get(struct object *obj) 1013+{ 1014+ unsigned long flags; 1015+ 1016+ spin_lock_irqsave(&cache_lock, flags); 1017+ __object_get(obj); 1018+ spin_unlock_irqrestore(&cache_lock, flags); 1019+} 1020+ 1021 /* Must be holding cache_lock */ 1022 static struct object *__cache_find(int id) 1023 { 1024@@ -35,6 +65,7 @@ 1025 { 1026 BUG_ON(!obj); 1027 list_del(&obj->list); 1028+ __object_put(obj); 1029 cache_num--; 1030 } 1031 1032@@ -63,6 +94,7 @@ 1033 strlcpy(obj->name, name, sizeof(obj->name)); 1034 obj->id = id; 1035 obj->popularity = 0; 1036+ obj->refcnt = 1; /* The cache holds a reference */ 1037 1038 spin_lock_irqsave(&cache_lock, flags); 1039 __cache_add(obj); 1040@@ -79,18 +111,15 @@ 1041 spin_unlock_irqrestore(&cache_lock, flags); 1042 } 1043 1044-int cache_find(int id, char *name) 1045+struct object *cache_find(int id) 1046 { 1047 struct object *obj; 1048- int ret = -ENOENT; 1049 unsigned long flags; 1050 1051 spin_lock_irqsave(&cache_lock, flags); 1052 obj = __cache_find(id); 1053- if (obj) { 1054- ret = 0; 1055- strcpy(name, obj->name); 1056- } 1057+ if (obj) 1058+ __object_get(obj); 1059 spin_unlock_irqrestore(&cache_lock, flags); 1060- return ret; 1061+ return obj; 1062 } 1063</programlisting> 1064 1065<para> 1066We encapsulate the reference counting in the standard 'get' and 'put' 1067functions. Now we can return the object itself from 1068<function>cache_find</function> which has the advantage that the user 1069can now sleep holding the object (eg. to 1070<function>copy_to_user</function> to name to userspace). 1071</para> 1072<para> 1073The other point to note is that I said a reference should be held for 1074every pointer to the object: thus the reference count is 1 when first 1075inserted into the cache. In some versions the framework does not hold 1076a reference count, but they are more complicated. 1077</para> 1078 1079 <sect2 id="examples-refcnt-atomic"> 1080 <title>Using Atomic Operations For The Reference Count</title> 1081<para> 1082In practice, <type>atomic_t</type> would usually be used for 1083<structfield>refcnt</structfield>. There are a number of atomic 1084operations defined in 1085 1086<filename class="headerfile">include/asm/atomic.h</filename>: these are 1087guaranteed to be seen atomically from all CPUs in the system, so no 1088lock is required. In this case, it is simpler than using spinlocks, 1089although for anything non-trivial using spinlocks is clearer. The 1090<function>atomic_inc</function> and 1091<function>atomic_dec_and_test</function> are used instead of the 1092standard increment and decrement operators, and the lock is no longer 1093used to protect the reference count itself. 1094</para> 1095 1096<programlisting> 1097--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 1098+++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 1099@@ -7,7 +7,7 @@ 1100 struct object 1101 { 1102 struct list_head list; 1103- unsigned int refcnt; 1104+ atomic_t refcnt; 1105 int id; 1106 char name[32]; 1107 int popularity; 1108@@ -18,33 +18,15 @@ 1109 static unsigned int cache_num = 0; 1110 #define MAX_CACHE_SIZE 10 1111 1112-static void __object_put(struct object *obj) 1113-{ 1114- if (--obj->refcnt == 0) 1115- kfree(obj); 1116-} 1117- 1118-static void __object_get(struct object *obj) 1119-{ 1120- obj->refcnt++; 1121-} 1122- 1123 void object_put(struct object *obj) 1124 { 1125- unsigned long flags; 1126- 1127- spin_lock_irqsave(&cache_lock, flags); 1128- __object_put(obj); 1129- spin_unlock_irqrestore(&cache_lock, flags); 1130+ if (atomic_dec_and_test(&obj->refcnt)) 1131+ kfree(obj); 1132 } 1133 1134 void object_get(struct object *obj) 1135 { 1136- unsigned long flags; 1137- 1138- spin_lock_irqsave(&cache_lock, flags); 1139- __object_get(obj); 1140- spin_unlock_irqrestore(&cache_lock, flags); 1141+ atomic_inc(&obj->refcnt); 1142 } 1143 1144 /* Must be holding cache_lock */ 1145@@ -65,7 +47,7 @@ 1146 { 1147 BUG_ON(!obj); 1148 list_del(&obj->list); 1149- __object_put(obj); 1150+ object_put(obj); 1151 cache_num--; 1152 } 1153 1154@@ -94,7 +76,7 @@ 1155 strlcpy(obj->name, name, sizeof(obj->name)); 1156 obj->id = id; 1157 obj->popularity = 0; 1158- obj->refcnt = 1; /* The cache holds a reference */ 1159+ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1160 1161 spin_lock_irqsave(&cache_lock, flags); 1162 __cache_add(obj); 1163@@ -119,7 +101,7 @@ 1164 spin_lock_irqsave(&cache_lock, flags); 1165 obj = __cache_find(id); 1166 if (obj) 1167- __object_get(obj); 1168+ object_get(obj); 1169 spin_unlock_irqrestore(&cache_lock, flags); 1170 return obj; 1171 } 1172</programlisting> 1173</sect2> 1174</sect1> 1175 1176 <sect1 id="examples-lock-per-obj"> 1177 <title>Protecting The Objects Themselves</title> 1178 <para> 1179In these examples, we assumed that the objects (except the reference 1180counts) never changed once they are created. If we wanted to allow 1181the name to change, there are three possibilities: 1182 </para> 1183 <itemizedlist> 1184 <listitem> 1185 <para> 1186You can make <symbol>cache_lock</symbol> non-static, and tell people 1187to grab that lock before changing the name in any object. 1188 </para> 1189 </listitem> 1190 <listitem> 1191 <para> 1192You can provide a <function>cache_obj_rename</function> which grabs 1193this lock and changes the name for the caller, and tell everyone to 1194use that function. 1195 </para> 1196 </listitem> 1197 <listitem> 1198 <para> 1199You can make the <symbol>cache_lock</symbol> protect only the cache 1200itself, and use another lock to protect the name. 1201 </para> 1202 </listitem> 1203 </itemizedlist> 1204 1205 <para> 1206Theoretically, you can make the locks as fine-grained as one lock for 1207every field, for every object. In practice, the most common variants 1208are: 1209</para> 1210 <itemizedlist> 1211 <listitem> 1212 <para> 1213One lock which protects the infrastructure (the <symbol>cache</symbol> 1214list in this example) and all the objects. This is what we have done 1215so far. 1216 </para> 1217 </listitem> 1218 <listitem> 1219 <para> 1220One lock which protects the infrastructure (including the list 1221pointers inside the objects), and one lock inside the object which 1222protects the rest of that object. 1223 </para> 1224 </listitem> 1225 <listitem> 1226 <para> 1227Multiple locks to protect the infrastructure (eg. one lock per hash 1228chain), possibly with a separate per-object lock. 1229 </para> 1230 </listitem> 1231 </itemizedlist> 1232 1233<para> 1234Here is the "lock-per-object" implementation: 1235</para> 1236<programlisting> 1237--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 1238+++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1239@@ -6,11 +6,17 @@ 1240 1241 struct object 1242 { 1243+ /* These two protected by cache_lock. */ 1244 struct list_head list; 1245+ int popularity; 1246+ 1247 atomic_t refcnt; 1248+ 1249+ /* Doesn't change once created. */ 1250 int id; 1251+ 1252+ spinlock_t lock; /* Protects the name */ 1253 char name[32]; 1254- int popularity; 1255 }; 1256 1257 static DEFINE_SPINLOCK(cache_lock); 1258@@ -77,6 +84,7 @@ 1259 obj->id = id; 1260 obj->popularity = 0; 1261 atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1262+ spin_lock_init(&obj->lock); 1263 1264 spin_lock_irqsave(&cache_lock, flags); 1265 __cache_add(obj); 1266</programlisting> 1267 1268<para> 1269Note that I decide that the <structfield>popularity</structfield> 1270count should be protected by the <symbol>cache_lock</symbol> rather 1271than the per-object lock: this is because it (like the 1272<structname>struct list_head</structname> inside the object) is 1273logically part of the infrastructure. This way, I don't need to grab 1274the lock of every object in <function>__cache_add</function> when 1275seeking the least popular. 1276</para> 1277 1278<para> 1279I also decided that the <structfield>id</structfield> member is 1280unchangeable, so I don't need to grab each object lock in 1281<function>__cache_find()</function> to examine the 1282<structfield>id</structfield>: the object lock is only used by a 1283caller who wants to read or write the <structfield>name</structfield> 1284field. 1285</para> 1286 1287<para> 1288Note also that I added a comment describing what data was protected by 1289which locks. This is extremely important, as it describes the runtime 1290behavior of the code, and can be hard to gain from just reading. And 1291as Alan Cox says, <quote>Lock data, not code</quote>. 1292</para> 1293</sect1> 1294</chapter> 1295 1296 <chapter id="common-problems"> 1297 <title>Common Problems</title> 1298 <sect1 id="deadlock"> 1299 <title>Deadlock: Simple and Advanced</title> 1300 1301 <para> 1302 There is a coding bug where a piece of code tries to grab a 1303 spinlock twice: it will spin forever, waiting for the lock to 1304 be released (spinlocks, rwlocks and mutexes are not 1305 recursive in Linux). This is trivial to diagnose: not a 1306 stay-up-five-nights-talk-to-fluffy-code-bunnies kind of 1307 problem. 1308 </para> 1309 1310 <para> 1311 For a slightly more complex case, imagine you have a region 1312 shared by a softirq and user context. If you use a 1313 <function>spin_lock()</function> call to protect it, it is 1314 possible that the user context will be interrupted by the softirq 1315 while it holds the lock, and the softirq will then spin 1316 forever trying to get the same lock. 1317 </para> 1318 1319 <para> 1320 Both of these are called deadlock, and as shown above, it can 1321 occur even with a single CPU (although not on UP compiles, 1322 since spinlocks vanish on kernel compiles with 1323 <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 1324 in the second example). 1325 </para> 1326 1327 <para> 1328 This complete lockup is easy to diagnose: on SMP boxes the 1329 watchdog timer or compiling with <symbol>DEBUG_SPINLOCK</symbol> set 1330 (<filename>include/linux/spinlock.h</filename>) will show this up 1331 immediately when it happens. 1332 </para> 1333 1334 <para> 1335 A more complex problem is the so-called 'deadly embrace', 1336 involving two or more locks. Say you have a hash table: each 1337 entry in the table is a spinlock, and a chain of hashed 1338 objects. Inside a softirq handler, you sometimes want to 1339 alter an object from one place in the hash to another: you 1340 grab the spinlock of the old hash chain and the spinlock of 1341 the new hash chain, and delete the object from the old one, 1342 and insert it in the new one. 1343 </para> 1344 1345 <para> 1346 There are two problems here. First, if your code ever 1347 tries to move the object to the same chain, it will deadlock 1348 with itself as it tries to lock it twice. Secondly, if the 1349 same softirq on another CPU is trying to move another object 1350 in the reverse direction, the following could happen: 1351 </para> 1352 1353 <table> 1354 <title>Consequences</title> 1355 1356 <tgroup cols="2" align="left"> 1357 1358 <thead> 1359 <row> 1360 <entry>CPU 1</entry> 1361 <entry>CPU 2</entry> 1362 </row> 1363 </thead> 1364 1365 <tbody> 1366 <row> 1367 <entry>Grab lock A -> OK</entry> 1368 <entry>Grab lock B -> OK</entry> 1369 </row> 1370 <row> 1371 <entry>Grab lock B -> spin</entry> 1372 <entry>Grab lock A -> spin</entry> 1373 </row> 1374 </tbody> 1375 </tgroup> 1376 </table> 1377 1378 <para> 1379 The two CPUs will spin forever, waiting for the other to give up 1380 their lock. It will look, smell, and feel like a crash. 1381 </para> 1382 </sect1> 1383 1384 <sect1 id="techs-deadlock-prevent"> 1385 <title>Preventing Deadlock</title> 1386 1387 <para> 1388 Textbooks will tell you that if you always lock in the same 1389 order, you will never get this kind of deadlock. Practice 1390 will tell you that this approach doesn't scale: when I 1391 create a new lock, I don't understand enough of the kernel 1392 to figure out where in the 5000 lock hierarchy it will fit. 1393 </para> 1394 1395 <para> 1396 The best locks are encapsulated: they never get exposed in 1397 headers, and are never held around calls to non-trivial 1398 functions outside the same file. You can read through this 1399 code and see that it will never deadlock, because it never 1400 tries to grab another lock while it has that one. People 1401 using your code don't even need to know you are using a 1402 lock. 1403 </para> 1404 1405 <para> 1406 A classic problem here is when you provide callbacks or 1407 hooks: if you call these with the lock held, you risk simple 1408 deadlock, or a deadly embrace (who knows what the callback 1409 will do?). Remember, the other programmers are out to get 1410 you, so don't do this. 1411 </para> 1412 1413 <sect2 id="techs-deadlock-overprevent"> 1414 <title>Overzealous Prevention Of Deadlocks</title> 1415 1416 <para> 1417 Deadlocks are problematic, but not as bad as data 1418 corruption. Code which grabs a read lock, searches a list, 1419 fails to find what it wants, drops the read lock, grabs a 1420 write lock and inserts the object has a race condition. 1421 </para> 1422 1423 <para> 1424 If you don't see why, please stay the fuck away from my code. 1425 </para> 1426 </sect2> 1427 </sect1> 1428 1429 <sect1 id="racing-timers"> 1430 <title>Racing Timers: A Kernel Pastime</title> 1431 1432 <para> 1433 Timers can produce their own special problems with races. 1434 Consider a collection of objects (list, hash, etc) where each 1435 object has a timer which is due to destroy it. 1436 </para> 1437 1438 <para> 1439 If you want to destroy the entire collection (say on module 1440 removal), you might do the following: 1441 </para> 1442 1443 <programlisting> 1444 /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 1445 HUNGARIAN NOTATION */ 1446 spin_lock_bh(&list_lock); 1447 1448 while (list) { 1449 struct foo *next = list->next; 1450 del_timer(&list->timer); 1451 kfree(list); 1452 list = next; 1453 } 1454 1455 spin_unlock_bh(&list_lock); 1456 </programlisting> 1457 1458 <para> 1459 Sooner or later, this will crash on SMP, because a timer can 1460 have just gone off before the <function>spin_lock_bh()</function>, 1461 and it will only get the lock after we 1462 <function>spin_unlock_bh()</function>, and then try to free 1463 the element (which has already been freed!). 1464 </para> 1465 1466 <para> 1467 This can be avoided by checking the result of 1468 <function>del_timer()</function>: if it returns 1469 <returnvalue>1</returnvalue>, the timer has been deleted. 1470 If <returnvalue>0</returnvalue>, it means (in this 1471 case) that it is currently running, so we can do: 1472 </para> 1473 1474 <programlisting> 1475 retry: 1476 spin_lock_bh(&list_lock); 1477 1478 while (list) { 1479 struct foo *next = list->next; 1480 if (!del_timer(&list->timer)) { 1481 /* Give timer a chance to delete this */ 1482 spin_unlock_bh(&list_lock); 1483 goto retry; 1484 } 1485 kfree(list); 1486 list = next; 1487 } 1488 1489 spin_unlock_bh(&list_lock); 1490 </programlisting> 1491 1492 <para> 1493 Another common problem is deleting timers which restart 1494 themselves (by calling <function>add_timer()</function> at the end 1495 of their timer function). Because this is a fairly common case 1496 which is prone to races, you should use <function>del_timer_sync()</function> 1497 (<filename class="headerfile">include/linux/timer.h</filename>) 1498 to handle this case. It returns the number of times the timer 1499 had to be deleted before we finally stopped it from adding itself back 1500 in. 1501 </para> 1502 </sect1> 1503 1504 </chapter> 1505 1506 <chapter id="Efficiency"> 1507 <title>Locking Speed</title> 1508 1509 <para> 1510There are three main things to worry about when considering speed of 1511some code which does locking. First is concurrency: how many things 1512are going to be waiting while someone else is holding a lock. Second 1513is the time taken to actually acquire and release an uncontended lock. 1514Third is using fewer, or smarter locks. I'm assuming that the lock is 1515used fairly often: otherwise, you wouldn't be concerned about 1516efficiency. 1517</para> 1518 <para> 1519Concurrency depends on how long the lock is usually held: you should 1520hold the lock for as long as needed, but no longer. In the cache 1521example, we always create the object without the lock held, and then 1522grab the lock only when we are ready to insert it in the list. 1523</para> 1524 <para> 1525Acquisition times depend on how much damage the lock operations do to 1526the pipeline (pipeline stalls) and how likely it is that this CPU was 1527the last one to grab the lock (ie. is the lock cache-hot for this 1528CPU): on a machine with more CPUs, this likelihood drops fast. 1529Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, 1530an atomic increment takes about 58ns, a lock which is cache-hot on 1531this CPU takes 160ns, and a cacheline transfer from another CPU takes 1532an additional 170 to 360ns. (These figures from Paul McKenney's 1533<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux 1534Journal RCU article</ulink>). 1535</para> 1536 <para> 1537These two aims conflict: holding a lock for a short time might be done 1538by splitting locks into parts (such as in our final per-object-lock 1539example), but this increases the number of lock acquisitions, and the 1540results are often slower than having a single lock. This is another 1541reason to advocate locking simplicity. 1542</para> 1543 <para> 1544The third concern is addressed below: there are some methods to reduce 1545the amount of locking which needs to be done. 1546</para> 1547 1548 <sect1 id="efficiency-rwlocks"> 1549 <title>Read/Write Lock Variants</title> 1550 1551 <para> 1552 Both spinlocks and mutexes have read/write variants: 1553 <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. 1554 These divide users into two classes: the readers and the writers. If 1555 you are only reading the data, you can get a read lock, but to write to 1556 the data you need the write lock. Many people can hold a read lock, 1557 but a writer must be sole holder. 1558 </para> 1559 1560 <para> 1561 If your code divides neatly along reader/writer lines (as our 1562 cache code does), and the lock is held by readers for 1563 significant lengths of time, using these locks can help. They 1564 are slightly slower than the normal locks though, so in practice 1565 <type>rwlock_t</type> is not usually worthwhile. 1566 </para> 1567 </sect1> 1568 1569 <sect1 id="efficiency-read-copy-update"> 1570 <title>Avoiding Locks: Read Copy Update</title> 1571 1572 <para> 1573 There is a special method of read/write locking called Read Copy 1574 Update. Using RCU, the readers can avoid taking a lock 1575 altogether: as we expect our cache to be read more often than 1576 updated (otherwise the cache is a waste of time), it is a 1577 candidate for this optimization. 1578 </para> 1579 1580 <para> 1581 How do we get rid of read locks? Getting rid of read locks 1582 means that writers may be changing the list underneath the 1583 readers. That is actually quite simple: we can read a linked 1584 list while an element is being added if the writer adds the 1585 element very carefully. For example, adding 1586 <symbol>new</symbol> to a single linked list called 1587 <symbol>list</symbol>: 1588 </para> 1589 1590 <programlisting> 1591 new->next = list->next; 1592 wmb(); 1593 list->next = new; 1594 </programlisting> 1595 1596 <para> 1597 The <function>wmb()</function> is a write memory barrier. It 1598 ensures that the first operation (setting the new element's 1599 <symbol>next</symbol> pointer) is complete and will be seen by 1600 all CPUs, before the second operation is (putting the new 1601 element into the list). This is important, since modern 1602 compilers and modern CPUs can both reorder instructions unless 1603 told otherwise: we want a reader to either not see the new 1604 element at all, or see the new element with the 1605 <symbol>next</symbol> pointer correctly pointing at the rest of 1606 the list. 1607 </para> 1608 <para> 1609 Fortunately, there is a function to do this for standard 1610 <structname>struct list_head</structname> lists: 1611 <function>list_add_rcu()</function> 1612 (<filename>include/linux/list.h</filename>). 1613 </para> 1614 <para> 1615 Removing an element from the list is even simpler: we replace 1616 the pointer to the old element with a pointer to its successor, 1617 and readers will either see it, or skip over it. 1618 </para> 1619 <programlisting> 1620 list->next = old->next; 1621 </programlisting> 1622 <para> 1623 There is <function>list_del_rcu()</function> 1624 (<filename>include/linux/list.h</filename>) which does this (the 1625 normal version poisons the old object, which we don't want). 1626 </para> 1627 <para> 1628 The reader must also be careful: some CPUs can look through the 1629 <symbol>next</symbol> pointer to start reading the contents of 1630 the next element early, but don't realize that the pre-fetched 1631 contents is wrong when the <symbol>next</symbol> pointer changes 1632 underneath them. Once again, there is a 1633 <function>list_for_each_entry_rcu()</function> 1634 (<filename>include/linux/list.h</filename>) to help you. Of 1635 course, writers can just use 1636 <function>list_for_each_entry()</function>, since there cannot 1637 be two simultaneous writers. 1638 </para> 1639 <para> 1640 Our final dilemma is this: when can we actually destroy the 1641 removed element? Remember, a reader might be stepping through 1642 this element in the list right now: if we free this element and 1643 the <symbol>next</symbol> pointer changes, the reader will jump 1644 off into garbage and crash. We need to wait until we know that 1645 all the readers who were traversing the list when we deleted the 1646 element are finished. We use <function>call_rcu()</function> to 1647 register a callback which will actually destroy the object once 1648 all pre-existing readers are finished. Alternatively, 1649 <function>synchronize_rcu()</function> may be used to block until 1650 all pre-existing are finished. 1651 </para> 1652 <para> 1653 But how does Read Copy Update know when the readers are 1654 finished? The method is this: firstly, the readers always 1655 traverse the list inside 1656 <function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function> 1657 pairs: these simply disable preemption so the reader won't go to 1658 sleep while reading the list. 1659 </para> 1660 <para> 1661 RCU then waits until every other CPU has slept at least once: 1662 since readers cannot sleep, we know that any readers which were 1663 traversing the list during the deletion are finished, and the 1664 callback is triggered. The real Read Copy Update code is a 1665 little more optimized than this, but this is the fundamental 1666 idea. 1667 </para> 1668 1669<programlisting> 1670--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1671+++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 1672@@ -1,15 +1,18 @@ 1673 #include <linux/list.h> 1674 #include <linux/slab.h> 1675 #include <linux/string.h> 1676+#include <linux/rcupdate.h> 1677 #include <linux/mutex.h> 1678 #include <asm/errno.h> 1679 1680 struct object 1681 { 1682- /* These two protected by cache_lock. */ 1683+ /* This is protected by RCU */ 1684 struct list_head list; 1685 int popularity; 1686 1687+ struct rcu_head rcu; 1688+ 1689 atomic_t refcnt; 1690 1691 /* Doesn't change once created. */ 1692@@ -40,7 +43,7 @@ 1693 { 1694 struct object *i; 1695 1696- list_for_each_entry(i, &cache, list) { 1697+ list_for_each_entry_rcu(i, &cache, list) { 1698 if (i->id == id) { 1699 i->popularity++; 1700 return i; 1701@@ -49,19 +52,25 @@ 1702 return NULL; 1703 } 1704 1705+/* Final discard done once we know no readers are looking. */ 1706+static void cache_delete_rcu(void *arg) 1707+{ 1708+ object_put(arg); 1709+} 1710+ 1711 /* Must be holding cache_lock */ 1712 static void __cache_delete(struct object *obj) 1713 { 1714 BUG_ON(!obj); 1715- list_del(&obj->list); 1716- object_put(obj); 1717+ list_del_rcu(&obj->list); 1718 cache_num--; 1719+ call_rcu(&obj->rcu, cache_delete_rcu); 1720 } 1721 1722 /* Must be holding cache_lock */ 1723 static void __cache_add(struct object *obj) 1724 { 1725- list_add(&obj->list, &cache); 1726+ list_add_rcu(&obj->list, &cache); 1727 if (++cache_num > MAX_CACHE_SIZE) { 1728 struct object *i, *outcast = NULL; 1729 list_for_each_entry(i, &cache, list) { 1730@@ -104,12 +114,11 @@ 1731 struct object *cache_find(int id) 1732 { 1733 struct object *obj; 1734- unsigned long flags; 1735 1736- spin_lock_irqsave(&cache_lock, flags); 1737+ rcu_read_lock(); 1738 obj = __cache_find(id); 1739 if (obj) 1740 object_get(obj); 1741- spin_unlock_irqrestore(&cache_lock, flags); 1742+ rcu_read_unlock(); 1743 return obj; 1744 } 1745</programlisting> 1746 1747<para> 1748Note that the reader will alter the 1749<structfield>popularity</structfield> member in 1750<function>__cache_find()</function>, and now it doesn't hold a lock. 1751One solution would be to make it an <type>atomic_t</type>, but for 1752this usage, we don't really care about races: an approximate result is 1753good enough, so I didn't change it. 1754</para> 1755 1756<para> 1757The result is that <function>cache_find()</function> requires no 1758synchronization with any other functions, so is almost as fast on SMP 1759as it would be on UP. 1760</para> 1761 1762<para> 1763There is a further optimization possible here: remember our original 1764cache code, where there were no reference counts and the caller simply 1765held the lock whenever using the object? This is still possible: if 1766you hold the lock, no one can delete the object, so you don't need to 1767get and put the reference count. 1768</para> 1769 1770<para> 1771Now, because the 'read lock' in RCU is simply disabling preemption, a 1772caller which always has preemption disabled between calling 1773<function>cache_find()</function> and 1774<function>object_put()</function> does not need to actually get and 1775put the reference count: we could expose 1776<function>__cache_find()</function> by making it non-static, and 1777such callers could simply call that. 1778</para> 1779<para> 1780The benefit here is that the reference count is not written to: the 1781object is not altered in any way, which is much faster on SMP 1782machines due to caching. 1783</para> 1784 </sect1> 1785 1786 <sect1 id="per-cpu"> 1787 <title>Per-CPU Data</title> 1788 1789 <para> 1790 Another technique for avoiding locking which is used fairly 1791 widely is to duplicate information for each CPU. For example, 1792 if you wanted to keep a count of a common condition, you could 1793 use a spin lock and a single counter. Nice and simple. 1794 </para> 1795 1796 <para> 1797 If that was too slow (it's usually not, but if you've got a 1798 really big machine to test on and can show that it is), you 1799 could instead use a counter for each CPU, then none of them need 1800 an exclusive lock. See <function>DEFINE_PER_CPU()</function>, 1801 <function>get_cpu_var()</function> and 1802 <function>put_cpu_var()</function> 1803 (<filename class="headerfile">include/linux/percpu.h</filename>). 1804 </para> 1805 1806 <para> 1807 Of particular use for simple per-cpu counters is the 1808 <type>local_t</type> type, and the 1809 <function>cpu_local_inc()</function> and related functions, 1810 which are more efficient than simple code on some architectures 1811 (<filename class="headerfile">include/asm/local.h</filename>). 1812 </para> 1813 1814 <para> 1815 Note that there is no simple, reliable way of getting an exact 1816 value of such a counter, without introducing more locks. This 1817 is not a problem for some uses. 1818 </para> 1819 </sect1> 1820 1821 <sect1 id="mostly-hardirq"> 1822 <title>Data Which Mostly Used By An IRQ Handler</title> 1823 1824 <para> 1825 If data is always accessed from within the same IRQ handler, you 1826 don't need a lock at all: the kernel already guarantees that the 1827 irq handler will not run simultaneously on multiple CPUs. 1828 </para> 1829 <para> 1830 Manfred Spraul points out that you can still do this, even if 1831 the data is very occasionally accessed in user context or 1832 softirqs/tasklets. The irq handler doesn't use a lock, and 1833 all other accesses are done as so: 1834 </para> 1835 1836<programlisting> 1837 spin_lock(&lock); 1838 disable_irq(irq); 1839 ... 1840 enable_irq(irq); 1841 spin_unlock(&lock); 1842</programlisting> 1843 <para> 1844 The <function>disable_irq()</function> prevents the irq handler 1845 from running (and waits for it to finish if it's currently 1846 running on other CPUs). The spinlock prevents any other 1847 accesses happening at the same time. Naturally, this is slower 1848 than just a <function>spin_lock_irq()</function> call, so it 1849 only makes sense if this type of access happens extremely 1850 rarely. 1851 </para> 1852 </sect1> 1853 </chapter> 1854 1855 <chapter id="sleeping-things"> 1856 <title>What Functions Are Safe To Call From Interrupts?</title> 1857 1858 <para> 1859 Many functions in the kernel sleep (ie. call schedule()) 1860 directly or indirectly: you can never call them while holding a 1861 spinlock, or with preemption disabled. This also means you need 1862 to be in user context: calling them from an interrupt is illegal. 1863 </para> 1864 1865 <sect1 id="sleeping"> 1866 <title>Some Functions Which Sleep</title> 1867 1868 <para> 1869 The most common ones are listed below, but you usually have to 1870 read the code to find out if other calls are safe. If everyone 1871 else who calls it can sleep, you probably need to be able to 1872 sleep, too. In particular, registration and deregistration 1873 functions usually expect to be called from user context, and can 1874 sleep. 1875 </para> 1876 1877 <itemizedlist> 1878 <listitem> 1879 <para> 1880 Accesses to 1881 <firstterm linkend="gloss-userspace">userspace</firstterm>: 1882 </para> 1883 <itemizedlist> 1884 <listitem> 1885 <para> 1886 <function>copy_from_user()</function> 1887 </para> 1888 </listitem> 1889 <listitem> 1890 <para> 1891 <function>copy_to_user()</function> 1892 </para> 1893 </listitem> 1894 <listitem> 1895 <para> 1896 <function>get_user()</function> 1897 </para> 1898 </listitem> 1899 <listitem> 1900 <para> 1901 <function>put_user()</function> 1902 </para> 1903 </listitem> 1904 </itemizedlist> 1905 </listitem> 1906 1907 <listitem> 1908 <para> 1909 <function>kmalloc(GFP_KERNEL)</function> 1910 </para> 1911 </listitem> 1912 1913 <listitem> 1914 <para> 1915 <function>mutex_lock_interruptible()</function> and 1916 <function>mutex_lock()</function> 1917 </para> 1918 <para> 1919 There is a <function>mutex_trylock()</function> which does not 1920 sleep. Still, it must not be used inside interrupt context since 1921 its implementation is not safe for that. 1922 <function>mutex_unlock()</function> will also never sleep. 1923 It cannot be used in interrupt context either since a mutex 1924 must be released by the same task that acquired it. 1925 </para> 1926 </listitem> 1927 </itemizedlist> 1928 </sect1> 1929 1930 <sect1 id="dont-sleep"> 1931 <title>Some Functions Which Don't Sleep</title> 1932 1933 <para> 1934 Some functions are safe to call from any context, or holding 1935 almost any lock. 1936 </para> 1937 1938 <itemizedlist> 1939 <listitem> 1940 <para> 1941 <function>printk()</function> 1942 </para> 1943 </listitem> 1944 <listitem> 1945 <para> 1946 <function>kfree()</function> 1947 </para> 1948 </listitem> 1949 <listitem> 1950 <para> 1951 <function>add_timer()</function> and <function>del_timer()</function> 1952 </para> 1953 </listitem> 1954 </itemizedlist> 1955 </sect1> 1956 </chapter> 1957 1958 <chapter id="apiref-mutex"> 1959 <title>Mutex API reference</title> 1960<!-- include/linux/mutex.h --> 1961<refentry id="API-mutex-init"> 1962<refentryinfo> 1963 <title>LINUX</title> 1964 <productname>Kernel Hackers Manual</productname> 1965 <date>July 2017</date> 1966</refentryinfo> 1967<refmeta> 1968 <refentrytitle><phrase>mutex_init</phrase></refentrytitle> 1969 <manvolnum>9</manvolnum> 1970 <refmiscinfo class="version">4.1.27</refmiscinfo> 1971</refmeta> 1972<refnamediv> 1973 <refname>mutex_init</refname> 1974 <refpurpose> 1975 initialize the mutex 1976 </refpurpose> 1977</refnamediv> 1978<refsynopsisdiv> 1979 <title>Synopsis</title> 1980 <funcsynopsis><funcprototype> 1981 <funcdef> <function>mutex_init </function></funcdef> 1982 <paramdef> <parameter>mutex</parameter></paramdef> 1983 </funcprototype></funcsynopsis> 1984</refsynopsisdiv> 1985<refsect1> 1986 <title>Arguments</title> 1987 <variablelist> 1988 <varlistentry> 1989 <term><parameter>mutex</parameter></term> 1990 <listitem> 1991 <para> 1992 the mutex to be initialized 1993 </para> 1994 </listitem> 1995 </varlistentry> 1996 </variablelist> 1997</refsect1> 1998<refsect1> 1999<title>Description</title> 2000<para> 2001 Initialize the mutex to unlocked state. 2002 </para><para> 2003 2004 It is not allowed to initialize an already locked mutex. 2005</para> 2006</refsect1> 2007</refentry> 2008 2009<refentry id="API-mutex-is-locked"> 2010<refentryinfo> 2011 <title>LINUX</title> 2012 <productname>Kernel Hackers Manual</productname> 2013 <date>July 2017</date> 2014</refentryinfo> 2015<refmeta> 2016 <refentrytitle><phrase>mutex_is_locked</phrase></refentrytitle> 2017 <manvolnum>9</manvolnum> 2018 <refmiscinfo class="version">4.1.27</refmiscinfo> 2019</refmeta> 2020<refnamediv> 2021 <refname>mutex_is_locked</refname> 2022 <refpurpose> 2023 is the mutex locked 2024 </refpurpose> 2025</refnamediv> 2026<refsynopsisdiv> 2027 <title>Synopsis</title> 2028 <funcsynopsis><funcprototype> 2029 <funcdef>int <function>mutex_is_locked </function></funcdef> 2030 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2031 </funcprototype></funcsynopsis> 2032</refsynopsisdiv> 2033<refsect1> 2034 <title>Arguments</title> 2035 <variablelist> 2036 <varlistentry> 2037 <term><parameter>lock</parameter></term> 2038 <listitem> 2039 <para> 2040 the mutex to be queried 2041 </para> 2042 </listitem> 2043 </varlistentry> 2044 </variablelist> 2045</refsect1> 2046<refsect1> 2047<title>Description</title> 2048<para> 2049 Returns 1 if the mutex is locked, 0 if unlocked. 2050</para> 2051</refsect1> 2052</refentry> 2053 2054<!-- kernel/locking/mutex.c --> 2055<refentry id="API-mutex-lock"> 2056<refentryinfo> 2057 <title>LINUX</title> 2058 <productname>Kernel Hackers Manual</productname> 2059 <date>July 2017</date> 2060</refentryinfo> 2061<refmeta> 2062 <refentrytitle><phrase>mutex_lock</phrase></refentrytitle> 2063 <manvolnum>9</manvolnum> 2064 <refmiscinfo class="version">4.1.27</refmiscinfo> 2065</refmeta> 2066<refnamediv> 2067 <refname>mutex_lock</refname> 2068 <refpurpose> 2069 acquire the mutex 2070 </refpurpose> 2071</refnamediv> 2072<refsynopsisdiv> 2073 <title>Synopsis</title> 2074 <funcsynopsis><funcprototype> 2075 <funcdef>void __sched <function>mutex_lock </function></funcdef> 2076 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2077 </funcprototype></funcsynopsis> 2078</refsynopsisdiv> 2079<refsect1> 2080 <title>Arguments</title> 2081 <variablelist> 2082 <varlistentry> 2083 <term><parameter>lock</parameter></term> 2084 <listitem> 2085 <para> 2086 the mutex to be acquired 2087 </para> 2088 </listitem> 2089 </varlistentry> 2090 </variablelist> 2091</refsect1> 2092<refsect1> 2093<title>Description</title> 2094<para> 2095 Lock the mutex exclusively for this task. If the mutex is not 2096 available right now, it will sleep until it can get it. 2097 </para><para> 2098 2099 The mutex must later on be released by the same task that 2100 acquired it. Recursive locking is not allowed. The task 2101 may not exit without first unlocking the mutex. Also, kernel 2102 memory where the mutex resides must not be freed with 2103 the mutex still locked. The mutex must first be initialized 2104 (or statically defined) before it can be locked. <function>memset</function>-ing 2105 the mutex to 0 is not allowed. 2106 </para><para> 2107 2108 ( The CONFIG_DEBUG_MUTEXES .config option turns on debugging 2109 checks that will enforce the restrictions and will also do 2110 deadlock debugging. ) 2111 </para><para> 2112 2113 This function is similar to (but not equivalent to) <function>down</function>. 2114</para> 2115</refsect1> 2116</refentry> 2117 2118<refentry id="API-mutex-unlock"> 2119<refentryinfo> 2120 <title>LINUX</title> 2121 <productname>Kernel Hackers Manual</productname> 2122 <date>July 2017</date> 2123</refentryinfo> 2124<refmeta> 2125 <refentrytitle><phrase>mutex_unlock</phrase></refentrytitle> 2126 <manvolnum>9</manvolnum> 2127 <refmiscinfo class="version">4.1.27</refmiscinfo> 2128</refmeta> 2129<refnamediv> 2130 <refname>mutex_unlock</refname> 2131 <refpurpose> 2132 release the mutex 2133 </refpurpose> 2134</refnamediv> 2135<refsynopsisdiv> 2136 <title>Synopsis</title> 2137 <funcsynopsis><funcprototype> 2138 <funcdef>void __sched <function>mutex_unlock </function></funcdef> 2139 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2140 </funcprototype></funcsynopsis> 2141</refsynopsisdiv> 2142<refsect1> 2143 <title>Arguments</title> 2144 <variablelist> 2145 <varlistentry> 2146 <term><parameter>lock</parameter></term> 2147 <listitem> 2148 <para> 2149 the mutex to be released 2150 </para> 2151 </listitem> 2152 </varlistentry> 2153 </variablelist> 2154</refsect1> 2155<refsect1> 2156<title>Description</title> 2157<para> 2158 Unlock a mutex that has been locked by this task previously. 2159 </para><para> 2160 2161 This function must not be used in interrupt context. Unlocking 2162 of a not locked mutex is not allowed. 2163 </para><para> 2164 2165 This function is similar to (but not equivalent to) <function>up</function>. 2166</para> 2167</refsect1> 2168</refentry> 2169 2170<refentry id="API-ww-mutex-unlock"> 2171<refentryinfo> 2172 <title>LINUX</title> 2173 <productname>Kernel Hackers Manual</productname> 2174 <date>July 2017</date> 2175</refentryinfo> 2176<refmeta> 2177 <refentrytitle><phrase>ww_mutex_unlock</phrase></refentrytitle> 2178 <manvolnum>9</manvolnum> 2179 <refmiscinfo class="version">4.1.27</refmiscinfo> 2180</refmeta> 2181<refnamediv> 2182 <refname>ww_mutex_unlock</refname> 2183 <refpurpose> 2184 release the w/w mutex 2185 </refpurpose> 2186</refnamediv> 2187<refsynopsisdiv> 2188 <title>Synopsis</title> 2189 <funcsynopsis><funcprototype> 2190 <funcdef>void __sched <function>ww_mutex_unlock </function></funcdef> 2191 <paramdef>struct ww_mutex * <parameter>lock</parameter></paramdef> 2192 </funcprototype></funcsynopsis> 2193</refsynopsisdiv> 2194<refsect1> 2195 <title>Arguments</title> 2196 <variablelist> 2197 <varlistentry> 2198 <term><parameter>lock</parameter></term> 2199 <listitem> 2200 <para> 2201 the mutex to be released 2202 </para> 2203 </listitem> 2204 </varlistentry> 2205 </variablelist> 2206</refsect1> 2207<refsect1> 2208<title>Description</title> 2209<para> 2210 Unlock a mutex that has been locked by this task previously with any of the 2211 ww_mutex_lock* functions (with or without an acquire context). It is 2212 forbidden to release the locks after releasing the acquire context. 2213 </para><para> 2214 2215 This function must not be used in interrupt context. Unlocking 2216 of a unlocked mutex is not allowed. 2217</para> 2218</refsect1> 2219</refentry> 2220 2221<refentry id="API-mutex-lock-interruptible"> 2222<refentryinfo> 2223 <title>LINUX</title> 2224 <productname>Kernel Hackers Manual</productname> 2225 <date>July 2017</date> 2226</refentryinfo> 2227<refmeta> 2228 <refentrytitle><phrase>mutex_lock_interruptible</phrase></refentrytitle> 2229 <manvolnum>9</manvolnum> 2230 <refmiscinfo class="version">4.1.27</refmiscinfo> 2231</refmeta> 2232<refnamediv> 2233 <refname>mutex_lock_interruptible</refname> 2234 <refpurpose> 2235 acquire the mutex, interruptible 2236 </refpurpose> 2237</refnamediv> 2238<refsynopsisdiv> 2239 <title>Synopsis</title> 2240 <funcsynopsis><funcprototype> 2241 <funcdef>int __sched <function>mutex_lock_interruptible </function></funcdef> 2242 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2243 </funcprototype></funcsynopsis> 2244</refsynopsisdiv> 2245<refsect1> 2246 <title>Arguments</title> 2247 <variablelist> 2248 <varlistentry> 2249 <term><parameter>lock</parameter></term> 2250 <listitem> 2251 <para> 2252 the mutex to be acquired 2253 </para> 2254 </listitem> 2255 </varlistentry> 2256 </variablelist> 2257</refsect1> 2258<refsect1> 2259<title>Description</title> 2260<para> 2261 Lock the mutex like <function>mutex_lock</function>, and return 0 if the mutex has 2262 been acquired or sleep until the mutex becomes available. If a 2263 signal arrives while waiting for the lock then this function 2264 returns -EINTR. 2265 </para><para> 2266 2267 This function is similar to (but not equivalent to) <function>down_interruptible</function>. 2268</para> 2269</refsect1> 2270</refentry> 2271 2272<refentry id="API-mutex-trylock"> 2273<refentryinfo> 2274 <title>LINUX</title> 2275 <productname>Kernel Hackers Manual</productname> 2276 <date>July 2017</date> 2277</refentryinfo> 2278<refmeta> 2279 <refentrytitle><phrase>mutex_trylock</phrase></refentrytitle> 2280 <manvolnum>9</manvolnum> 2281 <refmiscinfo class="version">4.1.27</refmiscinfo> 2282</refmeta> 2283<refnamediv> 2284 <refname>mutex_trylock</refname> 2285 <refpurpose> 2286 try to acquire the mutex, without waiting 2287 </refpurpose> 2288</refnamediv> 2289<refsynopsisdiv> 2290 <title>Synopsis</title> 2291 <funcsynopsis><funcprototype> 2292 <funcdef>int __sched <function>mutex_trylock </function></funcdef> 2293 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2294 </funcprototype></funcsynopsis> 2295</refsynopsisdiv> 2296<refsect1> 2297 <title>Arguments</title> 2298 <variablelist> 2299 <varlistentry> 2300 <term><parameter>lock</parameter></term> 2301 <listitem> 2302 <para> 2303 the mutex to be acquired 2304 </para> 2305 </listitem> 2306 </varlistentry> 2307 </variablelist> 2308</refsect1> 2309<refsect1> 2310<title>Description</title> 2311<para> 2312 Try to acquire the mutex atomically. Returns 1 if the mutex 2313 has been acquired successfully, and 0 on contention. 2314</para> 2315</refsect1> 2316<refsect1> 2317<title>NOTE</title> 2318<para> 2319 this function follows the <function>spin_trylock</function> convention, so 2320 it is negated from the <function>down_trylock</function> return values! Be careful 2321 about this when converting semaphore users to mutexes. 2322 </para><para> 2323 2324 This function must not be used in interrupt context. The 2325 mutex must be released by the same task that acquired it. 2326</para> 2327</refsect1> 2328</refentry> 2329 2330<refentry id="API-atomic-dec-and-mutex-lock"> 2331<refentryinfo> 2332 <title>LINUX</title> 2333 <productname>Kernel Hackers Manual</productname> 2334 <date>July 2017</date> 2335</refentryinfo> 2336<refmeta> 2337 <refentrytitle><phrase>atomic_dec_and_mutex_lock</phrase></refentrytitle> 2338 <manvolnum>9</manvolnum> 2339 <refmiscinfo class="version">4.1.27</refmiscinfo> 2340</refmeta> 2341<refnamediv> 2342 <refname>atomic_dec_and_mutex_lock</refname> 2343 <refpurpose> 2344 return holding mutex if we dec to 0 2345 </refpurpose> 2346</refnamediv> 2347<refsynopsisdiv> 2348 <title>Synopsis</title> 2349 <funcsynopsis><funcprototype> 2350 <funcdef>int <function>atomic_dec_and_mutex_lock </function></funcdef> 2351 <paramdef>atomic_t * <parameter>cnt</parameter></paramdef> 2352 <paramdef>struct mutex * <parameter>lock</parameter></paramdef> 2353 </funcprototype></funcsynopsis> 2354</refsynopsisdiv> 2355<refsect1> 2356 <title>Arguments</title> 2357 <variablelist> 2358 <varlistentry> 2359 <term><parameter>cnt</parameter></term> 2360 <listitem> 2361 <para> 2362 the atomic which we are to dec 2363 </para> 2364 </listitem> 2365 </varlistentry> 2366 <varlistentry> 2367 <term><parameter>lock</parameter></term> 2368 <listitem> 2369 <para> 2370 the mutex to return holding if we dec to 0 2371 </para> 2372 </listitem> 2373 </varlistentry> 2374 </variablelist> 2375</refsect1> 2376<refsect1> 2377<title>Description</title> 2378<para> 2379 return true and hold lock if we dec to 0, return false otherwise 2380</para> 2381</refsect1> 2382</refentry> 2383 2384 </chapter> 2385 2386 <chapter id="apiref-futex"> 2387 <title>Futex API reference</title> 2388<!-- kernel/futex.c --> 2389<refentry id="API-struct-futex-q"> 2390<refentryinfo> 2391 <title>LINUX</title> 2392 <productname>Kernel Hackers Manual</productname> 2393 <date>July 2017</date> 2394</refentryinfo> 2395<refmeta> 2396 <refentrytitle><phrase>struct futex_q</phrase></refentrytitle> 2397 <manvolnum>9</manvolnum> 2398 <refmiscinfo class="version">4.1.27</refmiscinfo> 2399</refmeta> 2400<refnamediv> 2401 <refname>struct futex_q</refname> 2402 <refpurpose> 2403 The hashed futex queue entry, one per waiting task 2404 </refpurpose> 2405</refnamediv> 2406<refsynopsisdiv> 2407 <title>Synopsis</title> 2408 <programlisting> 2409struct futex_q { 2410 struct plist_node list; 2411 struct task_struct * task; 2412 spinlock_t * lock_ptr; 2413 union futex_key key; 2414 struct futex_pi_state * pi_state; 2415 struct rt_mutex_waiter * rt_waiter; 2416 union futex_key * requeue_pi_key; 2417 u32 bitset; 2418}; </programlisting> 2419</refsynopsisdiv> 2420 <refsect1> 2421 <title>Members</title> 2422 <variablelist> 2423 <varlistentry> <term>list</term> 2424 <listitem><para> 2425priority-sorted list of tasks waiting on this futex 2426 </para></listitem> 2427 </varlistentry> 2428 <varlistentry> <term>task</term> 2429 <listitem><para> 2430the task waiting on the futex 2431 </para></listitem> 2432 </varlistentry> 2433 <varlistentry> <term>lock_ptr</term> 2434 <listitem><para> 2435the hash bucket lock 2436 </para></listitem> 2437 </varlistentry> 2438 <varlistentry> <term>key</term> 2439 <listitem><para> 2440the key the futex is hashed on 2441 </para></listitem> 2442 </varlistentry> 2443 <varlistentry> <term>pi_state</term> 2444 <listitem><para> 2445optional priority inheritance state 2446 </para></listitem> 2447 </varlistentry> 2448 <varlistentry> <term>rt_waiter</term> 2449 <listitem><para> 2450rt_waiter storage for use with requeue_pi 2451 </para></listitem> 2452 </varlistentry> 2453 <varlistentry> <term>requeue_pi_key</term> 2454 <listitem><para> 2455the requeue_pi target futex key 2456 </para></listitem> 2457 </varlistentry> 2458 <varlistentry> <term>bitset</term> 2459 <listitem><para> 2460bitset for the optional bitmasked wakeup 2461 </para></listitem> 2462 </varlistentry> 2463 </variablelist> 2464 </refsect1> 2465<refsect1> 2466<title>Description</title> 2467<para> 2468 We use this hashed waitqueue, instead of a normal wait_queue_t, so 2469 we can wake only the relevant ones (hashed queues may be shared). 2470 </para><para> 2471 2472 A futex_q has a woken state, just like tasks have TASK_RUNNING. 2473 It is considered woken when plist_node_empty(<structname>q</structname>->list) || q->lock_ptr == 0. 2474 The order of wakeup is always to make the first condition true, then 2475 the second. 2476 </para><para> 2477 2478 PI futexes are typically woken before they are removed from the hash list via 2479 the rt_mutex code. See <function>unqueue_me_pi</function>. 2480</para> 2481</refsect1> 2482</refentry> 2483 2484<refentry id="API-get-futex-key"> 2485<refentryinfo> 2486 <title>LINUX</title> 2487 <productname>Kernel Hackers Manual</productname> 2488 <date>July 2017</date> 2489</refentryinfo> 2490<refmeta> 2491 <refentrytitle><phrase>get_futex_key</phrase></refentrytitle> 2492 <manvolnum>9</manvolnum> 2493 <refmiscinfo class="version">4.1.27</refmiscinfo> 2494</refmeta> 2495<refnamediv> 2496 <refname>get_futex_key</refname> 2497 <refpurpose> 2498 Get parameters which are the keys for a futex 2499 </refpurpose> 2500</refnamediv> 2501<refsynopsisdiv> 2502 <title>Synopsis</title> 2503 <funcsynopsis><funcprototype> 2504 <funcdef>int <function>get_futex_key </function></funcdef> 2505 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 2506 <paramdef>int <parameter>fshared</parameter></paramdef> 2507 <paramdef>union futex_key * <parameter>key</parameter></paramdef> 2508 <paramdef>int <parameter>rw</parameter></paramdef> 2509 </funcprototype></funcsynopsis> 2510</refsynopsisdiv> 2511<refsect1> 2512 <title>Arguments</title> 2513 <variablelist> 2514 <varlistentry> 2515 <term><parameter>uaddr</parameter></term> 2516 <listitem> 2517 <para> 2518 virtual address of the futex 2519 </para> 2520 </listitem> 2521 </varlistentry> 2522 <varlistentry> 2523 <term><parameter>fshared</parameter></term> 2524 <listitem> 2525 <para> 2526 0 for a PROCESS_PRIVATE futex, 1 for PROCESS_SHARED 2527 </para> 2528 </listitem> 2529 </varlistentry> 2530 <varlistentry> 2531 <term><parameter>key</parameter></term> 2532 <listitem> 2533 <para> 2534 address where result is stored. 2535 </para> 2536 </listitem> 2537 </varlistentry> 2538 <varlistentry> 2539 <term><parameter>rw</parameter></term> 2540 <listitem> 2541 <para> 2542 mapping needs to be read/write (values: VERIFY_READ, 2543 VERIFY_WRITE) 2544 </para> 2545 </listitem> 2546 </varlistentry> 2547 </variablelist> 2548</refsect1> 2549<refsect1> 2550<title>Return</title> 2551<para> 2552 a negative error code or 0 2553 </para><para> 2554 2555 The key words are stored in *key on success. 2556 </para><para> 2557 2558 For shared mappings, it's (page->index, file_inode(vma->vm_file), 2559 offset_within_page). For private mappings, it's (uaddr, current->mm). 2560 We can usually work out the index without swapping in the page. 2561 </para><para> 2562 2563 <function>lock_page</function> might sleep, the caller should not hold a spinlock. 2564</para> 2565</refsect1> 2566</refentry> 2567 2568<refentry id="API-fault-in-user-writeable"> 2569<refentryinfo> 2570 <title>LINUX</title> 2571 <productname>Kernel Hackers Manual</productname> 2572 <date>July 2017</date> 2573</refentryinfo> 2574<refmeta> 2575 <refentrytitle><phrase>fault_in_user_writeable</phrase></refentrytitle> 2576 <manvolnum>9</manvolnum> 2577 <refmiscinfo class="version">4.1.27</refmiscinfo> 2578</refmeta> 2579<refnamediv> 2580 <refname>fault_in_user_writeable</refname> 2581 <refpurpose> 2582 Fault in user address and verify RW access 2583 </refpurpose> 2584</refnamediv> 2585<refsynopsisdiv> 2586 <title>Synopsis</title> 2587 <funcsynopsis><funcprototype> 2588 <funcdef>int <function>fault_in_user_writeable </function></funcdef> 2589 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 2590 </funcprototype></funcsynopsis> 2591</refsynopsisdiv> 2592<refsect1> 2593 <title>Arguments</title> 2594 <variablelist> 2595 <varlistentry> 2596 <term><parameter>uaddr</parameter></term> 2597 <listitem> 2598 <para> 2599 pointer to faulting user space address 2600 </para> 2601 </listitem> 2602 </varlistentry> 2603 </variablelist> 2604</refsect1> 2605<refsect1> 2606<title>Description</title> 2607<para> 2608 Slow path to fixup the fault we just took in the atomic write 2609 access to <parameter>uaddr</parameter>. 2610 </para><para> 2611 2612 We have no generic implementation of a non-destructive write to the 2613 user address. We know that we faulted in the atomic pagefault 2614 disabled section so we can as well avoid the #PF overhead by 2615 calling <function>get_user_pages</function> right away. 2616</para> 2617</refsect1> 2618</refentry> 2619 2620<refentry id="API-futex-top-waiter"> 2621<refentryinfo> 2622 <title>LINUX</title> 2623 <productname>Kernel Hackers Manual</productname> 2624 <date>July 2017</date> 2625</refentryinfo> 2626<refmeta> 2627 <refentrytitle><phrase>futex_top_waiter</phrase></refentrytitle> 2628 <manvolnum>9</manvolnum> 2629 <refmiscinfo class="version">4.1.27</refmiscinfo> 2630</refmeta> 2631<refnamediv> 2632 <refname>futex_top_waiter</refname> 2633 <refpurpose> 2634 Return the highest priority waiter on a futex 2635 </refpurpose> 2636</refnamediv> 2637<refsynopsisdiv> 2638 <title>Synopsis</title> 2639 <funcsynopsis><funcprototype> 2640 <funcdef>struct futex_q * <function>futex_top_waiter </function></funcdef> 2641 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 2642 <paramdef>union futex_key * <parameter>key</parameter></paramdef> 2643 </funcprototype></funcsynopsis> 2644</refsynopsisdiv> 2645<refsect1> 2646 <title>Arguments</title> 2647 <variablelist> 2648 <varlistentry> 2649 <term><parameter>hb</parameter></term> 2650 <listitem> 2651 <para> 2652 the hash bucket the futex_q's reside in 2653 </para> 2654 </listitem> 2655 </varlistentry> 2656 <varlistentry> 2657 <term><parameter>key</parameter></term> 2658 <listitem> 2659 <para> 2660 the futex key (to distinguish it from other futex futex_q's) 2661 </para> 2662 </listitem> 2663 </varlistentry> 2664 </variablelist> 2665</refsect1> 2666<refsect1> 2667<title>Description</title> 2668<para> 2669 Must be called with the hb lock held. 2670</para> 2671</refsect1> 2672</refentry> 2673 2674<refentry id="API-futex-lock-pi-atomic"> 2675<refentryinfo> 2676 <title>LINUX</title> 2677 <productname>Kernel Hackers Manual</productname> 2678 <date>July 2017</date> 2679</refentryinfo> 2680<refmeta> 2681 <refentrytitle><phrase>futex_lock_pi_atomic</phrase></refentrytitle> 2682 <manvolnum>9</manvolnum> 2683 <refmiscinfo class="version">4.1.27</refmiscinfo> 2684</refmeta> 2685<refnamediv> 2686 <refname>futex_lock_pi_atomic</refname> 2687 <refpurpose> 2688 Atomic work required to acquire a pi aware futex 2689 </refpurpose> 2690</refnamediv> 2691<refsynopsisdiv> 2692 <title>Synopsis</title> 2693 <funcsynopsis><funcprototype> 2694 <funcdef>int <function>futex_lock_pi_atomic </function></funcdef> 2695 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 2696 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 2697 <paramdef>union futex_key * <parameter>key</parameter></paramdef> 2698 <paramdef>struct futex_pi_state ** <parameter>ps</parameter></paramdef> 2699 <paramdef>struct task_struct * <parameter>task</parameter></paramdef> 2700 <paramdef>int <parameter>set_waiters</parameter></paramdef> 2701 </funcprototype></funcsynopsis> 2702</refsynopsisdiv> 2703<refsect1> 2704 <title>Arguments</title> 2705 <variablelist> 2706 <varlistentry> 2707 <term><parameter>uaddr</parameter></term> 2708 <listitem> 2709 <para> 2710 the pi futex user address 2711 </para> 2712 </listitem> 2713 </varlistentry> 2714 <varlistentry> 2715 <term><parameter>hb</parameter></term> 2716 <listitem> 2717 <para> 2718 the pi futex hash bucket 2719 </para> 2720 </listitem> 2721 </varlistentry> 2722 <varlistentry> 2723 <term><parameter>key</parameter></term> 2724 <listitem> 2725 <para> 2726 the futex key associated with uaddr and hb 2727 </para> 2728 </listitem> 2729 </varlistentry> 2730 <varlistentry> 2731 <term><parameter>ps</parameter></term> 2732 <listitem> 2733 <para> 2734 the pi_state pointer where we store the result of the 2735 lookup 2736 </para> 2737 </listitem> 2738 </varlistentry> 2739 <varlistentry> 2740 <term><parameter>task</parameter></term> 2741 <listitem> 2742 <para> 2743 the task to perform the atomic lock work for. This will 2744 be <quote>current</quote> except in the case of requeue pi. 2745 </para> 2746 </listitem> 2747 </varlistentry> 2748 <varlistentry> 2749 <term><parameter>set_waiters</parameter></term> 2750 <listitem> 2751 <para> 2752 force setting the FUTEX_WAITERS bit (1) or not (0) 2753 </para> 2754 </listitem> 2755 </varlistentry> 2756 </variablelist> 2757</refsect1> 2758<refsect1> 2759<title>Return</title> 2760<para> 2761 0 - ready to wait; 2762 1 - acquired the lock; 2763 <0 - error 2764 </para><para> 2765 2766 The hb->lock and futex_key refs shall be held by the caller. 2767</para> 2768</refsect1> 2769</refentry> 2770 2771<refentry id="API---unqueue-futex"> 2772<refentryinfo> 2773 <title>LINUX</title> 2774 <productname>Kernel Hackers Manual</productname> 2775 <date>July 2017</date> 2776</refentryinfo> 2777<refmeta> 2778 <refentrytitle><phrase>__unqueue_futex</phrase></refentrytitle> 2779 <manvolnum>9</manvolnum> 2780 <refmiscinfo class="version">4.1.27</refmiscinfo> 2781</refmeta> 2782<refnamediv> 2783 <refname>__unqueue_futex</refname> 2784 <refpurpose> 2785 Remove the futex_q from its futex_hash_bucket 2786 </refpurpose> 2787</refnamediv> 2788<refsynopsisdiv> 2789 <title>Synopsis</title> 2790 <funcsynopsis><funcprototype> 2791 <funcdef>void <function>__unqueue_futex </function></funcdef> 2792 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 2793 </funcprototype></funcsynopsis> 2794</refsynopsisdiv> 2795<refsect1> 2796 <title>Arguments</title> 2797 <variablelist> 2798 <varlistentry> 2799 <term><parameter>q</parameter></term> 2800 <listitem> 2801 <para> 2802 The futex_q to unqueue 2803 </para> 2804 </listitem> 2805 </varlistentry> 2806 </variablelist> 2807</refsect1> 2808<refsect1> 2809<title>Description</title> 2810<para> 2811 The q->lock_ptr must not be NULL and must be held by the caller. 2812</para> 2813</refsect1> 2814</refentry> 2815 2816<refentry id="API-requeue-futex"> 2817<refentryinfo> 2818 <title>LINUX</title> 2819 <productname>Kernel Hackers Manual</productname> 2820 <date>July 2017</date> 2821</refentryinfo> 2822<refmeta> 2823 <refentrytitle><phrase>requeue_futex</phrase></refentrytitle> 2824 <manvolnum>9</manvolnum> 2825 <refmiscinfo class="version">4.1.27</refmiscinfo> 2826</refmeta> 2827<refnamediv> 2828 <refname>requeue_futex</refname> 2829 <refpurpose> 2830 Requeue a futex_q from one hb to another 2831 </refpurpose> 2832</refnamediv> 2833<refsynopsisdiv> 2834 <title>Synopsis</title> 2835 <funcsynopsis><funcprototype> 2836 <funcdef>void <function>requeue_futex </function></funcdef> 2837 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 2838 <paramdef>struct futex_hash_bucket * <parameter>hb1</parameter></paramdef> 2839 <paramdef>struct futex_hash_bucket * <parameter>hb2</parameter></paramdef> 2840 <paramdef>union futex_key * <parameter>key2</parameter></paramdef> 2841 </funcprototype></funcsynopsis> 2842</refsynopsisdiv> 2843<refsect1> 2844 <title>Arguments</title> 2845 <variablelist> 2846 <varlistentry> 2847 <term><parameter>q</parameter></term> 2848 <listitem> 2849 <para> 2850 the futex_q to requeue 2851 </para> 2852 </listitem> 2853 </varlistentry> 2854 <varlistentry> 2855 <term><parameter>hb1</parameter></term> 2856 <listitem> 2857 <para> 2858 the source hash_bucket 2859 </para> 2860 </listitem> 2861 </varlistentry> 2862 <varlistentry> 2863 <term><parameter>hb2</parameter></term> 2864 <listitem> 2865 <para> 2866 the target hash_bucket 2867 </para> 2868 </listitem> 2869 </varlistentry> 2870 <varlistentry> 2871 <term><parameter>key2</parameter></term> 2872 <listitem> 2873 <para> 2874 the new key for the requeued futex_q 2875 </para> 2876 </listitem> 2877 </varlistentry> 2878 </variablelist> 2879</refsect1> 2880</refentry> 2881 2882<refentry id="API-requeue-pi-wake-futex"> 2883<refentryinfo> 2884 <title>LINUX</title> 2885 <productname>Kernel Hackers Manual</productname> 2886 <date>July 2017</date> 2887</refentryinfo> 2888<refmeta> 2889 <refentrytitle><phrase>requeue_pi_wake_futex</phrase></refentrytitle> 2890 <manvolnum>9</manvolnum> 2891 <refmiscinfo class="version">4.1.27</refmiscinfo> 2892</refmeta> 2893<refnamediv> 2894 <refname>requeue_pi_wake_futex</refname> 2895 <refpurpose> 2896 Wake a task that acquired the lock during requeue 2897 </refpurpose> 2898</refnamediv> 2899<refsynopsisdiv> 2900 <title>Synopsis</title> 2901 <funcsynopsis><funcprototype> 2902 <funcdef>void <function>requeue_pi_wake_futex </function></funcdef> 2903 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 2904 <paramdef>union futex_key * <parameter>key</parameter></paramdef> 2905 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 2906 </funcprototype></funcsynopsis> 2907</refsynopsisdiv> 2908<refsect1> 2909 <title>Arguments</title> 2910 <variablelist> 2911 <varlistentry> 2912 <term><parameter>q</parameter></term> 2913 <listitem> 2914 <para> 2915 the futex_q 2916 </para> 2917 </listitem> 2918 </varlistentry> 2919 <varlistentry> 2920 <term><parameter>key</parameter></term> 2921 <listitem> 2922 <para> 2923 the key of the requeue target futex 2924 </para> 2925 </listitem> 2926 </varlistentry> 2927 <varlistentry> 2928 <term><parameter>hb</parameter></term> 2929 <listitem> 2930 <para> 2931 the hash_bucket of the requeue target futex 2932 </para> 2933 </listitem> 2934 </varlistentry> 2935 </variablelist> 2936</refsect1> 2937<refsect1> 2938<title>Description</title> 2939<para> 2940 During futex_requeue, with requeue_pi=1, it is possible to acquire the 2941 target futex if it is uncontended or via a lock steal. Set the futex_q key 2942 to the requeue target futex so the waiter can detect the wakeup on the right 2943 futex, but remove it from the hb and NULL the rt_waiter so it can detect 2944 atomic lock acquisition. Set the q->lock_ptr to the requeue target hb->lock 2945 to protect access to the pi_state to fixup the owner later. Must be called 2946 with both q->lock_ptr and hb->lock held. 2947</para> 2948</refsect1> 2949</refentry> 2950 2951<refentry id="API-futex-proxy-trylock-atomic"> 2952<refentryinfo> 2953 <title>LINUX</title> 2954 <productname>Kernel Hackers Manual</productname> 2955 <date>July 2017</date> 2956</refentryinfo> 2957<refmeta> 2958 <refentrytitle><phrase>futex_proxy_trylock_atomic</phrase></refentrytitle> 2959 <manvolnum>9</manvolnum> 2960 <refmiscinfo class="version">4.1.27</refmiscinfo> 2961</refmeta> 2962<refnamediv> 2963 <refname>futex_proxy_trylock_atomic</refname> 2964 <refpurpose> 2965 Attempt an atomic lock for the top waiter 2966 </refpurpose> 2967</refnamediv> 2968<refsynopsisdiv> 2969 <title>Synopsis</title> 2970 <funcsynopsis><funcprototype> 2971 <funcdef>int <function>futex_proxy_trylock_atomic </function></funcdef> 2972 <paramdef>u32 __user * <parameter>pifutex</parameter></paramdef> 2973 <paramdef>struct futex_hash_bucket * <parameter>hb1</parameter></paramdef> 2974 <paramdef>struct futex_hash_bucket * <parameter>hb2</parameter></paramdef> 2975 <paramdef>union futex_key * <parameter>key1</parameter></paramdef> 2976 <paramdef>union futex_key * <parameter>key2</parameter></paramdef> 2977 <paramdef>struct futex_pi_state ** <parameter>ps</parameter></paramdef> 2978 <paramdef>int <parameter>set_waiters</parameter></paramdef> 2979 </funcprototype></funcsynopsis> 2980</refsynopsisdiv> 2981<refsect1> 2982 <title>Arguments</title> 2983 <variablelist> 2984 <varlistentry> 2985 <term><parameter>pifutex</parameter></term> 2986 <listitem> 2987 <para> 2988 the user address of the to futex 2989 </para> 2990 </listitem> 2991 </varlistentry> 2992 <varlistentry> 2993 <term><parameter>hb1</parameter></term> 2994 <listitem> 2995 <para> 2996 the from futex hash bucket, must be locked by the caller 2997 </para> 2998 </listitem> 2999 </varlistentry> 3000 <varlistentry> 3001 <term><parameter>hb2</parameter></term> 3002 <listitem> 3003 <para> 3004 the to futex hash bucket, must be locked by the caller 3005 </para> 3006 </listitem> 3007 </varlistentry> 3008 <varlistentry> 3009 <term><parameter>key1</parameter></term> 3010 <listitem> 3011 <para> 3012 the from futex key 3013 </para> 3014 </listitem> 3015 </varlistentry> 3016 <varlistentry> 3017 <term><parameter>key2</parameter></term> 3018 <listitem> 3019 <para> 3020 the to futex key 3021 </para> 3022 </listitem> 3023 </varlistentry> 3024 <varlistentry> 3025 <term><parameter>ps</parameter></term> 3026 <listitem> 3027 <para> 3028 address to store the pi_state pointer 3029 </para> 3030 </listitem> 3031 </varlistentry> 3032 <varlistentry> 3033 <term><parameter>set_waiters</parameter></term> 3034 <listitem> 3035 <para> 3036 force setting the FUTEX_WAITERS bit (1) or not (0) 3037 </para> 3038 </listitem> 3039 </varlistentry> 3040 </variablelist> 3041</refsect1> 3042<refsect1> 3043<title>Description</title> 3044<para> 3045 Try and get the lock on behalf of the top waiter if we can do it atomically. 3046 Wake the top waiter if we succeed. If the caller specified set_waiters, 3047 then direct <function>futex_lock_pi_atomic</function> to force setting the FUTEX_WAITERS bit. 3048 hb1 and hb2 must be held by the caller. 3049</para> 3050</refsect1> 3051<refsect1> 3052<title>Return</title> 3053<para> 3054 0 - failed to acquire the lock atomically; 3055 >0 - acquired the lock, return value is vpid of the top_waiter 3056 <0 - error 3057</para> 3058</refsect1> 3059</refentry> 3060 3061<refentry id="API-futex-requeue"> 3062<refentryinfo> 3063 <title>LINUX</title> 3064 <productname>Kernel Hackers Manual</productname> 3065 <date>July 2017</date> 3066</refentryinfo> 3067<refmeta> 3068 <refentrytitle><phrase>futex_requeue</phrase></refentrytitle> 3069 <manvolnum>9</manvolnum> 3070 <refmiscinfo class="version">4.1.27</refmiscinfo> 3071</refmeta> 3072<refnamediv> 3073 <refname>futex_requeue</refname> 3074 <refpurpose> 3075 Requeue waiters from uaddr1 to uaddr2 3076 </refpurpose> 3077</refnamediv> 3078<refsynopsisdiv> 3079 <title>Synopsis</title> 3080 <funcsynopsis><funcprototype> 3081 <funcdef>int <function>futex_requeue </function></funcdef> 3082 <paramdef>u32 __user * <parameter>uaddr1</parameter></paramdef> 3083 <paramdef>unsigned int <parameter>flags</parameter></paramdef> 3084 <paramdef>u32 __user * <parameter>uaddr2</parameter></paramdef> 3085 <paramdef>int <parameter>nr_wake</parameter></paramdef> 3086 <paramdef>int <parameter>nr_requeue</parameter></paramdef> 3087 <paramdef>u32 * <parameter>cmpval</parameter></paramdef> 3088 <paramdef>int <parameter>requeue_pi</parameter></paramdef> 3089 </funcprototype></funcsynopsis> 3090</refsynopsisdiv> 3091<refsect1> 3092 <title>Arguments</title> 3093 <variablelist> 3094 <varlistentry> 3095 <term><parameter>uaddr1</parameter></term> 3096 <listitem> 3097 <para> 3098 source futex user address 3099 </para> 3100 </listitem> 3101 </varlistentry> 3102 <varlistentry> 3103 <term><parameter>flags</parameter></term> 3104 <listitem> 3105 <para> 3106 futex flags (FLAGS_SHARED, etc.) 3107 </para> 3108 </listitem> 3109 </varlistentry> 3110 <varlistentry> 3111 <term><parameter>uaddr2</parameter></term> 3112 <listitem> 3113 <para> 3114 target futex user address 3115 </para> 3116 </listitem> 3117 </varlistentry> 3118 <varlistentry> 3119 <term><parameter>nr_wake</parameter></term> 3120 <listitem> 3121 <para> 3122 number of waiters to wake (must be 1 for requeue_pi) 3123 </para> 3124 </listitem> 3125 </varlistentry> 3126 <varlistentry> 3127 <term><parameter>nr_requeue</parameter></term> 3128 <listitem> 3129 <para> 3130 number of waiters to requeue (0-INT_MAX) 3131 </para> 3132 </listitem> 3133 </varlistentry> 3134 <varlistentry> 3135 <term><parameter>cmpval</parameter></term> 3136 <listitem> 3137 <para> 3138 <parameter>uaddr1</parameter> expected value (or <constant>NULL</constant>) 3139 </para> 3140 </listitem> 3141 </varlistentry> 3142 <varlistentry> 3143 <term><parameter>requeue_pi</parameter></term> 3144 <listitem> 3145 <para> 3146 if we are attempting to requeue from a non-pi futex to a 3147 pi futex (pi to pi requeue is not supported) 3148 </para> 3149 </listitem> 3150 </varlistentry> 3151 </variablelist> 3152</refsect1> 3153<refsect1> 3154<title>Description</title> 3155<para> 3156 Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire 3157 uaddr2 atomically on behalf of the top waiter. 3158</para> 3159</refsect1> 3160<refsect1> 3161<title>Return</title> 3162<para> 3163 >=0 - on success, the number of tasks requeued or woken; 3164 <0 - on error 3165</para> 3166</refsect1> 3167</refentry> 3168 3169<refentry id="API-queue-me"> 3170<refentryinfo> 3171 <title>LINUX</title> 3172 <productname>Kernel Hackers Manual</productname> 3173 <date>July 2017</date> 3174</refentryinfo> 3175<refmeta> 3176 <refentrytitle><phrase>queue_me</phrase></refentrytitle> 3177 <manvolnum>9</manvolnum> 3178 <refmiscinfo class="version">4.1.27</refmiscinfo> 3179</refmeta> 3180<refnamediv> 3181 <refname>queue_me</refname> 3182 <refpurpose> 3183 Enqueue the futex_q on the futex_hash_bucket 3184 </refpurpose> 3185</refnamediv> 3186<refsynopsisdiv> 3187 <title>Synopsis</title> 3188 <funcsynopsis><funcprototype> 3189 <funcdef>void <function>queue_me </function></funcdef> 3190 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3191 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 3192 </funcprototype></funcsynopsis> 3193</refsynopsisdiv> 3194<refsect1> 3195 <title>Arguments</title> 3196 <variablelist> 3197 <varlistentry> 3198 <term><parameter>q</parameter></term> 3199 <listitem> 3200 <para> 3201 The futex_q to enqueue 3202 </para> 3203 </listitem> 3204 </varlistentry> 3205 <varlistentry> 3206 <term><parameter>hb</parameter></term> 3207 <listitem> 3208 <para> 3209 The destination hash bucket 3210 </para> 3211 </listitem> 3212 </varlistentry> 3213 </variablelist> 3214</refsect1> 3215<refsect1> 3216<title>Description</title> 3217<para> 3218 The hb->lock must be held by the caller, and is released here. A call to 3219 <function>queue_me</function> is typically paired with exactly one call to <function>unqueue_me</function>. The 3220 exceptions involve the PI related operations, which may use <function>unqueue_me_pi</function> 3221 or nothing if the unqueue is done as part of the wake process and the unqueue 3222 state is implicit in the state of woken task (see <function>futex_wait_requeue_pi</function> for 3223 an example). 3224</para> 3225</refsect1> 3226</refentry> 3227 3228<refentry id="API-unqueue-me"> 3229<refentryinfo> 3230 <title>LINUX</title> 3231 <productname>Kernel Hackers Manual</productname> 3232 <date>July 2017</date> 3233</refentryinfo> 3234<refmeta> 3235 <refentrytitle><phrase>unqueue_me</phrase></refentrytitle> 3236 <manvolnum>9</manvolnum> 3237 <refmiscinfo class="version">4.1.27</refmiscinfo> 3238</refmeta> 3239<refnamediv> 3240 <refname>unqueue_me</refname> 3241 <refpurpose> 3242 Remove the futex_q from its futex_hash_bucket 3243 </refpurpose> 3244</refnamediv> 3245<refsynopsisdiv> 3246 <title>Synopsis</title> 3247 <funcsynopsis><funcprototype> 3248 <funcdef>int <function>unqueue_me </function></funcdef> 3249 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3250 </funcprototype></funcsynopsis> 3251</refsynopsisdiv> 3252<refsect1> 3253 <title>Arguments</title> 3254 <variablelist> 3255 <varlistentry> 3256 <term><parameter>q</parameter></term> 3257 <listitem> 3258 <para> 3259 The futex_q to unqueue 3260 </para> 3261 </listitem> 3262 </varlistentry> 3263 </variablelist> 3264</refsect1> 3265<refsect1> 3266<title>Description</title> 3267<para> 3268 The q->lock_ptr must not be held by the caller. A call to <function>unqueue_me</function> must 3269 be paired with exactly one earlier call to <function>queue_me</function>. 3270</para> 3271</refsect1> 3272<refsect1> 3273<title>Return</title> 3274<para> 3275 1 - if the futex_q was still queued (and we removed unqueued it); 3276 0 - if the futex_q was already removed by the waking thread 3277</para> 3278</refsect1> 3279</refentry> 3280 3281<refentry id="API-fixup-owner"> 3282<refentryinfo> 3283 <title>LINUX</title> 3284 <productname>Kernel Hackers Manual</productname> 3285 <date>July 2017</date> 3286</refentryinfo> 3287<refmeta> 3288 <refentrytitle><phrase>fixup_owner</phrase></refentrytitle> 3289 <manvolnum>9</manvolnum> 3290 <refmiscinfo class="version">4.1.27</refmiscinfo> 3291</refmeta> 3292<refnamediv> 3293 <refname>fixup_owner</refname> 3294 <refpurpose> 3295 Post lock pi_state and corner case management 3296 </refpurpose> 3297</refnamediv> 3298<refsynopsisdiv> 3299 <title>Synopsis</title> 3300 <funcsynopsis><funcprototype> 3301 <funcdef>int <function>fixup_owner </function></funcdef> 3302 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 3303 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3304 <paramdef>int <parameter>locked</parameter></paramdef> 3305 </funcprototype></funcsynopsis> 3306</refsynopsisdiv> 3307<refsect1> 3308 <title>Arguments</title> 3309 <variablelist> 3310 <varlistentry> 3311 <term><parameter>uaddr</parameter></term> 3312 <listitem> 3313 <para> 3314 user address of the futex 3315 </para> 3316 </listitem> 3317 </varlistentry> 3318 <varlistentry> 3319 <term><parameter>q</parameter></term> 3320 <listitem> 3321 <para> 3322 futex_q (contains pi_state and access to the rt_mutex) 3323 </para> 3324 </listitem> 3325 </varlistentry> 3326 <varlistentry> 3327 <term><parameter>locked</parameter></term> 3328 <listitem> 3329 <para> 3330 if the attempt to take the rt_mutex succeeded (1) or not (0) 3331 </para> 3332 </listitem> 3333 </varlistentry> 3334 </variablelist> 3335</refsect1> 3336<refsect1> 3337<title>Description</title> 3338<para> 3339 After attempting to lock an rt_mutex, this function is called to cleanup 3340 the pi_state owner as well as handle race conditions that may allow us to 3341 acquire the lock. Must be called with the hb lock held. 3342</para> 3343</refsect1> 3344<refsect1> 3345<title>Return</title> 3346<para> 3347 1 - success, lock taken; 3348 0 - success, lock not taken; 3349 <0 - on error (-EFAULT) 3350</para> 3351</refsect1> 3352</refentry> 3353 3354<refentry id="API-futex-wait-queue-me"> 3355<refentryinfo> 3356 <title>LINUX</title> 3357 <productname>Kernel Hackers Manual</productname> 3358 <date>July 2017</date> 3359</refentryinfo> 3360<refmeta> 3361 <refentrytitle><phrase>futex_wait_queue_me</phrase></refentrytitle> 3362 <manvolnum>9</manvolnum> 3363 <refmiscinfo class="version">4.1.27</refmiscinfo> 3364</refmeta> 3365<refnamediv> 3366 <refname>futex_wait_queue_me</refname> 3367 <refpurpose> 3368 <function>queue_me</function> and wait for wakeup, timeout, or signal 3369 </refpurpose> 3370</refnamediv> 3371<refsynopsisdiv> 3372 <title>Synopsis</title> 3373 <funcsynopsis><funcprototype> 3374 <funcdef>void <function>futex_wait_queue_me </function></funcdef> 3375 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 3376 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3377 <paramdef>struct hrtimer_sleeper * <parameter>timeout</parameter></paramdef> 3378 </funcprototype></funcsynopsis> 3379</refsynopsisdiv> 3380<refsect1> 3381 <title>Arguments</title> 3382 <variablelist> 3383 <varlistentry> 3384 <term><parameter>hb</parameter></term> 3385 <listitem> 3386 <para> 3387 the futex hash bucket, must be locked by the caller 3388 </para> 3389 </listitem> 3390 </varlistentry> 3391 <varlistentry> 3392 <term><parameter>q</parameter></term> 3393 <listitem> 3394 <para> 3395 the futex_q to queue up on 3396 </para> 3397 </listitem> 3398 </varlistentry> 3399 <varlistentry> 3400 <term><parameter>timeout</parameter></term> 3401 <listitem> 3402 <para> 3403 the prepared hrtimer_sleeper, or null for no timeout 3404 </para> 3405 </listitem> 3406 </varlistentry> 3407 </variablelist> 3408</refsect1> 3409</refentry> 3410 3411<refentry id="API-futex-wait-setup"> 3412<refentryinfo> 3413 <title>LINUX</title> 3414 <productname>Kernel Hackers Manual</productname> 3415 <date>July 2017</date> 3416</refentryinfo> 3417<refmeta> 3418 <refentrytitle><phrase>futex_wait_setup</phrase></refentrytitle> 3419 <manvolnum>9</manvolnum> 3420 <refmiscinfo class="version">4.1.27</refmiscinfo> 3421</refmeta> 3422<refnamediv> 3423 <refname>futex_wait_setup</refname> 3424 <refpurpose> 3425 Prepare to wait on a futex 3426 </refpurpose> 3427</refnamediv> 3428<refsynopsisdiv> 3429 <title>Synopsis</title> 3430 <funcsynopsis><funcprototype> 3431 <funcdef>int <function>futex_wait_setup </function></funcdef> 3432 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 3433 <paramdef>u32 <parameter>val</parameter></paramdef> 3434 <paramdef>unsigned int <parameter>flags</parameter></paramdef> 3435 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3436 <paramdef>struct futex_hash_bucket ** <parameter>hb</parameter></paramdef> 3437 </funcprototype></funcsynopsis> 3438</refsynopsisdiv> 3439<refsect1> 3440 <title>Arguments</title> 3441 <variablelist> 3442 <varlistentry> 3443 <term><parameter>uaddr</parameter></term> 3444 <listitem> 3445 <para> 3446 the futex userspace address 3447 </para> 3448 </listitem> 3449 </varlistentry> 3450 <varlistentry> 3451 <term><parameter>val</parameter></term> 3452 <listitem> 3453 <para> 3454 the expected value 3455 </para> 3456 </listitem> 3457 </varlistentry> 3458 <varlistentry> 3459 <term><parameter>flags</parameter></term> 3460 <listitem> 3461 <para> 3462 futex flags (FLAGS_SHARED, etc.) 3463 </para> 3464 </listitem> 3465 </varlistentry> 3466 <varlistentry> 3467 <term><parameter>q</parameter></term> 3468 <listitem> 3469 <para> 3470 the associated futex_q 3471 </para> 3472 </listitem> 3473 </varlistentry> 3474 <varlistentry> 3475 <term><parameter>hb</parameter></term> 3476 <listitem> 3477 <para> 3478 storage for hash_bucket pointer to be returned to caller 3479 </para> 3480 </listitem> 3481 </varlistentry> 3482 </variablelist> 3483</refsect1> 3484<refsect1> 3485<title>Description</title> 3486<para> 3487 Setup the futex_q and locate the hash_bucket. Get the futex value and 3488 compare it with the expected value. Handle atomic faults internally. 3489 Return with the hb lock held and a q.key reference on success, and unlocked 3490 with no q.key reference on failure. 3491</para> 3492</refsect1> 3493<refsect1> 3494<title>Return</title> 3495<para> 3496 0 - uaddr contains val and hb has been locked; 3497 <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked 3498</para> 3499</refsect1> 3500</refentry> 3501 3502<refentry id="API-handle-early-requeue-pi-wakeup"> 3503<refentryinfo> 3504 <title>LINUX</title> 3505 <productname>Kernel Hackers Manual</productname> 3506 <date>July 2017</date> 3507</refentryinfo> 3508<refmeta> 3509 <refentrytitle><phrase>handle_early_requeue_pi_wakeup</phrase></refentrytitle> 3510 <manvolnum>9</manvolnum> 3511 <refmiscinfo class="version">4.1.27</refmiscinfo> 3512</refmeta> 3513<refnamediv> 3514 <refname>handle_early_requeue_pi_wakeup</refname> 3515 <refpurpose> 3516 Detect early wakeup on the initial futex 3517 </refpurpose> 3518</refnamediv> 3519<refsynopsisdiv> 3520 <title>Synopsis</title> 3521 <funcsynopsis><funcprototype> 3522 <funcdef>int <function>handle_early_requeue_pi_wakeup </function></funcdef> 3523 <paramdef>struct futex_hash_bucket * <parameter>hb</parameter></paramdef> 3524 <paramdef>struct futex_q * <parameter>q</parameter></paramdef> 3525 <paramdef>union futex_key * <parameter>key2</parameter></paramdef> 3526 <paramdef>struct hrtimer_sleeper * <parameter>timeout</parameter></paramdef> 3527 </funcprototype></funcsynopsis> 3528</refsynopsisdiv> 3529<refsect1> 3530 <title>Arguments</title> 3531 <variablelist> 3532 <varlistentry> 3533 <term><parameter>hb</parameter></term> 3534 <listitem> 3535 <para> 3536 the hash_bucket futex_q was original enqueued on 3537 </para> 3538 </listitem> 3539 </varlistentry> 3540 <varlistentry> 3541 <term><parameter>q</parameter></term> 3542 <listitem> 3543 <para> 3544 the futex_q woken while waiting to be requeued 3545 </para> 3546 </listitem> 3547 </varlistentry> 3548 <varlistentry> 3549 <term><parameter>key2</parameter></term> 3550 <listitem> 3551 <para> 3552 the futex_key of the requeue target futex 3553 </para> 3554 </listitem> 3555 </varlistentry> 3556 <varlistentry> 3557 <term><parameter>timeout</parameter></term> 3558 <listitem> 3559 <para> 3560 the timeout associated with the wait (NULL if none) 3561 </para> 3562 </listitem> 3563 </varlistentry> 3564 </variablelist> 3565</refsect1> 3566<refsect1> 3567<title>Description</title> 3568<para> 3569 Detect if the task was woken on the initial futex as opposed to the requeue 3570 target futex. If so, determine if it was a timeout or a signal that caused 3571 the wakeup and return the appropriate error code to the caller. Must be 3572 called with the hb lock held. 3573</para> 3574</refsect1> 3575<refsect1> 3576<title>Return</title> 3577<para> 3578 0 = no early wakeup detected; 3579 <0 = -ETIMEDOUT or -ERESTARTNOINTR 3580</para> 3581</refsect1> 3582</refentry> 3583 3584<refentry id="API-futex-wait-requeue-pi"> 3585<refentryinfo> 3586 <title>LINUX</title> 3587 <productname>Kernel Hackers Manual</productname> 3588 <date>July 2017</date> 3589</refentryinfo> 3590<refmeta> 3591 <refentrytitle><phrase>futex_wait_requeue_pi</phrase></refentrytitle> 3592 <manvolnum>9</manvolnum> 3593 <refmiscinfo class="version">4.1.27</refmiscinfo> 3594</refmeta> 3595<refnamediv> 3596 <refname>futex_wait_requeue_pi</refname> 3597 <refpurpose> 3598 Wait on uaddr and take uaddr2 3599 </refpurpose> 3600</refnamediv> 3601<refsynopsisdiv> 3602 <title>Synopsis</title> 3603 <funcsynopsis><funcprototype> 3604 <funcdef>int <function>futex_wait_requeue_pi </function></funcdef> 3605 <paramdef>u32 __user * <parameter>uaddr</parameter></paramdef> 3606 <paramdef>unsigned int <parameter>flags</parameter></paramdef> 3607 <paramdef>u32 <parameter>val</parameter></paramdef> 3608 <paramdef>ktime_t * <parameter>abs_time</parameter></paramdef> 3609 <paramdef>u32 <parameter>bitset</parameter></paramdef> 3610 <paramdef>u32 __user * <parameter>uaddr2</parameter></paramdef> 3611 </funcprototype></funcsynopsis> 3612</refsynopsisdiv> 3613<refsect1> 3614 <title>Arguments</title> 3615 <variablelist> 3616 <varlistentry> 3617 <term><parameter>uaddr</parameter></term> 3618 <listitem> 3619 <para> 3620 the futex we initially wait on (non-pi) 3621 </para> 3622 </listitem> 3623 </varlistentry> 3624 <varlistentry> 3625 <term><parameter>flags</parameter></term> 3626 <listitem> 3627 <para> 3628 futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be 3629 the same type, no requeueing from private to shared, etc. 3630 </para> 3631 </listitem> 3632 </varlistentry> 3633 <varlistentry> 3634 <term><parameter>val</parameter></term> 3635 <listitem> 3636 <para> 3637 the expected value of uaddr 3638 </para> 3639 </listitem> 3640 </varlistentry> 3641 <varlistentry> 3642 <term><parameter>abs_time</parameter></term> 3643 <listitem> 3644 <para> 3645 absolute timeout 3646 </para> 3647 </listitem> 3648 </varlistentry> 3649 <varlistentry> 3650 <term><parameter>bitset</parameter></term> 3651 <listitem> 3652 <para> 3653 32 bit wakeup bitset set by userspace, defaults to all 3654 </para> 3655 </listitem> 3656 </varlistentry> 3657 <varlistentry> 3658 <term><parameter>uaddr2</parameter></term> 3659 <listitem> 3660 <para> 3661 the pi futex we will take prior to returning to user-space 3662 </para> 3663 </listitem> 3664 </varlistentry> 3665 </variablelist> 3666</refsect1> 3667<refsect1> 3668<title>Description</title> 3669<para> 3670 The caller will wait on uaddr and will be requeued by <function>futex_requeue</function> to 3671 uaddr2 which must be PI aware and unique from uaddr. Normal wakeup will wake 3672 on uaddr2 and complete the acquisition of the rt_mutex prior to returning to 3673 userspace. This ensures the rt_mutex maintains an owner when it has waiters; 3674 without one, the pi logic would not know which task to boost/deboost, if 3675 there was a need to. 3676 </para><para> 3677 3678 We call schedule in <function>futex_wait_queue_me</function> when we enqueue and return there 3679 via the following-- 3680 1) wakeup on uaddr2 after an atomic lock acquisition by <function>futex_requeue</function> 3681 2) wakeup on uaddr2 after a requeue 3682 3) signal 3683 4) timeout 3684 </para><para> 3685 3686 If 3, cleanup and return -ERESTARTNOINTR. 3687 </para><para> 3688 3689 If 2, we may then block on trying to take the rt_mutex and return via: 3690 5) successful lock 3691 6) signal 3692 7) timeout 3693 8) other lock acquisition failure 3694 </para><para> 3695 3696 If 6, return -EWOULDBLOCK (restarting the syscall would do the same). 3697 </para><para> 3698 3699 If 4 or 7, we cleanup and return with -ETIMEDOUT. 3700</para> 3701</refsect1> 3702<refsect1> 3703<title>Return</title> 3704<para> 3705 0 - On success; 3706 <0 - On error 3707</para> 3708</refsect1> 3709</refentry> 3710 3711<refentry id="API-sys-set-robust-list"> 3712<refentryinfo> 3713 <title>LINUX</title> 3714 <productname>Kernel Hackers Manual</productname> 3715 <date>July 2017</date> 3716</refentryinfo> 3717<refmeta> 3718 <refentrytitle><phrase>sys_set_robust_list</phrase></refentrytitle> 3719 <manvolnum>9</manvolnum> 3720 <refmiscinfo class="version">4.1.27</refmiscinfo> 3721</refmeta> 3722<refnamediv> 3723 <refname>sys_set_robust_list</refname> 3724 <refpurpose> 3725 Set the robust-futex list head of a task 3726 </refpurpose> 3727</refnamediv> 3728<refsynopsisdiv> 3729 <title>Synopsis</title> 3730 <funcsynopsis><funcprototype> 3731 <funcdef>long <function>sys_set_robust_list </function></funcdef> 3732 <paramdef>struct robust_list_head __user * <parameter>head</parameter></paramdef> 3733 <paramdef>size_t <parameter>len</parameter></paramdef> 3734 </funcprototype></funcsynopsis> 3735</refsynopsisdiv> 3736<refsect1> 3737 <title>Arguments</title> 3738 <variablelist> 3739 <varlistentry> 3740 <term><parameter>head</parameter></term> 3741 <listitem> 3742 <para> 3743 pointer to the list-head 3744 </para> 3745 </listitem> 3746 </varlistentry> 3747 <varlistentry> 3748 <term><parameter>len</parameter></term> 3749 <listitem> 3750 <para> 3751 length of the list-head, as userspace expects 3752 </para> 3753 </listitem> 3754 </varlistentry> 3755 </variablelist> 3756</refsect1> 3757</refentry> 3758 3759<refentry id="API-sys-get-robust-list"> 3760<refentryinfo> 3761 <title>LINUX</title> 3762 <productname>Kernel Hackers Manual</productname> 3763 <date>July 2017</date> 3764</refentryinfo> 3765<refmeta> 3766 <refentrytitle><phrase>sys_get_robust_list</phrase></refentrytitle> 3767 <manvolnum>9</manvolnum> 3768 <refmiscinfo class="version">4.1.27</refmiscinfo> 3769</refmeta> 3770<refnamediv> 3771 <refname>sys_get_robust_list</refname> 3772 <refpurpose> 3773 Get the robust-futex list head of a task 3774 </refpurpose> 3775</refnamediv> 3776<refsynopsisdiv> 3777 <title>Synopsis</title> 3778 <funcsynopsis><funcprototype> 3779 <funcdef>long <function>sys_get_robust_list </function></funcdef> 3780 <paramdef>int <parameter>pid</parameter></paramdef> 3781 <paramdef>struct robust_list_head __user *__user * <parameter>head_ptr</parameter></paramdef> 3782 <paramdef>size_t __user * <parameter>len_ptr</parameter></paramdef> 3783 </funcprototype></funcsynopsis> 3784</refsynopsisdiv> 3785<refsect1> 3786 <title>Arguments</title> 3787 <variablelist> 3788 <varlistentry> 3789 <term><parameter>pid</parameter></term> 3790 <listitem> 3791 <para> 3792 pid of the process [zero for current task] 3793 </para> 3794 </listitem> 3795 </varlistentry> 3796 <varlistentry> 3797 <term><parameter>head_ptr</parameter></term> 3798 <listitem> 3799 <para> 3800 pointer to a list-head pointer, the kernel fills it in 3801 </para> 3802 </listitem> 3803 </varlistentry> 3804 <varlistentry> 3805 <term><parameter>len_ptr</parameter></term> 3806 <listitem> 3807 <para> 3808 pointer to a length field, the kernel fills in the header size 3809 </para> 3810 </listitem> 3811 </varlistentry> 3812 </variablelist> 3813</refsect1> 3814</refentry> 3815 3816 </chapter> 3817 3818 <chapter id="references"> 3819 <title>Further reading</title> 3820 3821 <itemizedlist> 3822 <listitem> 3823 <para> 3824 <filename>Documentation/locking/spinlocks.txt</filename>: 3825 Linus Torvalds' spinlocking tutorial in the kernel sources. 3826 </para> 3827 </listitem> 3828 3829 <listitem> 3830 <para> 3831 Unix Systems for Modern Architectures: Symmetric 3832 Multiprocessing and Caching for Kernel Programmers: 3833 </para> 3834 3835 <para> 3836 Curt Schimmel's very good introduction to kernel level 3837 locking (not written for Linux, but nearly everything 3838 applies). The book is expensive, but really worth every 3839 penny to understand SMP locking. [ISBN: 0201633388] 3840 </para> 3841 </listitem> 3842 </itemizedlist> 3843 </chapter> 3844 3845 <chapter id="thanks"> 3846 <title>Thanks</title> 3847 3848 <para> 3849 Thanks to Telsa Gwynne for DocBooking, neatening and adding 3850 style. 3851 </para> 3852 3853 <para> 3854 Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul 3855 Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim 3856 Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, 3857 John Ashby for proofreading, correcting, flaming, commenting. 3858 </para> 3859 3860 <para> 3861 Thanks to the cabal for having no influence on this document. 3862 </para> 3863 </chapter> 3864 3865 <glossary id="glossary"> 3866 <title>Glossary</title> 3867 3868 <glossentry id="gloss-preemption"> 3869 <glossterm>preemption</glossterm> 3870 <glossdef> 3871 <para> 3872 Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is 3873 unset, processes in user context inside the kernel would not 3874 preempt each other (ie. you had that CPU until you gave it up, 3875 except for interrupts). With the addition of 3876 <symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when 3877 in user context, higher priority tasks can "cut in": spinlocks 3878 were changed to disable preemption, even on UP. 3879 </para> 3880 </glossdef> 3881 </glossentry> 3882 3883 <glossentry id="gloss-bh"> 3884 <glossterm>bh</glossterm> 3885 <glossdef> 3886 <para> 3887 Bottom Half: for historical reasons, functions with 3888 '_bh' in them often now refer to any software interrupt, e.g. 3889 <function>spin_lock_bh()</function> blocks any software interrupt 3890 on the current CPU. Bottom halves are deprecated, and will 3891 eventually be replaced by tasklets. Only one bottom half will be 3892 running at any time. 3893 </para> 3894 </glossdef> 3895 </glossentry> 3896 3897 <glossentry id="gloss-hwinterrupt"> 3898 <glossterm>Hardware Interrupt / Hardware IRQ</glossterm> 3899 <glossdef> 3900 <para> 3901 Hardware interrupt request. <function>in_irq()</function> returns 3902 <returnvalue>true</returnvalue> in a hardware interrupt handler. 3903 </para> 3904 </glossdef> 3905 </glossentry> 3906 3907 <glossentry id="gloss-interruptcontext"> 3908 <glossterm>Interrupt Context</glossterm> 3909 <glossdef> 3910 <para> 3911 Not user context: processing a hardware irq or software irq. 3912 Indicated by the <function>in_interrupt()</function> macro 3913 returning <returnvalue>true</returnvalue>. 3914 </para> 3915 </glossdef> 3916 </glossentry> 3917 3918 <glossentry id="gloss-smp"> 3919 <glossterm><acronym>SMP</acronym></glossterm> 3920 <glossdef> 3921 <para> 3922 Symmetric Multi-Processor: kernels compiled for multiple-CPU 3923 machines. (CONFIG_SMP=y). 3924 </para> 3925 </glossdef> 3926 </glossentry> 3927 3928 <glossentry id="gloss-softirq"> 3929 <glossterm>Software Interrupt / softirq</glossterm> 3930 <glossdef> 3931 <para> 3932 Software interrupt handler. <function>in_irq()</function> returns 3933 <returnvalue>false</returnvalue>; <function>in_softirq()</function> 3934 returns <returnvalue>true</returnvalue>. Tasklets and softirqs 3935 both fall into the category of 'software interrupts'. 3936 </para> 3937 <para> 3938 Strictly speaking a softirq is one of up to 32 enumerated software 3939 interrupts which can run on multiple CPUs at once. 3940 Sometimes used to refer to tasklets as 3941 well (ie. all software interrupts). 3942 </para> 3943 </glossdef> 3944 </glossentry> 3945 3946 <glossentry id="gloss-tasklet"> 3947 <glossterm>tasklet</glossterm> 3948 <glossdef> 3949 <para> 3950 A dynamically-registrable software interrupt, 3951 which is guaranteed to only run on one CPU at a time. 3952 </para> 3953 </glossdef> 3954 </glossentry> 3955 3956 <glossentry id="gloss-timers"> 3957 <glossterm>timer</glossterm> 3958 <glossdef> 3959 <para> 3960 A dynamically-registrable software interrupt, which is run at 3961 (or close to) a given time. When running, it is just like a 3962 tasklet (in fact, they are called from the TIMER_SOFTIRQ). 3963 </para> 3964 </glossdef> 3965 </glossentry> 3966 3967 <glossentry id="gloss-up"> 3968 <glossterm><acronym>UP</acronym></glossterm> 3969 <glossdef> 3970 <para> 3971 Uni-Processor: Non-SMP. (CONFIG_SMP=n). 3972 </para> 3973 </glossdef> 3974 </glossentry> 3975 3976 <glossentry id="gloss-usercontext"> 3977 <glossterm>User Context</glossterm> 3978 <glossdef> 3979 <para> 3980 The kernel executing on behalf of a particular process (ie. a 3981 system call or trap) or kernel thread. You can tell which 3982 process with the <symbol>current</symbol> macro.) Not to 3983 be confused with userspace. Can be interrupted by software or 3984 hardware interrupts. 3985 </para> 3986 </glossdef> 3987 </glossentry> 3988 3989 <glossentry id="gloss-userspace"> 3990 <glossterm>Userspace</glossterm> 3991 <glossdef> 3992 <para> 3993 A process executing its own code outside the kernel. 3994 </para> 3995 </glossdef> 3996 </glossentry> 3997 3998 </glossary> 3999</book> 4000 4001