1<?xml version="1.0" encoding="UTF-8"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> 4 5<book id="LKLockingGuide"> 6 <bookinfo> 7 <title>Unreliable Guide To Locking</title> 8 9 <authorgroup> 10 <author> 11 <firstname>Rusty</firstname> 12 <surname>Russell</surname> 13 <affiliation> 14 <address> 15 <email>rusty@rustcorp.com.au</email> 16 </address> 17 </affiliation> 18 </author> 19 </authorgroup> 20 21 <copyright> 22 <year>2003</year> 23 <holder>Rusty Russell</holder> 24 </copyright> 25 26 <legalnotice> 27 <para> 28 This documentation is free software; you can redistribute 29 it and/or modify it under the terms of the GNU General Public 30 License as published by the Free Software Foundation; either 31 version 2 of the License, or (at your option) any later 32 version. 33 </para> 34 35 <para> 36 This program is distributed in the hope that it will be 37 useful, but WITHOUT ANY WARRANTY; without even the implied 38 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 39 See the GNU General Public License for more details. 40 </para> 41 42 <para> 43 You should have received a copy of the GNU General Public 44 License along with this program; if not, write to the Free 45 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 46 MA 02111-1307 USA 47 </para> 48 49 <para> 50 For more details see the file COPYING in the source 51 distribution of Linux. 52 </para> 53 </legalnotice> 54 </bookinfo> 55 56 <toc></toc> 57 <chapter id="intro"> 58 <title>Introduction</title> 59 <para> 60 Welcome, to Rusty's Remarkably Unreliable Guide to Kernel 61 Locking issues. This document describes the locking systems in 62 the Linux Kernel in 2.6. 63 </para> 64 <para> 65 With the wide availability of HyperThreading, and <firstterm 66 linkend="gloss-preemption">preemption </firstterm> in the Linux 67 Kernel, everyone hacking on the kernel needs to know the 68 fundamentals of concurrency and locking for 69 <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>. 70 </para> 71 </chapter> 72 73 <chapter id="races"> 74 <title>The Problem With Concurrency</title> 75 <para> 76 (Skip this if you know what a Race Condition is). 77 </para> 78 <para> 79 In a normal program, you can increment a counter like so: 80 </para> 81 <programlisting> 82 very_important_count++; 83 </programlisting> 84 85 <para> 86 This is what they would expect to happen: 87 </para> 88 89 <table> 90 <title>Expected Results</title> 91 92 <tgroup cols="2" align="left"> 93 94 <thead> 95 <row> 96 <entry>Instance 1</entry> 97 <entry>Instance 2</entry> 98 </row> 99 </thead> 100 101 <tbody> 102 <row> 103 <entry>read very_important_count (5)</entry> 104 <entry></entry> 105 </row> 106 <row> 107 <entry>add 1 (6)</entry> 108 <entry></entry> 109 </row> 110 <row> 111 <entry>write very_important_count (6)</entry> 112 <entry></entry> 113 </row> 114 <row> 115 <entry></entry> 116 <entry>read very_important_count (6)</entry> 117 </row> 118 <row> 119 <entry></entry> 120 <entry>add 1 (7)</entry> 121 </row> 122 <row> 123 <entry></entry> 124 <entry>write very_important_count (7)</entry> 125 </row> 126 </tbody> 127 128 </tgroup> 129 </table> 130 131 <para> 132 This is what might happen: 133 </para> 134 135 <table> 136 <title>Possible Results</title> 137 138 <tgroup cols="2" align="left"> 139 <thead> 140 <row> 141 <entry>Instance 1</entry> 142 <entry>Instance 2</entry> 143 </row> 144 </thead> 145 146 <tbody> 147 <row> 148 <entry>read very_important_count (5)</entry> 149 <entry></entry> 150 </row> 151 <row> 152 <entry></entry> 153 <entry>read very_important_count (5)</entry> 154 </row> 155 <row> 156 <entry>add 1 (6)</entry> 157 <entry></entry> 158 </row> 159 <row> 160 <entry></entry> 161 <entry>add 1 (6)</entry> 162 </row> 163 <row> 164 <entry>write very_important_count (6)</entry> 165 <entry></entry> 166 </row> 167 <row> 168 <entry></entry> 169 <entry>write very_important_count (6)</entry> 170 </row> 171 </tbody> 172 </tgroup> 173 </table> 174 175 <sect1 id="race-condition"> 176 <title>Race Conditions and Critical Regions</title> 177 <para> 178 This overlap, where the result depends on the 179 relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>. 180 The piece of code containing the concurrency issue is called a 181 <firstterm>critical region</firstterm>. And especially since Linux starting running 182 on SMP machines, they became one of the major issues in kernel 183 design and implementation. 184 </para> 185 <para> 186 Preemption can have the same effect, even if there is only one 187 CPU: by preempting one task during the critical region, we have 188 exactly the same race condition. In this case the thread which 189 preempts might run the critical region itself. 190 </para> 191 <para> 192 The solution is to recognize when these simultaneous accesses 193 occur, and use locks to make sure that only one instance can 194 enter the critical region at any time. There are many 195 friendly primitives in the Linux kernel to help you do this. 196 And then there are the unfriendly primitives, but I'll pretend 197 they don't exist. 198 </para> 199 </sect1> 200 </chapter> 201 202 <chapter id="locks"> 203 <title>Locking in the Linux Kernel</title> 204 205 <para> 206 If I could give you one piece of advice: never sleep with anyone 207 crazier than yourself. But if I had to give you advice on 208 locking: <emphasis>keep it simple</emphasis>. 209 </para> 210 211 <para> 212 Be reluctant to introduce new locks. 213 </para> 214 215 <para> 216 Strangely enough, this last one is the exact reverse of my advice when 217 you <emphasis>have</emphasis> slept with someone crazier than yourself. 218 And you should think about getting a big dog. 219 </para> 220 221 <sect1 id="lock-intro"> 222 <title>Two Main Types of Kernel Locks: Spinlocks and Mutexes</title> 223 224 <para> 225 There are two main types of kernel locks. The fundamental type 226 is the spinlock 227 (<filename class="headerfile">include/asm/spinlock.h</filename>), 228 which is a very simple single-holder lock: if you can't get the 229 spinlock, you keep trying (spinning) until you can. Spinlocks are 230 very small and fast, and can be used anywhere. 231 </para> 232 <para> 233 The second type is a mutex 234 (<filename class="headerfile">include/linux/mutex.h</filename>): it 235 is like a spinlock, but you may block holding a mutex. 236 If you can't lock a mutex, your task will suspend itself, and be woken 237 up when the mutex is released. This means the CPU can do something 238 else while you are waiting. There are many cases when you simply 239 can't sleep (see <xref linkend="sleeping-things"/>), and so have to 240 use a spinlock instead. 241 </para> 242 <para> 243 Neither type of lock is recursive: see 244 <xref linkend="deadlock"/>. 245 </para> 246 </sect1> 247 248 <sect1 id="uniprocessor"> 249 <title>Locks and Uniprocessor Kernels</title> 250 251 <para> 252 For kernels compiled without <symbol>CONFIG_SMP</symbol>, and 253 without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at 254 all. This is an excellent design decision: when no-one else can 255 run at the same time, there is no reason to have a lock. 256 </para> 257 258 <para> 259 If the kernel is compiled without <symbol>CONFIG_SMP</symbol>, 260 but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks 261 simply disable preemption, which is sufficient to prevent any 262 races. For most purposes, we can think of preemption as 263 equivalent to SMP, and not worry about it separately. 264 </para> 265 266 <para> 267 You should always test your locking code with <symbol>CONFIG_SMP</symbol> 268 and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it 269 will still catch some kinds of locking bugs. 270 </para> 271 272 <para> 273 Mutexes still exist, because they are required for 274 synchronization between <firstterm linkend="gloss-usercontext">user 275 contexts</firstterm>, as we will see below. 276 </para> 277 </sect1> 278 279 <sect1 id="usercontextlocking"> 280 <title>Locking Only In User Context</title> 281 282 <para> 283 If you have a data structure which is only ever accessed from 284 user context, then you can use a simple mutex 285 (<filename>include/linux/mutex.h</filename>) to protect it. This 286 is the most trivial case: you initialize the mutex. Then you can 287 call <function>mutex_lock_interruptible()</function> to grab the mutex, 288 and <function>mutex_unlock()</function> to release it. There is also a 289 <function>mutex_lock()</function>, which should be avoided, because it 290 will not return if a signal is received. 291 </para> 292 293 <para> 294 Example: <filename>net/netfilter/nf_sockopt.c</filename> allows 295 registration of new <function>setsockopt()</function> and 296 <function>getsockopt()</function> calls, with 297 <function>nf_register_sockopt()</function>. Registration and 298 de-registration are only done on module load and unload (and boot 299 time, where there is no concurrency), and the list of registrations 300 is only consulted for an unknown <function>setsockopt()</function> 301 or <function>getsockopt()</function> system call. The 302 <varname>nf_sockopt_mutex</varname> is perfect to protect this, 303 especially since the setsockopt and getsockopt calls may well 304 sleep. 305 </para> 306 </sect1> 307 308 <sect1 id="lock-user-bh"> 309 <title>Locking Between User Context and Softirqs</title> 310 311 <para> 312 If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares 313 data with user context, you have two problems. Firstly, the current 314 user context can be interrupted by a softirq, and secondly, the 315 critical region could be entered from another CPU. This is where 316 <function>spin_lock_bh()</function> 317 (<filename class="headerfile">include/linux/spinlock.h</filename>) is 318 used. It disables softirqs on that CPU, then grabs the lock. 319 <function>spin_unlock_bh()</function> does the reverse. (The 320 '_bh' suffix is a historical reference to "Bottom Halves", the 321 old name for software interrupts. It should really be 322 called spin_lock_softirq()' in a perfect world). 323 </para> 324 325 <para> 326 Note that you can also use <function>spin_lock_irq()</function> 327 or <function>spin_lock_irqsave()</function> here, which stop 328 hardware interrupts as well: see <xref linkend="hardirq-context"/>. 329 </para> 330 331 <para> 332 This works perfectly for <firstterm linkend="gloss-up"><acronym>UP 333 </acronym></firstterm> as well: the spin lock vanishes, and this macro 334 simply becomes <function>local_bh_disable()</function> 335 (<filename class="headerfile">include/linux/interrupt.h</filename>), which 336 protects you from the softirq being run. 337 </para> 338 </sect1> 339 340 <sect1 id="lock-user-tasklet"> 341 <title>Locking Between User Context and Tasklets</title> 342 343 <para> 344 This is exactly the same as above, because <firstterm 345 linkend="gloss-tasklet">tasklets</firstterm> are actually run 346 from a softirq. 347 </para> 348 </sect1> 349 350 <sect1 id="lock-user-timers"> 351 <title>Locking Between User Context and Timers</title> 352 353 <para> 354 This, too, is exactly the same as above, because <firstterm 355 linkend="gloss-timers">timers</firstterm> are actually run from 356 a softirq. From a locking point of view, tasklets and timers 357 are identical. 358 </para> 359 </sect1> 360 361 <sect1 id="lock-tasklets"> 362 <title>Locking Between Tasklets/Timers</title> 363 364 <para> 365 Sometimes a tasklet or timer might want to share data with 366 another tasklet or timer. 367 </para> 368 369 <sect2 id="lock-tasklets-same"> 370 <title>The Same Tasklet/Timer</title> 371 <para> 372 Since a tasklet is never run on two CPUs at once, you don't 373 need to worry about your tasklet being reentrant (running 374 twice at once), even on SMP. 375 </para> 376 </sect2> 377 378 <sect2 id="lock-tasklets-different"> 379 <title>Different Tasklets/Timers</title> 380 <para> 381 If another tasklet/timer wants 382 to share data with your tasklet or timer , you will both need to use 383 <function>spin_lock()</function> and 384 <function>spin_unlock()</function> calls. 385 <function>spin_lock_bh()</function> is 386 unnecessary here, as you are already in a tasklet, and 387 none will be run on the same CPU. 388 </para> 389 </sect2> 390 </sect1> 391 392 <sect1 id="lock-softirqs"> 393 <title>Locking Between Softirqs</title> 394 395 <para> 396 Often a softirq might 397 want to share data with itself or a tasklet/timer. 398 </para> 399 400 <sect2 id="lock-softirqs-same"> 401 <title>The Same Softirq</title> 402 403 <para> 404 The same softirq can run on the other CPUs: you can use a 405 per-CPU array (see <xref linkend="per-cpu"/>) for better 406 performance. If you're going so far as to use a softirq, 407 you probably care about scalable performance enough 408 to justify the extra complexity. 409 </para> 410 411 <para> 412 You'll need to use <function>spin_lock()</function> and 413 <function>spin_unlock()</function> for shared data. 414 </para> 415 </sect2> 416 417 <sect2 id="lock-softirqs-different"> 418 <title>Different Softirqs</title> 419 420 <para> 421 You'll need to use <function>spin_lock()</function> and 422 <function>spin_unlock()</function> for shared data, whether it 423 be a timer, tasklet, different softirq or the same or another 424 softirq: any of them could be running on a different CPU. 425 </para> 426 </sect2> 427 </sect1> 428 </chapter> 429 430 <chapter id="hardirq-context"> 431 <title>Hard IRQ Context</title> 432 433 <para> 434 Hardware interrupts usually communicate with a 435 tasklet or softirq. Frequently this involves putting work in a 436 queue, which the softirq will take out. 437 </para> 438 439 <sect1 id="hardirq-softirq"> 440 <title>Locking Between Hard IRQ and Softirqs/Tasklets</title> 441 442 <para> 443 If a hardware irq handler shares data with a softirq, you have 444 two concerns. Firstly, the softirq processing can be 445 interrupted by a hardware interrupt, and secondly, the 446 critical region could be entered by a hardware interrupt on 447 another CPU. This is where <function>spin_lock_irq()</function> is 448 used. It is defined to disable interrupts on that cpu, then grab 449 the lock. <function>spin_unlock_irq()</function> does the reverse. 450 </para> 451 452 <para> 453 The irq handler does not to use 454 <function>spin_lock_irq()</function>, because the softirq cannot 455 run while the irq handler is running: it can use 456 <function>spin_lock()</function>, which is slightly faster. The 457 only exception would be if a different hardware irq handler uses 458 the same lock: <function>spin_lock_irq()</function> will stop 459 that from interrupting us. 460 </para> 461 462 <para> 463 This works perfectly for UP as well: the spin lock vanishes, 464 and this macro simply becomes <function>local_irq_disable()</function> 465 (<filename class="headerfile">include/asm/smp.h</filename>), which 466 protects you from the softirq/tasklet/BH being run. 467 </para> 468 469 <para> 470 <function>spin_lock_irqsave()</function> 471 (<filename>include/linux/spinlock.h</filename>) is a variant 472 which saves whether interrupts were on or off in a flags word, 473 which is passed to <function>spin_unlock_irqrestore()</function>. This 474 means that the same code can be used inside an hard irq handler (where 475 interrupts are already off) and in softirqs (where the irq 476 disabling is required). 477 </para> 478 479 <para> 480 Note that softirqs (and hence tasklets and timers) are run on 481 return from hardware interrupts, so 482 <function>spin_lock_irq()</function> also stops these. In that 483 sense, <function>spin_lock_irqsave()</function> is the most 484 general and powerful locking function. 485 </para> 486 487 </sect1> 488 <sect1 id="hardirq-hardirq"> 489 <title>Locking Between Two Hard IRQ Handlers</title> 490 <para> 491 It is rare to have to share data between two IRQ handlers, but 492 if you do, <function>spin_lock_irqsave()</function> should be 493 used: it is architecture-specific whether all interrupts are 494 disabled inside irq handlers themselves. 495 </para> 496 </sect1> 497 498 </chapter> 499 500 <chapter id="cheatsheet"> 501 <title>Cheat Sheet For Locking</title> 502 <para> 503 Pete Zaitcev gives the following summary: 504 </para> 505 <itemizedlist> 506 <listitem> 507 <para> 508 If you are in a process context (any syscall) and want to 509 lock other process out, use a mutex. You can take a mutex 510 and sleep (<function>copy_from_user*(</function> or 511 <function>kmalloc(x,GFP_KERNEL)</function>). 512 </para> 513 </listitem> 514 <listitem> 515 <para> 516 Otherwise (== data can be touched in an interrupt), use 517 <function>spin_lock_irqsave()</function> and 518 <function>spin_unlock_irqrestore()</function>. 519 </para> 520 </listitem> 521 <listitem> 522 <para> 523 Avoid holding spinlock for more than 5 lines of code and 524 across any function call (except accessors like 525 <function>readb</function>). 526 </para> 527 </listitem> 528 </itemizedlist> 529 530 <sect1 id="minimum-lock-reqirements"> 531 <title>Table of Minimum Requirements</title> 532 533 <para> The following table lists the <emphasis>minimum</emphasis> 534 locking requirements between various contexts. In some cases, 535 the same context can only be running on one CPU at a time, so 536 no locking is required for that context (eg. a particular 537 thread can only run on one CPU at a time, but if it needs 538 shares data with another thread, locking is required). 539 </para> 540 <para> 541 Remember the advice above: you can always use 542 <function>spin_lock_irqsave()</function>, which is a superset 543 of all other spinlock primitives. 544 </para> 545 546 <table> 547<title>Table of Locking Requirements</title> 548<tgroup cols="11"> 549<tbody> 550 551<row> 552<entry></entry> 553<entry>IRQ Handler A</entry> 554<entry>IRQ Handler B</entry> 555<entry>Softirq A</entry> 556<entry>Softirq B</entry> 557<entry>Tasklet A</entry> 558<entry>Tasklet B</entry> 559<entry>Timer A</entry> 560<entry>Timer B</entry> 561<entry>User Context A</entry> 562<entry>User Context B</entry> 563</row> 564 565<row> 566<entry>IRQ Handler A</entry> 567<entry>None</entry> 568</row> 569 570<row> 571<entry>IRQ Handler B</entry> 572<entry>SLIS</entry> 573<entry>None</entry> 574</row> 575 576<row> 577<entry>Softirq A</entry> 578<entry>SLI</entry> 579<entry>SLI</entry> 580<entry>SL</entry> 581</row> 582 583<row> 584<entry>Softirq B</entry> 585<entry>SLI</entry> 586<entry>SLI</entry> 587<entry>SL</entry> 588<entry>SL</entry> 589</row> 590 591<row> 592<entry>Tasklet A</entry> 593<entry>SLI</entry> 594<entry>SLI</entry> 595<entry>SL</entry> 596<entry>SL</entry> 597<entry>None</entry> 598</row> 599 600<row> 601<entry>Tasklet B</entry> 602<entry>SLI</entry> 603<entry>SLI</entry> 604<entry>SL</entry> 605<entry>SL</entry> 606<entry>SL</entry> 607<entry>None</entry> 608</row> 609 610<row> 611<entry>Timer A</entry> 612<entry>SLI</entry> 613<entry>SLI</entry> 614<entry>SL</entry> 615<entry>SL</entry> 616<entry>SL</entry> 617<entry>SL</entry> 618<entry>None</entry> 619</row> 620 621<row> 622<entry>Timer B</entry> 623<entry>SLI</entry> 624<entry>SLI</entry> 625<entry>SL</entry> 626<entry>SL</entry> 627<entry>SL</entry> 628<entry>SL</entry> 629<entry>SL</entry> 630<entry>None</entry> 631</row> 632 633<row> 634<entry>User Context A</entry> 635<entry>SLI</entry> 636<entry>SLI</entry> 637<entry>SLBH</entry> 638<entry>SLBH</entry> 639<entry>SLBH</entry> 640<entry>SLBH</entry> 641<entry>SLBH</entry> 642<entry>SLBH</entry> 643<entry>None</entry> 644</row> 645 646<row> 647<entry>User Context B</entry> 648<entry>SLI</entry> 649<entry>SLI</entry> 650<entry>SLBH</entry> 651<entry>SLBH</entry> 652<entry>SLBH</entry> 653<entry>SLBH</entry> 654<entry>SLBH</entry> 655<entry>SLBH</entry> 656<entry>MLI</entry> 657<entry>None</entry> 658</row> 659 660</tbody> 661</tgroup> 662</table> 663 664 <table> 665<title>Legend for Locking Requirements Table</title> 666<tgroup cols="2"> 667<tbody> 668 669<row> 670<entry>SLIS</entry> 671<entry>spin_lock_irqsave</entry> 672</row> 673<row> 674<entry>SLI</entry> 675<entry>spin_lock_irq</entry> 676</row> 677<row> 678<entry>SL</entry> 679<entry>spin_lock</entry> 680</row> 681<row> 682<entry>SLBH</entry> 683<entry>spin_lock_bh</entry> 684</row> 685<row> 686<entry>MLI</entry> 687<entry>mutex_lock_interruptible</entry> 688</row> 689 690</tbody> 691</tgroup> 692</table> 693 694</sect1> 695</chapter> 696 697<chapter id="trylock-functions"> 698 <title>The trylock Functions</title> 699 <para> 700 There are functions that try to acquire a lock only once and immediately 701 return a value telling about success or failure to acquire the lock. 702 They can be used if you need no access to the data protected with the lock 703 when some other thread is holding the lock. You should acquire the lock 704 later if you then need access to the data protected with the lock. 705 </para> 706 707 <para> 708 <function>spin_trylock()</function> does not spin but returns non-zero if 709 it acquires the spinlock on the first try or 0 if not. This function can 710 be used in all contexts like <function>spin_lock</function>: you must have 711 disabled the contexts that might interrupt you and acquire the spin lock. 712 </para> 713 714 <para> 715 <function>mutex_trylock()</function> does not suspend your task 716 but returns non-zero if it could lock the mutex on the first try 717 or 0 if not. This function cannot be safely used in hardware or software 718 interrupt contexts despite not sleeping. 719 </para> 720</chapter> 721 722 <chapter id="Examples"> 723 <title>Common Examples</title> 724 <para> 725Let's step through a simple example: a cache of number to name 726mappings. The cache keeps a count of how often each of the objects is 727used, and when it gets full, throws out the least used one. 728 729 </para> 730 731 <sect1 id="examples-usercontext"> 732 <title>All In User Context</title> 733 <para> 734For our first example, we assume that all operations are in user 735context (ie. from system calls), so we can sleep. This means we can 736use a mutex to protect the cache and all the objects within 737it. Here's the code: 738 </para> 739 740 <programlisting> 741#include <linux/list.h> 742#include <linux/slab.h> 743#include <linux/string.h> 744#include <linux/mutex.h> 745#include <asm/errno.h> 746 747struct object 748{ 749 struct list_head list; 750 int id; 751 char name[32]; 752 int popularity; 753}; 754 755/* Protects the cache, cache_num, and the objects within it */ 756static DEFINE_MUTEX(cache_lock); 757static LIST_HEAD(cache); 758static unsigned int cache_num = 0; 759#define MAX_CACHE_SIZE 10 760 761/* Must be holding cache_lock */ 762static struct object *__cache_find(int id) 763{ 764 struct object *i; 765 766 list_for_each_entry(i, &cache, list) 767 if (i->id == id) { 768 i->popularity++; 769 return i; 770 } 771 return NULL; 772} 773 774/* Must be holding cache_lock */ 775static void __cache_delete(struct object *obj) 776{ 777 BUG_ON(!obj); 778 list_del(&obj->list); 779 kfree(obj); 780 cache_num--; 781} 782 783/* Must be holding cache_lock */ 784static void __cache_add(struct object *obj) 785{ 786 list_add(&obj->list, &cache); 787 if (++cache_num > MAX_CACHE_SIZE) { 788 struct object *i, *outcast = NULL; 789 list_for_each_entry(i, &cache, list) { 790 if (!outcast || i->popularity < outcast->popularity) 791 outcast = i; 792 } 793 __cache_delete(outcast); 794 } 795} 796 797int cache_add(int id, const char *name) 798{ 799 struct object *obj; 800 801 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 802 return -ENOMEM; 803 804 strlcpy(obj->name, name, sizeof(obj->name)); 805 obj->id = id; 806 obj->popularity = 0; 807 808 mutex_lock(&cache_lock); 809 __cache_add(obj); 810 mutex_unlock(&cache_lock); 811 return 0; 812} 813 814void cache_delete(int id) 815{ 816 mutex_lock(&cache_lock); 817 __cache_delete(__cache_find(id)); 818 mutex_unlock(&cache_lock); 819} 820 821int cache_find(int id, char *name) 822{ 823 struct object *obj; 824 int ret = -ENOENT; 825 826 mutex_lock(&cache_lock); 827 obj = __cache_find(id); 828 if (obj) { 829 ret = 0; 830 strcpy(name, obj->name); 831 } 832 mutex_unlock(&cache_lock); 833 return ret; 834} 835</programlisting> 836 837 <para> 838Note that we always make sure we have the cache_lock when we add, 839delete, or look up the cache: both the cache infrastructure itself and 840the contents of the objects are protected by the lock. In this case 841it's easy, since we copy the data for the user, and never let them 842access the objects directly. 843 </para> 844 <para> 845There is a slight (and common) optimization here: in 846<function>cache_add</function> we set up the fields of the object 847before grabbing the lock. This is safe, as no-one else can access it 848until we put it in cache. 849 </para> 850 </sect1> 851 852 <sect1 id="examples-interrupt"> 853 <title>Accessing From Interrupt Context</title> 854 <para> 855Now consider the case where <function>cache_find</function> can be 856called from interrupt context: either a hardware interrupt or a 857softirq. An example would be a timer which deletes object from the 858cache. 859 </para> 860 <para> 861The change is shown below, in standard patch format: the 862<symbol>-</symbol> are lines which are taken away, and the 863<symbol>+</symbol> are lines which are added. 864 </para> 865<programlisting> 866--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 867+++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 868@@ -12,7 +12,7 @@ 869 int popularity; 870 }; 871 872-static DEFINE_MUTEX(cache_lock); 873+static DEFINE_SPINLOCK(cache_lock); 874 static LIST_HEAD(cache); 875 static unsigned int cache_num = 0; 876 #define MAX_CACHE_SIZE 10 877@@ -55,6 +55,7 @@ 878 int cache_add(int id, const char *name) 879 { 880 struct object *obj; 881+ unsigned long flags; 882 883 if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) 884 return -ENOMEM; 885@@ -63,30 +64,33 @@ 886 obj->id = id; 887 obj->popularity = 0; 888 889- mutex_lock(&cache_lock); 890+ spin_lock_irqsave(&cache_lock, flags); 891 __cache_add(obj); 892- mutex_unlock(&cache_lock); 893+ spin_unlock_irqrestore(&cache_lock, flags); 894 return 0; 895 } 896 897 void cache_delete(int id) 898 { 899- mutex_lock(&cache_lock); 900+ unsigned long flags; 901+ 902+ spin_lock_irqsave(&cache_lock, flags); 903 __cache_delete(__cache_find(id)); 904- mutex_unlock(&cache_lock); 905+ spin_unlock_irqrestore(&cache_lock, flags); 906 } 907 908 int cache_find(int id, char *name) 909 { 910 struct object *obj; 911 int ret = -ENOENT; 912+ unsigned long flags; 913 914- mutex_lock(&cache_lock); 915+ spin_lock_irqsave(&cache_lock, flags); 916 obj = __cache_find(id); 917 if (obj) { 918 ret = 0; 919 strcpy(name, obj->name); 920 } 921- mutex_unlock(&cache_lock); 922+ spin_unlock_irqrestore(&cache_lock, flags); 923 return ret; 924 } 925</programlisting> 926 927 <para> 928Note that the <function>spin_lock_irqsave</function> will turn off 929interrupts if they are on, otherwise does nothing (if we are already 930in an interrupt handler), hence these functions are safe to call from 931any context. 932 </para> 933 <para> 934Unfortunately, <function>cache_add</function> calls 935<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol> 936flag, which is only legal in user context. I have assumed that 937<function>cache_add</function> is still only called in user context, 938otherwise this should become a parameter to 939<function>cache_add</function>. 940 </para> 941 </sect1> 942 <sect1 id="examples-refcnt"> 943 <title>Exposing Objects Outside This File</title> 944 <para> 945If our objects contained more information, it might not be sufficient 946to copy the information in and out: other parts of the code might want 947to keep pointers to these objects, for example, rather than looking up 948the id every time. This produces two problems. 949 </para> 950 <para> 951The first problem is that we use the <symbol>cache_lock</symbol> to 952protect objects: we'd need to make this non-static so the rest of the 953code can use it. This makes locking trickier, as it is no longer all 954in one place. 955 </para> 956 <para> 957The second problem is the lifetime problem: if another structure keeps 958a pointer to an object, it presumably expects that pointer to remain 959valid. Unfortunately, this is only guaranteed while you hold the 960lock, otherwise someone might call <function>cache_delete</function> 961and even worse, add another object, re-using the same address. 962 </para> 963 <para> 964As there is only one lock, you can't hold it forever: no-one else would 965get any work done. 966 </para> 967 <para> 968The solution to this problem is to use a reference count: everyone who 969has a pointer to the object increases it when they first get the 970object, and drops the reference count when they're finished with it. 971Whoever drops it to zero knows it is unused, and can actually delete it. 972 </para> 973 <para> 974Here is the code: 975 </para> 976 977<programlisting> 978--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 979+++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 980@@ -7,6 +7,7 @@ 981 struct object 982 { 983 struct list_head list; 984+ unsigned int refcnt; 985 int id; 986 char name[32]; 987 int popularity; 988@@ -17,6 +18,35 @@ 989 static unsigned int cache_num = 0; 990 #define MAX_CACHE_SIZE 10 991 992+static void __object_put(struct object *obj) 993+{ 994+ if (--obj->refcnt == 0) 995+ kfree(obj); 996+} 997+ 998+static void __object_get(struct object *obj) 999+{ 1000+ obj->refcnt++; 1001+} 1002+ 1003+void object_put(struct object *obj) 1004+{ 1005+ unsigned long flags; 1006+ 1007+ spin_lock_irqsave(&cache_lock, flags); 1008+ __object_put(obj); 1009+ spin_unlock_irqrestore(&cache_lock, flags); 1010+} 1011+ 1012+void object_get(struct object *obj) 1013+{ 1014+ unsigned long flags; 1015+ 1016+ spin_lock_irqsave(&cache_lock, flags); 1017+ __object_get(obj); 1018+ spin_unlock_irqrestore(&cache_lock, flags); 1019+} 1020+ 1021 /* Must be holding cache_lock */ 1022 static struct object *__cache_find(int id) 1023 { 1024@@ -35,6 +65,7 @@ 1025 { 1026 BUG_ON(!obj); 1027 list_del(&obj->list); 1028+ __object_put(obj); 1029 cache_num--; 1030 } 1031 1032@@ -63,6 +94,7 @@ 1033 strlcpy(obj->name, name, sizeof(obj->name)); 1034 obj->id = id; 1035 obj->popularity = 0; 1036+ obj->refcnt = 1; /* The cache holds a reference */ 1037 1038 spin_lock_irqsave(&cache_lock, flags); 1039 __cache_add(obj); 1040@@ -79,18 +111,15 @@ 1041 spin_unlock_irqrestore(&cache_lock, flags); 1042 } 1043 1044-int cache_find(int id, char *name) 1045+struct object *cache_find(int id) 1046 { 1047 struct object *obj; 1048- int ret = -ENOENT; 1049 unsigned long flags; 1050 1051 spin_lock_irqsave(&cache_lock, flags); 1052 obj = __cache_find(id); 1053- if (obj) { 1054- ret = 0; 1055- strcpy(name, obj->name); 1056- } 1057+ if (obj) 1058+ __object_get(obj); 1059 spin_unlock_irqrestore(&cache_lock, flags); 1060- return ret; 1061+ return obj; 1062 } 1063</programlisting> 1064 1065<para> 1066We encapsulate the reference counting in the standard 'get' and 'put' 1067functions. Now we can return the object itself from 1068<function>cache_find</function> which has the advantage that the user 1069can now sleep holding the object (eg. to 1070<function>copy_to_user</function> to name to userspace). 1071</para> 1072<para> 1073The other point to note is that I said a reference should be held for 1074every pointer to the object: thus the reference count is 1 when first 1075inserted into the cache. In some versions the framework does not hold 1076a reference count, but they are more complicated. 1077</para> 1078 1079 <sect2 id="examples-refcnt-atomic"> 1080 <title>Using Atomic Operations For The Reference Count</title> 1081<para> 1082In practice, <type>atomic_t</type> would usually be used for 1083<structfield>refcnt</structfield>. There are a number of atomic 1084operations defined in 1085 1086<filename class="headerfile">include/asm/atomic.h</filename>: these are 1087guaranteed to be seen atomically from all CPUs in the system, so no 1088lock is required. In this case, it is simpler than using spinlocks, 1089although for anything non-trivial using spinlocks is clearer. The 1090<function>atomic_inc</function> and 1091<function>atomic_dec_and_test</function> are used instead of the 1092standard increment and decrement operators, and the lock is no longer 1093used to protect the reference count itself. 1094</para> 1095 1096<programlisting> 1097--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 1098+++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 1099@@ -7,7 +7,7 @@ 1100 struct object 1101 { 1102 struct list_head list; 1103- unsigned int refcnt; 1104+ atomic_t refcnt; 1105 int id; 1106 char name[32]; 1107 int popularity; 1108@@ -18,33 +18,15 @@ 1109 static unsigned int cache_num = 0; 1110 #define MAX_CACHE_SIZE 10 1111 1112-static void __object_put(struct object *obj) 1113-{ 1114- if (--obj->refcnt == 0) 1115- kfree(obj); 1116-} 1117- 1118-static void __object_get(struct object *obj) 1119-{ 1120- obj->refcnt++; 1121-} 1122- 1123 void object_put(struct object *obj) 1124 { 1125- unsigned long flags; 1126- 1127- spin_lock_irqsave(&cache_lock, flags); 1128- __object_put(obj); 1129- spin_unlock_irqrestore(&cache_lock, flags); 1130+ if (atomic_dec_and_test(&obj->refcnt)) 1131+ kfree(obj); 1132 } 1133 1134 void object_get(struct object *obj) 1135 { 1136- unsigned long flags; 1137- 1138- spin_lock_irqsave(&cache_lock, flags); 1139- __object_get(obj); 1140- spin_unlock_irqrestore(&cache_lock, flags); 1141+ atomic_inc(&obj->refcnt); 1142 } 1143 1144 /* Must be holding cache_lock */ 1145@@ -65,7 +47,7 @@ 1146 { 1147 BUG_ON(!obj); 1148 list_del(&obj->list); 1149- __object_put(obj); 1150+ object_put(obj); 1151 cache_num--; 1152 } 1153 1154@@ -94,7 +76,7 @@ 1155 strlcpy(obj->name, name, sizeof(obj->name)); 1156 obj->id = id; 1157 obj->popularity = 0; 1158- obj->refcnt = 1; /* The cache holds a reference */ 1159+ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1160 1161 spin_lock_irqsave(&cache_lock, flags); 1162 __cache_add(obj); 1163@@ -119,7 +101,7 @@ 1164 spin_lock_irqsave(&cache_lock, flags); 1165 obj = __cache_find(id); 1166 if (obj) 1167- __object_get(obj); 1168+ object_get(obj); 1169 spin_unlock_irqrestore(&cache_lock, flags); 1170 return obj; 1171 } 1172</programlisting> 1173</sect2> 1174</sect1> 1175 1176 <sect1 id="examples-lock-per-obj"> 1177 <title>Protecting The Objects Themselves</title> 1178 <para> 1179In these examples, we assumed that the objects (except the reference 1180counts) never changed once they are created. If we wanted to allow 1181the name to change, there are three possibilities: 1182 </para> 1183 <itemizedlist> 1184 <listitem> 1185 <para> 1186You can make <symbol>cache_lock</symbol> non-static, and tell people 1187to grab that lock before changing the name in any object. 1188 </para> 1189 </listitem> 1190 <listitem> 1191 <para> 1192You can provide a <function>cache_obj_rename</function> which grabs 1193this lock and changes the name for the caller, and tell everyone to 1194use that function. 1195 </para> 1196 </listitem> 1197 <listitem> 1198 <para> 1199You can make the <symbol>cache_lock</symbol> protect only the cache 1200itself, and use another lock to protect the name. 1201 </para> 1202 </listitem> 1203 </itemizedlist> 1204 1205 <para> 1206Theoretically, you can make the locks as fine-grained as one lock for 1207every field, for every object. In practice, the most common variants 1208are: 1209</para> 1210 <itemizedlist> 1211 <listitem> 1212 <para> 1213One lock which protects the infrastructure (the <symbol>cache</symbol> 1214list in this example) and all the objects. This is what we have done 1215so far. 1216 </para> 1217 </listitem> 1218 <listitem> 1219 <para> 1220One lock which protects the infrastructure (including the list 1221pointers inside the objects), and one lock inside the object which 1222protects the rest of that object. 1223 </para> 1224 </listitem> 1225 <listitem> 1226 <para> 1227Multiple locks to protect the infrastructure (eg. one lock per hash 1228chain), possibly with a separate per-object lock. 1229 </para> 1230 </listitem> 1231 </itemizedlist> 1232 1233<para> 1234Here is the "lock-per-object" implementation: 1235</para> 1236<programlisting> 1237--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 1238+++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1239@@ -6,11 +6,17 @@ 1240 1241 struct object 1242 { 1243+ /* These two protected by cache_lock. */ 1244 struct list_head list; 1245+ int popularity; 1246+ 1247 atomic_t refcnt; 1248+ 1249+ /* Doesn't change once created. */ 1250 int id; 1251+ 1252+ spinlock_t lock; /* Protects the name */ 1253 char name[32]; 1254- int popularity; 1255 }; 1256 1257 static DEFINE_SPINLOCK(cache_lock); 1258@@ -77,6 +84,7 @@ 1259 obj->id = id; 1260 obj->popularity = 0; 1261 atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ 1262+ spin_lock_init(&obj->lock); 1263 1264 spin_lock_irqsave(&cache_lock, flags); 1265 __cache_add(obj); 1266</programlisting> 1267 1268<para> 1269Note that I decide that the <structfield>popularity</structfield> 1270count should be protected by the <symbol>cache_lock</symbol> rather 1271than the per-object lock: this is because it (like the 1272<structname>struct list_head</structname> inside the object) is 1273logically part of the infrastructure. This way, I don't need to grab 1274the lock of every object in <function>__cache_add</function> when 1275seeking the least popular. 1276</para> 1277 1278<para> 1279I also decided that the <structfield>id</structfield> member is 1280unchangeable, so I don't need to grab each object lock in 1281<function>__cache_find()</function> to examine the 1282<structfield>id</structfield>: the object lock is only used by a 1283caller who wants to read or write the <structfield>name</structfield> 1284field. 1285</para> 1286 1287<para> 1288Note also that I added a comment describing what data was protected by 1289which locks. This is extremely important, as it describes the runtime 1290behavior of the code, and can be hard to gain from just reading. And 1291as Alan Cox says, <quote>Lock data, not code</quote>. 1292</para> 1293</sect1> 1294</chapter> 1295 1296 <chapter id="common-problems"> 1297 <title>Common Problems</title> 1298 <sect1 id="deadlock"> 1299 <title>Deadlock: Simple and Advanced</title> 1300 1301 <para> 1302 There is a coding bug where a piece of code tries to grab a 1303 spinlock twice: it will spin forever, waiting for the lock to 1304 be released (spinlocks, rwlocks and mutexes are not 1305 recursive in Linux). This is trivial to diagnose: not a 1306 stay-up-five-nights-talk-to-fluffy-code-bunnies kind of 1307 problem. 1308 </para> 1309 1310 <para> 1311 For a slightly more complex case, imagine you have a region 1312 shared by a softirq and user context. If you use a 1313 <function>spin_lock()</function> call to protect it, it is 1314 possible that the user context will be interrupted by the softirq 1315 while it holds the lock, and the softirq will then spin 1316 forever trying to get the same lock. 1317 </para> 1318 1319 <para> 1320 Both of these are called deadlock, and as shown above, it can 1321 occur even with a single CPU (although not on UP compiles, 1322 since spinlocks vanish on kernel compiles with 1323 <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 1324 in the second example). 1325 </para> 1326 1327 <para> 1328 This complete lockup is easy to diagnose: on SMP boxes the 1329 watchdog timer or compiling with <symbol>DEBUG_SPINLOCK</symbol> set 1330 (<filename>include/linux/spinlock.h</filename>) will show this up 1331 immediately when it happens. 1332 </para> 1333 1334 <para> 1335 A more complex problem is the so-called 'deadly embrace', 1336 involving two or more locks. Say you have a hash table: each 1337 entry in the table is a spinlock, and a chain of hashed 1338 objects. Inside a softirq handler, you sometimes want to 1339 alter an object from one place in the hash to another: you 1340 grab the spinlock of the old hash chain and the spinlock of 1341 the new hash chain, and delete the object from the old one, 1342 and insert it in the new one. 1343 </para> 1344 1345 <para> 1346 There are two problems here. First, if your code ever 1347 tries to move the object to the same chain, it will deadlock 1348 with itself as it tries to lock it twice. Secondly, if the 1349 same softirq on another CPU is trying to move another object 1350 in the reverse direction, the following could happen: 1351 </para> 1352 1353 <table> 1354 <title>Consequences</title> 1355 1356 <tgroup cols="2" align="left"> 1357 1358 <thead> 1359 <row> 1360 <entry>CPU 1</entry> 1361 <entry>CPU 2</entry> 1362 </row> 1363 </thead> 1364 1365 <tbody> 1366 <row> 1367 <entry>Grab lock A -> OK</entry> 1368 <entry>Grab lock B -> OK</entry> 1369 </row> 1370 <row> 1371 <entry>Grab lock B -> spin</entry> 1372 <entry>Grab lock A -> spin</entry> 1373 </row> 1374 </tbody> 1375 </tgroup> 1376 </table> 1377 1378 <para> 1379 The two CPUs will spin forever, waiting for the other to give up 1380 their lock. It will look, smell, and feel like a crash. 1381 </para> 1382 </sect1> 1383 1384 <sect1 id="techs-deadlock-prevent"> 1385 <title>Preventing Deadlock</title> 1386 1387 <para> 1388 Textbooks will tell you that if you always lock in the same 1389 order, you will never get this kind of deadlock. Practice 1390 will tell you that this approach doesn't scale: when I 1391 create a new lock, I don't understand enough of the kernel 1392 to figure out where in the 5000 lock hierarchy it will fit. 1393 </para> 1394 1395 <para> 1396 The best locks are encapsulated: they never get exposed in 1397 headers, and are never held around calls to non-trivial 1398 functions outside the same file. You can read through this 1399 code and see that it will never deadlock, because it never 1400 tries to grab another lock while it has that one. People 1401 using your code don't even need to know you are using a 1402 lock. 1403 </para> 1404 1405 <para> 1406 A classic problem here is when you provide callbacks or 1407 hooks: if you call these with the lock held, you risk simple 1408 deadlock, or a deadly embrace (who knows what the callback 1409 will do?). Remember, the other programmers are out to get 1410 you, so don't do this. 1411 </para> 1412 1413 <sect2 id="techs-deadlock-overprevent"> 1414 <title>Overzealous Prevention Of Deadlocks</title> 1415 1416 <para> 1417 Deadlocks are problematic, but not as bad as data 1418 corruption. Code which grabs a read lock, searches a list, 1419 fails to find what it wants, drops the read lock, grabs a 1420 write lock and inserts the object has a race condition. 1421 </para> 1422 1423 <para> 1424 If you don't see why, please stay the fuck away from my code. 1425 </para> 1426 </sect2> 1427 </sect1> 1428 1429 <sect1 id="racing-timers"> 1430 <title>Racing Timers: A Kernel Pastime</title> 1431 1432 <para> 1433 Timers can produce their own special problems with races. 1434 Consider a collection of objects (list, hash, etc) where each 1435 object has a timer which is due to destroy it. 1436 </para> 1437 1438 <para> 1439 If you want to destroy the entire collection (say on module 1440 removal), you might do the following: 1441 </para> 1442 1443 <programlisting> 1444 /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE 1445 HUNGARIAN NOTATION */ 1446 spin_lock_bh(&list_lock); 1447 1448 while (list) { 1449 struct foo *next = list->next; 1450 del_timer(&list->timer); 1451 kfree(list); 1452 list = next; 1453 } 1454 1455 spin_unlock_bh(&list_lock); 1456 </programlisting> 1457 1458 <para> 1459 Sooner or later, this will crash on SMP, because a timer can 1460 have just gone off before the <function>spin_lock_bh()</function>, 1461 and it will only get the lock after we 1462 <function>spin_unlock_bh()</function>, and then try to free 1463 the element (which has already been freed!). 1464 </para> 1465 1466 <para> 1467 This can be avoided by checking the result of 1468 <function>del_timer()</function>: if it returns 1469 <returnvalue>1</returnvalue>, the timer has been deleted. 1470 If <returnvalue>0</returnvalue>, it means (in this 1471 case) that it is currently running, so we can do: 1472 </para> 1473 1474 <programlisting> 1475 retry: 1476 spin_lock_bh(&list_lock); 1477 1478 while (list) { 1479 struct foo *next = list->next; 1480 if (!del_timer(&list->timer)) { 1481 /* Give timer a chance to delete this */ 1482 spin_unlock_bh(&list_lock); 1483 goto retry; 1484 } 1485 kfree(list); 1486 list = next; 1487 } 1488 1489 spin_unlock_bh(&list_lock); 1490 </programlisting> 1491 1492 <para> 1493 Another common problem is deleting timers which restart 1494 themselves (by calling <function>add_timer()</function> at the end 1495 of their timer function). Because this is a fairly common case 1496 which is prone to races, you should use <function>del_timer_sync()</function> 1497 (<filename class="headerfile">include/linux/timer.h</filename>) 1498 to handle this case. It returns the number of times the timer 1499 had to be deleted before we finally stopped it from adding itself back 1500 in. 1501 </para> 1502 </sect1> 1503 1504 </chapter> 1505 1506 <chapter id="Efficiency"> 1507 <title>Locking Speed</title> 1508 1509 <para> 1510There are three main things to worry about when considering speed of 1511some code which does locking. First is concurrency: how many things 1512are going to be waiting while someone else is holding a lock. Second 1513is the time taken to actually acquire and release an uncontended lock. 1514Third is using fewer, or smarter locks. I'm assuming that the lock is 1515used fairly often: otherwise, you wouldn't be concerned about 1516efficiency. 1517</para> 1518 <para> 1519Concurrency depends on how long the lock is usually held: you should 1520hold the lock for as long as needed, but no longer. In the cache 1521example, we always create the object without the lock held, and then 1522grab the lock only when we are ready to insert it in the list. 1523</para> 1524 <para> 1525Acquisition times depend on how much damage the lock operations do to 1526the pipeline (pipeline stalls) and how likely it is that this CPU was 1527the last one to grab the lock (ie. is the lock cache-hot for this 1528CPU): on a machine with more CPUs, this likelihood drops fast. 1529Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, 1530an atomic increment takes about 58ns, a lock which is cache-hot on 1531this CPU takes 160ns, and a cacheline transfer from another CPU takes 1532an additional 170 to 360ns. (These figures from Paul McKenney's 1533<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux 1534Journal RCU article</ulink>). 1535</para> 1536 <para> 1537These two aims conflict: holding a lock for a short time might be done 1538by splitting locks into parts (such as in our final per-object-lock 1539example), but this increases the number of lock acquisitions, and the 1540results are often slower than having a single lock. This is another 1541reason to advocate locking simplicity. 1542</para> 1543 <para> 1544The third concern is addressed below: there are some methods to reduce 1545the amount of locking which needs to be done. 1546</para> 1547 1548 <sect1 id="efficiency-rwlocks"> 1549 <title>Read/Write Lock Variants</title> 1550 1551 <para> 1552 Both spinlocks and mutexes have read/write variants: 1553 <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. 1554 These divide users into two classes: the readers and the writers. If 1555 you are only reading the data, you can get a read lock, but to write to 1556 the data you need the write lock. Many people can hold a read lock, 1557 but a writer must be sole holder. 1558 </para> 1559 1560 <para> 1561 If your code divides neatly along reader/writer lines (as our 1562 cache code does), and the lock is held by readers for 1563 significant lengths of time, using these locks can help. They 1564 are slightly slower than the normal locks though, so in practice 1565 <type>rwlock_t</type> is not usually worthwhile. 1566 </para> 1567 </sect1> 1568 1569 <sect1 id="efficiency-read-copy-update"> 1570 <title>Avoiding Locks: Read Copy Update</title> 1571 1572 <para> 1573 There is a special method of read/write locking called Read Copy 1574 Update. Using RCU, the readers can avoid taking a lock 1575 altogether: as we expect our cache to be read more often than 1576 updated (otherwise the cache is a waste of time), it is a 1577 candidate for this optimization. 1578 </para> 1579 1580 <para> 1581 How do we get rid of read locks? Getting rid of read locks 1582 means that writers may be changing the list underneath the 1583 readers. That is actually quite simple: we can read a linked 1584 list while an element is being added if the writer adds the 1585 element very carefully. For example, adding 1586 <symbol>new</symbol> to a single linked list called 1587 <symbol>list</symbol>: 1588 </para> 1589 1590 <programlisting> 1591 new->next = list->next; 1592 wmb(); 1593 list->next = new; 1594 </programlisting> 1595 1596 <para> 1597 The <function>wmb()</function> is a write memory barrier. It 1598 ensures that the first operation (setting the new element's 1599 <symbol>next</symbol> pointer) is complete and will be seen by 1600 all CPUs, before the second operation is (putting the new 1601 element into the list). This is important, since modern 1602 compilers and modern CPUs can both reorder instructions unless 1603 told otherwise: we want a reader to either not see the new 1604 element at all, or see the new element with the 1605 <symbol>next</symbol> pointer correctly pointing at the rest of 1606 the list. 1607 </para> 1608 <para> 1609 Fortunately, there is a function to do this for standard 1610 <structname>struct list_head</structname> lists: 1611 <function>list_add_rcu()</function> 1612 (<filename>include/linux/list.h</filename>). 1613 </para> 1614 <para> 1615 Removing an element from the list is even simpler: we replace 1616 the pointer to the old element with a pointer to its successor, 1617 and readers will either see it, or skip over it. 1618 </para> 1619 <programlisting> 1620 list->next = old->next; 1621 </programlisting> 1622 <para> 1623 There is <function>list_del_rcu()</function> 1624 (<filename>include/linux/list.h</filename>) which does this (the 1625 normal version poisons the old object, which we don't want). 1626 </para> 1627 <para> 1628 The reader must also be careful: some CPUs can look through the 1629 <symbol>next</symbol> pointer to start reading the contents of 1630 the next element early, but don't realize that the pre-fetched 1631 contents is wrong when the <symbol>next</symbol> pointer changes 1632 underneath them. Once again, there is a 1633 <function>list_for_each_entry_rcu()</function> 1634 (<filename>include/linux/list.h</filename>) to help you. Of 1635 course, writers can just use 1636 <function>list_for_each_entry()</function>, since there cannot 1637 be two simultaneous writers. 1638 </para> 1639 <para> 1640 Our final dilemma is this: when can we actually destroy the 1641 removed element? Remember, a reader might be stepping through 1642 this element in the list right now: if we free this element and 1643 the <symbol>next</symbol> pointer changes, the reader will jump 1644 off into garbage and crash. We need to wait until we know that 1645 all the readers who were traversing the list when we deleted the 1646 element are finished. We use <function>call_rcu()</function> to 1647 register a callback which will actually destroy the object once 1648 all pre-existing readers are finished. Alternatively, 1649 <function>synchronize_rcu()</function> may be used to block until 1650 all pre-existing are finished. 1651 </para> 1652 <para> 1653 But how does Read Copy Update know when the readers are 1654 finished? The method is this: firstly, the readers always 1655 traverse the list inside 1656 <function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function> 1657 pairs: these simply disable preemption so the reader won't go to 1658 sleep while reading the list. 1659 </para> 1660 <para> 1661 RCU then waits until every other CPU has slept at least once: 1662 since readers cannot sleep, we know that any readers which were 1663 traversing the list during the deletion are finished, and the 1664 callback is triggered. The real Read Copy Update code is a 1665 little more optimized than this, but this is the fundamental 1666 idea. 1667 </para> 1668 1669<programlisting> 1670--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 1671+++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 1672@@ -1,15 +1,18 @@ 1673 #include <linux/list.h> 1674 #include <linux/slab.h> 1675 #include <linux/string.h> 1676+#include <linux/rcupdate.h> 1677 #include <linux/mutex.h> 1678 #include <asm/errno.h> 1679 1680 struct object 1681 { 1682- /* These two protected by cache_lock. */ 1683+ /* This is protected by RCU */ 1684 struct list_head list; 1685 int popularity; 1686 1687+ struct rcu_head rcu; 1688+ 1689 atomic_t refcnt; 1690 1691 /* Doesn't change once created. */ 1692@@ -40,7 +43,7 @@ 1693 { 1694 struct object *i; 1695 1696- list_for_each_entry(i, &cache, list) { 1697+ list_for_each_entry_rcu(i, &cache, list) { 1698 if (i->id == id) { 1699 i->popularity++; 1700 return i; 1701@@ -49,19 +52,25 @@ 1702 return NULL; 1703 } 1704 1705+/* Final discard done once we know no readers are looking. */ 1706+static void cache_delete_rcu(void *arg) 1707+{ 1708+ object_put(arg); 1709+} 1710+ 1711 /* Must be holding cache_lock */ 1712 static void __cache_delete(struct object *obj) 1713 { 1714 BUG_ON(!obj); 1715- list_del(&obj->list); 1716- object_put(obj); 1717+ list_del_rcu(&obj->list); 1718 cache_num--; 1719+ call_rcu(&obj->rcu, cache_delete_rcu); 1720 } 1721 1722 /* Must be holding cache_lock */ 1723 static void __cache_add(struct object *obj) 1724 { 1725- list_add(&obj->list, &cache); 1726+ list_add_rcu(&obj->list, &cache); 1727 if (++cache_num > MAX_CACHE_SIZE) { 1728 struct object *i, *outcast = NULL; 1729 list_for_each_entry(i, &cache, list) { 1730@@ -104,12 +114,11 @@ 1731 struct object *cache_find(int id) 1732 { 1733 struct object *obj; 1734- unsigned long flags; 1735 1736- spin_lock_irqsave(&cache_lock, flags); 1737+ rcu_read_lock(); 1738 obj = __cache_find(id); 1739 if (obj) 1740 object_get(obj); 1741- spin_unlock_irqrestore(&cache_lock, flags); 1742+ rcu_read_unlock(); 1743 return obj; 1744 } 1745</programlisting> 1746 1747<para> 1748Note that the reader will alter the 1749<structfield>popularity</structfield> member in 1750<function>__cache_find()</function>, and now it doesn't hold a lock. 1751One solution would be to make it an <type>atomic_t</type>, but for 1752this usage, we don't really care about races: an approximate result is 1753good enough, so I didn't change it. 1754</para> 1755 1756<para> 1757The result is that <function>cache_find()</function> requires no 1758synchronization with any other functions, so is almost as fast on SMP 1759as it would be on UP. 1760</para> 1761 1762<para> 1763There is a further optimization possible here: remember our original 1764cache code, where there were no reference counts and the caller simply 1765held the lock whenever using the object? This is still possible: if 1766you hold the lock, no one can delete the object, so you don't need to 1767get and put the reference count. 1768</para> 1769 1770<para> 1771Now, because the 'read lock' in RCU is simply disabling preemption, a 1772caller which always has preemption disabled between calling 1773<function>cache_find()</function> and 1774<function>object_put()</function> does not need to actually get and 1775put the reference count: we could expose 1776<function>__cache_find()</function> by making it non-static, and 1777such callers could simply call that. 1778</para> 1779<para> 1780The benefit here is that the reference count is not written to: the 1781object is not altered in any way, which is much faster on SMP 1782machines due to caching. 1783</para> 1784 </sect1> 1785 1786 <sect1 id="per-cpu"> 1787 <title>Per-CPU Data</title> 1788 1789 <para> 1790 Another technique for avoiding locking which is used fairly 1791 widely is to duplicate information for each CPU. For example, 1792 if you wanted to keep a count of a common condition, you could 1793 use a spin lock and a single counter. Nice and simple. 1794 </para> 1795 1796 <para> 1797 If that was too slow (it's usually not, but if you've got a 1798 really big machine to test on and can show that it is), you 1799 could instead use a counter for each CPU, then none of them need 1800 an exclusive lock. See <function>DEFINE_PER_CPU()</function>, 1801 <function>get_cpu_var()</function> and 1802 <function>put_cpu_var()</function> 1803 (<filename class="headerfile">include/linux/percpu.h</filename>). 1804 </para> 1805 1806 <para> 1807 Of particular use for simple per-cpu counters is the 1808 <type>local_t</type> type, and the 1809 <function>cpu_local_inc()</function> and related functions, 1810 which are more efficient than simple code on some architectures 1811 (<filename class="headerfile">include/asm/local.h</filename>). 1812 </para> 1813 1814 <para> 1815 Note that there is no simple, reliable way of getting an exact 1816 value of such a counter, without introducing more locks. This 1817 is not a problem for some uses. 1818 </para> 1819 </sect1> 1820 1821 <sect1 id="mostly-hardirq"> 1822 <title>Data Which Mostly Used By An IRQ Handler</title> 1823 1824 <para> 1825 If data is always accessed from within the same IRQ handler, you 1826 don't need a lock at all: the kernel already guarantees that the 1827 irq handler will not run simultaneously on multiple CPUs. 1828 </para> 1829 <para> 1830 Manfred Spraul points out that you can still do this, even if 1831 the data is very occasionally accessed in user context or 1832 softirqs/tasklets. The irq handler doesn't use a lock, and 1833 all other accesses are done as so: 1834 </para> 1835 1836<programlisting> 1837 spin_lock(&lock); 1838 disable_irq(irq); 1839 ... 1840 enable_irq(irq); 1841 spin_unlock(&lock); 1842</programlisting> 1843 <para> 1844 The <function>disable_irq()</function> prevents the irq handler 1845 from running (and waits for it to finish if it's currently 1846 running on other CPUs). The spinlock prevents any other 1847 accesses happening at the same time. Naturally, this is slower 1848 than just a <function>spin_lock_irq()</function> call, so it 1849 only makes sense if this type of access happens extremely 1850 rarely. 1851 </para> 1852 </sect1> 1853 </chapter> 1854 1855 <chapter id="sleeping-things"> 1856 <title>What Functions Are Safe To Call From Interrupts?</title> 1857 1858 <para> 1859 Many functions in the kernel sleep (ie. call schedule()) 1860 directly or indirectly: you can never call them while holding a 1861 spinlock, or with preemption disabled. This also means you need 1862 to be in user context: calling them from an interrupt is illegal. 1863 </para> 1864 1865 <sect1 id="sleeping"> 1866 <title>Some Functions Which Sleep</title> 1867 1868 <para> 1869 The most common ones are listed below, but you usually have to 1870 read the code to find out if other calls are safe. If everyone 1871 else who calls it can sleep, you probably need to be able to 1872 sleep, too. In particular, registration and deregistration 1873 functions usually expect to be called from user context, and can 1874 sleep. 1875 </para> 1876 1877 <itemizedlist> 1878 <listitem> 1879 <para> 1880 Accesses to 1881 <firstterm linkend="gloss-userspace">userspace</firstterm>: 1882 </para> 1883 <itemizedlist> 1884 <listitem> 1885 <para> 1886 <function>copy_from_user()</function> 1887 </para> 1888 </listitem> 1889 <listitem> 1890 <para> 1891 <function>copy_to_user()</function> 1892 </para> 1893 </listitem> 1894 <listitem> 1895 <para> 1896 <function>get_user()</function> 1897 </para> 1898 </listitem> 1899 <listitem> 1900 <para> 1901 <function>put_user()</function> 1902 </para> 1903 </listitem> 1904 </itemizedlist> 1905 </listitem> 1906 1907 <listitem> 1908 <para> 1909 <function>kmalloc(GFP_KERNEL)</function> 1910 </para> 1911 </listitem> 1912 1913 <listitem> 1914 <para> 1915 <function>mutex_lock_interruptible()</function> and 1916 <function>mutex_lock()</function> 1917 </para> 1918 <para> 1919 There is a <function>mutex_trylock()</function> which does not 1920 sleep. Still, it must not be used inside interrupt context since 1921 its implementation is not safe for that. 1922 <function>mutex_unlock()</function> will also never sleep. 1923 It cannot be used in interrupt context either since a mutex 1924 must be released by the same task that acquired it. 1925 </para> 1926 </listitem> 1927 </itemizedlist> 1928 </sect1> 1929 1930 <sect1 id="dont-sleep"> 1931 <title>Some Functions Which Don't Sleep</title> 1932 1933 <para> 1934 Some functions are safe to call from any context, or holding 1935 almost any lock. 1936 </para> 1937 1938 <itemizedlist> 1939 <listitem> 1940 <para> 1941 <function>printk()</function> 1942 </para> 1943 </listitem> 1944 <listitem> 1945 <para> 1946 <function>kfree()</function> 1947 </para> 1948 </listitem> 1949 <listitem> 1950 <para> 1951 <function>add_timer()</function> and <function>del_timer()</function> 1952 </para> 1953 </listitem> 1954 </itemizedlist> 1955 </sect1> 1956 </chapter> 1957 1958 <chapter id="apiref-mutex"> 1959 <title>Mutex API reference</title> 1960!Iinclude/linux/mutex.h 1961!Ekernel/locking/mutex.c 1962 </chapter> 1963 1964 <chapter id="apiref-futex"> 1965 <title>Futex API reference</title> 1966!Ikernel/futex.c 1967 </chapter> 1968 1969 <chapter id="references"> 1970 <title>Further reading</title> 1971 1972 <itemizedlist> 1973 <listitem> 1974 <para> 1975 <filename>Documentation/locking/spinlocks.txt</filename>: 1976 Linus Torvalds' spinlocking tutorial in the kernel sources. 1977 </para> 1978 </listitem> 1979 1980 <listitem> 1981 <para> 1982 Unix Systems for Modern Architectures: Symmetric 1983 Multiprocessing and Caching for Kernel Programmers: 1984 </para> 1985 1986 <para> 1987 Curt Schimmel's very good introduction to kernel level 1988 locking (not written for Linux, but nearly everything 1989 applies). The book is expensive, but really worth every 1990 penny to understand SMP locking. [ISBN: 0201633388] 1991 </para> 1992 </listitem> 1993 </itemizedlist> 1994 </chapter> 1995 1996 <chapter id="thanks"> 1997 <title>Thanks</title> 1998 1999 <para> 2000 Thanks to Telsa Gwynne for DocBooking, neatening and adding 2001 style. 2002 </para> 2003 2004 <para> 2005 Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul 2006 Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim 2007 Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, 2008 John Ashby for proofreading, correcting, flaming, commenting. 2009 </para> 2010 2011 <para> 2012 Thanks to the cabal for having no influence on this document. 2013 </para> 2014 </chapter> 2015 2016 <glossary id="glossary"> 2017 <title>Glossary</title> 2018 2019 <glossentry id="gloss-preemption"> 2020 <glossterm>preemption</glossterm> 2021 <glossdef> 2022 <para> 2023 Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is 2024 unset, processes in user context inside the kernel would not 2025 preempt each other (ie. you had that CPU until you gave it up, 2026 except for interrupts). With the addition of 2027 <symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when 2028 in user context, higher priority tasks can "cut in": spinlocks 2029 were changed to disable preemption, even on UP. 2030 </para> 2031 </glossdef> 2032 </glossentry> 2033 2034 <glossentry id="gloss-bh"> 2035 <glossterm>bh</glossterm> 2036 <glossdef> 2037 <para> 2038 Bottom Half: for historical reasons, functions with 2039 '_bh' in them often now refer to any software interrupt, e.g. 2040 <function>spin_lock_bh()</function> blocks any software interrupt 2041 on the current CPU. Bottom halves are deprecated, and will 2042 eventually be replaced by tasklets. Only one bottom half will be 2043 running at any time. 2044 </para> 2045 </glossdef> 2046 </glossentry> 2047 2048 <glossentry id="gloss-hwinterrupt"> 2049 <glossterm>Hardware Interrupt / Hardware IRQ</glossterm> 2050 <glossdef> 2051 <para> 2052 Hardware interrupt request. <function>in_irq()</function> returns 2053 <returnvalue>true</returnvalue> in a hardware interrupt handler. 2054 </para> 2055 </glossdef> 2056 </glossentry> 2057 2058 <glossentry id="gloss-interruptcontext"> 2059 <glossterm>Interrupt Context</glossterm> 2060 <glossdef> 2061 <para> 2062 Not user context: processing a hardware irq or software irq. 2063 Indicated by the <function>in_interrupt()</function> macro 2064 returning <returnvalue>true</returnvalue>. 2065 </para> 2066 </glossdef> 2067 </glossentry> 2068 2069 <glossentry id="gloss-smp"> 2070 <glossterm><acronym>SMP</acronym></glossterm> 2071 <glossdef> 2072 <para> 2073 Symmetric Multi-Processor: kernels compiled for multiple-CPU 2074 machines. (CONFIG_SMP=y). 2075 </para> 2076 </glossdef> 2077 </glossentry> 2078 2079 <glossentry id="gloss-softirq"> 2080 <glossterm>Software Interrupt / softirq</glossterm> 2081 <glossdef> 2082 <para> 2083 Software interrupt handler. <function>in_irq()</function> returns 2084 <returnvalue>false</returnvalue>; <function>in_softirq()</function> 2085 returns <returnvalue>true</returnvalue>. Tasklets and softirqs 2086 both fall into the category of 'software interrupts'. 2087 </para> 2088 <para> 2089 Strictly speaking a softirq is one of up to 32 enumerated software 2090 interrupts which can run on multiple CPUs at once. 2091 Sometimes used to refer to tasklets as 2092 well (ie. all software interrupts). 2093 </para> 2094 </glossdef> 2095 </glossentry> 2096 2097 <glossentry id="gloss-tasklet"> 2098 <glossterm>tasklet</glossterm> 2099 <glossdef> 2100 <para> 2101 A dynamically-registrable software interrupt, 2102 which is guaranteed to only run on one CPU at a time. 2103 </para> 2104 </glossdef> 2105 </glossentry> 2106 2107 <glossentry id="gloss-timers"> 2108 <glossterm>timer</glossterm> 2109 <glossdef> 2110 <para> 2111 A dynamically-registrable software interrupt, which is run at 2112 (or close to) a given time. When running, it is just like a 2113 tasklet (in fact, they are called from the TIMER_SOFTIRQ). 2114 </para> 2115 </glossdef> 2116 </glossentry> 2117 2118 <glossentry id="gloss-up"> 2119 <glossterm><acronym>UP</acronym></glossterm> 2120 <glossdef> 2121 <para> 2122 Uni-Processor: Non-SMP. (CONFIG_SMP=n). 2123 </para> 2124 </glossdef> 2125 </glossentry> 2126 2127 <glossentry id="gloss-usercontext"> 2128 <glossterm>User Context</glossterm> 2129 <glossdef> 2130 <para> 2131 The kernel executing on behalf of a particular process (ie. a 2132 system call or trap) or kernel thread. You can tell which 2133 process with the <symbol>current</symbol> macro.) Not to 2134 be confused with userspace. Can be interrupted by software or 2135 hardware interrupts. 2136 </para> 2137 </glossdef> 2138 </glossentry> 2139 2140 <glossentry id="gloss-userspace"> 2141 <glossterm>Userspace</glossterm> 2142 <glossdef> 2143 <para> 2144 A process executing its own code outside the kernel. 2145 </para> 2146 </glossdef> 2147 </glossentry> 2148 2149 </glossary> 2150</book> 2151 2152