1vlocks for Bare-Metal Mutual Exclusion 2====================================== 3 4Voting Locks, or "vlocks" provide a simple low-level mutual exclusion 5mechanism, with reasonable but minimal requirements on the memory 6system. 7 8These are intended to be used to coordinate critical activity among CPUs 9which are otherwise non-coherent, in situations where the hardware 10provides no other mechanism to support this and ordinary spinlocks 11cannot be used. 12 13 14vlocks make use of the atomicity provided by the memory system for 15writes to a single memory location. To arbitrate, every CPU "votes for 16itself", by storing a unique number to a common memory location. The 17final value seen in that memory location when all the votes have been 18cast identifies the winner. 19 20In order to make sure that the election produces an unambiguous result 21in finite time, a CPU will only enter the election in the first place if 22no winner has been chosen and the election does not appear to have 23started yet. 24 25 26Algorithm 27--------- 28 29The easiest way to explain the vlocks algorithm is with some pseudo-code: 30 31 32 int currently_voting[NR_CPUS] = { 0, }; 33 int last_vote = -1; /* no votes yet */ 34 35 bool vlock_trylock(int this_cpu) 36 { 37 /* signal our desire to vote */ 38 currently_voting[this_cpu] = 1; 39 if (last_vote != -1) { 40 /* someone already volunteered himself */ 41 currently_voting[this_cpu] = 0; 42 return false; /* not ourself */ 43 } 44 45 /* let's suggest ourself */ 46 last_vote = this_cpu; 47 currently_voting[this_cpu] = 0; 48 49 /* then wait until everyone else is done voting */ 50 for_each_cpu(i) { 51 while (currently_voting[i] != 0) 52 /* wait */; 53 } 54 55 /* result */ 56 if (last_vote == this_cpu) 57 return true; /* we won */ 58 return false; 59 } 60 61 bool vlock_unlock(void) 62 { 63 last_vote = -1; 64 } 65 66 67The currently_voting[] array provides a way for the CPUs to determine 68whether an election is in progress, and plays a role analogous to the 69"entering" array in Lamport's bakery algorithm [1]. 70 71However, once the election has started, the underlying memory system 72atomicity is used to pick the winner. This avoids the need for a static 73priority rule to act as a tie-breaker, or any counters which could 74overflow. 75 76As long as the last_vote variable is globally visible to all CPUs, it 77will contain only one value that won't change once every CPU has cleared 78its currently_voting flag. 79 80 81Features and limitations 82------------------------ 83 84 * vlocks are not intended to be fair. In the contended case, it is the 85 _last_ CPU which attempts to get the lock which will be most likely 86 to win. 87 88 vlocks are therefore best suited to situations where it is necessary 89 to pick a unique winner, but it does not matter which CPU actually 90 wins. 91 92 * Like other similar mechanisms, vlocks will not scale well to a large 93 number of CPUs. 94 95 vlocks can be cascaded in a voting hierarchy to permit better scaling 96 if necessary, as in the following hypothetical example for 4096 CPUs: 97 98 /* first level: local election */ 99 my_town = towns[(this_cpu >> 4) & 0xf]; 100 I_won = vlock_trylock(my_town, this_cpu & 0xf); 101 if (I_won) { 102 /* we won the town election, let's go for the state */ 103 my_state = states[(this_cpu >> 8) & 0xf]; 104 I_won = vlock_lock(my_state, this_cpu & 0xf)); 105 if (I_won) { 106 /* and so on */ 107 I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; 108 if (I_won) { 109 /* ... */ 110 } 111 vlock_unlock(the_whole_country); 112 } 113 vlock_unlock(my_state); 114 } 115 vlock_unlock(my_town); 116 117 118ARM implementation 119------------------ 120 121The current ARM implementation [2] contains some optimisations beyond 122the basic algorithm: 123 124 * By packing the members of the currently_voting array close together, 125 we can read the whole array in one transaction (providing the number 126 of CPUs potentially contending the lock is small enough). This 127 reduces the number of round-trips required to external memory. 128 129 In the ARM implementation, this means that we can use a single load 130 and comparison: 131 132 LDR Rt, [Rn] 133 CMP Rt, #0 134 135 ...in place of code equivalent to: 136 137 LDRB Rt, [Rn] 138 CMP Rt, #0 139 LDRBEQ Rt, [Rn, #1] 140 CMPEQ Rt, #0 141 LDRBEQ Rt, [Rn, #2] 142 CMPEQ Rt, #0 143 LDRBEQ Rt, [Rn, #3] 144 CMPEQ Rt, #0 145 146 This cuts down on the fast-path latency, as well as potentially 147 reducing bus contention in contended cases. 148 149 The optimisation relies on the fact that the ARM memory system 150 guarantees coherency between overlapping memory accesses of 151 different sizes, similarly to many other architectures. Note that 152 we do not care which element of currently_voting appears in which 153 bits of Rt, so there is no need to worry about endianness in this 154 optimisation. 155 156 If there are too many CPUs to read the currently_voting array in 157 one transaction then multiple transations are still required. The 158 implementation uses a simple loop of word-sized loads for this 159 case. The number of transactions is still fewer than would be 160 required if bytes were loaded individually. 161 162 163 In principle, we could aggregate further by using LDRD or LDM, but 164 to keep the code simple this was not attempted in the initial 165 implementation. 166 167 168 * vlocks are currently only used to coordinate between CPUs which are 169 unable to enable their caches yet. This means that the 170 implementation removes many of the barriers which would be required 171 when executing the algorithm in cached memory. 172 173 packing of the currently_voting array does not work with cached 174 memory unless all CPUs contending the lock are cache-coherent, due 175 to cache writebacks from one CPU clobbering values written by other 176 CPUs. (Though if all the CPUs are cache-coherent, you should be 177 probably be using proper spinlocks instead anyway). 178 179 180 * The "no votes yet" value used for the last_vote variable is 0 (not 181 -1 as in the pseudocode). This allows statically-allocated vlocks 182 to be implicitly initialised to an unlocked state simply by putting 183 them in .bss. 184 185 An offset is added to each CPU's ID for the purpose of setting this 186 variable, so that no CPU uses the value 0 for its ID. 187 188 189Colophon 190-------- 191 192Originally created and documented by Dave Martin for Linaro Limited, for 193use in ARM-based big.LITTLE platforms, with review and input gratefully 194received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for 195grabbing most of this text out of the relevant mail thread and writing 196up the pseudocode. 197 198Copyright (C) 2012-2013 Linaro Limited 199Distributed under the terms of Version 2 of the GNU General Public 200License, as defined in linux/COPYING. 201 202 203References 204---------- 205 206[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming 207 Problem", Communications of the ACM 17, 8 (August 1974), 453-455. 208 209 http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm 210 211[2] linux/arch/arm/common/vlock.S, www.kernel.org. 212