1 2unshare system call: 3-------------------- 4This document describes the new system call, unshare. The document 5provides an overview of the feature, why it is needed, how it can 6be used, its interface specification, design, implementation and 7how it can be tested. 8 9Change Log: 10----------- 11version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 12 13Contents: 14--------- 15 1) Overview 16 2) Benefits 17 3) Cost 18 4) Requirements 19 5) Functional Specification 20 6) High Level Design 21 7) Low Level Design 22 8) Test Specification 23 9) Future Work 24 251) Overview 26----------- 27Most legacy operating system kernels support an abstraction of threads 28as multiple execution contexts within a process. These kernels provide 29special resources and mechanisms to maintain these "threads". The Linux 30kernel, in a clever and simple manner, does not make distinction 31between processes and "threads". The kernel allows processes to share 32resources and thus they can achieve legacy "threads" behavior without 33requiring additional data structures and mechanisms in the kernel. The 34power of implementing threads in this manner comes not only from 35its simplicity but also from allowing application programmers to work 36outside the confinement of all-or-nothing shared resources of legacy 37threads. On Linux, at the time of thread creation using the clone system 38call, applications can selectively choose which resources to share 39between threads. 40 41unshare system call adds a primitive to the Linux thread model that 42allows threads to selectively 'unshare' any resources that were being 43shared at the time of their creation. unshare was conceptualized by 44Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part 45of the discussion on POSIX threads on Linux. unshare augments the 46usefulness of Linux threads for applications that would like to control 47shared resources without creating a new process. unshare is a natural 48addition to the set of available primitives on Linux that implement 49the concept of process/thread as a virtual machine. 50 512) Benefits 52----------- 53unshare would be useful to large application frameworks such as PAM 54where creating a new process to control sharing/unsharing of process 55resources is not possible. Since namespaces are shared by default 56when creating a new process using fork or clone, unshare can benefit 57even non-threaded applications if they have a need to disassociate 58from default shared namespace. The following lists two use-cases 59where unshare can be used. 60 612.1 Per-security context namespaces 62----------------------------------- 63unshare can be used to implement polyinstantiated directories using 64the kernel's per-process namespace mechanism. Polyinstantiated directories, 65such as per-user and/or per-security context instance of /tmp, /var/tmp or 66per-security context instance of a user's home directory, isolate user 67processes when working with these directories. Using unshare, a PAM 68module can easily setup a private namespace for a user at login. 69Polyinstantiated directories are required for Common Criteria certification 70with Labeled System Protection Profile, however, with the availability 71of shared-tree feature in the Linux kernel, even regular Linux systems 72can benefit from setting up private namespaces at login and 73polyinstantiating /tmp, /var/tmp and other directories deemed 74appropriate by system administrators. 75 762.2 unsharing of virtual memory and/or open files 77------------------------------------------------- 78Consider a client/server application where the server is processing 79client requests by creating processes that share resources such as 80virtual memory and open files. Without unshare, the server has to 81decide what needs to be shared at the time of creating the process 82which services the request. unshare allows the server an ability to 83disassociate parts of the context during the servicing of the 84request. For large and complex middleware application frameworks, this 85ability to unshare after the process was created can be very 86useful. 87 883) Cost 89------- 90In order to not duplicate code and to handle the fact that unshare 91works on an active task (as opposed to clone/fork working on a newly 92allocated inactive task) unshare had to make minor reorganizational 93changes to copy_* functions utilized by clone/fork system call. 94There is a cost associated with altering existing, well tested and 95stable code to implement a new feature that may not get exercised 96extensively in the beginning. However, with proper design and code 97review of the changes and creation of an unshare test for the LTP 98the benefits of this new feature can exceed its cost. 99 1004) Requirements 101--------------- 102unshare reverses sharing that was done using clone(2) system call, 103so unshare should have a similar interface as clone(2). That is, 104since flags in clone(int flags, void *stack) specifies what should 105be shared, similar flags in unshare(int flags) should specify 106what should be unshared. Unfortunately, this may appear to invert 107the meaning of the flags from the way they are used in clone(2). 108However, there was no easy solution that was less confusing and that 109allowed incremental context unsharing in future without an ABI change. 110 111unshare interface should accommodate possible future addition of 112new context flags without requiring a rebuild of old applications. 113If and when new context flags are added, unshare design should allow 114incremental unsharing of those resources on an as needed basis. 115 1165) Functional Specification 117--------------------------- 118NAME 119 unshare - disassociate parts of the process execution context 120 121SYNOPSIS 122 #include <sched.h> 123 124 int unshare(int flags); 125 126DESCRIPTION 127 unshare allows a process to disassociate parts of its execution 128 context that are currently being shared with other processes. Part 129 of execution context, such as the namespace, is shared by default 130 when a new process is created using fork(2), while other parts, 131 such as the virtual memory, open file descriptors, etc, may be 132 shared by explicit request to share them when creating a process 133 using clone(2). 134 135 The main use of unshare is to allow a process to control its 136 shared execution context without creating a new process. 137 138 The flags argument specifies one or bitwise-or'ed of several of 139 the following constants. 140 141 CLONE_FS 142 If CLONE_FS is set, file system information of the caller 143 is disassociated from the shared file system information. 144 145 CLONE_FILES 146 If CLONE_FILES is set, the file descriptor table of the 147 caller is disassociated from the shared file descriptor 148 table. 149 150 CLONE_NEWNS 151 If CLONE_NEWNS is set, the namespace of the caller is 152 disassociated from the shared namespace. 153 154 CLONE_VM 155 If CLONE_VM is set, the virtual memory of the caller is 156 disassociated from the shared virtual memory. 157 158RETURN VALUE 159 On success, zero returned. On failure, -1 is returned and errno is 160 161ERRORS 162 EPERM CLONE_NEWNS was specified by a non-root process (process 163 without CAP_SYS_ADMIN). 164 165 ENOMEM Cannot allocate sufficient memory to copy parts of caller's 166 context that need to be unshared. 167 168 EINVAL Invalid flag was specified as an argument. 169 170CONFORMING TO 171 The unshare() call is Linux-specific and should not be used 172 in programs intended to be portable. 173 174SEE ALSO 175 clone(2), fork(2) 176 1776) High Level Design 178-------------------- 179Depending on the flags argument, the unshare system call allocates 180appropriate process context structures, populates it with values from 181the current shared version, associates newly duplicated structures 182with the current task structure and releases corresponding shared 183versions. Helper functions of clone (copy_*) could not be used 184directly by unshare because of the following two reasons. 185 1) clone operates on a newly allocated not-yet-active task 186 structure, where as unshare operates on the current active 187 task. Therefore unshare has to take appropriate task_lock() 188 before associating newly duplicated context structures 189 2) unshare has to allocate and duplicate all context structures 190 that are being unshared, before associating them with the 191 current task and releasing older shared structures. Failure 192 do so will create race conditions and/or oops when trying 193 to backout due to an error. Consider the case of unsharing 194 both virtual memory and namespace. After successfully unsharing 195 vm, if the system call encounters an error while allocating 196 new namespace structure, the error return code will have to 197 reverse the unsharing of vm. As part of the reversal the 198 system call will have to go back to older, shared, vm 199 structure, which may not exist anymore. 200 201Therefore code from copy_* functions that allocated and duplicated 202current context structure was moved into new dup_* functions. Now, 203copy_* functions call dup_* functions to allocate and duplicate 204appropriate context structures and then associate them with the 205task structure that is being constructed. unshare system call on 206the other hand performs the following: 207 1) Check flags to force missing, but implied, flags 208 2) For each context structure, call the corresponding unshare 209 helper function to allocate and duplicate a new context 210 structure, if the appropriate bit is set in the flags argument. 211 3) If there is no error in allocation and duplication and there 212 are new context structures then lock the current task structure, 213 associate new context structures with the current task structure, 214 and release the lock on the current task structure. 215 4) Appropriately release older, shared, context structures. 216 2177) Low Level Design 218------------------- 219Implementation of unshare can be grouped in the following 4 different 220items: 221 a) Reorganization of existing copy_* functions 222 b) unshare system call service function 223 c) unshare helper functions for each different process context 224 d) Registration of system call number for different architectures 225 226 7.1) Reorganization of copy_* functions 227 Each copy function such as copy_mm, copy_namespace, copy_files, 228 etc, had roughly two components. The first component allocated 229 and duplicated the appropriate structure and the second component 230 linked it to the task structure passed in as an argument to the copy 231 function. The first component was split into its own function. 232 These dup_* functions allocated and duplicated the appropriate 233 context structure. The reorganized copy_* functions invoked 234 their corresponding dup_* functions and then linked the newly 235 duplicated structures to the task structure with which the 236 copy function was called. 237 238 7.2) unshare system call service function 239 * Check flags 240 Force implied flags. If CLONE_THREAD is set force CLONE_VM. 241 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is 242 set and signals are also being shared, force CLONE_THREAD. If 243 CLONE_NEWNS is set, force CLONE_FS. 244 * For each context flag, invoke the corresponding unshare_* 245 helper routine with flags passed into the system call and a 246 reference to pointer pointing the new unshared structure 247 * If any new structures are created by unshare_* helper 248 functions, take the task_lock() on the current task, 249 modify appropriate context pointers, and release the 250 task lock. 251 * For all newly unshared structures, release the corresponding 252 older, shared, structures. 253 254 7.3) unshare_* helper functions 255 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, 256 and CLONE_THREAD, return -EINVAL since they are not implemented yet. 257 For others, check the flag value to see if the unsharing is 258 required for that structure. If it is, invoke the corresponding 259 dup_* function to allocate and duplicate the structure and return 260 a pointer to it. 261 262 7.4) Appropriately modify architecture specific code to register the 263 new system call. 264 2658) Test Specification 266--------------------- 267The test for unshare should test the following: 268 1) Valid flags: Test to check that clone flags for signal and 269 signal handlers, for which unsharing is not implemented 270 yet, return -EINVAL. 271 2) Missing/implied flags: Test to make sure that if unsharing 272 namespace without specifying unsharing of filesystem, correctly 273 unshares both namespace and filesystem information. 274 3) For each of the four (namespace, filesystem, files and vm) 275 supported unsharing, verify that the system call correctly 276 unshares the appropriate structure. Verify that unsharing 277 them individually as well as in combination with each 278 other works as expected. 279 4) Concurrent execution: Use shared memory segments and futex on 280 an address in the shm segment to synchronize execution of 281 about 10 threads. Have a couple of threads execute execve, 282 a couple _exit and the rest unshare with different combination 283 of flags. Verify that unsharing is performed as expected and 284 that there are no oops or hangs. 285 2869) Future Work 287-------------- 288The current implementation of unshare does not allow unsharing of 289signals and signal handlers. Signals are complex to begin with and 290to unshare signals and/or signal handlers of a currently running 291process is even more complex. If in the future there is a specific 292need to allow unsharing of signals and/or signal handlers, it can 293be incrementally added to unshare without affecting legacy 294applications using unshare. 295 296