1Shared Subtrees
2---------------
3
4Contents:
5	1) Overview
6	2) Features
7	3) Setting mount states
8	4) Use-case
9	5) Detailed semantics
10	6) Quiz
11	7) FAQ
12	8) Implementation
13
14
151) Overview
16-----------
17
18Consider the following situation:
19
20A process wants to clone its own namespace, but still wants to access the CD
21that got mounted recently.  Shared subtree semantics provide the necessary
22mechanism to accomplish the above.
23
24It provides the necessary building blocks for features like per-user-namespace
25and versioned filesystem.
26
272) Features
28-----------
29
30Shared subtree provides four different flavors of mounts; struct vfsmount to be
31precise
32
33	a. shared mount
34	b. slave mount
35	c. private mount
36	d. unbindable mount
37
38
392a) A shared mount can be replicated to as many mountpoints and all the
40replicas continue to be exactly same.
41
42	Here is an example:
43
44	Let's say /mnt has a mount that is shared.
45	mount --make-shared /mnt
46
47	Note: mount(8) command now supports the --make-shared flag,
48	so the sample 'smount' program is no longer needed and has been
49	removed.
50
51	# mount --bind /mnt /tmp
52	The above command replicates the mount at /mnt to the mountpoint /tmp
53	and the contents of both the mounts remain identical.
54
55	#ls /mnt
56	a b c
57
58	#ls /tmp
59	a b c
60
61	Now let's say we mount a device at /tmp/a
62	# mount /dev/sd0  /tmp/a
63
64	#ls /tmp/a
65	t1 t2 t3
66
67	#ls /mnt/a
68	t1 t2 t3
69
70	Note that the mount has propagated to the mount at /mnt as well.
71
72	And the same is true even when /dev/sd0 is mounted on /mnt/a. The
73	contents will be visible under /tmp/a too.
74
75
762b) A slave mount is like a shared mount except that mount and umount events
77	only propagate towards it.
78
79	All slave mounts have a master mount which is a shared.
80
81	Here is an example:
82
83	Let's say /mnt has a mount which is shared.
84	# mount --make-shared /mnt
85
86	Let's bind mount /mnt to /tmp
87	# mount --bind /mnt /tmp
88
89	the new mount at /tmp becomes a shared mount and it is a replica of
90	the mount at /mnt.
91
92	Now let's make the mount at /tmp; a slave of /mnt
93	# mount --make-slave /tmp
94
95	let's mount /dev/sd0 on /mnt/a
96	# mount /dev/sd0 /mnt/a
97
98	#ls /mnt/a
99	t1 t2 t3
100
101	#ls /tmp/a
102	t1 t2 t3
103
104	Note the mount event has propagated to the mount at /tmp
105
106	However let's see what happens if we mount something on the mount at /tmp
107
108	# mount /dev/sd1 /tmp/b
109
110	#ls /tmp/b
111	s1 s2 s3
112
113	#ls /mnt/b
114
115	Note how the mount event has not propagated to the mount at
116	/mnt
117
118
1192c) A private mount does not forward or receive propagation.
120
121	This is the mount we are familiar with. Its the default type.
122
123
1242d) A unbindable mount is a unbindable private mount
125
126	let's say we have a mount at /mnt and we make is unbindable
127
128	# mount --make-unbindable /mnt
129
130	 Let's try to bind mount this mount somewhere else.
131	 # mount --bind /mnt /tmp
132	 mount: wrong fs type, bad option, bad superblock on /mnt,
133	        or too many mounted file systems
134
135	Binding a unbindable mount is a invalid operation.
136
137
1383) Setting mount states
139
140	The mount command (util-linux package) can be used to set mount
141	states:
142
143	mount --make-shared mountpoint
144	mount --make-slave mountpoint
145	mount --make-private mountpoint
146	mount --make-unbindable mountpoint
147
148
1494) Use cases
150------------
151
152	A) A process wants to clone its own namespace, but still wants to
153	   access the CD that got mounted recently.
154
155	   Solution:
156
157		The system administrator can make the mount at /cdrom shared
158		mount --bind /cdrom /cdrom
159		mount --make-shared /cdrom
160
161		Now any process that clones off a new namespace will have a
162		mount at /cdrom which is a replica of the same mount in the
163		parent namespace.
164
165		So when a CD is inserted and mounted at /cdrom that mount gets
166		propagated to the other mount at /cdrom in all the other clone
167		namespaces.
168
169	B) A process wants its mounts invisible to any other process, but
170	still be able to see the other system mounts.
171
172	   Solution:
173
174		To begin with, the administrator can mark the entire mount tree
175		as shareable.
176
177		mount --make-rshared /
178
179		A new process can clone off a new namespace. And mark some part
180		of its namespace as slave
181
182		mount --make-rslave /myprivatetree
183
184		Hence forth any mounts within the /myprivatetree done by the
185		process will not show up in any other namespace. However mounts
186		done in the parent namespace under /myprivatetree still shows
187		up in the process's namespace.
188
189
190	Apart from the above semantics this feature provides the
191	building blocks to solve the following problems:
192
193	C)  Per-user namespace
194
195		The above semantics allows a way to share mounts across
196		namespaces.  But namespaces are associated with processes. If
197		namespaces are made first class objects with user API to
198		associate/disassociate a namespace with userid, then each user
199		could have his/her own namespace and tailor it to his/her
200		requirements. Offcourse its needs support from PAM.
201
202	D)  Versioned files
203
204		If the entire mount tree is visible at multiple locations, then
205		a underlying versioning file system can return different
206		version of the file depending on the path used to access that
207		file.
208
209		An example is:
210
211		mount --make-shared /
212		mount --rbind / /view/v1
213		mount --rbind / /view/v2
214		mount --rbind / /view/v3
215		mount --rbind / /view/v4
216
217		and if /usr has a versioning filesystem mounted, then that
218		mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
219		/view/v4/usr too
220
221		A user can request v3 version of the file /usr/fs/namespace.c
222		by accessing /view/v3/usr/fs/namespace.c . The underlying
223		versioning filesystem can then decipher that v3 version of the
224		filesystem is being requested and return the corresponding
225		inode.
226
2275) Detailed semantics:
228-------------------
229	The section below explains the detailed semantics of
230	bind, rbind, move, mount, umount and clone-namespace operations.
231
232	Note: the word 'vfsmount' and the noun 'mount' have been used
233	to mean the same thing, throughout this document.
234
2355a) Mount states
236
237	A given mount can be in one of the following states
238	1) shared
239	2) slave
240	3) shared and slave
241	4) private
242	5) unbindable
243
244	A 'propagation event' is defined as event generated on a vfsmount
245	that leads to mount or unmount actions in other vfsmounts.
246
247	A 'peer group' is defined as a group of vfsmounts that propagate
248	events to each other.
249
250	(1) Shared mounts
251
252		A 'shared mount' is defined as a vfsmount that belongs to a
253		'peer group'.
254
255		For example:
256			mount --make-shared /mnt
257			mount --bind /mnt /tmp
258
259		The mount at /mnt and that at /tmp are both shared and belong
260		to the same peer group. Anything mounted or unmounted under
261		/mnt or /tmp reflect in all the other mounts of its peer
262		group.
263
264
265	(2) Slave mounts
266
267		A 'slave mount' is defined as a vfsmount that receives
268		propagation events and does not forward propagation events.
269
270		A slave mount as the name implies has a master mount from which
271		mount/unmount events are received. Events do not propagate from
272		the slave mount to the master.  Only a shared mount can be made
273		a slave by executing the following command
274
275			mount --make-slave mount
276
277		A shared mount that is made as a slave is no more shared unless
278		modified to become shared.
279
280	(3) Shared and Slave
281
282		A vfsmount can be both shared as well as slave.  This state
283		indicates that the mount is a slave of some vfsmount, and
284		has its own peer group too.  This vfsmount receives propagation
285		events from its master vfsmount, and also forwards propagation
286		events to its 'peer group' and to its slave vfsmounts.
287
288		Strictly speaking, the vfsmount is shared having its own
289		peer group, and this peer-group is a slave of some other
290		peer group.
291
292		Only a slave vfsmount can be made as 'shared and slave' by
293		either executing the following command
294			mount --make-shared mount
295		or by moving the slave vfsmount under a shared vfsmount.
296
297	(4) Private mount
298
299		A 'private mount' is defined as vfsmount that does not
300		receive or forward any propagation events.
301
302	(5) Unbindable mount
303
304		A 'unbindable mount' is defined as vfsmount that does not
305		receive or forward any propagation events and cannot
306		be bind mounted.
307
308
309   	State diagram:
310   	The state diagram below explains the state transition of a mount,
311	in response to various commands.
312	------------------------------------------------------------------------
313	|             |make-shared |  make-slave  | make-private |make-unbindab|
314	--------------|------------|--------------|--------------|-------------|
315	|shared	      |shared	   |*slave/private|   private	 | unbindable  |
316	|             |            |              |              |             |
317	|-------------|------------|--------------|--------------|-------------|
318	|slave	      |shared      |	**slave	  |    private   | unbindable  |
319	|             |and slave   |              |              |             |
320	|-------------|------------|--------------|--------------|-------------|
321	|shared	      |shared      |    slave	  |    private   | unbindable  |
322	|and slave    |and slave   |              |              |             |
323	|-------------|------------|--------------|--------------|-------------|
324	|private      |shared	   |  **private	  |    private   | unbindable  |
325	|-------------|------------|--------------|--------------|-------------|
326	|unbindable   |shared	   |**unbindable  |    private   | unbindable  |
327	------------------------------------------------------------------------
328
329	* if the shared mount is the only mount in its peer group, making it
330	slave, makes it private automatically. Note that there is no master to
331	which it can be slaved to.
332
333	** slaving a non-shared mount has no effect on the mount.
334
335	Apart from the commands listed below, the 'move' operation also changes
336	the state of a mount depending on type of the destination mount. Its
337	explained in section 5d.
338
3395b) Bind semantics
340
341	Consider the following command
342
343	mount --bind A/a  B/b
344
345	where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
346	is the destination mount and 'b' is the dentry in the destination mount.
347
348	The outcome depends on the type of mount of 'A' and 'B'. The table
349	below contains quick reference.
350   ---------------------------------------------------------------------------
351   |         BIND MOUNT OPERATION                                            |
352   |**************************************************************************
353   |source(A)->| shared       |       private  |       slave    | unbindable |
354   | dest(B)  |               |                |                |            |
355   |   |      |               |                |                |            |
356   |   v      |               |                |                |            |
357   |**************************************************************************
358   |  shared  | shared        |     shared     | shared & slave |  invalid   |
359   |          |               |                |                |            |
360   |non-shared| shared        |      private   |      slave     |  invalid   |
361   ***************************************************************************
362
363     	Details:
364
365	1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
366	which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
367	mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
368	are created and mounted at the dentry 'b' on all mounts where 'B'
369	propagates to. A new propagation tree containing 'C1',..,'Cn' is
370	created. This propagation tree is identical to the propagation tree of
371	'B'.  And finally the peer-group of 'C' is merged with the peer group
372	of 'A'.
373
374	2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
375	which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
376	mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
377	are created and mounted at the dentry 'b' on all mounts where 'B'
378	propagates to. A new propagation tree is set containing all new mounts
379	'C', 'C1', .., 'Cn' with exactly the same configuration as the
380	propagation tree for 'B'.
381
382	3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
383	mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
384	'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
385	'C3' ... are created and mounted at the dentry 'b' on all mounts where
386	'B' propagates to. A new propagation tree containing the new mounts
387	'C','C1',..  'Cn' is created. This propagation tree is identical to the
388	propagation tree for 'B'. And finally the mount 'C' and its peer group
389	is made the slave of mount 'Z'.  In other words, mount 'C' is in the
390	state 'slave and shared'.
391
392	4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
393	invalid operation.
394
395	5. 'A' is a private mount and 'B' is a non-shared(private or slave or
396	unbindable) mount. A new mount 'C' which is clone of 'A', is created.
397	Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
398
399	6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
400	which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
401	mounted on mount 'B' at dentry 'b'.  'C' is made a member of the
402	peer-group of 'A'.
403
404	7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
405	new mount 'C' which is a clone of 'A' is created. Its root dentry is
406	'a'.  'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
407	slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
408	'Z'.  All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
409	mount/unmount on 'A' do not propagate anywhere else. Similarly
410	mount/unmount on 'C' do not propagate anywhere else.
411
412	8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
413	invalid operation. A unbindable mount cannot be bind mounted.
414
4155c) Rbind semantics
416
417	rbind is same as bind. Bind replicates the specified mount.  Rbind
418	replicates all the mounts in the tree belonging to the specified mount.
419	Rbind mount is bind mount applied to all the mounts in the tree.
420
421	If the source tree that is rbind has some unbindable mounts,
422	then the subtree under the unbindable mount is pruned in the new
423	location.
424
425	eg: let's say we have the following mount tree.
426
427		A
428	      /   \
429	      B   C
430	     / \ / \
431	     D E F G
432
433	     Let's say all the mount except the mount C in the tree are
434	     of a type other than unbindable.
435
436	     If this tree is rbound to say Z
437
438	     We will have the following tree at the new location.
439
440		Z
441		|
442		A'
443	       /
444	      B'		Note how the tree under C is pruned
445	     / \ 		in the new location.
446	    D' E'
447
448
449
4505d) Move semantics
451
452	Consider the following command
453
454	mount --move A  B/b
455
456	where 'A' is the source mount, 'B' is the destination mount and 'b' is
457	the dentry in the destination mount.
458
459	The outcome depends on the type of the mount of 'A' and 'B'. The table
460	below is a quick reference.
461   ---------------------------------------------------------------------------
462   |         		MOVE MOUNT OPERATION                                 |
463   |**************************************************************************
464   | source(A)->| shared      |       private  |       slave    | unbindable |
465   | dest(B)  |               |                |                |            |
466   |   |      |               |                |                |            |
467   |   v      |               |                |                |            |
468   |**************************************************************************
469   |  shared  | shared        |     shared     |shared and slave|  invalid   |
470   |          |               |                |                |            |
471   |non-shared| shared        |      private   |    slave       | unbindable |
472   ***************************************************************************
473	NOTE: moving a mount residing under a shared mount is invalid.
474
475      Details follow:
476
477	1. 'A' is a shared mount and 'B' is a shared mount.  The mount 'A' is
478	mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1', 'A2'...'An'
479	are created and mounted at dentry 'b' on all mounts that receive
480	propagation from mount 'B'. A new propagation tree is created in the
481	exact same configuration as that of 'B'. This new propagation tree
482	contains all the new mounts 'A1', 'A2'...  'An'.  And this new
483	propagation tree is appended to the already existing propagation tree
484	of 'A'.
485
486	2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
487	mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
488	are created and mounted at dentry 'b' on all mounts that receive
489	propagation from mount 'B'. The mount 'A' becomes a shared mount and a
490	propagation tree is created which is identical to that of
491	'B'. This new propagation tree contains all the new mounts 'A1',
492	'A2'...  'An'.
493
494	3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount.  The
495	mount 'A' is mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1',
496	'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
497	receive propagation from mount 'B'. A new propagation tree is created
498	in the exact same configuration as that of 'B'. This new propagation
499	tree contains all the new mounts 'A1', 'A2'...  'An'.  And this new
500	propagation tree is appended to the already existing propagation tree of
501	'A'.  Mount 'A' continues to be the slave mount of 'Z' but it also
502	becomes 'shared'.
503
504	4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
505	is invalid. Because mounting anything on the shared mount 'B' can
506	create new mounts that get mounted on the mounts that receive
507	propagation from 'B'.  And since the mount 'A' is unbindable, cloning
508	it to mount at other mountpoints is not possible.
509
510	5. 'A' is a private mount and 'B' is a non-shared(private or slave or
511	unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
512
513	6. 'A' is a shared mount and 'B' is a non-shared mount.  The mount 'A'
514	is mounted on mount 'B' at dentry 'b'.  Mount 'A' continues to be a
515	shared mount.
516
517	7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
518	The mount 'A' is mounted on mount 'B' at dentry 'b'.  Mount 'A'
519	continues to be a slave mount of mount 'Z'.
520
521	8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
522	'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
523	unbindable mount.
524
5255e) Mount semantics
526
527	Consider the following command
528
529	mount device  B/b
530
531	'B' is the destination mount and 'b' is the dentry in the destination
532	mount.
533
534	The above operation is the same as bind operation with the exception
535	that the source mount is always a private mount.
536
537
5385f) Unmount semantics
539
540	Consider the following command
541
542	umount A
543
544	where 'A' is a mount mounted on mount 'B' at dentry 'b'.
545
546	If mount 'B' is shared, then all most-recently-mounted mounts at dentry
547	'b' on mounts that receive propagation from mount 'B' and does not have
548	sub-mounts within them are unmounted.
549
550	Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
551	each other.
552
553	let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
554	'B1', 'B2' and 'B3' respectively.
555
556	let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
557	mount 'B1', 'B2' and 'B3' respectively.
558
559	if 'C1' is unmounted, all the mounts that are most-recently-mounted on
560	'B1' and on the mounts that 'B1' propagates-to are unmounted.
561
562	'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
563	on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
564
565	So all 'C1', 'C2' and 'C3' should be unmounted.
566
567	If any of 'C2' or 'C3' has some child mounts, then that mount is not
568	unmounted, but all other mounts are unmounted. However if 'C1' is told
569	to be unmounted and 'C1' has some sub-mounts, the umount operation is
570	failed entirely.
571
5725g) Clone Namespace
573
574	A cloned namespace contains all the mounts as that of the parent
575	namespace.
576
577	Let's say 'A' and 'B' are the corresponding mounts in the parent and the
578	child namespace.
579
580	If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
581	each other.
582
583	If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
584	'Z'.
585
586	If 'A' is a private mount, then 'B' is a private mount too.
587
588	If 'A' is unbindable mount, then 'B' is a unbindable mount too.
589
590
5916) Quiz
592
593	A. What is the result of the following command sequence?
594
595		mount --bind /mnt /mnt
596		mount --make-shared /mnt
597		mount --bind /mnt /tmp
598		mount --move /tmp /mnt/1
599
600		what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
601		Should they all be identical? or should /mnt and /mnt/1 be
602		identical only?
603
604
605	B. What is the result of the following command sequence?
606
607		mount --make-rshared /
608		mkdir -p /v/1
609		mount --rbind / /v/1
610
611		what should be the content of /v/1/v/1 be?
612
613
614	C. What is the result of the following command sequence?
615
616		mount --bind /mnt /mnt
617		mount --make-shared /mnt
618		mkdir -p /mnt/1/2/3 /mnt/1/test
619		mount --bind /mnt/1 /tmp
620		mount --make-slave /mnt
621		mount --make-shared /mnt
622		mount --bind /mnt/1/2 /tmp1
623		mount --make-slave /mnt
624
625		At this point we have the first mount at /tmp and
626		its root dentry is 1. Let's call this mount 'A'
627		And then we have a second mount at /tmp1 with root
628		dentry 2. Let's call this mount 'B'
629		Next we have a third mount at /mnt with root dentry
630		mnt. Let's call this mount 'C'
631
632		'B' is the slave of 'A' and 'C' is a slave of 'B'
633		A -> B -> C
634
635		at this point if we execute the following command
636
637		mount --bind /bin /tmp/test
638
639		The mount is attempted on 'A'
640
641		will the mount propagate to 'B' and 'C' ?
642
643		what would be the contents of
644		/mnt/1/test be?
645
6467) FAQ
647
648	Q1. Why is bind mount needed? How is it different from symbolic links?
649		symbolic links can get stale if the destination mount gets
650		unmounted or moved. Bind mounts continue to exist even if the
651		other mount is unmounted or moved.
652
653	Q2. Why can't the shared subtree be implemented using exportfs?
654
655		exportfs is a heavyweight way of accomplishing part of what
656		shared subtree can do. I cannot imagine a way to implement the
657		semantics of slave mount using exportfs?
658
659	Q3 Why is unbindable mount needed?
660
661		Let's say we want to replicate the mount tree at multiple
662		locations within the same subtree.
663
664		if one rbind mounts a tree within the same subtree 'n' times
665		the number of mounts created is an exponential function of 'n'.
666		Having unbindable mount can help prune the unneeded bind
667		mounts. Here is a example.
668
669		step 1:
670		   let's say the root tree has just two directories with
671		   one vfsmount.
672				    root
673				   /    \
674				  tmp    usr
675
676		    And we want to replicate the tree at multiple
677		    mountpoints under /root/tmp
678
679		step2:
680		      mount --make-shared /root
681
682		      mkdir -p /tmp/m1
683
684		      mount --rbind /root /tmp/m1
685
686		      the new tree now looks like this:
687
688				    root
689				   /    \
690				 tmp    usr
691				/
692			       m1
693			      /  \
694			     tmp  usr
695			     /
696			    m1
697
698			  it has two vfsmounts
699
700		step3:
701			    mkdir -p /tmp/m2
702			    mount --rbind /root /tmp/m2
703
704			the new tree now looks like this:
705
706				      root
707				     /    \
708				   tmp     usr
709				  /    \
710				m1       m2
711			       / \       /  \
712			     tmp  usr   tmp  usr
713			     / \          /
714			    m1  m2      m1
715				/ \     /  \
716			      tmp usr  tmp   usr
717			      /        / \
718			     m1       m1  m2
719			    /  \
720			  tmp   usr
721			  /  \
722			 m1   m2
723
724		       it has 6 vfsmounts
725
726		step 4:
727			  mkdir -p /tmp/m3
728			  mount --rbind /root /tmp/m3
729
730			  I won't draw the tree..but it has 24 vfsmounts
731
732
733		at step i the number of vfsmounts is V[i] = i*V[i-1].
734		This is an exponential function. And this tree has way more
735		mounts than what we really needed in the first place.
736
737		One could use a series of umount at each step to prune
738		out the unneeded mounts. But there is a better solution.
739		Unclonable mounts come in handy here.
740
741		step 1:
742		   let's say the root tree has just two directories with
743		   one vfsmount.
744				    root
745				   /    \
746				  tmp    usr
747
748		    How do we set up the same tree at multiple locations under
749		    /root/tmp
750
751		step2:
752		      mount --bind /root/tmp /root/tmp
753
754		      mount --make-rshared /root
755		      mount --make-unbindable /root/tmp
756
757		      mkdir -p /tmp/m1
758
759		      mount --rbind /root /tmp/m1
760
761		      the new tree now looks like this:
762
763				    root
764				   /    \
765				 tmp    usr
766				/
767			       m1
768			      /  \
769			     tmp  usr
770
771		step3:
772			    mkdir -p /tmp/m2
773			    mount --rbind /root /tmp/m2
774
775		      the new tree now looks like this:
776
777				    root
778				   /    \
779				 tmp    usr
780				/   \
781			       m1     m2
782			      /  \     / \
783			     tmp  usr tmp usr
784
785		step4:
786
787			    mkdir -p /tmp/m3
788			    mount --rbind /root /tmp/m3
789
790		      the new tree now looks like this:
791
792				    	  root
793				      /    	  \
794				     tmp    	   usr
795			         /    \    \
796			       m1     m2     m3
797			      /  \     / \    /  \
798			     tmp  usr tmp usr tmp usr
799
8008) Implementation
801
8028A) Datastructure
803
804	4 new fields are introduced to struct vfsmount
805	->mnt_share
806	->mnt_slave_list
807	->mnt_slave
808	->mnt_master
809
810	->mnt_share links together all the mount to/from which this vfsmount
811		send/receives propagation events.
812
813	->mnt_slave_list links all the mounts to which this vfsmount propagates
814		to.
815
816	->mnt_slave links together all the slaves that its master vfsmount
817		propagates to.
818
819	->mnt_master points to the master vfsmount from which this vfsmount
820		receives propagation.
821
822	->mnt_flags takes two more flags to indicate the propagation status of
823		the vfsmount.  MNT_SHARE indicates that the vfsmount is a shared
824		vfsmount.  MNT_UNCLONABLE indicates that the vfsmount cannot be
825		replicated.
826
827	All the shared vfsmounts in a peer group form a cyclic list through
828	->mnt_share.
829
830	All vfsmounts with the same ->mnt_master form on a cyclic list anchored
831	in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
832
833	 ->mnt_master can point to arbitrary (and possibly different) members
834	 of master peer group.  To find all immediate slaves of a peer group
835	 you need to go through _all_ ->mnt_slave_list of its members.
836	 Conceptually it's just a single set - distribution among the
837	 individual lists does not affect propagation or the way propagation
838	 tree is modified by operations.
839
840	All vfsmounts in a peer group have the same ->mnt_master.  If it is
841	non-NULL, they form a contiguous (ordered) segment of slave list.
842
843	A example propagation tree looks as shown in the figure below.
844	[ NOTE: Though it looks like a forest, if we consider all the shared
845	mounts as a conceptual entity called 'pnode', it becomes a tree]
846
847
848		        A <--> B <--> C <---> D
849		       /|\	      /|      |\
850		      / F G	     J K      H I
851		     /
852		    E<-->K
853			/|\
854		       M L N
855
856	In the above figure  A,B,C and D all are shared and propagate to each
857	other.   'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
858	mounts 'J' and 'K'  and  'D' has got two slave mounts 'H' and 'I'.
859	'E' is also shared with 'K' and they propagate to each other.  And
860	'K' has 3 slaves 'M', 'L' and 'N'
861
862	A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
863
864	A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
865
866	E's ->mnt_share links with ->mnt_share of K
867	'E', 'K', 'F', 'G' have their ->mnt_master point to struct
868				vfsmount of 'A'
869	'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
870	K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
871
872	C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
873	J and K's ->mnt_master points to struct vfsmount of C
874	and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
875	'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
876
877
878	NOTE: The propagation tree is orthogonal to the mount tree.
879
8808B Locking:
881
882	->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
883	by namespace_sem (exclusive for modifications, shared for reading).
884
885	Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
886	There are two exceptions: do_add_mount() and clone_mnt().
887	The former modifies a vfsmount that has not been visible in any shared
888	data structures yet.
889	The latter holds namespace_sem and the only references to vfsmount
890	are in lists that can't be traversed without namespace_sem.
891
8928C Algorithm:
893
894	The crux of the implementation resides in rbind/move operation.
895
896	The overall algorithm breaks the operation into 3 phases: (look at
897	attach_recursive_mnt() and propagate_mnt())
898
899	1. prepare phase.
900	2. commit phases.
901	3. abort phases.
902
903	Prepare phase:
904
905	for each mount in the source tree:
906		   a) Create the necessary number of mount trees to
907		   	be attached to each of the mounts that receive
908			propagation from the destination mount.
909		   b) Do not attach any of the trees to its destination.
910		      However note down its ->mnt_parent and ->mnt_mountpoint
911		   c) Link all the new mounts to form a propagation tree that
912		      is identical to the propagation tree of the destination
913		      mount.
914
915		   If this phase is successful, there should be 'n' new
916		   propagation trees; where 'n' is the number of mounts in the
917		   source tree.  Go to the commit phase
918
919		   Also there should be 'm' new mount trees, where 'm' is
920		   the number of mounts to which the destination mount
921		   propagates to.
922
923		   if any memory allocations fail, go to the abort phase.
924
925	Commit phase
926		attach each of the mount trees to their corresponding
927		destination mounts.
928
929	Abort phase
930		delete all the newly created trees.
931
932	NOTE: all the propagation related functionality resides in the file
933	pnode.c
934
935
936------------------------------------------------------------------------
937
938version 0.1  (created the initial document, Ram Pai linuxram@us.ibm.com)
939version 0.2  (Incorporated comments from Al Viro)
940