vfs.txt - OpenGrok cross reference for /linux-4.4.14/Documentation/filesystems/vfs.txt

Lines Matching refs:the
2 	      Overview of the Linux Virtual File System
11   This file is released under the GPLv2.
17 The Virtual File System (also known as the Virtual Filesystem Switch)
18 is the software layer in the kernel that provides the filesystem
20 within the kernel which allows different filesystem implementations to
25 in the document Documentation/filesystems/Locking.
31 The VFS implements the open(2), stat(2), chmod(2), and similar system
32 calls. The pathname argument that is passed to them is used by the VFS
33 to search through the directory entry cache (also known as the dentry
39 most computers cannot fit all dentries in the RAM at the same time,
40 some bits of the cache are missing. In order to resolve your pathname
41 into a dentry, the VFS may have to resort to creating dentries along
42 the way, and then loading the inode. This is done by looking up the
51 beasts.  They live either on the disc (for block device filesystems)
52 or in the memory (for pseudo filesystems). Inodes that live on the
53 disc are copied into the memory when required and changes to the inode
57 To look up an inode requires that the VFS calls the lookup() method of
58 the parent directory inode. This method is installed by the specific
59 filesystem implementation that the inode lives in. Once the VFS has
60 the required dentry (and hence the inode), we can do all those boring
61 things like open(2) the file, or stat(2) it to peek at the inode
62 data. The stat(2) operation is fairly simple: once the VFS has the
63 dentry, it peeks at the inode data and passes some of it back to
71 structure (this is the kernel-side implementation of file
73 a pointer to the dentry and a set of file operation member functions.
74 These are taken from the inode data. The open() file method is then
75 called so the specific filesystem implementation can do its work. You
76 can see that this is another switch performed by the VFS. The file
77 structure is placed into the file descriptor table for the process.
80 is done by using the userspace file descriptor to grab the appropriate
81 file structure, and then calling the required file structure method to
82 do whatever is required. For as long as the file is open, it keeps the
83 dentry in use, which in turn means that the VFS inode is still in use.
89 To register and unregister a filesystem, use the following API
99 the VFS will call the appropriate mount() method for the specific
100 filesystem.  New vfsmount referring to the tree returned by ->mount()
101 will be attached to the mountpoint, so that when pathname resolution
102 reaches the mountpoint it will jump into the root of that vfsmount.
104 You can see all filesystems that are registered to the kernel in the
111 This describes the filesystem. As of kernel 2.6.39, the following
127   name: the name of the filesystem type, such as "ext2", "iso9660",
132   mount: the method to call when a new instance of this
135   kill_sb: the method to call when an instance of this filesystem
145 The mount() method has the following arguments:
147   struct file_system_type *fs_type: describes the filesystem, partly initialized
148   	by the specific filesystem code
152   const char *dev_name: the device name we are mounting.
157 The mount() method must return the root dentry of the tree requested by
158 caller.  An active reference to its superblock must be grabbed and the
164 contains a suitable filesystem image the method creates and initializes
168 doesn't have to create a new one.  The main result from the caller's
169 point of view is a reference to dentry at the root of (sub)tree to
172 The most interesting member of the superblock structure that the
173 mount() method fills in is the "s_op" field. This is a pointer to
174 a "struct super_operations" which describes the next level of the
177 Usually, a filesystem uses one of the generic mount() implementations
184   mount_single: mount a filesystem which shares the instance between
187 A fill_super() callback implementation has the following arguments:
189   struct super_block *sb: the superblock structure. The callback
207 This describes how the VFS can manipulate the superblock of your
208 filesystem. As of kernel 2.6.22, the following members are defined:
251   dirty_inode: this method is called by the VFS to mark an inode dirty.
253   write_inode: this method is called when the VFS needs to write an
254 	inode to disc.  The second parameter indicates whether the write
257   drop_inode: called when the last access to the inode is dropped,
258 	with the inode->i_lock spinlock held.
263 	called regardless of the value of i_nlink)
265 	The "generic_delete_inode()" behavior is equivalent to the
266 	old practice of using "force_delete" in the put_inode() case,
267 	but does not have the races that the "force_delete()" approach
270   delete_inode: called when the VFS wants to delete an inode
272   put_super: called when the VFS wishes to free the superblock
273 	(i.e. unmount). This is called with the superblock lock held
276   	a superblock. The second parameter indicates whether the method
277 	should wait until the write out has been completed. Optional.
281   	used by the Logical Volume Manager (LVM).
286   statfs: called when the VFS needs to get filesystem statistics.
288   remount_fs: called when the filesystem is remounted. This is called
289 	with the kernel lock held
291   clear_inode: called then the VFS clears the inode. Optional
293   umount_begin: called when the VFS is unmounting a filesystem.
295   show_options: called by the VFS to show mount options for
298   quota_read: called by the VFS to read from filesystem quota file.
300   quota_write: called by the VFS to write to filesystem quota file.
302   nr_cached_objects: called by the sb cache shrinking function for the
303 	filesystem to return the number of freeable cached objects it contains.
306   free_cache_objects: called by the sb cache shrinking function for the
307 	filesystem to scan the number of objects indicated to try to free them.
311 	We can't do anything with any errors that the filesystem might
312 	encountered, hence the void return type. This will never be called if
313 	the VM is trying to reclaim under GFP_NOFS conditions, hence this
317 	scanning loop that is done. This allows the VFS to determine
322 Whoever sets up the inode is responsible for filling in the "i_op" field. This
323 is a pointer to a "struct inode_operations" which describes the methods that
330 An inode object represents an object within the filesystem.
336 This describes how the VFS can manipulate an inode in your
337 filesystem. As of kernel 2.6.22, the following members are defined:
373   create: called by the open(2) and creat(2) system calls. Only
376 	dentry). Here you will probably call d_instantiate() with the
377 	dentry and the newly created inode
379   lookup: called when the VFS needs to look up an inode in a parent
380 	directory. The name to look for is found in the dentry. This
381 	method must call d_add() to insert the found inode into the
382 	dentry. The "i_count" field in the inode structure should be
383 	incremented. If the named inode does not exist a NULL inode
384 	should be inserted into the dentry (this is called a negative
388 	If you wish to overload the dentry methods then you should
389 	initialise the "d_dop" field in the dentry; this is a pointer
391 	This method is called with the directory inode semaphore held
393   link: called by the link(2) system call. Only required if you want
395 	d_instantiate() just as you would in the create() method
397   unlink: called by the unlink(2) system call. Only required if you
400   symlink: called by the symlink(2) system call. Only required if you
402 	d_instantiate() just as you would in the create() method
404   mkdir: called by the mkdir(2) system call. Only required if you want
406 	call d_instantiate() just as you would in the create() method
408   rmdir: called by the rmdir(2) system call. Only required if you want
411   mknod: called by the mknod(2) system call to create a device (char,
415 	in the create() method
417   rename: called by the rename(2) system call to rename the object to
418 	have the parent and name given by the second inode and dentry.
421 	If no flags are supported by the filesystem then this method
422 	need not be implemented.  If some flags are supported then the
424 	flags.  Currently the following flags are implemented:
425 	(1) RENAME_NOREPLACE: this flag indicates that if the target
426 	of the rename exists the rename should fail with -EEXIST
427 	instead of replacing the target.  The VFS already checks for
428 	existence, so for local filesystems the RENAME_NOREPLACE
431 	exist; this is checked by the VFS.  Unlike plain rename,
434   readlink: called by the readlink(2) system call. Only required if
437   follow_link: called by the VFS to follow a symbolic link to the
439 	symbolic links.  This method returns the symlink body
440 	to traverse (and possibly resets the current position with
441 	nd_jump_link()).  If the body won't go away until the inode
443 	pinned, the data needed to release whatever we'd grabbed
447   put_link: called by the VFS to release resources allocated by
449 	to this method as the last parameter; only called when
452   permission: called by the VFS to check for access rights on a POSIX-like
456         mode, the filesystem must check the permission without blocking or
457 	storing to the inode.
462   setattr: called by the VFS to set attributes for a file. This method
465   getattr: called by the VFS to get attributes of a file. This method
468   setxattr: called by the VFS to set an extended attribute for a file.
472   getxattr: called by the VFS to retrieve the value of an extended
476   listxattr: called by the VFS to list all extended attributes for a
479   removexattr: called by the VFS to remove an extended attribute from
482   update_time: called by the VFS to update a specific time or the i_version of
483   	an inode.  If this is not defined the VFS will update the inode itself
486   atomic_open: called on the last component of an open.  Using this optional
487   	method the filesystem can look up, possibly create and open the file in
488   	one atomic operation.  If it cannot perform this (e.g. the file type
490 	usual 0 or -ve .  This method is only called if the last component is
492 	f_op->open().  If the file was created, the FILE_CREATED flag should be
493 	set in "opened".  In case of O_EXCL the method must only succeed if the
496   tmpfile: called in the end of O_TMPFILE open().  Optional, equivalent to
502 The address space object is used to group and manage pages in the page
503 cache.  It can be used to keep track of the pages in a file (or
504 anything else) and also track the mapping of sections of the file into
512 The first can be used independently to the others.  The VM can try to
514 pages in order to reuse them.  To do this it can call the ->writepage
517 references will be released without notice being given to the
521 lru_cache_add and mark_page_active needs to be called whenever the
525 maintains information about the PG_Dirty and PG_Writeback status of
529 The Dirty tag is primarily used by mpage_writepages - the default
530 ->writepages method.  It uses the tag to find dirty pages to call
531 ->writepage on.  If mpage_writepages is not used (i.e. the address
532 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
535 writing out the whole address_space.
543 typically using the 'private' field in the 'struct page'.  If such
544 information is attached, the PG_Private flag should be set.  This will
545 cause various VM routines to make extra calls into the address_space
549 application.  Data is read into the address space a whole page at a
550 time, and provided to the application either by copying of the page,
551 or by memory-mapping the page.
552 Data is written into the address space by the application, and then
553 written-back to storage typically in whole pages, however the
558 set_page_dirty to write data into the address_space, and writepage,
561 Adding and removing pages to/from an address_space is protected by the
564 When data is written to a page, the PG_Dirty flag should be set.  It
575 This describes how the VFS can manipulate mapping of a file to page cache in
596 	/* migrate the contents of a page to the specified target */
607   writepage: called by the VM to write a dirty page to backing store.
613       and should make sure the page is unlocked, either synchronously
614       or asynchronously when the write operation completes.
618       other pages from the mapping if that is easier (e.g. due to
620       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
623       See the file "Locking" for more details.
625   readpage: called by the VM to read a page from backing store.
627        unlocked and marked uptodate once the read completes.
628        If ->readpage discovers that it needs to unlock the page for
630        In this case, the page will be relocated, relocked and if
633   writepages: called by the VM to write out pages associated with the
635   	the writeback_control will specify a range of pages that must be
639   	instead.  This will choose pages from the address space that are
642   set_page_dirty: called by the VM to set a page dirty.
647 	If defined, it should set the PageDirty flag, and the
648         PAGECACHE_TAG_DIRTY tag in the radix tree.
650   readpages: called by the VM to read pages associated with the address_space
658 	Called by the generic buffered write code to ask the filesystem to
659 	prepare to write len bytes at the given offset in the file. The
660 	address_space should check that the write will be able to complete,
662 	housekeeping.  If the write will update parts of any basic-blocks on
664 	read already) so that the updated blocks can be written out properly.
666         The filesystem must return the locked pagecache page for the specified
667 	offset, in *pagep, for the caller to write into.
669 	It must be able to cope with short writes (where the length passed to
670 	write_begin is greater than the number of bytes copied into the page).
678         Returns 0 on success; < 0 on failure (which is the error code), in
682         be called. len is the original len passed to write_begin, and copied
683         is the amount that was able to be copied (copied == len is always true
684 	if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
686         The filesystem must take care of unlocking the page and releasing it
689         Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
692   bmap: called by the VFS to map a logical block offset within object to
693   	physical block number. This method is used by the FIBMAP
695   	a file, the file must have a stable mapping to a block
696   	device.  The swap system does not go through the filesystem
697   	but instead uses bmap to find out where the blocks in the file
701 	alternative to f_op->open(), the difference is that this method may open
702 	a file not necessarily originating from the same filesystem as the one
708         will be called when part or all of the page is to be removed
709 	from the address space.  This generally corresponds to either a
710 	truncation, punch hole  or a complete invalidation of the address
711 	space (in the latter case 'offset' will always be 0 and 'length'
712 	will be PAGE_CACHE_SIZE). Any private data associated with the page
714 	length is PAGE_CACHE_SIZE, then the private data should be released,
715 	because the page must be able to be completely discarded.  This may
716 	be done by calling the ->releasepage function, but in this case the
720         that the page should be freed if possible.  ->releasepage
721         should remove any private data from the page and clear the
725 	first is when the VM finds a clean page with no active users and
726         wants to make it a free page.  If ->releasepage succeeds, the
727         page will be removed from the address_space and become free.
731         through the fadvice(POSIX_FADV_DONTNEED) system call or by the
733         they believe the cache may be out of date with storage) by
735 	If the filesystem makes such a call, and needs to be certain
737         need to ensure this.  Possibly it can clear the PageUptodate
740   freepage: freepage is called once the page is no longer visible in
741         the page cache in order to allow the cleanup of any private
742 	data. Since it may be called by the memory reclaimer, it
743 	should not assume that the original address_space mapping still
746   direct_IO: called by the generic read/write routines to perform
747         direct_IO - that is IO requests which bypass the page cache
748         and transfer data directly between the storage and the
751   migrate_page:  This is used to compact the physical memory usage.
752         If the VM wants to relocate a page (maybe off a memory card
756         that it has to the page.
758   launder_page: Called before freeing a page - it writes back the dirty page. To
759   	prevent redirtying the page, it is kept locked during the whole
762   is_partially_uptodate: Called by the VM when reading a file through the
763 	pagecache when the underlying blocksize != pagesize. If the required
764 	block is up to date then the read can complete without needing the IO
765 	to bring the whole page up to date.
767   is_dirty_writeback: Called by the VM when attempting to reclaim a page.
773 	allows a filesystem to indicate to the VM if a page should be
774 	treated as dirty or writeback for the purposes of stalling.
782 	space if necessary and pin the block lookup information in
801 This describes how the VFS can manipulate an open file. As of kernel
802 4.1, the following members are defined:
842   llseek: called when the VFS needs to move the file position index
852   iterate: called when the VFS needs to read the directory contents
854   poll: called by the VFS when a process wants to check if there is
856 	is activity. Called by the select(2) and poll(2) system calls
858   unlocked_ioctl: called by the ioctl(2) system call.
860   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
863   mmap: called by the mmap(2) system call
865   open: called by the VFS when an inode should be opened. When the VFS
866 	opens a file, it creates a new "struct file". It then calls the
867 	open method for the newly allocated file structure. You might
868 	think that the open method really belongs in
870 	done the way it is because it makes filesystems simpler to
871 	implement. The open() method is a good place to initialize the
872 	"private_data" member in the file structure if you want to point
875   flush: called by the close(2) system call to flush a file
877   release: called when the last reference to an open file is closed
879   fsync: called by the fsync(2) system call
881   fasync: called by the fcntl(2) system call when asynchronous
884   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
887   get_unmapped_area: called by the mmap(2) system call
889   check_flags: called by the fcntl(2) system call for F_SETFL command
891   flock: called by the flock(2) system call
893   splice_write: called by the VFS to splice data from a pipe to a file. This
894 		method is used by the splice(2) system call
896   splice_read: called by the VFS to splice data from file to a pipe. This
897 	       method is used by the splice(2) system call
899   setlease: called by the VFS to set or release a file lock lease. setlease
901 	    the lease in the inode after setting it.
903   fallocate: called by the VFS to preallocate blocks or punch a hole.
905 Note that the file operations are implemented by the specific
906 filesystem in which the inode resides. When opening a device node
908 support routines in the VFS which will locate the required device
909 driver information. These support routines replace the filesystem file
910 operations with those for the device driver, and then proceed to call
911 the new open() method for the file. This is how opening a device file
912 in the filesystem eventually ends up calling the device driver open()
923 This describes how a filesystem can overload the standard dentry
924 operations. Dentries and the dcache are the domain of the VFS and the
927 the VFS uses a default. As of kernel 2.6.22, the following members are
944   d_revalidate: called when the VFS needs to revalidate a dentry. This
945 	is called whenever a name look-up finds a dentry in the
947 	dentries in the dcache are valid. Network filesystems are different
948 	since things can change on the server without the client necessarily
951 	This function should return a positive value if the dentry is still
955 	If in rcu-walk mode, the filesystem must revalidate the dentry without
956 	blocking or storing to the dentry, d_parent and d_inode should not be
963  d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry.
965 	doing a lookup in the parent directory. This includes "/", "." and "..",
968 	In this case, we are less concerned with whether the dentry is still
969 	fully correct, but rather that the inode is still valid. As with
973 	This function has the same return code semantics as d_revalidate.
977   d_hash: called when the VFS adds a dentry to the hash table. The first
978 	dentry passed to d_hash is the parent directory that the name is
985 	dentry is the parent of the dentry to be compared, the second is
986 	the child dentry. len and name string are properties of the dentry
987 	to be compared. qstr is the name to compare it with.
990 	possible, and should not or store into the dentry.
991 	Should not dereference pointers outside the dentry without
994 	However, our vfsmount is pinned, and RCU held, so the dentries and
1001   d_delete: called when the last reference to a dentry is dropped and the
1003 	immediately, or 0 to cache the dentry. Default is NULL which means to
1010 	being deallocated). The default when this is NULL is that the
1014   d_dname: called when the pathname of a dentry should be generated.
1017 	it's done only when the path is needed.). Real filesystems probably
1020 	held, d_dname() should not try to modify the dentry itself, unless
1023 	at the end of the buffer, and returns a pointer to the first char.
1027 	This should create a new VFS mount record and return the record to the
1028 	caller.  The caller is supplied with a path parameter giving the
1029 	automount directory to describe the automount target and the parent
1031 	be returned if someone else managed to make the automount first.  If
1032 	the vfsmount creation failed, then an error code should be returned.
1033 	If -EISDIR is returned, then the directory will be treated as an
1036 	If a vfsmount is returned, the caller will attempt to mount it on the
1037 	mountpoint and will remove the vfsmount from its expiration list in
1038 	the case of failure.  The vfsmount should be returned with 2 refs on
1039 	it to prevent automatic expiration - the caller will clean up the
1042 	This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
1043 	dentry.  This is set by __d_instantiate() if S_AUTOMOUNT is set on the
1046   d_manage: called to allow the filesystem to manage the transition from a
1048 	waiting to explore behind a 'mountpoint' whilst letting the daemon go
1049 	past and construct the subtree there.  0 should be returned to let the
1052 	mounted on it and not to check the automount flag.  Any other error
1055 	If the 'rcu_walk' parameter is true, then the caller is doing a
1057 	and the caller can be asked to leave it and call again by returning
1061 	This function is only used if DCACHE_MANAGE_TRANSIT is set on the
1084 	the usage count)
1086   dput: close a handle for a dentry (decrements the usage count). If
1087 	the usage count drops to 0, and the dentry is still in its
1088 	parent's hash, the "d_delete" method is called to check whether
1089 	it should be cached. If it should not be cached, or if the dentry
1094 	subsequent call to dput() will deallocate the dentry if its
1098 	the dentry then the dentry is turned into a negative dentry
1099 	(the d_iput() method is called). If there are other
1105   d_instantiate: add a dentry to the alias hash list for the inode and
1106 	updates the "d_inode" member. The "i_count" member in the
1107 	inode structure should be set/incremented. If the inode
1108 	pointer is NULL, the dentry is called a "negative
1113 	It looks up the child of that given name from the dcache
1114 	hash table. If it is found, the reference count is incremented
1115 	and the dentry is returned. The caller must use dput()
1116 	to free the dentry when it finishes using it.
1124 On mount and remount the filesystem is passed a string containing a
1139 to show all the currently active options.  The rules are:
1142     from the default
1147 Options used only internally between a mount helper and the kernel
1148 (such as file descriptors), or which only have an effect during the
1149 mounting (such as ones controlling the creation of a journal) are exempt
1150 from the above rules.
1152 The underlying reason for the above rules is to make sure, that a
1154 based on the information found in /proc/mounts.
1157 them is provided with the save_mount_options() and
1165 (Note some of these resources are not up-to-date with the latest kernel
1174 A tour of the Linux VFS by Michael K. Johnson. 1996
1177 A small trail through the Linux kernel by Andries Brouwer. 2001