Lines Matching refs:the

2 	      Overview of the Linux Virtual File System
11 This file is released under the GPLv2.
17 The Virtual File System (also known as the Virtual Filesystem Switch)
18 is the software layer in the kernel that provides the filesystem
20 within the kernel which allows different filesystem implementations to
25 in the document Documentation/filesystems/Locking.
31 The VFS implements the open(2), stat(2), chmod(2), and similar system
32 calls. The pathname argument that is passed to them is used by the VFS
33 to search through the directory entry cache (also known as the dentry
39 most computers cannot fit all dentries in the RAM at the same time,
40 some bits of the cache are missing. In order to resolve your pathname
41 into a dentry, the VFS may have to resort to creating dentries along
42 the way, and then loading the inode. This is done by looking up the
51 beasts. They live either on the disc (for block device filesystems)
52 or in the memory (for pseudo filesystems). Inodes that live on the
53 disc are copied into the memory when required and changes to the inode
57 To look up an inode requires that the VFS calls the lookup() method of
58 the parent directory inode. This method is installed by the specific
59 filesystem implementation that the inode lives in. Once the VFS has
60 the required dentry (and hence the inode), we can do all those boring
61 things like open(2) the file, or stat(2) it to peek at the inode
62 data. The stat(2) operation is fairly simple: once the VFS has the
63 dentry, it peeks at the inode data and passes some of it back to
71 structure (this is the kernel-side implementation of file
73 a pointer to the dentry and a set of file operation member functions.
74 These are taken from the inode data. The open() file method is then
75 called so the specific filesystem implementation can do its work. You
76 can see that this is another switch performed by the VFS. The file
77 structure is placed into the file descriptor table for the process.
80 is done by using the userspace file descriptor to grab the appropriate
81 file structure, and then calling the required file structure method to
82 do whatever is required. For as long as the file is open, it keeps the
83 dentry in use, which in turn means that the VFS inode is still in use.
89 To register and unregister a filesystem, use the following API
99 the VFS will call the appropriate mount() method for the specific
100 filesystem. New vfsmount referring to the tree returned by ->mount()
101 will be attached to the mountpoint, so that when pathname resolution
102 reaches the mountpoint it will jump into the root of that vfsmount.
104 You can see all filesystems that are registered to the kernel in the
111 This describes the filesystem. As of kernel 2.6.39, the following
127 name: the name of the filesystem type, such as "ext2", "iso9660",
132 mount: the method to call when a new instance of this
135 kill_sb: the method to call when an instance of this filesystem
145 The mount() method has the following arguments:
147 struct file_system_type *fs_type: describes the filesystem, partly initialized
148 by the specific filesystem code
152 const char *dev_name: the device name we are mounting.
157 The mount() method must return the root dentry of the tree requested by
158 caller. An active reference to its superblock must be grabbed and the
164 contains a suitable filesystem image the method creates and initializes
168 doesn't have to create a new one. The main result from the caller's
169 point of view is a reference to dentry at the root of (sub)tree to
172 The most interesting member of the superblock structure that the
173 mount() method fills in is the "s_op" field. This is a pointer to
174 a "struct super_operations" which describes the next level of the
177 Usually, a filesystem uses one of the generic mount() implementations
184 mount_single: mount a filesystem which shares the instance between
187 A fill_super() callback implementation has the following arguments:
189 struct super_block *sb: the superblock structure. The callback
207 This describes how the VFS can manipulate the superblock of your
208 filesystem. As of kernel 2.6.22, the following members are defined:
251 dirty_inode: this method is called by the VFS to mark an inode dirty.
253 write_inode: this method is called when the VFS needs to write an
254 inode to disc. The second parameter indicates whether the write
257 drop_inode: called when the last access to the inode is dropped,
258 with the inode->i_lock spinlock held.
263 called regardless of the value of i_nlink)
265 The "generic_delete_inode()" behavior is equivalent to the
266 old practice of using "force_delete" in the put_inode() case,
267 but does not have the races that the "force_delete()" approach
270 delete_inode: called when the VFS wants to delete an inode
272 put_super: called when the VFS wishes to free the superblock
273 (i.e. unmount). This is called with the superblock lock held
276 a superblock. The second parameter indicates whether the method
277 should wait until the write out has been completed. Optional.
281 used by the Logical Volume Manager (LVM).
286 statfs: called when the VFS needs to get filesystem statistics.
288 remount_fs: called when the filesystem is remounted. This is called
289 with the kernel lock held
291 clear_inode: called then the VFS clears the inode. Optional
293 umount_begin: called when the VFS is unmounting a filesystem.
295 show_options: called by the VFS to show mount options for
298 quota_read: called by the VFS to read from filesystem quota file.
300 quota_write: called by the VFS to write to filesystem quota file.
302 nr_cached_objects: called by the sb cache shrinking function for the
303 filesystem to return the number of freeable cached objects it contains.
306 free_cache_objects: called by the sb cache shrinking function for the
307 filesystem to scan the number of objects indicated to try to free them.
311 We can't do anything with any errors that the filesystem might
312 encountered, hence the void return type. This will never be called if
313 the VM is trying to reclaim under GFP_NOFS conditions, hence this
317 scanning loop that is done. This allows the VFS to determine
322 Whoever sets up the inode is responsible for filling in the "i_op" field. This
323 is a pointer to a "struct inode_operations" which describes the methods that
330 An inode object represents an object within the filesystem.
336 This describes how the VFS can manipulate an inode in your
337 filesystem. As of kernel 2.6.22, the following members are defined:
373 create: called by the open(2) and creat(2) system calls. Only
376 dentry). Here you will probably call d_instantiate() with the
377 dentry and the newly created inode
379 lookup: called when the VFS needs to look up an inode in a parent
380 directory. The name to look for is found in the dentry. This
381 method must call d_add() to insert the found inode into the
382 dentry. The "i_count" field in the inode structure should be
383 incremented. If the named inode does not exist a NULL inode
384 should be inserted into the dentry (this is called a negative
388 If you wish to overload the dentry methods then you should
389 initialise the "d_dop" field in the dentry; this is a pointer
391 This method is called with the directory inode semaphore held
393 link: called by the link(2) system call. Only required if you want
395 d_instantiate() just as you would in the create() method
397 unlink: called by the unlink(2) system call. Only required if you
400 symlink: called by the symlink(2) system call. Only required if you
402 d_instantiate() just as you would in the create() method
404 mkdir: called by the mkdir(2) system call. Only required if you want
406 call d_instantiate() just as you would in the create() method
408 rmdir: called by the rmdir(2) system call. Only required if you want
411 mknod: called by the mknod(2) system call to create a device (char,
415 in the create() method
417 rename: called by the rename(2) system call to rename the object to
418 have the parent and name given by the second inode and dentry.
421 If no flags are supported by the filesystem then this method
422 need not be implemented. If some flags are supported then the
424 flags. Currently the following flags are implemented:
425 (1) RENAME_NOREPLACE: this flag indicates that if the target
426 of the rename exists the rename should fail with -EEXIST
427 instead of replacing the target. The VFS already checks for
428 existence, so for local filesystems the RENAME_NOREPLACE
431 exist; this is checked by the VFS. Unlike plain rename,
434 readlink: called by the readlink(2) system call. Only required if
437 follow_link: called by the VFS to follow a symbolic link to the
439 symbolic links. This method returns the symlink body
440 to traverse (and possibly resets the current position with
441 nd_jump_link()). If the body won't go away until the inode
443 pinned, the data needed to release whatever we'd grabbed
447 put_link: called by the VFS to release resources allocated by
449 to this method as the last parameter; only called when
452 permission: called by the VFS to check for access rights on a POSIX-like
456 mode, the filesystem must check the permission without blocking or
457 storing to the inode.
462 setattr: called by the VFS to set attributes for a file. This method
465 getattr: called by the VFS to get attributes of a file. This method
468 setxattr: called by the VFS to set an extended attribute for a file.
472 getxattr: called by the VFS to retrieve the value of an extended
476 listxattr: called by the VFS to list all extended attributes for a
479 removexattr: called by the VFS to remove an extended attribute from
482 update_time: called by the VFS to update a specific time or the i_version of
483 an inode. If this is not defined the VFS will update the inode itself
486 atomic_open: called on the last component of an open. Using this optional
487 method the filesystem can look up, possibly create and open the file in
488 one atomic operation. If it cannot perform this (e.g. the file type
490 usual 0 or -ve . This method is only called if the last component is
492 f_op->open(). If the file was created, the FILE_CREATED flag should be
493 set in "opened". In case of O_EXCL the method must only succeed if the
496 tmpfile: called in the end of O_TMPFILE open(). Optional, equivalent to
502 The address space object is used to group and manage pages in the page
503 cache. It can be used to keep track of the pages in a file (or
504 anything else) and also track the mapping of sections of the file into
512 The first can be used independently to the others. The VM can try to
514 pages in order to reuse them. To do this it can call the ->writepage
517 references will be released without notice being given to the
521 lru_cache_add and mark_page_active needs to be called whenever the
525 maintains information about the PG_Dirty and PG_Writeback status of
529 The Dirty tag is primarily used by mpage_writepages - the default
530 ->writepages method. It uses the tag to find dirty pages to call
531 ->writepage on. If mpage_writepages is not used (i.e. the address
532 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
535 writing out the whole address_space.
543 typically using the 'private' field in the 'struct page'. If such
544 information is attached, the PG_Private flag should be set. This will
545 cause various VM routines to make extra calls into the address_space
549 application. Data is read into the address space a whole page at a
550 time, and provided to the application either by copying of the page,
551 or by memory-mapping the page.
552 Data is written into the address space by the application, and then
553 written-back to storage typically in whole pages, however the
558 set_page_dirty to write data into the address_space, and writepage,
561 Adding and removing pages to/from an address_space is protected by the
564 When data is written to a page, the PG_Dirty flag should be set. It
575 This describes how the VFS can manipulate mapping of a file to page cache in
596 /* migrate the contents of a page to the specified target */
607 writepage: called by the VM to write a dirty page to backing store.
613 and should make sure the page is unlocked, either synchronously
614 or asynchronously when the write operation completes.
618 other pages from the mapping if that is easier (e.g. due to
620 should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
623 See the file "Locking" for more details.
625 readpage: called by the VM to read a page from backing store.
627 unlocked and marked uptodate once the read completes.
628 If ->readpage discovers that it needs to unlock the page for
630 In this case, the page will be relocated, relocked and if
633 writepages: called by the VM to write out pages associated with the
635 the writeback_control will specify a range of pages that must be
639 instead. This will choose pages from the address space that are
642 set_page_dirty: called by the VM to set a page dirty.
647 If defined, it should set the PageDirty flag, and the
648 PAGECACHE_TAG_DIRTY tag in the radix tree.
650 readpages: called by the VM to read pages associated with the address_space
658 Called by the generic buffered write code to ask the filesystem to
659 prepare to write len bytes at the given offset in the file. The
660 address_space should check that the write will be able to complete,
662 housekeeping. If the write will update parts of any basic-blocks on
664 read already) so that the updated blocks can be written out properly.
666 The filesystem must return the locked pagecache page for the specified
667 offset, in *pagep, for the caller to write into.
669 It must be able to cope with short writes (where the length passed to
670 write_begin is greater than the number of bytes copied into the page).
678 Returns 0 on success; < 0 on failure (which is the error code), in
682 be called. len is the original len passed to write_begin, and copied
683 is the amount that was able to be copied (copied == len is always true
684 if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
686 The filesystem must take care of unlocking the page and releasing it
689 Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
692 bmap: called by the VFS to map a logical block offset within object to
693 physical block number. This method is used by the FIBMAP
695 a file, the file must have a stable mapping to a block
696 device. The swap system does not go through the filesystem
697 but instead uses bmap to find out where the blocks in the file
701 alternative to f_op->open(), the difference is that this method may open
702 a file not necessarily originating from the same filesystem as the one
708 will be called when part or all of the page is to be removed
709 from the address space. This generally corresponds to either a
710 truncation, punch hole or a complete invalidation of the address
711 space (in the latter case 'offset' will always be 0 and 'length'
712 will be PAGE_CACHE_SIZE). Any private data associated with the page
714 length is PAGE_CACHE_SIZE, then the private data should be released,
715 because the page must be able to be completely discarded. This may
716 be done by calling the ->releasepage function, but in this case the
720 that the page should be freed if possible. ->releasepage
721 should remove any private data from the page and clear the
725 first is when the VM finds a clean page with no active users and
726 wants to make it a free page. If ->releasepage succeeds, the
727 page will be removed from the address_space and become free.
731 through the fadvice(POSIX_FADV_DONTNEED) system call or by the
733 they believe the cache may be out of date with storage) by
735 If the filesystem makes such a call, and needs to be certain
737 need to ensure this. Possibly it can clear the PageUptodate
740 freepage: freepage is called once the page is no longer visible in
741 the page cache in order to allow the cleanup of any private
742 data. Since it may be called by the memory reclaimer, it
743 should not assume that the original address_space mapping still
746 direct_IO: called by the generic read/write routines to perform
747 direct_IO - that is IO requests which bypass the page cache
748 and transfer data directly between the storage and the
751 migrate_page: This is used to compact the physical memory usage.
752 If the VM wants to relocate a page (maybe off a memory card
756 that it has to the page.
758 launder_page: Called before freeing a page - it writes back the dirty page. To
759 prevent redirtying the page, it is kept locked during the whole
762 is_partially_uptodate: Called by the VM when reading a file through the
763 pagecache when the underlying blocksize != pagesize. If the required
764 block is up to date then the read can complete without needing the IO
765 to bring the whole page up to date.
767 is_dirty_writeback: Called by the VM when attempting to reclaim a page.
773 allows a filesystem to indicate to the VM if a page should be
774 treated as dirty or writeback for the purposes of stalling.
782 space if necessary and pin the block lookup information in
801 This describes how the VFS can manipulate an open file. As of kernel
802 4.1, the following members are defined:
842 llseek: called when the VFS needs to move the file position index
852 iterate: called when the VFS needs to read the directory contents
854 poll: called by the VFS when a process wants to check if there is
856 is activity. Called by the select(2) and poll(2) system calls
858 unlocked_ioctl: called by the ioctl(2) system call.
860 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
863 mmap: called by the mmap(2) system call
865 open: called by the VFS when an inode should be opened. When the VFS
866 opens a file, it creates a new "struct file". It then calls the
867 open method for the newly allocated file structure. You might
868 think that the open method really belongs in
870 done the way it is because it makes filesystems simpler to
871 implement. The open() method is a good place to initialize the
872 "private_data" member in the file structure if you want to point
875 flush: called by the close(2) system call to flush a file
877 release: called when the last reference to an open file is closed
879 fsync: called by the fsync(2) system call
881 fasync: called by the fcntl(2) system call when asynchronous
884 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
887 get_unmapped_area: called by the mmap(2) system call
889 check_flags: called by the fcntl(2) system call for F_SETFL command
891 flock: called by the flock(2) system call
893 splice_write: called by the VFS to splice data from a pipe to a file. This
894 method is used by the splice(2) system call
896 splice_read: called by the VFS to splice data from file to a pipe. This
897 method is used by the splice(2) system call
899 setlease: called by the VFS to set or release a file lock lease. setlease
901 the lease in the inode after setting it.
903 fallocate: called by the VFS to preallocate blocks or punch a hole.
905 Note that the file operations are implemented by the specific
906 filesystem in which the inode resides. When opening a device node
908 support routines in the VFS which will locate the required device
909 driver information. These support routines replace the filesystem file
910 operations with those for the device driver, and then proceed to call
911 the new open() method for the file. This is how opening a device file
912 in the filesystem eventually ends up calling the device driver open()
923 This describes how a filesystem can overload the standard dentry
924 operations. Dentries and the dcache are the domain of the VFS and the
927 the VFS uses a default. As of kernel 2.6.22, the following members are
944 d_revalidate: called when the VFS needs to revalidate a dentry. This
945 is called whenever a name look-up finds a dentry in the
947 dentries in the dcache are valid. Network filesystems are different
948 since things can change on the server without the client necessarily
951 This function should return a positive value if the dentry is still
955 If in rcu-walk mode, the filesystem must revalidate the dentry without
956 blocking or storing to the dentry, d_parent and d_inode should not be
963 d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry.
965 doing a lookup in the parent directory. This includes "/", "." and "..",
968 In this case, we are less concerned with whether the dentry is still
969 fully correct, but rather that the inode is still valid. As with
973 This function has the same return code semantics as d_revalidate.
977 d_hash: called when the VFS adds a dentry to the hash table. The first
978 dentry passed to d_hash is the parent directory that the name is
985 dentry is the parent of the dentry to be compared, the second is
986 the child dentry. len and name string are properties of the dentry
987 to be compared. qstr is the name to compare it with.
990 possible, and should not or store into the dentry.
991 Should not dereference pointers outside the dentry without
994 However, our vfsmount is pinned, and RCU held, so the dentries and
1001 d_delete: called when the last reference to a dentry is dropped and the
1003 immediately, or 0 to cache the dentry. Default is NULL which means to
1010 being deallocated). The default when this is NULL is that the
1014 d_dname: called when the pathname of a dentry should be generated.
1017 it's done only when the path is needed.). Real filesystems probably
1020 held, d_dname() should not try to modify the dentry itself, unless
1023 at the end of the buffer, and returns a pointer to the first char.
1027 This should create a new VFS mount record and return the record to the
1028 caller. The caller is supplied with a path parameter giving the
1029 automount directory to describe the automount target and the parent
1031 be returned if someone else managed to make the automount first. If
1032 the vfsmount creation failed, then an error code should be returned.
1033 If -EISDIR is returned, then the directory will be treated as an
1036 If a vfsmount is returned, the caller will attempt to mount it on the
1037 mountpoint and will remove the vfsmount from its expiration list in
1038 the case of failure. The vfsmount should be returned with 2 refs on
1039 it to prevent automatic expiration - the caller will clean up the
1042 This function is only used if DCACHE_NEED_AUTOMOUNT is set on the
1043 dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the
1046 d_manage: called to allow the filesystem to manage the transition from a
1048 waiting to explore behind a 'mountpoint' whilst letting the daemon go
1049 past and construct the subtree there. 0 should be returned to let the
1052 mounted on it and not to check the automount flag. Any other error
1055 If the 'rcu_walk' parameter is true, then the caller is doing a
1057 and the caller can be asked to leave it and call again by returning
1061 This function is only used if DCACHE_MANAGE_TRANSIT is set on the
1084 the usage count)
1086 dput: close a handle for a dentry (decrements the usage count). If
1087 the usage count drops to 0, and the dentry is still in its
1088 parent's hash, the "d_delete" method is called to check whether
1089 it should be cached. If it should not be cached, or if the dentry
1094 subsequent call to dput() will deallocate the dentry if its
1098 the dentry then the dentry is turned into a negative dentry
1099 (the d_iput() method is called). If there are other
1105 d_instantiate: add a dentry to the alias hash list for the inode and
1106 updates the "d_inode" member. The "i_count" member in the
1107 inode structure should be set/incremented. If the inode
1108 pointer is NULL, the dentry is called a "negative
1113 It looks up the child of that given name from the dcache
1114 hash table. If it is found, the reference count is incremented
1115 and the dentry is returned. The caller must use dput()
1116 to free the dentry when it finishes using it.
1124 On mount and remount the filesystem is passed a string containing a
1139 to show all the currently active options. The rules are:
1142 from the default
1147 Options used only internally between a mount helper and the kernel
1148 (such as file descriptors), or which only have an effect during the
1149 mounting (such as ones controlling the creation of a journal) are exempt
1150 from the above rules.
1152 The underlying reason for the above rules is to make sure, that a
1154 based on the information found in /proc/mounts.
1157 them is provided with the save_mount_options() and
1165 (Note some of these resources are not up-to-date with the latest kernel
1174 A tour of the Linux VFS by Michael K. Johnson. 1996
1177 A small trail through the Linux kernel by Andries Brouwer. 2001