Linux Filesystems API

Linux Filesystems API This documentation is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA For more details see the file COPYING in the source distribution of Linux. The Linux VFS The Filesystem types LINUX Kernel Hackers Manual July 2017 enum positive_aop_returns 9 4.1.27 enum positive_aop_returns aop return codes with specific semantics Synopsis enum positive_aop_returns { AOP_WRITEPAGE_ACTIVATE, AOP_TRUNCATED_PAGE }; Constants AOP_WRITEPAGE_ACTIVATE Informs the caller that page writeback has completed, that the page is still locked, and should be considered active. The VM uses this hint to return the page to the active list -- it won't be a candidate for writeback again in the near future. Other callers must be careful to unlock the page if they get this return. Returned by writepage; AOP_TRUNCATED_PAGE The AOP method that was handed a locked page has unlocked it and the page might have been truncated. The caller should back up to acquiring a new page and trying again. The aop will be taking reasonable precautions not to livelock. If the caller held a page reference, it should drop it before retrying. Returned by readpage. Description address_space_operation functions return these large constants to indicate special semantics to the caller. These are much larger than the bytes in a page to allow for functions that return the number of bytes operated on in a given page. LINUX Kernel Hackers Manual July 2017 sb_end_write 9 4.1.27 sb_end_write drop write access to a superblock Synopsis void sb_end_write struct super_block * sb Arguments sb the super we wrote to Description Decrement number of writers to the filesystem. Wake up possible waiters wanting to freeze the filesystem. LINUX Kernel Hackers Manual July 2017 sb_end_pagefault 9 4.1.27 sb_end_pagefault drop write access to a superblock from a page fault Synopsis void sb_end_pagefault struct super_block * sb Arguments sb the super we wrote to Description Decrement number of processes handling write page fault to the filesystem. Wake up possible waiters wanting to freeze the filesystem. LINUX Kernel Hackers Manual July 2017 sb_end_intwrite 9 4.1.27 sb_end_intwrite drop write access to a superblock for internal fs purposes Synopsis void sb_end_intwrite struct super_block * sb Arguments sb the super we wrote to Description Decrement fs-internal number of writers to the filesystem. Wake up possible waiters wanting to freeze the filesystem. LINUX Kernel Hackers Manual July 2017 sb_start_write 9 4.1.27 sb_start_write get write access to a superblock Synopsis void sb_start_write struct super_block * sb Arguments sb the super we write to Description When a process wants to write data or metadata to a file system (i.e. dirty a page or an inode), it should embed the operation in a sb_start_write - sb_end_write pair to get exclusion against file system freezing. This function increments number of writers preventing freezing. If the file system is already frozen, the function waits until the file system is thawed. Since freeze protection behaves as a lock, users have to preserve ordering of freeze protection and other filesystem locks. Generally, freeze protection should be the outermost lock. In particular, we have: sb_start_write -> i_mutex (write path, truncate, directory ops, ...) -> s_umount (freeze_super, thaw_super) LINUX Kernel Hackers Manual July 2017 sb_start_pagefault 9 4.1.27 sb_start_pagefault get write access to a superblock from a page fault Synopsis void sb_start_pagefault struct super_block * sb Arguments sb the super we write to Description When a process starts handling write page fault, it should embed the operation into sb_start_pagefault - sb_end_pagefault pair to get exclusion against file system freezing. This is needed since the page fault is going to dirty a page. This function increments number of running page faults preventing freezing. If the file system is already frozen, the function waits until the file system is thawed. Since page fault freeze protection behaves as a lock, users have to preserve ordering of freeze protection and other filesystem locks. It is advised to put sb_start_pagefault close to mmap_sem in lock ordering. Page fault handling code implies lock dependency mmap_sem -> sb_start_pagefault LINUX Kernel Hackers Manual July 2017 inode_inc_iversion 9 4.1.27 inode_inc_iversion increments i_version Synopsis void inode_inc_iversion struct inode * inode Arguments inode inode that need to be updated Description Every time the inode is modified, the i_version field will be incremented. The filesystem has to be mounted with i_version flag The Directory Cache LINUX Kernel Hackers Manual July 2017 __d_drop 9 4.1.27 __d_drop drop a dentry Synopsis void __d_drop struct dentry * dentry Arguments dentry dentry to drop Description d_drop unhashes the entry from the parent dentry hashes, so that it won't be found through a VFS lookup any more. Note that this is different from deleting the dentry - d_delete will try to mark the dentry negative if possible, giving a successful _negative_ lookup, while d_drop will just make the cache lookup fail. d_drop is used mainly for stuff that wants to invalidate a dentry for some reason (NFS timeouts or autofs deletes). __d_drop requires dentry->d_lock. LINUX Kernel Hackers Manual July 2017 shrink_dcache_sb 9 4.1.27 shrink_dcache_sb shrink dcache for a superblock Synopsis void shrink_dcache_sb struct super_block * sb Arguments sb superblock Description Shrink the dcache for the specified super block. This is used to free the dcache before unmounting a file system. LINUX Kernel Hackers Manual July 2017 have_submounts 9 4.1.27 have_submounts check for mounts over a dentry Synopsis int have_submounts struct dentry * parent Arguments parent dentry to check. Description Return true if the parent or its subdirectories contain a mount point LINUX Kernel Hackers Manual July 2017 shrink_dcache_parent 9 4.1.27 shrink_dcache_parent prune dcache Synopsis void shrink_dcache_parent struct dentry * parent Arguments parent parent of entries to prune Description Prune the dcache to remove unused children of the parent dentry. LINUX Kernel Hackers Manual July 2017 d_invalidate 9 4.1.27 d_invalidate detach submounts, prune dcache, and drop Synopsis void d_invalidate struct dentry * dentry Arguments dentry dentry to invalidate (aka detach, prune and drop) Description no dcache lock. The final d_drop is done as an atomic operation relative to rename_lock ensuring there are no races with d_set_mounted. This ensures there are no unhashed dentries on the path to a mountpoint. LINUX Kernel Hackers Manual July 2017 d_alloc 9 4.1.27 d_alloc allocate a dcache entry Synopsis struct dentry * d_alloc struct dentry * parent const struct qstr * name Arguments parent parent of entry to allocate name qstr of the name Description Allocates a dentry. It returns NULL if there is insufficient memory available. On a success the dentry is returned. The name passed in is copied and the copy passed in may be reused after this call. LINUX Kernel Hackers Manual July 2017 d_alloc_pseudo 9 4.1.27 d_alloc_pseudo allocate a dentry (for lookup-less filesystems) Synopsis struct dentry * d_alloc_pseudo struct super_block * sb const struct qstr * name Arguments sb the superblock name qstr of the name Description For a filesystem that just pins its dentries in memory and never performs lookups at all, return an unhashed IS_ROOT dentry. LINUX Kernel Hackers Manual July 2017 d_instantiate 9 4.1.27 d_instantiate fill in inode information for a dentry Synopsis void d_instantiate struct dentry * entry struct inode * inode Arguments entry dentry to complete inode inode to attach to this dentry Description Fill in inode information in the entry. This turns negative dentries into productive full members of society. NOTE! This assumes that the inode count has been incremented (or otherwise set) by the caller to indicate that it is now in use by the dcache. LINUX Kernel Hackers Manual July 2017 d_instantiate_no_diralias 9 4.1.27 d_instantiate_no_diralias instantiate a non-aliased dentry Synopsis int d_instantiate_no_diralias struct dentry * entry struct inode * inode Arguments entry dentry to complete inode inode to attach to this dentry Description Fill in inode information in the entry. If a directory alias is found, then return an error (and drop inode). Together with d_materialise_unique this guarantees that a directory inode may never have more than one alias. LINUX Kernel Hackers Manual July 2017 d_find_any_alias 9 4.1.27 d_find_any_alias find any alias for a given inode Synopsis struct dentry * d_find_any_alias struct inode * inode Arguments inode inode to find an alias for Description If any aliases exist for the given inode, take and return a reference for one of them. If no aliases exist, return NULL. LINUX Kernel Hackers Manual July 2017 d_obtain_alias 9 4.1.27 d_obtain_alias find or allocate a DISCONNECTED dentry for a given inode Synopsis struct dentry * d_obtain_alias struct inode * inode Arguments inode inode to allocate the dentry for Description Obtain a dentry for an inode resulting from NFS filehandle conversion or similar open by handle operations. The returned dentry may be anonymous, or may have a full name (if the inode was already in the cache). When called on a directory inode, we must ensure that the inode only ever has one dentry. If a dentry is found, that is returned instead of allocating a new one. On successful return, the reference to the inode has been transferred to the dentry. In case of an error the reference on the inode is released. To make it easier to use in export operations a NULL or IS_ERR inode may be passed in and the error will be propagated to the return value, with a NULL inode replaced by ERR_PTR(-ESTALE). LINUX Kernel Hackers Manual July 2017 d_obtain_root 9 4.1.27 d_obtain_root find or allocate a dentry for a given inode Synopsis struct dentry * d_obtain_root struct inode * inode Arguments inode inode to allocate the dentry for Description Obtain an IS_ROOT dentry for the root of a filesystem. We must ensure that directory inodes only ever have one dentry. If a dentry is found, that is returned instead of allocating a new one. On successful return, the reference to the inode has been transferred to the dentry. In case of an error the reference on the inode is released. A NULL or IS_ERR inode may be passed in and will be the error will be propagate to the return value, with a NULL inode replaced by ERR_PTR(-ESTALE). LINUX Kernel Hackers Manual July 2017 d_add_ci 9 4.1.27 d_add_ci lookup or allocate new dentry with case-exact name Synopsis struct dentry * d_add_ci struct dentry * dentry struct inode * inode struct qstr * name Arguments dentry the negative dentry that was passed to the parent's lookup func inode the inode case-insensitive lookup has found name the case-exact name to be associated with the returned dentry Description This is to avoid filling the dcache with case-insensitive names to the same inode, only the actual correct case is stored in the dcache for case-insensitive filesystems. For a case-insensitive lookup match and if the the case-exact dentry already exists in in the dcache, use it and return it. If no entry exists with the exact case name, allocate new dentry with the exact case, and return the spliced entry. LINUX Kernel Hackers Manual July 2017 d_lookup 9 4.1.27 d_lookup search for a dentry Synopsis struct dentry * d_lookup const struct dentry * parent const struct qstr * name Arguments parent parent dentry name qstr of name we wish to find Returns dentry, or NULL d_lookup searches the children of the parent dentry for the name in question. If the dentry is found its reference count is incremented and the dentry is returned. The caller must use dput to free the entry when it has finished using it. NULL is returned if the dentry does not exist. LINUX Kernel Hackers Manual July 2017 d_hash_and_lookup 9 4.1.27 d_hash_and_lookup hash the qstr then search for a dentry Synopsis struct dentry * d_hash_and_lookup struct dentry * dir struct qstr * name Arguments dir Directory to search in name qstr of name we wish to find Description On lookup failure NULL is returned; on bad name - ERR_PTR(-error) LINUX Kernel Hackers Manual July 2017 d_delete 9 4.1.27 d_delete delete a dentry Synopsis void d_delete struct dentry * dentry Arguments dentry The dentry to delete Description Turn the dentry into a negative dentry if possible, otherwise remove it from the hash queues so it can be deleted later LINUX Kernel Hackers Manual July 2017 d_rehash 9 4.1.27 d_rehash add an entry back to the hash Synopsis void d_rehash struct dentry * entry Arguments entry dentry to add to the hash Description Adds a dentry to the hash according to its name. LINUX Kernel Hackers Manual July 2017 dentry_update_name_case 9 4.1.27 dentry_update_name_case update case insensitive dentry with a new name Synopsis void dentry_update_name_case struct dentry * dentry struct qstr * name Arguments dentry dentry to be updated name new name Description Update a case insensitive dentry with new case of name. dentry must have been returned by d_lookup with name name. Old and new name lengths must match (ie. no d_compare which allows mismatched name lengths). Parent inode i_mutex must be held over d_lookup and into this call (to keep renames and concurrent inserts, and readdir(2) away). LINUX Kernel Hackers Manual July 2017 d_splice_alias 9 4.1.27 d_splice_alias splice a disconnected dentry into the tree if one exists Synopsis struct dentry * d_splice_alias struct inode * inode struct dentry * dentry Arguments inode the inode which may have a disconnected dentry dentry a negative dentry which we want to point to the inode. Description If inode is a directory and has an IS_ROOT alias, then d_move that in place of the given dentry and return it, else simply d_add the inode to the dentry and return NULL. If a non-IS_ROOT directory is found, the filesystem is corrupt, and we should error out directories can't have multiple aliases. This is needed in the lookup routine of any filesystem that is exportable (via knfsd) so that we can build dcache paths to directories effectively. If a dentry was found and moved, then it is returned. Otherwise NULL is returned. This matches the expected return value of ->lookup. Cluster filesystems may call this function with a negative, hashed dentry. In that case, we know that the inode will be a regular file, and also this will only occur during atomic_open. So we need to check for the dentry being already hashed only in the final case. LINUX Kernel Hackers Manual July 2017 d_path 9 4.1.27 d_path return the path of a dentry Synopsis char * d_path const struct path * path char * buf int buflen Arguments path path to report buf buffer to return value in buflen buffer length Description Convert a dentry into an ASCII path name. If the entry has been deleted the string (deleted) is appended. Note that this is ambiguous. Returns a pointer into the buffer or an error code if the path was too long. Note: Callers should use the returned pointer, not the passed in buffer, to use the name! The implementation often starts at an offset into the buffer, and may leave 0 bytes at the start. buflen should be positive. LINUX Kernel Hackers Manual July 2017 d_add 9 4.1.27 d_add add dentry to hash queues Synopsis void d_add struct dentry * entry struct inode * inode Arguments entry dentry to add inode The inode to attach to this dentry Description This adds the entry to the hash queues and initializes inode. The entry was actually filled in earlier during d_alloc. LINUX Kernel Hackers Manual July 2017 d_add_unique 9 4.1.27 d_add_unique add dentry to hash queues without aliasing Synopsis struct dentry * d_add_unique struct dentry * entry struct inode * inode Arguments entry dentry to add inode The inode to attach to this dentry Description This adds the entry to the hash queues and initializes inode. The entry was actually filled in earlier during d_alloc. LINUX Kernel Hackers Manual July 2017 dget_dlock 9 4.1.27 dget_dlock get a reference to a dentry Synopsis struct dentry * dget_dlock struct dentry * dentry Arguments dentry dentry to get a reference to Description Given a dentry or NULL pointer increment the reference count if appropriate and return the dentry. A dentry will not be destroyed when it has references. LINUX Kernel Hackers Manual July 2017 d_unhashed 9 4.1.27 d_unhashed is dentry hashed Synopsis int d_unhashed const struct dentry * dentry Arguments dentry entry to check Description Returns true if the dentry passed is not currently hashed. LINUX Kernel Hackers Manual July 2017 d_really_is_negative 9 4.1.27 d_really_is_negative Determine if a dentry is really negative (ignoring fallthroughs) Synopsis bool d_really_is_negative const struct dentry * dentry Arguments dentry The dentry in question Description Returns true if the dentry represents either an absent name or a name that doesn't map to an inode (ie. ->d_inode is NULL). The dentry could represent a true miss, a whiteout that isn't represented by a 0,0 chardev or a fallthrough marker in an opaque directory. Note! (1) This should be used *only* by a filesystem to examine its own dentries. It should not be used to look at some other filesystem's dentries. (2) It should also be used in combination with d_inode to get the inode. (3) The dentry may have something attached to ->d_lower and the type field of the flags may be set to something other than miss or whiteout. LINUX Kernel Hackers Manual July 2017 d_really_is_positive 9 4.1.27 d_really_is_positive Determine if a dentry is really positive (ignoring fallthroughs) Synopsis bool d_really_is_positive const struct dentry * dentry Arguments dentry The dentry in question Description Returns true if the dentry represents a name that maps to an inode (ie. ->d_inode is not NULL). The dentry might still represent a whiteout if that is represented on medium as a 0,0 chardev. Note! (1) This should be used *only* by a filesystem to examine its own dentries. It should not be used to look at some other filesystem's dentries. (2) It should also be used in combination with d_inode to get the inode. LINUX Kernel Hackers Manual July 2017 d_inode 9 4.1.27 d_inode Get the actual inode of this dentry Synopsis struct inode * d_inode const struct dentry * dentry Arguments dentry The dentry to query Description This is the helper normal filesystems should use to get at their own inodes in their own dentries and ignore the layering superimposed upon them. LINUX Kernel Hackers Manual July 2017 d_inode_rcu 9 4.1.27 d_inode_rcu Get the actual inode of this dentry with ACCESS_ONCE Synopsis struct inode * d_inode_rcu const struct dentry * dentry Arguments dentry The dentry to query Description This is the helper normal filesystems should use to get at their own inodes in their own dentries and ignore the layering superimposed upon them. LINUX Kernel Hackers Manual July 2017 d_backing_inode 9 4.1.27 d_backing_inode Get upper or lower inode we should be using Synopsis struct inode * d_backing_inode const struct dentry * upper Arguments upper The upper layer Description This is the helper that should be used to get at the inode that will be used if this dentry were to be opened as a file. The inode may be on the upper dentry or it may be on a lower dentry pinned by the upper. Normal filesystems should not use this to access their own inodes. LINUX Kernel Hackers Manual July 2017 d_backing_dentry 9 4.1.27 d_backing_dentry Get upper or lower dentry we should be using Synopsis struct dentry * d_backing_dentry struct dentry * upper Arguments upper The upper layer Description This is the helper that should be used to get the dentry of the inode that will be used if this dentry were opened as a file. It may be the upper dentry or it may be a lower dentry pinned by the upper. Normal filesystems should not use this to access their own dentries. Inode Handling LINUX Kernel Hackers Manual July 2017 inode_init_always 9 4.1.27 inode_init_always perform inode structure intialisation Synopsis int inode_init_always struct super_block * sb struct inode * inode Arguments sb superblock inode belongs to inode inode to initialise Description These are initializations that need to be done on every inode allocation as the fields are not initialised by slab allocation. LINUX Kernel Hackers Manual July 2017 drop_nlink 9 4.1.27 drop_nlink directly drop an inode's link count Synopsis void drop_nlink struct inode * inode Arguments inode inode Description This is a low-level filesystem helper to replace any direct filesystem manipulation of i_nlink. In cases where we are attempting to track writes to the filesystem, a decrement to zero means an imminent write when the file is truncated and actually unlinked on the filesystem. LINUX Kernel Hackers Manual July 2017 clear_nlink 9 4.1.27 clear_nlink directly zero an inode's link count Synopsis void clear_nlink struct inode * inode Arguments inode inode Description This is a low-level filesystem helper to replace any direct filesystem manipulation of i_nlink. See drop_nlink for why we care about i_nlink hitting zero. LINUX Kernel Hackers Manual July 2017 set_nlink 9 4.1.27 set_nlink directly set an inode's link count Synopsis void set_nlink struct inode * inode unsigned int nlink Arguments inode inode nlink new nlink (should be non-zero) Description This is a low-level filesystem helper to replace any direct filesystem manipulation of i_nlink. LINUX Kernel Hackers Manual July 2017 inc_nlink 9 4.1.27 inc_nlink directly increment an inode's link count Synopsis void inc_nlink struct inode * inode Arguments inode inode Description This is a low-level filesystem helper to replace any direct filesystem manipulation of i_nlink. Currently, it is only here for parity with dec_nlink. LINUX Kernel Hackers Manual July 2017 inode_sb_list_add 9 4.1.27 inode_sb_list_add add inode to the superblock list of inodes Synopsis void inode_sb_list_add struct inode * inode Arguments inode inode to add LINUX Kernel Hackers Manual July 2017 __insert_inode_hash 9 4.1.27 __insert_inode_hash hash an inode Synopsis void __insert_inode_hash struct inode * inode unsigned long hashval Arguments inode unhashed inode hashval unsigned long value used to locate this object in the inode_hashtable. Description Add an inode to the inode hash for this superblock. LINUX Kernel Hackers Manual July 2017 __remove_inode_hash 9 4.1.27 __remove_inode_hash remove an inode from the hash Synopsis void __remove_inode_hash struct inode * inode Arguments inode inode to unhash Description Remove an inode from the superblock. LINUX Kernel Hackers Manual July 2017 new_inode 9 4.1.27 new_inode obtain an inode Synopsis struct inode * new_inode struct super_block * sb Arguments sb superblock Description Allocates a new inode for given superblock. The default gfp_mask for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE. If HIGHMEM pages are unsuitable or it is known that pages allocated for the page cache are not reclaimable or migratable, mapping_set_gfp_mask must be called with suitable flags on the newly created inode's mapping LINUX Kernel Hackers Manual July 2017 unlock_new_inode 9 4.1.27 unlock_new_inode clear the I_NEW state and wake up any waiters Synopsis void unlock_new_inode struct inode * inode Arguments inode new inode to unlock Description Called when the inode is fully initialised to clear the new state of the inode and wake up anyone waiting for the inode to finish initialisation. LINUX Kernel Hackers Manual July 2017 lock_two_nondirectories 9 4.1.27 lock_two_nondirectories take two i_mutexes on non-directory objects Synopsis void lock_two_nondirectories struct inode * inode1 struct inode * inode2 Arguments inode1 first inode to lock inode2 second inode to lock Description Lock any non-NULL argument that is not a directory. Zero, one or two objects may be locked by this function. LINUX Kernel Hackers Manual July 2017 unlock_two_nondirectories 9 4.1.27 unlock_two_nondirectories release locks from lock_two_nondirectories Synopsis void unlock_two_nondirectories struct inode * inode1 struct inode * inode2 Arguments inode1 first inode to unlock inode2 second inode to unlock LINUX Kernel Hackers Manual July 2017 iget5_locked 9 4.1.27 iget5_locked obtain an inode from a mounted file system Synopsis struct inode * iget5_locked struct super_block * sb unsigned long hashval int (*test) struct inode *, void * int (*set) struct inode *, void * void * data Arguments sb super block of file system hashval hash value (usually inode number) to get test callback used for comparisons between inodes set callback used to initialize a new struct inode data opaque data pointer to pass to test and set Description Search for the inode specified by hashval and data in the inode cache, and if present it is return it with an increased reference count. This is a generalized version of iget_locked for file systems where the inode number is not sufficient for unique identification of an inode. If the inode is not in cache, allocate a new inode and return it locked, hashed, and with the I_NEW flag set. The file system gets to fill it in before unlocking it via unlock_new_inode. Note both test and set are called with the inode_hash_lock held, so can't sleep. LINUX Kernel Hackers Manual July 2017 iget_locked 9 4.1.27 iget_locked obtain an inode from a mounted file system Synopsis struct inode * iget_locked struct super_block * sb unsigned long ino Arguments sb super block of file system ino inode number to get Description Search for the inode specified by ino in the inode cache and if present return it with an increased reference count. This is for file systems where the inode number is sufficient for unique identification of an inode. If the inode is not in cache, allocate a new inode and return it locked, hashed, and with the I_NEW flag set. The file system gets to fill it in before unlocking it via unlock_new_inode. LINUX Kernel Hackers Manual July 2017 iunique 9 4.1.27 iunique get a unique inode number Synopsis ino_t iunique struct super_block * sb ino_t max_reserved Arguments sb superblock max_reserved highest reserved inode number Description Obtain an inode number that is unique on the system for a given superblock. This is used by file systems that have no natural permanent inode numbering system. An inode number is returned that is higher than the reserved limit but unique. BUGS With a large number of inodes live on the file system this function currently becomes quite slow. LINUX Kernel Hackers Manual July 2017 ilookup5_nowait 9 4.1.27 ilookup5_nowait search for an inode in the inode cache Synopsis struct inode * ilookup5_nowait struct super_block * sb unsigned long hashval int (*test) struct inode *, void * void * data Arguments sb super block of file system to search hashval hash value (usually inode number) to search for test callback used for comparisons between inodes data opaque data pointer to pass to test Description Search for the inode specified by hashval and data in the inode cache. If the inode is in the cache, the inode is returned with an incremented reference count. Note I_NEW is not waited upon so you have to be very careful what you do with the returned inode. You probably should be using ilookup5 instead. Note2 test is called with the inode_hash_lock held, so can't sleep. LINUX Kernel Hackers Manual July 2017 ilookup5 9 4.1.27 ilookup5 search for an inode in the inode cache Synopsis struct inode * ilookup5 struct super_block * sb unsigned long hashval int (*test) struct inode *, void * void * data Arguments sb super block of file system to search hashval hash value (usually inode number) to search for test callback used for comparisons between inodes data opaque data pointer to pass to test Description Search for the inode specified by hashval and data in the inode cache, and if the inode is in the cache, return the inode with an incremented reference count. Waits on I_NEW before returning the inode. returned with an incremented reference count. This is a generalized version of ilookup for file systems where the inode number is not sufficient for unique identification of an inode. Note test is called with the inode_hash_lock held, so can't sleep. LINUX Kernel Hackers Manual July 2017 ilookup 9 4.1.27 ilookup search for an inode in the inode cache Synopsis struct inode * ilookup struct super_block * sb unsigned long ino Arguments sb super block of file system to search ino inode number to search for Description Search for the inode ino in the inode cache, and if the inode is in the cache, the inode is returned with an incremented reference count. LINUX Kernel Hackers Manual July 2017 find_inode_nowait 9 4.1.27 find_inode_nowait find an inode in the inode cache Synopsis struct inode * find_inode_nowait struct super_block * sb unsigned long hashval int (*match) struct inode *, unsigned long, void * void * data Arguments sb super block of file system to search hashval hash value (usually inode number) to search for match callback used for comparisons between inodes data opaque data pointer to pass to match Description Search for the inode specified by hashval and data in the inode cache, where the helper function match will return 0 if the inode does not match, 1 if the inode does match, and -1 if the search should be stopped. The match function must be responsible for taking the i_lock spin_lock and checking i_state for an inode being freed or being initialized, and incrementing the reference count before returning 1. It also must not sleep, since it is called with the inode_hash_lock spinlock held. This is a even more generalized version of ilookup5 when the function must never block --- find_inode can block in __wait_on_freeing_inode --- or when the caller can not increment the reference count because the resulting iput might cause an inode eviction. The tradeoff is that the match funtion must be very carefully implemented. LINUX Kernel Hackers Manual July 2017 iput 9 4.1.27 iput put an inode Synopsis void iput struct inode * inode Arguments inode inode to put Description Puts an inode, dropping its usage count. If the inode use count hits zero, the inode is then freed and may also be destroyed. Consequently, iput can sleep. LINUX Kernel Hackers Manual July 2017 bmap 9 4.1.27 bmap find a block number in a file Synopsis sector_t bmap struct inode * inode sector_t block Arguments inode inode of file block block to find Description Returns the block number on the device holding the inode that is the disk block number for the block of the file requested. That is, asked for block 4 of inode 1 the function will return the disk block relative to the disk start that holds that block of the file. LINUX Kernel Hackers Manual July 2017 touch_atime 9 4.1.27 touch_atime update the access time Synopsis void touch_atime const struct path * path Arguments path the struct path to update Description Update the accessed time on an inode and mark it for writeback. This function automatically handles read only file systems and media, as well as the noatime flag and inode specific noatime markers. LINUX Kernel Hackers Manual July 2017 file_update_time 9 4.1.27 file_update_time update mtime and ctime time Synopsis int file_update_time struct file * file Arguments file file accessed Description Update the mtime and ctime members of an inode and mark the inode for writeback. Note that this function is meant exclusively for usage in the file write path of filesystems, and filesystems may choose to explicitly ignore update via this function with the S_NOCMTIME inode flag, e.g. for network filesystem where these timestamps are handled by the server. This can return an error for file systems who need to allocate space in order to update an inode. LINUX Kernel Hackers Manual July 2017 inode_init_owner 9 4.1.27 inode_init_owner Init uid,gid,mode for new inode according to posix standards Synopsis void inode_init_owner struct inode * inode const struct inode * dir umode_t mode Arguments inode New inode dir Directory inode mode mode of the new inode LINUX Kernel Hackers Manual July 2017 inode_owner_or_capable 9 4.1.27 inode_owner_or_capable check current task permissions to inode Synopsis bool inode_owner_or_capable const struct inode * inode Arguments inode inode being checked Description Return true if current either has CAP_FOWNER in a namespace with the inode owner uid mapped, or owns the file. LINUX Kernel Hackers Manual July 2017 inode_dio_wait 9 4.1.27 inode_dio_wait wait for outstanding DIO requests to finish Synopsis void inode_dio_wait struct inode * inode Arguments inode inode to wait for Description Waits for all pending direct I/O requests to finish so that we can proceed with a truncate or equivalent operation. Must be called under a lock that serializes taking new references to i_dio_count, usually by inode->i_mutex. LINUX Kernel Hackers Manual July 2017 make_bad_inode 9 4.1.27 make_bad_inode mark an inode bad due to an I/O error Synopsis void make_bad_inode struct inode * inode Arguments inode Inode to mark bad Description When an inode cannot be read due to a media or remote network failure this function makes the inode bad and causes I/O operations on it to fail from this point on. LINUX Kernel Hackers Manual July 2017 is_bad_inode 9 4.1.27 is_bad_inode is an inode errored Synopsis int is_bad_inode struct inode * inode Arguments inode inode to test Description Returns true if the inode in question has been marked as bad. LINUX Kernel Hackers Manual July 2017 iget_failed 9 4.1.27 iget_failed Mark an under-construction inode as dead and release it Synopsis void iget_failed struct inode * inode Arguments inode The inode to discard Description Mark an under-construction inode as dead and release it. Registration and Superblocks LINUX Kernel Hackers Manual July 2017 deactivate_locked_super 9 4.1.27 deactivate_locked_super drop an active reference to superblock Synopsis void deactivate_locked_super struct super_block * s Arguments s superblock to deactivate Description Drops an active reference to superblock, converting it into a temprory one if there is no other active references left. In that case we tell fs driver to shut it down and drop the temporary reference we had just acquired. Caller holds exclusive lock on superblock; that lock is released. LINUX Kernel Hackers Manual July 2017 deactivate_super 9 4.1.27 deactivate_super drop an active reference to superblock Synopsis void deactivate_super struct super_block * s Arguments s superblock to deactivate Description Variant of deactivate_locked_super, except that superblock is *not* locked by caller. If we are going to drop the final active reference, lock will be acquired prior to that. LINUX Kernel Hackers Manual July 2017 generic_shutdown_super 9 4.1.27 generic_shutdown_super common helper for ->kill_sb Synopsis void generic_shutdown_super struct super_block * sb Arguments sb superblock to kill Description generic_shutdown_super does all fs-independent work on superblock shutdown. Typical ->kill_sb should pick all fs-specific objects that need destruction out of superblock, call generic_shutdown_super and release aforementioned objects. Note: dentries and inodes _are_ taken care of and do not need specific handling. Upon calling this function, the filesystem may no longer alter or rearrange the set of dentries belonging to this super_block, nor may it change the attachments of dentries to inodes. LINUX Kernel Hackers Manual July 2017 sget 9 4.1.27 sget find or create a superblock Synopsis struct super_block * sget struct file_system_type * type int (*test) struct super_block *,void * int (*set) struct super_block *,void * int flags void * data Arguments type filesystem type superblock should belong to test comparison callback set setup callback flags mount flags data argument to each of them LINUX Kernel Hackers Manual July 2017 iterate_supers_type 9 4.1.27 iterate_supers_type call function for superblocks of given type Synopsis void iterate_supers_type struct file_system_type * type void (*f) struct super_block *, void * void * arg Arguments type fs type f function to call arg argument to pass to it Description Scans the superblock list and calls given function, passing it locked superblock and given argument. LINUX Kernel Hackers Manual July 2017 get_super 9 4.1.27 get_super get the superblock of a device Synopsis struct super_block * get_super struct block_device * bdev Arguments bdev device to get the superblock for Scans the superblock list and finds the superblock of the file system mounted on the device given. NULL is returned if no match is found. LINUX Kernel Hackers Manual July 2017 get_super_thawed 9 4.1.27 get_super_thawed get thawed superblock of a device Synopsis struct super_block * get_super_thawed struct block_device * bdev Arguments bdev device to get the superblock for Description Scans the superblock list and finds the superblock of the file system mounted on the device. The superblock is returned once it is thawed (or immediately if it was not frozen). NULL is returned if no match is found. LINUX Kernel Hackers Manual July 2017 freeze_super 9 4.1.27 freeze_super lock the filesystem and force it into a consistent state Synopsis int freeze_super struct super_block * sb Arguments sb the super to lock Description Syncs the super to make sure the filesystem is consistent and calls the fs's freeze_fs. Subsequent calls to this without first thawing the fs will return -EBUSY. During this function, sb->s_writers.frozen goes through these values: SB_UNFROZEN File system is normal, all writes progress as usual. SB_FREEZE_WRITE The file system is in the process of being frozen. New writes should be blocked, though page faults are still allowed. We wait for all writes to complete and then proceed to the next stage. SB_FREEZE_PAGEFAULT Freezing continues. Now also page faults are blocked but internal fs threads can still modify the filesystem (although they should not dirty new pages or inodes), writeback can run etc. After waiting for all running page faults we sync the filesystem which will clean all dirty pages and inodes (no new dirty pages or inodes can be created when sync is running). SB_FREEZE_FS The file system is frozen. Now all internal sources of fs modification are blocked (e.g. XFS preallocation truncation on inode reclaim). This is usually implemented by blocking new transactions for filesystems that have them and need this additional guard. After all internal writers are finished we call ->freeze_fs to finish filesystem freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is mostly auxiliary for filesystems to verify they do not modify frozen fs. sb->s_writers.frozen is protected by sb->s_umount. LINUX Kernel Hackers Manual July 2017 thaw_super 9 4.1.27 thaw_super - unlock filesystem Synopsis int thaw_super struct super_block * sb Arguments sb the super to thaw Description Unlocks the filesystem and marks it writeable again after freeze_super. File Locks LINUX Kernel Hackers Manual July 2017 posix_lock_file 9 4.1.27 posix_lock_file Apply a POSIX-style lock to a file Synopsis int posix_lock_file struct file * filp struct file_lock * fl struct file_lock * conflock Arguments filp The file to apply the lock to fl The lock to be applied conflock Place to return a copy of the conflicting lock, if found. Description Add a POSIX style lock to a file. We merge adjacent & overlapping locks whenever possible. POSIX locks are sorted by owner task, then by starting address Note that if called with an FL_EXISTS argument, the caller may determine whether or not a lock was successfully freed by testing the return value for -ENOENT. LINUX Kernel Hackers Manual July 2017 posix_lock_inode_wait 9 4.1.27 posix_lock_inode_wait Apply a POSIX-style lock to a file Synopsis int posix_lock_inode_wait struct inode * inode struct file_lock * fl Arguments inode inode of file to which lock request should be applied fl The lock to be applied Description Variant of posix_lock_file_wait that does not take a filp, and so can be used after the filp has already been torn down. LINUX Kernel Hackers Manual July 2017 locks_mandatory_area 9 4.1.27 locks_mandatory_area Check for a conflicting lock Synopsis int locks_mandatory_area int read_write struct inode * inode struct file * filp loff_t offset size_t count Arguments read_write FLOCK_VERIFY_WRITE for exclusive access, FLOCK_VERIFY_READ for shared inode the file to check filp how the file was opened (if it was) offset start of area to check count length of area to check Description Searches the inode's list of locks to find any POSIX locks which conflict. This function is called from rw_verify_area and locks_verify_truncate. LINUX Kernel Hackers Manual July 2017 __break_lease 9 4.1.27 __break_lease revoke all outstanding leases on file Synopsis int __break_lease struct inode * inode unsigned int mode unsigned int type Arguments inode the inode of the file to return mode O_RDONLY: break only write leases; O_WRONLY or O_RDWR: break all leases type FL_LEASE: break leases and delegations; FL_DELEG: break only delegations Description break_lease (inlined for speed) has checked there already is at least some kind of lock (maybe a lease) on this file. Leases are broken on a call to open or truncate. This function can sleep unless you specified O_NONBLOCK to your open. LINUX Kernel Hackers Manual July 2017 lease_get_mtime 9 4.1.27 lease_get_mtime get the last modified time of an inode Synopsis void lease_get_mtime struct inode * inode struct timespec * time Arguments inode the inode time pointer to a timespec which will contain the last modified time Description This is to force NFS clients to flush their caches for files with exclusive leases. The justification is that if someone has an exclusive lease, then they could be modifying it. LINUX Kernel Hackers Manual July 2017 generic_setlease 9 4.1.27 generic_setlease sets a lease on an open file Synopsis int generic_setlease struct file * filp long arg struct file_lock ** flp void ** priv Arguments filp file pointer arg type of lease to obtain flp input - file_lock to use, output - file_lock inserted priv private data for lm_setup (may be NULL if lm_setup doesn't require it) Description The (input) flp->fl_lmops->lm_break function is required by break_lease. LINUX Kernel Hackers Manual July 2017 vfs_setlease 9 4.1.27 vfs_setlease sets a lease on an open file Synopsis int vfs_setlease struct file * filp long arg struct file_lock ** lease void ** priv Arguments filp file pointer arg type of lease to obtain lease file_lock to use when adding a lease priv private info for lm_setup when adding a lease (may be NULL if lm_setup doesn't require it) Description Call this to establish a lease on the file. The lease argument is not used for F_UNLCK requests and may be NULL. For commands that set or alter an existing lease, the (*lease)->fl_lmops->lm_break operation must be set; if not, this function will return -ENOLCK (and generate a scary-looking stack trace). The priv pointer is passed directly to the lm_setup function as-is. It may be NULL if the lm_setup operation doesn't require it. LINUX Kernel Hackers Manual July 2017 flock_lock_inode_wait 9 4.1.27 flock_lock_inode_wait Apply a FLOCK-style lock to a file Synopsis int flock_lock_inode_wait struct inode * inode struct file_lock * fl Arguments inode inode of the file to apply to fl The lock to be applied Description Apply a FLOCK style lock request to an inode. LINUX Kernel Hackers Manual July 2017 vfs_test_lock 9 4.1.27 vfs_test_lock test file byte range lock Synopsis int vfs_test_lock struct file * filp struct file_lock * fl Arguments filp The file to test lock for fl The lock to test; also used to hold result Description Returns -ERRNO on failure. Indicates presence of conflicting lock by setting conf->fl_type to something other than F_UNLCK. LINUX Kernel Hackers Manual July 2017 vfs_lock_file 9 4.1.27 vfs_lock_file file byte range lock Synopsis int vfs_lock_file struct file * filp unsigned int cmd struct file_lock * fl struct file_lock * conf Arguments filp The file to apply the lock to cmd type of locking operation (F_SETLK, F_GETLK, etc.) fl The lock to be applied conf Place to return a copy of the conflicting lock, if found. Description A caller that doesn't care about the conflicting lock may pass NULL as the final argument. If the filesystem defines a private ->lock method, then conf will be left unchanged; so a caller that cares should initialize it to some acceptable default. To avoid blocking kernel daemons, such as lockd, that need to acquire POSIX locks, the ->lock interface may return asynchronously, before the lock has been granted or denied by the underlying filesystem, if (and only if) lm_grant is set. Callers expecting ->lock to return asynchronously will only use F_SETLK, not F_SETLKW; they will set FL_SLEEP if (and only if) the request is for a blocking lock. When ->lock does return asynchronously, it must return FILE_LOCK_DEFERRED, and call ->lm_grant when the lock request completes. If the request is for non-blocking lock the file system should return FILE_LOCK_DEFERRED then try to get the lock and call the callback routine with the result. If the request timed out the callback routine will return a nonzero return code and the file system should release the lock. The file system is also responsible to keep a corresponding posix lock when it grants a lock so the VFS can find out which locks are locally held and do the correct lock cleanup when required. The underlying filesystem must not drop the kernel lock or call ->lm_grant before returning to the caller with a FILE_LOCK_DEFERRED return code. LINUX Kernel Hackers Manual July 2017 posix_unblock_lock 9 4.1.27 posix_unblock_lock stop waiting for a file lock Synopsis int posix_unblock_lock struct file_lock * waiter Arguments waiter the lock which was waiting Description lockd needs to block waiting for locks. LINUX Kernel Hackers Manual July 2017 vfs_cancel_lock 9 4.1.27 vfs_cancel_lock file byte range unblock lock Synopsis int vfs_cancel_lock struct file * filp struct file_lock * fl Arguments filp The file to apply the unblock to fl The lock to be unblocked Description Used by lock managers to cancel blocked requests LINUX Kernel Hackers Manual July 2017 locks_mandatory_locked 9 4.1.27 locks_mandatory_locked Check for an active lock Synopsis int locks_mandatory_locked struct file * file Arguments file the file to check Description Searches the inode's list of locks to find any POSIX locks which conflict. This function is called from locks_verify_locked only. LINUX Kernel Hackers Manual July 2017 fcntl_getlease 9 4.1.27 fcntl_getlease Enquire what lease is currently active Synopsis int fcntl_getlease struct file * filp Arguments filp the file Description The value returned by this function will be one of (if no lease break is pending): F_RDLCK to indicate a shared lease is held. F_WRLCK to indicate an exclusive lease is held. F_UNLCK to indicate no lease is held. (if a lease break is pending): F_RDLCK to indicate an exclusive lease needs to be changed to a shared lease (or removed). F_UNLCK to indicate the lease needs to be removed. XXX sfr & willy disagree over whether F_INPROGRESS should be returned to userspace. LINUX Kernel Hackers Manual July 2017 check_conflicting_open 9 4.1.27 check_conflicting_open see if the given dentry points to a file that has an existing open that would conflict with the desired lease. Synopsis int check_conflicting_open const struct dentry * dentry const long arg int flags Arguments dentry dentry to check arg type of lease that we're trying to acquire flags -- undescribed -- Description Check to see if there's an existing open fd on this file that would conflict with the lease we're trying to set. LINUX Kernel Hackers Manual July 2017 fcntl_setlease 9 4.1.27 fcntl_setlease sets a lease on an open file Synopsis int fcntl_setlease unsigned int fd struct file * filp long arg Arguments fd open file descriptor filp file pointer arg type of lease to obtain Description Call this fcntl to establish a lease on the file. Note that you also need to call F_SETSIG to receive a signal when the lease is broken. LINUX Kernel Hackers Manual July 2017 sys_flock 9 4.1.27 sys_flock flock system call. Synopsis long sys_flock unsigned int fd unsigned int cmd Arguments fd the file descriptor to lock. cmd the type of lock to apply. Description Apply a FL_FLOCK style lock to an open file descriptor. The cmd can be one of LOCK_SH -- a shared lock. LOCK_EX -- an exclusive lock. LOCK_UN -- remove an existing lock. LOCK_MAND -- a `mandatory' flock. This exists to emulate Windows Share Modes. LOCK_MAND can be combined with LOCK_READ or LOCK_WRITE to allow other processes read and write access respectively. Other Functions LINUX Kernel Hackers Manual July 2017 mpage_readpages 9 4.1.27 mpage_readpages populate an address space with some pages & start reads against them Synopsis int mpage_readpages struct address_space * mapping struct list_head * pages unsigned nr_pages get_block_t get_block Arguments mapping the address_space pages The address of a list_head which contains the target pages. These pages have their ->index populated and are otherwise uninitialised. The page at pages->prev has the lowest file offset, and reads should be issued in pages->prev to pages->next order. nr_pages The number of pages at *pages get_block The filesystem's block mapper function. Description This function walks the pages and the blocks within each page, building and emitting large BIOs. If anything unusual happens, such as: - encountering a page which has buffers - encountering a page which has a non-hole after a hole - encountering a page with non-contiguous blocks then this code just gives up and calls the buffer_head-based read function. It does handle a page which has holes at the end - that is a common case: the end-of-file on blocksize < PAGE_CACHE_SIZE setups. BH_Boundary explanation There is a problem. The mpage read code assembles several pages, gets all their disk mappings, and then submits them all. That's fine, but obtaining the disk mappings may require I/O. Reads of indirect blocks, for example. So an mpage read of the first 16 blocks of an ext2 file will cause I/O to be submitted in the following order 12 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 because the indirect block has to be read to get the mappings of blocks 13,14,15,16. Obviously, this impacts performance. So what we do it to allow the filesystem's get_block function to set BH_Boundary when it maps block 11. BH_Boundary says: mapping of the block after this one will require I/O against a block which is probably close to this one. So you should push what I/O you have currently accumulated. This all causes the disk requests to be issued in the correct order. LINUX Kernel Hackers Manual July 2017 mpage_writepages 9 4.1.27 mpage_writepages walk the list of dirty pages of the given address space & writepage all of them Synopsis int mpage_writepages struct address_space * mapping struct writeback_control * wbc get_block_t get_block Arguments mapping address space structure to write wbc subtract the number of written pages from *wbc->nr_to_write get_block the filesystem's block mapper function. If this is NULL then use a_ops->writepage. Otherwise, go direct-to-BIO. Description This is a library function, which implements the writepages address_space_operation. If a page is already under I/O, generic_writepages skips it, even if it's dirty. This is desirable behaviour for memory-cleaning writeback, but it is INCORRECT for data-integrity system calls such as fsync. fsync and msync need to guarantee that all the data which was dirty at the time the call was made get new I/O started against them. If wbc->sync_mode is WB_SYNC_ALL then we were called for data integrity and we must wait for existing IO to complete. LINUX Kernel Hackers Manual July 2017 generic_permission 9 4.1.27 generic_permission check for access rights on a Posix-like filesystem Synopsis int generic_permission struct inode * inode int mask Arguments inode inode to check access rights for mask right to check for (MAY_READ, MAY_WRITE, MAY_EXEC, ...) Description Used to check for read/write/execute permissions on a file. We use fsuid for this, letting us set arbitrary permissions for filesystem access without changing the normal uids which are used for other things. generic_permission is rcu-walk aware. It returns -ECHILD in case an rcu-walk request cannot be satisfied (eg. requires blocking or too much complexity). It would then be called again in ref-walk mode. LINUX Kernel Hackers Manual July 2017 __inode_permission 9 4.1.27 __inode_permission Check for access rights to a given inode Synopsis int __inode_permission struct inode * inode int mask Arguments inode Inode to check permission on mask Right to check for (MAY_READ, MAY_WRITE, MAY_EXEC) Description Check for read/write/execute permissions on an inode. When checking for MAY_APPEND, MAY_WRITE must also be set in mask. This does not check for a read-only file system. You probably want inode_permission. LINUX Kernel Hackers Manual July 2017 inode_permission 9 4.1.27 inode_permission Check for access rights to a given inode Synopsis int inode_permission struct inode * inode int mask Arguments inode Inode to check permission on mask Right to check for (MAY_READ, MAY_WRITE, MAY_EXEC) Description Check for read/write/execute permissions on an inode. We use fs[ug]id for this, letting us set arbitrary permissions for filesystem access without changing the normal UIDs which are used for other things. When checking for MAY_APPEND, MAY_WRITE must also be set in mask. LINUX Kernel Hackers Manual July 2017 path_get 9 4.1.27 path_get get a reference to a path Synopsis void path_get const struct path * path Arguments path path to get the reference to Description Given a path increment the reference count to the dentry and the vfsmount. LINUX Kernel Hackers Manual July 2017 path_put 9 4.1.27 path_put put a reference to a path Synopsis void path_put const struct path * path Arguments path path to put the reference to Description Given a path decrement the reference count to the dentry and the vfsmount. LINUX Kernel Hackers Manual July 2017 vfs_path_lookup 9 4.1.27 vfs_path_lookup lookup a file path relative to a dentry-vfsmount pair Synopsis int vfs_path_lookup struct dentry * dentry struct vfsmount * mnt const char * name unsigned int flags struct path * path Arguments dentry pointer to dentry of the base directory mnt pointer to vfs mount of the base directory name pointer to file name flags lookup flags path pointer to struct path to fill LINUX Kernel Hackers Manual July 2017 lookup_one_len 9 4.1.27 lookup_one_len filesystem helper to lookup single pathname component Synopsis struct dentry * lookup_one_len const char * name struct dentry * base int len Arguments name pathname component to lookup base base directory to lookup from len maximum length len should be interpreted to Description Note that this routine is purely a helper for filesystem usage and should not be called by generic code. LINUX Kernel Hackers Manual July 2017 vfs_unlink 9 4.1.27 vfs_unlink unlink a filesystem object Synopsis int vfs_unlink struct inode * dir struct dentry * dentry struct inode ** delegated_inode Arguments dir parent directory dentry victim delegated_inode returns victim inode, if the inode is delegated. Description The caller must hold dir->i_mutex. If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and return a reference to the inode in delegated_inode. The caller should then break the delegation on that inode and retry. Because breaking a delegation may take a long time, the caller should drop dir->i_mutex before doing so. Alternatively, a caller may pass NULL for delegated_inode. This may be appropriate for callers that expect the underlying filesystem not to be NFS exported. LINUX Kernel Hackers Manual July 2017 vfs_link 9 4.1.27 vfs_link create a new link Synopsis int vfs_link struct dentry * old_dentry struct inode * dir struct dentry * new_dentry struct inode ** delegated_inode Arguments old_dentry object to be linked dir new parent new_dentry where to create the new link delegated_inode returns inode needing a delegation break Description The caller must hold dir->i_mutex If vfs_link discovers a delegation on the to-be-linked file in need of breaking, it will return -EWOULDBLOCK and return a reference to the inode in delegated_inode. The caller should then break the delegation and retry. Because breaking a delegation may take a long time, the caller should drop the i_mutex before doing so. Alternatively, a caller may pass NULL for delegated_inode. This may be appropriate for callers that expect the underlying filesystem not to be NFS exported. LINUX Kernel Hackers Manual July 2017 vfs_rename 9 4.1.27 vfs_rename rename a filesystem object Synopsis int vfs_rename struct inode * old_dir struct dentry * old_dentry struct inode * new_dir struct dentry * new_dentry struct inode ** delegated_inode unsigned int flags Arguments old_dir parent of source old_dentry source new_dir parent of destination new_dentry destination delegated_inode returns an inode needing a delegation break flags rename flags Description The caller must hold multiple mutexes--see lock_rename). If vfs_rename discovers a delegation in need of breaking at either the source or destination, it will return -EWOULDBLOCK and return a reference to the inode in delegated_inode. The caller should then break the delegation and retry. Because breaking a delegation may take a long time, the caller should drop all locks before doing so. Alternatively, a caller may pass NULL for delegated_inode. This may be appropriate for callers that expect the underlying filesystem not to be NFS exported. The worst of all namespace operations - renaming directory. Perverted doesn't even start to describe it. Somebody in UCB had a heck of a trip... Problems a) we can get into loop creation. b) race potential - two innocent renames can create a loop together. That's where 4.4 screws up. Current fix: serialization on sb->s_vfs_rename_mutex. We might be more accurate, but that's another story. c) we have to lock _four_ objects - parents and victim (if it exists), and source (if it is not a directory). And that - after we got ->i_mutex on parents (until then we don't know whether the target exists). Solution: try to be smart with locking order for inodes. We rely on the fact that tree topology may change only under ->s_vfs_rename_mutex _and_ that parent of the object we move will be locked. Thus we can rank directories by the tree (ancestors first) and rank all non-directories after them. That works since everybody except rename does lock parent, lookup, lock child and rename is under ->s_vfs_rename_mutex. HOWEVER, it relies on the assumption that any object with ->lookup has no more than 1 dentry. If hybrid objects will ever appear, we'd better make sure that there's no link(2) for them. d) conversion from fhandle to dentry may come in the wrong moment - when we are removing the target. Solution: we will have to grab ->i_mutex in the fhandle_to_dentry code. [FIXME - current nfsfh.c relies on ->i_mutex on parents, which works but leads to some truly excessive locking]. LINUX Kernel Hackers Manual July 2017 sync_mapping_buffers 9 4.1.27 sync_mapping_buffers write out & wait upon a mapping's associated buffers Synopsis int sync_mapping_buffers struct address_space * mapping Arguments mapping the mapping which wants those buffers written Description Starts I/O against the buffers at mapping->private_list, and waits upon that I/O. Basically, this is a convenience function for fsync. mapping is a file or directory which needs those buffers to be written for a successful fsync. LINUX Kernel Hackers Manual July 2017 mark_buffer_dirty 9 4.1.27 mark_buffer_dirty mark a buffer_head as needing writeout Synopsis void mark_buffer_dirty struct buffer_head * bh Arguments bh the buffer_head to mark dirty Description mark_buffer_dirty will set the dirty bit against the buffer, then set its backing page dirty, then tag the page as dirty in its address_space's radix tree and then attach the address_space's inode to its superblock's dirty inode list. mark_buffer_dirty is atomic. It takes bh->b_page->mapping->private_lock, mapping->tree_lock and mapping->host->i_lock. LINUX Kernel Hackers Manual July 2017 __bread_gfp 9 4.1.27 __bread_gfp reads a specified block and returns the bh Synopsis struct buffer_head * __bread_gfp struct block_device * bdev sector_t block unsigned size gfp_t gfp Arguments bdev the block_device to read from block number of block size size (in bytes) to read gfp page allocation flag Description Reads a specified block, and returns buffer head that contains it. The page cache can be allocated from non-movable area not to prevent page migration if you set gfp to zero. It returns NULL if the block was unreadable. LINUX Kernel Hackers Manual July 2017 block_invalidatepage 9 4.1.27 block_invalidatepage invalidate part or all of a buffer-backed page Synopsis void block_invalidatepage struct page * page unsigned int offset unsigned int length Arguments page the page which is affected offset start of the range to invalidate length length of the range to invalidate Description block_invalidatepage is called when all or part of the page has become invalidated by a truncate operation. block_invalidatepage does not have to release all buffers, but it must ensure that no dirty buffer is left outside offset and that no I/O is underway against any of the blocks which are outside the truncation point. Because the caller is about to free (and possibly reuse) those blocks on-disk. LINUX Kernel Hackers Manual July 2017 ll_rw_block 9 4.1.27 ll_rw_block level access to block devices (DEPRECATED) Synopsis void ll_rw_block int rw int nr struct buffer_head * bhs[] Arguments rw whether to READ or WRITE or maybe READA (readahead) nr number of struct buffer_heads in the array bhs[] array of pointers to struct buffer_head Description ll_rw_block takes an array of pointers to struct buffer_heads, and requests an I/O operation on them, either a READ or a WRITE. The third READA option is described in the documentation for generic_make_request which ll_rw_block calls. This function drops any buffer that it cannot get a lock on (with the BH_Lock state bit), any buffer that appears to be clean when doing a write request, and any buffer that appears to be up-to-date when doing read request. Further it marks as clean buffers that are processed for writing (the buffer cache won't assume that they are actually clean until the buffer gets unlocked). ll_rw_block sets b_end_io to simple completion handler that marks the buffer up-to-date (if appropriate), unlocks the buffer and wakes any waiters. All of the buffers must be for the same device, and must also be a multiple of the current approved size for the device. LINUX Kernel Hackers Manual July 2017 bh_uptodate_or_lock 9 4.1.27 bh_uptodate_or_lock Test whether the buffer is uptodate Synopsis int bh_uptodate_or_lock struct buffer_head * bh Arguments bh struct buffer_head Description Return true if the buffer is up-to-date and false, with the buffer locked, if not. LINUX Kernel Hackers Manual July 2017 bh_submit_read 9 4.1.27 bh_submit_read Submit a locked buffer for reading Synopsis int bh_submit_read struct buffer_head * bh Arguments bh struct buffer_head Description Returns zero on success and -EIO on error. LINUX Kernel Hackers Manual July 2017 bio_reset 9 4.1.27 bio_reset reinitialize a bio Synopsis void bio_reset struct bio * bio Arguments bio bio to reset Description After calling bio_reset, bio will be in the same state as a freshly allocated bio returned bio bio_alloc_bioset - the only fields that are preserved are the ones that are initialized by bio_alloc_bioset. See comment in struct bio. LINUX Kernel Hackers Manual July 2017 bio_chain 9 4.1.27 bio_chain chain bio completions Synopsis void bio_chain struct bio * bio struct bio * parent Arguments bio the target bio parent the bio's parent bio Description The caller won't have a bi_end_io called when bio completes - instead, parent's bi_end_io won't be called until both parent and bio have completed; the chained bio will also be freed when it completes. The caller must not set bi_private or bi_end_io in bio. LINUX Kernel Hackers Manual July 2017 bio_alloc_bioset 9 4.1.27 bio_alloc_bioset allocate a bio for I/O Synopsis struct bio * bio_alloc_bioset gfp_t gfp_mask int nr_iovecs struct bio_set * bs Arguments gfp_mask the GFP_ mask given to the slab allocator nr_iovecs number of iovecs to pre-allocate bs the bio_set to allocate from. Description If bs is NULL, uses kmalloc to allocate the bio; else the allocation is backed by the bs's mempool. When bs is not NULL, if __GFP_WAIT is set then bio_alloc will always be able to allocate a bio. This is due to the mempool guarantees. To make this work, callers must never allocate more than 1 bio at a time from this pool. Callers that need to allocate more than 1 bio must always submit the previously allocated bio for IO before attempting to allocate a new one. Failure to do so can cause deadlocks under memory pressure. Note that when running under generic_make_request (i.e. any block driver), bios are not submitted until after you return - see the code in generic_make_request that converts recursion into iteration, to prevent stack overflows. This would normally mean allocating multiple bios under generic_make_request would be susceptible to deadlocks, but we have deadlock avoidance code that resubmits any blocked bios from a rescuer thread. However, we do not guarantee forward progress for allocations from other mempools. Doing multiple allocations from the same mempool under generic_make_request should be avoided - instead, use bio_set's front_pad for per bio allocations. RETURNS Pointer to new bio on success, NULL on failure. LINUX Kernel Hackers Manual July 2017 bio_put 9 4.1.27 bio_put release a reference to a bio Synopsis void bio_put struct bio * bio Arguments bio bio to release reference to Description Put a reference to a struct bio, either one you have gotten with bio_alloc, bio_get or bio_clone. The last put of a bio will free it. LINUX Kernel Hackers Manual July 2017 __bio_clone_fast 9 4.1.27 __bio_clone_fast clone a bio that shares the original bio's biovec Synopsis void __bio_clone_fast struct bio * bio struct bio * bio_src Arguments bio destination bio bio_src bio to clone Description Clone a bio. Caller will own the returned bio, but not the actual data it points to. Reference count of returned bio will be one. Caller must ensure that bio_src is not freed before bio. LINUX Kernel Hackers Manual July 2017 bio_clone_fast 9 4.1.27 bio_clone_fast clone a bio that shares the original bio's biovec Synopsis struct bio * bio_clone_fast struct bio * bio gfp_t gfp_mask struct bio_set * bs Arguments bio bio to clone gfp_mask allocation priority bs bio_set to allocate from Description Like __bio_clone_fast, only also allocates the returned bio LINUX Kernel Hackers Manual July 2017 bio_clone_bioset 9 4.1.27 bio_clone_bioset clone a bio Synopsis struct bio * bio_clone_bioset struct bio * bio_src gfp_t gfp_mask struct bio_set * bs Arguments bio_src bio to clone gfp_mask allocation priority bs bio_set to allocate from Description Clone bio. Caller will own the returned bio, but not the actual data it points to. Reference count of returned bio will be one. LINUX Kernel Hackers Manual July 2017 bio_get_nr_vecs 9 4.1.27 bio_get_nr_vecs return approx number of vecs Synopsis int bio_get_nr_vecs struct block_device * bdev Arguments bdev I/O target Description Return the approximate number of pages we can send to this target. There's no guarantee that you will be able to fit this number of pages into a bio, it does not account for dynamic restrictions that vary on offset. LINUX Kernel Hackers Manual July 2017 bio_add_pc_page 9 4.1.27 bio_add_pc_page attempt to add page to bio Synopsis int bio_add_pc_page struct request_queue * q struct bio * bio struct page * page unsigned int len unsigned int offset Arguments q the target queue bio destination bio page page to add len vec entry length offset vec entry offset Description Attempt to add a page to the bio_vec maplist. This can fail for a number of reasons, such as the bio being full or target block device limitations. The target block device must allow bio's up to PAGE_SIZE, so it is always possible to add a single page to an empty bio. This should only be used by REQ_PC bios. LINUX Kernel Hackers Manual July 2017 bio_add_page 9 4.1.27 bio_add_page attempt to add page to bio Synopsis int bio_add_page struct bio * bio struct page * page unsigned int len unsigned int offset Arguments bio destination bio page page to add len vec entry length offset vec entry offset Description Attempt to add a page to the bio_vec maplist. This can fail for a number of reasons, such as the bio being full or target block device limitations. The target block device must allow bio's up to PAGE_SIZE, so it is always possible to add a single page to an empty bio. LINUX Kernel Hackers Manual July 2017 submit_bio_wait 9 4.1.27 submit_bio_wait submit a bio, and wait until it completes Synopsis int submit_bio_wait int rw struct bio * bio Arguments rw whether to READ or WRITE, or maybe to READA (read ahead) bio The struct bio which describes the I/O Description Simple wrapper around submit_bio. Returns 0 on success, or the error from bio_endio on failure. LINUX Kernel Hackers Manual July 2017 bio_advance 9 4.1.27 bio_advance increment/complete a bio by some number of bytes Synopsis void bio_advance struct bio * bio unsigned bytes Arguments bio bio to advance bytes number of bytes to complete Description This updates bi_sector, bi_size and bi_idx; if the number of bytes to complete doesn't align with a bvec boundary, then bv_len and bv_offset will be updated on the last bvec as well. bio will then represent the remaining, uncompleted portion of the io. LINUX Kernel Hackers Manual July 2017 bio_alloc_pages 9 4.1.27 bio_alloc_pages allocates a single page for each bvec in a bio Synopsis int bio_alloc_pages struct bio * bio gfp_t gfp_mask Arguments bio bio to allocate pages for gfp_mask flags for allocation Description Allocates pages up to bio->bi_vcnt. Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are freed. LINUX Kernel Hackers Manual July 2017 bio_copy_data 9 4.1.27 bio_copy_data copy contents of data buffers from one chain of bios to another Synopsis void bio_copy_data struct bio * dst struct bio * src Arguments dst destination bio list src source bio list Description If src and dst are single bios, bi_next must be NULL - otherwise, treats src and dst as linked lists of bios. Stops when it reaches the end of either src or dst - that is, copies min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios). LINUX Kernel Hackers Manual July 2017 bio_uncopy_user 9 4.1.27 bio_uncopy_user finish previously mapped bio Synopsis int bio_uncopy_user struct bio * bio Arguments bio bio being terminated Description Free pages allocated from bio_copy_user_iov and write back data to user space in case of a read. LINUX Kernel Hackers Manual July 2017 bio_unmap_user 9 4.1.27 bio_unmap_user unmap a bio Synopsis void bio_unmap_user struct bio * bio Arguments bio the bio being unmapped Description Unmap a bio previously mapped by bio_map_user. Must be called with a process context. bio_unmap_user may sleep. LINUX Kernel Hackers Manual July 2017 bio_map_kern 9 4.1.27 bio_map_kern map kernel address into bio Synopsis struct bio * bio_map_kern struct request_queue * q void * data unsigned int len gfp_t gfp_mask Arguments q the struct request_queue for the bio data pointer to buffer to map len length in bytes gfp_mask allocation flags for bio allocation Description Map the kernel address into a bio suitable for io to a block device. Returns an error pointer in case of error. LINUX Kernel Hackers Manual July 2017 bio_copy_kern 9 4.1.27 bio_copy_kern copy kernel address into bio Synopsis struct bio * bio_copy_kern struct request_queue * q void * data unsigned int len gfp_t gfp_mask int reading Arguments q the struct request_queue for the bio data pointer to buffer to copy len length in bytes gfp_mask allocation flags for bio and page allocation reading data direction is READ Description copy the kernel address into a bio suitable for io to a block device. Returns an error pointer in case of error. LINUX Kernel Hackers Manual July 2017 bio_endio 9 4.1.27 bio_endio end I/O on a bio Synopsis void bio_endio struct bio * bio int error Arguments bio bio error error, if any Description bio_endio will end I/O on the whole bio. bio_endio is the preferred way to end I/O on a bio, it takes care of clearing BIO_UPTODATE on error. error is 0 on success, and and one of the established -Exxxx (-EIO, for instance) error values in case something went wrong. No one should call bi_end_io directly on a bio unless they own it and thus know that it has an end_io function. LINUX Kernel Hackers Manual July 2017 bio_endio_nodec 9 4.1.27 bio_endio_nodec end I/O on a bio, without decrementing bi_remaining Synopsis void bio_endio_nodec struct bio * bio int error Arguments bio bio error error, if any Description For code that has saved and restored bi_end_io; thing hard before using this function, probably you should've cloned the entire bio. LINUX Kernel Hackers Manual July 2017 bio_split 9 4.1.27 bio_split split a bio Synopsis struct bio * bio_split struct bio * bio int sectors gfp_t gfp struct bio_set * bs Arguments bio bio to split sectors number of sectors to split from the front of bio gfp gfp mask bs bio set to allocate from Description Allocates and returns a new bio which represents sectors from the start of bio, and updates bio to represent the remaining sectors. Unless this is a discard request the newly allocated bio will point to bio's bi_io_vec; it is the caller's responsibility to ensure that bio is not freed before the split. LINUX Kernel Hackers Manual July 2017 bio_trim 9 4.1.27 bio_trim trim a bio Synopsis void bio_trim struct bio * bio int offset int size Arguments bio bio to trim offset number of sectors to trim from the front of bio size size we want to trim bio to, in sectors LINUX Kernel Hackers Manual July 2017 bioset_create 9 4.1.27 bioset_create Create a bio_set Synopsis struct bio_set * bioset_create unsigned int pool_size unsigned int front_pad Arguments pool_size Number of bio and bio_vecs to cache in the mempool front_pad Number of bytes to allocate in front of the returned bio Description Set up a bio_set to be used with bio_alloc_bioset. Allows the caller to ask for a number of bytes to be allocated in front of the bio. Front pad allocation is useful for embedding the bio inside another structure, to avoid allocating extra data to go with the bio. Note that the bio must be embedded at the END of that structure always, or things will break badly. LINUX Kernel Hackers Manual July 2017 bioset_create_nobvec 9 4.1.27 bioset_create_nobvec Create a bio_set without bio_vec mempool Synopsis struct bio_set * bioset_create_nobvec unsigned int pool_size unsigned int front_pad Arguments pool_size Number of bio to cache in the mempool front_pad Number of bytes to allocate in front of the returned bio Description Same functionality as bioset_create except that mempool is not created for bio_vecs. Saving some memory for bio_clone_fast users. LINUX Kernel Hackers Manual July 2017 seq_open 9 4.1.27 seq_open initialize sequential file Synopsis int seq_open struct file * file const struct seq_operations * op Arguments file file we initialize op method table describing the sequence Description seq_open sets file, associating it with a sequence described by op. op->start sets the iterator up and returns the first element of sequence. op->stop shuts it down. op->next returns the next element of sequence. op->show prints element into the buffer. In case of error ->start and ->next return ERR_PTR(error). In the end of sequence they return NULL. ->show returns 0 in case of success and negative number in case of error. Returning SEQ_SKIP means discard this element and move on. LINUX Kernel Hackers Manual July 2017 seq_read 9 4.1.27 seq_read ->read method for sequential files. Synopsis ssize_t seq_read struct file * file char __user * buf size_t size loff_t * ppos Arguments file the file to read from buf the buffer to read to size the maximum number of bytes to read ppos the current position in the file Description Ready-made ->f_op->read LINUX Kernel Hackers Manual July 2017 seq_lseek 9 4.1.27 seq_lseek ->llseek method for sequential files. Synopsis loff_t seq_lseek struct file * file loff_t offset int whence Arguments file the file in question offset new position whence 0 for absolute, 1 for relative position Description Ready-made ->f_op->llseek LINUX Kernel Hackers Manual July 2017 seq_release 9 4.1.27 seq_release free the structures associated with sequential file. Synopsis int seq_release struct inode * inode struct file * file Arguments inode its inode file file in question Description Frees the structures associated with sequential file; can be used as ->f_op->release if you don't have private data to destroy. LINUX Kernel Hackers Manual July 2017 seq_escape 9 4.1.27 seq_escape print string into buffer, escaping some characters Synopsis int seq_escape struct seq_file * m const char * s const char * esc Arguments m target buffer s string esc set of characters that need escaping Description Puts string into buffer, replacing each occurrence of character from esc with usual octal escape. Returns 0 in case of success, -1 - in case of overflow. LINUX Kernel Hackers Manual July 2017 mangle_path 9 4.1.27 mangle_path mangle and copy path to buffer beginning Synopsis char * mangle_path char * s const char * p const char * esc Arguments s buffer start p beginning of path in above buffer esc set of characters that need escaping Description Copy the path from p to s, replacing each occurrence of character from esc with usual octal escape. Returns pointer past last written character in s, or NULL in case of failure. LINUX Kernel Hackers Manual July 2017 seq_path 9 4.1.27 seq_path seq_file interface to print a pathname Synopsis int seq_path struct seq_file * m const struct path * path const char * esc Arguments m the seq_file handle path the struct path to print esc set of characters to escape in the output Description return the absolute path of 'path', as represented by the dentry / mnt pair in the path parameter. LINUX Kernel Hackers Manual July 2017 seq_write 9 4.1.27 seq_write write arbitrary data to buffer Synopsis int seq_write struct seq_file * seq const void * data size_t len Arguments seq seq_file identifying the buffer to which data should be written data data address len number of bytes Description Return 0 on success, non-zero otherwise. LINUX Kernel Hackers Manual July 2017 seq_pad 9 4.1.27 seq_pad write padding spaces to buffer Synopsis void seq_pad struct seq_file * m char c Arguments m seq_file identifying the buffer to which data should be written c the byte to append after padding if non-zero LINUX Kernel Hackers Manual July 2017 seq_hlist_start 9 4.1.27 seq_hlist_start start an iteration of a hlist Synopsis struct hlist_node * seq_hlist_start struct hlist_head * head loff_t pos Arguments head the head of the hlist pos the start position of the sequence Description Called at seq_file->op->start. LINUX Kernel Hackers Manual July 2017 seq_hlist_start_head 9 4.1.27 seq_hlist_start_head start an iteration of a hlist Synopsis struct hlist_node * seq_hlist_start_head struct hlist_head * head loff_t pos Arguments head the head of the hlist pos the start position of the sequence Description Called at seq_file->op->start. Call this function if you want to print a header at the top of the output. LINUX Kernel Hackers Manual July 2017 seq_hlist_next 9 4.1.27 seq_hlist_next move to the next position of the hlist Synopsis struct hlist_node * seq_hlist_next void * v struct hlist_head * head loff_t * ppos Arguments v the current iterator head the head of the hlist ppos the current position Description Called at seq_file->op->next. LINUX Kernel Hackers Manual July 2017 seq_hlist_start_rcu 9 4.1.27 seq_hlist_start_rcu start an iteration of a hlist protected by RCU Synopsis struct hlist_node * seq_hlist_start_rcu struct hlist_head * head loff_t pos Arguments head the head of the hlist pos the start position of the sequence Description Called at seq_file->op->start. This list-traversal primitive may safely run concurrently with the _rcu list-mutation primitives such as hlist_add_head_rcu as long as the traversal is guarded by rcu_read_lock. LINUX Kernel Hackers Manual July 2017 seq_hlist_start_head_rcu 9 4.1.27 seq_hlist_start_head_rcu start an iteration of a hlist protected by RCU Synopsis struct hlist_node * seq_hlist_start_head_rcu struct hlist_head * head loff_t pos Arguments head the head of the hlist pos the start position of the sequence Description Called at seq_file->op->start. Call this function if you want to print a header at the top of the output. This list-traversal primitive may safely run concurrently with the _rcu list-mutation primitives such as hlist_add_head_rcu as long as the traversal is guarded by rcu_read_lock. LINUX Kernel Hackers Manual July 2017 seq_hlist_next_rcu 9 4.1.27 seq_hlist_next_rcu move to the next position of the hlist protected by RCU Synopsis struct hlist_node * seq_hlist_next_rcu void * v struct hlist_head * head loff_t * ppos Arguments v the current iterator head the head of the hlist ppos the current position Description Called at seq_file->op->next. This list-traversal primitive may safely run concurrently with the _rcu list-mutation primitives such as hlist_add_head_rcu as long as the traversal is guarded by rcu_read_lock. LINUX Kernel Hackers Manual July 2017 seq_hlist_start_percpu 9 4.1.27 seq_hlist_start_percpu start an iteration of a percpu hlist array Synopsis struct hlist_node * seq_hlist_start_percpu struct hlist_head __percpu * head int * cpu loff_t pos Arguments head pointer to percpu array of struct hlist_heads cpu pointer to cpu cursor pos start position of sequence Description Called at seq_file->op->start. LINUX Kernel Hackers Manual July 2017 seq_hlist_next_percpu 9 4.1.27 seq_hlist_next_percpu move to the next position of the percpu hlist array Synopsis struct hlist_node * seq_hlist_next_percpu void * v struct hlist_head __percpu * head int * cpu loff_t * pos Arguments v pointer to current hlist_node head pointer to percpu array of struct hlist_heads cpu pointer to cpu cursor pos start position of sequence Description Called at seq_file->op->next. LINUX Kernel Hackers Manual July 2017 register_filesystem 9 4.1.27 register_filesystem register a new filesystem Synopsis int register_filesystem struct file_system_type * fs Arguments fs the file system structure Description Adds the file system passed to the list of file systems the kernel is aware of for mount and other syscalls. Returns 0 on success, or a negative errno code on an error. The struct file_system_type that is passed is linked into the kernel structures and must not be freed until the file system has been unregistered. LINUX Kernel Hackers Manual July 2017 unregister_filesystem 9 4.1.27 unregister_filesystem unregister a file system Synopsis int unregister_filesystem struct file_system_type * fs Arguments fs filesystem to unregister Description Remove a file system that was previously successfully registered with the kernel. An error is returned if the file system is not found. Zero is returned on a success. Once this function has returned the struct file_system_type structure may be freed or reused. LINUX Kernel Hackers Manual July 2017 writeback_in_progress 9 4.1.27 writeback_in_progress determine whether there is writeback in progress Synopsis int writeback_in_progress struct backing_dev_info * bdi Arguments bdi the device's backing_dev_info structure. Description Determine whether there is writeback waiting to be handled against a backing device. LINUX Kernel Hackers Manual July 2017 writeback_inodes_sb_nr 9 4.1.27 writeback_inodes_sb_nr writeback dirty inodes from given super_block Synopsis void writeback_inodes_sb_nr struct super_block * sb unsigned long nr enum wb_reason reason Arguments sb the superblock nr the number of pages to write reason reason why some writeback work initiated Description Start writeback on some inodes on this super_block. No guarantees are made on how many (if any) will be written, and this function does not wait for IO completion of submitted IO. LINUX Kernel Hackers Manual July 2017 writeback_inodes_sb 9 4.1.27 writeback_inodes_sb writeback dirty inodes from given super_block Synopsis void writeback_inodes_sb struct super_block * sb enum wb_reason reason Arguments sb the superblock reason reason why some writeback work was initiated Description Start writeback on some inodes on this super_block. No guarantees are made on how many (if any) will be written, and this function does not wait for IO completion of submitted IO. LINUX Kernel Hackers Manual July 2017 try_to_writeback_inodes_sb_nr 9 4.1.27 try_to_writeback_inodes_sb_nr try to start writeback if none underway Synopsis int try_to_writeback_inodes_sb_nr struct super_block * sb unsigned long nr enum wb_reason reason Arguments sb the superblock nr the number of pages to write reason the reason of writeback Description Invoke writeback_inodes_sb_nr if no writeback is currently underway. Returns 1 if writeback was started, 0 if not. LINUX Kernel Hackers Manual July 2017 try_to_writeback_inodes_sb 9 4.1.27 try_to_writeback_inodes_sb try to start writeback if none underway Synopsis int try_to_writeback_inodes_sb struct super_block * sb enum wb_reason reason Arguments sb the superblock reason reason why some writeback work was initiated Description Implement by try_to_writeback_inodes_sb_nr Returns 1 if writeback was started, 0 if not. LINUX Kernel Hackers Manual July 2017 sync_inodes_sb 9 4.1.27 sync_inodes_sb sync sb inode pages Synopsis void sync_inodes_sb struct super_block * sb Arguments sb the superblock Description This function writes and waits on any dirty inode belonging to this super_block. LINUX Kernel Hackers Manual July 2017 write_inode_now 9 4.1.27 write_inode_now write an inode to disk Synopsis int write_inode_now struct inode * inode int sync Arguments inode inode to write to disk sync whether the write should be synchronous or not Description This function commits an inode to disk immediately if it is dirty. This is primarily needed by knfsd. The caller must either have a ref on the inode or must have set I_WILL_FREE. LINUX Kernel Hackers Manual July 2017 sync_inode 9 4.1.27 sync_inode write an inode and its pages to disk. Synopsis int sync_inode struct inode * inode struct writeback_control * wbc Arguments inode the inode to sync wbc controls the writeback mode Description sync_inode will write an inode and its pages to disk. It will also correctly update the inode on its superblock's dirty inode lists and will update inode->i_state. The caller must have a ref on the inode. LINUX Kernel Hackers Manual July 2017 sync_inode_metadata 9 4.1.27 sync_inode_metadata write an inode to disk Synopsis int sync_inode_metadata struct inode * inode int wait Arguments inode the inode to sync wait wait for I/O to complete. Description Write an inode to disk and adjust its dirty state after completion. Note only writes the actual inode, no associated data or other metadata. LINUX Kernel Hackers Manual July 2017 freeze_bdev 9 4.1.27 freeze_bdev - lock a filesystem and force it into a consistent state Synopsis struct super_block * freeze_bdev struct block_device * bdev Arguments bdev blockdevice to lock Description If a superblock is found on this device, we take the s_umount semaphore on it to make sure nobody unmounts until the snapshot creation is done. The reference counter (bd_fsfreeze_count) guarantees that only the last unfreeze process can unfreeze the frozen filesystem actually when multiple freeze requests arrive simultaneously. It counts up in freeze_bdev and count down in thaw_bdev. When it becomes 0, thaw_bdev will unfreeze actually. LINUX Kernel Hackers Manual July 2017 thaw_bdev 9 4.1.27 thaw_bdev - unlock filesystem Synopsis int thaw_bdev struct block_device * bdev struct super_block * sb Arguments bdev blockdevice to unlock sb associated superblock Description Unlocks the filesystem and marks it writeable again after freeze_bdev. LINUX Kernel Hackers Manual July 2017 bdev_read_page 9 4.1.27 bdev_read_page Start reading a page from a block device Synopsis int bdev_read_page struct block_device * bdev sector_t sector struct page * page Arguments bdev The device to read the page from sector The offset on the device to read the page to (need not be aligned) page The page to read Description On entry, the page should be locked. It will be unlocked when the page has been read. If the block driver implements rw_page synchronously, that will be true on exit from this function, but it need not be. Errors returned by this function are usually soft, eg out of memory, or queue full; callers should try a different route to read this page rather than propagate an error back up the stack. Return negative errno if an error occurs, 0 if submission was successful. LINUX Kernel Hackers Manual July 2017 bdev_write_page 9 4.1.27 bdev_write_page Start writing a page to a block device Synopsis int bdev_write_page struct block_device * bdev sector_t sector struct page * page struct writeback_control * wbc Arguments bdev The device to write the page to sector The offset on the device to write the page to (need not be aligned) page The page to write wbc The writeback_control for the write Description On entry, the page should be locked and not currently under writeback. On exit, if the write started successfully, the page will be unlocked and under writeback. If the write failed already (eg the driver failed to queue the page to the device), the page will still be locked. If the caller is a ->writepage implementation, it will need to unlock the page. Errors returned by this function are usually soft, eg out of memory, or queue full; callers should try a different route to write this page rather than propagate an error back up the stack. Return negative errno if an error occurs, 0 if submission was successful. LINUX Kernel Hackers Manual July 2017 bdev_direct_access 9 4.1.27 bdev_direct_access Get the address for directly-accessibly memory Synopsis long bdev_direct_access struct block_device * bdev sector_t sector void ** addr unsigned long * pfn long size Arguments bdev The device containing the memory sector The offset within the device addr Where to put the address of the memory pfn The Page Frame Number for the memory size The number of bytes requested Description If a block device is made up of directly addressable memory, this function will tell the caller the PFN and the address of the memory. The address may be directly dereferenced within the kernel without the need to call ioremap, kmap or similar. The PFN is suitable for inserting into page tables. Return negative errno if an error occurs, otherwise the number of bytes accessible at this address. LINUX Kernel Hackers Manual July 2017 bdgrab 9 4.1.27 bdgrab - Grab a reference to an already referenced block device Synopsis struct block_device * bdgrab struct block_device * bdev Arguments bdev Block device to grab a reference to. LINUX Kernel Hackers Manual July 2017 bd_link_disk_holder 9 4.1.27 bd_link_disk_holder create symlinks between holding disk and slave bdev Synopsis int bd_link_disk_holder struct block_device * bdev struct gendisk * disk Arguments bdev the claimed slave bdev disk the holding disk Description DON'T USE THIS UNLESS YOU'RE ALREADY USING IT. This functions creates the following sysfs symlinks. - from slaves directory of the holder disk to the claimed bdev - from holders directory of the bdev to the holder disk For example, if /dev/dm-0 maps to /dev/sda and disk for dm-0 is passed to bd_link_disk_holder, then: /sys/block/dm-0/slaves/sda --> /sys/block/sda /sys/block/sda/holders/dm-0 --> /sys/block/dm-0 The caller must have claimed bdev before calling this function and ensure that both bdev and disk are valid during the creation and lifetime of these symlinks. CONTEXT Might sleep. RETURNS 0 on success, -errno on failure. LINUX Kernel Hackers Manual July 2017 bd_unlink_disk_holder 9 4.1.27 bd_unlink_disk_holder destroy symlinks created by bd_link_disk_holder Synopsis void bd_unlink_disk_holder struct block_device * bdev struct gendisk * disk Arguments bdev the calimed slave bdev disk the holding disk Description DON'T USE THIS UNLESS YOU'RE ALREADY USING IT. CONTEXT Might sleep. LINUX Kernel Hackers Manual July 2017 check_disk_size_change 9 4.1.27 check_disk_size_change checks for disk size change and adjusts bdev size. Synopsis void check_disk_size_change struct gendisk * disk struct block_device * bdev Arguments disk struct gendisk to check bdev struct bdev to adjust. Description This routine checks to see if the bdev size does not match the disk size and adjusts it if it differs. LINUX Kernel Hackers Manual July 2017 revalidate_disk 9 4.1.27 revalidate_disk wrapper for lower-level driver's revalidate_disk call-back Synopsis int revalidate_disk struct gendisk * disk Arguments disk struct gendisk to be revalidated Description This routine is a wrapper for lower-level driver's revalidate_disk call-backs. It is used to do common pre and post operations needed for all revalidate_disk operations. LINUX Kernel Hackers Manual July 2017 blkdev_get 9 4.1.27 blkdev_get open a block device Synopsis int blkdev_get struct block_device * bdev fmode_t mode void * holder Arguments bdev block_device to open mode FMODE_* mask holder exclusive holder identifier Description Open bdev with mode. If mode includes FMODE_EXCL, bdev is open with exclusive access. Specifying FMODE_EXCL with NULL holder is invalid. Exclusive opens may nest for the same holder. On success, the reference count of bdev is unchanged. On failure, bdev is put. CONTEXT Might sleep. RETURNS 0 on success, -errno on failure. LINUX Kernel Hackers Manual July 2017 blkdev_get_by_path 9 4.1.27 blkdev_get_by_path open a block device by name Synopsis struct block_device * blkdev_get_by_path const char * path fmode_t mode void * holder Arguments path path to the block device to open mode FMODE_* mask holder exclusive holder identifier Description Open the blockdevice described by the device file at path. mode and holder are identical to blkdev_get. On success, the returned block_device has reference count of one. CONTEXT Might sleep. RETURNS Pointer to block_device on success, ERR_PTR(-errno) on failure. LINUX Kernel Hackers Manual July 2017 blkdev_get_by_dev 9 4.1.27 blkdev_get_by_dev open a block device by device number Synopsis struct block_device * blkdev_get_by_dev dev_t dev fmode_t mode void * holder Arguments dev device number of block device to open mode FMODE_* mask holder exclusive holder identifier Description Open the blockdevice described by device number dev. mode and holder are identical to blkdev_get. Use it ONLY if you really do not have anything better - i.e. when you are behind a truly sucky interface and all you are given is a device number. _Never_ to be used for internal purposes. If you ever need it - reconsider your API. On success, the returned block_device has reference count of one. CONTEXT Might sleep. RETURNS Pointer to block_device on success, ERR_PTR(-errno) on failure. LINUX Kernel Hackers Manual July 2017 lookup_bdev 9 4.1.27 lookup_bdev lookup a struct block_device by name Synopsis struct block_device * lookup_bdev const char * pathname Arguments pathname special file representing the block device Description Get a reference to the blockdevice at pathname in the current namespace if possible and return it. Return ERR_PTR(error) otherwise. The proc filesystem sysctl interface LINUX Kernel Hackers Manual July 2017 proc_dostring 9 4.1.27 proc_dostring read a string sysctl Synopsis int proc_dostring struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes a string from/to the user buffer. If the kernel buffer provided is not large enough to hold the string, the string is truncated. The copied string is NULL-terminated. If the string is being read by the user process, it is copied and a newline '\n' is added. It is truncated if the buffer is not large enough. Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_dointvec 9 4.1.27 proc_dointvec read a vector of integers Synopsis int proc_dointvec struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string. Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_dointvec_minmax 9 4.1.27 proc_dointvec_minmax read a vector of integers with min/max values Synopsis int proc_dointvec_minmax struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string. This routine will ensure the values are within the range specified by table->extra1 (min) and table->extra2 (max). Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_doulongvec_minmax 9 4.1.27 proc_doulongvec_minmax read a vector of long integers with min/max values Synopsis int proc_doulongvec_minmax struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long values from/to the user buffer, treated as an ASCII string. This routine will ensure the values are within the range specified by table->extra1 (min) and table->extra2 (max). Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_doulongvec_ms_jiffies_minmax 9 4.1.27 proc_doulongvec_ms_jiffies_minmax read a vector of millisecond values with min/max values Synopsis int proc_doulongvec_ms_jiffies_minmax struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes up to table->maxlen/sizeof(unsigned long) unsigned long values from/to the user buffer, treated as an ASCII string. The values are treated as milliseconds, and converted to jiffies when they are stored. This routine will ensure the values are within the range specified by table->extra1 (min) and table->extra2 (max). Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_dointvec_jiffies 9 4.1.27 proc_dointvec_jiffies read a vector of integers as seconds Synopsis int proc_dointvec_jiffies struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos file position Description Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string. The values read are assumed to be in seconds, and are converted into jiffies. Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_dointvec_userhz_jiffies 9 4.1.27 proc_dointvec_userhz_jiffies read a vector of integers as 1/USER_HZ seconds Synopsis int proc_dointvec_userhz_jiffies struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos pointer to the file position Description Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string. The values read are assumed to be in 1/USER_HZ seconds, and are converted into jiffies. Returns 0 on success. LINUX Kernel Hackers Manual July 2017 proc_dointvec_ms_jiffies 9 4.1.27 proc_dointvec_ms_jiffies read a vector of integers as 1 milliseconds Synopsis int proc_dointvec_ms_jiffies struct ctl_table * table int write void __user * buffer size_t * lenp loff_t * ppos Arguments table the sysctl table write TRUE if this is a write to the sysctl file buffer the user buffer lenp the size of the user buffer ppos the current position in the file Description Reads/writes up to table->maxlen/sizeof(unsigned int) integer values from/to the user buffer, treated as an ASCII string. The values read are assumed to be in 1/1000 seconds, and are converted into jiffies. Returns 0 on success. proc filesystem interface LINUX Kernel Hackers Manual July 2017 proc_flush_task 9 4.1.27 proc_flush_task Remove dcache entries for task from the /proc dcache. Synopsis void proc_flush_task struct task_struct * task Arguments task task that should be flushed. Description When flushing dentries from proc, one needs to flush them from global proc (proc_mnt) and from all the namespaces' procs this task was seen in. This call is supposed to do all of this job. Looks in the dcache for /proc/pid /proc/tgid/task/pid if either directory is present flushes it and all of it'ts children from the dcache. It is safe and reasonable to cache /proc entries for a task until that task exits. After that they just clog up the dcache with useless entries, possibly causing useful dcache entries to be flushed instead. This routine is proved to flush those useless dcache entries at process exit time. NOTE This routine is just an optimization so it does not guarantee that no dcache entries will exist at process exit time it just makes it very unlikely that any will persist. Events based on file descriptors LINUX Kernel Hackers Manual July 2017 eventfd_signal 9 4.1.27 eventfd_signal Adds n to the eventfd counter. Synopsis __u64 eventfd_signal struct eventfd_ctx * ctx __u64 n Arguments ctx [in] Pointer to the eventfd context. n [in] Value of the counter to be added to the eventfd internal counter. The value cannot be negative. Description This function is supposed to be called by the kernel in paths that do not allow sleeping. In this function we allow the counter to reach the ULLONG_MAX value, and we signal this as overflow condition by returining a POLLERR to poll(2). Returns the amount by which the counter was incrememnted. This will be less than n if the counter has overflowed. LINUX Kernel Hackers Manual July 2017 eventfd_ctx_get 9 4.1.27 eventfd_ctx_get Acquires a reference to the internal eventfd context. Synopsis struct eventfd_ctx * eventfd_ctx_get struct eventfd_ctx * ctx Arguments ctx [in] Pointer to the eventfd context. Returns In case of success, returns a pointer to the eventfd context. LINUX Kernel Hackers Manual July 2017 eventfd_ctx_put 9 4.1.27 eventfd_ctx_put Releases a reference to the internal eventfd context. Synopsis void eventfd_ctx_put struct eventfd_ctx * ctx Arguments ctx [in] Pointer to eventfd context. Description The eventfd context reference must have been previously acquired either with eventfd_ctx_get or eventfd_ctx_fdget. LINUX Kernel Hackers Manual July 2017 eventfd_ctx_remove_wait_queue 9 4.1.27 eventfd_ctx_remove_wait_queue Read the current counter and removes wait queue. Synopsis int eventfd_ctx_remove_wait_queue struct eventfd_ctx * ctx wait_queue_t * wait __u64 * cnt Arguments ctx [in] Pointer to eventfd context. wait [in] Wait queue to be removed. cnt [out] Pointer to the 64-bit counter value. Description Returns 0 if successful, or the following error codes: -EAGAIN : The operation would have blocked. This is used to atomically remove a wait queue entry from the eventfd wait queue head, and read/reset the counter value. LINUX Kernel Hackers Manual July 2017 eventfd_ctx_read 9 4.1.27 eventfd_ctx_read Reads the eventfd counter or wait if it is zero. Synopsis ssize_t eventfd_ctx_read struct eventfd_ctx * ctx int no_wait __u64 * cnt Arguments ctx [in] Pointer to eventfd context. no_wait [in] Different from zero if the operation should not block. cnt [out] Pointer to the 64-bit counter value. Description Returns 0 if successful, or the following error codes: -EAGAIN : The operation would have blocked but no_wait was non-zero. -ERESTARTSYS : A signal interrupted the wait operation. If no_wait is zero, the function might sleep until the eventfd internal counter becomes greater than zero. LINUX Kernel Hackers Manual July 2017 eventfd_fget 9 4.1.27 eventfd_fget Acquire a reference of an eventfd file descriptor. Synopsis struct file * eventfd_fget int fd Arguments fd [in] Eventfd file descriptor. Description Returns a pointer to the eventfd file structure in case of success, or the following error pointer -EBADF : Invalid fd file descriptor. -EINVAL : The fd file descriptor is not an eventfd file. LINUX Kernel Hackers Manual July 2017 eventfd_ctx_fdget 9 4.1.27 eventfd_ctx_fdget Acquires a reference to the internal eventfd context. Synopsis struct eventfd_ctx * eventfd_ctx_fdget int fd Arguments fd [in] Eventfd file descriptor. Description Returns a pointer to the internal eventfd context, otherwise the error pointers returned by the following functions eventfd_fget LINUX Kernel Hackers Manual July 2017 eventfd_ctx_fileget 9 4.1.27 eventfd_ctx_fileget Acquires a reference to the internal eventfd context. Synopsis struct eventfd_ctx * eventfd_ctx_fileget struct file * file Arguments file [in] Eventfd file pointer. Description Returns a pointer to the internal eventfd context, otherwise the error pointer -EINVAL : The fd file descriptor is not an eventfd file. The Filesystem for Exporting Kernel Objects LINUX Kernel Hackers Manual July 2017 sysfs_create_file_ns 9 4.1.27 sysfs_create_file_ns create an attribute file for an object with custom ns Synopsis int sysfs_create_file_ns struct kobject * kobj const struct attribute * attr const void * ns Arguments kobj object we're creating for attr attribute descriptor ns namespace the new file should belong to LINUX Kernel Hackers Manual July 2017 sysfs_add_file_to_group 9 4.1.27 sysfs_add_file_to_group add an attribute file to a pre-existing group. Synopsis int sysfs_add_file_to_group struct kobject * kobj const struct attribute * attr const char * group Arguments kobj object we're acting for. attr attribute descriptor. group group name. LINUX Kernel Hackers Manual July 2017 sysfs_chmod_file 9 4.1.27 sysfs_chmod_file update the modified mode value on an object attribute. Synopsis int sysfs_chmod_file struct kobject * kobj const struct attribute * attr umode_t mode Arguments kobj object we're acting for. attr attribute descriptor. mode file permissions. LINUX Kernel Hackers Manual July 2017 sysfs_remove_file_ns 9 4.1.27 sysfs_remove_file_ns remove an object attribute with a custom ns tag Synopsis void sysfs_remove_file_ns struct kobject * kobj const struct attribute * attr const void * ns Arguments kobj object we're acting for attr attribute descriptor ns namespace tag of the file to remove Description Hash the attribute name and namespace tag and kill the victim. LINUX Kernel Hackers Manual July 2017 sysfs_remove_file_from_group 9 4.1.27 sysfs_remove_file_from_group remove an attribute file from a group. Synopsis void sysfs_remove_file_from_group struct kobject * kobj const struct attribute * attr const char * group Arguments kobj object we're acting for. attr attribute descriptor. group group name. LINUX Kernel Hackers Manual July 2017 sysfs_create_bin_file 9 4.1.27 sysfs_create_bin_file create binary file for object. Synopsis int sysfs_create_bin_file struct kobject * kobj const struct bin_attribute * attr Arguments kobj object. attr attribute descriptor. LINUX Kernel Hackers Manual July 2017 sysfs_remove_bin_file 9 4.1.27 sysfs_remove_bin_file remove binary file for object. Synopsis void sysfs_remove_bin_file struct kobject * kobj const struct bin_attribute * attr Arguments kobj object. attr attribute descriptor. LINUX Kernel Hackers Manual July 2017 sysfs_create_link 9 4.1.27 sysfs_create_link create symlink between two objects. Synopsis int sysfs_create_link struct kobject * kobj struct kobject * target const char * name Arguments kobj object whose directory we're creating the link in. target object we're pointing to. name name of the symlink. LINUX Kernel Hackers Manual July 2017 sysfs_remove_link 9 4.1.27 sysfs_remove_link remove symlink in object's directory. Synopsis void sysfs_remove_link struct kobject * kobj const char * name Arguments kobj object we're acting for. name name of the symlink to remove. LINUX Kernel Hackers Manual July 2017 sysfs_rename_link_ns 9 4.1.27 sysfs_rename_link_ns rename symlink in object's directory. Synopsis int sysfs_rename_link_ns struct kobject * kobj struct kobject * targ const char * old const char * new const void * new_ns Arguments kobj object we're acting for. targ object we're pointing to. old previous name of the symlink. new new name of the symlink. new_ns new namespace of the symlink. Description A helper function for the common rename symlink idiom. The debugfs filesystem debugfs interface LINUX Kernel Hackers Manual July 2017 debugfs_create_file 9 4.1.27 debugfs_create_file create a file in the debugfs filesystem Synopsis struct dentry * debugfs_create_file const char * name umode_t mode struct dentry * parent void * data const struct file_operations * fops Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. data a pointer to something that the caller will want to get to later on. The inode.i_private pointer will point to this value on the open call. fops a pointer to a struct file_operations that should be used for this file. Description This is the basic create a file function for debugfs. It allows for a wide range of flexibility in creating a file, or a directory (if you want to create a directory, the debugfs_create_dir function is recommended to be used instead.) This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_create_file_size 9 4.1.27 debugfs_create_file_size create a file in the debugfs filesystem Synopsis struct dentry * debugfs_create_file_size const char * name umode_t mode struct dentry * parent void * data const struct file_operations * fops loff_t file_size Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. data a pointer to something that the caller will want to get to later on. The inode.i_private pointer will point to this value on the open call. fops a pointer to a struct file_operations that should be used for this file. file_size initial file size Description This is the basic create a file function for debugfs. It allows for a wide range of flexibility in creating a file, or a directory (if you want to create a directory, the debugfs_create_dir function is recommended to be used instead.) This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_create_dir 9 4.1.27 debugfs_create_dir create a directory in the debugfs filesystem Synopsis struct dentry * debugfs_create_dir const char * name struct dentry * parent Arguments name a pointer to a string containing the name of the directory to create. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the directory will be created in the root of the debugfs filesystem. Description This function creates a directory in debugfs with the given name. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_create_automount 9 4.1.27 debugfs_create_automount create automount point in the debugfs filesystem Synopsis struct dentry * debugfs_create_automount const char * name struct dentry * parent struct vfsmount *(*f) void * void * data Arguments name a pointer to a string containing the name of the file to create. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. f function to be called when pathname resolution steps on that one. data opaque argument to pass to f. Description f should return what ->d_automount would. LINUX Kernel Hackers Manual July 2017 debugfs_create_symlink 9 4.1.27 debugfs_create_symlink create a symbolic link in the debugfs filesystem Synopsis struct dentry * debugfs_create_symlink const char * name struct dentry * parent const char * target Arguments name a pointer to a string containing the name of the symbolic link to create. parent a pointer to the parent dentry for this symbolic link. This should be a directory dentry if set. If this parameter is NULL, then the symbolic link will be created in the root of the debugfs filesystem. target a pointer to a string containing the path to the target of the symbolic link. Description This function creates a symbolic link with the given name in debugfs that links to the given target path. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the symbolic link is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_remove 9 4.1.27 debugfs_remove removes a file or directory from the debugfs filesystem Synopsis void debugfs_remove struct dentry * dentry Arguments dentry a pointer to a the dentry of the file or directory to be removed. Description This function removes a file or directory in debugfs that was previously created with a call to another debugfs function (like debugfs_create_file or variants thereof.) This function is required to be called in order for the file to be removed, no automatic cleanup of files will happen when a module is removed, you are responsible here. LINUX Kernel Hackers Manual July 2017 debugfs_remove_recursive 9 4.1.27 debugfs_remove_recursive recursively removes a directory Synopsis void debugfs_remove_recursive struct dentry * dentry Arguments dentry a pointer to a the dentry of the directory to be removed. Description This function recursively removes a directory tree in debugfs that was previously created with a call to another debugfs function (like debugfs_create_file or variants thereof.) This function is required to be called in order for the file to be removed, no automatic cleanup of files will happen when a module is removed, you are responsible here. LINUX Kernel Hackers Manual July 2017 debugfs_rename 9 4.1.27 debugfs_rename rename a file/directory in the debugfs filesystem Synopsis struct dentry * debugfs_rename struct dentry * old_dir struct dentry * old_dentry struct dentry * new_dir const char * new_name Arguments old_dir a pointer to the parent dentry for the renamed object. This should be a directory dentry. old_dentry dentry of an object to be renamed. new_dir a pointer to the parent dentry where the object should be moved. This should be a directory dentry. new_name a pointer to a string containing the target name. Description This function renames a file/directory in debugfs. The target must not exist for rename to succeed. This function will return a pointer to old_dentry (which is updated to reflect renaming) if it succeeds. If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_initialized 9 4.1.27 debugfs_initialized Tells whether debugfs has been registered Synopsis bool debugfs_initialized void Arguments void no arguments LINUX Kernel Hackers Manual July 2017 debugfs_create_u8 9 4.1.27 debugfs_create_u8 create a debugfs file that is used to read and write an unsigned 8-bit value Synopsis struct dentry * debugfs_create_u8 const char * name umode_t mode struct dentry * parent u8 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. Description This function creates a file in debugfs with the given name that contains the value of the variable value. If the mode variable is so set, it can be read from, and written to. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_u16 9 4.1.27 debugfs_create_u16 create a debugfs file that is used to read and write an unsigned 16-bit value Synopsis struct dentry * debugfs_create_u16 const char * name umode_t mode struct dentry * parent u16 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. Description This function creates a file in debugfs with the given name that contains the value of the variable value. If the mode variable is so set, it can be read from, and written to. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_u32 9 4.1.27 debugfs_create_u32 create a debugfs file that is used to read and write an unsigned 32-bit value Synopsis struct dentry * debugfs_create_u32 const char * name umode_t mode struct dentry * parent u32 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. Description This function creates a file in debugfs with the given name that contains the value of the variable value. If the mode variable is so set, it can be read from, and written to. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_u64 9 4.1.27 debugfs_create_u64 create a debugfs file that is used to read and write an unsigned 64-bit value Synopsis struct dentry * debugfs_create_u64 const char * name umode_t mode struct dentry * parent u64 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. Description This function creates a file in debugfs with the given name that contains the value of the variable value. If the mode variable is so set, it can be read from, and written to. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_x8 9 4.1.27 debugfs_create_x8 create a debugfs file that is used to read and write an unsigned 8-bit value Synopsis struct dentry * debugfs_create_x8 const char * name umode_t mode struct dentry * parent u8 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_x16 9 4.1.27 debugfs_create_x16 create a debugfs file that is used to read and write an unsigned 16-bit value Synopsis struct dentry * debugfs_create_x16 const char * name umode_t mode struct dentry * parent u16 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_x32 9 4.1.27 debugfs_create_x32 create a debugfs file that is used to read and write an unsigned 32-bit value Synopsis struct dentry * debugfs_create_x32 const char * name umode_t mode struct dentry * parent u32 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_x64 9 4.1.27 debugfs_create_x64 create a debugfs file that is used to read and write an unsigned 64-bit value Synopsis struct dentry * debugfs_create_x64 const char * name umode_t mode struct dentry * parent u64 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_size_t 9 4.1.27 debugfs_create_size_t create a debugfs file that is used to read and write an size_t value Synopsis struct dentry * debugfs_create_size_t const char * name umode_t mode struct dentry * parent size_t * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_atomic_t 9 4.1.27 debugfs_create_atomic_t create a debugfs file that is used to read and write an atomic_t value Synopsis struct dentry * debugfs_create_atomic_t const char * name umode_t mode struct dentry * parent atomic_t * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. LINUX Kernel Hackers Manual July 2017 debugfs_create_bool 9 4.1.27 debugfs_create_bool create a debugfs file that is used to read and write a boolean value Synopsis struct dentry * debugfs_create_bool const char * name umode_t mode struct dentry * parent u32 * value Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. value a pointer to the variable that the file should read to and write from. Description This function creates a file in debugfs with the given name that contains the value of the variable value. If the mode variable is so set, it can be read from, and written to. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_blob 9 4.1.27 debugfs_create_blob create a debugfs file that is used to read a binary blob Synopsis struct dentry * debugfs_create_blob const char * name umode_t mode struct dentry * parent struct debugfs_blob_wrapper * blob Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. blob a pointer to a struct debugfs_blob_wrapper which contains a pointer to the blob data and the size of the data. Description This function creates a file in debugfs with the given name that exports blob->data as a binary blob. If the mode variable is so set it can be read from. Writing is not supported. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_u32_array 9 4.1.27 debugfs_create_u32_array create a debugfs file that is used to read u32 array. Synopsis struct dentry * debugfs_create_u32_array const char * name umode_t mode struct dentry * parent u32 * array u32 elements Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. array u32 array that provides data. elements total number of elements in the array. Description This function creates a file in debugfs with the given name that exports array as data. If the mode variable is so set it can be read from. Writing is not supported. Seek within the file is also not supported. Once array is created its size can not be changed. The function returns a pointer to dentry on success. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. LINUX Kernel Hackers Manual July 2017 debugfs_print_regs32 9 4.1.27 debugfs_print_regs32 use seq_print to describe a set of registers Synopsis void debugfs_print_regs32 struct seq_file * s const struct debugfs_reg32 * regs int nregs void __iomem * base char * prefix Arguments s the seq_file structure being used to generate output regs an array if struct debugfs_reg32 structures nregs the length of the above array base the base address to be used in reading the registers prefix a string to be prefixed to every output line Description This function outputs a text block describing the current values of some 32-bit hardware registers. It is meant to be used within debugfs files based on seq_file that need to show registers, intermixed with other information. The prefix argument may be used to specify a leading string, because some peripherals have several blocks of identical registers, for example configuration of dma channels LINUX Kernel Hackers Manual July 2017 debugfs_create_regset32 9 4.1.27 debugfs_create_regset32 create a debugfs file that returns register values Synopsis struct dentry * debugfs_create_regset32 const char * name umode_t mode struct dentry * parent struct debugfs_regset32 * regset Arguments name a pointer to a string containing the name of the file to create. mode the permission that the file should have parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. regset a pointer to a struct debugfs_regset32, which contains a pointer to an array of register definitions, the array size and the base address where the register bank is to be found. Description This function creates a file in debugfs with the given name that reports the names and values of a set of 32-bit registers. If the mode variable is so set it can be read from. Writing is not supported. This function will return a pointer to a dentry if it succeeds. This pointer must be passed to the debugfs_remove function when the file is to be removed (no automatic cleanup happens if your module is unloaded, you are responsible here.) If an error occurs, NULL will be returned. If debugfs is not enabled in the kernel, the value -ENODEV will be returned. It is not wise to check for this value, but rather, check for NULL or !NULL instead as to eliminate the need for #ifdef in the calling code. LINUX Kernel Hackers Manual July 2017 debugfs_create_devm_seqfile 9 4.1.27 debugfs_create_devm_seqfile create a debugfs file that is bound to device. Synopsis struct dentry * debugfs_create_devm_seqfile struct device * dev const char * name struct dentry * parent int (*read_fn) struct seq_file *s, void *data Arguments dev device related to this debugfs file. name name of the debugfs file. parent a pointer to the parent dentry for this file. This should be a directory dentry if set. If this parameter is NULL, then the file will be created in the root of the debugfs filesystem. read_fn function pointer called to print the seq_file content. The Linux Journalling API Roger Gammans

rgammans@computer-surgery.co.uk

Stephen Tweedie

sct@redhat.com

2002 Roger Gammans The Linux Journalling API Overview Details The journalling layer is easy to use. You need to first of all create a journal_t data structure. There are two calls to do this dependent on how you decide to allocate the physical media on which the journal resides. The journal_init_inode() call is for journals stored in filesystem inodes, or the journal_init_dev() call can be use for journal stored on a raw device (in a continuous range of blocks). A journal_t is a typedef for a struct pointer, so when you are finally finished make sure you call journal_destroy() on it to free up any used kernel memory. Once you have got your journal_t object you need to 'mount' or load the journal file, unless of course you haven't initialised it yet - in which case you need to call journal_create(). Most of the time however your journal file will already have been created, but before you load it you must call journal_wipe() to empty the journal file. Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the job of the client file system to detect this and skip the call to journal_wipe(). In either case the next call should be to journal_load() which prepares the journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery() for you if it detects any outstanding transactions in the journal and similarly journal_load() will call journal_recover() if necessary. I would advise reading fs/ext3/super.c for examples on this stage. [RGG: Why is the journal_wipe() call necessary - doesn't this needlessly complicate the API. Or isn't a good idea for the journal layer to hide dirty mounts from the client fs] Now you can go ahead and start modifying the underlying filesystem. Almost. You still need to actually journal your filesystem changes, this is done by wrapping them into transactions. Additionally you also need to wrap the modification of each of the buffers with calls to the journal layer, so it knows what the modifications you are actually making are. To do this use journal_start() which returns a transaction handle. journal_start() and its counterpart journal_stop(), which indicates the end of a transaction are nestable calls, so you can reenter a transaction if necessary, but remember you must call journal_stop() the same number of times as journal_start() before the transaction is completed (or more accurately leaves the update phase). Ext3/VFS makes use of this feature to simplify quota support. Inside each transaction you need to wrap the modifications to the individual buffers (blocks). Before you start to modify a buffer you need to call journal_get_{create,write,undo}_access() as appropriate, this allows the journalling layer to copy the unmodified data if it needs to. After all the buffer may be part of a previously uncommitted transaction. At this point you are at last ready to modify a buffer, and once you are have done so you need to call journal_dirty_{meta,}data(). Or if you've asked for access to a buffer you now know is now longer required to be pushed back on the device you can call journal_forget() in much the same way as you might have used bforget() in the past. A journal_flush() may be called at any time to commit and checkpoint all your transactions. Then at umount time , in your put_super() you can then call journal_destroy() to clean up your in-core journal object. Unfortunately there a couple of ways the journal layer can cause a deadlock. The first thing to note is that each task can only have a single outstanding transaction at any one time, remember nothing commits until the outermost journal_stop(). This means you must complete the transaction at the end of each file/inode/address etc. operation you perform, so that the journalling system isn't re-entered on another journal. Since transactions can't be nested/batched across differing journals, and another filesystem other than yours (say ext3) may be modified in a later syscall. The second case to bear in mind is that journal_start() can block if there isn't enough space in the journal for your transaction (based on the passed nblocks param) - when it blocks it merely(!) needs to wait for transactions to complete and be committed from other tasks, so essentially we are waiting for journal_stop(). So to avoid deadlocks you must treat journal_start/stop() as if they were semaphores and include them in your semaphore ordering rules to prevent deadlocks. Note that journal_extend() has similar blocking behaviour to journal_start() so you can deadlock here just as easily as on journal_start(). Try to reserve the right number of blocks the first time. ;-). This will be the maximum number of blocks you are going to touch in this transaction. I advise having a look at at least ext3_jbd.h to see the basis on which ext3 uses to make these decisions. Another wriggle to watch out for is your on-disk block allocation strategy. why? Because, if you undo a delete, you need to ensure you haven't reused any of the freed blocks in a later transaction. One simple way of doing this is make sure any blocks you allocate only have checkpointed transactions listed against them. Ext3 does this in ext3_test_allocatable(). Lock is also providing through journal_{un,}lock_updates(), ext3 uses this when it wants a window with a clean and stable fs for a moment. eg. journal_lock_updates() //stop new stuff happening.. journal_flush() // checkpoint everything. ..do stuff on stable fs journal_unlock_updates() // carry on with filesystem use. The opportunities for abuse and DOS attacks with this should be obvious, if you allow unprivileged userspace to trigger codepaths containing these calls. A new feature of jbd since 2.5.25 is commit callbacks with the new journal_callback_set() function you can now ask the journalling layer to call you back when the transaction is finally committed to disk, so that you can do some of your own management. The key to this is the journal_callback struct, this maintains the internal callback information but you can extend it like this:- struct myfs_callback_s { //Data structure element required by jbd.. struct journal_callback for_jbd; // Stuff for myfs allocated together. myfs_inode* i_commited; } this would be useful if you needed to know when data was committed to a particular inode. Summary Using the journal is a matter of wrapping the different context changes, being each mount, each modification (transaction) and each changed buffer to tell the journalling layer about them. Here is a some pseudo code to give you an idea of how it works, as an example. journal_t* my_jnrl = journal_create(); journal_init_{dev,inode}(jnrl,...) if (clean) journal_wipe(); journal_load(); foreach(transaction) { /*transactions must be completed before a syscall returns to userspace*/ handle_t * xct=journal_start(my_jnrl); foreach(bh) { journal_get_{create,write,undo}_access(xact,bh); if ( myfs_modify(bh) ) { /* returns true if makes changes */ journal_dirty_{meta,}data(xact,bh); } else { journal_forget(bh); } } journal_stop(xct); } journal_destroy(my_jrnl); Data Types The journalling layer uses typedefs to 'hide' the concrete definitions of the structures used. As a client of the JBD layer you can just rely on the using the pointer as a magic cookie of some sort. Obviously the hiding is not enforced as this is 'C'. Structures LINUX Kernel Hackers Manual July 2017 typedef handle_t 9 typedef handle_t The handle_t type represents a single atomic update being performed by some process. Synopsis typedef handle_t; Description All filesystem modifications made by the process go through this handle. Recursive operations (such as quota operations) are gathered into a single update. The buffer credits field is used to account for journaled buffers being modified by the running process. To ensure that there is enough log space for all outstanding operations, we need to limit the number of outstanding buffers possible at any time. When the operation completes, any buffer credits not used are credited back to the transaction, so that at all times we know how many buffers the outstanding updates on a transaction might possibly touch. This is an opaque datatype. LINUX Kernel Hackers Manual July 2017 typedef journal_t 9 typedef journal_t The journal_t maintains all of the journaling state information for a single filesystem. Synopsis typedef journal_t; Description journal_t is linked to from the fs superblock structure. We use the journal_t to keep track of all outstanding transaction activity on the filesystem, and to manage the state of the log writing process. This is an opaque datatype. LINUX Kernel Hackers Manual July 2017 struct handle_s 9 4.1.27 struct handle_s this is the concrete type associated with handle_t. Synopsis struct handle_s { transaction_t * h_transaction; int h_buffer_credits; int h_ref; int h_err; unsigned int h_sync:1; unsigned int h_jdata:1; unsigned int h_aborted:1; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map h_lockdep_map; #endif }; Members h_transaction Which compound transaction is this update a part of? h_buffer_credits Number of remaining buffers we are allowed to dirty. h_ref Reference count on this handle h_err Field for caller's use to track errors through large fs operations h_sync flag for sync-on-close h_jdata flag to force data journaling h_aborted flag indicating fatal error on handle h_lockdep_map lockdep info for debugging lock problems LINUX Kernel Hackers Manual July 2017 struct journal_s 9 4.1.27 struct journal_s this is the concrete type associated with journal_t. Synopsis struct journal_s { unsigned long j_flags; int j_errno; struct buffer_head * j_sb_buffer; journal_superblock_t * j_superblock; int j_format_version; spinlock_t j_state_lock; int j_barrier_count; transaction_t * j_running_transaction; transaction_t * j_committing_transaction; transaction_t * j_checkpoint_transactions; wait_queue_head_t j_wait_transaction_locked; wait_queue_head_t j_wait_logspace; wait_queue_head_t j_wait_done_commit; wait_queue_head_t j_wait_checkpoint; wait_queue_head_t j_wait_commit; wait_queue_head_t j_wait_updates; struct mutex j_checkpoint_mutex; unsigned int j_head; unsigned int j_tail; unsigned int j_free; unsigned int j_first; unsigned int j_last; struct block_device * j_dev; int j_blocksize; unsigned int j_blk_offset; struct block_device * j_fs_dev; unsigned int j_maxlen; spinlock_t j_list_lock; struct inode * j_inode; tid_t j_tail_sequence; tid_t j_transaction_sequence; tid_t j_commit_sequence; tid_t j_commit_request; tid_t j_commit_waited; __u8 j_uuid[16]; struct task_struct * j_task; int j_max_transaction_buffers; unsigned long j_commit_interval; struct timer_list j_commit_timer; spinlock_t j_revoke_lock; struct jbd_revoke_table_s * j_revoke; struct jbd_revoke_table_s * j_revoke_table[2]; struct buffer_head ** j_wbuf; int j_wbufsize; pid_t j_last_sync_writer; u64 j_average_commit_time; void * j_private; }; Members j_flags General journaling state flags j_errno Is there an outstanding uncleared error on the journal (from a prior abort)? j_sb_buffer First part of superblock buffer j_superblock Second part of superblock buffer j_format_version Version of the superblock format j_state_lock Protect the various scalars in the journal j_barrier_count Number of processes waiting to create a barrier lock j_running_transaction The current running transaction.. j_committing_transaction the transaction we are pushing to disk j_checkpoint_transactions a linked circular list of all transactions waiting for checkpointing j_wait_transaction_locked Wait queue for waiting for a locked transaction to start committing, or for a barrier lock to be released j_wait_logspace Wait queue for waiting for checkpointing to complete j_wait_done_commit Wait queue for waiting for commit to complete j_wait_checkpoint Wait queue to trigger checkpointing j_wait_commit Wait queue to trigger commit j_wait_updates Wait queue to wait for updates to complete j_checkpoint_mutex Mutex for locking against concurrent checkpoints j_head Journal head - identifies the first unused block in the journal j_tail Journal tail - identifies the oldest still-used block in the journal. j_free Journal free - how many free blocks are there in the journal? j_first The block number of the first usable block j_last The block number one beyond the last usable block j_dev Device where we store the journal j_blocksize blocksize for the location where we store the journal. j_blk_offset starting block offset for into the device where we store the journal j_fs_dev Device which holds the client fs. For internal journal this will be equal to j_dev j_maxlen Total maximum capacity of the journal region on disk. j_list_lock Protects the buffer lists and internal buffer state. j_inode Optional inode where we store the journal. If present, all journal block numbers are mapped into this inode via bmap. j_tail_sequence Sequence number of the oldest transaction in the log j_transaction_sequence Sequence number of the next transaction to grant j_commit_sequence Sequence number of the most recently committed transaction j_commit_request Sequence number of the most recent transaction wanting commit j_commit_waited Sequence number of the most recent transaction someone is waiting for to commit. j_uuid[16] Uuid of client object. j_task Pointer to the current commit thread for this journal j_max_transaction_buffers Maximum number of metadata buffers to allow in a single compound commit transaction j_commit_interval What is the maximum transaction lifetime before we begin a commit? j_commit_timer The timer used to wakeup the commit thread j_revoke_lock Protect the revoke table j_revoke The revoke table - maintains the list of revoked blocks in the current transaction. j_revoke_table[2] alternate revoke tables for j_revoke j_wbuf array of buffer_heads for journal_commit_transaction j_wbufsize maximum number of buffer_heads allowed in j_wbuf, the number that will fit in j_blocksize j_last_sync_writer most recent pid which did a synchronous write j_average_commit_time the average amount of time in nanoseconds it takes to commit a transaction to the disk. j_private An opaque pointer to fs-private information. Functions The functions here are split into two groups those that affect a journal as a whole, and those which are used to manage transactions Journal Level LINUX Kernel Hackers Manual July 2017 journal_init_dev 9 4.1.27 journal_init_dev creates and initialises a journal structure Synopsis journal_t * journal_init_dev struct block_device * bdev struct block_device * fs_dev int start int len int blocksize Arguments bdev Block device on which to create the journal fs_dev Device which hold journalled filesystem for this journal. start Block nr Start of journal. len Length of the journal in blocks. blocksize blocksize of journalling device Returns a newly created journal_t * journal_init_dev creates a journal which maps a fixed contiguous range of blocks on an arbitrary block device. LINUX Kernel Hackers Manual July 2017 journal_init_inode 9 4.1.27 journal_init_inode creates a journal which maps to a inode. Synopsis journal_t * journal_init_inode struct inode * inode Arguments inode An inode to create the journal in Description journal_init_inode creates a journal which maps an on-disk inode as the journal. The inode must exist already, must support bmap and must have all data blocks preallocated. LINUX Kernel Hackers Manual July 2017 journal_create 9 4.1.27 journal_create Initialise the new journal file Synopsis int journal_create journal_t * journal Arguments journal Journal to create. This structure must have been initialised Description Given a journal_t structure which tells us which disk blocks we can use, create a new journal superblock and initialise all of the journal fields from scratch. LINUX Kernel Hackers Manual July 2017 journal_load 9 4.1.27 journal_load Read journal from disk. Synopsis int journal_load journal_t * journal Arguments journal Journal to act on. Description Given a journal_t structure which tells us which disk blocks contain a journal, read the journal from disk to initialise the in-memory structures. LINUX Kernel Hackers Manual July 2017 journal_destroy 9 4.1.27 journal_destroy Release a journal_t structure. Synopsis int journal_destroy journal_t * journal Arguments journal Journal to act on. Description Release a journal_t structure once it is no longer in use by the journaled object. Return <0 if we couldn't clean up the journal. LINUX Kernel Hackers Manual July 2017 journal_check_used_features 9 4.1.27 journal_check_used_features Check if features specified are used. Synopsis int journal_check_used_features journal_t * journal unsigned long compat unsigned long ro unsigned long incompat Arguments journal Journal to check. compat bitmask of compatible features ro bitmask of features that force read-only mount incompat bitmask of incompatible features Description Check whether the journal uses all of a given set of features. Return true (non-zero) if it does. LINUX Kernel Hackers Manual July 2017 journal_check_available_features 9 4.1.27 journal_check_available_features Check feature set in journalling layer Synopsis int journal_check_available_features journal_t * journal unsigned long compat unsigned long ro unsigned long incompat Arguments journal Journal to check. compat bitmask of compatible features ro bitmask of features that force read-only mount incompat bitmask of incompatible features Description Check whether the journaling code supports the use of all of a given set of features on this journal. Return true LINUX Kernel Hackers Manual July 2017 journal_set_features 9 4.1.27 journal_set_features Mark a given journal feature in the superblock Synopsis int journal_set_features journal_t * journal unsigned long compat unsigned long ro unsigned long incompat Arguments journal Journal to act on. compat bitmask of compatible features ro bitmask of features that force read-only mount incompat bitmask of incompatible features Description Mark a given journal feature as present on the superblock. Returns true if the requested features could be set. LINUX Kernel Hackers Manual July 2017 journal_update_format 9 4.1.27 journal_update_format Update on-disk journal structure. Synopsis int journal_update_format journal_t * journal Arguments journal Journal to act on. Description Given an initialised but unloaded journal struct, poke about in the on-disk structure to update it to the most recent supported version. LINUX Kernel Hackers Manual July 2017 journal_flush 9 4.1.27 journal_flush Flush journal Synopsis int journal_flush journal_t * journal Arguments journal Journal to act on. Description Flush all data for a given journal to disk and empty the journal. Filesystems can use this when remounting readonly to ensure that recovery does not need to happen on remount. LINUX Kernel Hackers Manual July 2017 journal_wipe 9 4.1.27 journal_wipe Wipe journal contents Synopsis int journal_wipe journal_t * journal int write Arguments journal Journal to act on. write flag (see below) Description Wipe out all of the contents of a journal, safely. This will produce a warning if the journal contains any valid recovery information. Must be called between journal_init_*() and journal_load. If 'write' is non-zero, then we wipe out the journal on disk; otherwise we merely suppress recovery. LINUX Kernel Hackers Manual July 2017 journal_abort 9 4.1.27 journal_abort Shutdown the journal immediately. Synopsis void journal_abort journal_t * journal int errno Arguments journal the journal to shutdown. errno an error number to record in the journal indicating the reason for the shutdown. Description Perform a complete, immediate shutdown of the ENTIRE journal (not of a single transaction). This operation cannot be undone without closing and reopening the journal. The journal_abort function is intended to support higher level error recovery mechanisms such as the ext2/ext3 remount-readonly error mode. Journal abort has very specific semantics. Any existing dirty, unjournaled buffers in the main filesystem will still be written to disk by bdflush, but the journaling mechanism will be suspended immediately and no further transaction commits will be honoured. Any dirty, journaled buffers will be written back to disk without hitting the journal. Atomicity cannot be guaranteed on an aborted filesystem, but we _do_ attempt to leave as much data as possible behind for fsck to use for cleanup. Any attempt to get a new transaction handle on a journal which is in ABORT state will just result in an -EROFS error return. A journal_stop on an existing handle will return -EIO if we have entered abort state during the update. Recursive transactions are not disturbed by journal abort until the final journal_stop, which will receive the -EIO error. Finally, the journal_abort call allows the caller to supply an errno which will be recorded (if possible) in the journal superblock. This allows a client to record failure conditions in the middle of a transaction without having to complete the transaction to record the failure to disk. ext3_error, for example, now uses this functionality. Errors which originate from within the journaling layer will NOT supply an errno; a null errno implies that absolutely no further writes are done to the journal (unless there are any already in progress). LINUX Kernel Hackers Manual July 2017 journal_errno 9 4.1.27 journal_errno returns the journal's error state. Synopsis int journal_errno journal_t * journal Arguments journal journal to examine. Description This is the errno numbet set with journal_abort, the last time the journal was mounted - if the journal was stopped without calling abort this will be 0. If the journal has been aborted on this mount time -EROFS will be returned. LINUX Kernel Hackers Manual July 2017 journal_clear_err 9 4.1.27 journal_clear_err clears the journal's error state Synopsis int journal_clear_err journal_t * journal Arguments journal journal to act on. Description An error must be cleared or Acked to take a FS out of readonly mode. LINUX Kernel Hackers Manual July 2017 journal_ack_err 9 4.1.27 journal_ack_err Ack journal err. Synopsis void journal_ack_err journal_t * journal Arguments journal journal to act on. Description An error must be cleared or Acked to take a FS out of readonly mode. LINUX Kernel Hackers Manual July 2017 journal_recover 9 4.1.27 journal_recover recovers a on-disk journal Synopsis int journal_recover journal_t * journal Arguments journal the journal to recover Description The primary function for recovering the log contents when mounting a journaled device. Recovery is done in three passes. In the first pass, we look for the end of the log. In the second, we assemble the list of revoke blocks. In the third and final pass, we replay any un-revoked blocks in the log. LINUX Kernel Hackers Manual July 2017 journal_skip_recovery 9 4.1.27 journal_skip_recovery Start journal and wipe exiting records Synopsis int journal_skip_recovery journal_t * journal Arguments journal journal to startup Description Locate any valid recovery information from the journal and set up the journal structures in memory to ignore it (presumably because the caller has evidence that it is out of date). This function does'nt appear to be exorted.. We perform one pass over the journal to allow us to tell the user how much recovery information is being erased, and to let us initialise the journal transaction sequence numbers to the next unused ID. Transasction Level LINUX Kernel Hackers Manual July 2017 journal_start 9 4.1.27 journal_start Obtain a new handle. Synopsis handle_t * journal_start journal_t * journal int nblocks Arguments journal Journal to start transaction on. nblocks number of block buffer we might modify Description We make sure that the transaction can guarantee at least nblocks of modified buffers in the log. We block until the log can guarantee that much space. This function is visible to journal users (like ext3fs), so is not called with the journal already locked. Return a pointer to a newly allocated handle, or an ERR_PTR value on failure. LINUX Kernel Hackers Manual July 2017 journal_extend 9 4.1.27 journal_extend extend buffer credits. Synopsis int journal_extend handle_t * handle int nblocks Arguments handle handle to 'extend' nblocks nr blocks to try to extend by. Description Some transactions, such as large extends and truncates, can be done atomically all at once or in several stages. The operation requests a credit for a number of buffer modications in advance, but can extend its credit if it needs more. journal_extend tries to give the running handle more buffer credits. It does not guarantee that allocation - this is a best-effort only. The calling process MUST be able to deal cleanly with a failure to extend here. Return 0 on success, non-zero on failure. return code < 0 implies an error return code > 0 implies normal transaction-full status. LINUX Kernel Hackers Manual July 2017 journal_restart 9 4.1.27 journal_restart restart a handle. Synopsis int journal_restart handle_t * handle int nblocks Arguments handle handle to restart nblocks nr credits requested Description Restart a handle for a multi-transaction filesystem operation. If the journal_extend call above fails to grant new buffer credits to a running handle, a call to journal_restart will commit the handle's transaction so far and reattach the handle to a new transaction capabable of guaranteeing the requested number of credits. LINUX Kernel Hackers Manual July 2017 journal_lock_updates 9 4.1.27 journal_lock_updates establish a transaction barrier. Synopsis void journal_lock_updates journal_t * journal Arguments journal Journal to establish a barrier on. Description This locks out any further updates from being started, and blocks until all existing updates have completed, returning only once the journal is in a quiescent state with no updates running. We do not use simple mutex for synchronization as there are syscalls which want to return with filesystem locked and that trips up lockdep. Also hibernate needs to lock filesystem but locked mutex then blocks hibernation. Since locking filesystem is rare operation, we use simple counter and waitqueue for locking. LINUX Kernel Hackers Manual July 2017 journal_unlock_updates 9 4.1.27 journal_unlock_updates release barrier Synopsis void journal_unlock_updates journal_t * journal Arguments journal Journal to release the barrier on. Description Release a transaction barrier obtained with journal_lock_updates. LINUX Kernel Hackers Manual July 2017 journal_get_write_access 9 4.1.27 journal_get_write_access notify intent to modify a buffer for metadata (not data) update. Synopsis int journal_get_write_access handle_t * handle struct buffer_head * bh Arguments handle transaction to add buffer modifications to bh bh to be used for metadata writes Description Returns an error code or 0 on success. In full data journalling mode the buffer may be of type BJ_AsyncData, because we're writeing a buffer which is also part of a shared mapping. LINUX Kernel Hackers Manual July 2017 journal_get_create_access 9 4.1.27 journal_get_create_access notify intent to use newly created bh Synopsis int journal_get_create_access handle_t * handle struct buffer_head * bh Arguments handle transaction to new buffer to bh new buffer. Description Call this if you create a new bh. LINUX Kernel Hackers Manual July 2017 journal_get_undo_access 9 4.1.27 journal_get_undo_access Notify intent to modify metadata with non-rewindable consequences Synopsis int journal_get_undo_access handle_t * handle struct buffer_head * bh Arguments handle transaction bh buffer to undo Description Sometimes there is a need to distinguish between metadata which has been committed to disk and that which has not. The ext3fs code uses this for freeing and allocating space, we have to make sure that we do not reuse freed space until the deallocation has been committed, since if we overwrote that space we would make the delete un-rewindable in case of a crash. To deal with that, journal_get_undo_access requests write access to a buffer for parts of non-rewindable operations such as delete operations on the bitmaps. The journaling code must keep a copy of the buffer's contents prior to the undo_access call until such time as we know that the buffer has definitely been committed to disk. We never need to know which transaction the committed data is part of, buffers touched here are guaranteed to be dirtied later and so will be committed to a new transaction in due course, at which point we can discard the old committed data pointer. Returns error number or 0 on success. LINUX Kernel Hackers Manual July 2017 journal_dirty_data 9 4.1.27 journal_dirty_data mark a buffer as containing dirty data to be flushed Synopsis int journal_dirty_data handle_t * handle struct buffer_head * bh Arguments handle transaction bh bufferhead to mark Description Mark a buffer as containing dirty data which needs to be flushed before we can commit the current transaction. The buffer is placed on the transaction's data list and is marked as belonging to the transaction. Returns error number or 0 on success. journal_dirty_data can be called via page_launder->ext3_writepage by kswapd. LINUX Kernel Hackers Manual July 2017 journal_dirty_metadata 9 4.1.27 journal_dirty_metadata mark a buffer as containing dirty metadata Synopsis int journal_dirty_metadata handle_t * handle struct buffer_head * bh Arguments handle transaction to add buffer to. bh buffer to mark Description Mark dirty metadata which needs to be journaled as part of the current transaction. The buffer is placed on the transaction's metadata list and is marked as belonging to the transaction. Returns error number or 0 on success. Special care needs to be taken if the buffer already belongs to the current committing transaction (in which case we should have frozen data present for that commit). In that case, we don't relink the buffer that only gets done when the old transaction finally completes its commit. LINUX Kernel Hackers Manual July 2017 journal_forget 9 4.1.27 journal_forget bforget for potentially-journaled buffers. Synopsis int journal_forget handle_t * handle struct buffer_head * bh Arguments handle transaction handle bh bh to 'forget' Description We can only do the bforget if there are no commits pending against the buffer. If the buffer is dirty in the current running transaction we can safely unlink it. bh may not be a journalled buffer at all - it may be a non-JBD buffer which came off the hashtable. Check for this. Decrements bh->b_count by one. Allow this call even if the handle has aborted --- it may be part of the caller's cleanup after an abort. LINUX Kernel Hackers Manual July 2017 journal_stop 9 4.1.27 journal_stop complete a transaction Synopsis int journal_stop handle_t * handle Arguments handle tranaction to complete. Description All done for a particular handle. There is not much action needed here. We just return any remaining buffer credits to the transaction and remove the handle. The only complication is that we need to start a commit operation if the filesystem is marked for synchronous update. journal_stop itself will not usually return an error, but it may do so in unusual circumstances. In particular, expect it to return -EIO if a journal_abort has been executed since the transaction began. LINUX Kernel Hackers Manual July 2017 journal_force_commit 9 4.1.27 journal_force_commit force any uncommitted transactions Synopsis int journal_force_commit journal_t * journal Arguments journal journal to force For synchronous operations force any uncommitted transactions to disk. May seem kludgy, but it reuses all the handle batching code in a very simple manner. LINUX Kernel Hackers Manual July 2017 journal_try_to_free_buffers 9 4.1.27 journal_try_to_free_buffers try to free page buffers. Synopsis int journal_try_to_free_buffers journal_t * journal struct page * page gfp_t gfp_mask Arguments journal journal for operation page to try and free gfp_mask we use the mask to detect how hard should we try to release buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to release the buffers. Description For all the buffers on this page, if they are fully written out ordered data, move them onto BUF_CLEAN so try_to_free_buffers can reap them. This function returns non-zero if we wish try_to_free_buffers to be called. We do this if the page is releasable by try_to_free_buffers. We also do it if the page has locked or dirty buffers and the caller wants us to perform sync or async writeout. This complicates JBD locking somewhat. We aren't protected by the BKL here. We wish to remove the buffer from its committing or running transaction's ->t_datalist via __journal_unfile_buffer. This may *change* the value of transaction_t->t_datalist, so anyone who looks at t_datalist needs to lock against this function. Even worse, someone may be doing a journal_dirty_data on this buffer. So we need to lock against that. journal_dirty_data will come out of the lock with the buffer dirty, which makes it ineligible for release here. Who else is affected by this? hmm... Really the only contender is do_get_write_access - it could be looking at the buffer while journal_try_to_free_buffer is changing its state. But that cannot happen because we never reallocate freed data as metadata while the data is part of a transaction. Yes? Return 0 on failure, 1 on success LINUX Kernel Hackers Manual July 2017 journal_invalidatepage 9 4.1.27 journal_invalidatepage invalidate a journal page Synopsis void journal_invalidatepage journal_t * journal struct page * page unsigned int offset unsigned int length Arguments journal journal to use for flush page page to flush offset offset of the range to invalidate length length of the range to invalidate Description Reap page buffers containing data in specified range in page. See also Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie splice API splice is a method for moving blocks of data around inside the kernel, without continually transferring them between the kernel and user space. LINUX Kernel Hackers Manual July 2017 splice_to_pipe 9 4.1.27 splice_to_pipe fill passed data into a pipe Synopsis ssize_t splice_to_pipe struct pipe_inode_info * pipe struct splice_pipe_desc * spd Arguments pipe pipe to fill spd data to fill Description spd contains a map of pages and len/offset tuples, along with the struct pipe_buf_operations associated with these pages. This function will link that data to the pipe. LINUX Kernel Hackers Manual July 2017 generic_file_splice_read 9 4.1.27 generic_file_splice_read splice data from file to a pipe Synopsis ssize_t generic_file_splice_read struct file * in loff_t * ppos struct pipe_inode_info * pipe size_t len unsigned int flags Arguments in file to splice from ppos position in in pipe pipe to splice to len number of bytes to splice flags splice modifier flags Description Will read pages from given file and fill them into a pipe. Can be used as long as the address_space operations for the source implements a readpage hook. LINUX Kernel Hackers Manual July 2017 splice_from_pipe_feed 9 4.1.27 splice_from_pipe_feed feed available data from a pipe to a file Synopsis int splice_from_pipe_feed struct pipe_inode_info * pipe struct splice_desc * sd splice_actor * actor Arguments pipe pipe to splice from sd information to actor actor handler that splices the data Description This function loops over the pipe and calls actor to do the actual moving of a single struct pipe_buffer to the desired destination. It returns when there's no more buffers left in the pipe or if the requested number of bytes (sd->total_len) have been copied. It returns a positive number (one) if the pipe needs to be filled with more data, zero if the required number of bytes have been copied and -errno on error. This, together with splice_from_pipe_{begin,end,next}, may be used to implement the functionality of __splice_from_pipe when locking is required around copying the pipe buffers to the destination. LINUX Kernel Hackers Manual July 2017 splice_from_pipe_next 9 4.1.27 splice_from_pipe_next wait for some data to splice from Synopsis int splice_from_pipe_next struct pipe_inode_info * pipe struct splice_desc * sd Arguments pipe pipe to splice from sd information about the splice operation Description This function will wait for some data and return a positive value (one) if pipe buffers are available. It will return zero or -errno if no more data needs to be spliced. LINUX Kernel Hackers Manual July 2017 splice_from_pipe_begin 9 4.1.27 splice_from_pipe_begin start splicing from pipe Synopsis void splice_from_pipe_begin struct splice_desc * sd Arguments sd information about the splice operation Description This function should be called before a loop containing splice_from_pipe_next and splice_from_pipe_feed to initialize the necessary fields of sd. LINUX Kernel Hackers Manual July 2017 splice_from_pipe_end 9 4.1.27 splice_from_pipe_end finish splicing from pipe Synopsis void splice_from_pipe_end struct pipe_inode_info * pipe struct splice_desc * sd Arguments pipe pipe to splice from sd information about the splice operation Description This function will wake up pipe writers if necessary. It should be called after a loop containing splice_from_pipe_next and splice_from_pipe_feed. LINUX Kernel Hackers Manual July 2017 __splice_from_pipe 9 4.1.27 __splice_from_pipe splice data from a pipe to given actor Synopsis ssize_t __splice_from_pipe struct pipe_inode_info * pipe struct splice_desc * sd splice_actor * actor Arguments pipe pipe to splice from sd information to actor actor handler that splices the data Description This function does little more than loop over the pipe and call actor to do the actual moving of a single struct pipe_buffer to the desired destination. See pipe_to_file, pipe_to_sendpage, or pipe_to_user. LINUX Kernel Hackers Manual July 2017 splice_from_pipe 9 4.1.27 splice_from_pipe splice data from a pipe to a file Synopsis ssize_t splice_from_pipe struct pipe_inode_info * pipe struct file * out loff_t * ppos size_t len unsigned int flags splice_actor * actor Arguments pipe pipe to splice from out file to splice to ppos position in out len how many bytes to splice flags splice modifier flags actor handler that splices the data Description See __splice_from_pipe. This function locks the pipe inode, otherwise it's identical to __splice_from_pipe. LINUX Kernel Hackers Manual July 2017 iter_file_splice_write 9 4.1.27 iter_file_splice_write splice data from a pipe to a file Synopsis ssize_t iter_file_splice_write struct pipe_inode_info * pipe struct file * out loff_t * ppos size_t len unsigned int flags Arguments pipe pipe info out file to write to ppos position in out len number of bytes to splice flags splice modifier flags Description Will either move or copy pages (determined by flags options) from the given pipe inode to the given file. This one is ->write_iter-based. LINUX Kernel Hackers Manual July 2017 generic_splice_sendpage 9 4.1.27 generic_splice_sendpage splice data from a pipe to a socket Synopsis ssize_t generic_splice_sendpage struct pipe_inode_info * pipe struct file * out loff_t * ppos size_t len unsigned int flags Arguments pipe pipe to splice from out socket to write to ppos position in out len number of bytes to splice flags splice modifier flags Description Will send len bytes from the pipe to a network socket. No data copying is involved. LINUX Kernel Hackers Manual July 2017 splice_direct_to_actor 9 4.1.27 splice_direct_to_actor splices data directly between two non-pipes Synopsis ssize_t splice_direct_to_actor struct file * in struct splice_desc * sd splice_direct_actor * actor Arguments in file to splice from sd actor information on where to splice to actor handles the data splicing Description This is a special case helper to splice directly between two points, without requiring an explicit pipe. Internally an allocated pipe is cached in the process, and reused during the lifetime of that process. LINUX Kernel Hackers Manual July 2017 do_splice_direct 9 4.1.27 do_splice_direct splices data directly between two files Synopsis long do_splice_direct struct file * in loff_t * ppos struct file * out loff_t * opos size_t len unsigned int flags Arguments in file to splice from ppos input file offset out file to splice to opos output file offset len number of bytes to splice flags splice modifier flags Description For use by do_sendfile. splice can easily emulate sendfile, but doing it in the application would incur an extra system call (splice in + splice out, as compared to just sendfile). So this helper can splice directly through a process-private pipe. pipes API Pipe interfaces are all for in-kernel (builtin image) use. They are not exported for use by modules. LINUX Kernel Hackers Manual July 2017 struct pipe_buffer 9 4.1.27 struct pipe_buffer a linux kernel pipe buffer Synopsis struct pipe_buffer { struct page * page; unsigned int offset; unsigned int len; const struct pipe_buf_operations * ops; unsigned int flags; unsigned long private; }; Members page the page containing the data for the pipe buffer offset offset of data inside the page len length of data inside the page ops operations associated with this buffer. See pipe_buf_operations. flags pipe buffer flags. See above. private private data owned by the ops. LINUX Kernel Hackers Manual July 2017 struct pipe_inode_info 9 4.1.27 struct pipe_inode_info a linux kernel pipe Synopsis struct pipe_inode_info { struct mutex mutex; wait_queue_head_t wait; unsigned int nrbufs; unsigned int curbuf; unsigned int buffers; unsigned int readers; unsigned int writers; unsigned int files; unsigned int waiting_writers; unsigned int r_counter; unsigned int w_counter; struct page * tmp_page; struct fasync_struct * fasync_readers; struct fasync_struct * fasync_writers; struct pipe_buffer * bufs; }; Members mutex mutex protecting the whole thing wait reader/writer wait point in case of empty/full pipe nrbufs the number of non-empty pipe buffers in this pipe curbuf the current pipe buffer entry buffers total number of buffers (should be a power of 2) readers number of current readers of this pipe writers number of current writers of this pipe files number of struct file referring this pipe (protected by ->i_lock) waiting_writers number of writers blocked waiting for room r_counter reader counter w_counter writer counter tmp_page cached released page fasync_readers reader side fasync fasync_writers writer side fasync bufs the circular array of pipe buffers LINUX Kernel Hackers Manual July 2017 generic_pipe_buf_steal 9 4.1.27 generic_pipe_buf_steal attempt to take ownership of a pipe_buffer Synopsis int generic_pipe_buf_steal struct pipe_inode_info * pipe struct pipe_buffer * buf Arguments pipe the pipe that the buffer belongs to buf the buffer to attempt to steal Description This function attempts to steal the struct page attached to buf. If successful, this function returns 0 and returns with the page locked. The caller may then reuse the page for whatever he wishes; the typical use is insertion into a different file page cache. LINUX Kernel Hackers Manual July 2017 generic_pipe_buf_get 9 4.1.27 generic_pipe_buf_get get a reference to a struct pipe_buffer Synopsis void generic_pipe_buf_get struct pipe_inode_info * pipe struct pipe_buffer * buf Arguments pipe the pipe that the buffer belongs to buf the buffer to get a reference to Description This function grabs an extra reference to buf. It's used in in the tee system call, when we duplicate the buffers in one pipe into another. LINUX Kernel Hackers Manual July 2017 generic_pipe_buf_confirm 9 4.1.27 generic_pipe_buf_confirm verify contents of the pipe buffer Synopsis int generic_pipe_buf_confirm struct pipe_inode_info * info struct pipe_buffer * buf Arguments info the pipe that the buffer belongs to buf the buffer to confirm Description This function does nothing, because the generic pipe code uses pages that are always good when inserted into the pipe. LINUX Kernel Hackers Manual July 2017 generic_pipe_buf_release 9 4.1.27 generic_pipe_buf_release put a reference to a struct pipe_buffer Synopsis void generic_pipe_buf_release struct pipe_inode_info * pipe struct pipe_buffer * buf Arguments pipe the pipe that the buffer belongs to buf the buffer to put a reference to Description This function releases a reference to buf.