📄 vfs.txt
字号:
if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag). The filesystem must take care of unlocking the page and releasing it refcount, and updating i_size. Returns < 0 on failure, otherwise the number of bytes (<= 'copied') that were able to be copied into pagecache. bmap: called by the VFS to map a logical block offset within object to physical block number. This method is used by the FIBMAP ioctl and for working with swap-files. To be able to swap to a file, the file must have a stable mapping to a block device. The swap system does not go through the filesystem but instead uses bmap to find out where the blocks in the file are and uses those addresses directly. invalidatepage: If a page has PagePrivate set, then invalidatepage will be called when part or all of the page is to be removed from the address space. This generally corresponds to either a truncation or a complete invalidation of the address space (in the latter case 'offset' will always be 0). Any private data associated with the page should be updated to reflect this truncation. If offset is 0, then the private data should be released, because the page must be able to be completely discarded. This may be done by calling the ->releasepage function, but in this case the release MUST succeed. releasepage: releasepage is called on PagePrivate pages to indicate that the page should be freed if possible. ->releasepage should remove any private data from the page and clear the PagePrivate flag. It may also remove the page from the address_space. If this fails for some reason, it may indicate failure with a 0 return value. This is used in two distinct though related cases. The first is when the VM finds a clean page with no active users and wants to make it a free page. If ->releasepage succeeds, the page will be removed from the address_space and become free. The second case is when a request has been made to invalidate some or all pages in an address_space. This can happen through the fadvice(POSIX_FADV_DONTNEED) system call or by the filesystem explicitly requesting it as nfs and 9fs do (when they believe the cache may be out of date with storage) by calling invalidate_inode_pages2(). If the filesystem makes such a call, and needs to be certain that all pages are invalidated, then its releasepage will need to ensure this. Possibly it can clear the PageUptodate bit if it cannot free private data yet. direct_IO: called by the generic read/write routines to perform direct_IO - that is IO requests which bypass the page cache and transfer data directly between the storage and the application's address space. get_xip_page: called by the VM to translate a block number to a page. The page is valid until the corresponding filesystem is unmounted. Filesystems that want to use execute-in-place (XIP) need to implement it. An example implementation can be found in fs/ext2/xip.c. migrate_page: This is used to compact the physical memory usage. If the VM wants to relocate a page (maybe off a memory card that is signalling imminent failure) it will pass a new page and an old page to this function. migrate_page should transfer any private data across and update any references that it has to the page. launder_page: Called before freeing a page - it writes back the dirty page. To prevent redirtying the page, it is kept locked during the whole operation.The File Object===============A file object represents a file opened by a process.struct file_operations----------------------This describes how the VFS can manipulate an open file. As of kernel2.6.22, the following members are defined:struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);};Again, all methods are called without any locks being held, unlessotherwise noted. llseek: called when the VFS needs to move the file position index read: called by read(2) and related system calls aio_read: called by io_submit(2) and other asynchronous I/O operations write: called by write(2) and related system calls aio_write: called by io_submit(2) and other asynchronous I/O operations readdir: called when the VFS needs to read the directory contents poll: called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls ioctl: called by the ioctl(2) system call unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not require the BKL should use this method instead of the ioctl() above. compat_ioctl: called by the ioctl(2) system call when 32 bit system calls are used on 64 bit kernels. mmap: called by the mmap(2) system call open: called by the VFS when an inode should be opened. When the VFS opens a file, it creates a new "struct file". It then calls the open method for the newly allocated file structure. You might think that the open method really belongs in "struct inode_operations", and you may be right. I think it's done the way it is because it makes filesystems simpler to implement. The open() method is a good place to initialize the "private_data" member in the file structure if you want to point to a device structure flush: called by the close(2) system call to flush a file release: called when the last reference to an open file is closed fsync: called by the fsync(2) system call fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands readv: called by the readv(2) system call writev: called by the writev(2) system call sendfile: called by the sendfile(2) system call get_unmapped_area: called by the mmap(2) system call check_flags: called by the fcntl(2) system call for F_SETFL command dir_notify: called by the fcntl(2) system call for F_NOTIFY command flock: called by the flock(2) system call splice_write: called by the VFS to splice data from a pipe to a file. This method is used by the splice(2) system call splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system callNote that the file operations are implemented by the specificfilesystem in which the inode resides. When opening a device node(character or block special) most filesystems will call specialsupport routines in the VFS which will locate the required devicedriver information. These support routines replace the filesystem fileoperations with those for the device driver, and then proceed to callthe new open() method for the file. This is how opening a device filein the filesystem eventually ends up calling the device driver open()method.Directory Entry Cache (dcache)==============================struct dentry_operations------------------------This describes how a filesystem can overload the standard dentryoperations. Dentries and the dcache are the domain of the VFS and theindividual filesystem implementations. Device drivers have no businesshere. These methods may be set to NULL, as they are either optional orthe VFS uses a default. As of kernel 2.6.22, the following members aredefined:struct dentry_operations { int (*d_revalidate)(struct dentry *, struct nameidata *); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int);}; d_revalidate: called when the VFS needs to revalidate a dentry. This is called whenever a name look-up finds a dentry in the dcache. Most filesystems leave this as NULL, because all their dentries in the dcache are valid d_hash: called when the VFS adds a dentry to the hash table d_compare: called when a dentry should be compared with another d_delete: called when the last reference to a dentry is deleted. This means no-one is using the dentry, however it is still valid and in the dcache d_release: called when a dentry is really deallocated d_iput: called when a dentry loses its inode (just prior to its being deallocated). The default when this is NULL is that the VFS calls iput(). If you define this method, you must call iput() yourself d_dname: called when the pathname of a dentry should be generated. Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay pathname generation. (Instead of doing it when dentry is created, its done only when the path is needed.). Real filesystems probably dont want to use it, because their dentries are present in global dcache hash, so their hash should be an invariant. As no lock is held, d_dname() should not try to modify the dentry itself, unless appropriate SMP safety is used. CAUTION : d_path() logic is quite tricky. The correct way to return for example "Hello" is to put it at the end of the buffer, and returns a pointer to the first char. dynamic_dname() helper function is provided to take care of this.Example :static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen){ return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);}Each dentry has a pointer to its parent dentry, as well as a hash listof child dentries. Child dentries are basically like files in adirectory.Directory Entry Cache API--------------------------There are a number of functions defined which permit a filesystem tomanipulate dentries: dget: open a new handle for an existing dentry (this just increments the usage count) dput: close a handle for a dentry (decrements the usage count). If the usage count drops to 0, the "d_delete" method is called and the dentry is placed on the unused list if the dentry is still in its parents hash list. Putting the dentry on the unused list just means that if the system needs some RAM, it goes through the unused list of dentries and deallocates them. If the dentry has already been unhashed and the usage count drops to 0, in this case the dentry is deallocated after the "d_delete" method is called d_drop: this unhashes a dentry from its parents hash list. A subsequent call to dput() will deallocate the dentry if its usage count drops to 0 d_delete: delete a dentry. If there are no other open references to the dentry then the dentry is turned into a negative dentry (the d_iput() method is called). If there are other references, then d_drop() is called instead d_add: add a dentry to its parents hash list and then calls d_instantiate() d_instantiate: add a dentry to the alias hash list for the inode and updates the "d_inode" member. The "i_count" member in the inode structure should be set/incremented. If the inode pointer is NULL, the dentry is called a "negative dentry". This function is commonly called when an inode is created for an existing negative dentry d_lookup: look up a dentry given its parent and path name component It looks up the child of that given name from the dcache hash table. If it is found, the reference count is incremented and the dentry is returned. The caller must use d_put() to free the dentry when it finishes using it.For further information on dentry locking, please refer to the documentDocumentation/filesystems/dentry-locking.txt.Resources=========(Note some of these resources are not up-to-date with the latest kernel version.)Creating Linux virtual filesystems. 2002 <http://lwn.net/Articles/13325/>The Linux Virtual File-system Layer by Neil Brown. 1999 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>A tour of the Linux VFS by Michael K. Johnson. 1996 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>A small trail through the Linux kernel by Andries Brouwer. 2001 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -