📄 045_fs_inode_c.html
字号:
position: absolute; top: 0; } div.google_footer { flow: static(footer); } /* always consider this element at the start of the doc */ div#google_footer { flow: static(footer, start); } span.google_pagenumber { content: counter(page); } span.google_pagecount { content: counter(pages); } } @page { @top { content: flow(header); } @bottom { content: flow(footer); } } /* end default print css */ /* custom css *//* end custom css */ /* ui edited css */ body { font-family: Verdana; font-size: 10.0pt; line-height: normal; background-color: #ffffff; } .documentBG { background-color: #ffffff; } /* end ui edited css */</style> </head> <body revision="dcbsxfpf_33dwh2nqdt:334"> <div align=center>
<table align=center border=0 cellpadding=0 cellspacing=0 height=5716 width=768>
<tbody>
<tr>
<td height=5716 valign=top width=802>
<pre>2007-12-30<br>fs/inode.c <br> inode也是一种cache? 是,没啥区别.<br> 我们已经分析过一大把的cache了,特别是文件系统的cache,是有着规律可寻的.通常包含一个hash,一对get/put函数,sync支持,<br>invalid支持,shrink支持,wait_on支持,体现缓存的老化链表操作(mark dirty等),还有used/valid/dirt/clean的定义.然后...<br>就这些.把握了这些方面,补上同步/互斥的考量,一个完整的cache就诞生了...<br> <br> 我们从一个显见的主线讨论这个cache(每次看一个cache的讨论方法都有所不同,因为以前没有经验),这个主线就是inode可以存<br>在的各个链表.inode.c有段很好的注释,可以帮助我们理解inode的list operation. <br><font color=#3333ff><br> /*<br> * Each inode can be on two separate lists. One is<br> * the hash list of the inode, used for lookups. The<br> * other linked list is the "type" list:<br> * "in_use" - valid inode, i_count > 0, i_nlink > 0<br> * "dirty" - as "in_use" but also dirty<br> * "unused" - valid inode, i_count = 0<br> *<br> * A "dirty" list is maintained for each super block,<br> * allowing for low-overhead inode sync() operations.<br> */</font><br><br>static LIST_HEAD(<font color=#000099><b>inode_in_use</b></font>); <font color=#3333ff> /* valid inode,i_count>0, i_nlink>0*/</font><br>static LIST_HEAD(<font color=#000099><b>inode_unused</b></font>); <font color=#3333ff> /* valid inode, i_count=0 */</font><br><font color=#006600><b>(LIST in super_block)sb->s_dirty /*valid,i_count>0,i_nlink>0, dirty*/<br></b></font><br>static LIST_HEAD(<font color=#000099><b>anon_hash_chain</b></font>); /* for inodes with NULL i_sb (临时用一下,瞬态存在于这个链表而已,主要<br>满足 inode->i_list必须有存在于一个链表的要求)*/<br><br>static struct list_head *<font color=#000099><b>inode_hashtable</b></font>; /*valid(加入hash 就是valid了*/<br><br><font color=#339999 size=3><font color=#3333ff>inode life management的总结:</font><br><br></font><font color=#339999 size=4>1)普通inode,创建或读入(查找)</font><br>一个普通的inode创建后,经过初始化,加入到hash表,同时加入anon_hash_chain, 成为一个valid的inode.然后马上就会通过mark<br>dirty 操作,将之移动到<font color=#006600><b>sb->s_dirty</b></font>队列.创建过程的典型情景:<br>path_walk -> vfs_create -><font color=#000099>ext2_create</font>-><font color=#006600>ext2_new_inode(->new_inode),mark_inode_dirty</font><br>一个普通的inode还可以通过一次查找过程从磁盘读入,这种情况下,inode加入hash后马上会移动到<font color=#000099><b>inode_in_use</b></font>队列.具体代码见<br>path_walk->real_lookup->ext2_lookup->iget4-><font color=#006600><b>get_new_inode</b></font>.<br> <font color=#000099><b>anon_hash_chain</b></font> 是一个临时队列,sync和invalid的时候不用考虑的.<br><br><font color=#339999 size=4>2)特殊的inode: 在life 队列,但可能不在hash表</font><br> 就是我们刚刚分析的类似pipefs这样的文件系统,其inode就没有加入hash,但是却能进入其life链表:in_uese,典型情景见:<br>do_pipe->get_pipe_inode. 或者那些临时一用的fake inode:比如blkdev_get.<br><br><font color=#339999 size=4>3)为inode分配内存(最底层的分配函数)</font><br>通过上面的分析,我们可以认清楚两个最基本的inode分配函数.<br>struct inode * <font color=#000066><b>get_empty_inode</b></font>(void) /*特殊inode,或者没有任何预知信息的inode分配,不进行inde hash查找*/ <br>static struct inode * <font color=#000066><b>get_new_inode</b></font>(*sb, ino,*head, find_inode_t find_actor, void *opaque)/*分配并读入*/<br>最好结合1),2)所述情景看看就明白了.<br>static inline struct inode * <font color=#000066><b>new_inode</b></font>(struct super_block *sb) /*call get_empty_inode*/<br><br><font color=#339999 size=4>4)iget/iput</font><br>iget->iget4<br>get操作都是现在hash中查找(cache 查找),如果不存在就分配新的inode.<br>struct inode *<font color=#000066><b>iget4</b></font>(struct super_block *sb, unsigned long ino, find_inode_t find_actor, void *opaque)<br>{<br> struct list_head * head = inode_hashtable + hash(sb,ino);<br> struct inode * inode;<br><br> spin_lock(&inode_lock);<br> inode = find_inode(sb, ino, head, find_actor, opaque);<font color=#3333ff>/*cache hash查找*/</font><br> if (inode) {<br> __iget(inode); <font color=#3333ff>/*这是一个检查inode状态的机会*/</font><br> spin_unlock(&inode_lock);<br> wait_on_inode(inode);<br> return inode;<br> }<br> spin_unlock(&inode_lock);<br><br> <font color=#3333ff>/*<br> * get_new_inode() will do the right thing, re-trying the search<br> * in case it had to block at any point.<br> */</font><br> return get_new_inode(sb, ino, head, find_actor, opaque);<br>}<br>static inline void __iget(struct inode * inode)<br>{<br> if (atomic_read(&inode->i_count)) {<font color=#3333ff> /*已经有引用存在*/</font><br> atomic_inc(&inode->i_count);<br> return;<br> }<br> atomic_inc(&inode->i_count); <font color=#3333ff>/*无引用的valid inode: 这些是真正的cache起来的inode....*/</font><br> if (!(inode->i_state & I_DIRTY)) {<br> list_del(&inode->i_list); <font color=#3333ff>/*这个应该是unused队列*/</font><br> list_add(&inode->i_list, &inode_in_use); <br> }<br> inodes_stat.nr_unused--;<br>}<br><br>iput则是释放一个inode的基本途径.无论inode如何分配出去,总是通过iput来释放的.<br>/**<br> * iput - put an inode <br> * @inode: inode to put<br> *<br> * Puts an inode, dropping its usage count. If the inode use count hits<br> * zero the inode is also then freed and may be destroyed.<br> */<br>void iput(struct inode *inode)<br>{<br> if (inode) {<br> struct super_operations *op = NULL;<br> <font color=#3333ff>/*call 文件系统指定的put 函数*/</font><br> if (inode->i_sb && inode->i_sb->s_op)<br> op = inode->i_sb->s_op;<br> if (op && op->put_inode)<br> op->put_inode(inode);<br><br> if (!atomic_dec_and_lock(&inode->i_count, &inode_lock))<br> return;<br><br> if (!inode->i_nlink) {<font color=#3333ff>/*已经删除*/</font><br> ... destroy所有数据<br><br> if (inode->i_data.nrpages)<br> truncate_inode_pages(&inode->i_data, 0);<font color=#3333ff> /*clear filemap的页面,见下图*/</font><br><br> if (op && op->delete_inode) {<br> void (*delete)(struct inode *) = op->delete_inode;<br> /* s_op->delete_inode internally recalls clear_inode() */<br> delete(inode);<br> } else<br> clear_inode(inode);<br> if (inode->i_state != I_CLEAR)<br> BUG();<br> } else {/<font color=#3366ff>*未删除*/</font><br> if (!list_empty(&inode->i_hash)) { <br> if (!(inode->i_state & I_DIRTY)) {<br> list_del(&inode->i_list);<br> list_add(&inode->i_list,<br> &inode_unused);<br> }<br> inodes_stat.nr_unused++;<br> spin_unlock(&inode_lock);<br> return;<br> } else {<font color=#3333ff>/*像pipe和临时的inode是不加入hash表的*/</font> <br> /* magic nfs path */<br> <font color=#3333ff>.....clear 所有数据</font><br> }<br> }<br> destroy_inode(inode);<font color=#3333ff> /*call free ....*/</font><br> }<br>}<br><br>通过看iput,知道,如果未删除的inode,其也在hash表中,就留下来,作为cache使用,其他情况就直接free了.<br>最后注意这个接口, struct inode *igrab(struct inode *inode),是对一个已经拿到指针的inode进行引用操作.<br><br><br><font color=#339999 size=4>5)inode在文件系统中的角色</font><br>这里又遇到了filemap和文件系统的关系,再回顾下这张图:如果inode(一个文件)被mmap到内存,其纽带就是:<br>inode->i_mapping->i_mmap_shared|i_mmap.</pre>
<div id=ips: style="PADDING:1em 0pt; TEXT-ALIGN:left">
<img src=045__fs_inode_c_images/dcbsxfpf_15cq94jchq.gif style="WIDTH:701px; HEIGHT:606px">
</div>
<pre>对照inode中众多的成员,或许我们能够理解更多.下面蓝色注释部分是能够从上图,和上面的介绍理解的部分.(灰色的不难理解啊,是<br>功能和特性部分.复杂的部分在于管理....)<br>struct inode {<br> struct list_head i_hash; <font color=#3333ff>/*inode hash表*/</font><br> struct list_head i_list; <font color=#3333ff>/*inode life management 表:used/unused/dirty(<font color=#006600><b>sb->s_dirty</b></font>)*/</font><br> struct list_head i_dentry;<font color=#3333ff>/*对应一个inode的多个dentry表*/</font><br> <br> struct list_head i_dirty_buffers;<font color=#3333ff> /*meta data:buffers, filedata:buffer entry,见buffer.c的分析*/</font><br><br> <font color=#666666>unsigned long i_ino;</font><br> atomic_t <font color=#3333ff>i_count</font>;<br> kdev_t <font color=#ff0000><b>i_dev</b></font>;<br><font color=#666666> umode_t i_mode;</font><br> nlink_t <font color=#3333ff>i_nlink</font>;<br> kdev_t <font color=#ff0000><b>i_rdev</b></font>;<br> <font color=#666666>.....//time,size,uid,block|blksize</font><br> struct semaphore i_sem;<br> struct semaphore i_zombie;<br> <font color=#999999>struct inode_operations *i_op;<br> struct file_operations *i_fop; /* former ->i_op->default_file_ops */</font><br> struct super_block *<font color=#3333ff>i_sb</font>;<br> <font color=#999999>wait_queue_head_t i_wait;</font><br> struct file_lock *i_flock;<br> struct address_space *<font color=#3333ff>i_mapping</font>;<br> struct address_space <font color=#3333ff>i_data</font>; <br> struct dquot *i_dquot[MAXQUOTAS];<br> struct pipe_inode_info *<font color=#3333ff>i_pipe</font>;<br> struct block_device *<font color=#cc0000><b>i_bdev</b></font>;<br><br> unsigned long <font color=#990000>i_dnotify_mask</font>; /* Directory notify events */<br> struct dnotify_struct *<font color=#990000>i_dnotify</font>; /* for directory notifications */<br><font color=#666666><br> unsigned long i_state;<br><br> unsigned int i_flags;<br> unsigned char i_sock;<br><br> atomic_t i_writecount;<br> unsigned int i_attr_flags;<br> __u32 i_generation;<br> union {....<br> struct ext2_inode_info ext2_i;<br> ....... <br> } u;</font><br>};<br><br>欣慰的是,inode所有的角色我们几乎都分析过了. 比如dnotify,看看这个图,回顾一下:(最好,fcntl.c相关部分也看看)<br> <br> +------+ <br> |inode | <br> +-/----+ <br> | <br> | +-------------+ +-------------+ <br> --------|dentry_notify|--|dentry_notify| <br> +------/------+ +-------------+ <br> | <br> | <br> | <br> +------\------------+ <br> | file->f_owner.pid | <br> +-------------------+ <br><br>最后值得一提的是,三个dev:(乱点,以后统一应该是趋势...)<br><br>kdev_t <font color=#ff0000><b>i_dev</b></font><font color=#000000>; </font>
kdev_t <font color=#ff0000><b>i_rdev</b></font><font color=#000000>; </font>
struct block_device *<font color=#cc0000><b>i_bdev</b></font>;<br>说到这个三个设备,就又回到了这个经典函数了:<br><font color=#000066><b>init_special_inode</b></font>(struct inode *inode, umode_t mode, int rdev)<br>{<br> inode->i_mode = mode;<br> if (S_ISCHR(mode)) {<br> inode->i_fop = &def_chr_fops;<br> inode-><b>i_rdev</b> = to_kdev_t(rdev);<br> } else if (S_ISBLK(mode)) {<br> inode->i_fop = &def_blk_fops;<br> inode-><b>i_rdev</b> = to_kdev_t(rdev);<br> inode->i_bdev = bdget(rdev);<br> } else if (S_ISFIFO(mode))<br> inode->i_fop = &def_fifo_fops;<br> else if (S_ISSOCK(mode))<br> inode->i_fop = &bad_sock_fops;<br> else<br> printk(KERN_DEBUG "init_special_inode: bogus imode (%o)\n", mode);<br>}<br><br>一眼就明白个大概了,这三个dev不是什么时候都全有效的. 如果是fifo,则i_rdev和i_bdev都无效...;如果是block dev则i_rdev<br>和i_bdev有效;如果是char dev 则只有i_rdev有效.那么i_dev是什么?, 呵呵,<font color=#cc0000><b>i_dev</b></font>是inode本身所属的设备(一般是block设备).<br>需要注意的是,在内核中标定一个设备,kdev_t就够了,即i_dev和i_rdev就是用于标定设备的.而i_bdev是block_device类型的,专<br>门用于块设备的,块设备本身也可以是一个文件(见block_dev.c),但是和char设备不同,char设备直接面对file_operations<br>(见devices.c),而block dev则是一个间接的结构,读写是通过一些块设备通用的接口实现的,设备操作接口本身只有open/release<br>没有read/wirte. 这个区别请仔细阅读块设备和字符设备的fops:<br> &<font color=#cc0000><b>def_chr_fops</b></font>; /*只有open,直接替换成字符设备的fops即可*/<br> &<font color=#cc0000><b>def_blk_fops</b></font>; /*什么都提供了,一个完整的虚拟层*/<br>并且,2.4中devfs支持不是很完善,和block_device纠缠在一起. 如果可以从i_rdev获取blkdev的fops,则i_bdev本来是不必要的.<br><br><br><font color=#339999 size=4><b>6)mark dirty和sync<br></b></font>inode的dirty状态分成几个级别:<br>#define <font color=#ff0000>I_DIRTY_SYNC</font> 1 /* Not dirty enough for O_DATASYNC */<br>#define I_DIRTY_DATASYNC 2 /* Data-related inode changes pending (没有单独mark data sync的接口) */<br>#define <font color=#ff0000>I_DIRTY_PAGES</font> 4 /* Data-related inode changes pending */<br>#define <font color=#ff0000>I_DIRTY</font> (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)<br><br>这些状态和O_SYNC标记有互动关系,都是标记inode有部分数据已dirty.<br><font color=#ff0000>I_DIRTY_SYNC</font> /* 只更新访问时间*/<br>I_DIRTY_DATASYNC /*无单独mark函数,和I_DIRTY一起被设置,表明inode本身已经dirty*/<br><font color=#ff0000>I_DIRTY_PAGES</font> /*通过mmap 写操作,当内核检查到mmap的页面是dirty时,mark这标记*/<br>I_DIRTY : 打包.<br>同步一个inode的含义是同步inode相关联的filemap,以及inode本身.这个同步操作是sync_one.<br>static inline void <font color=#000066><b>sync_one</b></font>(struct inode *inode, int sync)<br>{<br> if (inode->i_state & I_LOCK) { <font color=#3333ff>/*已经处于io状态,则无需sync了,只需要wait就行*/</font><br> __iget(inode);<br> spin_unlock(&inode_lock);<br> __wait_on_inode(inode);<br> iput(inode);<br> spin_lock(&inode_lock);<br> } else { <font color=#3333ff>/*否则需要同步filemap的页面,和inode相关的filemap,和inode本身*/</font><br> unsigned dirty;<br> <font color=#3333ff> /*一定会变得clean,移动到合适的队列*/</font><br> list_del(&inode->i_list);<br> list_add(&inode->i_list, atomic_read(&inode->i_count)<br> ? &inode_in_use<br> : &inode_unused);<br> /* Set I_LOCK, reset I_DIRTY */<br> dirty = inode->i_state & I_DIRTY;<br> inode->i_state |= I_LOCK; /*开始i/o*/<br> inode->i_state &= ~I_DIRTY;<br> spin_unlock(&inode_lock);<br><br> filemap_fdatasync(inode->i_mapping);<font color=#3333ff> /*根据inode->i_map->dirtypages 同步所有page cache的页面*/</font><br><br> /* Don't write the inode if only I_DIRTY_PAGES was set */<br> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))<br> write_inode(inode, sync); <font color=#3333ff>/*同步至inode本身:mmap 写操作不会引起同步inode本身*/</font><br><br> filemap_fdatawait(inode->i_mapping); <font color=#3333ff>/*wait on pages*/</font><br><br> spin_lock(&inode_lock);<br> inode->i_state &= ~I_LOCK;<br> wake_up(&inode->i_wait);<br> }<br>}<br><br>和同步一个inode有关,单又不尽相同的一个操作是osync同步:<br>O_SYNC :open的参数,指定write操作等待磁盘写io完成,包括文件数据和meta数据.<br>O_SYNC的处理函数是generic_osync_inode,和同步一个inode不同,多了一层同步:要同步inode相关的dirtybuffers,即文件数据和<br>meta data.<br><br>/**<br> * generic_osync_inode - flush all dirty data for a given inode to disk<br> * @inode: inode to write<br> * @datasync: if set, don't bother flushing timestamps<br> *<br> * This can be called by file_write functions for files which have the<br> * O_SYNC flag set, to flush dirty writes to disk. <br> */<br><br>int <font color=#000066><b>generic_osync_inode</b></font>(struct inode *inode, int datasync)<br>{<br> int err;<br> <br> /* <br> * WARNING<br> *<br> * Currently, the filesystem write path does not pass the<br> * filp down to the low-level write functions. Therefore it<br> * is impossible for (say) __block_commit_write to know if<br> * the operation is O_SYNC or not.<br> *<br> * Ideally, O_SYNC writes would have the filesystem call<br> * ll_rw_block as it went to kick-start the writes, and we<br> * could call osync_inode_buffers() here to wait only for<br> * those IOs which have already been submitted to the device<br> * driver layer. As it stands, if we did this we'd not write<br> * anything to disk since our writes have not been queued by<br> * this point: they are still on the dirty LRU.<br> * <br> * So, currently we will call fsync_inode_buffers() instead,<br> * to flush _all_ dirty buffers for this inode to disk on <br> * every O_SYNC write, not just the synchronous I/Os. --sct<br> */<br><br>#ifdef WRITERS_QUEUE_IO<br> err = osync_inode_buffers(inode);<br>#else<br> err = fsync_inode_buffers(inode); <font color=#3333ff>/*现在使用这个函数:将inode上的dirty buffer写入磁盘,meta data<br> 和文件数据*/</font><br>#endif<br><br> spin_lock(&inode_lock);<br> if (!(inode->i_state & I_DIRTY))<br> goto out;<br> if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) <font color=#3333ff>/*没有DATASYNC就不同步filemap和inode*/</font><br> goto out; <font color=#3333ff> /*即mmap的写操作和更新访问时间不会引起O_SYNC标记同步filemap和inode*/</font><br> spin_unlock(&inode_lock);<br> write_inode_now(inode, 1); /*call sync one*/<br> return err;<br><br> out:<br> spin_unlock(&inode_lock);<br> return err;<br>}<br><br>inode的sync就说这些. 值得注意的是,以前的分析,可能涉及到inode的同步时,可能不尽准确.应该参考这里和代码,重新理解以前的<br>相关分析.<br><br><font color=#339999 size=4><b>7) invalid</b></font><br>所谓invalid就是放弃.放弃这个inode,只知最终释放其占有内存. 注意如果没有调用sync进行同步的话,这个操作会丢失数据的.<br>invalid根本不检查是否dirty.其操作的过程是,先将inode从各个队列移动到一个临时队列,然后逐个释放.<br>/**<br> * invalidate_inodes - discard the inodes on a device<br> * @sb: superblock<br> *<br> * Discard all of the inodes for a given superblock. If the discard<br> * fails because there are busy inodes then a non zero value is returned.<br> * If the discard is successful all the inodes have been discarded.<br> */<br> <br>int invalidate_inodes(struct super_block * sb)<br>{<br> int busy;<br> LIST_HEAD(throw_away);<br><br> spin_lock(&inode_lock);<br> busy = invalidate_list(&inode_in_use, sb, &throw_away); <font color=#3333ff>/*从hash和inuse链表摘除*/</font><br> busy |= invalidate_list(&inode_unused, sb, &throw_away);/*丢弃inode上的dirty buffer*/<br> busy |= invalidate_list(&sb->s_dirty, sb, &throw_away); /**/<br> spin_unlock(&inode_lock);<br><br> dispose_list(&throw_away); <font color=#3333ff>/*丢弃filemap的pages,释放inode所占有内存.*/</font><br><br> return busy;<br>}<br><br><font color=#339999 size=4><b>8)shrink 和prune操作</b></font><br>就是找到unused的inode,把unused的inode中能释放的释放掉.也是分两个步骤,先找到能释放的,放到一个临时链表,然后逐个释放.<br>void prune_icache(int goal)<br>void shrink_icache_memory(int priority, int gfp_mask)<br>具体看看上面两个函数.为啥unused inode不能立即删除?呵呵,unused,也是valid,可能有dirty数据的... <br><br><br></pre>
</td>
</tr>
</tbody>
</table>
</div>
<br></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -