📄 023_mm_swap_state_c.html
字号:
flow: static(header); } /* used to insert page numbers */ div.google_header::before, div.google_footer::before { position: absolute; top: 0; } div.google_footer { flow: static(footer); } /* always consider this element at the start of the doc */ div#google_footer { flow: static(footer, start); } span.google_pagenumber { content: counter(page); } span.google_pagecount { content: counter(pages); } } @page { @top { content: flow(header); } @bottom { content: flow(footer); } } /* end default print css */ /* custom css *//* end custom css */ /* ui edited css */ body { font-family: Verdana; font-size: 10.0pt; line-height: normal; background-color: #ffffff; } .documentBG { background-color: #ffffff; } /* end ui edited css */</style> </head> <body revision="dcbsxfpf_65fnfzsmft:5"> <table align=center cellpadding=0 cellspacing=0 height=5716 width=768>
<tbody>
<tr>
<td height=5716 valign=top width=100%>
<pre>2006-8-10 <br>mm/swap_state.c<br><br> 当一个page要和外部存储设备发生联系的时候,就要建立一个address_space,对于swap<br>就是 swapper_space .还要提供address_space_operations,对于swap 就是swap_aops.<br> 考虑page cache/swap cache/shmem/filemap,无不如此.<br> <br> 建立着两个结构只是解决了页面写出的问题,而读入靠的是handle_pte_fault->直接的<br>函数调用.对于swap就是do_swap_page,file map/mmap靠vma->vm_ops->nopage.没有一个统<br>一的解决方案.<br> <br> 不打算太多分析这些东西了.这里重点讨论物理内存页面,page->count以及swp entry<br>的引用计数.(真的需要逐函数列到这里?)<br> <br> <br> <br> <br> <br> <br> <br> <br> page, 何去何从<br> <br> 看page_alloc.c, buddy系统,所有物理页面都受buddy管理(reserve除外,那是外设内存,<br>或者特殊用途). page的去向只看分配函数的调用关系即可.<br> page_alloc.c 提供的分配接口:(只有这几个被应用--2.4)<br> <br> 1.(alloc_pages:call by)-->page_cache_alloc<br> 从这个接口流出的页面都在page cache(swap cache)中.用于磁盘(or疑似)文件缓<br> 存.具体的使用者是: swap cache, page cache,file read(page cache),filemap<br> (page cache,or copy from page cache),COW(may not in page cache),<br> shmem_no_page(page cache).<br> <br> 2.__get_free_pages:<br> 广泛应用于驱动, 内核使用的hash表, task struct结构,网络(hash等),buffers<br> (文件系统的meta data,blk设备文件读写.(还有fly的buffers,创建于需要io的页<br> 面,这种页面不是从__get_free_pages流出) ), slab(slab).<br> <br> 3. __get_free_page<br> page table(pdir,pmd),驱动, 用户参数页.<br> <br> 4.get_zeroed_page:<br> 驱动(tty), shmem(建立于内核的直接/间接映射表,永不与后备缓存打交道),<br> <br> 5.alloc_page:<br> buffers,string参数页,匿名页(缺页中断),vmalloc(内核页面,永不交换).<br> <br> <br> 现在可以回答这个问题,物理内存都用到哪里去了?:(fix me,i think everything is here)<br> 1)内核'自己'使用<br> 包括驱动,网络,页表(内核或者进程),各种hash表,从用户copy的参数,通过slab作为<br> 各种内核数据结构的cache(inode,dentry.....),shmem映射表,vmalloc使用的内核<br> 页面.<br> <br> 2)page cache/swap cache <br> 页面只能位于这两个cache中的一个.用于缓存位于磁盘上的文件内容(不是meta).包<br> 括普通文件,filemap(共享),shemem.<br> <br> 3)buffers<br> 用于缓存文件系统的meta data,用于设备文件读写的缓存.不包括那些为了进行page<br> io而创建的fly buffers,但是这些fly buffer也page->count++了.<br> <br> 4)用户进程页面<br> 这是一个混合体. 进程使用的页面也可以位于page cache/swap cache, 还可以拥有<br> buffer. 除了这些有所属的页面,进程使用的页面还有一种叫做匿名页,即无mapping.<br> 包括还未进入swap的进程页面,filemap(no shared),COW页面.<br> <br> <br> <br> <br> <br> page->count <br> <br> 先贴一段从mm.h中的注释,这个值得一看.注意,这个注释太老了,inode->i_pages在2.4中已<br>经不存在了. 这段话--> For pages belonging to inodes, the page->count is the number<br>of attaches, plus 1 if buffers are allocated to the page.已经不正确了.(和我们的文<br>档一样,好久没有更新了,2.6中还行.)<br><br>/*<br> * Various page->flags bits:<br> *<br> * PG_reserved is set for a page which must never be accessed (which<br> * may not even be present).<br> *<br> * PG_DMA has been removed, page->zone now tells exactly wether the<br> * page is suited to do DMAing into.<br> *<br> * Multiple processes may "see" the same page. E.g. for untouched<br> * mappings of /dev/null, all processes see the same page full of<br> * zeroes, and text pages of executables and shared libraries have<br> * only one copy in memory, at most, normally.<br> *<br> * For the non-reserved pages, page->count denotes a reference count.<br> * page->count == 0 means the page is free.<br> * page->count == 1 means the page is used for exactly one purpose<br> * (e.g. a private data page of one process).<br> *<br> * A page may be used for kmalloc() or anyone else who does a<br> * __get_free_page(). In this case the page->count is at least 1, and<br> * all other fields are unused but should be 0 or NULL. The<br> * management of this page is the responsibility of the one who uses<br> * it.<br> *<br> * The other pages (we may call them "process pages") are completely<br> * managed by the Linux memory manager: I/O, buffers, swapping etc.<br> * The following discussion applies only to them.<br> *<br> * A page may belong to an inode's memory mapping. In this case,<br> * page->inode is the pointer to the inode, and page->offset is the<br> * file offset of the page (not necessarily a multiple of PAGE_SIZE).<br> *<br> * A page may have buffers allocated to it. In this case,<br> * page->buffers is a circular list of these buffer heads. Else,<br> * page->buffers == NULL.<br> *<br> * For pages belonging to inodes, the page->count is the number of<br> * attaches, plus 1 if buffers are allocated to the page.<br> *<br> * All pages belonging to an inode make up a doubly linked list<br> * inode->i_pages, using the fields page->next and page->prev. (These<br> * fields are also used for freelist management when page->count==0.)<br> * There is also a hash table mapping (inode,offset) to the page<br> * in memory if present. The lists for this hash table use the fields<br> * page->next_hash and page->pprev_hash.<br> *<br> * All process pages can do I/O:<br> * - inode pages may need to be read from disk,<br> * - inode pages which have been modified and are MAP_SHARED may need<br> * to be written to disk,<br> * - private pages which have been modified may need to be swapped out<br> * to swap space and (later) to be read back into memory.<br> * During disk I/O, PG_locked is used. This bit is set before I/O<br> * and reset when I/O completes. page->wait is a wait queue of all<br> * tasks waiting for the I/O on this page to complete.<br> * PG_uptodate tells whether the page's contents is valid.<br> * When a read completes, the page becomes uptodate, unless a disk I/O<br> * error happened.<br> *<br> * For choosing which pages to swap out, inode pages carry a<br> * PG_referenced bit, which is set any time the system accesses<br> * that page through the (inode,offset) hash table.<br> *<br> * PG_skip is used on sparc/sparc64 architectures to "skip" certain<br> * parts of the address space.<br> *<br> * PG_error is set to indicate that an I/O error occurred on this page.<br> *<br> * PG_arch_1 is an architecture specific page state bit. The generic<br> * code guarentees that this bit is cleared for a page when it first<br> * is entered into the page cache.<br> */<br><br> 根据刚才分析的物理页面,page的流向, 对page->count的简单描述如下:<br> 1)第一类内核自己使用的页面,一般引用计数都是1.(fixme).<br> <br> 2)page/swap cache中的页面,增加1, buffers 增加1.<br> <br> 3)用户进程: 每个进程增加1.<br> <br> 4)许多地方为了保护页面临时不被释放, get后很快释放.此类忽略.<br> <br> <br> <br> <br> page->count 实例分析<br><br> 我选择了函数is_page_shared来进行详细分析. 02年10月份的时候,linuxforum很是热闹.<br>对此函数的讨论,淹没在一片汪洋之中.不过对page->count的好奇和争论一直未曾停歇.或许<br>国外的论坛上早已经不存在这种问题的活跃讨论了,而我们仍将继续.<br> 请仔细阅读注释.<br>/*<br> * Work out if there are any other processes sharing this page, ignoring<br> * any page reference coming from the swap cache, or from outstanding<br> * swap IO on this page. (The page cache _does_ count as another valid<br> * reference to the page, however.)<br> */<br> /* I)这种情况下page 引用计数来源:<br> * 1. 进程,one per process 2. swap or page cahce, one 3.buffers one<br> * <br> * II)page 相关的swap entry:<br> * page加入了swap cache, 当page 对应的swap entry引用计数不是1 的时候(例如tmpfs),<br> * 代表另外一个地方依然希望通过swap entry 找到此page(tmpfs).所以相当于此page 多<br> * 了一个匿名的引用方式.<br> *<br> * III) page cache 算作了"另一个进程" (见上面的en comment)<br> */<br>static inline int is_page_shared(struct page *page)<br>{<br> unsigned int count;<br> if (PageReserved(page))<br> return 1;<br> count = page_count(page); //page 本身的引用计数<br><br> /* II) page在swap cache: (不在page cache)<br> * 所有进程的引用= page count + swap entry -(swap 本身对page的引用) <br> * swap 本身对page的引用是: <br> * swap cache 对page 引用 1,此page 对swap entry 的引用 1 如果有<br> * buffers, 算作swap 对其引用,1(反正不是进程).<br> */<br> if (PageSwapCache(page))<br> count += swap_count(page) - 2 - !!page->buffers;<br><br> /* III) 存在于page cache 或者不存在于page cahce <br> * 此中情况下,如有buffers,则必然属于page cache(filemap).否则<br> * 进程的页面何故需要写入磁盘?<br> * 进程+ page cache(bind buffers)的引用计数=page count<br> */<br> <br><br> /* 如果是在swap cache, 剩下的计数有一个是当前进程<br> * 所以>1 时才是有其他进程使用此页面<br> */<br> return count > 1;<br>}<br> <br> 其含义以经分析如上,下面看看使用条件和具体使用的方式:<br> 此函数假设已经有进程在使用此page(ref one),这就是使用的条件.共有三处引用: <br> 1. do_wp_page-> 目的是pte_mkwrite. 引用计数已知,就是2,如果只有swap cahce 引用<br> 此页面(不会有buffer),此操作安全.此函数适用.<br> 2.do_swap_page->页面肯定在swap cache.并且即使有buffers, 读入操作也已完成.故可<br> 以减去buffers的引用.<br> 3. memory.c : free_pte->free_page_and_swap_cache(所有情况都是进程期望释放自己<br> 的pte.),已知在swap cache, 并且后续对于buffers也要释放掉(锁定页面). 所以这个<br> 情况使用此函数应该是最初的目的.<br> <br> <br> 另外就是deactivate_page_nolock这个函数,参考try_to_swap_out ->deactivate_page-><br>deactivate_page_nolock:<br> try swap out:考察 当前 进程的时候,觉得要deacite此页面,但是除了swap cache,当前<br>进程和可能有的buffer之外如果还有其他引用的地方,则暂时不要deactive等到另外的一个进<br>程也决定deactive的时候再真正deactive. <br> 另外page_ramdisk的页面不应该deactive,保证ramdisk的页面永驻内存.<br> 另外refill_inactive_scan 是个特例.请参考相关代码.<br> <br> deactive后页面转入lru队列的inactive_dirty_list,对于这个队列中的页面,将做何处<br>理?:<br> 就是page_launder,清洗dirt 页面(脏了就洗干净吗!^_^).而清洗的时候要lock页面,如<br>果还有其他进程或者像tmpfs,ramdisk这样的人在悄悄的使用这个页面,情况将不堪设想.所<br>以不要清洗除了swap cache/buffer之外还有其他引用的页面.(caller extra ref或者当前<br>进程的引用再调用完这个函数后会page->count--,见try swap out.<br><br><br>/**<br> * (de)activate_page - move pages from/to active and inactive lists<br> * @page: the page we want to move<br> * @nolock - are we already holding the pagemap_lru_lock?<br> *<br> * Deactivate_page will move an active page to the right<br> * inactive list, while activate_page will move a page back<br> * from one of the inactive lists to the active list. If<br> * called on a page which is not on any of the lists, the<br> * page is left alone.<br> */<br>void deactivate_page_nolock(struct page * page)<br>{<br> /*<br> * One for the cache, one for the extra reference the<br> * caller has and (maybe) one for the buffers.<br> *<br> * This isn't perfect, but works for just about everything.<br> * Besides, as long as we don't move unfreeable pages to the<br> * inactive_clean list it doesn't need to be perfect...<br> */<br> /* extra reference: 当前进程或者调用者.记住<br> * ref count 的三个来源,才能灵活运用.<br> */<br> int maxcount = (page->buffers ? 3 : 2);<br> page->age = 0;<br> ClearPageReferenced(page);<br><br> /*<br> * Don't touch it if it's not on the active list.<br> * (some pages aren't on any list at all)<br> */<br> if (PageActive(page) && page_count(page) <= maxcount && !page_ramdisk(page)) {<br> del_page_from_active_list(page);<br> add_page_to_inactive_dirty_list(page);<br> }<br>} <br> <br> <br> 对付page count的思路就是如此了. <br> <br> <br> <br> <br> 题外, swap entry的引用计数<br> 紧紧分析一下shmem_writepage对swap entry的引用计数的处理.<br>/*<br> * Move the page from the page cache to the swap cache<br> * (未做真正写入,留给swap cache 写入)<br> */<br> /* page_launder:page->mapping->a_ops->writepage<br> * filemap_fdatasync-> page->mapping->a_ops->writepage<br> */<br>static int shmem_writepage(struct page * page)<br>{<br> int error;<br> struct shmem_inode_info *info;<br> swp_entry_t *entry, swap;<br><br> /*<br> * <br> */<br> info = &page->mapping->host->u.shmem_i;<br> if (info->locked)<br> return 1;<br> swap = __get_swap_page(2); /* 分配swap page(tmpfs(映射表) +swap cache(page->index) ,so refs is 2)*/<br> if (!swap.val)<br> return 1;<br><br> spin_lock(&info->lock);<br> /*寻找tmpfs内记录swap entry 的散列表*/<br> entry = shmem_swp_entry (info, page->index);<br> if (!entry) /* this had been allocted on page allocation */<br> BUG();<br> error = -EAGAIN;<br> if (entry->val) { /*已经有了swap entry与之对应*/<br> __swap_free(swap, 2);<br> goto out;<br> }<br><br> *entry = swap; /*tempfs ref swap entry, 释放引用见shmem_unuse-..>shmem_clear_swp*/<br> error = 0;<br> /* Remove the from the page cache */<br> lru_cache_del(page);<br> remove_inode_page(page);<br><br> /* Add it to the swap cache */<br> add_to_swap_cache(page, swap); /*swap cache ref swap entry,释放引用见try_to_unuse,or __delete_from_swap_cache*/<br> page_cache_release(page);<br> set_page_dirty(page);<br> info->swapped++;<br>out:<br> spin_unlock(&info->lock);<br> UnlockPage(page);<br> return error;<br>}<br><br> 总之,对于ref count,目的是从一个地方能到他的时候,就应该对应一个ref.<br></pre>
</td>
</tr>
</tbody>
</table></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -