📄 023_mm_swap_state_c.html

📁 重读linux 2.4.2o所写的笔记
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
      flow: static(header);    }    /* used to insert page numbers */    div.google_header::before, div.google_footer::before {      position: absolute;      top: 0;    }    div.google_footer {      flow: static(footer);    }    /* always consider this element at the start of the doc */    div#google_footer {      flow: static(footer, start);    }    span.google_pagenumber {      content: counter(page);    }    span.google_pagecount {      content: counter(pages);    }  }  @page {    @top {      content: flow(header);    }    @bottom {      content: flow(footer);    }  }  /* end default print css */ /* custom css *//* end custom css */  /* ui edited css */    body {    font-family: Verdana;        font-size: 10.0pt;    line-height: normal;    background-color: #ffffff;  }    .documentBG {    background-color: #ffffff;  }  /* end ui edited css */</style>   </head>  <body  revision="dcbsxfpf_65fnfzsmft:5">      <table align=center cellpadding=0 cellspacing=0 height=5716 width=768>
  <tbody>
  <tr>
    <td height=5716 valign=top width=100%>
      <pre>2006-8-10   <br>mm/swap_state.c<br><br>    当一个page要和外部存储设备发生联系的时候,就要建立一个address_space,对于swap<br>就是 swapper_space .还要提供address_space_operations,对于swap 就是swap_aops.<br>    考虑page cache/swap cache/shmem/filemap,无不如此.<br>    <br>    建立着两个结构只是解决了页面写出的问题,而读入靠的是handle_pte_fault-&gt;直接的<br>函数调用.对于swap就是do_swap_page,file map/mmap靠vma-&gt;vm_ops-&gt;nopage.没有一个统<br>一的解决方案.<br>             <br>     不打算太多分析这些东西了.这里重点讨论物理内存页面,page-&gt;count以及swp entry<br>的引用计数.(真的需要逐函数列到这里?)<br>               <br>               <br>               <br>               <br>               <br>               <br>               <br>               <br>                            page, 何去何从<br>                  <br>   看page_alloc.c, buddy系统,所有物理页面都受buddy管理(reserve除外,那是外设内存,<br>或者特殊用途). page的去向只看分配函数的调用关系即可.<br>   page_alloc.c 提供的分配接口:(只有这几个被应用--2.4)<br>     <br>     1.(alloc_pages:call by)--&gt;page_cache_alloc<br>        从这个接口流出的页面都在page cache(swap cache)中.用于磁盘(or疑似)文件缓<br>        存.具体的使用者是: swap cache, page cache,file read(page cache),filemap<br>        (page cache,or copy from page cache),COW(may not in page cache),<br>        shmem_no_page(page cache).<br>     <br>     2.__get_free_pages:<br>         广泛应用于驱动, 内核使用的hash表, task struct结构,网络(hash等),buffers<br>         (文件系统的meta data,blk设备文件读写.(还有fly的buffers,创建于需要io的页<br>         面,这种页面不是从__get_free_pages流出) ), slab(slab).<br>         <br>     3. __get_free_page<br>         page table(pdir,pmd),驱动, 用户参数页.<br>     <br>     4.get_zeroed_page:<br>         驱动(tty), shmem(建立于内核的直接/间接映射表,永不与后备缓存打交道),<br>   <br>     5.alloc_page:<br>         buffers,string参数页,匿名页(缺页中断),vmalloc(内核页面,永不交换).<br>         <br>     <br>     现在可以回答这个问题,物理内存都用到哪里去了?:(fix me,i think everything is here)<br>     1)内核'自己'使用<br>       包括驱动,网络,页表(内核或者进程),各种hash表,从用户copy的参数,通过slab作为<br>       各种内核数据结构的cache(inode,dentry.....),shmem映射表,vmalloc使用的内核<br>       页面.<br>       <br>     2)page cache/swap cache <br>       页面只能位于这两个cache中的一个.用于缓存位于磁盘上的文件内容(不是meta).包<br>      括普通文件,filemap(共享),shemem.<br>      <br>     3)buffers<br>        用于缓存文件系统的meta data,用于设备文件读写的缓存.不包括那些为了进行page<br>        io而创建的fly buffers,但是这些fly buffer也page-&gt;count++了.<br>        <br>     4)用户进程页面<br>        这是一个混合体. 进程使用的页面也可以位于page cache/swap cache, 还可以拥有<br>       buffer. 除了这些有所属的页面,进程使用的页面还有一种叫做匿名页,即无mapping.<br>       包括还未进入swap的进程页面,filemap(no shared),COW页面.<br>       <br>     <br>     <br>                      <br>                      <br>                                page-&gt;count  <br>                           <br>  先贴一段从mm.h中的注释,这个值得一看.注意,这个注释太老了,inode-&gt;i_pages在2.4中已<br>经不存在了. 这段话--&gt; For pages belonging to inodes, the page-&gt;count is the number<br>of attaches, plus 1 if buffers are allocated to the page.已经不正确了.(和我们的文<br>档一样,好久没有更新了,2.6中还行.)<br><br>/*<br> * Various page-&gt;flags bits:<br> *<br> * PG_reserved is set for a page which must never be accessed (which<br> * may not even be present).<br> *<br> * PG_DMA has been removed, page-&gt;zone now tells exactly wether the<br> * page is suited to do DMAing into.<br> *<br> * Multiple processes may "see" the same page. E.g. for untouched<br> * mappings of /dev/null, all processes see the same page full of<br> * zeroes, and text pages of executables and shared libraries have<br> * only one copy in memory, at most, normally.<br> *<br> * For the non-reserved pages, page-&gt;count denotes a reference count.<br> *   page-&gt;count == 0 means the page is free.<br> *   page-&gt;count == 1 means the page is used for exactly one purpose<br> *   (e.g. a private data page of one process).<br> *<br> * A page may be used for kmalloc() or anyone else who does a<br> * __get_free_page(). In this case the page-&gt;count is at least 1, and<br> * all other fields are unused but should be 0 or NULL. The<br> * management of this page is the responsibility of the one who uses<br> * it.<br> *<br> * The other pages (we may call them "process pages") are completely<br> * managed by the Linux memory manager: I/O, buffers, swapping etc.<br> * The following discussion applies only to them.<br> *<br> * A page may belong to an inode's memory mapping. In this case,<br> * page-&gt;inode is the pointer to the inode, and page-&gt;offset is the<br> * file offset of the page (not necessarily a multiple of PAGE_SIZE).<br> *<br> * A page may have buffers allocated to it. In this case,<br> * page-&gt;buffers is a circular list of these buffer heads. Else,<br> * page-&gt;buffers == NULL.<br> *<br> * For pages belonging to inodes, the page-&gt;count is the number of<br> * attaches, plus 1 if buffers are allocated to the page.<br> *<br> * All pages belonging to an inode make up a doubly linked list<br> * inode-&gt;i_pages, using the fields page-&gt;next and page-&gt;prev. (These<br> * fields are also used for freelist management when page-&gt;count==0.)<br> * There is also a hash table mapping (inode,offset) to the page<br> * in memory if present. The lists for this hash table use the fields<br> * page-&gt;next_hash and page-&gt;pprev_hash.<br> *<br> * All process pages can do I/O:<br> * - inode pages may need to be read from disk,<br> * - inode pages which have been modified and are MAP_SHARED may need<br> *   to be written to disk,<br> * - private pages which have been modified may need to be swapped out<br> *   to swap space and (later) to be read back into memory.<br> * During disk I/O, PG_locked is used. This bit is set before I/O<br> * and reset when I/O completes. page-&gt;wait is a wait queue of all<br> * tasks waiting for the I/O on this page to complete.<br> * PG_uptodate tells whether the page's contents is valid.<br> * When a read completes, the page becomes uptodate, unless a disk I/O<br> * error happened.<br> *<br> * For choosing which pages to swap out, inode pages carry a<br> * PG_referenced bit, which is set any time the system accesses<br> * that page through the (inode,offset) hash table.<br> *<br> * PG_skip is used on sparc/sparc64 architectures to "skip" certain<br> * parts of the address space.<br> *<br> * PG_error is set to indicate that an I/O error occurred on this page.<br> *<br> * PG_arch_1 is an architecture specific page state bit.  The generic<br> * code guarentees that this bit is cleared for a page when it first<br> * is entered into the page cache.<br> */<br><br>  根据刚才分析的物理页面,page的流向, 对page-&gt;count的简单描述如下:<br>  1)第一类内核自己使用的页面,一般引用计数都是1.(fixme).<br>  <br>  2)page/swap cache中的页面,增加1, buffers 增加1.<br>  <br>  3)用户进程: 每个进程增加1.<br>  <br>  4)许多地方为了保护页面临时不被释放, get后很快释放.此类忽略.<br>  <br>  <br>  <br>                       <br>                          page-&gt;count 实例分析<br><br>  我选择了函数is_page_shared来进行详细分析. 02年10月份的时候,linuxforum很是热闹.<br>对此函数的讨论,淹没在一片汪洋之中.不过对page-&gt;count的好奇和争论一直未曾停歇.或许<br>国外的论坛上早已经不存在这种问题的活跃讨论了,而我们仍将继续.<br>  请仔细阅读注释.<br>/*<br> * Work out if there are any other processes sharing this page, ignoring<br> * any page reference coming from the swap cache, or from outstanding<br> * swap IO on this page.  (The page cache _does_ count as another valid<br> * reference to the page, however.)<br> */<br> /* I)这种情况下page 引用计数来源:<br>  *   1. 进程,one per process  2. swap or page cahce, one   3.buffers one<br>  *  <br>  * II)page 相关的swap entry:<br>  *     page加入了swap cache, 当page 对应的swap entry引用计数不是1 的时候(例如tmpfs),<br>  *  代表另外一个地方依然希望通过swap entry 找到此page(tmpfs).所以相当于此page 多<br>  *  了一个匿名的引用方式.<br>  *<br>  * III) page cache 算作了"另一个进程" (见上面的en comment)<br>  */<br>static inline int is_page_shared(struct page *page)<br>{<br>	unsigned int count;<br>	if (PageReserved(page))<br>		return 1;<br>	count = page_count(page); //page 本身的引用计数<br><br>   /*  II) page在swap cache:  (不在page cache)<br>    *      所有进程的引用= page count + swap entry -(swap 本身对page的引用) <br>    *      swap 本身对page的引用是: <br>    *        swap cache 对page 引用 1,此page 对swap entry 的引用 1 如果有<br>    *        buffers, 算作swap 对其引用,1(反正不是进程).<br>    */<br>	if (PageSwapCache(page))<br>		count += swap_count(page) - 2 - !!page-&gt;buffers;<br><br>    /* III) 存在于page cache  或者不存在于page cahce <br>     *    此中情况下,如有buffers,则必然属于page cache(filemap).否则<br>     *  进程的页面何故需要写入磁盘?<br>     *    进程+ page cache(bind buffers)的引用计数=page count<br>     */<br>        	<br><br>	 /* 如果是在swap cache, 剩下的计数有一个是当前进程<br>	  * 所以&gt;1 时才是有其他进程使用此页面<br>    */<br>	 return  count &gt; 1;<br>}<br>  <br>  其含义以经分析如上,下面看看使用条件和具体使用的方式:<br>  此函数假设已经有进程在使用此page(ref one),这就是使用的条件.共有三处引用: <br>  1. do_wp_page-&gt; 目的是pte_mkwrite. 引用计数已知,就是2,如果只有swap cahce 引用<br>     此页面(不会有buffer),此操作安全.此函数适用.<br>  2.do_swap_page-&gt;页面肯定在swap cache.并且即使有buffers, 读入操作也已完成.故可<br>    以减去buffers的引用.<br>  3. memory.c : free_pte-&gt;free_page_and_swap_cache(所有情况都是进程期望释放自己<br>    的pte.),已知在swap  cache, 并且后续对于buffers也要释放掉(锁定页面). 所以这个<br>    情况使用此函数应该是最初的目的.<br>  <br>  <br>  另外就是deactivate_page_nolock这个函数,参考try_to_swap_out -&gt;deactivate_page-&gt;<br>deactivate_page_nolock:<br>   try swap out:考察 当前 进程的时候,觉得要deacite此页面,但是除了swap cache,当前<br>进程和可能有的buffer之外如果还有其他引用的地方,则暂时不要deactive等到另外的一个进<br>程也决定deactive的时候再真正deactive. <br>   另外page_ramdisk的页面不应该deactive,保证ramdisk的页面永驻内存.<br>   另外refill_inactive_scan 是个特例.请参考相关代码.<br>   <br>   deactive后页面转入lru队列的inactive_dirty_list,对于这个队列中的页面,将做何处<br>理?:<br>   就是page_launder,清洗dirt 页面(脏了就洗干净吗!^_^).而清洗的时候要lock页面,如<br>果还有其他进程或者像tmpfs,ramdisk这样的人在悄悄的使用这个页面,情况将不堪设想.所<br>以不要清洗除了swap cache/buffer之外还有其他引用的页面.(caller extra ref或者当前<br>进程的引用再调用完这个函数后会page-&gt;count--,见try swap out.<br><br><br>/**<br> * (de)activate_page - move pages from/to active and inactive lists<br> * @page: the page we want to move<br> * @nolock - are we already holding the pagemap_lru_lock?<br> *<br> * Deactivate_page will move an active page to the right<br> * inactive list, while activate_page will move a page back<br> * from one of the inactive lists to the active list. If<br> * called on a page which is not on any of the lists, the<br> * page is left alone.<br> */<br>void deactivate_page_nolock(struct page * page)<br>{<br>	/*<br>	 * One for the cache, one for the extra reference the<br>	 * caller has and (maybe) one for the buffers.<br>	 *<br>	 * This isn't perfect, but works for just about everything.<br>	 * Besides, as long as we don't move unfreeable pages to the<br>	 * inactive_clean list it doesn't need to be perfect...<br>	 */<br>	 /* extra reference: 当前进程或者调用者.记住<br>	   * ref count 的三个来源,才能灵活运用.<br>	   */<br>	int maxcount = (page-&gt;buffers ? 3 : 2);<br>	page-&gt;age = 0;<br>	ClearPageReferenced(page);<br><br>	/*<br>	 * Don't touch it if it's not on the active list.<br>	 * (some pages aren't on any list at all)<br>	 */<br>	if (PageActive(page) &amp;&amp; page_count(page) &lt;= maxcount &amp;&amp; !page_ramdisk(page)) {<br>		del_page_from_active_list(page);<br>		add_page_to_inactive_dirty_list(page);<br>	}<br>}	<br>  <br>  <br>   对付page count的思路就是如此了. <br>   <br>   <br>   <br>                         <br>                        题外, swap entry的引用计数<br>  紧紧分析一下shmem_writepage对swap entry的引用计数的处理.<br>/*<br> * Move the page from the page cache to the swap cache<br> * (未做真正写入,留给swap cache 写入)<br> */<br> /*  page_launder:page-&gt;mapping-&gt;a_ops-&gt;writepage<br>   *  filemap_fdatasync-&gt; page-&gt;mapping-&gt;a_ops-&gt;writepage<br>   */<br>static int shmem_writepage(struct page * page)<br>{<br>	int error;<br>	struct shmem_inode_info *info;<br>	swp_entry_t *entry, swap;<br><br>  /*<br>   *  <br>	 */<br>	info = &amp;page-&gt;mapping-&gt;host-&gt;u.shmem_i;<br>	if (info-&gt;locked)<br>		return 1;<br>	swap = __get_swap_page(2); /* 分配swap page(tmpfs(映射表) +swap cache(page-&gt;index) ,so refs is 2)*/<br>	if (!swap.val)<br>		return 1;<br><br>	spin_lock(&amp;info-&gt;lock);<br>	/*寻找tmpfs内记录swap entry 的散列表*/<br>	entry = shmem_swp_entry (info, page-&gt;index);<br>	if (!entry)	/* this had been allocted on page allocation */<br>		BUG();<br>	error = -EAGAIN;<br>	if (entry-&gt;val) { /*已经有了swap entry与之对应*/<br>                __swap_free(swap, 2);<br>		goto out;<br>        }<br><br>        *entry = swap; /*tempfs ref swap entry, 释放引用见shmem_unuse-..&gt;shmem_clear_swp*/<br>	error = 0;<br>	/* Remove the from the page cache */<br>	lru_cache_del(page);<br>	remove_inode_page(page);<br><br>	/* Add it to the swap cache */<br>	add_to_swap_cache(page, swap); /*swap cache ref swap entry,释放引用见try_to_unuse,or __delete_from_swap_cache*/<br>	page_cache_release(page);<br>	set_page_dirty(page);<br>	info-&gt;swapped++;<br>out:<br>	spin_unlock(&amp;info-&gt;lock);<br>	UnlockPage(page);<br>	return error;<br>}<br><br>   总之,对于ref count,目的是从一个地方能到他的时候,就应该对应一个ref.<br></pre>
    </td>
  </tr>
  </tbody>
</table></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -