linux内核写时复制机制源代码解读

姜换新 2020-08-31

展开全文

作者简介

写时复制技术（一下简称COW）是linux内核比较重要的一种机制，我们都知道：父进程fork子进程的时候，子进程会和父进程会以只读的方式共享所有私有的可写页，当有一方将要写的时候会发生COW缺页异常。那么究竟COW在linux内核中是如何触发？又是如何处理的呢？我们将在本文中以源代码情景分析的方式来解读神秘的写时COW，从源代码级别的角度彻底理解它。

需要说明的是:本文中所分析的内核源码时linux-5.0版本内核，使用arm64处理器架构，当然此文章发布时linux内核已经是linux-5.8.x，当你查看最新的内核源码的时候会发现变化并不是很大。本文主要会从下面几个方面去分析讨论写时复制：

1.fork子进程时内核为COW做了哪些准备
2.COW进程是如何触发的
3.内核时怎样处理COW这种缺页异常的
4.匿名页的reuse

一，从fork说起

我们都知道，进程是通过fork进行创建的，fork创建子进程的时候会和父进程共享资源，如fs,file,mm等等，其中内存资源的共享是一下路径：
kernel/fork.c
_do_fork->copy_process->copy_mm

当然本文中讨论的是COW，暂时不详解其他资源共享以及内存资源共享的其他部分（后面的相关文章我们会讨论），copy_mm总体来说所作的工作是：分配mm_struct结构实例mm，拷贝父进程的old_mm到mm,创建自己的pgd页全局目录，然后会遍历父进程的vma链表为子进程建立vma链表（如代码段，数据段等等），然后就是比较关键的页的共享，linux内核为了效率考虑并不是拷贝父进程的所有物理页内容，而是通过复制页表来共享这些页。而在复制页表的时候，内核会判断这个页表条目是完全复制还是修改为只读来为COW缺页做准备。

共享父进程内存资源处理如下：

以下我们主要分析copy_one_pte 拷贝页表条目的这一函数：
首先会处理一些页表项不为空但物理页不在内存中的情况（!pte_present(pte)分支）如被swap到交换分区中的页，接下来处理物理页在内存中的情况：

773 /* 774 |* If it's a COW mapping, write protect it both 775 |* in the parent and the child 776 |*/ 777 if (is_cow_mapping(vm_flags) && pte_write(pte)) {//vma为私有可写而且pte有可写属性 778 ptep_set_wrprotect(src_mm, addr, src_pte);//设置父进程页表项为只读 779 pte = pte_wrprotect(pte); //为子进程设置只读的页表项值 780 } 781

上面的代码块是判断当前页所在的vma是否是私有可写的属性而且父进程页表项是可写：

  247 static inline bool is_cow_mapping(vm_flags_t flags)  248 {  249         return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;  250 }

如果判断成立说明是COW的映射，则需要将父子进程页表修改为只读：
ptep_set_wrprotect(src_mm, addr, src_pte)将父进程的页表项修改为只读， pte = pte_wrprotect(pte)将子进程的即将写入的页表项值修改为只读（注意：修改之前pte为父进程原来的pte值，修改之后子进程pte还没有写入到对应的页表项条目中！）
修改页表项为只读的核心函数为：

152 static inline pte_t pte_wrprotect(pte_t pte) 153 { 154 pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));//清可写位 155 pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));//置位只读位 156 return pte; 157

再次回到copy_one_pte函数往下分析：
上面我们已经修改了父进程的页表项，也获得了子进程即将写入的页表项值pte(注意：现在还没有写入到子进程的页表项中，因为此时子进程的页表项值还没有被完全拼接号好)，接下来我们将要拼接子进程的页表项的值：

   782         /*   783         |* If it's a shared mapping, mark it clean in   784         |* the child   785         |*/   786         if (vm_flags & VM_SHARED) //vma的属性为共享   787                 pte = pte_mkclean(pte);//设置页表项值为clean   788         pte = pte_mkold(pte); //设置页表项值为未被访问过即是清PTE_AF   789   790         page = vm_normal_page(vma, addr, pte); //获得pte对应的page结构（即是和父进程共享的页描述符）   791         if (page) {   792                 get_page(page);//增进page结构的引用计数   793                 page_dup_rmap(page, false);//注意：不是拷贝rmap 而是增加page->_mapcount计数（页被映射计数）   794                 rss[mm_counter(page)]  ;   795         } else if (pte_devmap(pte)) {   796                 page = pte_page(pte);   797   798                 /*   799                 |* Cache coherent device memory behave like regular page and   800                 |* not like persistent memory page. For more informations see   801                 |* MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h   802                 |*/   803                 if (is_device_public_page(page)) {   804                         get_page(page);   805                         page_dup_rmap(page, false);   806                         rss[mm_counter(page)]  ;   807                 }   808         }   809   810 out_set_pte:   811         set_pte_at(dst_mm, addr, dst_pte, pte);//将拼接的页表项值写入到子进程的页表项中   812         return 0;

以上过程就完成了对于需要写时复制的页，将父子进程的页表项改写为只读（这时候vma的属性是可写的），并共享相同的物理页，这为下面的COW缺页异常做好了页表级别的准备工作。

二，COW缺页异常触发条件

当然如果父子进程仅仅是对COW共享的页面做只读访问，则通过各自的页表就能直接访问到对应的数据，一切都正常，一旦有一方去写，就会发生处理器异常，处理器会判断出是COW缺页异常：

arm64处理器处理过程：

我们从handle_pte_fault函数开始分析：

3800 if (vmf->flags & FAULT_FLAG_WRITE) {//vam可写 3801 if (!pte_write(entry))//页表项属性只读 3802 return do_wp_page(vmf);//处理cow 3803 entry = pte_mkdirty(entry); 3804 }

程序走到上面的判断说明：页表项存在，物理页存在内存，但是vma是可写，pte页表项是只读属性（这就是fork的时候所作的准备），这些条件也是COW缺页异常判断的条件。

二,发生COW缺页异常

当内核判断了这次异常时COW缺页异常，就会调用do_wp_page进行处理：

  2480 static vm_fault_t do_wp_page(struct vm_fault *vmf)  2481         __releases(vmf->ptl)  2482 {  2483         struct vm_area_struct *vma = vmf->vma;  2484  2485         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);//获得异常地址对应的page实例  2486         if (!vmf->page) {  2487                 /*  2488                 |* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a  2489                 |* VM_PFNMAP VMA.  2490                 |*  2491                 |* We should not cow pages in a shared writeable mapping.  2492                 |* Just mark the pages writable and/or call ops->pfn_mkwrite.  2493                 |*/  2494                 if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==  2495                                 |    (VM_WRITE|VM_SHARED))  2496                         return wp_pfn_shared(vmf);//处理共享可写映射  2497  2498                 pte_unmap_unlock(vmf->pte, vmf->ptl);  2499                 return wp_page_copy(vmf);//处理私有可写映射  2500         }

2485行，获得发生异常时地址所在的page结构，如果没有page结构是使用页帧号的特殊映射，则通过wp_pfn_shared处理共享可写映射，wp_page_copy处理私有可写映射，当然这不是我们分析重点。

我们继续往下分析：
我们主要关注2522行，判断是否可以重新使用这个页，这个稍后在分析。

2544 |* Ok, we need to copy. Oh, well.. 2545 |*/ 2546 get_page(vmf->page); 2547 2548 pte_unmap_unlock(vmf->pte, vmf->ptl); 2549 return wp_page_copy(vmf);

2546行增加原来的页的引用计数，防止被释放。
2548行释放页表锁
2549行这是COW处理的核心函数

我们下面将详细分析wp_page_copy函数：

        * - Allocate a page, copy the content of the old page to the new one.  2234  * - Handle book keeping and accounting - cgroups, mmu-notifiers, etc.  2235  * - Take the PTL. If the pte changed, bail out and release the allocated page  2236  * - If the pte is still the way we remember it, update the page table and all  2237  *   relevant references. This includes dropping the reference the page-table  2238  *   held to the old page, as well as updating the rmap.  2239  * - In any case, unlock the PTL and drop the reference we took to the old page.  2240  */  2241 static vm_fault_t wp_page_copy(struct vm_fault *vmf)  2242 {  2243         struct vm_area_struct *vma = vmf->vma;  2244         struct mm_struct *mm = vma->vm_mm;  2245         struct page *old_page = vmf->page;  2246         struct page *new_page = NULL;  2247         pte_t entry;  2248         int page_copied = 0;  2249         struct mem_cgroup *memcg;  2250         struct mmu_notifier_range range;  2251  2252         if (unlikely(anon_vma_prepare(vma)))  2253                 goto oom;  2254  2255         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {  2256                 new_page = alloc_zeroed_user_highpage_movable(vma,  2257                                                         |     vmf->address);  2258                 if (!new_page)  2259                         goto oom;  2260         } else {  2261                 new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,  2262                                 vmf->address);  2263                 if (!new_page)  2264                         goto oom;  2265                 cow_user_page(new_page, old_page, vmf->address, vma);  2266         }

2252行关联一个anon_vma实例到vma
2255行到 2259行判断原来的页表项映射的页是0页，就分配高端可移动的页并用0初始化
2261到2265行如果不是0页，分配高端可移动的页，然后将原来的页拷贝到新页

2268 if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false)) 2269 goto oom_free_new; 2270 2271 __SetPageUptodate(new_page); 2272 2273 mmu_notifier_range_init(&range, mm, vmf->address & PAGE_MASK, 2274 (vmf->address & PAGE_MASK) PAGE_SIZE); 2275 mmu_notifier_invalidate_range_start(&range); 2276 2277 /* 2278 |* Re-check the pte - we dropped the lock 2279 |*/ 2280 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); 2281 if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { 2282 if (old_page) { 2283 if (!PageAnon(old_page)) { 2284 dec_mm_counter_fast(mm, 2285 mm_counter_file(old_page)); 2286 inc_mm_counter_fast(mm, MM_ANONPAGES); 2287 } 2288 } else { 2289 inc_mm_counter_fast(mm, MM_ANONPAGES); 2290 } 2291 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); 2292 entry = mk_pte(new_page, vma->vm_page_prot); 2293 entry = maybe_mkwrite(pte_mkdirty(entry), vma); 2294 /* 2295 |* Clear the pte entry and flush it first, before updating the 2296 |* pte with the new entry. This will avoid a race condition 2297 |* seen in the presence of one thread doing SMC and another 2298 |* thread doing COW. 2299 |*/ 2300 ptep_clear_flush_notify(vma, vmf->address, vmf->pte); 2301 page_add_new_anon_rmap(new_page, vma, vmf->address, false); 2302 mem_cgroup_commit_charge(new_page, memcg, false, false); 2303 lru_cache_add_active_or_unevictable(new_page, vma); 2304 /* 2305 |* We call the notify macro here because, when using secondary 2306 |* mmu page tables (such as kvm shadow page tables), we want the 2307 |* new page to be mapped directly into the secondary page table. 2308 |*/ 2309 set_pte_at_notify(mm, vmf->address, vmf->pte, entry); 2310 update_mmu_cache(vma, vmf->address, vmf->pte); 2311 if (old_page) { 2312 /* 2313 |* Only after switching the pte to the new page may 2314 |* we remove the mapcount here. Otherwise another 2315 |* process may come and find the rmap count decremented 2316 |* before the pte is switched to the new page, and 2317 |* 'reuse' the old page writing into it while our pte 2318 |* here still points into it and can be read by other 2319 |* threads. 2320 |* 2321 |* The critical issue is to order this 2322 |* page_remove_rmap with the ptp_clear_flush above. 2323 |* Those stores are ordered by (if nothing else,) 2324 |* the barrier present in the atomic_add_negative 2325 |* in page_remove_rmap. 2326 |* 2327 |* Then the TLB flush in ptep_clear_flush ensures that 2328 |* no process can access the old page before the 2329 |* decremented mapcount is visible. And the old page 2330 |* cannot be reused until after the decremented 2331 |* mapcount is visible. So transitively, TLBs to 2332 |* old page will be flushed before it can be reused. 2333 |*/ 2334 page_remove_rmap(old_page, false); 2335 } 2336 2337 /* Free the old page.. */ 2338 new_page = old_page; 2339 page_copied = 1; 2340 } else { 2341 mem_cgroup_cancel_charge(new_page, memcg, false); 2342 }

2271行设置新的页标识位为PageUptodate，表示页中包含有效数据。
2280行锁住页表
2281到2339行是发生缺页异常时获得页表项和现在锁住之后获得页表项内容相同的情况
2341 时页表项不同的情况
主要分析相同的情况：
2282到2290 主要时对页计数的统计
2291 cache中刷新页
2292行由vma的访问权限和新页的页描述符来构建页表项的值
2293行设置页表项值属性为脏和可写（如果vma有可写属性，这个时候将页表项修改为了可写，fork的时候修改为只读这个地方修改了回来）
2300行将页表项原有的值清除，然后刷新地址发生缺页地址对应的tlb（这一行操作很重要）
2301行将新的物理页添加到vma对应的匿名页的反向映射中
2303行将新物理页添加到活跃或不可回收LRU链表中
2309 行将构建好的页表项值写入到页表项条目中，这个时候页表项修改才会生效。
2334行删除原来的页到虚拟页的反向映射，然后做了比较重要的一个操作为**atomic_add_negative(-1, &page->_mapcount)**将页的页表映射计数减一。
2344到2347 递减旧页的引用计数并释放页表锁
2353到2364行如果已经映射了新的物理页，旧页被锁住在内存中，将旧页解锁。
到此就完成了写时复制过程。总结下：分配新的物理页，拷贝原来页的内容到新页，然后修改页表项内容指向新页并修改为可写（vma具备可写属性）。
前面我们遗留了一个问题没有讨论，那就是do_wp_page函数中处理reuse_swap_page的处理，所谓的单身匿名页面的处理。

四，匿名页的reuse

假设有如下情形发生：父进程P通过fork创建了子进程A,其中有一私有可写的匿名页page1被共享，这个时候内核会此页都映射到各自的虚拟内存页，并修改双方的页表属性为只读，page1的映射计数_mapcount为2，这个时候假设子进程写page1,则发生COW异常，异常处理程序为子进程A分配了新页page2并和虚拟页建立映射关系，并改写了子进程页表项为可写，这个时候子进程可以随意的写page2而不会影响父进程，当然上面分析我们知道page1的映射计数_mapcount会递减1变为1，也就表面这个页page1被父进程所唯一映射，那么这个时候父进程再去写page1，会发生什么呢？还会发生COW去分配新的页吗?
下面我们在源代码中寻找答案：

do_wp_page函数的2502到2541是我们分析重点：

  2502         /*  2503         |* Take out anonymous pages first, anonymous shared vmas are  2504         |* not dirty accountable.  2505         |*/  2506         if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {  2507                 int total_map_swapcount;  2508                 if (!trylock_page(vmf->page)) {  2509                         get_page(vmf->page);  2510                         pte_unmap_unlock(vmf->pte, vmf->ptl);  2511                         lock_page(vmf->page);  2512                         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,  2513                                         vmf->address, &vmf->ptl);  2514                         if (!pte_same(*vmf->pte, vmf->orig_pte)) {  2515                                 unlock_page(vmf->page);  2516                                 pte_unmap_unlock(vmf->pte, vmf->ptl);  2517                                 put_page(vmf->page);  2518                                 return 0;  2519                         }  2520                         put_page(vmf->page);  2521                 }  2522                 if (reuse_swap_page(vmf->page, &total_map_swapcount)) {  2523                         if (total_map_swapcount == 1) {  2524                                 /*  2525                                 |* The page is all ours. Move it to  2526                                 |* our anon_vma so the rmap code will  2527                                 |* not search our parent or siblings.  2528                                 |* Protected against the rmap code by  2529                                 |* the page lock.  2530                                 |*/  2524                                 /*  2525                                 |* The page is all ours. Move it to  2526                                 |* our anon_vma so the rmap code will  2527                                 |* not search our parent or siblings.  2528                                 |* Protected against the rmap code by  2529                                 |* the page lock.  2530                                 |*/  2531                                 page_move_anon_rmap(vmf->page, vma);  2532                         }  2533                         unlock_page(vmf->page);  2534                         wp_page_reuse(vmf);  2535                         return VM_FAULT_WRITE;  2536                 }  2537                 unlock_page(vmf->page);  2538         } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==  2539                                         (VM_WRITE|VM_SHARED))) {  2540                 return wp_page_shared(vmf);  2541         }

2506行对于匿名页面且非KSM页
2522行判断是否这个页面只被我所拥有（total_map_swapcount <= 0）
2534 调用wp_page_reuse处理（这是重点）

2195 /* 2196 * Handle write page faults for pages that can be reused in the current vma 2197 * 2198 * This can happen either due to the mapping being with the VM_SHARED flag, 2199 * or due to us being the last reference standing to the page. In either 2200 * case, all we need to do here is to mark the page as writable and update 2201 * any related book-keeping. 2202 */ 2203 static inline void wp_page_reuse(struct vm_fault *vmf) 2204 __releases(vmf->ptl) 2205 { 2206 struct vm_area_struct *vma = vmf->vma; 2207 struct page *page = vmf->page; 2208 pte_t entry; 2209 /* 2210 |* Clear the pages cpupid information as the existing 2211 |* information potentially belongs to a now completely 2212 |* unrelated process. 2213 |*/ 2214 if (page) 2215 page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1); 2216 2217 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); 2218 entry = pte_mkyoung(vmf->orig_pte); 2219 entry = maybe_mkwrite(pte_mkdirty(entry), vma); 2220 if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) 2221 update_mmu_cache(vma, vmf->address, vmf->pte); 2222 pte_unmap_unlock(vmf->pte, vmf->ptl); 2223 }

代码中可以清晰看到：
2218行设置页被访问
2219行设置页表项为脏，如果页所在的vma是可写属性则设置页表项值为可写
2220行将设置好的页表项值写入到页表项条目中（真正设置好了页表项），注意arm64中在ptep_set_access_flags刷新了页对应的tlb。

分析到这里，有关COW的机制已经全部分析完，当然这个过程涉及到了无数的技术细节，在此不再一一赘述，后面有机会会讨论到相关的内容。

五，总结

我们总结一下写时复制（COW）机制的整个过程：首先发生在父进程fork子进程的时候，父子进程会共享（此共享并不是我们通常所说的共享映射和私有映射，而是通过将页映射到每个进程页表形成共享）所有的私有可写的物理页，并将父子进程对应的页表项修改为只读，当有一方试图写共享的物理页，由于页表项属性是只读的会发生COW缺页异常，缺页异常处理程序会为写操作的一方分配新的物理页，并将原来共享的物理页内容拷贝到新页，然后建立新页的页表映射关系，这样写操作的进程就可以继续执行，不会影响另一方，父子进程对共享的私有页面访问就分道扬镳了，当共享的页面最终只有一个拥有者（即是其他映射页面到自己页表的进程都发生写时复制分配了新的物理页），这个时候如果拥有者进程想要写这个页就会重新使用这个页而不用分配新页。

下面给出实验代码案例：
程序中有一全局变量num=10 打印num的值, 然后fork子进程，在子进程中修改全局变量num=100 然后打印num的值,父进程中睡眠1s故意等待子进程先执行完，然后再次打印num的值

    1 #include <stdio.h>
    2 #include <unistd.h>
    3 #include <sys/types.h>
    4 
    5 
    6 int num = 10;
    7 
    8 int main(int argc,char **argv)
    9 {
   10 
   11         pid_t pid;
   12 
   13         printf('###%s:%d  pid=%d num=%d###\n', __func__, __LINE__,  getpid(), num);
   14 
   15 
   16         pid = fork();
   17         if (pid < 0) {
   18                 printf('fail to fork\n');
   19                 return -1;
   20         } else if (pid == 0) { //child process
   21                 num = 100;
   22                 printf('### This is child process pid=%d  num=%d###\n', getpid(), num);
   23                 _exit(0);
   24         } else { //parent process
   25                 sleep(1);
   26                 printf('### This is parent process  pid=%d  num=%d###\n', getpid(), num);
   27                 _exit(0);
   28         }
   29 
   30         return 0;
   31 }

大家可以思考一下：第13，22, 27分别得出的num是多少？
我们编译执行：

hanch@hanch-VirtualBox:~/test/COW$ gcc fork-cow-test.c -o fork-cow-testhanch@hanch-VirtualBox:~/test/COW$ ./fork-cow-test ###main:13 pid=26844 num=10###### This is child process pid=26845 num=100###### This is parent process pid=26844 num=10###

可以发现父进程中的全局变量num =10, 当fork子进程后对这个全局变量进行了修改使得num =100,实际上fork的时候已经将父子进程的num这个全局变量所在的页修改为了只读，然后共享这个页，当子进程写这个全局变量的时候发生了COW缺页异常，然而这对于应用程序来说是透明的，内核却在缺页异常处理中做了很多工作：主要是为子进程分配物理页，将父进程的num所在的页内容拷贝到子进程，然后将子进程的va所对应的的页表条目修改为可写和分配的物理页建立了映射关系，然后缺页异常就返回了（从内核空间返回到了用户空间），这个时候处理器会重新执行赋值操作指令，这个时候属于子进程的num才被改写为100，但是要明白这个时候父进程的num变量所在的页的读写属性还是只读，父进程再去写的时候依然会发生COW缺页异常。

最后我们用图说话来理解COW的整个过程：