📄 qsnet-suse-2.6.patch
字号:
Index: LINUX-SRC-TREE/arch/i386/defconfig===================================================================--- LINUX-SRC-TREE.orig/arch/i386/defconfig+++ LINUX-SRC-TREE/arch/i386/defconfig@@ -2932,3 +2932,5 @@ CONFIG_CFGNAME="default" CONFIG_RELEASE="7.283" CONFIG_X86_BIOS_REBOOT=y CONFIG_PC=y+CONFIG_IOPROC=y+CONFIG_PTRACK=yIndex: LINUX-SRC-TREE/arch/i386/Kconfig===================================================================--- LINUX-SRC-TREE.orig/arch/i386/Kconfig+++ LINUX-SRC-TREE/arch/i386/Kconfig@@ -1022,6 +1022,9 @@ config APM_REAL_MODE_POWER_OFF a work-around for a number of buggy BIOSes. Switch this option on if your computer crashes instead of powering off properly. +source "mm/Kconfig"+source "kernel/Kconfig"+ endmenu source "arch/i386/kernel/cpu/cpufreq/Kconfig"Index: LINUX-SRC-TREE/arch/i386/mm/hugetlbpage.c===================================================================--- LINUX-SRC-TREE.orig/arch/i386/mm/hugetlbpage.c+++ LINUX-SRC-TREE/arch/i386/mm/hugetlbpage.c@@ -16,6 +16,7 @@ #include <linux/err.h> #include <linux/sysctl.h> #include <linux/mempolicy.h>+#include <linux/ioproc.h> #include <asm/mman.h> #include <asm/pgalloc.h> #include <asm/tlb.h>@@ -393,6 +394,7 @@ zap_hugepage_range(struct vm_area_struct { struct mm_struct *mm = vma->vm_mm; spin_lock(&mm->page_table_lock);+ ioproc_invalidate_range(vma, start, start + length); unmap_hugepage_range(vma, start, start + length); spin_unlock(&mm->page_table_lock); }Index: LINUX-SRC-TREE/arch/ia64/defconfig===================================================================--- LINUX-SRC-TREE.orig/arch/ia64/defconfig+++ LINUX-SRC-TREE/arch/ia64/defconfig@@ -104,6 +104,8 @@ CONFIG_IA64_PALINFO=y CONFIG_EFI_VARS=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=m+CONFIG_IOPROC=y+CONFIG_PTRACK=y # # Power management and ACPIIndex: LINUX-SRC-TREE/arch/ia64/Kconfig===================================================================--- LINUX-SRC-TREE.orig/arch/ia64/Kconfig+++ LINUX-SRC-TREE/arch/ia64/Kconfig@@ -334,6 +334,8 @@ config EFI_VARS To use this option, you have to check that the "/proc file system support" (CONFIG_PROC_FS) is enabled, too. +source "mm/Kconfig"+source "kernel/Kconfig" source "fs/Kconfig.binfmt" endmenuIndex: LINUX-SRC-TREE/arch/ia64/mm/hugetlbpage.c===================================================================--- LINUX-SRC-TREE.orig/arch/ia64/mm/hugetlbpage.c+++ LINUX-SRC-TREE/arch/ia64/mm/hugetlbpage.c@@ -19,6 +19,7 @@ #include <linux/slab.h> #include <linux/sysctl.h> #include <linux/mempolicy.h>+#include <linux/ioproc.h> #include <asm/mman.h> #include <asm/pgalloc.h> #include <asm/tlb.h>@@ -378,6 +379,7 @@ void zap_hugepage_range(struct vm_area_s { struct mm_struct *mm = vma->vm_mm; spin_lock(&mm->page_table_lock);+ ioproc_invalidate_range(vma, start, start + length); unmap_hugepage_range(vma, start, start + length); spin_unlock(&mm->page_table_lock); }Index: LINUX-SRC-TREE/arch/x86_64/defconfig===================================================================--- LINUX-SRC-TREE.orig/arch/x86_64/defconfig+++ LINUX-SRC-TREE/arch/x86_64/defconfig@@ -98,6 +98,8 @@ CONFIG_MTRR=y CONFIG_GART_IOMMU=y CONFIG_SWIOTLB=y CONFIG_X86_MCE=y+CONFIG_IOPROC=y+CONFIG_PTRACK=y # # Power management optionsIndex: LINUX-SRC-TREE/arch/x86_64/Kconfig===================================================================--- LINUX-SRC-TREE.orig/arch/x86_64/Kconfig+++ LINUX-SRC-TREE/arch/x86_64/Kconfig@@ -343,6 +343,9 @@ source "drivers/acpi/Kconfig" source "arch/x86_64/kernel/cpufreq/Kconfig" +source "mm/Kconfig"+source "kernel/Kconfig"+ endmenu menu "Bus options (PCI etc.)"Index: LINUX-SRC-TREE/Documentation/vm/ioproc.txt===================================================================--- /dev/null+++ LINUX-SRC-TREE/Documentation/vm/ioproc.txt@@ -0,0 +1,468 @@+Linux IOPROC patch overview+===========================++The network interface for an HPC network differs significantly from+network interfaces for traditional IP networks. HPC networks tend to+be used directly from user processes and perform large RDMA transfers+between theses processes address space. They also have a requirement+for low latency communication, and typically achieve this by OS bypass+techniques. This then requires a different model to traditional+interconnects, in that a process may need to expose a large amount of+it's address space to the network RDMA.++Locking down of memory has been a common mechanism for performing+this, together with a pin-down cache implemented in user+libraries. The disadvantage of this method is that large portions of+the physical memory can be locked down for a single process, even if+it's working set changes over the different phases of it's+execution. This leads to inefficient memory utilisation - akin to the+disadvantage of swapping compared to paging.++This model also has problems where memory is being dynamically+allocated and freed, since the pin down cache is unaware that memory+may have been released by a call to munmap() and so it will still be+locking down the now unused pages.++Some modern HPC network interfaces implement their own MMU and are+able to handle a translation fault during a network access. The+Quadrics (http://www.quadrics.com) devices (Elan3 and Elan4) have done+this for some time and we expect others to follow the same route in+the relatively near future. These NICs are able to operate in an+environment where paging occurs and do not require memory to be locked+down. The advantage of this is that the user process can expose large+portions of it's address space without having to worry about physical+memory constraints.++However should the operating system decide to swap a page to disk,+then the NIC must be made aware that it should no longer read/write+from this memory, but should generate a translation fault instead.++The ioproc patch has been developed to provide a mechanism whereby the+device driver for a NIC can be aware of when a user process's address+translations change, either by paging or by explicitly mapping or+unmapping memory.++The patch involves inserting callbacks where translations are being+invalidated to notify the NIC that the memory behind those+translations is no longer visible to the application (and so should+not be visible to the NIC). This callback is then responsible for+ensuring that the NIC will not access the physical memory that was+being mapped.++An ioproc invalidate callback in the kswapd code could be utilised to+prevent memory from being paged out if the NIC is unable to support+network page faulting.++For NICs which support network page faulting, there is no requirement+for a user level pin down cache, since they are able to page-in their+translations on the first communication using a buffer. However this+is likely to be inefficient, resulting in slow first use of the+buffer. If the communication buffers were continually allocated and+freed using mmap based malloc() calls then this would lead to all+communications being slower than desirable.++To optimise these warm-up cases the ioproc patch adds calls to+ioproc_update wherever the kernel is creating translations for a user+process. These then allows the device driver to preload translations+so that they are already present for the first network communication+from a buffer.++Linux 2.6 IOPROC implementation details+=======================================++The Linux IOPROC patch adds hooks to the Linux VM code whenever page+table entries are being created and/or invalidated. IOPROC device+drivers can register their interest in being informed of such changes+by registering an ioproc_ops structure which is defined as follows;++extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);++typedef struct ioproc_ops {+ struct ioproc_ops *next;+ void *arg;++ void (*release)(void *arg, struct mm_struct *mm);+ void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);+ void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);+ void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);++ void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);++ void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);+ void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);+ void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);++} ioproc_ops_t;++ioproc_register_ops+===================+This function should be called by the IOPROC device driver to register+its interest in PTE changes for the process associated with the passed+in mm_struct.++The ioproc registration is not inherited across fork() and should be+called once for each process that IOPROC is interested in.++This function must be called whilst holding the mm->page_table_lock.++ioproc_unregister_ops+=====================+This function should be called by the IOPROC device driver when it no+longer requires informing of PTE changes in the process associated+with the supplied mm_struct.++This function is not normally needed to be called as the ioproc_ops+struct is unlinked from the associated mm_struct during the+ioproc_release() call.++This function must be called whilst holding the mm->page_table_lock.++ioproc_ops struct+=================+A linked list ioproc_ops structures is hung off the user process+mm_struct (linux/sched.h). At each hook point in the patched kernel+the ioproc patch will call the associated ioproc_ops callback function+pointer in turn for each registered structure.++The intention of the callbacks is to allow the IOPROC device driver to+inspect the new or modified PTE entry via the Linux kernel+(e.g. find_pte_map()). These callbacks should not modify the Linux+kernel VM state or PTE entries.++The ioproc_ops callback function pointers are defined as follows;++ioproc_release+==============+The release hook is called when a program exits and all its vma areas+are torn down and unmapped. i.e. during exit_mmap(). Before each+release hook is called the ioproc_ops structure is unlinked from the+mm_struct.++No locks are required as the process has the only reference to the mm+at this point.++ioproc_sync_[range|page]+========================+The sync hooks are called when a memory map is synchronised with its+disk image i.e. when the msync() syscall is invoked. Any future read+or write by the IOPROC device to the associated pages should cause the+page to be marked as referenced or modified.++Called holding the mm->page_table_lock++ioproc_invalidate_[range|page]+==============================+The invalidate hooks are called whenever a valid PTE is unloaded+e.g. when a page is unmapped by the user or paged out by the+kernel. After this call the IOPROC must not access the physical memory+again unless a new translation is loaded.++Called holding the mm->page_table_lock++ioproc_update_[range|page]+==========================+The update hooks are called whenever a valid PTE is loaded+e.g. mmaping memory, moving the brk up, when breaking COW or faulting+in an anonymous page of memory. These give the IOPROC device the+opportunity to load translations speculatively, which can improve+performance by avoiding device translation faults.++Called holding the mm->page_table_lock++ioproc_change_protection+========================+This hook is called when the protection on a region of memory is+changed i.e. when the mprotect() syscall is invoked.++The IOPROC must not be able to write to a read-only page, so if the+permissions are downgraded then it must honour them. If they are+upgraded it can treat this in the same way as the+ioproc_update_[range|page]() calls++Called holding the mm->page_table_lock+++Linux 2.6 IOPROC patch details+==============================++Here are the specific details of each ioproc hook added to the Linux+2.6 VM system and the reasons for doing so;++++++ FILE+ mm/fremap.c++==== FUNCTION+ zap_pte++CALLED FROM+ install_page+ install_file_pte++PTE MODIFICATION+ ptep_clear_flush++ADDED HOOKS+ ioproc_invalidate_page++==== FUNCTION+ install_page++CALLED FROM+ filemap_populate, shmem_populate++PTE MODIFICATION+ set_pte++ADDED HOOKS+ ioproc_update_page++==== FUNCTION+ install_file_pte++CALLED FROM+ filemap_populate, shmem_populate++PTE MODIFICATION+ set_pte++ADDED HOOKS+ ioproc_update_page+++++++ FILE+ mm/memory.c++==== FUNCTION+ zap_page_range++CALLED FROM+ read_zero_pagealigned, madvise_dontneed, unmap_mapping_range,+ unmap_mapping_range_list, do_mmap_pgoff++PTE MODIFICATION+ set_pte (unmap_vmas)++ADDED HOOKS+ ioproc_invalidate_range+++==== FUNCTION+ zeromap_page_range++CALLED FROM+ read_zero_pagealigned, mmap_zero++PTE MODIFICATION+ set_pte (zeromap_pte_range)++ADDED HOOKS+ ioproc_invalidate_range+ ioproc_update_range+++==== FUNCTION+ remap_page_range++CALLED FROM+ many device drivers++PTE MODIFICATION+ set_pte (remap_pte_range)++ADDED HOOKS+ ioproc_invalidate_range+ ioproc_update_range+++==== FUNCTION+ break_cow++CALLED FROM+ do_wp_page++PTE MODIFICATION+ ptep_establish++ADDED HOOKS+ ioproc_invalidate_page+ ioproc_update_page+++==== FUNCTION+ do_wp_page++CALLED FROM+ do_swap_page, handle_pte_fault++PTE MODIFICATION+ ptep_set_access_flags++ADDED HOOKS
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -