CN114840332A - Page exchange method and device and electronic equipment - Google Patents

Page exchange method and device and electronic equipment Download PDF

Info

Publication number
CN114840332A
CN114840332A CN202210307376.1A CN202210307376A CN114840332A CN 114840332 A CN114840332 A CN 114840332A CN 202210307376 A CN202210307376 A CN 202210307376A CN 114840332 A CN114840332 A CN 114840332A
Authority
CN
China
Prior art keywords
page table
thread
page
global
cpu core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210307376.1A
Other languages
Chinese (zh)
Inventor
乔一凡
陆庆达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210307376.1A priority Critical patent/CN114840332A/en
Publication of CN114840332A publication Critical patent/CN114840332A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the disclosure discloses a page exchange method, a page exchange device and electronic equipment. In the page swapping method described in the present disclosure, the swap-out overhead of the hybrid storage structure can be reduced by: (1) tracking page mapping cache and reducing TLB shootdown overhead using per-thread page tables and TLB shootdown operations; (2) the swap out is treated as a lightweight thread scheduled by the programming language runtime with garbage collection and lightweight concurrency by the operating system kernel and the programming language runtime with garbage collection and lightweight concurrency to perform the swap out asynchronously.

Description

Page exchange method and device and electronic equipment
Technical Field
The disclosure relates to the technical field of computers, in particular to a page exchange method, a page exchange device and electronic equipment.
Background
Currently, memory consumption of data center applications is increasing. Memory intensive applications, such as data analysis systems, graphics processing systems, and memory caching systems and databases, face tremendous memory pressure in view of the limited capacity of Dynamic Random Access Memory (DRAM) that a single server can accommodate. Kernel swappaping may alleviate this pressure by swapping out (page-out) inactive pages to a high-performance Solid State Disk (SSD) and remote memory. Before the kernel swaps (swap-in) pages from swap space (swap space), the kernel first needs to perform page swap out to make room for the swapped-in pages. However, page-out is a poorly scalable and heavyweight operation that is located on the critical path.
The current kernel swap system (kernel swap system) is designed for the traditional scenario where a disk (HDD) is used as a swap device (swap devices) and the disk access speed is slow. Applications cannot be frequently swapped to disk to avoid unacceptable performance degradation. However, this assumption does not hold in a hybrid storage architecture setting, because non-volatile memory (NVM) provides higher read/write bandwidth and lower random access latency compared to conventional HDD/SSDs. Thus, the swap throughput of applications in hybrid memory can be orders of magnitude higher than swapping to disk. For example, if some sort of logistic regression test is run on hybrid memory, then 25% of the working set is on the DRAM side and 75% on the NVM side. Experimental results show that each CPU core generates a main page fault (Major page fault) every 100us in operation. High frequency swap-in is accompanied by high frequency swap-out to make room for the newly swapped-in page.
Swapping out is a heavy duty task in the operating system kernel page fault handling path. Each swap-out first selects a victim physical Page by scanning a list of kernel LRU (Least Recently Used) active/inactive pages, then searches for Page Table Entries (PTEs) referencing the victim physical Page by reverse mapping, and finally sends an Inter-Processor Interrupt (IPI) broadcast to all online CPU cores to hit a Translation Lookup Buffer (TLB) to write out the Page to NVM, etc. This complex series of processes introduces a high computational overhead. Furthermore, the cost of sending and responding to IPIs increases with the number of on-line CPU cores, leading to serious scalability issues. Frequent TLB drops and flushes also increase the TLB miss (TLB miss) rate and compromise application performance.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present disclosure provide a lightweight page swapping method and apparatus for a hybrid storage architecture, and an electronic device, which solve the scalability problem of TLB shootdown in a page swap-out operation by using a per-thread page table, and can reduce the computational overhead of a page swap-out process.
In a first aspect, an embodiment of the present disclosure provides a page swapping method, where the method includes:
allocating a page table of each thread for a first type thread to record page mapping of a page locally accessed by the first type thread;
distributing a global page table for the process to which the first type thread belongs to record page mapping of pages accessed by the threads in the process;
in the case of executing missing page processing during page exchange, detecting a shared state of a page table entry of the global page table in the global page table, wherein a CPU core bitmap is attached to the page table entry in the global page table to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared state of the page table entry;
and according to the detected sharing state of the page table entries of the global page table in the global page table, performing a knock-down process on a translation lookup buffer scheduling a CPU core of the thread involved in the missing page process based on the page table of each thread involved in the missing page process and the global page table.
With reference to the first aspect, in a first implementation manner of the first aspect, according to a detected shared state of a page table entry of the global page table in the global page table, based on the page table per thread and the global page table involved in the missing page processing, the performing a knock-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the missing page processing includes:
traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer;
and executing a knock-down process on a translation lookup buffer of the CPU core scheduling the thread involved in the page fault process based on a traversal result of a memory management unit in the CPU core and based on the page table of each thread involved in the page fault process and the global page table.
With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the missing page processing includes at least one of the following processing:
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table;
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread;
for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry;
for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table entry does not exist, obtaining a valid page table entry through an operating system kernel, filling the obtained page table entry into the per-thread page table, and marking the page table entry in the global page table by using a specific forward pointer pointing to the page table entry of the per-thread page table.
With reference to the first aspect or the first implementation manner of the first aspect, in a third implementation manner of the first aspect, the performing, according to the detected shared state of the page table entry of the global page table in the global page table, a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the page fault process based on the page table per thread and the global page table involved in the page fault process includes:
according to the detected page table item of the global page table is in a private state in the global page table, a CPU core scheduling the thread involved in the missing page processing knocks down the page table item from a self translation lookup buffer and skips an inter-processor interrupt broadcast, or
And according to the detected page table entry of the global page table is in a shared state in the global page table, the CPU core scheduling the thread involved in the page fault processing sends an inter-processor interrupt request to the CPU core recorded in the CPU core bitmap, and the CPU core scheduling the thread involved in the page fault processing and the CPU core recorded in the CPU core bitmap schedule a second type of thread except the first type of thread.
With reference to the first aspect, in a fourth implementation manner of the first aspect, the first type thread is created and started by a runtime of a lightweight concurrent programming language.
With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the method further includes:
exporting a swap-out thread executed in page missing processing as a swap-out lightweight thread by using an operating system kernel and the runtime of the lightweight concurrent programming language, and scheduling the swap-out lightweight thread together with the user lightweight thread managed and scheduled by the runtime, so that the swap-out lightweight thread is asynchronously executed.
With reference to the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the method further includes:
inserting security points for a plurality of steps performed by the swapped out lightweight thread;
when the operation is started, a group of memory areas are allocated in an operating system kernel to store the converted state information;
the swapped-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swapped-out lightweight thread to the sub-memory area;
returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point;
passing the identification back to the operating system kernel when rescheduling the swapped out lightweight thread by the runtime.
With reference to the fifth implementation manner or the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the scheduling, by the runtime, the swapping out lightweight thread together with the user lightweight thread managed and scheduled by the runtime includes:
and distributing a lower priority for the swapped-out lightweight thread than the user lightweight thread through the running.
With reference to the first aspect or the first implementation manner of the first aspect, the present disclosure provides in an eighth implementation manner of the first aspect, performing page swapping in a hybrid memory architecture, which is a hybrid memory architecture at least including a dynamic random access memory DRAM and a non-volatile memory NVM.
In a second aspect, an embodiment of the present disclosure provides a page swapping apparatus, where the apparatus includes:
the page table allocation module for each thread is configured to allocate a page table for each thread for a first type of thread so as to record page mapping of a page locally accessed by the first type of thread;
the global page table allocation module is configured to allocate a global page table for the process to which the first type thread belongs so as to record page mapping of pages accessed by the threads in the process;
a page table entry detection module configured to detect a shared state of a page table entry of the global page table in the global page table when page swapping is performed, where a CPU core bitmap is appended to the page table entry in the global page table to track all CPU cores that cache the page table entry in a translation lookup buffer, so as to determine the shared state of the page table entry;
and the translation lookup buffer execution hit processing module is configured to execute hit processing on a translation lookup buffer of a CPU core scheduling the thread involved in the missing page processing based on the page table of each thread involved in the missing page processing and the global page table according to the detected shared state of the page table entry of the global page table in the global page table.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect, the first implementation manner to the eighth implementation manner of the first aspect.
In a fourth aspect, an embodiment of the present disclosure provides a readable storage medium, on which computer instructions are stored, and the computer instructions, when executed by a processor, implement the method according to any one of the first aspect, the first implementation manner to the eighth implementation manner of the first aspect.
In a fifth aspect, an embodiment of the present disclosure provides a computer program product, which includes computer instructions that, when executed by a processor, implement the first aspect, and the method described in any one of the first to eighth implementations of the first aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the technical scheme provided by the embodiment of the disclosure, a page table of each thread is distributed to a first type thread to record page mapping of a page locally accessed by the first type thread; distributing a global page table for the process to which the first type thread belongs to record page mapping of pages accessed by the threads in the process; in the case of executing missing page processing during page exchange, detecting a shared state of a page table entry of the global page table in the global page table, wherein a CPU core bitmap is attached to the page table entry in the global page table to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared state of the page table entry; according to the detected sharing state of the page table entries of the global page table in the global page table, based on the page table of each thread involved in the missing page processing and the global page table, the hit-down processing is performed on the translation lookup buffer scheduling the CPU core of the thread involved in the missing page processing, the problem of expandability of TLB hit-down in page swap-out operation can be solved by using the page table of each thread, and the calculation overhead of the page swap-out process can be reduced.
According to the technical scheme provided by the embodiment of the present disclosure, performing a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the missing page process based on the page table per thread and the global page table involved in the missing page process according to a detected shared state of a page table entry of the global page table in the global page table, includes: traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer; and based on the traversal result of the memory management unit in the CPU core and based on the page table of each thread and the global page table related to the page missing processing, executing the hit down processing on the translation lookup buffer of the CPU core scheduling the thread related to the page missing processing, solving the expandability problem of TLB hit down in page swap-out operation by using the page table of each thread, and reducing the calculation overhead in the page swap-out process.
According to the technical scheme provided by the embodiment of the disclosure, the page missing processing comprises at least one of the following processing: for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table; for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread; for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry; for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table item does not exist, a valid page table item is obtained through an operating system kernel, the obtained page table item is filled into the page table per thread, the page table item in the global page table is marked by using a specific forward pointer pointing to the page table item of the page table per thread, the expandability problem of TLB knock-down in page swap-out operation can be solved by using the page table per thread, and the calculation overhead of a page swap-out process can be reduced.
According to the technical solution provided by the embodiment of the present disclosure, performing, by the sharing state of the detected page table entry of the global page table in the global page table, a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the page fault process based on the page table per thread and the global page table involved in the page fault process includes: in accordance with a detection that a page table entry of the global page table is in a private state in the global page table, a CPU core scheduling a thread involved in the page fault processing will knock down the page table entry from its translation look-up buffer and skip an inter-processor interrupt broadcast, or according to the detected page table entry of the global page table being in a shared state in the global page table, the CPU core scheduling the thread involved in the missing page processing will send an inter-processor interrupt request only to the CPU core recorded in the CPU core bitmap, and the CPU core of the thread involved in the page fault processing is scheduled and the CPU core recorded in the CPU core bitmap schedules a second type of thread except the first type of thread, the scalability problem of TLB shootdown in page swap-out operations can be solved by using per-thread page tables, and the computational overhead of the page swap-out process can be reduced.
According to the technical scheme provided by the embodiment of the disclosure, the first type of thread is created and started by the runtime of the programming language with lightweight concurrency, so that the swapping-out overhead of the programming language with lightweight concurrency is reduced.
According to the technical scheme provided by the embodiment of the disclosure, the swap-out thread executed in the page fault processing is exported to be the swap-out lightweight thread by utilizing the operating system kernel and the runtime of the lightweight concurrent programming language, so that the swap-out lightweight thread is dispatched together with the user lightweight thread managed and dispatched by the runtime through the runtime, the swap-out lightweight thread is asynchronously executed, the swap-out thread can be decoupled into some swap-out lightweight threads, and the swap-out thread is dispatched together with the common user lightweight thread to asynchronously execute the swap-out operation, thereby reducing the swap-out overhead of the lightweight concurrent programming language.
According to the technical scheme provided by the embodiment of the disclosure, safety points are inserted for a plurality of steps executed by the swapped-out lightweight thread; when the operation is started, a group of memory areas are allocated in an operating system kernel to store the converted state information; the swap-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swap-out lightweight thread to the sub-memory area; returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point; the identification is transmitted back to the kernel of the operating system when the swap-out lightweight thread is rescheduled in the runtime, so that the swap-out thread can be decoupled into some swap-out lightweight threads and is scheduled together with the common user lightweight thread to asynchronously execute the swap-out operation, thereby reducing the swap-out overhead of the programming language with lightweight concurrence.
According to the technical scheme provided by the embodiment of the disclosure, the swapping out lightweight thread is scheduled together with the user lightweight thread managed and scheduled by the runtime through the runtime, and the method comprises the following steps: the swapping-out threads can be decoupled into some swapping-out lightweight threads by distributing lower priority to the swapping-out lightweight threads than the user lightweight threads during the running, and the swapping-out threads and the common user lightweight threads are jointly scheduled to asynchronously execute swapping-out operation, so that the swapping-out overhead of the programming language with lightweight concurrence is reduced.
According to the technical scheme provided by the embodiment of the disclosure, by executing the page swap in the hybrid storage architecture which at least comprises the dynamic random access memory DRAM and the nonvolatile memory NVM, the expandability problem of TLB knock-down in the page swap-out operation can be solved by using the page table per thread for the hybrid storage architecture of the DRAM and the NVM, and the calculation overhead of the page swap-out process can be reduced.
According to the technical scheme provided by the embodiment of the disclosure, a page table allocation module for each thread is configured to allocate a page table for each thread for a first type of thread so as to record page mapping of a page accessed locally by the first type of thread; the global page table allocation module is configured to allocate a global page table for the process to which the first type thread belongs so as to record page mapping of pages accessed by the threads in the process; a page table entry detection module configured to detect a shared state of a page table entry of the global page table in the global page table when page swapping is performed, where a CPU core bitmap is appended to the page table entry in the global page table to track all CPU cores that cache the page table entry in a translation lookup buffer, so as to determine the shared state of the page table entry; and the translation lookup buffer execution hit processing module is configured to execute hit processing on the translation lookup buffer of the CPU core scheduling the thread involved in the missing page processing based on the page table per thread and the global page table involved in the missing page processing according to the detected shared state of the page table entries of the global page table in the global page table, and can solve the scalability problem of TLB hit in page swap-out operation by using the page table per thread and reduce the calculation overhead of the page swap-out process.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a page swapping method according to an embodiment of the present disclosure;
FIG. 2 illustrates an exemplary diagram of a page table structure including a per-thread page table and a global page table employed in a lightweight page swap method of a hybrid storage architecture according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram illustrating a scenario in which a swap-out thread is exported as special Goroutine to be scheduled together with user Goroutine when running by Go in the page swapping method according to an embodiment of the present disclosure;
FIG. 4 shows a block diagram of a page swapping device according to an embodiment of the present disclosure;
FIG. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a computer system suitable for use in implementing a method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of labels, numbers, steps, actions, components, parts, or combinations thereof disclosed in the present specification, and are not intended to preclude the possibility that one or more other labels, numbers, steps, actions, components, parts, or combinations thereof are present or added.
It should be further noted that the embodiments and labels in the embodiments of the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
In embodiments of the present disclosure, the following concepts may be mentioned:
non-volatile memory (NVM): persistent byte-addressable memory. NVM provides larger memory capacity compared to DRAM, lower access latency (<10us) compared to SSD, and higher bandwidth. For example, as an intermediate layer between a DRAM and an SSD in a storage hierarchy, an NVM may be used as a high performance swap partition when application memory is scarce.
Programming languages with garbage collection and lightweight concurrency (Programming languages with a garbiage collection and light-weight consolidity): many general purpose programming languages have built-in lightweight concurrency (Light-weight concurrency) support (see definitions below) and garbage collection runtime systems (see definitions below). Go and Erlang are two examples of such languages. Go is a popular modern language, and we use Go as an example in the description of embodiments of the present disclosure. The following embodiments also refer to a programming language with garbage collection and lightweight concurrency as a programming language with lightweight concurrency. Lightweight concurrency refers to the large amount of concurrency that can exist when running a lightweight thread using a programming language with lightweight concurrency, and thus, is referred to as a programming language with lightweight concurrency. Compared with the traditional system level thread and process, the maximum advantage of the lightweight thread is that the lightweight thread is 'lightweight', millions or even tens of millions can be easily created, and system resources cannot be exhausted; and the number of threads and processes that can be allowed to run by the system resources is much less than this number.
Lightweight concurrency is provided by lightweight user-level threads managed by language runtimes (language runtime). For example, Goroutine is a lightweight user-level thread (or simply, lightweight thread) that is managed and scheduled by a Go runtime (Go runtime). The Go runtime creates a set of kernel threads as a mutator thread pool and schedules Goroutine onto available threads in the user-level thread pool. Throughout this document, we refer to kernel threads as "threads".
Garbage Collection (GC) in Go: in addition to the mutator thread, the Go runtime creates a set of GC threads for concurrent tracking. The GC thread tracks from some "root" objects and marks all accessible objects by reference chains. When all reachable objects are marked as active, the trace is complete. The GC thread then compresses the active objects into a contiguous memory region and releases the memory occupied by the dead objects. A thread running the user code is called a mutator thread, and other threads created for auxiliary work such as GC during running are non-mutator threads.
Page table: a page table is a data structure residing in memory for mapping virtual memory addresses to physical addresses. In the x86 and x86-64 architectures, the page table can be viewed as a radix tree (radix-tree) indexed by virtual addresses. The operating system provides a page table for each process, the address of which is stored in the CR3 register of the CPU. When translating a virtual address to a physical address, the CPU reads the CR3 register to obtain a page table and retrieves the mapping.
Page Table Entry (PTE): the leaves of the radix tree, referred to as PTEs, store physical memory addresses and auxiliary bit flags indicating the status of physical pages, such as access bits and dirty bits. For example, a PTE structure of a specification has a number of available bits that are unused by specific hardware and are customizable by the operating system.
Paging and Swapping (Paging and swaping): a typical operating system divides the virtual address space and the physical memory space into consecutive pages of the same size (which may be 4KB or 2MB, for example). When the operating system memory is low, it will swap out (page out) the recently unused page to swap space (swap space) through the swap mechanism. For example, swap space is typically located on an NVM or disk. Later, when an application accesses a page located in swap space, it will trigger a page fault and the operating system will fetch the page back to main memory, which is called swap in. In embodiments of the present disclosure, a page is sometimes also referred to simply as a page.
Inter-processor interrupts (IPIs): a special type of hardware interrupt in a multi-core CPU. An IPI issued by one CPU core to another CPU core may interrupt the other CPU core and require the operation and response of the other core.
TLB shootdown (TLB shootdown): translation Lookup Buffers (TLBs) are a type of special cache for CPUs, which are used to cache virtual to physical memory address translations. Is accessed on each memory read/write operation. To improve efficiency, TLBs do not typically maintain their coherency by hardware, but rather implement a TLB shootdown protocol by the operating system to maintain TLB coherency. When the mapping of a virtual page changes and its TLB entries become stale, the operating system kernel will flush TLB entries in the TLBs of all CPU cores to restore coherency. In practical implementation, the kernel may directly hit the corresponding entry in the TLB of the current CPU core, but needs to broadcast an IPI request to all remote CPU cores, let them hit the corresponding TLB entry or flush (flush) the entire TLB according to additional information in the IPI, and wait for their response. The latency of IPI transmission and response will increase as the number of CPU cores increases.
In order to solve the problems of the prior art, the inventors of the present disclosure considered the following related solutions:
for example, one access bit-based TLB invalidation mechanism: ABIS (Access Based invocation System) re-uses the access bits in the PTE as hints of PTE sharing status, avoiding IPI broadcast during TLB shootdown if the page is accessed by only one CPU core. ABIS constructs a false page table to copy page mappings when handling page faults. With such a duplicate page table, it can insert an entry directly into the TLB of the CPU core without setting an access bit in the corresponding PTE. If other CPU cores access the page and trigger a TLB miss, the access bit will be set, thus serving as an indicator of the shared page. However, when a TLB entry is evicted and later reinserted, the ABIS may cause a false-positive indication (FALSE-POSITIVE INDICATION). Go GC (garpage Collection) threads can significantly magnify the problem because they track active objects at the same time, thus sharing a large number of pages with the mutator thread.
For example, Per-core page table (Per-core page table) scheme: the Radix VM (Radix Virtual Machine) provides a Radix tree structured per-core page table to track page sharing state to avoid unnecessary IPI broadcasts during TLB shootdown. It is implemented in a teaching operating system (teaching OS) that lacks many memory management functions. PSPT (Partially detached Page Table) provides a similar per-core Page Table design for a certain processor in an HPC (high performance computing) setting. PSPT requires the programmer to carefully mark the compute regions that each kernel page table will process. Such detailed modifications and annotations to the program are often impractical for non-HPC applications. Furthermore, both of these schemes have no global page tables, and therefore, the PTEs in multiple page tables need to be changed when the page mapping is modified, thereby incurring additional overhead. Where a per-core page table is used instead of a per-thread page table, the per-core page table does not distinguish between different types of application threads (e.g., mutator threads and GC threads).
For example, kernel switched path optimization scheme: fastwap (fast page swap) optimizes page reads/writes during swap by bypassing the kernel block IO software stack and utilizing the frontend swap interface. It reduces software overhead when executing page IO, but the swap-out operation is still in its critical path for page fault handling, and TLB shootdown still has scalability issues.
In consideration of the disadvantages of the related technical solutions, the embodiments of the present disclosure provide a per-thread page table (per-thread page table) design to solve the scalability (scalability) problem of TLB shootdown in the page swap-out process, and reduce the computation overhead in the swap-out process. In addition, embodiments of the present disclosure propose methods to reduce swap-out overhead (swap-out overhead) for programming language (e.g., Go language) applications with garbage collection and lightweight concurrency by decoupling the swap-out threads into several "swap-out lightweight threads" (e.g., Goroutine) and scheduling them in concert with the normal lightweight threads (e.g., Goroutine).
It should be noted that "lightweight" is a concept as opposed to "heavyweight", both of which are primarily compared in terms of ease of use of the application framework, characteristics of the services provided, and the like. Moreover, references to "lightweight" in embodiments of the present disclosure refer to "lightweight" in programming languages with garbage collection and lightweight concurrency.
According to the technical scheme provided by the embodiment of the disclosure, a page table of each thread is distributed to a first type thread to record page mapping of a page locally accessed by the first type thread; distributing a global page table for the process to which the first type thread belongs to record page mapping of pages accessed by the threads in the process; in the case of executing missing page processing during page exchange, detecting a shared state of a page table entry of the global page table in the global page table, wherein a CPU core bitmap is attached to the page table entry in the global page table to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared state of the page table entry; according to the detected sharing state of the page table entries of the global page table in the global page table, based on the page table of each thread involved in the missing page processing and the global page table, the hit-down processing is performed on the translation lookup buffer scheduling the CPU core of the thread involved in the missing page processing, the problem of expandability of TLB hit-down in page swap-out operation can be solved by using the page table of each thread, and the calculation overhead of the page swap-out process can be reduced.
In order to solve the above problems, the present disclosure proposes a page swapping method, apparatus, electronic device, readable storage medium, and computer program product.
Fig. 1 illustrates a flow chart of a page swapping method according to an embodiment of the present disclosure. As shown in fig. 1, the page swapping method includes steps S101, S102, S103, S104.
In step S101, a page table per thread is allocated to a first type thread to record a page mapping of a page locally accessed by the first type thread.
In step S102, a global page table is allocated to the process to which the first type of thread belongs to record a page mapping of a page accessed by a thread in the process.
In step S103, in a case where page missing processing is performed during page swap, a shared state of a page table entry of the global page table in the global page table is detected, where a CPU core bitmap is appended to the page table entry in the global page table to track all CPU cores that cache the page table entry in a translation lookup buffer, so as to determine the shared state of the page table entry.
In step S104, according to the detected shared state of the page table entries of the global page table in the global page table, based on the page table of each thread involved in the missing page processing and the global page table, a hit-down process is performed on a translation lookup buffer scheduling a CPU core of the thread involved in the missing page processing.
In one embodiment of the present disclosure, page swapping is performed in a hybrid memory architecture, which is a hybrid memory architecture that includes at least a dynamic random access memory DRAM and a non-volatile memory NVM. The core switching system of the related art is designed for a conventional scenario in which a disk is used as a switching device (swap device) and the disk access speed is slow. Applications cannot be frequently swapped to disk to avoid unacceptable performance degradation. However, this assumption is not true under the hybrid storage architectures mentioned in the embodiments of the present disclosure, since NVMs provide higher read/write bandwidth and lower random access latency compared to conventional HDD/SSDs. Thus, the swap throughput of applications in hybrid memory can be orders of magnitude higher than swapping to disk. For example, if some sort of logistic regression test is run on hybrid memory, then 25% of the working set is on the DRAM side and 75% on the NVM side. Experimental results show that each CPU core generates a main page fault (Major page fault) every 100us in operation. High frequency swap-in is accompanied by high frequency swap-out to make room for the newly swapped-in page. In one embodiment of the present disclosure, a hybrid storage architecture may include multiple storage tiers, e.g., an NVM as an intermediate layer between a DRAM and an SSD in a storage hierarchy, which may be used as a high performance swap partition when application memory is scarce.
According to the technical scheme provided by the embodiment of the disclosure, by executing the page swap in the hybrid storage architecture which at least comprises the dynamic random access memory DRAM and the nonvolatile memory NVM, the expandability problem of TLB knock-down in the page swap-out operation can be solved by using the page table per thread for the hybrid storage architecture of the DRAM and the NVM, and the calculation overhead of the page swap-out process can be reduced.
In one embodiment of the present disclosure, the first type of thread is created and launched by a runtime having a lightweight concurrent programming language. In one embodiment of the present disclosure, a programming language with lightweight concurrency refers to the aforementioned programming language with garbage collection and lightweight concurrency, e.g., Go language or Erlang language.
According to the technical scheme provided by the embodiment of the disclosure, the first type of thread is created and started by the runtime of the lightweight concurrent programming language, so that the swap-out overhead of the lightweight concurrent programming language is reduced.
In one embodiment of the present disclosure, a page swap method may refer to a page swap method for a hybrid storage architecture. In one embodiment of the present disclosure, the page swap method may refer to a lightweight page swap method for a hybrid memory architecture. Referring to the page swap method as a lightweight page swap method refers to a page swap method that utilizes programming language support with lightweight concurrency.
In one embodiment of the present disclosure, a hierarchical page table structure is proposed to track whether a page is accessed by multiple CPU cores. If a physical page is only accessed by a single core, the operating system kernel can skip the expensive IPI broadcast and perform TLB shootdown locally when evicting the page, thereby reducing TLB shootdown overhead in the swap out. To balance memory consumption and tracking accuracy, we propose per-thread page tables for each Go mutator thread and provide backup global page tables for all (performance-unimportant) non-mutator threads. Good performance can be provided for the mutator thread by ensuring the consistency of the page table per thread and the global page table.
A page table structure employed in a lightweight page swap method of a hybrid storage architecture according to an embodiment of the present disclosure will be described below with reference to fig. 2.
FIG. 2 illustrates an exemplary schematic diagram of a page table structure including a per-thread page table and a global page table employed in a lightweight page swap method of a hybrid storage architecture according to an embodiment of the present disclosure.
The general design of the page table structure is described below with reference to fig. 2. As shown in FIG. 2, a per-thread page table is assigned to each mutator thread started by Go runtime (Go runtime) to record the page mapping of pages accessed locally by the mutator thread, and a "shared" global page table may be reserved for the entire process to which the mutator thread belongs to record the page mapping of pages accessed by multiple threads. The global page table is also used for address translation by the non-mutator thread (see CPU core (or simply core) 1 in fig. 2). The operating system also maintains a flag for each core to mark whether the thread currently on that core is a mutator thread. As shown in FIG. 2, a mutator thread flag is 1 and a non-mutator thread flag is 0. Both the per-thread page table structure and the Process global page table structure may be similar to the x86/x86-64 conventional page table structures, which share the same Process Context Identifier (Process-Context Identifier PCID) in the CR3 value (although not shown in fig. 2, all of the CRs 3 in fig. 2 have the same PCID value). A CPU core bitmap is also appended for each PTE in the global page table to track all CPU cores that may cache this PTE in their TLBs. Each CPU core loads the address of the per-thread page table into its CR3 register when scheduling threads. When a TLB miss (TLB miss) is encountered, a Memory Management Unit (MMU) in the CPU core may still traverse the per-thread page table following the same page table structure. In addition, the meaning of PGD, PUD, PMD, etc. in the global page table can be known from the conventional page table structure of x86/x86-64, and the details are not repeated in the present disclosure.
The Page Fault (Page Fault) processing is described below with reference to fig. 2.
For a CPU core where CR3 loads a per-thread page table, a page fault can be divided into two cases: (1) the lack of a corresponding PTE in the per-thread page table, but which can be resolved in the global page table; (2) PTE is missing in both.
If the corresponding PTE is valid in the global page table, which belongs to the situation (1), the missing page processing program marks the PTE as shared by setting corresponding bits in a core bitmap of the PTE of the global page table, and then fills PTE values into page tables of each thread to complete the missing page error processing. Referring to the absence of a PTE 204 in the per-thread page table in FIG. 2, the filling of the per-thread page table with the values of the PTE 203 in the global page table is indicated by the dashed double-dotted line in the legend.
If the corresponding PTE is not present in either the non-per-thread page table or the global page table, then a true page fault is obtained, which is case (2). For case (2), the missing page fault handler first follows the original page fault handling logic in the operating system kernel to get a valid PTE. Then we lock the per-thread page table and the global page table, only populate this PTE into the per-thread page table, and mark the global PTE 202 with a special forward pointer to the per-thread PTE 201 (indicated by the dotted line with an arrow in fig. 2, see legend), and finally unlock the page table. The page is now in the "private" state, so the IPI broadcast may be skipped when performing a TLB shootdown.
For the kernel with the CR3 loaded with the global page table, the page fault can be divided into two cases: (1) a PTE of the global page table is a forward pointer to an entry in the per-thread page table and the PTE is invalid; (2) the PTE does not exist in the global page table.
If case (1) and the PTE is a forward pointer, the operating system should lock the global page table and per-thread page table pointed to by the forward pointer, copy the PTE value into the global page table, and mark this PTE as shared in the core bitmap.
In the case (2), the processing is performed in the same manner as in the case (2) in which the CR3 has loaded the CPU core of the page table per thread. That is, if case (2) is that the PTE does not exist in the global page table, the global page table may be locked, the missing page fault resolved by populating the PTE, and then unlocked.
TLB miss handling is described below with reference to FIG. 2. For one mutator thread, when the CPU core in which it resides encounters a TLB miss, the MMU of the CPU core will first traverse the per-thread page table of the current mutator thread. If the PTE is valid, the MMU may resolve the TLB miss by caching the PTE. If the per-thread PTE does not exist, the MMU would trigger a page fault and then be handled by a page fault handler. For non-mutator threads, because they directly use the global page table, the MMU will traverse the global page table on a TLB miss. If the required PTE is not present or the PTE forward pointer points to an entry in the per-thread page table, a page fault error is generated and handled by the page fault handler.
The TLB shootdown process is described below with reference to fig. 2. When an initiator core (initiator core) of the CPU wants to hit a PTE, the operating system kernel first detects the shared status of the PTEs of the global page table in the global page table. If the PTE in the global page table is private, the initiator core will hit the PTE from its TLB and skip the IPI broadcast; if the PTE in the global page table is shared, the initiator core will only send IPI requests to cores recorded in the core bitmap and the cores schedule non-mutator threads. In either case, a significant amount of IPI overhead can be saved.
The page mapping change is described below with reference to fig. 2. When the initiator core wants to change a PTE, it needs to maintain the consistency of its per-thread page table and global page table. It will first lock its per-thread page table and global page table and detect the shared state of the PTEs of the global page table. If the PTE is private, the CPU core need only change the PTE directly in its per-thread page table. The global page table forward pointer remains unchanged during this process and therefore does not need to be updated. If the PTE is shared, the initiator core needs to invalidate PTEs in the per-thread page tables of other cores that share the PTE. For those per-thread page tables that do not currently have threads scheduled, their modification may be deferred until the time they are scheduled.
Thread scheduling is described below with reference to fig. 2. The operating system maintains a per core muter thread flag at all times when a Go thread is scheduled, as shown in fig. 2. Each time the operating system starts a mutator thread, it assigns a newly assigned per-thread page table to the mutator thread and sets the page table address into the CR3 register of the scheduled core. Later when the operating system stops threading, the operating system first locks the global page table, scans the per-thread page table, and evicts all private PTEs in the per-thread page table into the global page table by replacing forward pointers in the global page table with real PTE values. We use one of the customizable bits in the PTE to mark whether the PTE in the global page table is a forward pointer or a true PTE value. Note that the PTEs in the global page table may not be used by any CPU core, but this does not affect correctness. Later on for the shared PTE, the operating system will clear the corresponding bit in its CPU core bitmap because the core bitmap can accurately track all cores that reference the PTE. At which point page table coherency is restored. The operating system then flushes all TLB entries associated with the descheduled thread from the core where the thread resides and unlocks the global page table. For non-mutator threads, they always use a global page table when they are scheduled. Also, the operating system flushes all TLB entries from the core where the thread is located, without having to change the global page table.
In one embodiment of the present disclosure, step S104 includes:
traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer;
and executing a knock-down process on a translation lookup buffer of the CPU core scheduling the thread involved in the page fault process based on a traversal result of a memory management unit in the CPU core and based on the page table of each thread involved in the page fault process and the global page table.
Performing a hit on a translation look-up buffer (TLB hit) in one embodiment of the present disclosure may refer to hitting a PTE in the TLB, which may also refer to flushing the TLB.
According to the technical scheme provided by the embodiment of the present disclosure, performing a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the missing page process based on the page table per thread and the global page table involved in the missing page process according to a detected shared state of a page table entry of the global page table in the global page table, includes: traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer; and based on the traversal result of the memory management unit in the CPU core and based on the page table of each thread and the global page table related to the page missing processing, executing the hit down processing on the translation lookup buffer of the CPU core scheduling the thread related to the page missing processing, solving the expandability problem of TLB hit down in page swap-out operation by using the page table of each thread, and reducing the calculation overhead in the page swap-out process.
In one embodiment of the present disclosure, the page fault processing includes at least one of:
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table;
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread;
for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry;
for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table entry does not exist, obtaining a valid page table entry through an operating system kernel, filling the obtained page table entry into the per-thread page table, and marking the page table entry in the global page table by using a specific forward pointer pointing to the page table entry of the per-thread page table.
According to the technical scheme provided by the embodiment of the disclosure, the page missing processing comprises at least one of the following processing: for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table; for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread; for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry; for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table item does not exist, a valid page table item is obtained through an operating system kernel, the obtained page table item is filled into the page table per thread, the page table item in the global page table is marked by using a specific forward pointer pointing to the page table item of the page table per thread, the expandability problem of TLB knock-down in page swap-out operation can be solved by using the page table per thread, and the calculation overhead of a page swap-out process can be reduced.
In one embodiment of the present disclosure, step S104 includes:
according to the detected page table item of the global page table is in a private state in the global page table, a CPU core scheduling the thread involved in the missing page processing knocks down the page table item from a self translation lookup buffer and skips an inter-processor interrupt broadcast, or
And according to the detected page table entry of the global page table is in a shared state in the global page table, the CPU core scheduling the thread involved in the page fault processing sends an inter-processor interrupt request to the CPU core recorded in the CPU core bitmap, and the CPU core scheduling the thread involved in the page fault processing and the CPU core recorded in the CPU core bitmap schedule a second type of thread except the first type of thread.
In one embodiment of the present disclosure, the first type thread may refer to a mutator thread, and the second type thread may refer to a thread other than the mutator thread, for example, a GC thread.
According to the technical solution provided by the embodiment of the present disclosure, performing, by the sharing state of the detected page table entry of the global page table in the global page table, a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the page fault process based on the page table per thread and the global page table involved in the page fault process includes: in accordance with a detection that a page table entry of the global page table is in a private state in the global page table, a CPU core scheduling a thread involved in the page fault processing will knock down the page table entry from its translation look-up buffer and skip an inter-processor interrupt broadcast, or according to the detected page table entry of the global page table being in a shared state in the global page table, the CPU core scheduling the thread involved in the missing page processing will send an inter-processor interrupt request only to the CPU core recorded in the CPU core bitmap, and the CPU core of the thread involved in the page fault processing is scheduled and the CPU core recorded in the CPU core bitmap schedules a second type of thread except the first type of thread, the scalability problem of TLB shootdown in page swap-out operations can be solved by using per-thread page tables, and the computational overhead of the page swap-out process can be reduced.
A scenario in which a swap-out thread is exported as a special Goroutine in the page swap method will be described below with reference to fig. 3.
Fig. 3 is a schematic diagram illustrating a scenario in which a swap-out thread is exported as a special Goroutine to be scheduled together with a user Goroutine when running by Go in a page swapping method according to an embodiment of the present disclosure.
In the embodiments of the present disclosure, a lightweight thread is described with Goroutine as an example of the lightweight thread.
Exporting a swap-out thread (of the operating system threads of the kernel space) as a special Goroutine (the swap-out Goroutine T0 and T1 shown in FIG. 3) is described below with reference to FIG. 3. As shown in fig. 3, Goroutine is swapped out by Go runtime of user space to be scheduled with user Goroutine.
In the embodiment shown in FIG. 3, a swap-out (swap-out) thread is exported as Goroutine using the operating system kernel and Go runtime, so the swap-out operation may be performed asynchronously. Exporting the swap-out as Goroutine may benefit applications from three aspects. First, asynchronous swap out may release some cold pages ahead of demand swap-in, thereby reducing demand swap-in latency and thus overall latency. Secondly, the Go running can schedule and swap out the Goroutine when the user Goroutine cannot fully utilize the CPU resource, so that the overall CPU utilization rate and performance are improved. Third, swapping out the critical path for the missing page fault handling of the move-out request means that Go runtime can overlap swap-out communication with user computation, thereby further improving CPU utilization and NVM bandwidth utilization.
For example, at the kernel level, a new system call, 'user _ swap out,' may be implemented to export a swapped out thread to Go runtime in user space. The logic of "user _ swap out" is similar to the current swap-out function in the core, but differs mainly by two points. First, given that Goroutine is co-scheduled, it is necessary to add several "safe points" in the user _ swap out' where the system call will save its state and return to the user level. Thus, Go runtime can safely suspend swapping out of Goroutine and reassign CPUs to other Goroutine. We swap out a page into five steps: (1) scanning a kernel active/inactive list, and selecting a batch of pages to be swapped out; (2) searching PTEs referencing these victim physical pages through a reverse map; (3) allocating exchange items and inserting the exchange items into an exchange cache; (4) unmap pages and hit TLBs; (5) the page contents are written to NVM. A security dot may be inserted at the end of each step.
To correctly resume a swap-out when a swap-out Goroutine reschedule, Goroutine state information, such as the stack and other auxiliary state information, needs to be saved at each security point. Since the swap out is performed at the kernel level, the kernel stack and data cannot be stored in the Go runtime at the user level. In contrast, when the Go operation is started, a set of memory regions is allocated in the kernel space to store the swap-out state information. Swapping out Goroutine may allocate a region from the set of memory regions and save its execution state into it (see the state array shown in FIG. 3). The "user _ swap out" system call returns an ID at each security point to the Go runtime, which is an index to the area storing its state. The Go runtime remembers this ID only for the swapped out Goroutine and passes it back to the 'user _ swap out' system call the next time the swapped out Goroutine is rescheduled. The system call may follow the ID to retrieve the previous state and resume execution.
At the user level, Go runtimes may be modified to co-schedule out Goroutines and user Goroutines. To prevent the swapped-out Goroutine from consuming too much CPU resources and starving the user Goroutine, the swapped-out Goroutine may be assigned a lower priority than the ordinary user Goroutine, and scheduled only when there is no way to schedule more available users Goroutine. The scheduling policy is also customizable, and the user can adjust the policy or implement his own policy to achieve higher throughput or lower latency.
In one embodiment of the present disclosure, the page swapping method further includes:
exporting a swap-out thread executed in page missing processing as a swap-out lightweight thread by using an operating system kernel and the runtime of the lightweight concurrent programming language, and scheduling the swap-out lightweight thread together with the user lightweight thread managed and scheduled by the runtime, so that the swap-out lightweight thread is asynchronously executed.
According to the technical scheme provided by the embodiment of the disclosure, the swap-out thread executed in the page fault processing is exported to be the swap-out lightweight thread by utilizing the operating system kernel and the runtime of the lightweight concurrent programming language, so that the swap-out lightweight thread is dispatched together with the user lightweight thread managed and dispatched by the runtime through the runtime, the swap-out lightweight thread is asynchronously executed, the swap-out thread can be decoupled into some swap-out lightweight threads, and the swap-out thread is dispatched together with the common user lightweight thread to asynchronously execute the swap-out operation, thereby reducing the swap-out overhead of the lightweight concurrent programming language.
In one embodiment of the present disclosure, the page swapping method further includes:
inserting security points for a plurality of steps performed by the swapped out lightweight thread;
when the operation is started, a group of memory areas are allocated in an operating system kernel to store the converted state information;
the swap-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swap-out lightweight thread to the sub-memory area;
returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point;
passing the identification back to the operating system kernel when rescheduling the swapped out lightweight thread by the runtime.
According to the technical scheme provided by the embodiment of the disclosure, safety points are inserted for a plurality of steps executed by the swapped-out lightweight thread; allocating a group of memory areas in an operating system kernel to store the swap-out state information when the operation is started; the swap-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swap-out lightweight thread to the sub-memory area; returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point; the identification is transmitted back to the kernel of the operating system when the swap-out lightweight thread is rescheduled in the runtime, so that the swap-out thread can be decoupled into some swap-out lightweight threads and is scheduled together with the common user lightweight thread to asynchronously execute the swap-out operation, thereby reducing the swap-out overhead of the programming language with lightweight concurrence.
In an embodiment of the present disclosure, said scheduling, by the runtime, the swapped out lightweight thread together with the user lightweight thread managed and scheduled by the runtime comprises:
and distributing a lower priority for the swapped-out lightweight thread than the user lightweight thread through the running.
According to the technical scheme provided by the embodiment of the disclosure, the swapping out lightweight thread is scheduled together with the user lightweight thread managed and scheduled by the runtime through the runtime, and the method comprises the following steps: the swapping-out threads can be decoupled into some swapping-out lightweight threads by distributing lower priority to the swapping-out lightweight threads than the user lightweight threads during the running, and the swapping-out threads and the common user lightweight threads are jointly scheduled to asynchronously execute swapping-out operation, so that the swapping-out overhead of the programming language with lightweight concurrence is reduced.
In embodiments of the present disclosure, the page tables per thread are suitable for the x86/x86-64 architecture, as their TLBs are managed by hardware. Embodiments of the present disclosure are directed to an application written in gold that runs on top of a hybrid storage architecture, where the NVM can be used as a swap device (swap device).
In the page swapping method described in the present disclosure, the swap-out overhead of the hybrid storage structure can be reduced by: (1) tracking page mapping cache and reducing TLB shootdown overhead using per-thread page tables and TLB shootdown operations; (2) the swap out is treated as a lightweight thread scheduled by the programming language runtime with garbage collection and lightweight concurrency by the operating system kernel and the programming language runtime with garbage collection and lightweight concurrency to perform the swap out asynchronously.
A page exchange apparatus according to an embodiment of the present disclosure is described below with reference to fig. 4. Fig. 4 shows a block diagram of a page swapping apparatus 400 according to an embodiment of the present disclosure. As shown in FIG. 4, the page swap apparatus 400 includes a per-thread page table allocation module 401, a global page table allocation module 402, a page table entry detection module 403, and a translation look-up buffer execution hit processing module 404.
The per-thread page table allocation module 401 is configured to allocate a per-thread page table for a first type of thread to record a page mapping of pages locally accessed by the first type of thread.
The global page table allocation module 402 is configured to allocate a global page table for a process to which the first type of thread belongs to record a page mapping of pages accessed by the threads in the process.
Page table entry detection module 403 is configured to detect a shared status of a page table entry of the global page table in the case of performing missing page processing while performing a page swap, wherein the page table entry in the global page table is accompanied by a CPU core bitmap to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared status of the page table entry.
The translation look-up buffer execution hit processing module 404 is configured to execute a hit processing on a translation look-up buffer of a CPU core scheduling a thread involved in the missing page processing based on the per-thread page table and the global page table involved in the missing page processing according to the detected shared state of the page table entries of the global page table in the global page table.
According to the technical scheme provided by the embodiment of the disclosure, a page table allocation module for each thread is configured to allocate a page table for each thread for a first type of thread so as to record page mapping of a page accessed locally by the first type of thread; the global page table allocation module is configured to allocate a global page table for the process to which the first type thread belongs so as to record page mapping of pages accessed by the threads in the process; a page table entry detection module configured to detect a shared state of a page table entry of the global page table in the global page table when page swapping is performed, where a CPU core bitmap is appended to the page table entry in the global page table to track all CPU cores that cache the page table entry in a translation lookup buffer, so as to determine the shared state of the page table entry; and the translation lookup buffer execution hit processing module is configured to execute hit processing on the translation lookup buffer of the CPU core scheduling the thread involved in the missing page processing based on the page table per thread and the global page table involved in the missing page processing according to the detected shared state of the page table entries of the global page table in the global page table, and can solve the scalability problem of TLB hit in page swap-out operation by using the page table per thread and reduce the calculation overhead of the page swap-out process.
It will be understood by those skilled in the art that the technical solution described with reference to fig. 4 may be combined with the embodiment described with reference to fig. 1 to 3, so as to have the technical effects achieved by the embodiment described with reference to fig. 1 to 3. For details, reference may be made to the description made above with reference to fig. 1 to 3, and details thereof are not repeated herein.
Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
The disclosed embodiment also provides an electronic device, as shown in fig. 5, including at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; wherein the memory 502 stores instructions executable by the at least one processor 501, the instructions being executable by the at least one processor 501 to implement the steps of:
allocating a page table of each thread for a first type thread to record page mapping of a page locally accessed by the first type thread;
distributing a global page table for the process to which the first type thread belongs to record page mapping of pages accessed by the threads in the process;
in the case of executing missing page processing during page exchange, detecting a shared state of a page table entry of the global page table in the global page table, wherein a CPU core bitmap is attached to the page table entry in the global page table to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared state of the page table entry;
and according to the detected sharing state of the page table entries of the global page table in the global page table, performing a knock-down process on a translation lookup buffer scheduling a CPU core of the thread involved in the missing page process based on the page table of each thread involved in the missing page process and the global page table.
In an embodiment of the present disclosure, the performing, according to the detected shared state of the page table entry of the global page table in the global page table, a hit-down process on a translation lookup buffer scheduling a CPU core of a thread involved in the page fault process based on the per-thread page table and the global page table involved in the page fault process includes:
traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer;
and executing a knock-down process on a translation lookup buffer of the CPU core scheduling the thread involved in the page fault process based on a traversal result of a memory management unit in the CPU core and based on the page table of each thread involved in the page fault process and the global page table.
In one embodiment of the present disclosure, the page fault processing includes at least one of:
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table;
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread;
for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry;
for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table entry does not exist, obtaining a valid page table entry through an operating system kernel, filling the obtained page table entry into the per-thread page table, and marking the page table entry in the global page table by using a specific forward pointer pointing to the page table entry of the per-thread page table.
In an embodiment of the present disclosure, the performing, according to the detected shared state of the page table entry of the global page table in the global page table, a hit-down process on a translation lookup buffer of a CPU core scheduling a thread involved in the missing page process based on the per-thread page table and the global page table involved in the missing page process includes:
according to the detected page table item of the global page table is in a private state in the global page table, a CPU core scheduling the thread involved in the missing page processing knocks down the page table item from a self translation lookup buffer and skips an inter-processor interrupt broadcast, or
And according to the detected page table entry of the global page table is in a shared state in the global page table, the CPU core scheduling the thread involved in the page fault processing sends an inter-processor interrupt request to the CPU core recorded in the CPU core bitmap, and the CPU core scheduling the thread involved in the page fault processing and the CPU core recorded in the CPU core bitmap schedule a second type of thread except the first type of thread.
In one embodiment of the present disclosure, the memory 502 also stores instructions executable by the at least one processor 501, the instructions being executed by the at least one processor 501 to implement the steps of:
exporting a swap-out thread executed in page missing processing as a swap-out lightweight thread by using an operating system kernel and the runtime of the lightweight concurrent programming language, and scheduling the swap-out lightweight thread together with the user lightweight thread managed and scheduled by the runtime, so that the swap-out lightweight thread is asynchronously executed.
In one embodiment of the present disclosure, the memory 502 also stores instructions executable by the at least one processor 501, the instructions being executed by the at least one processor 501 to implement the steps of:
inserting security points for a plurality of steps performed by the swapped out lightweight thread;
when the operation is started, a group of memory areas are allocated in an operating system kernel to store the converted state information;
the swap-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swap-out lightweight thread to the sub-memory area;
returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point;
passing the identification back to the operating system kernel when rescheduling the swapped out lightweight thread by the runtime.
In an embodiment of the present disclosure, said scheduling, by the runtime, the swapped out lightweight thread together with the user lightweight thread managed and scheduled by the runtime comprises:
and distributing a lower priority for the swapped-out lightweight thread than the user lightweight thread through the running.
In one embodiment of the present disclosure, page swapping is performed in a hybrid memory architecture, which is a hybrid memory architecture that includes at least a dynamic random access memory DRAM and a non-volatile memory NVM.
FIG. 6 is a block diagram of a computer system suitable for use in implementing methods according to embodiments of the present disclosure. As shown in fig. 6, the computer system 600 includes a processing unit 601 which can execute various processes in the embodiments shown in the above-described drawings in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the system 600 are also stored. The processing unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 905 as necessary. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary. The processing unit 601 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.
In particular, according to embodiments of the present disclosure, the methods described above with reference to the figures may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the methods of the figures. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. For example, embodiments of the present disclosure include a readable storage medium having stored thereon computer instructions which, when executed by a processor, implement program code for performing the methods in the figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the node in the above embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (13)

1. A page swapping method, wherein the method comprises:
allocating a page table of each thread for a first type thread to record page mapping of a page locally accessed by the first type thread;
distributing a global page table for the process to which the first type thread belongs to record page mapping of pages accessed by the threads in the process;
in the case of executing missing page processing during page exchange, detecting a shared state of a page table entry of the global page table in the global page table, wherein a CPU core bitmap is attached to the page table entry in the global page table to track all CPU cores caching the page table entry in a translation lookup buffer, thereby determining the shared state of the page table entry;
and according to the detected sharing state of the page table entries of the global page table in the global page table, performing a knock-down process on a translation lookup buffer scheduling a CPU core of the thread involved in the missing page process based on the page table of each thread involved in the missing page process and the global page table.
2. The method of claim 1, wherein the performing, according to the detected shared state of the page table entries of the global page table in the global page table, a hit-down process on a translation look-up buffer of a CPU core scheduling a thread involved in the page fault process based on the per-thread page table and the global page table involved in the page fault process comprises:
traversing, with a memory management unit in a CPU core that caches the page table entries in the translation look-up buffer, a per-thread page table that conforms to a same page table structure based on a translation look-up buffer miss occurring when a flush is performed on the translation look-up buffer;
and executing a knock-down process on a translation lookup buffer of the CPU core scheduling the thread involved in the page fault process based on a traversal result of a memory management unit in the CPU core and based on the page table of each thread involved in the page fault process and the global page table.
3. The method according to claim 1 or 2, wherein the page fault processing comprises at least one of:
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a corresponding page table entry is absent in the page table of each thread, enabling the page table of each thread to acquire the absent page table entry through the global page table;
for a CPU core loaded with a page table of each thread, under the condition that a page fault is that a page table of each thread and a corresponding page table item is absent in a global page table, obtaining a valid page table item through an operating system kernel, filling the obtained page table item into the page table of each thread, and marking the page table item in the global page table by using a specific forward pointer pointing to the page table item of the page table of each thread;
for a CPU core loaded with a global page table, in the case that a page fault is invalid for a specific forward pointer pointing to a page table entry of the per-thread page table, copying a value of the page table entry to the global page table through an operating system kernel, and marking the page table entry as shared in a CPU core bitmap of the page table entry;
for a CPU core loaded with a global page table, under the condition that a missing page fault is that a page table entry does not exist, obtaining a valid page table entry through an operating system kernel, filling the obtained page table entry into the per-thread page table, and marking the page table entry in the global page table by using a specific forward pointer pointing to the page table entry of the per-thread page table.
4. The method according to claim 1 or 2, wherein the performing, according to the detected shared state of the page table entries of the global page table in the global page table, a hit-and-drop process on a translation lookup buffer of a CPU core scheduling a thread involved in the page fault process based on the per-thread page table and the global page table involved in the page fault process comprises:
according to the detected page table item of the global page table is in a private state in the global page table, a CPU core scheduling the thread involved in the missing page processing knocks down the page table item from a self translation lookup buffer and skips an inter-processor interrupt broadcast, or
And according to the detected page table entry of the global page table is in a shared state in the global page table, the CPU core scheduling the thread involved in the page fault processing sends an inter-processor interrupt request to the CPU core recorded in the CPU core bitmap, and the CPU core scheduling the thread involved in the page fault processing and the CPU core recorded in the CPU core bitmap schedule a second type of thread except the first type of thread.
5. The method of claim 1, wherein the first type of thread is created and launched by a runtime having a lightweight concurrent programming language.
6. The method of claim 5, wherein the method further comprises:
exporting a swap-out thread executed in page missing processing as a swap-out lightweight thread by using an operating system kernel and the runtime of the lightweight concurrent programming language, and scheduling the swap-out lightweight thread together with the user lightweight thread managed and scheduled by the runtime, so that the swap-out lightweight thread is asynchronously executed.
7. The method of claim 6, wherein the method further comprises:
inserting security points for a plurality of steps performed by the swapped out lightweight thread;
when the operation is started, a group of memory areas are allocated in an operating system kernel to store the converted state information;
the swap-out lightweight thread allocates a sub-memory area from the group of memory areas and stores the execution state of the swap-out lightweight thread to the sub-memory area;
returning an identifier serving as an index of the sub memory area to the runtime through an application program kernel at each security point;
passing the identification back to the operating system kernel when rescheduling the swapped out lightweight thread by the runtime.
8. The method of claim 6 or 7, wherein said scheduling, by the runtime, the swapped out lightweight thread with the user lightweight thread managed and scheduled by the runtime comprises:
and distributing lower priority for the swapped out lightweight thread than the user lightweight thread through the runtime.
9. The method according to claim 1 or 2, wherein the page swap is performed in a hybrid memory architecture, the hybrid memory architecture being a hybrid memory architecture comprising at least a dynamic random access memory, DRAM, and a non-volatile memory, NVM.
10. A page exchange apparatus, wherein the apparatus comprises:
the page table allocation module for each thread is configured to allocate a page table for each thread for a first type of thread so as to record page mapping of a page locally accessed by the first type of thread;
the global page table allocation module is configured to allocate a global page table for the process to which the first type thread belongs so as to record page mapping of pages accessed by the threads in the process;
a page table entry detection module configured to detect a shared state of a page table entry of the global page table in the global page table when page swapping is performed, where a CPU core bitmap is appended to the page table entry in the global page table to track all CPU cores that cache the page table entry in a translation lookup buffer, so as to determine the shared state of the page table entry;
and the translation lookup buffer execution hit processing module is configured to execute hit processing on a translation lookup buffer of a CPU core scheduling the thread involved in the missing page processing based on the page table of each thread involved in the missing page processing and the global page table according to the detected shared state of the page table entry of the global page table in the global page table.
11. An electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1-9.
12. A readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the method steps of any of claims 1-9.
13. A computer program product comprising computer instructions which, when executed by a processor, carry out the method steps of any of claims 1-9.
CN202210307376.1A 2022-03-25 2022-03-25 Page exchange method and device and electronic equipment Pending CN114840332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210307376.1A CN114840332A (en) 2022-03-25 2022-03-25 Page exchange method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210307376.1A CN114840332A (en) 2022-03-25 2022-03-25 Page exchange method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114840332A true CN114840332A (en) 2022-08-02

Family

ID=82564722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210307376.1A Pending CN114840332A (en) 2022-03-25 2022-03-25 Page exchange method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114840332A (en)

Similar Documents

Publication Publication Date Title
US20210374069A1 (en) Method, system, and apparatus for page sizing extension
US8412907B1 (en) System, method and computer program product for application-level cache-mapping awareness and reallocation
US10802987B2 (en) Computer processor employing cache memory storing backless cache lines
US7941631B2 (en) Providing metadata in a translation lookaside buffer (TLB)
US8370577B2 (en) Metaphysically addressed cache metadata
US8176282B2 (en) Multi-domain management of a cache in a processor system
US9152572B2 (en) Translation lookaside buffer for multiple context compute engine
US7194597B2 (en) Method and apparatus for sharing TLB entries
EP3109765B1 (en) Cache pooling for computing systems
Papagiannis et al. Optimizing memory-mapped {I/O} for fast storage devices
US20060059317A1 (en) Multiprocessing apparatus
JP7340326B2 (en) Perform maintenance operations
KR20170100003A (en) Cache accessed using virtual addresses
US10459852B1 (en) Memory utilization analysis for memory management systems
US7197605B2 (en) Allocating cache lines
US7721047B2 (en) System, method and computer program product for application-level cache-mapping awareness and reallocation requests
CN112965921A (en) TLB management method and system in multitask GPU
CN114840332A (en) Page exchange method and device and electronic equipment
US11741017B2 (en) Power aware translation lookaside buffer invalidation optimization
Ji et al. The design of a novel object-oriented processor: Oomips
Bhattacharjee et al. Advanced VM Hardware
CN114780452A (en) TLB entry merging method and address translation method
Pham Architectural support for efficient virtual memory on big-memory systems
KR20000014803A (en) Page directory sharing method of a main electronic computer
Cache et al. JR Goodman Improving Direct-Mapped Cache Performance by the Addition of

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination