CN117827464B

CN117827464B - Memory optimization method and system for software and hardware collaborative design under heterogeneous memory situation

Info

Publication number: CN117827464B
Application number: CN202410239173.2A
Authority: CN
Inventors: 孙广宇; 周哲; 陈奕奇
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-04-30
Anticipated expiration: 2044-03-04
Also published as: CN117827464A

Abstract

The invention discloses a memory optimization method and a memory optimization system for software and hardware collaborative design under heterogeneous memory situations, wherein a memory access analysis unit is integrated in a memory controller of a hardware device side; realizing memory layering daemon on an operating system end, namely a software end; the memory layering daemon utilizes the information provided by the memory access analysis unit to migrate the hot pages in the memory to the CPU local memory with high speed, thereby realizing memory optimization. The invention is based on the memory layering of the CXL native of the high-speed computing interconnection protocol, and the system computing performance can be greatly improved by adopting the collaborative design of hardware and an operating system.

Description

Memory optimization method and system for software and hardware collaborative design under heterogeneous memory situation

Technical Field

The invention relates to heterogeneous memory optimization technology of a computing system, in particular to a memory optimization method and a memory optimization system for software and hardware collaborative design under heterogeneous memory situations.

Background

With the rapid development of data centers and high performance computing, there is an increasing demand for more efficient, flexible memory management schemes. Among many solutions, the high-speed computing interconnection protocol (CXL) technology stands out for providing a consistent, byte-addressable way of interconnection between a processor (CPU) and an external device. In particular, memory expansion using CXL technology (referred to as CXL memory for short) has been the focus of recent years, which allows servers to easily integrate a variety of memory devices to expand memory capacity and bandwidth without requiring hardware modifications to the CPU side.

The presence of CXLs has significant advantages in terms of expanding server memory capacity and bandwidth, but also suffers from higher access latency. This problem is particularly pronounced when slower memory media such as PCM and ReRAM are integrated with the CPU through CXL. Thus, a CXL-based hierarchical memory system was created that aims to optimize system performance by placing frequently accessed "hot" pages in fast NUMA nodes and "cold" pages in slow nodes. This technique is known as memory layering.

However, existing memory layering techniques encounter significant challenges when applied to CXL-based layered memory systems. These challenges stem primarily from the lack of an efficient, low-overhead memory access analysis method. Unlike RDMA-based memory splitting techniques, CXL memory allows direct CPU access, while the operating system cannot perceive access patterns, so special analysis methods are required to distinguish between "hot" pages and "cold" pages.

The traditional memory access analysis method comprises a software method and a hardware method, and the following defects exist respectively:

(1) Software-only methods include page table tagging-based methods and page fault interrupt-based methods. They all require periodic scanning of the memory address space, which can result in significant CPU computational overhead, as well as low timeliness and accuracy of the memory analysis.

(2) The traditional hardware method is to add a hardware counter on the CPU side to analyze the memory access. But this is related to the specifications defined by CPU manufacturers and is difficult to support across platforms. At the same time, the use of the counter on the CPU side also causes a large performance overhead for the CPU. Reducing the sampling frequency in order to reduce the performance overhead reduces the accuracy of the memory analysis.

In summary, it is difficult to accurately provide memory access information in real time by the conventional memory access analysis method.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a memory optimization method and a memory optimization system for collaborative design of software and hardware under heterogeneous memory situations, which are memory layering solutions based on a high-speed computing interconnection protocol (CXL), adopt collaborative design of hardware and an operating system, design heterogeneous memory optimization based on the high-speed computing interconnection protocol CXL, namely expand memory by using the high-speed computing interconnection protocol (CXL), and optimize memory management by using a method of collaborative design of the operating system and the hardware.

The present invention designs a memory access analysis unit (or referred to as a dedicated hardware analyzer) and integrates the memory access analysis unit in a memory controller (memory device side). The memory access analysis unit can effectively analyze the Last Level Cache (LLC) miss of CXL memory (memory expansion by CXL technology) and provide key information such as page heat, memory bandwidth utilization, read/write ratio, access frequency distribution and the like for an operating system. Meanwhile, at the operating system end, an advanced memory layering strategy is realized, the hot page is effectively lifted by utilizing the insight of a memory access analysis unit, and the hot page can be migrated to a faster local memory, so that the memory optimization is realized.

The technical scheme of the invention is as follows:

The invention provides a memory optimization method for collaborative design of software and hardware under heterogeneous memory situation, which is implemented by integrating a memory access analysis unit in a memory hardware (equipment end) controller; realizing a memory layering daemon (heterogeneous memory optimization system daemon) at an operating system side; and the memory layering daemon utilizes the information provided by the memory access analysis unit to migrate the hot pages in the memory to the CPU local memory with higher speed.

The workflow of the invention comprises the following steps:

1) Integrating a memory access analysis unit in a system hardware (device side) controller, for supporting efficient memory access analysis, comprising: detecting a hot page in a slow memory and monitoring a runtime state;

The invention adopts CXL-based memory layering technology, and system hardware supports any number of memory levels, including a fast memory level and a slow memory level; different memory levels are managed through NUMA (Non-Uniform Memory Access ) interfaces of a Linux operating system;

A memory access analysis unit at the CXL memory side intercepts the memory access address of the CPU to the CXL memory;

The memory access analysis unit records a memory access address, analyzes the state of the equipment side and sends the memory access address into an asynchronous FIFO;

The present invention may be implemented on a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). When in implementation, the invention is realized on the FPGA, and asynchronous first-in first-out queues (FIFO) are used for carrying out the cross-clock domain processing; then, the asynchronous FIFO sends the state and the page address to a memory access analysis unit for recording; the memory access analysis unit is responsible for interacting with the CPU, namely, the host sends an instruction to the memory access analysis unit, and the memory access analysis unit sends data to the host; interaction is realized through memory mapping input/output (MMIO);

By using a schematic algorithm controller in the memory access analysis unit, the address space cost of the memory is greatly reduced; the rough algorithm controller is realized through hardware, page addresses with the accessed times being larger than a certain threshold value are found out and sent to a cache after being repeated; the CPU only needs to read information from the cache to obtain pages frequently accessed in the CXL memory;

Designing a system for analyzing the error of the rough statistical data structure, wherein the system is also positioned in the memory access analysis unit and is used for reading the frequency distribution of the memory page access from the rough algorithm controller; from this distribution, an upper error bound in the outline algorithm controller can be calculated, providing relevant information about hardware errors for the user-mode program.

2) Designing and realizing a heterogeneous memory optimization system daemon interacting with a memory access analysis unit, wherein the heterogeneous memory optimization system daemon is used for collecting running statistics data, carrying out parameterized configuration in a user state and lifting a management hot page, and conforming to a migration strategy appointed in a user space; after the hot page information is analyzed, the heterogeneous memory optimization system daemon can lift the hot page from the CXL memory to the CPU local memory, namely, the hot page is lifted. The hot page analysis requires only device side modifications, i.e. is compatible with any CXL-enabled server platform;

3) Designing a migration strategy to provide memory analysis and migration guidance for the daemon process; migration policies occur in user space, allowing users to customize and adjust;

the page migration strategy is carried out according to the information provided by the hardware through a scheduling algorithm, so that the dynamic adjustment of hot page promotion is realized; the strategy comprises the following steps:

31 The main body of the strategy consists of a continuously running circulation body; the starting and stopping of the loop body can be controlled in the user mode of the operating system;

32 In each cycle, the user mode policy program first reads information about CXL memory access from MMIO interface provided by hardware, including: the page access frequency distribution f of the CXL memory, the bandwidth occupancy rate b of the CXL memory, the hot page lifting accuracy x and the outline error e;

33 Lifting the identified hot page in each cycle; the threshold value of the hot page adopts a percentile p of CXL memory page access frequency, namely, the page with p divided bits before the CXL memory page access frequency is the hot page to be lifted;

34 Calculating the current error through the outline algorithm controller, setting a threshold value of the outline error, and judging whether the current error is larger than the threshold value e_max of the outline error or not; the threshold e_max of the gross error is equal to the access frequency of the CXL memory page for the p-bit.

If so, the percentile p is reduced; otherwise, dynamically adjusting the value of the percentile p according to the bandwidth occupancy b of the CXL memory and the hot page lifting accuracy x.

The performance of load operation can be obviously improved through continuous dynamic adjustment;

35 Memory access analysis unit hardware and its interaction with the operating system is accomplished through the memory mapped I/O interface and the driver of the operating system kernel.

Through the steps, the memory optimization of the software and hardware collaborative design under the heterogeneous memory situation is realized.

In specific implementation, the memory optimization system for the collaborative design of the software and the hardware under the heterogeneous memory situation comprises a memory access analysis unit in memory equipment and a system daemon module in a Linux operating system, wherein the system daemon module interacts with the memory access analysis unit. Wherein: integrating a memory access analysis unit in a memory hardware (equipment end) controller; realizing a memory layering daemon module at an operating system end; the memory access analysis unit records the statistical information of the CPU access memory at the equipment end and provides the statistical information for the memory layering daemon in the operating system; the memory layering daemon in the operating system utilizes the information provided by the memory access analysis unit to migrate the hot pages in the memory to the CPU local memory with higher speed.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a memory optimization method and a memory optimization system for software and hardware collaborative design under heterogeneous memory situations, which are based on memory layering of a high-speed computing interconnection protocol CXL native, and adopt collaborative design of hardware and an operating system. The memory access analysis unit is integrated in the system hardware (equipment side) controller, so that the Last Level Cache (LLC) miss of CXL memory (memory expansion by using CXL technology) can be effectively analyzed, and key information is provided for an operating system. Meanwhile, a memory layering strategy is realized at an operating system end, and the insight of a memory access analysis unit is utilized to effectively thermally promote. The heterogeneous memory system realized based on the technology can improve the full system performance by 28 to 73 percent compared with the traditional heterogeneous memory system realized based on software.

Drawings

FIG. 1 is an exemplary architecture of a memory access analysis unit employed in the practice of the present invention.

FIG. 2 is a block diagram of the architecture employed by the runtime statistics provided by the present invention.

Fig. 3 is a schematic diagram of a hardware structure provided by the present invention.

FIG. 4 is a flow chart of a scheduling algorithm according to the present invention for determining a page migration policy based on information provided by hardware.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention provides a memory optimization method and a heterogeneous memory optimization system for software and hardware collaborative design under heterogeneous memory situations, which comprise the following contents:

system architecture:

the heterogeneous memory optimization system of the present invention supports any number of memory levels: DDR DRAM, which is directly connected to the CPU, serves as the fast memory hierarchy, while memory connected through CXL serves as the slow memory hierarchy.

These different memory levels are managed through the NUMA (Non-Uniform Memory Access ) interface of the Linux operating system.

And (II) memory access analysis:

one of the core functions of the heterogeneous memory optimization system is to support efficient memory access analysis, and the heterogeneous memory optimization system mainly comprises: detecting hot pages in the slow memory and monitoring the runtime state.

Hot page detection: the heterogeneous memory optimization system integrates a memory access analysis unit for high-resolution and low-overhead hot page detection. The memory access analysis unit is located in a device-side controller of the CXL memory, analyzes a memory access request sent through the CXL channel, and generates page heat information and other statistical data.

And (3) state monitoring: in a hierarchical memory system, monitoring the runtime state of memory hardware (such as bandwidth utilization, read/write ratio, and page access frequency distribution) is critical to efficient memory layering. The heterogeneous memory optimization system supports efficient and accurate monitoring of the runtime state of these hardware.

And (III) a heterogeneous memory optimization system daemon:

In the kernel of the Linux operating system, a heterogeneous memory optimization system daemon which interacts with a memory access analysis unit is realized and is responsible for collecting the running statistics data. The daemon follows the general daemon design specification of the Linux operating system kernel, and periodically acquires memory cold and hot data from a memory access analysis unit and transfers hot pages to a fast CPU local memory. The daemon may be configured parameterized in the user state.

The heterogeneous memory optimization system daemon is responsible for managing the promotion of hot pages and following migration policies specified in user space.

(IV) migration strategy:

Based on the rich information provided by the memory access analysis unit, the migration strategy of the heterogeneous memory optimization system provides memory analysis and migration guidance for daemons.

This policy decision occurs in the user space, allowing the user to make customizations and adjustments.

And (V) hardware and software interactions:

The implementation of the memory access analysis unit hardware and interaction with the operating system is accomplished through a memory mapped I/O interface and drivers of the operating system kernel.

In order to detect cold pages, the heterogeneous memory optimization system utilizes the existing LRU 2Q mechanism in the Linux kernel. This is because migration of a cold page requires less timeliness than promotion of a hot page, because even if the cold page resides in memory for a long period of time, the performance of the overall system is not affected as long as there is sufficient space in the memory.

(Six) System optimization and compatibility:

since the memory access analysis unit requires only memory device side modifications based on hot page analysis, the memory access analysis unit is compatible with any CXL-enabled server platform.

Through heterogeneous memory optimization system daemon and user-defined migration strategy, memory use can be flexibly adjusted and optimized, and overall system performance is improved.

The specific implementation of the invention comprises the following steps:

1. The heterogeneous memory optimization system intercepts the memory access address of the CPU to the CXL memory at the memory access analysis unit of the CXL memory side, as shown in figure 3. Fig. 3 shows a hardware structure provided by the present invention. The memory access analysis unit records the access addresses, analyzes the status of the device side, and puts the addresses into an asynchronous first-in-first-out queue (FIFO). Since the prototype of the present invention is implemented on a Field Programmable Gate Array (FPGA), FIFO is required for cross-clock domain processing. But this is not meant to limit the scope of this patent to FPGAs. This patent may also be implemented on an Application Specific Integrated Circuit (ASIC). The asynchronous FIFO then feeds the status and page address to the memory access analysis unit for recording. The memory access analysis unit is responsible for interacting with the CPU, i.e. the CPU may send instructions to the memory access analysis unit, while the memory access analysis unit may send data to the host. This interaction is achieved by MMIO.

2. Fig. 1 shows a memory access analysis unit architecture used in the present invention, and fig. 1 details a schematic algorithm structure (a schematic algorithm controller implemented by hardware) used in a memory access analysis unit. The reason for this structure is that the address space of the memory is very large, the cost of maintaining a counter for each page is very expensive, and the cost of reducing space can be greatly reduced using a diagrammatical algorithm. As shown in fig. 1, the page addresses sent from the asynchronous FIFO to the memory access analysis unit are mapped to the locations of the corresponding caches by several hash functions. The several hash functions are independent of each other. The counter of the cache location mapped by the hash function is incremented, and when the accessed times of a certain page address need to be queried, only the minimum value in the counters of the corresponding cache locations needs to be found. A hardware implemented sketch algorithm controller (search) will find out the page addresses that are accessed more than a certain threshold, and send them to the cache after the time of the re-execution. The outline algorithm controller is implemented in a memory access analysis unit and is used for capturing the access of the CPU to the CXL memory. The host-side CPU need only read information from this cache to know which pages in the CXL memory are accessed frequently.

3. One inherent drawback of the sketch statistical algorithm is its proximity. Since the number of counters used for performing the sketch statistics is far smaller than the number of memory pages to be counted, hash conflicts can occur in the cache data structures corresponding to the hash functions, and thus a situation that the value obtained by the query is inconsistent with the actual value can occur during the query. For this purpose, the heterogeneous memory optimization system designs a system for analyzing the error of the profile statistics structure, as shown in fig. 2. FIG. 2 illustrates the runtime statistics provided by the present invention, which reads the frequency distribution of memory page accesses from the profiling algorithm controller. From this distribution, an upper bound for errors in the profile statistics structure can be calculated, providing relevant information about hardware errors for the user-mode program.

4. FIG. 4 illustrates a flow of a scheduling algorithm (dynamic adjustment strategy for hot page promotion) designed in the present invention to determine a page migration strategy based on information provided by hardware. As shown in fig. 4, the heterogeneous memory optimization system implements a dynamic adjustment strategy for hot page promotion in the user mode. The main body of the strategy consists of a continuously running circulation body, and the starting and stopping of the circulation body are controlled by a user mode to display. In each cycle, the user state policy program will first read the relevant information of CXL memory access from the MMIO interface provided by the hardware, such as the page access frequency distribution f of the CXL memory, the bandwidth occupancy b of the CXL memory, the hot page promotion accuracy x, and the outline error e. The bandwidth occupancy rate b is the proportion of the clock cycle number of the hardware side for reading and writing the memory to the total clock cycle number, and the hot page promotion accuracy is the proportion of the page which is promoted and then demoted into the CXL memory to the total page. The algorithm promotes the identified hot page in each cycle, and the threshold value of the hot page takes a percentile p of the CXL memory page access frequency, i.e., the p-indexed page preceding the CXL memory page access frequency is considered to be the hot page that should be promoted. The algorithm then determines if the current error magnitude is greater than some specified threshold e_max, if so, the algorithm decreases p, otherwise the value of p is dynamically adjusted based on b and x. Specifically, each cycle p is multiplied by (1+b)/(1+x). The performance of load operation can be significantly improved by using such a way to constantly dynamically adjust how aggressive the hot page is to be migrated to the fast local memory (i.e., the threshold p of the hot page).

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the scope of the invention and the appended claims, including but not limited to: the invention, the adjustment of the super parameter e _ max, etc. is implemented using ASIC specific circuits. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A memory optimization method for software and hardware collaborative design under heterogeneous memory situation is characterized in that a memory access analysis unit is integrated in a memory controller of a hardware device end; realizing memory layering daemon on an operating system end, namely a software end; the memory layering daemon utilizes the information provided by the memory access analysis unit to migrate the hot pages in the memory to the CPU local memory with high speed, thereby realizing memory optimization; the method comprises the following steps:

1) Integrating a memory access analysis unit in a memory controller of a device side, wherein the memory access analysis unit is used for supporting efficient memory access analysis and comprises: detecting a hot page in a slow memory and monitoring a runtime state;

the memory expansion by using the high-speed computing interconnection protocol CXL technology is CXL memory;

System hardware supports any number of memory levels, including fast memory levels and slow memory levels; different memory levels are managed through non-uniform memory access interfaces of a Linux operating system;

The memory access analysis unit of the CXL memory side intercepts the memory access address of the CPU to the CXL memory;

The rough algorithm controller is realized through hardware, namely a rough algorithm structure is used in a memory access analysis unit, page addresses with the accessed times larger than a set threshold value are found out, and the page addresses are sent to a cache after duplicate removal; the CPU can obtain the frequently accessed pages in the CXL memory only by reading the information from the cache; reading the frequency distribution of memory page access from the sketch algorithm controller; calculating the upper bound of hardware error in the outline algorithm controller according to the frequency distribution;

2) The heterogeneous memory optimization system daemon process interacted with the memory access analysis unit is realized and used for collecting the statistical data during operation, carrying out parameterized configuration in a user state, managing the promotion of a hot page and following a migration strategy appointed in a user space;

3) Designing a migration strategy, wherein the migration strategy occurs in a user space and allows a user to customize and adjust; by designing a scheduling algorithm, page migration strategies are carried out according to information provided by hardware, and dynamic adjustment of hot page promotion is realized; comprising the following steps:

31 A continuously operating loop body; the starting and stopping of the loop body are controlled by a user mode;

32 In each cycle, firstly, information of CXL memory access is read from a memory mapping input-output MMIO interface provided by hardware, including: the page access frequency distribution f of the CXL memory, the bandwidth occupancy rate b of the CXL memory, the hot page lifting accuracy x and the outline error e;

34 Judging whether the current error is larger than a set rough error threshold value or not; if so, p is reduced; otherwise dynamically adjusting the value of p according to b and x; the running performance of the load is improved through continuous dynamic adjustment;

35 The hardware of the memory access analysis unit and the interaction between the hardware and the operating system are completed through the memory mapping I/O interface and the driver of the operating system kernel;

2. The memory optimization method of software and hardware co-design in heterogeneous memory scenario of claim 1, wherein the method is implemented on a field programmable gate array FPGA or an application specific integrated circuit ASIC.

3. The memory optimization method for co-design of software and hardware in heterogeneous memory context as claimed in claim 2, wherein said method specifically uses asynchronous FIFO to perform cross-clock domain processing; then, the asynchronous FIFO sends the state and the page address to a memory access analysis unit for recording;

The memory access analysis unit interacts with the host CPU, namely, the host sends an instruction to the memory access analysis unit, and the memory access analysis unit sends data to the host; the interaction is realized by MMIO.

4. The memory optimization method of software and hardware collaborative design under heterogeneous memory situation according to claim 1, wherein DDR DRAM directly connected with CPU is used as a fast memory level; the memory connected through CXL is used as a slow memory hierarchy.

5. The memory optimization method of software and hardware co-design under heterogeneous memory scenario as claimed in claim 1, wherein the method is characterized in that a cold page is detected by using an LRU 2Q mechanism in a Linux kernel.

6. The memory optimization method of software and hardware collaborative design under heterogeneous memory situation according to claim 1, wherein a general algorithm controller is implemented in a memory access analysis unit through hardware, and is used for capturing access of a CPU to CXL memory, and sending the access to a cache after de-duplication.

7. The memory optimization method of software and hardware collaborative design under heterogeneous memory scenario as set forth in claim 6, wherein the frequency distribution of memory page accesses is read out by a schematic algorithm controller; from this distribution, an upper error bound in the profile statistics structure is further calculated.

8. The memory optimization method of claim 1, wherein the scheduling algorithm is specifically a dynamic adjustment strategy for implementing hot page promotion in a user state, and is used for providing guidance for memory analysis and migration for a daemon of a heterogeneous memory optimization system.

9. A heterogeneous memory optimization system adopting the memory optimization method of the software and hardware collaborative design under the heterogeneous memory situation as set forth in claim 1, which is characterized by comprising a memory access analysis unit in memory equipment and a system daemon module in a Linux operating system which interacts with the memory access analysis unit; wherein: integrating a memory access analysis unit in a memory hardware equipment end controller; realizing a memory layering daemon module at an operating system end;

Recording the statistical information of the access of the CPU to the memory in the memory hardware through the memory access analysis unit, and providing the statistical information to a memory layering daemon in an operating system;

And migrating the hot page in the memory to the CPU local memory by using the information provided by the memory access analysis unit through the memory layering daemon in the operating system.

10. The heterogeneous memory optimization system of claim 9, wherein the system supports any number of memory levels, wherein DDR DRAMs directly connected to the CPU are fast memory levels; the memory connected through the CXL is a slow memory hierarchy.