WO2013044829A1

WO2013044829A1 - Data readahead method and device for non-uniform memory access

Info

Publication number: WO2013044829A1
Application number: PCT/CN2012/082202
Authority: WO
Inventors: 谭玺; 韦竹林; 刘轶; 朴明铉
Original assignee: 华为技术有限公司
Priority date: 2011-09-27
Filing date: 2012-09-27
Publication date: 2013-04-04
Also published as: CN102508638A; CN102508638B

Abstract

A data readahead method and device for non-uniform memory access (NUMN), the method comprising: obtaining a data readahead volume parameter factor r according to a parameter representing the magnetic disk load of a NUMA system and the idle readahead buffer capacity of a node that a process is in; calculating the product (S_size) of the previous readahead window size (R_{prev_size}), the readahead volume maximization multiplier (T_scale) and the data readahead volume parameter factor r; comparing the preset maximum readahead volume (MAX_readahead) with the S_size, and using the smaller value of the MAX_readahead and the S_size as the readahead window size to read the data ahead. The method of the present invention comprehensively considers the factors such as magnetic disk I/O load, the remaining cache size of the node and the like affecting system performance, thus facilitating data I/O hiding and system resource saving.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of communications, and more particularly to a data prefetching method and apparatus for non-uniform memory access. Background technique

Currently, disks are still the primary storage medium for computer systems. However, with the rapid development of technology, the disk faces its input/output (I/O) bandwidth, which is not comparable to the central processing unit (CPU)/memory development speed and disk access latency and CPU. / Memory two gaps in the growing gap between read and write speeds. Among the CPU speed, disk transfer speed and disk I/O access speed, the disk I/O access speed is the slowest. In particular, the disk I/O access speed and the CPU speed are getting wider and larger. Disk I /O access latency has become one of the most important bottlenecks that constrain system I/O performance. At the operating system level, heterogeneity is a very effective I/O performance optimization strategy. Data prefetching is a common method to achieve I/O heterogeneity.

The so-called data prefetching means that the system performs I/O operations in advance in the background, and loads the required data into the memory in advance to hide the I/O delay of the application, thereby effectively improving the utilization of the computer system. Compared with the traditional serial processing, the data prefetching provides an operation strategy that eliminates the waiting time of the CPU and enables the CPU and the disk to work in parallel, thereby improving the overall I/O performance of the system. Data prefetching adopts the method of pattern matching, gp, to maintain the history access record by monitoring the access sequence of each file, and to match the identified patterns one by one. If the behavior characteristics of an access pattern are met, data prediction and prefetching can be performed accordingly. The specific implementation techniques include heuristic pre-fetching and informed prefetching. Among them, heuristic prefetching is transparent to the upper layer application, and the I/O characteristics are analyzed through the historical access record of the automatic observing program, and the prediction and prefetching are performed independently. The data block that will be accessed. The Lmux kernel version after 2.6.22 provides a heuristic-based on-demand prefetch algorithm that works on the Virtual File System (VFS) layer to uniformly serve various file read operations ( Through the system call API), the pair is independent of the specific file system. The on-demand prefetch algorithm introduces page state and page cache state, and adopts loose sequential decision conditions to provide effective support for sequential I/O operations, including heterogeneous/non-blocking I/O and multi-thread interleaved I/O. , sequential random mixed I / O, large-scale concurrent I / O and other operations. When an application wants to access data, it accesses a disk file via the page cache through the system call interface. The kernel calls the prefetch algorithm on this standard file access path, tracks the application's access sequence, and performs proper prefetching. Specifically, the heuristic-based on-demand prefetch algorithm provided by Lmux mainly determines the access mode of the application by monitoring the read request of the application and the page cache, and then determines the location and size of the prefetch according to the access mode. The prefetching framework can be roughly divided into two parts: a monitoring part and a judging processing part, wherein the monitoring part is embedded in a read request response process such as a do-generic_file_read() function, and detects whether each page in the request has been In the file cache address space, if not, apply for a new page, and the application is temporarily suspended while waiting for I/O to load the page for peer read-ahead. If the offset of the new page is exactly the location pointed to by the pre-read parameter async_size, the page is prefetched (PG_readahead). In the subsequent data prefetching process, when the prefetch tag page (PG_readahead page) is detected, it means that the timing of the next prefetch I/O has arrived, and the system performs the pre-reading. The matching part is located

The ondemand-readahead〇 function is logically composed of a set of independent decision modules that determine whether the file is a read, a small file read, a sequential read, or a random read. The on-demand prefetching framework supports both sequential read and random read access modes, and simple discarding of small file reads is performed without data prefetching.

Both the heuristic-based on-demand prefetch algorithm provided by the Lmux kernel version 2.2.63 and other data prefetching techniques are designed for single-processor systems. Since the single processor system itself is limited by factors such as the computing power, memory capacity and bandwidth of the processor, the corresponding data prefetch design is relatively conservative, especially the prefetch management part: The page size of the initial read request is the benchmark, taking a multiplication strategy with a factor of 2 and setting the upper limit window.

□.

With the continuous improvement of computer performance requirements in scientific computing and transaction processing, Symmetrical Multi-Processing (SMP) systems are becoming more widely used and larger in scale. A non-uniform memory access (NUMA) multiprocessor system is a system of independent nodes connected by a high-speed private network. Each node can be a single CPU or an SMP system. As a distributed shared memory structure, NUMA system combines the advantages of SMP system easy programming and high scalability of distributed storage systems, and has become one of the mainstream architectures of today's high-performance servers.

Multiprocessor systems based on distributed shared memory NUMA architecture are very different from single processor systems in terms of CPU access queue control, memory access control, and node load balancing architecture. Data prefetching for single processor systems The multiprocessor system environment of the NUMA architecture has not been met. If the Lmux system is deployed on a distributed shared memory NUMA architecture server, the prefetch management method provided by the Linux system does not take into account the unique characteristics of the NUMA architecture server, such as CPU load, node remaining memory size, and global remaining memory size. And other influencing factors, therefore, the actual operation effect of this data prefetching for a single processor system cannot be optimal. For example, when multiple processors access files at the same time, if the data is still prefetched according to the amount of prefetched data designed for a single processor system, the disk system may be overloaded; for example, when the node of the NUMA architecture has local residual memory. When there are few, if the data is still pre-fetched according to the amount of pre-fetched data designed for the single-processor system, the data in the local memory of the node is likely to be due to the large delay of accessing the remote memory due to the characteristics of the distributed memory of the NUMA architecture. When it has not been taken (the node that has not been accessed by the remote memory is taken away), the prefetched data further increases the occupation of the remaining memory of the node. Summary of the invention

Embodiments of the present invention provide a data prefetching method and apparatus for non-uniform memory access, Improve the reliability and accuracy of file prefetching under the NUMA architecture.

An embodiment of the present invention provides a data prefetching method for non-uniform memory access, where the method includes:

Obtain the data prefetch parameter parameter based on the parameter representing the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located.

R

Find the product of the size of the previous prefetch window ^e " ^ize , the maximum multiplier of the prefetch amount , ^fe and the data prefetch parameter parameter r. Compare the maximum prefetch readahead and the size of the ^s With the stated

- _rea ^ and the smaller of the values are used to prefetch data as the size of this prefetch window. An embodiment of the present invention provides a data prefetching apparatus for non-uniform memory access, where the data prefetching parameter factor obtaining module is configured to access parameters and processes of a disk load in a NUMA system according to a non-uniform memory. The idle prefetch buffer capacity of the node acquires a data prefetch parameter parameter prefetch window multiplier module, which is used to obtain the size of the previous prefetch window, the maximum multiplier of the prefetch amount, and the data prefetch The product of the quantity parameter factor r and the _β ;

And the size of the ^ύ -, the smaller value of the ^1⁄2 ^ ^ ^ and the ^ is used as the size of the prefetch window to prefetch the data.

According to the embodiment of the present invention, after the data prefetch parameter parameter is obtained according to the parameter indicating the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located, the previous prefetch can be obtained. The size of the window ^ f, the maximum multiplier of the prefetch amount and the data prefetch parameter parameter r are the product of the maximum prefetch

The size relationship of MAX determines the size of this prefetch window, and finally according to the determined prefetch ^ The size of the mouth to prefetch the data. Since the parameter representing the disk load in the NUMA system is related to the input/output I/O queue of the current operating system, and the data prefetch parameter parameter ^ is obtained according to the idle prefetch buffer capacity of the node where the process is located, therefore, Compared with the data prefetching algorithm provided by the technology for the single processor, the data prefetching method provided by the embodiment of the present invention comprehensively considers the factors affecting system performance such as the disk I/O load and the remaining memory size of the node, gp, on the disk. When the I/O load is light and the remaining memory of the node is large, the data prefetch amount is appropriately enlarged, which is conducive to hiding data I/O. When the disk I/O load is heavy and the remaining memory of the node is small, the data is appropriately reduced. Pre-fetching is beneficial to save system resources. DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the prior art or the embodiments will be briefly described below. Obviously, the drawings in the following description are only some of the present invention. For the embodiments, other figures can also be obtained as those skilled in the art.

1 is a schematic flowchart of a data prefetching method for non-uniform memory access according to an embodiment of the present invention;

2 is a schematic diagram of a general design principle of a data prefetching algorithm;

Figure 3 is a schematic diagram of the working hierarchy of the data prefetch algorithm;

4 is a schematic diagram of a prefetch window in a data prefetch algorithm;

FIG. 5 is a schematic structural diagram of a data prefetching apparatus for non-uniform memory access according to an embodiment of the present invention; FIG.

6 is a schematic structural diagram of a data prefetching device for non-uniform memory access according to another embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a data prefetching apparatus for non-uniform memory access according to another embodiment of the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention are within the scope of the present invention.

Referring to FIG. 1, FIG. 1 is a schematic flowchart of a data prefetching method for non-uniform memory access according to an embodiment of the present invention, which mainly includes the following steps:

S101. Obtain a data prefetch parameter parameter according to a parameter that identifies a non-uniform memory access disk load in the NUMA system and an idle prefetch buffer capacity of the node where the process is located.

It should be noted that although the NUMA system contains multiple nodes, only one operating system is run. Therefore, data prefetching is for the entire operating system. In an embodiment of the invention, the parameters characterizing the disk load in the NUMA system are related to the input/output I/O queue of the current operating system. The so-called "current operating system I / O queue" refers to the I / O queue managed by the operating system and the current access to the disk by the I / O queue, how many read and write queues in the current NUMA system are accessing the disk .

In order to illustrate the capacity of the idle prefetch buffer of the node, the Lmux system is taken as an example to describe the data prefetching algorithm based on the design principle and working level of the data prefetching algorithm. After the Lmux kernel finishes reading the data, it caches the file page it has recently accessed into the memory for a period of time. The memory of the cache file page is called the page cache. Usually the data read (via the system's "read ( )" API) occurs between the application buffer and the page cache, as shown in Figure 2, and The data prefetch algorithm is responsible for reading data from the disk to fill the page cache. When the page cache application reads from the page cache to the application buffer, its read granularity is generally small, for example, the read and write granularity of the file copy command. Typically 4KByte (kilobytes), the kernel's data prefetch fills the data from disk to the page cache at a size that it thinks is more appropriate, for example, 16 KByte (kilobytes) to 128 KByte (kilobytes) (page cache as for data prefetching The working level of the algorithm can be seen in Figure 3. The data prefetch algorithm works at the VFS layer, which uniformly serves various file read operations (system call APIs), and is independent of the specific file system. When the application requests to read file data through different system APIs such as read ( ), pread ( ), readv ( ), al/O_ read ( ), sendfile ( ), and splice ( ), it will enter the unified read request processing. The function do— generic— file— read ( ). This function fetches data from the page cache to satisfy the application's request and, when appropriate, calls the prefetch routine to perform the necessary read-ahead I/O. The read-ahead I/O request from the read-ahead algorithm is preprocessed by — do_page — cache — readaheadO , which checks if each page in the request is already in the file cache address space, and if not, requests a new page. If the offset of the new page is exactly the location pointed to by the pre-read parameter async_size, the PG_readahead flag is set for the page. Finally, all new pages are passed to read_pages(), where they are added to the in-memory radix tree and inactive-list one by one, and call the readpage() of the filesystem in which they are located to deliver the page to I/O.

In the embodiment of the present invention, the prefetch buffer of the node is a memory allocated by the system to the node for caching the file page recently accessed by the node kernel, that is, the page cache, and the idle prefetch buffer of the node. The area is the memory remaining after removing the memory occupied by the already prefetched data in the page cache. The idle prefetch buffer capacity of a node is also one of the factors that affect the amount of data prefetch.

S102, a prefetch window size before obtaining ^R - ^Size, multiple prefetch largest multiplication ^. ^Fe and the product of the data prefetch parameter parameter ^.

In the data prefetch algorithm, when a thread running on a node belongs to a file, whenever a data prefetch request is issued, the data prefetch algorithm uses a data called a "prefetch window". The structure records the data prefetch request to indicate the length of the data requested for prefetching, as shown in FIG. The start and size form a prefetch window that records the location and size of the most recent prefetch request, and async_size indicates the amount of advance prefetch. PG—The readahead page is set in the last prefetch I/O, indicating that the application has exhausted enough early read windows, the time for the next prefetch I/O has arrived, and the isochronous read-ahead is initiated. To read more file pages. Therefore, it is easy to obtain the size of the ^previous prefetch window ^Rprev — ^Size by the recorded data prefetch request.

It should be noted that if the process is the first access to the file, that is, there is no pre-fetch window of the previous record, in this case, the pre-fetch window size can be set to be larger than the data length of the initial request pre-fetch, for example , you can set the prefetch window size to twice the length of the data requested for the first time. Of course, it can be set to other multiples. In principle, as long as the data length of the first request prefetch is larger, the present invention does not particularly limit this. In the embodiment of the present invention, the pre-fetch amount is multiplied by a maximum of ^. It is used to limit the multiplier of each pre-fetch amount, which can be set by the user according to the actual situation. With the size of the previous prefetch window

», the maximum multiplier of the pre-fetch amount each time. The relationship between ^fe and the data prefetch parameter factor is prev_size χ ¹ _sca le XT

S103, comparing the set maximum prefetch amount ^^^/^ and the size of the ₆ , and using the smaller value of the ^4 _rearfatearf and the & _β as the size of the prefetch window to prefetch data .

Due to various limitations of the prefetch buffer capacity, the size of the prefetch window cannot be increased indefinitely. For example, the size of the prefetch window cannot be infinitely increased according to the relationship of ^^- xxr, gp, should There is some restriction on the size of the prefetch window. In the embodiment of the present invention, it can be set by a user.

The A^ _{rearfatearf is} compared with ^ (= ^Prev_size χ T丄_scale χ 求 ) obtained in step _S10 2 , and finally, the data is prefetched with the smaller value in ^S and the size of the prefetch window.

The data prefetching method for non-uniform memory access provided by the embodiment of the present invention can be used to access the parameters and processes of the disk load in the NUMA system according to the characterization of the non-uniform memory. The product of three factors the set ^Λ maximum prefetch quantity ^^^^ size relationship to determine the size of this window prefetch finally determined according to the size of this window to prefetch data prefetching. Since the parameter representing the disk load in the NUMA system is related to the input/output I/O queue of the current operating system, and the data prefetch parameter parameter ^ is obtained according to the idle prefetch buffer capacity of the node where the process is located, therefore, Compared with the data prefetching algorithm provided by the technology for the single processor, the data prefetching method provided by the embodiment of the present invention comprehensively considers the factors affecting the system performance such as the disk I/O load and the remaining memory of the node, BP, in the disk. When the I/O load is light and the remaining memory of the node is large, the data prefetch amount is appropriately enlarged, which is conducive to hiding data I/O. When the disk I/O load is heavy and the remaining memory of the node is small, the data is appropriately reduced. Pre-fetching is beneficial to save system resources.

In an embodiment provided by the present invention, the data prefetch parameter parameter ^ obtained by characterizing the disk load in the NUMA system and the idle prefetch buffer capacity of the node where the process is located can be implemented by:

First, according to the parameters of the disk load in the NUMA system and the idle prefetch buffer capacity of the node where the process is located, the weight of the disk load to the prefetch amount and the prefetch buffer capacity of the node where the process is located are added to the weight of the prefetch amount. Then, the difference between the weight of the disk load on the prefetch amount and the weight of the prefetch buffer capacity of the node where the process is located is increased, and the difference is the data prefetch parameter parameter. .

In the above embodiment, the weight of the disk load to the prefetch amount is obtained according to the parameter representing the disk load in the NUMA system, and the length of the current I/O queue of the operating system can be obtained by calling the I/O queue obtaining module. Specifically, you can use the jprobe technique to detect the do-generic-make-request() function, from which you can get the length of the current I/O queue of the system, that is, the operating system I/O queue length being used (by detecting do-generic- Make - the parameter of the request ( ) function count ) , you can also get the maximum I / O queue length defined by the operating system (by detecting the parameter max - io - length of the do - generic - make - request ( ) function); said current operating system I / O queue length (referred to as ^G ™ operating system defined maximum I / O The ratio of the queue length (recorded as ² dishes) is multiplied by the first adjustable factor (denoted as a) to obtain the weight of the disk load on the prefetch increase. BP, the weight of the disk load on the prefetch increase is a ⁰ TM ⁰ In the foregoing embodiment, the weight of the prefetch buffer capacity of the node where the process is located is increased according to the idle prefetch buffer capacity of the node where the process is located, and the idle prefetch of the node where the process is located may be obtained by calling the memory acquisition module. The buffer capacity is implemented. Specifically, the idle prefetch buffer capacity of the node where the process is located (recorded as the ratio of ^M to the total prefetch buffer capacity of the node where the process is located (denoted as ^M w. ') is multiplied by the second adjustable factor (denoted as b) to obtain the weight of the prefetch buffer capacity of the node where the process is located, and gp, the weight of the prefetch buffer capacity of the node where the process is located to the prefetch amount is ^M sl ^M f

At this time, the data prefetch parameter parameter ^ is the difference between the weight of the disk load on the prefetch amount and the weight of the prefetch buffer capacity of the node where the process is located, and gP = ^current IQ^,

/ total

It should be noted that, in the embodiment of the present invention, the first adjustable factor a and the second adjustable factor b may be determined by the user according to the hardware environment and the requirements of the user, and the value ranges from (0 1). If the user is not the first The adjustable factor a and the second adjustable factor b are adjusted, and the first adjustable factor a and the second adjustable factor b may take a default value of 1. From the expression of the data prefetch parameter parameter, the first adjustable The factor a and the second adjustable factor b are used to adjust the weight of the total prefetch buffer capacity (ie, ^/^) and the disk load condition (ie, Q-1Q) of the node where the process is located, and specifically, the weight of the prefetch amount, specifically, When the first adjustable factor a is relatively large and the second adjustable factor b is relatively small, the pre-fetch buffer idle ratio (ie, ^MM fw ) of the node where the process is located has a relatively large influence on the pre-fetch amount; When the adjustment factor a is relatively small and the second adjustment factor b is relatively large, the disk load condition (ie, the »·^/β dish) has a relatively large influence on the prefetch amount.

In order to achieve efficient use of resources, multiple virtual machines are often run in a server system (which can be a node in a NUMA system), and each virtual machine runs a separate operating system. In virtual In the virtualized system, the data prefetching method for non-uniform memory access provided by the foregoing embodiment is basically unchanged, except that, due to the function of the virtualized I/O subsystem, each virtual machine runs independently. The operating system can have a separate file system and independently manage I/O queues. Therefore, the I/O queue length inside the operating system does not reflect the disk I/O load of the entire system. At this time, if the virtualization system provides a call interface that acquires the entire NUMA system I/O queue length, the I/O queue acquisition module uses the call interface to obtain the current I/O queue length of the entire NUMA system instead of the virtual machine. The independent operating system running in the middle obtains the current I/O queue length of the entire NUMA system; if the virtualization system does not provide such a calling interface, the I/O queue length of the current NUMA system is obtained from the operating system running in the virtual machine. . Specifically, for the case where the virtualization system does not provide a calling interface, a virtualization system management tool (for example, Hypervisor) may be invoked to obtain related parameters. Management tools such as Hypervisor coordinate the memory management and communication in the virtualization system running on each node. The strategy adopted for memory allocation and scheduling is public. The specific implementation is as follows, if virtualization is performed on a node. The system is running prefetch software, which can first obtain the I/O queue length of the node, according to the management tool.

(Hypervisor) memory scheduling strategy to calculate the current I / O queue length of the entire NUMA system

(ie the length of the I/O queue being used by the entire NUMA system). What needs to be further explained is that for the pre-fetch amount, the maximum multiplication factor ^. ^Fe , can be determined by the user according to the total prefetch buffer capacity of the node where the process is located and the characteristics of the main application of the system, § 卩, if the total prefetch buffer allocated by the node where the process is located has a large capacity, the main application has sequential sequential read For the characteristics of the file, the maximum multiplier of the prefetch amount ⁷ ^ can be set to a larger value, so that the prefetch window can be rapidly increased and the data prefetch hit rate is increased, if allowed.

As an embodiment of the present invention, the maximum multiplication factor of the prefetch amount may be in the range [0, 8], where the symbol "[]" represents a closed interval; in [0, 8], the maximum prefetch amount is also followed. The multiplication factor ^^ is larger, and the value of the maximum multiplication factor of the prefetch amount is larger.

According to the embodiment provided by the present invention, it is known to provide a single processor with the prior art. Compared with the data prefetching algorithm, the data prefetching method for non-uniform memory access provided by the present invention can at least bring the following effects:

First, the present invention does not change the basic framework of the file prefetching of the Lmux kernel, but proposes a new prefetch management strategy based on the traditional data prefetching algorithm and the exclusive environment. An optimization of prefetch management does not affect the stability of the system;

Secondly, the present invention comprehensively considers the multiple architecture features of the NUMA system affecting file prefetching effects such as disk load and memory management, and solves the problem that the Lmux kernel data prefetching algorithm does not match, and improves data prefetching. Reliability and accuracy;

Thirdly, the data prefetch parameter parameter ^ of the active prediction algorithm is proposed. According to the positive correlation between the data prefetch amount and the idle prefetch buffer capacity of the node where the process is located, the inverse correlation relationship with the NUMA system disk load, The determined data prefetch is determined by the current disk load of the NUMA system, the idle prefetch buffer size of the node where the process is located, and the global memory size, rather than simply a certain number (for example, 2 or 4). Multiplying, realizing the dynamic determination of the data prefetch parameter parameter size, scientifically effective and low management prefetch window size;

Fourth, dynamic and adaptive prefetching of data size and advancement is achieved, ensuring that the program terminates its order and reverse order access at any time, and the prefetch hit rate is at an acceptable level.

Referring to FIG. 5, it is a schematic structural diagram of a data prefetching apparatus 05 for non-uniform memory access according to an embodiment of the present invention. For the convenience of description, only parts related to the embodiment of the present invention are shown. The data prefetching device 05 for non-uniform memory access illustrated in Fig. 5 includes a data prefetch parameter factor obtaining module 501, a prefetch window multiplying module 502, and a prefetching window fetching module 503, wherein:

The data prefetch parameter acquisition module 501 is configured to obtain a data prefetch parameter parameter according to the parameter that identifies the non-uniform memory access disk load in the NUMA system and the idle prefetch buffer capacity of the node where the process is located, where the NUMA is represented. The parameters of the disk load in the system are related to the current operating system input/output I/O queue.

It should be noted that although the NUMA system contains multiple nodes, only one operating system is run. Therefore, data prefetching is for the entire operating system. In the embodiment shown in FIG. 5, the parameters characterizing the disk load in the NUMA system are related to the input/output I/O queue of the current operating system. The so-called "current operating system I / O queue" refers to the I / O queue managed by the operating system and the current access to the disk by the I / O queue, how many read and write queues in the current NUMA system are accessing the disk .

In order to illustrate the capacity of the idle prefetch buffer of the node, the Lmux system is taken as an example to describe the data prefetching algorithm based on the design principle and working level of the data prefetching algorithm. After the Lmux kernel finishes reading the data, it caches the file page it has recently accessed into the memory for a period of time. The memory of the cache file page is called the page cache. Usually the data read (via the system's "read ( )" API) occurs between the application buffer and the page cache, as shown in Figure 2, and The data prefetch algorithm is responsible for reading data from the disk to fill the page cache. When the page cache application reads from the page cache to the application buffer, its read granularity is generally small, for example, the read and write granularity of the file copy command. Typically 4KByte (kilobytes), the kernel's data prefetch fills the data from disk to the page cache at a size that it thinks is more appropriate, for example, 16 KByte (kilobytes) to 128 KByte (kilobytes) (page cache As for the working level of the data prefetch algorithm, refer to Figure 3. The data prefetch algorithm works at the VFS layer, and uniformly serves various file read operations (system call APIs), which are independent of the specific File system. When an application requests to read file data through different system APIs such as read ( ), pread ( ), readv ( ), al/O_ read ( ), sendfile ( ), and splice ( ), Enter the unified read request handler do — generic — file — read ( ) This function takes the data from the page cache to satisfy the application's request, and calls the prefetch routine to make the necessary read-ahead I/O when appropriate. The read-ahead I/O request from the read-ahead algorithm is preprocessed by — do_page — cache — readaheadO , which checks if each page in the request is already in the file cache address space, and if not, requests a new page. If the offset of the new page is exactly the location pointed to by the pre-read parameter async_size, then the PG_readahead flag is set for the page. Finally, all The new page is passed to read_pages ( ), where they are added to the in-memory radix tree and inactive_ list one by one, and call the readpage ( ) of the file system in which they are located to deliver the page to I/O.

In the embodiment shown in FIG. 5, the prefetch buffer of the node is a memory allocated by the system to the node for caching the file page recently accessed by the node kernel, that is, the page cache, and the node is idle. The prefetch buffer is the memory remaining after the memory occupied by the already prefetched data in the page cache is removed. The idle prefetch buffer capacity of a node is also one of the factors that affect the amount of data prefetch. The prefetch window multiplication module 502 is configured to obtain the size of the previous prefetch window ^f and the maximum multiplier of the prefetch amount ^. ^Fe and the product of the data prefetch parameter parameter acquisition module 501 and the data prefetch parameter parameter ^.

In the data prefetch algorithm, when a thread running on a node belongs to a file, whenever a data prefetch request is issued, the data prefetch algorithm uses a data called a "prefetch window". The structure records the data prefetch request to indicate the length of the data requested for prefetching, as shown in FIG. The start and size form a prefetch window that records the location and size of the most recent prefetch request, and async_size indicates the amount of advance prefetch. PG—The readahead page is set in the last prefetch I/O, indicating that the application has exhausted enough early read windows, the time for the next prefetch I/O has arrived, and the read-ahead read-ahead is read to read Take more file pages. Therefore, it is easy to obtain the size of the last prefetch window ^Rprev — ^Size by the recorded data prefetch request.

It should be noted that if the process is the first access to the file, that is, there is no pre-fetch window of the previous record, in this case, the pre-fetch window size can be set to be larger than the data length of the initial request pre-fetch, for example , you can set the prefetch window size to twice the length of the data requested for the first time. Of course, it can be set to other multiples. In principle, as long as the data length of the first request prefetch is larger, the present invention does not particularly limit this.

In the embodiment shown in Figure 5, the maximum multiplier of the prefetch amount is used to limit each pre-preparation. The multiplication multiple can be set by the user according to the actual situation. ^ with the last prefetch window:

R

Small ^f, pre-fetch amount maximum multiplier each time. ^Fe and the data prefetch parameter parameter ^

S,.

The relationship between the three is prev-size χ ¹ _sca le XT

.MAX

The prefetch window obtaining module 503 is configured to compare the set maximum read-ahead readahead and the size of the ^S obtained by the pre-fetch window multiplying module 502 to the smaller of the 'readahead and the ' ^ze ' The value is used to prefetch the data as the size of this prefetch window.

Due to various limitations of the prefetch buffer capacity, the size of the prefetch window cannot be increased indefinitely. For example, the size of the prefetch window cannot be infinitely increased according to the relationship of ^^- xxr, gp, should There is some restriction on the size of the prefetch window.

.MAX In the embodiment shown in Figure 5, a maximum prefetch can be set by the user: readahead

(= R

The comparison χ ^ scale X Γ ) is compared, and finally, the smaller value of the readahead and the S _size is used as the _size of the prefetch window to prefetch the data.

The data prefetching device 05 for non-uniform memory access provided by the embodiment shown in FIG. 5 above knows that the data prefetch parameter parameter obtaining module 501 accesses the parameter of the disk load in the NUMA system according to the non-uniform memory. After the data prefetch parameter parameter ^ is obtained from the idle prefetch buffer capacity of the node where the process is located, the prefetch window obtaining module 503 can be the size of the last prefetch window ^{R e} , the maximum multiplier of the prefetch amount ^ . The size relationship between ^fe and the data prefetch parameter parameter ^ and the set maximum prefetch amount ^ _rearfatearf determines the size of the prefetch window, and finally prefetches the data according to the determined size of the prefetch window. Since the parameter representing the disk load in the NUMA system is related to the input/output I/O queue of the current operating system, and the data prefetch parameter parameter ^ is obtained according to the idle prefetch buffer capacity of the node where the process is located, therefore, The present invention is compared to a data prefetch algorithm for a single processor provided by the technology. The data prefetching device provided by the embodiment comprehensively considers factors such as the disk I/O load and the remaining memory size of the node, which affects system performance, gp, when the disk I/O load is light, and the remaining memory of the node is large, the data pre-expanding is appropriately expanded. The amount of data is conducive to concealing data 1/0. When the disk I/O load is heavy and the remaining memory of the node is small, the data prefetching amount is appropriately reduced, which is beneficial to save system resources.

It should be noted that, in the foregoing embodiment of the data prefetching device for non-uniform memory access, the division of each functional module is merely an example, and the actual application may be implemented according to requirements, such as corresponding hardware configuration requirements or software implementation. Convenience considerations, and the above function assignment is performed by different functional modules, that is, the internal structure of the data prefetching device for non-uniform memory access is divided into different functional modules to complete all or part of the functions described above. . Moreover, in practical applications, the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be executed by corresponding hardware, for example, the foregoing data prefetch parameter parameter acquisition module may be Having the foregoing hardware for characterizing the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located, for example, the data prefetch parameter factor acquirer may also A general processor or other hardware device capable of executing a corresponding computer program to perform the foregoing functions; and the pre-fetch window multiplication module as described above may have a size of a pre-fetch window for performing the aforementioned pre-fetching window, R _prev — _Size , Pre Taking the maximum multiplier of each time and the hardware of the product function of the data prefetch parameter parameter acquisition module (or the data prefetch parameter parameter acquirer), for example, the prefetch amount Window multiplier, or it can be executed before the corresponding computer program is completed A general processor or other hardware features (various embodiments of the present disclosure provides the above described principles can be applied).

The data prefetch parameter parameter acquisition module 501 illustrated in FIG. 5 further includes a weight acquisition sub-module 601 and a difference sub-module 602, such as the data prefetching device 06 for non-uniform memory access as illustrated in FIG. 6, wherein :

The weight obtaining sub-module 601 is configured to obtain a disk load pair according to a parameter indicating a non-uniform memory access to a disk load in the NUMA system and an idle prefetch buffer capacity of a node where the process is located. The weight of the prefetching amount and the weight of the prefetch buffer capacity of the node where the process is located increase the prefetch amount;

The difference sub-module 602 is configured to obtain a difference between a weight of the disk load on the pre-fetch amount and a weight of the pre-fetch buffer capacity of the node where the process is located, and obtain the data pre-fetch Quantity parameter factor.

The weight acquisition sub-module 601 of the example of Fig. 6 further includes a memory acquisition unit 701 and a prefetch weight acquisition unit 702, such as the data prefetching apparatus 07 for non-uniform memory access as illustrated in Fig. 7, wherein:

The memory obtaining unit 701 is configured to call the I/O queue obtaining module and the memory acquiring module to respectively acquire the length of the current I/O queue of the operating system and the idle prefetch buffer prefetching weight obtaining unit 702 of the node where the process is located, Multiplying a ratio of a length of a current I/O queue of the operating system to an operating system-defined maximum I/O queue length by a first adjustable factor to obtain a weight of a disk load to a prefetch amount, the process The ratio of the capacity of the idle prefetch buffer of the node to the total prefetch buffer capacity of the node where the process is located is multiplied by the second adjustable factor to obtain the weight of the prefetch buffer capacity of the node where the process is located. The ratio of the length of the current I/O queue of the operating system to the maximum I/O queue length defined by the operating system is a parameter characterizing the disk load in the non-uniform memory access NUMA system.

In the data prefetching device 07 for non-uniform memory access exemplified in FIG. 7 above, the memory obtaining unit 701 can acquire the length of the current I/O queue of the operating system by calling the I/O queue obtaining module, combined with the prefetch amount. The weight obtaining unit 702 obtains the weight of the disk load on the increase of the prefetch amount. Specifically, the jprobe technique can be used to detect the do-generic-make-request() function, from which the length of the current I/O queue of the system is obtained, that is, the operating system I/O queue length being used (by detecting do-generic- The value of the make_request() function, count), can also get the maximum I/O queue length defined by the operating system (by detecting the parameter max_io_length of the do-generic-make-request() function); then, prefetching Volume weight acquisition unit 702 will operate the operating system The ratio of the length of the current I/O queue (denoted ^Q -t ) to the maximum I/O queue length defined by the operating system (denoted as G _{max )} multiplied by the first adjustable factor (denoted as _a ) to obtain the disk load pair The weight of the prefetch increase, BP, the weight of the disk load on the prefetch increase is a ⁰ TM ⁰

In the data prefetching device 07 for non-uniform memory access as illustrated in FIG. 7 above, the memory obtaining unit 701 can also acquire the idle prefetch buffer capacity of the node where the process is located by calling the memory obtaining module, and combine the prefetch weight. The obtaining unit 702 is configured to obtain the weight of the prefetch buffer capacity of the node where the process is located to increase the prefetch amount. Specifically, the prefetch amount weight obtaining unit 702 multiplies the ratio of the idle prefetch buffer capacity of the node where the process is located (reported as ^M and the total prefetch buffer capacity of the node where the process is located (denoted as ^M w. '). two adjustable factor (denoted by b) of the process is located to the right of access nodes pre-fetch buffer capacity prefetch grew heavy, gp, a process where the prefetch buffer capacity of the node on the right amount of growth prefetch weight ^M Sl ^M f

/ total

It should be noted that, in the data prefetching device 06 or 07 for non-uniform memory access illustrated in FIG. 6 or FIG. 7, the first adjustable factor _a and the second adjustable factor b may be determined by the user according to the hardware environment. Determined by its own requirements, its value range is (0 1). If the user does not adjust the first adjustable factor a and the second adjustable factor b, the first adjustable factor a and the second adjustable factor b may be absent. The value is 1. From the expression of the data prefetch parameter parameter, the first adjustable factor a and the second adjustable factor b are used to adjust the total prefetch buffer capacity occupancy of the node where the process is located (ie, ^M ^

/ Gmax ) The weight of the prefetch amount, specifically, when the first adjustable factor a is relatively large and the second adjustable factor b is relatively small, the node where the process is located prefetches the buffer idle ratio (ie, ^MM fw ) The pre-fetch amount has a relatively large influence; on the contrary, when the first adjustable factor a is relatively small and the second adjustable factor b is relatively large, the disk load condition (ie, Qcurrent / Q^ ) has a relatively large impact on the amount of prefetch.

In order to achieve efficient use of resources, multiple virtual machines are often run in a server system (which can be a node in a NUMA system), and each virtual machine runs a separate operating system. In the virtualization system, the data prefetching method for non-uniform memory access provided by the foregoing embodiment is basically unchanged, except that, due to the role of the virtualized I/O subsystem, each virtual machine runs independently. The operating system can have a separate file system and independently manage I/O queues. Therefore, the I/O queue length inside the operating system does not reflect the disk I/O load of the entire system. At this time, if the virtualization system provides a call interface that acquires the entire NUMA system I/O queue length, the I/O queue acquisition module uses the call interface to obtain the current I/O queue length of the entire NUMA system instead of the virtual machine. The independent operating system running in the middle obtains the current I/O queue length of the entire NUMA system; if the virtualization system does not provide such a calling interface, the I/O queue length of the current NUMA system is obtained from the operating system running in the virtual machine. . Specifically, for the case where the virtualization system does not provide a calling interface, a virtualization management tool (for example, a hypervisor) may be invoked, and a management tool such as a hypervisor uniformly coordinates the memory management and communication in a virtualized system running on each node. The strategy adopted for memory allocation and scheduling is public. The specific implementation is as follows. If the virtualization system on a node is running prefetching software, you can first obtain the I/O queue length of the node, according to the management tool (Hypervisor). The memory scheduling strategy derives the current I/O queue length for the entire NUMA system (that is, the length of the I/O queue being used by the entire NUMA system). Need to further explain that, for the maximum multiplier of the prefetch amount ^ is used to limit the multiplier of each prefetch amount, which can be determined by the user according to the total prefetch buffer capacity of the node where the process is located and the characteristics of the main application of the system, gp If the total prefetch buffer allocated by the node where the process is located has a large capacity, and the main application has the characteristics of sequential sequential read files, the maximum multiplier of the prefetch amount ^ can be set to a larger value, so that the allowed In this case, the prefetch window can grow rapidly and improve the data prefetch hit rate. As an embodiment of the data prefetching means 07 for non-uniform memory access exemplified in FIG. 7, the prefetch amount is multiplied by a maximum number ^. The value range of ^fe can be [0, 8], where the symbol "[ ] " represents a closed interval; in [0, 8], the same as the maximum multiplier of the pre-fetch amount ⁷ ^ s, the prefetch The maximum multiplication factor ^. The principle that the value of ^fe is larger.

It should be noted that the information interaction, the execution process, and the like between the modules/units of the foregoing device are based on the same concept as the method embodiment of the present invention, and the technical effects thereof are the same as the embodiment of the method of the present invention. Reference is made to the description in the method embodiment of the present invention, and details are not described herein again.

One of ordinary skill in the art will appreciate that all or a portion of the various methods of the above-described embodiments can be performed by a program to instruct related hardware, such as one or more or all of the following various methods:

According to the parameter indicating the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located, the data prefetch parameter parameter is obtained to obtain the size of the previous prefetch window ^R - ^Size , the maximum prefetch amount Multiplier ^. ^Fe and the product of the data prefetch parameter parameter r, ^S · ' compare the set maximum prefetch amount ^ ^ ^ ^ and the size, with the _{comparison of} the ^4 _rearfatearf and the & _β The small value is used to prefetch the data as the size of this prefetch window. The program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like.

The data prefetching method and apparatus for non-uniform memory access provided by the embodiments of the present invention are described in detail above. The specific examples are used to explain the principles and implementation manners of the present invention. The description of the above embodiments is only Used to help understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, in accordance with the idea of the present invention, The scope of the application and the scope of the application are subject to change. In the above description, the content of the specification should not be construed as limiting the invention.

+

Claims

Rights request

A data prefetching method for non-uniform memory access, wherein the method comprises:

According to the parameter indicating the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located, the data prefetch parameter parameter is obtained to obtain the size of the previous prefetch window ^R - ^Size , the maximum prefetch amount Multiplier ^. ^Fe and the product of the data prefetch parameter parameter r, ^S · ' compare the set maximum prefetch amount ^ ^ ^ ^ and the size, with the _{comparison of} the ^4 _rearfatearf and the & _β The small value is used to prefetch the data as the size of this prefetch window.

2. The method according to claim 1, wherein the obtaining a data prefetch parameter parameter according to a parameter representing a disk load in the non-uniform memory access NUMA system and an idle prefetch buffer capacity of a node where the thread is located ^Includes:

Obtaining the weight of the disk load on the prefetch amount and the prefetch buffer capacity of the node where the process is located according to the parameter indicating the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located. Take the weight of growth;

The difference between the weight of the disk load on the prefetch amount and the weight of the prefetch buffer capacity of the node where the process is located is increased, and the data prefetch parameter parameter is obtained.

The method according to claim 2, wherein the obtaining the disk load pair prefetch amount according to the parameter indicating the disk load in the non-uniform memory access NUMA system and the idle prefetch buffer capacity of the node where the process is located The weight of the increase and the weight of the prefetch buffer capacity of the node where the process is located increase the amount of prefetch:

Calling the input/output I/O queue acquisition module and the memory acquisition module to respectively obtain the length of the current I/O queue of the operating system and the idle prefetch buffer capacity of the node where the process is located;

Multiplying a ratio of a length of a current I/O queue of the operating system to a maximum I/O queue length defined by an operating system by a first adjustable factor to obtain a weight of a disk load to a prefetch amount, The ratio of the capacity of the idle prefetch buffer of the node where the process is located to the total prefetch buffer capacity of the node where the thread is located is multiplied by the second adjustable factor to obtain the prefetch buffer capacity of the node where the thread is located, which increases the prefetch amount. Weight, the ratio of the length of the current I/O queue of the operating system to the maximum I/O queue length defined by the operating system is a parameter characterizing the disk load in the non-uniform memory access NUMA system.

4. The method of claim 1, wherein the prefetch amount is a maximum multiplication factor. The range of _fe is [ ₀ , _8] , where the symbol " [ ] " indicates a closed interval.

The method according to claim 1, wherein the larger the capacity of the idle prefetch buffer is, the maximum multiplier of the prefetch amount is ^. The larger the value of ^fe .

6. A data prefetching apparatus for non-uniform memory access, wherein the apparatus comprises:

a data prefetch parameter parameter acquisition module, configured to obtain a data prefetch parameter parameter prefetch window multiplication module according to a parameter representing a non-uniform memory access parameter of a disk load in a NUMA system and an idle prefetch buffer capacity of a node where the process is located , for obtaining the size of the previous prefetch window ^f, the maximum multiplier of the prefetch amount ⁷ ^ ^fe and the product of the data prefetch parameter parameter ^ &; the prefetch window acquisition module, for comparison set the maximum prefetch quantity and size of the ^M4 ^^^ ^S, the smaller the value of the ^S and Λ are prefetched as the size of this window to prefetch data.

The device according to claim 6, wherein the data prefetch parameter parameter acquisition module comprises:

a weight obtaining sub-module, configured to obtain, according to the parameter that the non-uniform memory accesses the disk load in the NUMA system and the idle prefetch buffer capacity of the node where the process is located, obtain the weight of the disk load on the prefetch amount and the process The weight of the prefetch buffer capacity of the node for the increase of the prefetch amount; a difference submodule, configured to obtain a difference between a weight of the disk load on the prefetch amount and a weight of the prefetch buffer capacity of the node where the thread is located, and a weight of the prefetch amount, to obtain the data prefetch amount Parameter factor

The apparatus according to claim 7, wherein the weight acquisition sub-module comprises: a memory acquisition unit, configured to call an input/output I/O queue acquisition module and a memory acquisition module to respectively acquire an operating system current I/ O The length of the queue and the idle prefetch buffer capacity of the node where the thread is located;

a prefetching weight obtaining unit, configured to multiply a ratio of a length of a current I/O queue of the operating system to a maximum I/O queue length defined by an operating system by a first adjustable factor to obtain a disk load to increase a prefetch amount The weight of the idle prefetch buffer of the node where the thread is located is multiplied by the ratio of the total prefetch buffer capacity of the node where the thread is located by the second adjustable factor to obtain the prefetch buffer capacity of the node where the thread is located. For the weight of the prefetch increase, the ratio of the length of the current I/O queue of the operating system to the maximum I/O queue length defined by the operating system is a parameter characterizing the disk load in the non-uniform memory access NUMA system.

9. The apparatus according to claim 6, wherein the pre-fetch amount is a maximum multiplication factor. The range of _fe is [ ₀ , _8] , where the symbol " [ ] " indicates a closed interval.

10. The apparatus according to claim 6 or 9, wherein the idle prefetch buffer