CN112817887B

CN112817887B - Far memory access optimization method and system under separated combined architecture

Info

Publication number: CN112817887B
Application number: CN202110209483.6A
Authority: CN
Inventors: 李超; 王靖; 汪陶磊; 过敏意
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-09-17
Anticipated expiration: 2041-02-24
Also published as: CN112817887A

Abstract

A far memory access optimization method and system under a separable and combinable framework are disclosed, firstly, a writable working set is deployed on a local computing node according to the memory read-write frequency of application, and a read-only working set is deployed on a far-end memory node; selecting a proper default data block size according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of data blocks by setting indexes for the data blocks and combining dynamic blocking in the RDMA transmission process; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; and setting a buffer by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write. The method can fully mine the performance potential of the application layer computing task for accessing the remote memory by using RDMA.

Description

Far memory access optimization method and system under separated combined architecture

Technical Field

The invention relates to a technology in the field of distributed data processing, in particular to a remote memory access optimization method and a remote memory access optimization system under a separable and combinable architecture.

Background

Under the existing split and combinable memory architecture with scarce memory resources, people use high-speed networks such as RDMA protocol to realize the read-write of remote memory. The existing application-aware remote memory access solution replaces the RDMA at the back end of the Linux page exchange mechanism to transparently perform remote memory access, cannot avoid the additional overhead generated by kernel introduction, and does not consider the parallel potential brought by the memory access characteristic of an upper-layer application program.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a remote memory access optimization method and a remote memory access optimization system under a separated and combined architecture, which can fully mine the performance potential of an application layer computing task for accessing a remote memory by using RDMA.

The invention is realized by the following technical scheme:

the invention relates to an RDMA (remote direct memory Access) remote memory access method applying parallel collaboration under a combinable architecture, which comprises the steps of firstly deploying a writable working set on a local computing node according to the memory read-write frequency of application, and deploying a read-only working set on a remote memory node; selecting the size of a default data block according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of the data block by setting indexes for the data block and combining RDMA transmission process dynamic blocking; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; and setting a buffer by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write.

The said separated and combined structure is: the data center is provided with a framework for flexibly combining and matching a plurality of server CPUs and memories in a network connection mode, wherein: servers with computing tasks as functions are used as computing nodes (computer nodes), and servers with Memory access as functions are used as Memory nodes (Memory nodes).

The far memory architecture is as follows: a distributed architecture comprising at least one compute node and at least one memory node, wherein: the computing node and the memory node both comprise a server, and the computing node and the memory node are in wired connection through respective RDMA network cards.

The server takes a CPU as a computing core and a DRAM as a memory unit, the RDMA network card is connected with a mainboard of the server through PCIe, the CPU of each server uses a local memory and uses a remote memory through the RDMA network card without occupying the resources of the remote CPU.

The working set deployment specifically includes:

i) dividing a read-only working set according to the memory read-write frequency of the application;

ii) in the preprocessing process, the data blocks in the read-only working set divided in the step i) are transmitted to a remote memory area in a blocking mode in an RDMAwrite mode;

and iii) in the calculation execution process, the local application program continuously initiates a request for reading a remote data block, and the remote end returns the corresponding data block required by the server to the local machine in an RDMA Read mode according to the received request of the server program for use by the current program.

The default data block size is Chunk ═ α × Channel × Frame ÷ Core, where: channel is the number of PCIe channels for transmitting data once of the mainboard, Core is the number of CPUs of the mainboard, Frame is the number of data frames of the RDMA network card, and alpha is more than or equal to 1 and less than or equal to 1024.

The step of setting the index of the data block refers to: the index is set as the address pair lkey and the key pair rkey of the memory area corresponding to the data block.

The dynamic blocking of the RDMA transmission process refers to: when the Data block Data _ block which needs to be sent currently is larger than the default size Chunk which is set currently, the Data block is divided into Data blocks when being sent

Respectively sending the data, otherwise, sending the data as a Chunk size so as to realize transparent dispersion; when receiving, integrating the data blocks divided into beta according to the original sequence according to the indexes, thereby realizing integration.

The bidirectional unilateral operation mechanism matched with local application reading and writing is as follows: the server program sets a buffer area for receiving information, sends an index for reading data to the far end, and the far end receives a corresponding data block according to the index and carries out unilateral read-write operation based on a RDMA mechanism of a queue, and directly writes the data into the buffer area of the server without data replication.

In the single-side operation, each time a data block read from or written to the receiving buffer is treated as a new read data block, the later data block will overwrite the previous data block information in the buffer.

The queue-based RDMA mechanism refers to:

step 1: sending (receiving) a request for adding an event A in a queue;

step 2: executing the event A, and starting reading and writing data;

and step 3: a, popping up a sending (receiving) queue by an event; and adding the data into a completion queue;

and 4, step 4: the next send (receive) event B enters the send (receive) queue;

and 5: completing the popping of the event A in the queue, and scanning the state of the event A;

step 6: when the event state of the A is successful, starting to execute the B; and when the state is unsuccessful, an error is reported.

And 7: steps 2-6 are repeated until there is no new time to join the transmit (receive) queue.

The asynchronous read-write mechanism based on event notification is as follows: when the remote memory starts to read/write, i.e. represents that the read/write is successful, the event is moved from the sending/receiving queue to the completion queue, and the actual time of the read-write process depends on the size of the read-write data block and the current network bandwidth.

The buffer areas are located in the local memory area and the remote memory area, and specifically include: an asynchronous parallel buffer, a transmit region, and a receive region, wherein: the sending area and the receiving area correspondingly execute data sending (writing) or receiving (reading) operation, an asynchronous parallel buffer area is used for temporary storage of transmission data to support asynchronous reading, and the data written in the asynchronous parallel buffer area cannot be overwritten until the data is rewritten next time.

The asynchronous parallel processing refers to: in the process of local computation, a data transmission process of RDMA is prepared at the same time, that is, in the last iteration, one step of receiving data to be used next time is prepared, and the specific steps include:

i) computing an iteration start;

ii) copying data of the RDMA receiving buffer to a computing area, and reading information of the read-only working set;

iii) preparing the index of the data block to be accessed in the next round, and sending the index to the far end;

iv) opening the RDMA receive buffer to receive data in preparation for the next round of data;

v) executing the calculation part of the iteration;

vi) returning to the step i) until the algorithm is converged or no new remote data block needing to be accessed is available.

Technical effects

The invention integrally solves the performance bottleneck caused by the fact that the kernel cannot introduce the access characteristic of an upper application program due to the fact that the extra overhead generated by the kernel introduction during the far memory access cannot be avoided because the back end of the Linux page exchange mechanism is replaced by RDMA in the prior art.

Compared with the prior art, the invention realizes the remote memory read-write framework of the application layer, optimizes the application read-write characteristics and the characteristics of RDMA hardware in fine granularity, and comprises the steps of deploying a read-only working set at a remote memory node, selecting a proper default data block size according to the characteristics of hardware resources, setting indexes for the data blocks to realize the transparent dispersion and integration of the data blocks, setting a bidirectional unilateral operation mechanism matched with local application read-write, designing an asynchronous read-write buffer area, achieving the asynchronous parallel of local calculation and RDMA data read-write, reducing the bandwidth occupation, improving the transmission efficiency, improving the overall performance applied under the remote memory framework, reducing the overall delay and even achieving the effect close to local memory processing.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a schematic diagram of an embodiment framework;

FIG. 4 is a schematic diagram of an embodiment data block partitioning and integration;

FIG. 5 is a block diagram illustrating an embodiment RDMA single-sided read and compute communication parallelism.

Detailed Description

In this embodiment, taking graph computing application as an example, using RDMA as a remote memory medium, the system environment is as follows: two Intel (R) Xeon (R) Gold 6148 CPUs with 2 20 cores, 256GB memory, 21TB hard disk and a two-channel Mellanox ConnectX-5RDMA network card. One of the servers serves as a compute node, and the other serves as a remote memory access node (remote node).

As shown in fig. 1, a far memory access optimization system under a split and combinable architecture according to the present embodiment includes: at least one local computing node and at least one remote memory node, which are connected and exchange data through respective RDMA network cards and in a wired manner, wherein: each node comprises a memory area, the memory area consists of a local area and a remote area, and a CPU of the local computing node exchanges data with the memory area through a cache.

As shown in fig. 3, the remote memory access optimization system includes: a compute node and a memory node, wherein: the computing node divides the application content into a read-only part and a non-read part according to the local application read-write characteristic, and simultaneously, the computing node blocks the data according to the local network card and the memory hardware parameter and carries out read-write interaction with the memory node through remote memory read-write operation. The memory nodes write the data transmitted by the computing nodes into the remote memory based on the unilateral memory writing according to the reading and writing requirements of the computing nodes, and read the data required by the computing nodes into the computing nodes by adopting the unilateral memory reading mode based on the index.

The computing node comprises: the device comprises an application read-write separation module, a data block selection module, a data block dispersion integration module, a first bidirectional unilateral read-write module and a first asynchronous parallel module, wherein: the application read-write separation module divides the application content into a read-only part and a non-read-only part according to the local application read-write characteristic to obtain two major data blocks and transmits the two major data blocks to the data block selection module; the data block selecting module selects a certain data block size according to the local network card and the memory hardware parameters and outputs the data block size to the data block dispersing and integrating module; the first bidirectional unilateral reading and writing module transmits the data block to the first asynchronous parallel module according to the size of the data block selected by the data block selecting module, the first bidirectional unilateral reading and writing module and the first asynchronous parallel module are transmitted to the first asynchronous parallel module in a far memory writing mode, and the data block is read from an asynchronous buffer area of the first asynchronous parallel module in a far memory reading mode; the first asynchronous parallel module is provided with a sending and receiving isolation area through an asynchronous buffer area, asynchronously transmits the data block to the first bidirectional unilateral reading and writing module in a far memory writing mode, simultaneously reads the returned data block from the first bidirectional unilateral reading and writing module in real time in the far memory reading mode, and stores the data block to the asynchronous buffer area for the first bidirectional unilateral reading and writing module to use; the first bidirectional unilateral read-write module carries out remote memory read and write processing on the data, and under a remote memory write mode, the data from the first asynchronous parallel module is written into a remote memory based on unilateral memory write, and a corresponding data index is locally reserved; in a far memory reading mode, the required data is transmitted to a far memory according to a reserved index, a corresponding data block is searched in the far memory, read back in a far memory reading mode and output to a first asynchronous parallel module.

The memory node is matched with the computing node and comprises: a second bidirectional one-sided read-write module and a second asynchronous parallel module, wherein: the second bidirectional unilateral read-write module writes and reads data from the computing node, and writes the data into the second asynchronous parallel module and returns a corresponding index to the computing node based on receiving a unilateral memory write data block from the computing node in a far memory write mode; in a far memory reading mode, according to an index from a computing node, searching a corresponding data block, acquiring the data block from the second asynchronous parallel module, and returning the data block to the computing node in a far memory reading mode; the second asynchronous parallel module reads the returned data block from the second bidirectional unilateral reading and writing module in real time in a far memory writing mode through an asynchronous buffer area and a transmitting and receiving isolation area, stores the data block to the asynchronous buffer area, and simultaneously transmits the data block to the second bidirectional unilateral reading and writing module asynchronously in the far memory reading mode.

As shown in fig. 2, the present embodiment relates to a method for optimizing remote memory access under the split and combinable architecture of the above system, which includes the following steps:

step 1) dividing the edge data of the graph calculation into a read-only working set.

And step 2) deploying the read-only working set of the graph calculation to a remote memory in a preprocessing stage in an RDMAwrite mode.

And 3) selecting a default size of the transmission data block according to hardware characteristics, and calculating the default size of the data block according to the number of PCIe channels for transmitting data once of the mainboard, the number of CPUs (central processing units) Core of the mainboard and the number of frames Frame of the RDMA (remote direct memory access) network card, wherein the default size is equal to or larger than alpha and equal to or less than 1024.

Step 4) as shown in fig. 4, dividing and integrating the Data according to the size of the Data block determined in step 3), specifically including dividing the Data block into Data blocks when the Data block Data _ block currently required to be transmitted is larger than the default size Chunk currently set, when transmitting the Data block Data _ block, dividing the Data block into Data blocks

And each is sent separately. And when the Data block Data _ block which needs to be sent currently is smaller than or equal to the default size Chunk which is set currently, merging the Data block Data _ block into a Chunk size for sending. Upon reception, if the data is now divided into β data blocks, the data blocks are assembled in the original order upon reception.

Step 5) RDMA port bindings and connection setup are first opened locally and remotely as shown in fig. 5.

And 6) setting three buffers, namely an asynchronous parallel buffer, a sending region and a receiving region in the local memory region and the remote memory region at the same time, wherein the local sending region L-SB and the remote receiving region R-RB correspondingly execute data sending (writing) operation, and the local receiving region L-RB and the remote sending region R-SB correspondingly execute data receiving (reading) operation. .

Step 7) in the preprocessing stage, the system writes the data block of the read-only working set into the remote memory through RDMA single-side Write, and the data block transmission mode in the step 4) is also followed.

Step 8) during the iterative computation, performing the time coverage of computation and communication to achieve the parallel effect, wherein the specific steps comprise

i) Computing an iteration start;

v) executing the calculation part of the iteration;

vi) returning to step i) until the algorithm converges or no new remote data block needs to be accessed.

And 9) after the calculation process is finished, recovering the memory areas occupied by the local and the remote ends, and disconnecting the RDMA connection.

Through practical experiments, by rewriting BFS and Pagerank algorithms in a Gridgraph graph calculation framework, the remote memory access can be performed by using RDMA in the above manner, and 4 data sets (1G to 32G are unequal) such as Livejournal are processed, and the obtained experimental result is that as shown in the following table, under the condition that 80% of local memory is saved, the total time is improved by about 8.3 times compared with the latest remote memory access framework Fastwap. Compared with the prior art, the performance index improvement of the method is less local memory occupation, less transmission throughput and faster total delay.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. An RDMA (remote direct memory Access) remote memory access method applying parallel collaboration under a separate and combinable architecture is characterized in that firstly, a writable working set is deployed on a local computing node according to the memory read-write frequency of application, and a read-only working set is deployed on a remote memory node; selecting a proper default data block size according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of data blocks by setting indexes for the data blocks and combining dynamic blocking in the RDMA transmission process; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; setting a buffer area by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write;

the said separated and combined structure is: the data center is provided with a framework for flexibly combining and matching a plurality of server CPUs and memories in a network connection mode, wherein: taking a server with a calculation task as a calculation node and a server with a memory access function as a memory node;

the far memory means: a distributed architecture comprising at least one compute node and at least one memory node, wherein: the computing node and the memory node respectively comprise a server, and the computing node and the memory node are in wired connection through respective RDMA network cards;

2. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as claimed in claim 1, wherein said working set deployment specifically comprises:

and iii) in the calculation execution process, the local application continuously initiates a request for reading a remote data block, and the remote end returns the corresponding data block required by the server to the local machine in an RDMA Read mode according to the received request of the server program for use by the current program.

3. The RDMA remote memory access method applying parallel coordination under the split combinable architecture as claimed in claim 1, wherein the size of the default data block is Chunk ═ α × Channel × Frame ÷ Core, wherein: channel is the number of PCIe channels for transmitting data once of the mainboard, Core is the number of CPUs of the mainboard, Frame is the number of data frames of the RDMA network card, and alpha is more than or equal to 1 and less than or equal to 1024.

4. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said indexing data blocks is: the index is set as the address pair lkey and the key pair rkey of the memory area corresponding to the data block.

5. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as claimed in claim 1, wherein said RDMA transfer process dynamic blocking is: when the Data block Data _ block which needs to be sent currently is larger than the default size Chunk which is set currently, the Data block is divided into Data blocks when being sent

6. The RDMA remote memory access method of application parallel collaboration under the separate and combinable architecture as claimed in claim 1, wherein the bidirectional one-sided operation mechanism cooperating with local application read-write is: the server program sets a buffer area for receiving information, sends an index for reading data to a far end, the far end receives a corresponding data block according to the index and carries out unilateral read-write operation based on a RDMA mechanism of a queue, and the data is directly written into the buffer area of the server without data copying;

7. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said queue-based RDMA mechanism is:

step 1: sending (receiving) a request for adding an event A in a queue;

step 2: executing the event A, and starting reading and writing data;

and 4, step 4: the next send (receive) event B enters the send (receive) queue;

step 6: when the event state of the A is successful, starting to execute the B; when the state is unsuccessful, reporting an error;

8. The RDMA remote memory access method applying parallel collaboration under separate and combinable architecture as claimed in claim 1, wherein said asynchronous read-write mechanism based on event notification is: when the remote memory starts to read/write, i.e. represents that the read/write is successful, the event is moved from the sending/receiving queue to the completion queue, and the actual time of the read-write process depends on the size of the read-write data block and the current network bandwidth.

9. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as recited in claim 1, wherein the buffer is located in a local memory region and a remote memory region, and specifically comprises: an asynchronous parallel buffer, a transmit region, and a receive region, wherein: the sending area and the receiving area correspondingly execute data sending or receiving operation, the asynchronous parallel buffer area is used for temporarily storing transmission data to support asynchronous reading, and the data written in the asynchronous parallel buffer area cannot be overwritten until the data is rewritten next time.

10. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said asynchronous parallel processing is: in the process of local computation, a data transmission process of RDMA is prepared at the same time, that is, in the last iteration, one step of receiving data to be used next time is prepared, and the specific steps include:

i) computing an iteration start;

v) executing the calculation part of the iteration;