CN112817887B - Far memory access optimization method and system under separated combined architecture - Google Patents

Far memory access optimization method and system under separated combined architecture Download PDF

Info

Publication number
CN112817887B
CN112817887B CN202110209483.6A CN202110209483A CN112817887B CN 112817887 B CN112817887 B CN 112817887B CN 202110209483 A CN202110209483 A CN 202110209483A CN 112817887 B CN112817887 B CN 112817887B
Authority
CN
China
Prior art keywords
data
rdma
read
data block
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110209483.6A
Other languages
Chinese (zh)
Other versions
CN112817887A (en
Inventor
李超
王靖
汪陶磊
过敏意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110209483.6A priority Critical patent/CN112817887B/en
Publication of CN112817887A publication Critical patent/CN112817887A/en
Application granted granted Critical
Publication of CN112817887B publication Critical patent/CN112817887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/544Remote
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Bus Control (AREA)

Abstract

A far memory access optimization method and system under a separable and combinable framework are disclosed, firstly, a writable working set is deployed on a local computing node according to the memory read-write frequency of application, and a read-only working set is deployed on a far-end memory node; selecting a proper default data block size according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of data blocks by setting indexes for the data blocks and combining dynamic blocking in the RDMA transmission process; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; and setting a buffer by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write. The method can fully mine the performance potential of the application layer computing task for accessing the remote memory by using RDMA.

Description

Far memory access optimization method and system under separated combined architecture
Technical Field
The invention relates to a technology in the field of distributed data processing, in particular to a remote memory access optimization method and a remote memory access optimization system under a separable and combinable architecture.
Background
Under the existing split and combinable memory architecture with scarce memory resources, people use high-speed networks such as RDMA protocol to realize the read-write of remote memory. The existing application-aware remote memory access solution replaces the RDMA at the back end of the Linux page exchange mechanism to transparently perform remote memory access, cannot avoid the additional overhead generated by kernel introduction, and does not consider the parallel potential brought by the memory access characteristic of an upper-layer application program.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a remote memory access optimization method and a remote memory access optimization system under a separated and combined architecture, which can fully mine the performance potential of an application layer computing task for accessing a remote memory by using RDMA.
The invention is realized by the following technical scheme:
the invention relates to an RDMA (remote direct memory Access) remote memory access method applying parallel collaboration under a combinable architecture, which comprises the steps of firstly deploying a writable working set on a local computing node according to the memory read-write frequency of application, and deploying a read-only working set on a remote memory node; selecting the size of a default data block according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of the data block by setting indexes for the data block and combining RDMA transmission process dynamic blocking; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; and setting a buffer by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write.
The said separated and combined structure is: the data center is provided with a framework for flexibly combining and matching a plurality of server CPUs and memories in a network connection mode, wherein: servers with computing tasks as functions are used as computing nodes (computer nodes), and servers with Memory access as functions are used as Memory nodes (Memory nodes).
The far memory architecture is as follows: a distributed architecture comprising at least one compute node and at least one memory node, wherein: the computing node and the memory node both comprise a server, and the computing node and the memory node are in wired connection through respective RDMA network cards.
The server takes a CPU as a computing core and a DRAM as a memory unit, the RDMA network card is connected with a mainboard of the server through PCIe, the CPU of each server uses a local memory and uses a remote memory through the RDMA network card without occupying the resources of the remote CPU.
The working set deployment specifically includes:
i) dividing a read-only working set according to the memory read-write frequency of the application;
ii) in the preprocessing process, the data blocks in the read-only working set divided in the step i) are transmitted to a remote memory area in a blocking mode in an RDMAwrite mode;
and iii) in the calculation execution process, the local application program continuously initiates a request for reading a remote data block, and the remote end returns the corresponding data block required by the server to the local machine in an RDMA Read mode according to the received request of the server program for use by the current program.
The default data block size is Chunk ═ α × Channel × Frame ÷ Core, where: channel is the number of PCIe channels for transmitting data once of the mainboard, Core is the number of CPUs of the mainboard, Frame is the number of data frames of the RDMA network card, and alpha is more than or equal to 1 and less than or equal to 1024.
The step of setting the index of the data block refers to: the index is set as the address pair lkey and the key pair rkey of the memory area corresponding to the data block.
The dynamic blocking of the RDMA transmission process refers to: when the Data block Data _ block which needs to be sent currently is larger than the default size Chunk which is set currently, the Data block is divided into Data blocks when being sent
Figure BDA0002950799830000021
Respectively sending the data, otherwise, sending the data as a Chunk size so as to realize transparent dispersion; when receiving, integrating the data blocks divided into beta according to the original sequence according to the indexes, thereby realizing integration.
The bidirectional unilateral operation mechanism matched with local application reading and writing is as follows: the server program sets a buffer area for receiving information, sends an index for reading data to the far end, and the far end receives a corresponding data block according to the index and carries out unilateral read-write operation based on a RDMA mechanism of a queue, and directly writes the data into the buffer area of the server without data replication.
In the single-side operation, each time a data block read from or written to the receiving buffer is treated as a new read data block, the later data block will overwrite the previous data block information in the buffer.
The queue-based RDMA mechanism refers to:
step 1: sending (receiving) a request for adding an event A in a queue;
step 2: executing the event A, and starting reading and writing data;
and step 3: a, popping up a sending (receiving) queue by an event; and adding the data into a completion queue;
and 4, step 4: the next send (receive) event B enters the send (receive) queue;
and 5: completing the popping of the event A in the queue, and scanning the state of the event A;
step 6: when the event state of the A is successful, starting to execute the B; and when the state is unsuccessful, an error is reported.
And 7: steps 2-6 are repeated until there is no new time to join the transmit (receive) queue.
The asynchronous read-write mechanism based on event notification is as follows: when the remote memory starts to read/write, i.e. represents that the read/write is successful, the event is moved from the sending/receiving queue to the completion queue, and the actual time of the read-write process depends on the size of the read-write data block and the current network bandwidth.
The buffer areas are located in the local memory area and the remote memory area, and specifically include: an asynchronous parallel buffer, a transmit region, and a receive region, wherein: the sending area and the receiving area correspondingly execute data sending (writing) or receiving (reading) operation, an asynchronous parallel buffer area is used for temporary storage of transmission data to support asynchronous reading, and the data written in the asynchronous parallel buffer area cannot be overwritten until the data is rewritten next time.
The asynchronous parallel processing refers to: in the process of local computation, a data transmission process of RDMA is prepared at the same time, that is, in the last iteration, one step of receiving data to be used next time is prepared, and the specific steps include:
i) computing an iteration start;
ii) copying data of the RDMA receiving buffer to a computing area, and reading information of the read-only working set;
iii) preparing the index of the data block to be accessed in the next round, and sending the index to the far end;
iv) opening the RDMA receive buffer to receive data in preparation for the next round of data;
v) executing the calculation part of the iteration;
vi) returning to the step i) until the algorithm is converged or no new remote data block needing to be accessed is available.
Technical effects
The invention integrally solves the performance bottleneck caused by the fact that the kernel cannot introduce the access characteristic of an upper application program due to the fact that the extra overhead generated by the kernel introduction during the far memory access cannot be avoided because the back end of the Linux page exchange mechanism is replaced by RDMA in the prior art.
Compared with the prior art, the invention realizes the remote memory read-write framework of the application layer, optimizes the application read-write characteristics and the characteristics of RDMA hardware in fine granularity, and comprises the steps of deploying a read-only working set at a remote memory node, selecting a proper default data block size according to the characteristics of hardware resources, setting indexes for the data blocks to realize the transparent dispersion and integration of the data blocks, setting a bidirectional unilateral operation mechanism matched with local application read-write, designing an asynchronous read-write buffer area, achieving the asynchronous parallel of local calculation and RDMA data read-write, reducing the bandwidth occupation, improving the transmission efficiency, improving the overall performance applied under the remote memory framework, reducing the overall delay and even achieving the effect close to local memory processing.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a schematic diagram of an embodiment framework;
FIG. 4 is a schematic diagram of an embodiment data block partitioning and integration;
FIG. 5 is a block diagram illustrating an embodiment RDMA single-sided read and compute communication parallelism.
Detailed Description
In this embodiment, taking graph computing application as an example, using RDMA as a remote memory medium, the system environment is as follows: two Intel (R) Xeon (R) Gold 6148 CPUs with 2 20 cores, 256GB memory, 21TB hard disk and a two-channel Mellanox ConnectX-5RDMA network card. One of the servers serves as a compute node, and the other serves as a remote memory access node (remote node).
As shown in fig. 1, a far memory access optimization system under a split and combinable architecture according to the present embodiment includes: at least one local computing node and at least one remote memory node, which are connected and exchange data through respective RDMA network cards and in a wired manner, wherein: each node comprises a memory area, the memory area consists of a local area and a remote area, and a CPU of the local computing node exchanges data with the memory area through a cache.
As shown in fig. 3, the remote memory access optimization system includes: a compute node and a memory node, wherein: the computing node divides the application content into a read-only part and a non-read part according to the local application read-write characteristic, and simultaneously, the computing node blocks the data according to the local network card and the memory hardware parameter and carries out read-write interaction with the memory node through remote memory read-write operation. The memory nodes write the data transmitted by the computing nodes into the remote memory based on the unilateral memory writing according to the reading and writing requirements of the computing nodes, and read the data required by the computing nodes into the computing nodes by adopting the unilateral memory reading mode based on the index.
The computing node comprises: the device comprises an application read-write separation module, a data block selection module, a data block dispersion integration module, a first bidirectional unilateral read-write module and a first asynchronous parallel module, wherein: the application read-write separation module divides the application content into a read-only part and a non-read-only part according to the local application read-write characteristic to obtain two major data blocks and transmits the two major data blocks to the data block selection module; the data block selecting module selects a certain data block size according to the local network card and the memory hardware parameters and outputs the data block size to the data block dispersing and integrating module; the first bidirectional unilateral reading and writing module transmits the data block to the first asynchronous parallel module according to the size of the data block selected by the data block selecting module, the first bidirectional unilateral reading and writing module and the first asynchronous parallel module are transmitted to the first asynchronous parallel module in a far memory writing mode, and the data block is read from an asynchronous buffer area of the first asynchronous parallel module in a far memory reading mode; the first asynchronous parallel module is provided with a sending and receiving isolation area through an asynchronous buffer area, asynchronously transmits the data block to the first bidirectional unilateral reading and writing module in a far memory writing mode, simultaneously reads the returned data block from the first bidirectional unilateral reading and writing module in real time in the far memory reading mode, and stores the data block to the asynchronous buffer area for the first bidirectional unilateral reading and writing module to use; the first bidirectional unilateral read-write module carries out remote memory read and write processing on the data, and under a remote memory write mode, the data from the first asynchronous parallel module is written into a remote memory based on unilateral memory write, and a corresponding data index is locally reserved; in a far memory reading mode, the required data is transmitted to a far memory according to a reserved index, a corresponding data block is searched in the far memory, read back in a far memory reading mode and output to a first asynchronous parallel module.
The memory node is matched with the computing node and comprises: a second bidirectional one-sided read-write module and a second asynchronous parallel module, wherein: the second bidirectional unilateral read-write module writes and reads data from the computing node, and writes the data into the second asynchronous parallel module and returns a corresponding index to the computing node based on receiving a unilateral memory write data block from the computing node in a far memory write mode; in a far memory reading mode, according to an index from a computing node, searching a corresponding data block, acquiring the data block from the second asynchronous parallel module, and returning the data block to the computing node in a far memory reading mode; the second asynchronous parallel module reads the returned data block from the second bidirectional unilateral reading and writing module in real time in a far memory writing mode through an asynchronous buffer area and a transmitting and receiving isolation area, stores the data block to the asynchronous buffer area, and simultaneously transmits the data block to the second bidirectional unilateral reading and writing module asynchronously in the far memory reading mode.
As shown in fig. 2, the present embodiment relates to a method for optimizing remote memory access under the split and combinable architecture of the above system, which includes the following steps:
step 1) dividing the edge data of the graph calculation into a read-only working set.
And step 2) deploying the read-only working set of the graph calculation to a remote memory in a preprocessing stage in an RDMAwrite mode.
And 3) selecting a default size of the transmission data block according to hardware characteristics, and calculating the default size of the data block according to the number of PCIe channels for transmitting data once of the mainboard, the number of CPUs (central processing units) Core of the mainboard and the number of frames Frame of the RDMA (remote direct memory access) network card, wherein the default size is equal to or larger than alpha and equal to or less than 1024.
Step 4) as shown in fig. 4, dividing and integrating the Data according to the size of the Data block determined in step 3), specifically including dividing the Data block into Data blocks when the Data block Data _ block currently required to be transmitted is larger than the default size Chunk currently set, when transmitting the Data block Data _ block, dividing the Data block into Data blocks
Figure BDA0002950799830000051
And each is sent separately. And when the Data block Data _ block which needs to be sent currently is smaller than or equal to the default size Chunk which is set currently, merging the Data block Data _ block into a Chunk size for sending. Upon reception, if the data is now divided into β data blocks, the data blocks are assembled in the original order upon reception.
Step 5) RDMA port bindings and connection setup are first opened locally and remotely as shown in fig. 5.
And 6) setting three buffers, namely an asynchronous parallel buffer, a sending region and a receiving region in the local memory region and the remote memory region at the same time, wherein the local sending region L-SB and the remote receiving region R-RB correspondingly execute data sending (writing) operation, and the local receiving region L-RB and the remote sending region R-SB correspondingly execute data receiving (reading) operation. .
Step 7) in the preprocessing stage, the system writes the data block of the read-only working set into the remote memory through RDMA single-side Write, and the data block transmission mode in the step 4) is also followed.
Step 8) during the iterative computation, performing the time coverage of computation and communication to achieve the parallel effect, wherein the specific steps comprise
i) Computing an iteration start;
ii) copying data of the RDMA receiving buffer to a computing area, and reading information of the read-only working set;
iii) preparing the index of the data block to be accessed in the next round, and sending the index to the far end;
iv) opening the RDMA receive buffer to receive data in preparation for the next round of data;
v) executing the calculation part of the iteration;
vi) returning to step i) until the algorithm converges or no new remote data block needs to be accessed.
And 9) after the calculation process is finished, recovering the memory areas occupied by the local and the remote ends, and disconnecting the RDMA connection.
Through practical experiments, by rewriting BFS and Pagerank algorithms in a Gridgraph graph calculation framework, the remote memory access can be performed by using RDMA in the above manner, and 4 data sets (1G to 32G are unequal) such as Livejournal are processed, and the obtained experimental result is that as shown in the following table, under the condition that 80% of local memory is saved, the total time is improved by about 8.3 times compared with the latest remote memory access framework Fastwap. Compared with the prior art, the performance index improvement of the method is less local memory occupation, less transmission throughput and faster total delay.
Figure BDA0002950799830000061
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (10)

1. An RDMA (remote direct memory Access) remote memory access method applying parallel collaboration under a separate and combinable architecture is characterized in that firstly, a writable working set is deployed on a local computing node according to the memory read-write frequency of application, and a read-only working set is deployed on a remote memory node; selecting a proper default data block size according to hardware resource characteristics in the data transmission process, and realizing transparent dispersion and integration of data blocks by setting indexes for the data blocks and combining dynamic blocking in the RDMA transmission process; a bidirectional unilateral operation mechanism matched with local application reading and writing is realized by utilizing unilateral reading and writing and a RDMA mechanism based on a queue; setting a buffer area by using an asynchronous read-write mechanism based on event notification to realize asynchronous parallel processing of local computation and RDMA data read-write;
the said separated and combined structure is: the data center is provided with a framework for flexibly combining and matching a plurality of server CPUs and memories in a network connection mode, wherein: taking a server with a calculation task as a calculation node and a server with a memory access function as a memory node;
the far memory means: a distributed architecture comprising at least one compute node and at least one memory node, wherein: the computing node and the memory node respectively comprise a server, and the computing node and the memory node are in wired connection through respective RDMA network cards;
the server takes a CPU as a computing core and a DRAM as a memory unit, the RDMA network card is connected with a mainboard of the server through PCIe, the CPU of each server uses a local memory and uses a remote memory through the RDMA network card without occupying the resources of the remote CPU.
2. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as claimed in claim 1, wherein said working set deployment specifically comprises:
i) dividing a read-only working set according to the memory read-write frequency of the application;
ii) in the preprocessing process, the data blocks in the read-only working set divided in the step i) are transmitted to a remote memory area in a blocking mode in an RDMAwrite mode;
and iii) in the calculation execution process, the local application continuously initiates a request for reading a remote data block, and the remote end returns the corresponding data block required by the server to the local machine in an RDMA Read mode according to the received request of the server program for use by the current program.
3. The RDMA remote memory access method applying parallel coordination under the split combinable architecture as claimed in claim 1, wherein the size of the default data block is Chunk ═ α × Channel × Frame ÷ Core, wherein: channel is the number of PCIe channels for transmitting data once of the mainboard, Core is the number of CPUs of the mainboard, Frame is the number of data frames of the RDMA network card, and alpha is more than or equal to 1 and less than or equal to 1024.
4. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said indexing data blocks is: the index is set as the address pair lkey and the key pair rkey of the memory area corresponding to the data block.
5. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as claimed in claim 1, wherein said RDMA transfer process dynamic blocking is: when the Data block Data _ block which needs to be sent currently is larger than the default size Chunk which is set currently, the Data block is divided into Data blocks when being sent
Figure FDA0003184418860000021
Respectively sending the data, otherwise, sending the data as a Chunk size so as to realize transparent dispersion; when receiving, integrating the data blocks divided into beta according to the original sequence according to the indexes, thereby realizing integration.
6. The RDMA remote memory access method of application parallel collaboration under the separate and combinable architecture as claimed in claim 1, wherein the bidirectional one-sided operation mechanism cooperating with local application read-write is: the server program sets a buffer area for receiving information, sends an index for reading data to a far end, the far end receives a corresponding data block according to the index and carries out unilateral read-write operation based on a RDMA mechanism of a queue, and the data is directly written into the buffer area of the server without data copying;
in the single-side operation, each time a data block read from or written to the receiving buffer is treated as a new read data block, the later data block will overwrite the previous data block information in the buffer.
7. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said queue-based RDMA mechanism is:
step 1: sending (receiving) a request for adding an event A in a queue;
step 2: executing the event A, and starting reading and writing data;
and step 3: a, popping up a sending (receiving) queue by an event; and adding the data into a completion queue;
and 4, step 4: the next send (receive) event B enters the send (receive) queue;
and 5: completing the popping of the event A in the queue, and scanning the state of the event A;
step 6: when the event state of the A is successful, starting to execute the B; when the state is unsuccessful, reporting an error;
and 7: steps 2-6 are repeated until there is no new time to join the transmit (receive) queue.
8. The RDMA remote memory access method applying parallel collaboration under separate and combinable architecture as claimed in claim 1, wherein said asynchronous read-write mechanism based on event notification is: when the remote memory starts to read/write, i.e. represents that the read/write is successful, the event is moved from the sending/receiving queue to the completion queue, and the actual time of the read-write process depends on the size of the read-write data block and the current network bandwidth.
9. The RDMA remote memory access method applying parallel collaboration under a split and combinable architecture as recited in claim 1, wherein the buffer is located in a local memory region and a remote memory region, and specifically comprises: an asynchronous parallel buffer, a transmit region, and a receive region, wherein: the sending area and the receiving area correspondingly execute data sending or receiving operation, the asynchronous parallel buffer area is used for temporarily storing transmission data to support asynchronous reading, and the data written in the asynchronous parallel buffer area cannot be overwritten until the data is rewritten next time.
10. The RDMA remote memory access method applying parallel collaboration under split and combinable architecture as claimed in claim 1, wherein said asynchronous parallel processing is: in the process of local computation, a data transmission process of RDMA is prepared at the same time, that is, in the last iteration, one step of receiving data to be used next time is prepared, and the specific steps include:
i) computing an iteration start;
ii) copying data of the RDMA receiving buffer to a computing area, and reading information of the read-only working set;
iii) preparing the index of the data block to be accessed in the next round, and sending the index to the far end;
iv) opening the RDMA receive buffer to receive data in preparation for the next round of data;
v) executing the calculation part of the iteration;
vi) returning to the step i) until the algorithm is converged or no new remote data block needing to be accessed is available.
CN202110209483.6A 2021-02-24 2021-02-24 Far memory access optimization method and system under separated combined architecture Active CN112817887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110209483.6A CN112817887B (en) 2021-02-24 2021-02-24 Far memory access optimization method and system under separated combined architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110209483.6A CN112817887B (en) 2021-02-24 2021-02-24 Far memory access optimization method and system under separated combined architecture

Publications (2)

Publication Number Publication Date
CN112817887A CN112817887A (en) 2021-05-18
CN112817887B true CN112817887B (en) 2021-09-17

Family

ID=75865550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110209483.6A Active CN112817887B (en) 2021-02-24 2021-02-24 Far memory access optimization method and system under separated combined architecture

Country Status (1)

Country Link
CN (1) CN112817887B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448897B (en) * 2021-07-12 2022-09-06 上海交通大学 Optimization method suitable for pure user mode far-end direct memory access
CN113395359B (en) * 2021-08-17 2021-10-29 苏州浪潮智能科技有限公司 File currency cluster data transmission method and system based on remote direct memory access
CN115495246B (en) * 2022-09-30 2023-04-18 上海交通大学 Hybrid remote memory scheduling method under separated memory architecture

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7281030B1 (en) * 1999-09-17 2007-10-09 Intel Corporation Method of reading a remote memory
CN105426321A (en) * 2015-11-13 2016-03-23 上海交通大学 RDMA friendly caching method using remote position information
CN106844048A (en) * 2017-01-13 2017-06-13 上海交通大学 Distributed shared memory method and system based on ardware feature
CN108268208A (en) * 2016-12-30 2018-07-10 清华大学 A kind of distributed memory file system based on RDMA
CN110262754A (en) * 2019-06-14 2019-09-20 华东师范大学 A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA
CN111221773A (en) * 2020-01-15 2020-06-02 华东师范大学 Data storage architecture method based on RMDA high-speed network and skip list

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7539780B2 (en) * 2003-12-01 2009-05-26 International Business Machines Corporation Asynchronous completion notification for an RDMA system
US20060168094A1 (en) * 2005-01-21 2006-07-27 International Business Machines Corporation DIRECT ACCESS OF SCSI BUFFER WITH RDMA ATP MECHANISM BY iSCSI TARGET AND/OR INITIATOR
US8966195B2 (en) * 2009-06-26 2015-02-24 Hewlett-Packard Development Company, L.P. Direct memory access and super page swapping optimizations for a memory blade
CN105589664B (en) * 2015-12-29 2018-07-31 四川中电启明星信息技术有限公司 Virtual memory high speed transmission method
CN111078607B (en) * 2019-12-24 2023-06-23 上海交通大学 Network access programming framework deployment method and system for RDMA (remote direct memory access) and nonvolatile memory
CN111400307B (en) * 2020-02-20 2023-06-23 上海交通大学 Persistent hash table access system supporting remote concurrent access
CN111400306B (en) * 2020-02-20 2023-03-28 上海交通大学 RDMA (remote direct memory Access) -and non-volatile memory-based radix tree access system
CN111459418B (en) * 2020-05-15 2021-07-23 南京大学 RDMA (remote direct memory Access) -based key value storage system transmission method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7281030B1 (en) * 1999-09-17 2007-10-09 Intel Corporation Method of reading a remote memory
CN105426321A (en) * 2015-11-13 2016-03-23 上海交通大学 RDMA friendly caching method using remote position information
CN108268208A (en) * 2016-12-30 2018-07-10 清华大学 A kind of distributed memory file system based on RDMA
CN106844048A (en) * 2017-01-13 2017-06-13 上海交通大学 Distributed shared memory method and system based on ardware feature
CN110262754A (en) * 2019-06-14 2019-09-20 华东师范大学 A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA
CN111221773A (en) * 2020-01-15 2020-06-02 华东师范大学 Data storage architecture method based on RMDA high-speed network and skip list

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于RDMA的分布式存储系统研究综述";陈游旻 等;《计算机研究与发展》;20190129;第227-238页 *

Also Published As

Publication number Publication date
CN112817887A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN112817887B (en) Far memory access optimization method and system under separated combined architecture
CN113485823A (en) Data transmission method, device, network equipment and storage medium
CN111339192A (en) Distributed edge computing data storage system
CN111708738B (en) Method and system for realizing interaction of hadoop file system hdfs and object storage s3 data
CN107907867A (en) A kind of real-time SAR quick look systems of multi-operation mode
CN111708719B (en) Computer storage acceleration method, electronic equipment and storage medium
CN114201421B (en) Data stream processing method, storage control node and readable storage medium
US11243714B2 (en) Efficient data movement method for in storage computation
WO2023019800A1 (en) Filecoin cluster data transmission method and system based on remote direct memory access
CN113590528A (en) Multi-channel data acquisition, storage and playback card, system and method based on HP interface
CN101576912A (en) System and reading and writing method for realizing asynchronous input and output interface of distributed file system
CN112445735A (en) Method, computer equipment, system and storage medium for transmitting federated learning data
US20080201549A1 (en) System and Method for Improving Data Caching
JP4208506B2 (en) High-performance storage device access environment
US7600074B2 (en) Controller of redundant arrays of independent disks and operation method thereof
US7409486B2 (en) Storage system, and storage control method
CN103986771A (en) High-availability cluster management method independent of shared storage
CN112003800B (en) Method and device for exchanging and transmitting messages of ports with different bandwidths
CN116074179B (en) High expansion node system based on CPU-NPU cooperation and training method
WO2023093608A1 (en) Automatic distributed cloud storage scheduling interaction method and apparatus, and device
US11847049B2 (en) Processing system that increases the memory capacity of a GPGPU
CN105740166A (en) Cache reading and reading processing method and device
CN106502828A (en) A kind of remote copy method based on LVM of optimization
US8054857B2 (en) Task queuing methods and systems for transmitting frame information over an I/O interface
JPH0715670B2 (en) Data processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant