CN112565326A - RDMA communication address exchange method facing distributed file system - Google Patents
RDMA communication address exchange method facing distributed file system Download PDFInfo
- Publication number
- CN112565326A CN112565326A CN201910918615.5A CN201910918615A CN112565326A CN 112565326 A CN112565326 A CN 112565326A CN 201910918615 A CN201910918615 A CN 201910918615A CN 112565326 A CN112565326 A CN 112565326A
- Authority
- CN
- China
- Prior art keywords
- rdma
- address
- information
- message
- remote
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer And Data Communications (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses an RDMA communication address exchange method facing a distributed file system, which comprises an RDMA data sending phase and an RDMA data receiving phase of the distributed file system, and comprises the following steps: pre-exchanging a plurality of RDMA addresses when establishing connection with a remote end; according to the type of the file operation, packaging the file operation data to a local RDMA registration address to generate an RDMA message; the types of the file operations comprise an opening operation open, a closing operation close, a lookup operation lookup, a writing operation write and a reading operation read; allocating a remote RDMA address for the file operation according to the current remote RDMA address use condition; if the RDMA free address is less than half of the current RDMA address, the RDMA address tension alarm information is sent in the generated file operation RDMA message, and the remote end applies for 8 RDMA addresses. The method and the device do not cause memory waste, reduce communication efficiency, consume additional system resources, reduce redundant RDMA address exchange operation, and improve the overall efficiency of RDMA communication.
Description
Technical Field
The invention belongs to the technical field of communication address exchange methods, and particularly relates to an RDMA (remote direct memory Access) communication address exchange method for a distributed file system.
Background
Currently, communication protocols based on RDMA technology have become the preferred method for high-speed, high-performance data transfer, and more distributed file systems communicate using RDMA technology. In the RDMA communication example, in order to implement data transmission, a target end of writing or reading needs to send an address of the RDMA communication to an initiator end, and this operation is called an address exchange operation in the RDMA process and is an essential link in the current RDMA communication example. The current mainstream address exchange methods have disadvantages, such as one-time exchange before communication in the pre-application block, which causes memory waste; when the RDMA communication process is insufficient, the communication thread temporarily stops communication to perform address exchange, which may reduce communication efficiency; initiating additional threads that monitor the status of the communication addresses for address exchange consumes additional resources.
To address server-side data processing delays in network transmissions, Remote Direct Memory Access (RDMA) techniques have been developed that allow data to be moved quickly from one system memory storage to the memory storage of a remote system without any impact on the operating system, thus eliminating the need for as much computer processing power as is required, which eliminates the overhead of external memory copy and context switch, thus freeing memory bandwidth and CPU cycles for improved application system performance. RDMA has low latency, low load, high bandwidth characteristics. Because a large-scale distributed file system carries out file operation communication between a plurality of clients and a server, the requirements on network bandwidth and delay are high, and more distributed file systems begin to use RDMA (remote direct memory Access) technologies such as glusterfs and ceph in network communication.
To transfer data using RDMA technology, the first operation is RDMA address swap operation, which sends data from the source memory to the target memory, and the address of the target memory is notified to the source memory. Due to the fact that the distributed file system is large in IO amount, RDMA address exchange can be frequently carried out, and due to the fact that the distributed file system has the characteristic of high sensitivity to network delay, RDMA address exchange operation can become a performance bottleneck of the distributed file system during RDMA communication.
The currently mainstream RDMA address exchange methods in the distributed file system include the following methods:
(1) the method has the disadvantages that if the file system is in a low load state, most of the applied memories are in an unused state, and the memories are wasted;
(2) the client and the server pre-apply for a plurality of small memory blocks for address exchange, if the address is insufficient in the RDMA communication process, the communication thread temporarily stops communication to apply for the memory for address exchange, and the method can reduce the communication efficiency and increase the delay;
(3) starting an additional thread to monitor the state of RDMA communication, applying for memory to perform address exchange when the address is insufficient, and releasing the memory when the address is sufficient, which has the disadvantage of consuming additional system resources.
Disclosure of Invention
The RDMA communication address exchange method does not cause memory waste, reduces communication efficiency, consumes additional system resources, reduces redundant RDMA address exchange operation, and improves the overall efficiency of RDMA communication.
In order to achieve the purpose, the invention adopts the technical scheme that: a RDMA communication address exchange method facing a distributed file system comprises an RDMA data sending phase and an RDMA data receiving phase of the distributed file system,
the RDMA data send phase includes the steps of:
s0, setting a data structure of the RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
RDMA message headers | RDMA message content |
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header saves the use condition of a local RDMA receiving address and the alarm information of a remote RDMA receiving address;
when receiving the message, the RDMA message header saves and protects the use condition of a remote RDMA receiving address and local RDMA receiving address alarm information;
the RDMA receiving address is an RDMA address used for receiving the RDMA message, and the RDMA sending address is an RDMA address used for sending the RDMA message;
file operation data: specific data of file operation, some operations are only operation type information data, and some operations comprise operation type data and file content data to be transmitted;
s1, the local end and the remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in the memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of the two parties into handshake information exchanged by establishing TCP connection, namely the remote end informs the RDMA receiving address of the remote end to the local end, so that the local end can send RDMA messages to the remote end through the RDMA receiving addresses of the remote end;
s2, according to the type of the file operation, packaging the file operation data into the file operation type information data in the RDMA message content of the local RDMA message;
s3, allocating a remote RDMA receiving address for the current file operation according to the current remote RDMA receiving address use condition obtained in the last step of the local receiving flow S2;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling the local RDMA receiving address using condition and the remote RDMA receiving address tension alarm identification in the RDMA message header of the generated file operation RDMA message; sending RDMA receiving address tension alarm information and executing the next step;
if the remote free RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not perform special identifier filling, only fills the using condition of the local RDMA receiving address, and executes the next step;
if the remote free RDMA receiving address is larger than three quarters of all the remote RDMA receiving addresses at present, filling the using condition of the local RDMA receiving address and the redundant alarm identifier of the remote RDMA receiving address in the header of the generated RDMA message for file operation, and additionally sending the redundant alarm of the RDMA receiving address to release 8 RDMA receiving addresses when the remote receives the message;
s5, sending RDMA information composed of RDMA information header and RDMA information content of file operation to remote end;
the RDMA data reception phase comprises the steps of:
s1, pre-exchanging a plurality of RDMA addresses when establishing connection with a far end;
s2, receiving the RDMA message of the file operation sent by the remote end, analyzing the received file operation type information data and the remote RDMA receiving address using condition through the RDMA message header of the RDMA message, and storing the remote RDMA receiving address using condition for the next message sending;
s3, if the RDMA message is the file operation of the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is the file operation message with the RDMA address alarm information, extracting a specific alarm type from the RDMA message header of the RDMA message;
s5, if the alarm type is that the RDMA receiving address is redundant, releasing local 8 RDMA receiving addresses; if the alarm type is the RDMA receiving address tension, 8 new RDMA receiving addresses are applied and the using condition of the local RDMA receiving address is updated, and the remote end is notified through the step S4 of the next sending flow;
and S6, executing the corresponding file operation.
The technical scheme of further improvement in the technical scheme is as follows:
1. in the above-described scheme, the file operation type information data is operation information of an open operation, operation information of a write operation, operation information of a read operation, operation information of a readdir operation, operation information of a stat operation, operation information of an openair operation, operation information of an mkdir operation, operation information of an unlink operation, operation information of a create operation, operation information of a readv operation, operation information of a write operation, operation information of a flush operation, operation information of a fsync operation, operation information of a setxattr operation, operation information of a getxattr operation, operation information of a setarttr operation, operation information of a fstat operation, operation information of an ftrundate operation, operation information of a startfs operation, operation information of a packute operation, operation information of a fgetxattr operation, operation information of an etxattr operation, operation information of a close operation, or operation information of an upokfsk operation.
2. In the above scheme, the file operation is an open operation, a write operation, a close operation, or a lookup operation.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the RDMA communication address exchange method for the distributed file system does not need special RDMA address exchange operation, attaches the RDMA address exchange operation to the operation semantics of the file system, checks whether the pre-exchanged RDMA address space is sufficient during each file operation, and transmits the supplemented RDMA address information in the file operation if the pre-exchanged RDMA address space is insufficient, so that memory waste, communication efficiency reduction, extra system resource consumption, redundant RDMA address exchange operation reduction and the overall efficiency of RDMA communication are improved.
Drawings
FIG. 1 is a schematic flow diagram of the RDMA data Send phase of the present invention;
FIG. 2 is a flow diagram of the RDMA data receive phase of the present invention.
Detailed Description
The invention is further described below with reference to the following examples:
example (b): a RDMA communication address exchange method facing a distributed file system is characterized in that: including the RDMA data send phase and RDMA data receive phase of the distributed file system,
the RDMA data send phase includes the steps of:
s0, setting a data structure of the RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
RDMA message headers | RDMA message content |
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header saves the use condition of a local RDMA receiving address and the alarm information of a remote RDMA receiving address;
when receiving the message, the RDMA message header saves and protects the use condition of a remote RDMA receiving address and local RDMA receiving address alarm information;
the RDMA receiving address is an RDMA address used for receiving the RDMA message, and the RDMA sending address is an RDMA address used for sending the RDMA message;
file operation data: specific data of file operation, some operations are only operation type information data, and some operations comprise operation type data and file content data to be transmitted;
s1, the local end and the remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in the memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of the two parties into handshake information exchanged by establishing TCP connection, namely the remote end informs the RDMA receiving address of the remote end to the local end, so that the local end can send RDMA messages to the remote end through the RDMA receiving addresses of the remote end;
s2, according to the type of the file operation, packaging the file operation data into the file operation type information data in the RDMA message content of the local RDMA message;
s3, allocating a remote RDMA receiving address for the current file operation according to the current remote RDMA receiving address use condition obtained in the last step of the local receiving flow S2;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling the local RDMA receiving address using condition and the remote RDMA receiving address tension alarm identification in the RDMA message header of the generated file operation RDMA message; sending RDMA receiving address tension alarm information and executing the next step;
if the remote free RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not perform special identifier filling, only fills the using condition of the local RDMA receiving address, and executes the next step;
if the remote free RDMA receiving address is larger than three quarters of all the remote RDMA receiving addresses at present, filling the using condition of the local RDMA receiving address and the redundant alarm identifier of the remote RDMA receiving address in the header of the generated RDMA message for file operation, and additionally sending the redundant alarm of the RDMA receiving address to release 8 RDMA receiving addresses when the remote receives the message;
s5, sending RDMA information composed of RDMA information header and RDMA information content of file operation to remote end;
the RDMA data reception phase comprises the steps of:
s1, pre-exchanging a plurality of RDMA addresses when establishing connection with a far end;
s2, receiving the RDMA message of the file operation sent by the remote end, analyzing the received file operation type information data and the remote RDMA receiving address using condition through the RDMA message header of the RDMA message, and storing the remote RDMA receiving address using condition for the next message sending;
s3, if the RDMA message is the file operation of the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is the file operation message with the RDMA address alarm information, extracting a specific alarm type from the RDMA message header of the RDMA message;
s5, if the alarm type is that the RDMA receiving address is redundant, releasing local 8 RDMA receiving addresses; if the alarm type is the RDMA receiving address tension, 8 new RDMA receiving addresses are applied and the using condition of the local RDMA receiving address is updated, and the remote end is notified through the step S4 of the next sending flow;
and S6, executing the corresponding file operation.
The file operation type information data is operation information of an open operation, operation information of a write operation, operation information of a read operation, operation information of a readdir operation, operation information of a stat operation, operation information of an openair operation, operation information of an mkdir operation, operation information of an unlink operation, operation information of a create operation, operation information of a readv operation, operation information of a writev operation, operation information of a flush operation, operation information of a fsync operation, operation information of a setxattr operation, operation information of a getxattr operation, operation information of a setattr operation, operation information of a fstat operation, operation information of a ftruncate operation, operation information of a statfs operation, operation information of a truncate operation, operation information of a fgetxattr operation, operation information of a setxattr operation, operation information of a close operation, or operation information of an okook operation.
The file operation is an open operation, a write operation, a close operation or a lookup operation.
When the RDMA communication address exchange method facing the distributed file system is adopted, the special RDMA address exchange operation is not needed, but the RDMA address exchange operation is attached to the operation semantics of the file system, whether the pre-exchanged RDMA address space is sufficient or not is checked during each file operation, if not, the supplementary RDMA address information is attached to the inside of the file operation for transmission, so that the memory waste is not caused, the communication efficiency is reduced, the extra system resources are consumed, the redundant RDMA address exchange operation is reduced, and the overall efficiency of the RDMA communication is improved.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.
Claims (3)
1. A RDMA communication address exchange method facing a distributed file system is characterized in that: including the RDMA data send phase and RDMA data receive phase of the distributed file system,
the RDMA data send phase includes the steps of:
s0, setting a data structure of the RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header saves the use condition of a local RDMA receiving address and the alarm information of a remote RDMA receiving address;
when receiving the message, the RDMA message header saves and protects the use condition of a remote RDMA receiving address and local RDMA receiving address alarm information;
the RDMA receiving address is an RDMA address used for receiving the RDMA message, and the RDMA sending address is an RDMA address used for sending the RDMA message;
file operation data: specific data of file operation, some operations are only operation type information data, and some operations comprise operation type data and file content data to be transmitted;
s1, the local end and the remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in the memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of the two parties into handshake information exchanged by establishing TCP connection, namely the remote end informs the RDMA receiving address of the remote end to the local end, so that the local end can send RDMA messages to the remote end through the RDMA receiving addresses of the remote end;
s2, according to the type of the file operation, packaging the file operation data into the file operation type information data in the RDMA message content of the local RDMA message;
s3, allocating a remote RDMA receiving address for the current file operation according to the current remote RDMA receiving address use condition obtained in the last step of the local receiving flow S2;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling the local RDMA receiving address using condition and the remote RDMA receiving address tension alarm identification in the RDMA message header of the generated file operation RDMA message; sending RDMA receiving address tension alarm information and executing the next step;
if the remote free RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not perform special identifier filling, only fills the using condition of the local RDMA receiving address, and executes the next step;
if the remote free RDMA receiving address is larger than three quarters of all the remote RDMA receiving addresses at present, filling the using condition of the local RDMA receiving address and the redundant alarm identifier of the remote RDMA receiving address in the header of the generated RDMA message for file operation, and additionally sending the redundant alarm of the RDMA receiving address to release 8 RDMA receiving addresses when the remote receives the message;
s5, sending RDMA information composed of RDMA information header and RDMA information content of file operation to remote end;
the RDMA data reception phase comprises the steps of:
s1, pre-exchanging a plurality of RDMA addresses when establishing connection with a far end;
s2, receiving the RDMA message of the file operation sent by the remote end, analyzing the received file operation type information data and the remote RDMA receiving address using condition through the RDMA message header of the RDMA message, and storing the remote RDMA receiving address using condition for the next message sending;
s3, if the RDMA message is the file operation of the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is the file operation message with the RDMA address alarm information, extracting a specific alarm type from the RDMA message header of the RDMA message;
s5, if the alarm type is that the RDMA receiving address is redundant, releasing local 8 RDMA receiving addresses; if the alarm type is the RDMA receiving address tension, 8 new RDMA receiving addresses are applied and the using condition of the local RDMA receiving address is updated, and the remote end is notified through the step S4 of the next sending flow;
and S6, executing the corresponding file operation.
2. The RDMA communication Address exchange method for a distributed File System according to claim 1, wherein: the file operation type information data is operation information of open operation, operation information of write operation, operation information of read operation, operation information of readdir operation, operation information of stat operation, operation information of openair operation, operation information of mkdir operation, operation information of unlink operation, operation information of create operation, operation information of readv operation, operation information of writev operation, operation information of flush operation, operation information of fsync operation, operation information of setxattr operation, operation information of getxattr operation, operation information of setartar operation, operation information of ftruncateate operation, operation information of statfs operation, operation information of truncate operation, operation information of fgetxattr operation, operation information of setxattr operation, operation information of close operation, or operation information of open operation.
3. The RDMA communication Address exchange method for a distributed File System according to claim 1, wherein: the file operation is an open operation, a write operation, a close operation or a lookup operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910918615.5A CN112565326B (en) | 2019-09-26 | 2019-09-26 | RDMA communication address exchange method for distributed file system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910918615.5A CN112565326B (en) | 2019-09-26 | 2019-09-26 | RDMA communication address exchange method for distributed file system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112565326A true CN112565326A (en) | 2021-03-26 |
CN112565326B CN112565326B (en) | 2023-10-17 |
Family
ID=75029871
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910918615.5A Active CN112565326B (en) | 2019-09-26 | 2019-09-26 | RDMA communication address exchange method for distributed file system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112565326B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170187621A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Connectionless reliable transport |
US20170199841A1 (en) * | 2016-01-13 | 2017-07-13 | Red Hat, Inc. | Pre-registering memory regions for remote direct memory access in a distributed file system |
US20180024865A1 (en) * | 2016-07-22 | 2018-01-25 | Fujitsu Limited | Parallel processing apparatus and node-to-node communication method |
CN110191194A (en) * | 2019-06-13 | 2019-08-30 | 华中科技大学 | A kind of Distributed File System Data transmission method and system based on RDMA network |
-
2019
- 2019-09-26 CN CN201910918615.5A patent/CN112565326B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170187621A1 (en) * | 2015-12-29 | 2017-06-29 | Amazon Technologies, Inc. | Connectionless reliable transport |
US20170199841A1 (en) * | 2016-01-13 | 2017-07-13 | Red Hat, Inc. | Pre-registering memory regions for remote direct memory access in a distributed file system |
US20180024865A1 (en) * | 2016-07-22 | 2018-01-25 | Fujitsu Limited | Parallel processing apparatus and node-to-node communication method |
CN110191194A (en) * | 2019-06-13 | 2019-08-30 | 华中科技大学 | A kind of Distributed File System Data transmission method and system based on RDMA network |
Non-Patent Citations (1)
Title |
---|
刘路;张磊;曹继军;戴艺;: "基于动态连接的RDMA可靠传输协议设计", 计算机工程与科学, no. 08 * |
Also Published As
Publication number | Publication date |
---|---|
CN112565326B (en) | 2023-10-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108268208B (en) | RDMA (remote direct memory Access) -based distributed memory file system | |
CN103259683B (en) | Based on the Web network management system L2 cache method for pushing of HTML5 | |
US9148485B2 (en) | Reducing packet size in a communication protocol | |
CN110401592B (en) | Method and equipment for data transfer in message channel | |
CN111459417B (en) | Non-lock transmission method and system for NVMeoF storage network | |
CN114201421B (en) | Data stream processing method, storage control node and readable storage medium | |
US10609125B2 (en) | Method and system for transmitting communication data | |
US20230027178A1 (en) | Composable infrastructure enabled by heterogeneous architecture, delivered by cxl based cached switch soc and extensible via cxloverethernet (coe) protocols | |
CN112087490A (en) | High-performance mobile terminal application software log collection system | |
CN111314480A (en) | Load self-adaptive cross-platform file transfer protocol and distributed service implementation method thereof | |
Qiu et al. | Full-kv: Flexible and ultra-low-latency in-memory key-value store system design on cpu-fpga | |
CN115202573A (en) | Data storage system and method | |
CN103338156A (en) | Thread pool based named pipe server concurrent communication method | |
CN112565326A (en) | RDMA communication address exchange method facing distributed file system | |
CN115176453A (en) | Message caching method, memory distributor and message forwarding system | |
WO2023030195A1 (en) | Memory management method and apparatus, control program and controller | |
US20230409506A1 (en) | Data transmission method, device, network system, and storage medium | |
CN112685358B (en) | DDR3 grouping read-write method based on FPGA | |
CN108075989B (en) | Extensible protocol-based load balancing network middleware implementation method | |
CN107615259A (en) | A kind of data processing method and system | |
Xue et al. | Network interface architecture for remote indirect memory access (rima) in datacenters | |
CN110941490A (en) | Medical image processing method based on cloud computing | |
CN114363428B (en) | Socket-based data transmission method | |
CN116340246B (en) | Data pre-reading method and medium for direct memory access read operation | |
CN113411266B (en) | Cloud data transmission method and system based on isolation device, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |