CN112565326B - RDMA communication address exchange method for distributed file system - Google Patents

RDMA communication address exchange method for distributed file system Download PDF

Info

Publication number
CN112565326B
CN112565326B CN201910918615.5A CN201910918615A CN112565326B CN 112565326 B CN112565326 B CN 112565326B CN 201910918615 A CN201910918615 A CN 201910918615A CN 112565326 B CN112565326 B CN 112565326B
Authority
CN
China
Prior art keywords
rdma
address
message
receiving
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910918615.5A
Other languages
Chinese (zh)
Other versions
CN112565326A (en
Inventor
肖伟
何晓斌
余婷
陈起
高洁
王涛
罗永耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910918615.5A priority Critical patent/CN112565326B/en
Publication of CN112565326A publication Critical patent/CN112565326A/en
Application granted granted Critical
Publication of CN112565326B publication Critical patent/CN112565326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a RDMA communication address exchange method facing a distributed file system, which comprises an RDMA data sending stage and an RDMA data receiving stage of the distributed file system, and comprises the following steps: pre-exchanging a plurality of RDMA addresses when establishing connection with a remote end; according to the type of file operation, encapsulating file operation data to a local RDMA registration address to generate an RDMA message; the file operation types comprise an open operation open, a close operation close, a lookup operation lookup, a writing operation write and a reading operation read; distributing remote RDMA addresses for the current file operation according to the current remote RDMA address service condition; if the RDMA idle address is less than half of the current RDMA address, the RDMA message is operated on the generated file, and the RDMA address tension alarm information is additionally transmitted to enable the far-end to apply for 8 RDMA addresses. The invention can not cause memory waste, reduce communication efficiency and consume extra system resources, reduce redundant RDMA address exchange operation and improve the overall efficiency of RDMA communication.

Description

RDMA communication address exchange method for distributed file system
Technical Field
The invention belongs to the technical field of communication address exchange methods, and particularly relates to an RDMA communication address exchange method oriented to a distributed file system.
Background
Currently, communication protocols based on RDMA technology have become the method of choice for high-speed, high-performance data transfer, and more distributed file systems communicate using RDMA technology. In the RDMA communication example, in order to realize data transmission, a writing or reading target end sends an address of RDMA communication to an initiating end, and the operation is called an address exchange operation in the RDMA process, which is an essential link in the current RDMA communication example. The main stream address exchange method at present has the defects that if the memory is exchanged once before communication in a large block is applied for, the memory waste is caused; when the RDMA communication process is insufficient, the communication thread temporarily stops communication to exchange addresses, so that the communication efficiency is reduced; starting an additional thread that monitors the status of the communication address for address exchanges consumes additional resources.
To address server-side data processing delays in network transmissions, remote Direct Memory Access (RDMA) techniques have been developed that allow data to be moved quickly from a system memory to the memory of a remote system without any impact on the operating system, thus eliminating the need for more or less computer processing power, eliminating the overhead of external memory copying and context switching, and thus freeing up memory bandwidth and CPU cycles for improved application system performance. RDMA has low latency, low load, high bandwidth characteristics. Large-scale distributed file systems, because multiple clients perform file operation communication with a server, have high demands on network bandwidth and delay, and more distributed file systems begin to use RDMA technology in network communication, such as glusterfs, ceph.
To transfer data using RDMA technology, the first operation to be performed is an RDMA address exchange operation, sending data from a source memory to a destination memory, and first notifying the source of the address of the destination memory. Because of the large IO amount, the distributed file system may frequently exchange RDMA addresses, and because of the high sensitivity to network delay, the RDMA address exchange operation may become a performance bottleneck in RDMA communication.
The current mainstream RDMA address exchange method in the distributed file system has the following methods:
(1) The method has the defects that if a file system is in a low-load state, most of the applied memory is in an unused state, and the memory is wasted;
(2) The method comprises the steps that a client and a server pre-apply for a plurality of small-block memories to exchange addresses, and if the addresses are insufficient in the RDMA communication process, a communication thread temporarily stops communication to apply for the memories to exchange the addresses, so that the communication efficiency can be reduced, and the delay can be increased;
(3) Starting an additional thread to monitor the state of RDMA communication, applying for memory to exchange addresses when the addresses are insufficient, and releasing the memory when the addresses are sufficient, which has the disadvantage of consuming additional system resources.
Disclosure of Invention
The invention aims to provide an RDMA communication address exchange method oriented to a distributed file system, which does not cause memory waste, reduces communication efficiency and consumes extra system resources, reduces redundant RDMA address exchange operation and improves the overall efficiency of RDMA communication.
In order to achieve the above purpose, the invention adopts the following technical scheme: a RDMA communication address exchange method facing a distributed file system comprises an RDMA data sending stage and an RDMA data receiving stage of the distributed file system,
the RDMA data transmission stage comprises the following steps:
s0, setting a data structure of an RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
RDMA message header RDMA message content
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header stores the use condition of a local RDMA receiving address and remote RDMA receiving address alarm information;
when receiving messages, the RDMA message header stores the use condition of the remote RDMA receiving address and the local RDMA receiving address alarm information;
RDMA receive address is RDMA address for receiving RDMA message, RDMA send address is RDMA address for sending RDMA message;
file operation data: specific data of file operations, some operations are operation type information data only, and some operations comprise operation type data and file content data to be transmitted;
s1, a local end and a remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in a memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of both parties into handshake information exchanged by establishing TCP connection, namely, the remote end informs the local end of the RDMA receiving addresses, so that the local end can send RDMA messages to the remote end through the remote end RDMA receiving addresses;
s2, according to the type of file operation, encapsulating file operation data into file operation type information data in RDMA message content of a local RDMA message;
s3, distributing remote RDMA receiving addresses for the current file operation according to the current remote RDMA receiving address service condition obtained in the step S2 of the last local receiving process;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling a local RDMA receiving address using condition and a remote RDMA receiving address tension alarm mark in an RDMA message header of the generated file operation RDMA message; carrying out the next step by additionally sending RDMA receiving address tension alarm information;
if the remote idle RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not make special mark filling, only fills the use condition of the local RDMA receiving address, and executes the next step;
if the remote idle RDMA receiving address is more than three fourths of all the current remote RDMA receiving addresses, filling the RDMA message header of the generated file operation RDMA message with the use condition of the local RDMA receiving address and the redundant alarm identification of the remote RDMA receiving address, and additionally sending the redundant alarm of the RDMA receiving address to enable the remote to release 8 RDMA receiving addresses when receiving the message;
s5, sending RDMA message composed of RDMA message header and RDMA message content of file operation to far end;
the RDMA data reception phase includes the steps of:
s1, pre-exchanging a plurality of RDMA addresses when connection is established with a far end;
s2, receiving an RDMA message of a file operation sent by a far end, analyzing received file operation type information data and a use condition of a far-end RDMA receiving address through an RDMA message header of the RDMA message, and storing the use condition of the far-end RDMA receiving address for the next message sending;
s3, if the RDMA message is the RDMA message of the file operation without the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is a file operation message with RDMA address alarm information, extracting a specific alarm type from an RDMA message header of the RDMA message;
s5, if the alarm type is redundant in RDMA receiving addresses, releasing local 8 RDMA receiving addresses; if the alarm type is that the RDMA receiving address is tense, applying 8 new RDMA receiving addresses and updating the service condition of the local RDMA receiving address, and informing the far end through the S4 step of the next sending flow;
s6, executing corresponding file operation.
The technical scheme further improved in the technical scheme is as follows:
1. in the above scheme, the file operation type information data is operation information of open operation, operation information of write operation, operation information of read operation, operation information of readdir operation, operation information of stat operation, operation information of opendir operation, operation information of mkdir operation, operation information of unlink operation, operation information of create operation, operation information of readv operation, operation information of write operation, operation information of flush operation, operation information of fsync operation, operation information of setxattr operation, operation information of getxattr operation, operation information of setattr operation, operation information of getattr operation, operation information of ftruncate operation, operation information of statfs operation, operation information of truncate operation, operation information of fgetxattr operation, operation information of fsetxattr operation, operation information of close operation, or operation information of load operation.
2. In the above scheme, the file operation is an open operation, a write operation, a close operation or a lookup operation.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the RDMA communication address exchange method oriented to the distributed file system does not need special RDMA address exchange operation, but attaches the RDMA address exchange operation to the operation semantics of the file system, checks whether the pre-exchanged RDMA address space is sufficient or not when each file operation is performed, and attaches the complementary RDMA address information to the file operation for internal transmission if the pre-exchanged RDMA address space is insufficient, so that memory waste is not caused, communication efficiency is reduced, extra system resources are consumed, redundant RDMA address exchange operation is reduced, and the overall efficiency of RDMA communication is improved.
Drawings
FIG. 1 is a schematic diagram of the RDMA data transfer phase flow according to the present invention;
FIG. 2 is a schematic diagram of the RDMA data reception phase flow according to the present invention.
Detailed Description
The invention is further described below with reference to examples:
examples: an RDMA communication address exchange method facing to a distributed file system is characterized in that: including the RDMA data send phase and RDMA data receive phase of the distributed file system,
the RDMA data transmission stage comprises the following steps:
s0, setting a data structure of an RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
RDMA message header RDMA message content
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header stores the use condition of a local RDMA receiving address and remote RDMA receiving address alarm information;
when receiving messages, the RDMA message header stores the use condition of the remote RDMA receiving address and the local RDMA receiving address alarm information;
RDMA receive address is RDMA address for receiving RDMA message, RDMA send address is RDMA address for sending RDMA message;
file operation data: specific data of file operations, some operations are operation type information data only, and some operations comprise operation type data and file content data to be transmitted;
s1, a local end and a remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in a memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of both parties into handshake information exchanged by establishing TCP connection, namely, the remote end informs the local end of the RDMA receiving addresses, so that the local end can send RDMA messages to the remote end through the remote end RDMA receiving addresses;
s2, according to the type of file operation, encapsulating file operation data into file operation type information data in RDMA message content of a local RDMA message;
s3, distributing remote RDMA receiving addresses for the current file operation according to the current remote RDMA receiving address service condition obtained in the step S2 of the last local receiving process;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling a local RDMA receiving address using condition and a remote RDMA receiving address tension alarm mark in an RDMA message header of the generated file operation RDMA message; carrying out the next step by additionally sending RDMA receiving address tension alarm information;
if the remote idle RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not make special mark filling, only fills the use condition of the local RDMA receiving address, and executes the next step;
if the remote idle RDMA receiving address is more than three fourths of all the current remote RDMA receiving addresses, filling the RDMA message header of the generated file operation RDMA message with the use condition of the local RDMA receiving address and the redundant alarm identification of the remote RDMA receiving address, and additionally sending the redundant alarm of the RDMA receiving address to enable the remote to release 8 RDMA receiving addresses when receiving the message;
s5, sending RDMA message composed of RDMA message header and RDMA message content of file operation to far end;
the RDMA data reception phase includes the steps of:
s1, pre-exchanging a plurality of RDMA addresses when connection is established with a far end;
s2, receiving an RDMA message of a file operation sent by a far end, analyzing received file operation type information data and a use condition of a far-end RDMA receiving address through an RDMA message header of the RDMA message, and storing the use condition of the far-end RDMA receiving address for the next message sending;
s3, if the RDMA message is the RDMA message of the file operation without the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is a file operation message with RDMA address alarm information, extracting a specific alarm type from an RDMA message header of the RDMA message;
s5, if the alarm type is redundant in RDMA receiving addresses, releasing local 8 RDMA receiving addresses; if the alarm type is that the RDMA receiving address is tense, applying 8 new RDMA receiving addresses and updating the service condition of the local RDMA receiving address, and informing the far end through the S4 step of the next sending flow;
s6, executing corresponding file operation.
The file operation type information data is operation information of open operation, operation information of write operation, operation information of read operation, operation information of readdir operation, operation information of stat operation, operation information of opendir operation, operation information of mkdir operation, operation information of unlink operation, operation information of create operation, operation information of readv operation, operation information of write operation, operation information of flush operation, operation information of fsync operation, operation information of setxattr operation, operation information of getxattr operation, operation information of setattr operation, operation information of getattr operation, operation information of ftruncate operation, operation information of statfs operation, operation information of trunk operation, operation information of fgetxattr operation, operation information of fsetxattr operation, operation information of close operation, or operation information of lock operation.
The file operation is an open operation, a write operation, a close operation, or a look up operation.
When the RDMA communication address exchange method facing the distributed file system is adopted, special RDMA address exchange operation is not needed, the RDMA address exchange operation is attached to the operation semantics of the file system, whether the pre-exchanged RDMA address space is sufficient or not is checked when each file operation is carried out, and if the pre-exchanged RDMA address space is insufficient, the complementary RDMA address information is attached to the file operation for internal transmission, so that memory waste is not caused, communication efficiency is reduced, extra system resources are consumed, redundant RDMA address exchange operation is reduced, and the overall efficiency of RDMA communication is improved.
The above embodiments are provided to illustrate the technical concept and features of the present invention and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the spirit of the present invention should be construed to be included in the scope of the present invention.

Claims (3)

1. An RDMA communication address exchange method facing to a distributed file system is characterized in that: including the RDMA data send phase and RDMA data receive phase of the distributed file system,
the RDMA data transmission stage comprises the following steps:
s0, setting a data structure of an RDMA message, wherein the data structure of the RDMA message consists of an RDMA message header and RDMA message content;
RDMA message header RDMA message content
The RDMA message content is used for storing file operation type information data and file content data to be transmitted;
when the message is sent, the RDMA message header stores the use condition of a local RDMA receiving address and remote RDMA receiving address alarm information;
when receiving messages, the RDMA message header stores the use condition of the remote RDMA receiving address and the local RDMA receiving address alarm information;
RDMA receive address is RDMA address for receiving RDMA message, RDMA send address is RDMA address for sending RDMA message;
file operation data: specific data of file operations, some operations are operation type information data only, and some operations comprise operation type data and file content data to be transmitted;
s1, a local end and a remote end apply for a plurality of RDMA receiving addresses and RDMA sending addresses in a memory in advance, and exchange the RDMA receiving addresses by filling the RDMA receiving addresses of both parties into handshake information exchanged by establishing TCP connection, namely, the remote end informs the local end of the RDMA receiving addresses, so that the local end can send RDMA messages to the remote end through the remote end RDMA receiving addresses;
s2, according to the type of file operation, encapsulating file operation data into file operation type information data in RDMA message content of a local RDMA message;
s3, distributing remote RDMA receiving addresses for the current file operation according to the current remote RDMA receiving address service condition obtained in the step S2 of the last local receiving process;
s4, if the unused remote RDMA receiving address is less than half of the current all remote RDMA receiving addresses, filling a local RDMA receiving address using condition and a remote RDMA receiving address tension alarm mark in an RDMA message header of the generated file operation RDMA message; carrying out the next step by additionally sending RDMA receiving address tension alarm information;
if the remote idle RDMA receiving address is between half and three quarters of the current all remote RDMA receiving addresses, the RDMA message header does not make special mark filling, only fills the use condition of the local RDMA receiving address, and executes the next step;
if the remote idle RDMA receiving address is more than three fourths of all the current remote RDMA receiving addresses, filling the RDMA message header of the generated file operation RDMA message with the use condition of the local RDMA receiving address and the redundant alarm identification of the remote RDMA receiving address, and additionally sending the redundant alarm of the RDMA receiving address to enable the remote to release 8 RDMA receiving addresses when receiving the message;
s5, sending RDMA message composed of RDMA message header and RDMA message content of file operation to far end;
the RDMA data reception phase includes the steps of:
s1, pre-exchanging a plurality of RDMA addresses when connection is established with a far end;
s2, receiving an RDMA message of a file operation sent by a far end, analyzing received file operation type information data and a use condition of a far-end RDMA receiving address through an RDMA message header of the RDMA message, and storing the use condition of the far-end RDMA receiving address for the next message sending;
s3, if the RDMA message is the RDMA message of the file operation without the RDMA address alarm information in the header of the RDMA message, executing the corresponding file operation;
s4, if the file operation message is a file operation message with RDMA address alarm information, extracting a specific alarm type from an RDMA message header of the RDMA message;
s5, if the alarm type is redundant in RDMA receiving addresses, releasing local 8 RDMA receiving addresses; if the alarm type is that the RDMA receiving address is tense, applying 8 new RDMA receiving addresses and updating the service condition of the local RDMA receiving address, and informing the far end through the S4 step of the next sending flow;
s6, executing corresponding file operation.
2. The RDMA communication address exchange method for a distributed file system as recited in claim 1, wherein: the file operation type information data is operation information of open operation, operation information of write operation, operation information of read operation, operation information of readdir operation, operation information of stat operation, operation information of opendir operation, operation information of mkdir operation, operation information of unlink operation, operation information of create operation, operation information of readv operation, operation information of write operation, operation information of flush operation, operation information of fsync operation, operation information of setxattr operation, operation information of getxattr operation, operation information of setattr operation, operation information of getattr operation, operation information of ftruncate operation, operation information of statfs operation, operation information of trunk operation, operation information of fgetxattr operation, operation information of fsetxattr operation, operation information of close operation, or operation information of lock operation.
3. The RDMA communication address exchange method for a distributed file system as recited in claim 1, wherein: the file operation is an open operation, a write operation, a close operation, or a look up operation.
CN201910918615.5A 2019-09-26 2019-09-26 RDMA communication address exchange method for distributed file system Active CN112565326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910918615.5A CN112565326B (en) 2019-09-26 2019-09-26 RDMA communication address exchange method for distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910918615.5A CN112565326B (en) 2019-09-26 2019-09-26 RDMA communication address exchange method for distributed file system

Publications (2)

Publication Number Publication Date
CN112565326A CN112565326A (en) 2021-03-26
CN112565326B true CN112565326B (en) 2023-10-17

Family

ID=75029871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910918615.5A Active CN112565326B (en) 2019-09-26 2019-09-26 RDMA communication address exchange method for distributed file system

Country Status (1)

Country Link
CN (1) CN112565326B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110191194A (en) * 2019-06-13 2019-08-30 华中科技大学 A kind of Distributed File System Data transmission method and system based on RDMA network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10148570B2 (en) * 2015-12-29 2018-12-04 Amazon Technologies, Inc. Connectionless reliable transport
US10713211B2 (en) * 2016-01-13 2020-07-14 Red Hat, Inc. Pre-registering memory regions for remote direct memory access in a distributed file system
JP6668993B2 (en) * 2016-07-22 2020-03-18 富士通株式会社 Parallel processing device and communication method between nodes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110191194A (en) * 2019-06-13 2019-08-30 华中科技大学 A kind of Distributed File System Data transmission method and system based on RDMA network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于动态连接的RDMA可靠传输协议设计;刘路;张磊;曹继军;戴艺;;计算机工程与科学(08);全文 *

Also Published As

Publication number Publication date
CN112565326A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN108268208B (en) RDMA (remote direct memory Access) -based distributed memory file system
CN110401592B (en) Method and equipment for data transfer in message channel
CN111277616B (en) RDMA-based data transmission method and distributed shared memory system
US10200460B2 (en) Server-processor hybrid system for processing data
CN110177118A (en) A kind of RPC communication method based on RDMA
CN112631788B (en) Data transmission method and data transmission server
CN111459417B (en) Non-lock transmission method and system for NVMeoF storage network
CN112099977A (en) Real-time data analysis engine of distributed tracking system
CN111432025A (en) Cloud edge cooperation-oriented distributed service directory management method and system
CN112087490A (en) High-performance mobile terminal application software log collection system
CN113468221A (en) System integration method based on kafka message data bus
TWI442248B (en) Processor-server hybrid system for processing data
CN106131162B (en) A method of network service agent is realized based on IOCP mechanism
CN115202573A (en) Data storage system and method
Sun et al. SKV: A SmartNIC-Offloaded Distributed Key-Value Store
CN112565326B (en) RDMA communication address exchange method for distributed file system
CN113259408A (en) Data transmission method and system
CN112052104A (en) Message queue management method based on multi-computer-room realization and electronic equipment
CN113630366A (en) Internet of things equipment access method and system
CN108234595B (en) Log transmission method and system
CN115086311B (en) Management system of enterprise cross-system service based on cloud service bus
CN110674221A (en) Spatial data synchronization method, terminal and computer readable storage medium
CN110290035B (en) Intelligent family data storage and access method and system based on K3S
CN111782322A (en) Intranet and extranet message communication server and system based on cloud desktop server
CN111541667A (en) Method, equipment and storage medium for intersystem message communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant