CN117857658A - Traffic communication and dynamic switching method and device based on three-stack fusion - Google Patents

Traffic communication and dynamic switching method and device based on three-stack fusion Download PDF

Info

Publication number
CN117857658A
CN117857658A CN202311735913.3A CN202311735913A CN117857658A CN 117857658 A CN117857658 A CN 117857658A CN 202311735913 A CN202311735913 A CN 202311735913A CN 117857658 A CN117857658 A CN 117857658A
Authority
CN
China
Prior art keywords
computing node
tcp
storage node
node
rdma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311735913.3A
Other languages
Chinese (zh)
Inventor
韩杏玲
樊小平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202311735913.3A priority Critical patent/CN117857658A/en
Publication of CN117857658A publication Critical patent/CN117857658A/en
Pending legal-status Critical Current

Links

Landscapes

  • Computer And Data Communications (AREA)

Abstract

The invention relates to a traffic communication and dynamic switching method and device based on three-stack integration, belonging to the field of high-performance network, wherein the method supports three stacks of RDMA, user-mode TCP and kernel-mode TCP through the integration protocol stack, can shield the internal realization difference of different transmission protocols, provides a unified communication interface for application, and compared with the protocol stack of a single transmission protocol, the application does not need to pay attention to the internal realization difference and programming interface of each transmission protocol, is simple and easy to use, can exert the technical advantages of RDMA and TCP simultaneously, and can solve the problem that large-scale networking can not be realized by using the RDMA protocol only.

Description

Traffic communication and dynamic switching method and device based on three-stack fusion
Technical Field
The invention belongs to the field of high-performance networks, and particularly relates to a traffic communication and dynamic switching method and device based on three-stack fusion.
Background
For many years, the TCP/IP protocol is always a pillar for internet communication, and the traditional kernel TCP protocol stack has stable functional performance and strong reliability, but the kernel protocol stack has a lot of performance bottlenecks due to the existence of interrupt processing, kernel copying and other mechanisms. The user state TCP protocol stack can bypass the kernel, directly uses the data of the network hardware device for the user state application program, can reduce interruption and memory copy, and has greatly improved performance compared with the kernel TCP protocol stack. However, the requirements of applications such as high-performance computing, distributed storage and the like on the network are ultra-high bandwidth, ultra-low delay and ultra-high reliability, the requirements of high-performance computing service are still difficult to meet by user-state TCP, and RDMA can directly access memory data through a network interface without intervention of an operating system kernel, so that the industry generally adopts RDMA to replace TCP protocol.
However, RDMA networks are very sensitive to packet loss, can be transmitted at full rate in a lossless state, and once packet loss retransmission occurs, performance can be rapidly reduced, congestion and packet loss can be avoided by starting a congestion control mechanism of PFC and RDMA network cards on a switch when AZ internal nodes communicate, and high performance of the RDMA networks is ensured.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a traffic communication and dynamic switching method and device based on three-stack fusion, which adopts the following technical scheme:
in a first aspect, a method for traffic communication and dynamic switching based on three stack fusion, the method comprising:
s01, initializing a fusion protocol stack of each computing node and each storage node;
s02, recording a time point t1 by sending a negotiation message according to a computing node, and recording a time point t2 according to a reply of the negotiation message;
s03, judging whether the computing node and the storage node are in the same AZ or not according to the configured round trip delay threshold value and through round trip delay between the time t1 and the time t2;
s04, when the computing node and the storage node belong to the same AZ, a read-write request is established by integrating RDMA as a transmission protocol;
s14, when the computing node and the storage node do not belong to the same AZ, a read-write request is established by using a user mode TCP as a transmission protocol;
s05, when the transmission protocols of the computing node and the storage node are abnormal, the computing node transfers the service data to a kernel TCP;
s06, reestablishing a read-write request between the computing node and the storage node according to the S04; and migrating the business data in the kernel TCP to the storage node according to the corresponding transmission type.
Further, S04 specifically includes:
s041, when the computing node and the storage node are in the same AZ and support integrated RDMA, selecting the integrated RDMA as a transmission protocol of the computing node;
s042, creating QP according to the computing node, creating QP in the storage node through QP negotiation message, and completing RDMA channel between the computing node and the storage node;
s043, according to the RDMA QP channel, the computing node sends a read-write request to the storage node to realize service communication.
Further, S05 specifically includes:
s051, the fusion protocol stack monitors that the IB device is abnormal, QP is in error state, a kernel-state TCP socket is established through the computing node, and a channel connection request is sent to a storage node to complete channel connection establishment;
s052, the computing node migrates the service data to the kernel TCP channel, and the read-write state is maintained.
Further, S06 specifically includes:
s061, recovering QP resources through the computing node, notifying a storage node to recover QP resources, and periodically checking the status of the IB network card;
s062, after the network card state is recovered, reestablishing an RDMA QP channel between the computing node and the storage node;
s063, according to the RDMA QP channel, the service data is relocated back into the RDMA QP through the computing node.
Further, S14 specifically includes:
s141, when the computing node and the storage node are in different AZ, and RTBB is larger than a round trip delay threshold value and simultaneously supports user-state TCP, selecting the user-state TCP as a transmission protocol of the computing node;
s142, creating a user TCP socket according to a computing node, and initiating a link establishment request to a storage node, and replying the link establishment request to complete link establishment through the storage node;
s143, according to the link establishment, a read-write request is sent to a storage node through a computing node, so that service communication is realized.
Further, S05 specifically includes:
s151, when the user state TCP channel abnormality occurs in the computing node, the bottom layer protocol stack closes the user state TCPsocket and creates a kernel state TCPsocket, and sends a kernel state TCP channel connection request to the storage node to complete channel creation;
and S152, the computing node migrates the service data to the kernel TCP and maintains a read-write state.
Further, S06 specifically includes:
s061, the computing node re-creates a user state TCP socket and establishes a user state TCP channel in the storage node;
and S062, according to the user state TCP channel, re-migrating the service data into the user state TCP through the computing node.
In a second aspect, an embodiment of the present invention provides a traffic communication and dynamic switching device based on three-stack fusion, which is configured to implement the traffic communication and dynamic switching method based on three-stack fusion in the first aspect, where the device includes:
an external interface module: the system is used for providing a unified calling interface for application software;
the connection management module is used for realizing negotiation of control messages between the computing node and the storage node and adaptive selection of transmission protocols;
the task scheduling module is used for managing connection switching of the connection management module and dynamic loading and unloading of tasks;
the task execution module is used for packaging data processing in the protocol into tasks and adding the tasks into a task execution queue;
a transport protocol module; the method is used for transmitting and receiving each bottom layer protocol stack and comprises an RDMA protocol processing unit, a user mode TCP protocol processing unit and a kernel mode TCP protocol processing module unit.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is configured to store one or more computer instructions, and where the one or more computer instructions implement the method according to the first aspect, when executed by the processor.
In a fourth aspect, embodiments of the present invention provide a computer storage medium having a computer program stored therein, which when executed by a processor, is adapted to carry out the method according to the first aspect.
The invention has the following beneficial effects:
the method combines the protocol stack to shield the internal realization difference of three transmission protocols, provides a unified calling interface for upper layer service, and is simple and easy to use;
by actively detecting the network environment and adaptively selecting to use RDMA or user TCP according to the round trip delay between communication nodes, no application intervention is needed, network resources can be fully utilized, and the communication performance is improved;
the method comprises the steps of monitoring a connection state and a network card state in real time, immediately switching a transmission protocol to a high-reliability kernel-mode TCP channel when network congestion or misoperation occurs to cause unavailability of an original channel, actively recovering an RDMA or user-mode TCP channel after fault recovery, and re-migrating service data, wherein the service is completely not perceived in the whole process, and the service communication is not interrupted, so that the communication stability and reliability are greatly improved;
the advantages of RDMA (remote direct memory access) with high performance and low time delay can be fully exerted by fusing the protocol stack, meanwhile, the TCP (transmission control protocol) is used as a supplement, the problem that an RDMA network is difficult to deploy in a large-scale network is solved, and the purposes of simplicity, easiness in use, high performance and high reliability are achieved.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views. It is apparent that the drawings in the following description are only some of the embodiments described in the embodiments of the present invention, and that other drawings may be obtained from these drawings by those of ordinary skill in the art.
Fig. 1 is a flow chart of adaptive selection of a transmission protocol of a traffic communication and dynamic switching method based on three stack integration according to an embodiment of the present invention;
fig. 2 is a transport protocol failover flowchart of a traffic communication and dynamic switching method based on three stack fusion according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a traffic communication and dynamic switching device based on three stack fusion according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a transmission protocol adaptive principle according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the embodiments of the present invention better understood by those skilled in the art, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.
In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
In the description of the present invention, it should be noted that unless explicitly stated and limited otherwise, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The terms "mounted," "connected," "coupled," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of methods and systems that are consistent with aspects of the invention as detailed in the accompanying claims.
First, a partial noun related to the present invention is explained:
TCP (Transmission Control Protocol ): a protocol provides for reliable transmission of data in an IP environment.
RDMA (Remote Direct Memory Access ): a technology for directly reading and writing memory data of other servers without processing by an operating system/CPU.
AZ (Availability Zone ), one or several similar data centers, interconnected by a low-delay fiber optic network.
PFC (Priority-based Flow Control ), priority-based flow control techniques enable queues of PFC functions, known as lossless queues. When congestion occurs in a lossless queue of the downstream device, the downstream device notifies the upstream device that traffic of the queue is stopped, so that zero-packet-loss transmission is realized.
RTT (Round-Trip Time), the total Time delay experienced from the Time when the sender sends data to the Time when the sender receives an acknowledgement from the receiver.
QP (Queue Pair), abstract description for one connection in RDMA.
ACK (Acknowledge character), acknowledgement character, a transmission class control character sent by the receiving side to the transmitting side in data communication. Indicating that the transmitted data has been acknowledged with no errors.
Embodiment one:
FIG. 1 shows a flow chart of adaptive selection of a transport protocol for a traffic communication and dynamic switching method based on three stack fusion;
FIG. 2 illustrates a transport protocol failover flowchart of a traffic communication and dynamic handoff method based on a three stack fusion;
the traffic communication and dynamic switching method based on three-stack fusion provided by the embodiment comprises the following steps:
firstly, initializing a fusion protocol stack of each computing node A and each storage node C according to S01;
next, according to S02, according to the computing node a, recording a time point t1 by sending a negotiation message, and recording a time point t2 according to a reply of the negotiation message;
the computing node A creates QP and sends out QP negotiation information comprising the initial sequence number of the QP, QP number, local end gid and the like, the storage node C creates QP after receiving the negotiation information and connects QP, and simultaneously replies QP negotiation information, and the computing node A connects QP after receiving the reply information and notifies service connection availability;
next, according to S03, determining, according to the configured round trip delay threshold, whether the computing node a and the storage node C are in the same AZ according to the round trip delay between the times t1 and t2;
the round trip delay rtta=t2-t 1 between the computing node a and the storage node C is calculated, in general, in the same AZ, the round trip delay of the node is less than 100us, and the round trip delay across the AZ node is about 1ms, so that the threshold value threshold is configured to be 200us, when the round trip delay is less than the threshold value, the communication node can be considered to be in the same AZ, otherwise, the round trip delay across the AZ communication is performed.
Next, according to S041, when the computing node a and the storage node C are in the same AZ and support integrated RDMA, selecting the integrated RDMA as a transport protocol of the computing node a;
since RTTA is less than threshold and both nodes support RDMA, RDMA is used as the transport protocol.
Next, according to S042, creating a QP according to the computing node a, creating a QP at the storage node C by means of a QP negotiation message, and completing QP connection between the computing node a and the storage node C;
the computing node A creates a QP, sends out a QP negotiation message comprising the information of the starting sequence number, the QP number, the local end gid and the like of the QP, and the storage node C creates the QP and connects the QP after receiving the negotiation message and replies the QP negotiation message at the same time.
Next, according to S043, according to the QP connection, the computing node a sends a read-write request to the storage node C, so as to implement service communication;
the computing node A creates an RDMA channel QP and sends out a QP negotiation message which comprises the information of the starting sequence number, the QP number, the local gid and the like of the QP, the storage node C creates the QP after receiving the negotiation message and is connected with the QP, meanwhile, replies the QP negotiation message, and the computing node A is connected with the QP after receiving the reply message to inform that service connection is available.
Next, according to S051, the converged protocol stack monitors that IB device is abnormal, QP is in error state, a kernel-mode TCP connection is established through the computing node, and a connection request is sent to a storage node, so as to complete connection creation;
the data sending task of the computing node A does not select an available data channel, informs a connection management module to immediately establish a kernel-state TCP channel, sends a kernel-state TCP channel establishment request to the storage node C, and the storage node C replies the request, so that the kernel TCP channel is successfully established;
next, according to S052, the computing node migrates service data to the kernel TCP, and maintains a read-write state;
namely, when the computing node A retransmits the service data, a kernel-mode TCP channel is selected, the service data is migrated to the kernel-mode TCP channel, and IO is ensured not to be interrupted;
next, according to S061, recovering QP resources by the computing node a and notifying the storage node C to recover QP resources, and periodically checking the status of the IB network card;
recovering QP resources through the computing node A, notifying the storage node C to recover QP resources, and periodically checking the status of the IB network card;
the computing node A resets the QP state, informs the opposite end storage node C to carry out reset operation, initializes the QP, tries to carry out QP negotiation again with the storage node C until the QP connection at the two ends is successful, and sets the RDMA channel state as available;
next, according to S062, after the network card state is recovered, reestablishing QP connection between the computing node a and the storage node C;
after the network card is recovered, attempting to create QP until QP can be created successfully, and attempting QP negotiation with the storage node C until QP connection at two ends is successful;
finally, according to S063, the traffic data is relocated back into RDMA QP by the compute node a according to the QP connection.
I.e., the traffic data is re-migrated back onto RDMA QP by compute node a.
Selecting an RDMA channel by a data transmission task of the computing node A, and re-migrating the service data to the RDMA channel;
the computing node A monitors that the kernel-mode TCP channel is not used for a long time, destroys the kernel-mode TCP channel, and sends a destroy request to the storage node C, and the storage node C synchronously destroys the kernel-mode TCP channel.
Embodiment two:
FIG. 1 shows a flow chart of adaptive selection of a transport protocol for a traffic communication and dynamic switching method based on three stack fusion;
FIG. 2 illustrates a transport protocol failover flowchart of a traffic communication and dynamic handoff method based on a three stack fusion;
the traffic communication and dynamic switching method based on three-stack fusion provided by the embodiment comprises the following steps:
firstly, initializing a fusion protocol stack of each computing node B and a storage node C according to S01;
next, according to S02, recording a time point t1 by sending a negotiation message according to the computing node B, recording a time point t2 according to a reply of the negotiation message;
the computing node B creates QP and sends out QP negotiation information comprising the initial sequence number of the QP, QP number, local end gid and the like, the storage node C creates QP after receiving the negotiation information and connects QP, and simultaneously replies QP negotiation information, and the computing node B connects QP after receiving the reply information and notifies service connection availability;
next, according to S03, determining, according to the configured round trip delay threshold, whether the computing node B and the storage node C are in the same AZ according to the round trip delay between the times t1 and t2;
the round trip delay rtta=t2-t 1 between the computing node B and the storage node C is calculated, in general, in the same AZ, the round trip delay of the node is less than 100us, and the round trip delay across the AZ node is about 1ms, so that the threshold is configured to be 200us, when the round trip delay is less than the threshold, the communication node can be considered to be in the same AZ, otherwise, the round trip delay across the AZ communication is performed.
Next, according to S141, when the computing node B and the storage node C are in different AZ, and RTBB is greater than the round trip delay threshold, and simultaneously supports the user TCP, selecting the user TCP as the transmission protocol of the computing node B;
by acquiring RTTB, since RTTB is greater than threshold, it indicates that computing node B and storage node C span AZ, and both computing node B and storage node C support user-mode TCP, so that computing node B adopts user-mode TCP as a transmission protocol.
Next, according to S142, a user state TCP socket is created according to the computing node B, a link establishment request is initiated to the storage node C, and the link establishment request is replied to complete the link establishment through the storage node C;
next, according to S143, according to the link establishment, a read-write request is sent to the storage node C by the computing node B, so as to implement service communication.
The computing node B calls a data sending interface by using the connection created in the S142, sends a read-write IO request to the storage node C, selects a user-state TCP channel from which data is delivered to the network, and the storage node C receives the data from the user-state TCP channel, so that normal communication of the service is realized;
next, according to S151, when the computing node B has an exception of the user state TCP channel, the underlying protocol stack closes the user state TCPsocket and creates a kernel state TCPsocket, and sends a kernel state TCP connection request to the storage node C to complete the TCP channel creation;
the computing node B creates a user-state TCP socket, initiates a link establishment request to the storage node C, the storage node C replies the link establishment request, the link establishment is successful, and the computing node B informs that service connection is available;
next, according to S152, the computing node B migrates the service data to the kernel TCP, and maintains the read-write state.
And the computing node B calls a message sending interface through the connection created in the step S151, sends a read-write IO request to the storage node C, and realizes normal communication of the service.
Next, according to S061, the computing node B creates a user state TCP socket, and establishes a user state TCP connection with the storage node C;
the computing node B finds that the user-state TCP channel is abnormal, the underlying protocol stack closes the TCPsocket, immediately creates the kernel-state TCPsocket, and sends a kernel-state TCP connection request to the storage node C, the storage node C replies the request, and the channel creation is successful;
the computing node B monitors that the user state TCP socket is abnormal, and sets the user state TCP channel to be in an unavailable state;
the data sending task of the computing node B does not select an available data channel, informs a connection management module to immediately establish a kernel-state TCP channel, sends a kernel-state TCP channel establishment request to a storage node C, and the storage node C replies the request, so that the kernel-state TCP channel is successfully established and is set to be in an available state;
selecting a kernel-mode TCP channel when the computing node B retransmits service data, immediately migrating the service data to the kernel-mode TCP channel, and ensuring that IO is not interrupted
Next, according to S062, according to the user state TCP channel, the service data is relocated back to the user state TCP through the computing node B;
the computing node B tries to re-create the user state TCP socket, and after the user state TCP socket is successfully created, tries to establish user state TCP connection with the storage node C until the connection is successfully created; the traffic data is then relocated back to the user mode TCP conductance by the compute node B.
The computing node B destroys the wrong user state TCP socket, re-creates the user state TCP socket, tries to re-establish the user state TCP connection with the storage node C until the connection between the two ends is successful, and sets the user state TCP channel as an available state
The data sending task of the computing node B selects a user state TCP channel, and service data is relocated back to the user state TCP channel;
the computing node B monitors that the kernel-mode TCP channel is not used for a long time, destroys the kernel-mode TCP channel, sends a destroy request to the storage node C, and synchronously destroys the kernel-mode TCP channel by the storage node C
According to the invention, by detecting RTT between the communication nodes, whether the nodes of both communication sides are positioned in the same AZ is judged according to the RTT, and a proper transmission protocol is adaptively selected for the communication nodes, so that no application intervention is needed, network resources can be fully utilized, and the communication performance is improved;
simultaneously monitoring the connection state and the network card state in real time, switching the transmission protocol to a high-reliability kernel-mode TCP protocol immediately when the original connection is unavailable due to network congestion or network card abnormality, actively recovering to RDMA or user-mode TCP connection after fault recovery, and re-migrating service data, wherein the service is completely not needed to be perceived in the whole process, and the service communication is not interrupted, so that the communication stability and reliability are greatly improved;
the invention supports three stacks of RDMA, user mode TCP and kernel mode TCP by integrating protocol stacks, can shield the internal realization difference of different transmission protocols, provides a unified communication interface for application, and compared with a protocol stack of a single transmission protocol, has no need of paying attention to the internal realization difference and programming interface of each transmission protocol, is simple and easy to use, can exert the technical advantages of RDMA and TCP simultaneously, and can solve the problem that large-scale networking is not realized by only utilizing the RDMA protocol.
Embodiment III:
fig. 2 shows a schematic diagram of a traffic communication and dynamic switching device based on three-stack fusion, for implementing the method according to the first embodiment, as shown in fig. 2, the traffic communication and dynamic switching device based on three-stack fusion provided in this embodiment includes:
an external interface module: the system is used for providing a unified calling interface for application software; the system comprises a fusion protocol stack resource initialization interface, a connection creation interface, a data transmission interface and a task processing interface. Initializing each bottom protocol stack resource in the resource initializing interface; creating a connection from the client node to the server node in the connection creation interface, and returning a connection number to the application software; the application software calls the data transmission interface to transmit the data from a designated connection number; the application software circularly calls the task processing interface to process various events in the fusion protocol stack, such as asynchronous data transmission, data reception and the like.
The connection management module is used for realizing the functions of control message negotiation, transmission protocol self-adaptive selection, data connection creation, state monitoring, fault switching and the like of the two communication nodes;
the fusion transmission stack is used for preferentially selecting RDMA for communication nodes in AZ and selecting user-mode TCP for the cross-AZ communication nodes, mainly considering that the transmission delay of a channel of the cross-AZ communication link is large and lossless transmission is difficult to ensure, and considering that RDMA is very sensitive to packet loss, the cross-AZ communication nodes are not applicable to RDMA.
Transmission protocol adaptive selection: supporting the active detection of RTT between communication nodes, determining whether the communication nodes are located in the same AZ according to RTT, wherein the round trip delay of the nodes is generally less than 100us, and the round trip delay of the nodes across the AZ is generally about 1ms, so that the threshold value threshold is configured as 200us, when RTT is less than threshold, the communication nodes can be considered to be in the same AZ, otherwise, the nodes are communicated across the AZ (200 us is not an absolute case and can be configured as a parameter according to the actual network environment).
Creating a data connection: the connection in the convergence protocol stack is an abstract description of the communication connection from the client node to the server node, and is managed through a connection number. Multiple data channels are managed under one connection, one channel being a communication instance of a specific transport protocol, such as one QP under RDMA transport protocol, or one TCP socket of TCP transport protocol. When connection is initialized, only a high-priority channel, such as an RDMA channel or a user-mode TCP channel, is created, only when a failure high-priority channel is unavailable, a kernel-mode TCP channel is created, and after the high-priority channel is restored to be available, the kernel-mode TCP channel is destroyed in time to save resources;
connection state monitoring: and (3) monitoring the connection state of the nodes in real time, setting a channel management state machine, and setting the channel state to be unavailable when faults occur. Therefore, when data is needed to be sent at the moment, an available RDMA channel or a user state TCP channel cannot be selected, so that the creation of a kernel state TCP channel is triggered, and the business flow is ensured not to be interrupted.
Automatic connection repairing: the fusion protocol stack has a connection automatic repair function, and when the channel state is unavailable, the error channel is periodically tried to be recovered by resetting the error channel or creating a new channel. After the high-priority channel is recovered, the channel state is set to be available, and when the follow-up data needs to be sent, the data can be migrated back to the RDMA or user state TCP channel, so that the efficient and stable transmission of the service is realized.
The task scheduling module is used for managing connection switching of the connection management module and dynamic loading and unloading of tasks;
according to the channel use condition in the connection management module, the task dynamic loading and unloading is realized, for example, when only an RDMA type channel is created on the node, the task scheduling module only loads the RDMA type task for processing; when a kernel-mode TCP channel is created due to faults, tasks of the kernel-mode TCP type are loaded for processing, and after the kernel-mode TCP channel is restored and destroyed due to faults, the tasks of the kernel-mode TCP type are unloaded, so that waste of CPU resources caused by the fact that a CPU executes invalid tasks is avoided as much as possible.
The task execution module is a high-performance processing module of the data surface, data transmission and data receiving processes in each protocol are abstractly packaged into tasks, the tasks are added into a task execution queue through the task calling module, and the application software circularly calls the task processing interface to trigger task execution. The task types mainly comprise the following four types:
data asynchronous transmission task: in order to ensure reliable transmission of data, the fusion protocol stack creates an asynchronous transmission queue with the granularity of a working thread, and when the application software calls a data transmission interface, the application software only enqueues the data in the asynchronous transmission queue. The data sending task sequentially tries to deliver the data in the asynchronous sending queue to the network, and preferentially selects a high-priority channel from the appointed connection to send, and only when the high-priority channel is unavailable, the data sending task checks whether the kernel-mode TCP channel is available or not, and if the kernel-mode TCP channel is available, the data is directly sent from the kernel-mode TCP channel; directly notifying a connection management module to create if the kernel mode TCP channel is not created at the moment; when all channels are not available, transmission is attempted periodically until the packet is delivered into the network.
Successful data transmission confirmation task: the fusion protocol stack bottom layer transmission protocol adopts reliable RDMA and TCP, after the data is received by the opposite terminal, the opposite terminal replies ack for confirmation, the fusion protocol stack can call the bottom layer transmission protocol interface to inquire whether the data packet is received by the opposite terminal ack or not, and only the data message of the ack can send successful confirmation to the application software, so that reliable transmission of the data is ensured.
Data retransmission task: although the bottom layer transmission protocol of the fusion protocol stack is reliable transmission, when the bottom layer data channel is abnormal, the situation of data packet loss can still be encountered, and in order to ensure the complete and reliable transmission of data, the fusion transmission stack designs a retransmission mechanism on a public framework. When the data packet is delivered to the network, a time stamp is marked on the data packet, and when the data packet exceeds 5 seconds and is not confirmed successfully, retransmission is carried out once, the time stamp is updated, and the data packet is cleared from a transmission queue until the retransmission times reach the upper limit due to the successful confirmation of the data packet or the long-time network abnormality, and the transmission failure is returned to the application software.
Data receiving task: and calling an interface provided by the bottom protocol stack, receiving data, and submitting the received data packet to the application software for processing.
A transport protocol module; the method is used for sending and receiving processing of each bottom layer protocol stack, namely providing a unified API interface and calling each bottom layer protocol stack interface, wherein the protocol stack interface comprises an RDMA protocol processing unit, a user mode TCP protocol processing unit and a kernel mode TCP protocol processing module unit.
Embodiment four:
the embodiment also provides an electronic device, including a memory and a processor, where the memory is configured to store one or more computer instructions, and the one or more computer instructions when executed by the processor implement the method of the first or second embodiment;
in practical applications, the processor may be an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), a digital signal processor (Digital Signal Processor, abbreviated as DSP), a digital signal processing device (Digital Signal Processing Device, abbreviated as DSPD), a programmable logic device (Programmable Logic Device, abbreviated as PLD), a field programmable gate array (Field Programmable Gate Array, abbreviated as FPGA), a controller, a microcontroller (Microcontroller Unit, MCU), a microprocessor or other electronic component implementation for executing the method in the above embodiment.
The method implemented in this embodiment is as described in embodiment one.
Fifth embodiment:
the present embodiment also provides a computer storage medium having a computer program stored therein, which when executed by one or more processors, implements the method of embodiment one or two;
the computer readable storage medium may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk.
The method implemented in this embodiment is as described in the first or second embodiment.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the embodiments of the present invention, and are not limiting. Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the invention, and any changes and substitutions that would be apparent to one skilled in the art are intended to be included within the scope of the present invention.

Claims (10)

1. The traffic communication and dynamic switching method based on three-stack fusion is characterized by comprising the following steps:
s01, initializing a fusion protocol stack of each computing node and each storage node;
s02, recording a time point t1 by sending a negotiation message according to a computing node, and recording a time point t2 according to a reply of the negotiation message;
s03, judging whether the computing node and the storage node are in the same AZ or not according to the configured round trip delay threshold value and through round trip delay between the time t1 and the time t2;
s04, when the computing node and the storage node belong to the same AZ, a read-write request is established by integrating RDMA as a transmission protocol;
s14, when the computing node and the storage node do not belong to the same AZ, a read-write request is established by using a user mode TCP as a transmission protocol;
s05, when the transmission protocols of the computing node and the storage node are abnormal, the computing node transfers the service data to a kernel TCP;
s06, reestablishing a read-write request between the computing node and the storage node according to the S04; and migrating the business data in the kernel TCP to the storage node according to the corresponding transmission type.
2. The method for traffic communication and dynamic switching based on three stack fusion according to claim 1, wherein S04 specifically comprises:
s041, when the computing node and the storage node are in the same AZ and support integrated RDMA, selecting the integrated RDMA as a transmission protocol of the computing node;
s042, creating a QP according to a computing node, creating the QP in a storage node through QP negotiation information, and completing an RDMA QP channel between the computing node and the storage node;
s043, according to the RDMA QP channel, the computing node sends a read-write request to the storage node to realize service communication.
3. The method for traffic communication and dynamic switching based on three stack fusion according to claim 2, wherein S05 specifically comprises:
s051, the fusion protocol stack monitors that the IB device is abnormal, QP is in error state, a kernel mode TCP connection is established through the computing node, a TCP socket connection request is sent to a storage node, and channel creation is completed;
s052, the computing node migrates the service data to the kernel TCP and maintains the read-write state.
4. The method for traffic communication and dynamic switching based on three stack fusion according to claim 3, wherein S06 specifically comprises:
s061, recovering QP resources through the computing node, notifying a storage node to recover QP resources, and periodically checking the status of the IB network card;
s062, after the network card state is recovered, reestablishing an RDMA QP channel between the computing node and the storage node;
s063, according to the RDMA QP channel, the service data is relocated back into the RDMA QP through the computing node.
5. The method for traffic communication and dynamic switching based on three stack fusion according to claim 1, wherein S14 specifically comprises:
s141, when the computing node and the storage node are in different AZ, and RTBB is larger than a round trip delay threshold value and simultaneously supports user-state TCP, selecting the user-state TCP as a transmission protocol of the computing node;
s142, creating a user TCP socket according to a computing node, and initiating a link establishment request to a storage node, and replying the link establishment request to complete link establishment through the storage node;
s143, according to the link establishment, a read-write request is sent to a storage node through a computing node, so that service communication is realized.
6. The method for traffic communication and dynamic switching based on three stack fusion according to claim 5, wherein S05 specifically comprises:
s151, when the computing node has abnormality of a user state TCP channel, a bottom layer protocol stack closes the user state TCP socket and creates a kernel state TCP socket, and sends a kernel state TCP socket connection request to a storage node to complete channel creation;
and S152, the computing node migrates the service data to the kernel TCP and maintains a read-write state.
7. The method for traffic communication and dynamic switching based on three stack fusion according to claim 6, wherein S06 specifically comprises:
s061, the computing node re-creates a user state TCP socket and establishes a user state TCP channel in the storage node;
and S062, according to the user state TCP channel, re-migrating the service data into the user state TCP through the computing node.
8. A traffic communication and dynamic switching device based on three stack fusion, the device comprising:
an external interface module: the system is used for providing a unified calling interface for application software;
the connection management module is used for realizing negotiation of control messages between the computing node and the storage node and adaptive selection of transmission protocols;
the task scheduling module is used for dynamically loading and unloading tasks in the dynamic change of the transmission protocol;
the task execution module is used for packaging data processing in the protocol into tasks and adding the tasks into a task execution queue;
a transport protocol module; the method is used for transmitting and receiving each bottom layer protocol stack and comprises an RDMA protocol processing unit, a user mode TCP protocol processing unit and a kernel mode TCP protocol processing module unit.
9. An electronic device comprising a memory and a processor, the memory configured to store one or more computer instructions, wherein the one or more computer instructions when executed by the processor implement the three stack fusion-based traffic communication and dynamic switching method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, which computer program, when being executed by a processor, is adapted to implement the three stack fusion based traffic communication and dynamic handover method according to any of the claims 1 to 7.
CN202311735913.3A 2023-12-15 2023-12-15 Traffic communication and dynamic switching method and device based on three-stack fusion Pending CN117857658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311735913.3A CN117857658A (en) 2023-12-15 2023-12-15 Traffic communication and dynamic switching method and device based on three-stack fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311735913.3A CN117857658A (en) 2023-12-15 2023-12-15 Traffic communication and dynamic switching method and device based on three-stack fusion

Publications (1)

Publication Number Publication Date
CN117857658A true CN117857658A (en) 2024-04-09

Family

ID=90532000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311735913.3A Pending CN117857658A (en) 2023-12-15 2023-12-15 Traffic communication and dynamic switching method and device based on three-stack fusion

Country Status (1)

Country Link
CN (1) CN117857658A (en)

Similar Documents

Publication Publication Date Title
US10965519B2 (en) Exactly-once transaction semantics for fault tolerant FPGA based transaction systems
JP3932994B2 (en) Server handover system and method
US7840682B2 (en) Distributed kernel operating system
RU2345408C2 (en) Improvement of availability and scalability in system of message transfer by method, transparent for application
US7043550B2 (en) Method for controlling group membership in a distributed multinode data processing system to assure mutually symmetric liveness status indications
CA2547829C (en) Improved distributed kernel operating system
US7203748B2 (en) Method for detecting the quick restart of liveness daemons in a distributed multinode data processing system
US20060179147A1 (en) System and method for connection failover using redirection
KR20040086583A (en) Message delivery with configurable assurances and features between two endpoints
CN103370903A (en) Method and system for client recovery strategy in a redundant server configuration
CN102648612B (en) Method and system for managing a connection in a connection oriented in-order delivery environment
US20100082822A1 (en) Technique for realizing high reliability in inter-application communication
US11381505B2 (en) Acknowledgment storm detection
EP3920035B1 (en) Message transmission/reception method, communication device, and program
CN117857658A (en) Traffic communication and dynamic switching method and device based on three-stack fusion
US8549345B1 (en) Methods and apparatus for recovering from a failed network interface card
US20110078255A1 (en) Method and system for managing a connection in a connection oriented in-order delivery environment
CN114826888A (en) Message sending method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination