CN108494817B - Data transmission method, related device and system - Google Patents

Data transmission method, related device and system Download PDF

Info

Publication number
CN108494817B
CN108494817B CN201810131741.1A CN201810131741A CN108494817B CN 108494817 B CN108494817 B CN 108494817B CN 201810131741 A CN201810131741 A CN 201810131741A CN 108494817 B CN108494817 B CN 108494817B
Authority
CN
China
Prior art keywords
middleware
working node
socket
rdma
application layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810131741.1A
Other languages
Chinese (zh)
Other versions
CN108494817A (en
Inventor
袁学文
王曙光
周斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201810131741.1A priority Critical patent/CN108494817B/en
Publication of CN108494817A publication Critical patent/CN108494817A/en
Application granted granted Critical
Publication of CN108494817B publication Critical patent/CN108494817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer And Data Communications (AREA)

Abstract

The application discloses a data transmission method, a related device and a system. The method may comprise: the method comprises the steps that a first device and a second device establish Remote Direct Memory Access (RDMA) connection between a first working node and the second working node, wherein the distributed computing system comprises the first device and the second device, the first device deploys the first working node, and the second device deploys the second working node; and the first equipment transmits the data output by the first working node to the second working node through the network card supporting the RoCE protocol. By implementing the scheme, the RoCE technology can be used on the storm system, so that the data transmission delay is reduced, and the processing resources consumed by equipment on data transmission are reduced. In addition, the implementation codes of all interfaces of the application layer of the equipment do not need to be modified, only the middleware needs to be added, all the interfaces of the application layer are redirected to the middleware, the RoCE technology can be used, and the implementation process is simple and convenient.

Description

Data transmission method, related device and system
Technical Field
The present application relates to the field of big data processing and communication technologies, and in particular, to a data transmission method, a related apparatus and a system.
Background
With the development of information technology, a large amount of information is generated all the time, and the amount of data required to be processed by various communication systems or communication platforms is increasing. Since the value of the information decreases with the passage of time, it is desirable to process the information in time after it is generated so that the user can make accurate analysis and decision on the real-time information.
In order to meet the timeliness requirement of users on information, distributed processing technologies and platforms are currently used to process large data, and common distributed computing systems may include storm, spark, samza, flink, and the like.
At present, users have higher and higher requirements on timeliness of information processing, and how to improve a common distributed computing system to meet the higher timeliness requirement is an urgent problem to be solved.
Disclosure of Invention
The application provides a data transmission method, a related device and a system, which can use RoCE technology on a storm system, reduce data transmission delay and reduce processing resources consumed by equipment on data transmission.
In a first aspect, the present application provides a data transmission method, including: establishing, by a first device and a second device, a remote direct memory access, RDMA, connection between a first working node and a second working node, wherein a distributed computing system comprises the first device and the second device, the first device deploying the first working node, the second device deploying the second working node; and the first device sends the data output by the first working node to the second working node through a network card supporting an Ethernet-based remote direct memory access (RoCE) protocol.
By implementing the data transmission method of the first aspect, the RoCE technology can be used on the storm system, so that the data transmission delay is reduced, and the processing resources consumed by the equipment in data transmission are reduced.
Specifically, the first device is deployed with an application layer and a first middleware, the application layer provides various APIs, the first middleware provides various APIs supporting the RoCE protocol, and the API of the application layer is redirected to the API of the first middleware; the second device is deployed with second middleware which provides various APIs supporting the RoCE protocol.
In an alternative embodiment, the first device creates an RDMA connection between the first working node and the second working node by the application layer instructing a first middleware with a second middleware of the second device. By implementing the method of the optional embodiment, the storm system does not need to modify the implementation codes of each interface of the application layer of the device, only needs to add a middleware, redirects each interface of the application layer to the middleware, can use the RoCE technology, and has simple and convenient implementation process.
Here, when the first device and the second device establish an RDMA connection, the first device may actively initiate a connection request to the second device, or the first device may accept a connection request initiated by the second device, that is, the first device may be a server or a client, which is described in the following cases.
(1) In the first case, the first device is a client that establishes the RDMA connection and the second device is a server that establishes the RDMA connection. In a first case, when the first device indicates the first middleware through the application layer, the first middleware is specifically indicated through the application layer: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; initiating an RDMA connection request between the first working node and the second working node to a second middleware of the second device based on the RDMA socket pointed to by the first descriptor.
(2) In a second case, the first device is a server that establishes the RDMA connection and the second device is a client that establishes the RDMA connection. In a second case, when the first device indicates the first middleware through the application layer, the first middleware is specifically indicated through the application layer: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; creating a new RDMA socket on condition that an RDMA connection request between the first working node and the second working node, which is transmitted by a second middleware of the second device, is listened to on the socket pointed by the first descriptor; accepting the RDMA connection request based on the new RDMA socket.
In an optional embodiment, before the first device indicates the first middleware through the application layer, the first device may obtain a tag of the server, and determine that the tag of the server is the tag of the first working node or the tag of the second working node, where the tag includes an IP address and/or a port number.
In this application, when the first device sends the data output by the first working node to the second working node through the network card supporting the RoCE protocol, firstly, data output by the first working node is sent to the first middleware from the application layer, the data is sent to a network card supporting RoCE protocol through the first middleware, then the data is encapsulated through the network card supporting the RoCE protocol to obtain a data frame capable of being transmitted in the Ethernet, and finally the data frame is sent to the network card supporting the RoCE protocol of the second device through the network card supporting the RoCE protocol, so that the second device decapsulates the data frame through a network card supporting a RoCE protocol, and sends the decapsulated data to the second middleware, and sending the decapsulated data to an application layer deployed in the second device through the second middleware.
In a second aspect, the present application provides a first device configured to perform the data transmission method described in the first aspect. The first device may include: a memory and a communication interface coupled with the memory, wherein: the communication interface is configured with a network card supporting a RoCE protocol, and may be used for a first device to communicate with other devices, the memory includes an application layer and a middleware, a user interface of the application layer is redirected to a user interface of the middleware to store implementation codes of the data transmission method described in the first aspect, and the processor is configured to execute program codes stored in the memory, that is, to execute the data transmission method provided in the first aspect or any one of possible implementation manners of the first aspect.
In a third aspect, a first device is provided, which includes a plurality of functional modules, and is configured to correspondingly perform the data transmission method provided in the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application provides a communication system, where the communication system is a storm system, and the communication system includes: a first device and a second device. The first device deploys a first working node, and the second device deploys a second working node. The first device is used for establishing remote direct memory access RDMA connection between the first working node and the second working node with the second device, and transmitting data output by the first working node to the second working node through a network card supporting remote direct memory access RoCE protocol based on Ethernet.
In some optional embodiments, the first apparatus of the fourth aspect may be the first apparatus described in the second or third aspect.
In a fifth aspect, a computer-readable storage medium is provided, the computer-readable storage medium storing thereon a program code for implementing the data transmission method described in the first aspect, the program code containing execution instructions for executing the data transmission method described in the first aspect.
In a sixth aspect, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data transmission method described in the first aspect.
By implementing the method and the device, the RoCE technology can be used on the storm system, the data transmission time delay is reduced, and the processing resources consumed by the equipment on data transmission are reduced. In addition, by applying the data transmission method of the application, the storm system does not need to modify the implementation codes of all interfaces of the application layer of the equipment, only needs to add a middleware, redirects all interfaces of the application layer to the middleware, can use the RoCE technology, and is simple and convenient in implementation process.
Drawings
FIG. 1 is a schematic diagram of a data processing process of a distributed computing system in the prior art;
FIG. 2 is a schematic structural diagram of the storm system provided herein;
FIG. 3 is a hardware block diagram of the apparatus provided herein;
FIG. 4 is a schematic diagram of a TCP/IP protocol stack and the protocol stack of RoCEV2 as provided herein;
fig. 5 is a schematic structural diagram of a RoCE frame provided in the present application;
fig. 6 is a schematic flowchart of a procedure for calling interface functions at two ends of communication according to the present application;
fig. 7 is a schematic flowchart of a data transmission method provided in the present application;
fig. 8 is a schematic diagram of an interface function call process of the client provided in the present application;
fig. 9 is a schematic diagram of an interface function call process of a server according to the present application;
fig. 10 is a functional block diagram of a first device provided in the present application.
Detailed Description
The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.
At present, the amount of data to be processed by various communication systems or communication platforms is increasing, and in order to increase the processing speed, a distributed computing method is generally used to connect a plurality of devices having different functions or having different data at different locations through a communication network, and perform different processing on the data under the unified management of a control system, thereby coordinating and cooperating to solve the problem.
In a conventional distributed computing system, such as storm, spark, etc., data needs to be transmitted between different devices many times during the process of processing the data. During data transmission, most of the time and processing resources consume data copying between the user state and the kernel state, and data is processed in the kernel protocol stack.
Referring to fig. 1, fig. 1 illustrates a process of processing data in a distributed computing system in the prior art. In fig. 1, device 1 processes data first, and then sends the processed data to device 2, and device 2 continues to further process the data. That is, the device 1 is a data transmitting end, and the device 2 is a data receiving end.
First, device 1 and device 2 establish a socket connection based on the TCP/IP protocol, that is, data transmission based on the socket connection both use the TCP/IP protocol. After the socket connection is established, device 1 and device 2 may perform data transfer.
In the device 1, the data processing flow is as follows:
1. the data is stored in a buffer (buffer) of a user space (user) application layer, and the device 1 firstly copies the data from the user space to a kernel space (kernel);
2. in the kernel space, the device 1 encapsulates data (such as constructing TCPheader, filling IP packets, fragmentation, etc.) through the kernel protocol stack;
3. the device 1 sends the encapsulated data from the kernel space to a hardware device (hardware) on the bottom layer, such as a network card;
4. the data is transmitted to the network via the underlying hardware devices and transmitted to the hardware devices (e.g., network cards) of the device 2 via the network.
In the device 2, the data processing flow is as follows:
1. hardware devices (such as a network card and the like) at the bottom layer of the device 2 receive the data sent by the device 1, and the device 2 sends the data to the kernel space;
2. in the kernel space, the device 2 decapsulates the data through the kernel protocol stack (e.g., removes TCPheader, etc.);
3. the device 2 sends the decapsulated data from the kernel space to the user space;
4. the buffer of the user space application layer stores data.
In the prior art, a distributed computing system generally processes data in the manner shown in fig. 1, and cannot avoid a copy process of data between a user space and a kernel space and a processing process of data in a kernel protocol stack, so that transmission delay is large, and a requirement of a user on higher timeliness of data cannot be met.
In order to further improve the efficiency of a distributed computing system and meet the requirement of users on higher and higher information timeliness, the application provides a data transmission method, a related device and a system, which can avoid the copying process of data between a user mode and a kernel mode and the protocol encapsulation process of the data in a kernel protocol stack, thereby reducing the data transmission delay, reducing the use of processing resources and increasing the throughput of a communication system.
Referring to fig. 2, fig. 2 is a storm for the distributed computing system provided herein. The storm system may also be called a cluster, and a plurality of nodes (nodes) are configured in the storm system and are divided into a master node (master node) and a slave node (slave node).
The main node is also called a Nimbus node, is a process for managing the whole situation in the storm system, and is mainly responsible for receiving Topology submitted by a user, performing corresponding verification, allocating tasks, and writing information related to the tasks into a directory corresponding to a Zookeeper. In addition, the Nimbus node is also responsible for monitoring the task execution through the Zookeeper.
The slave node is also called a super node, is a process for managing tasks on related nodes in the storm system, and is responsible for monitoring task allocation conditions and starting/stopping a work process (worker) according to actual conditions. Here, one or more work processes (worker) can be started inside the hypervisor node, the worker is used for completing an actual data processing task, and a slave node running the worker is referred to as a work node in the application. Here, 1 or more threads (executors) may be included in 1 individual worker, and the 1 or more threads jointly complete the processing task of the worker.
Here, a complete Topology is executed and completed by multiple worker nodes (workers) together, each worker executing only a subset of the Topology. For example, Topology submitted by a user is used for carrying out statistical analysis on a session, Topology submitted by the user can be processed through two work nodes, namely worker1 and worker2 in storm, worker1 can be used for carrying out word segmentation on the session, and worker2 can be used for further carrying out statistics on data subjected to word segmentation, namely, worker1 and worker2 jointly complete tasks submitted by the user. Further, the worker1 may include an executive 1, an executive 2, and an executive 3, where the executive 1 is configured to receive data (i.e., a session), the executive 2 is configured to process data (i.e., perform word segmentation on the session), and the executive 3 is configured to send data to the worker 2.
Referring to fig. 3, fig. 3 is a schematic structural diagram of the apparatus 100 provided in the present application. The apparatus 100 may deploy a worker node in the storm system shown in fig. 2. As shown in FIG. 3, device 100 may include a bus 110, a processor 120, a memory 130, an input-output module 140, a display module 150, a communication interface 160, and other similar components.
Bus 110 may be circuitry that interconnects the above-described elements and passes communications (e.g., control messages) between the above-described elements.
The processor 120 may receive commands from the above-described other elements (e.g., the memory 130, the input and output module 140, the display module 150, the communication interface 160, etc.) through the bus 110, and parse the received commands, perform calculations or data processing according to the parsed commands. Specifically, the processor 120 may be configured to call a program stored in the memory 130, for example, an implementation program of the signal transmission method provided in one or more embodiments of the present application on the first device or the second device side, and execute instructions included in the program.
Memory 130 may store commands and data received or generated by processor 120 or other elements (e.g., memory 130, input-output module 140, display module 150, communication interface 160, etc.). In some embodiments of the present application, the memory 130 may be configured to store an implementation program of the data transmission method provided in one or more embodiments of the present application on the first device or the second device side, please refer to the following embodiments regarding the implementation of the data transmission method provided in one or more embodiments of the present application.
The memory 130 includes programming modules, such as a kernel 131, middleware 132, an application layer 133, an SDP module 134, etc., which may each be implemented in software, firmware, hardware, or a combination of two or more thereof.
The SDP module 134 may be used to configure parameters of operational nodes deployed or operating in the device 100. For example, in the present application, the SDP module may be used to configure a startup parameter of a worker in the device, so that the device may use an SDP protocol when deploying or running the worker.
The application layer 133 may include a home application, dialer application, short message service/multimedia message service application, instant messaging application, browser application, camera application, alarm application, contacts application, voice dialing application, email application, calendar application, media player application, photo album application, clock application, and any other similar application. The application layer provides an Application Programming Interface (API), which may include at least one interface or function for file control, window control, image processing, character control, and the like.
Middleware 132 may be used to communicate and exchange data between application layer 133 and kernel 131. In this application, the middleware 132 also provides APIs, and each API in the middleware 132 supports an ethernet-based remote direct memory access (RDMA over converted ethernet, RoCE) protocol, so that the device 100 supports data transmission based on the RoCE protocol, where RDMA and RoCE may refer to the following description of the basic concepts, and are not described herein again. In this application, the API of the device application layer 133 is redirected to the API of the middleware 132, when the device calls each API of the application layer 133, each API in the middleware 132 is actually called, and specific implementation of redirection may refer to relevant descriptions of subsequent key technical points and method embodiments, which are not described herein again.
The kernel 131 may control or manage system resources (e.g., the bus 110, the processor 120, the memory 130, etc.) for performing operations or functions implemented by other programming modules (e.g., the middleware 132 and the application layer 133). In addition, the kernel 131 may provide an interface for the middleware 132 or the application layer 133 to access and control various elements of the management device 100. Mechanisms below the transport layer in the TCP/IP protocol stack are provided in the kernel 131 for encapsulating communication data.
The input-output module 140 may be used to enable interaction between the device 100 and the device/external environment, and may mainly include an audio input-output module, a key input module, a display, and the like. Specifically, the input/output module may further include: cameras, touch screens, sensors, and the like.
The display module 150 may display various information (e.g., multimedia data, text data, etc.) received from the above elements.
Communication interface 160 may be used for device 100 to communicate with other electronic devices. Specifically, the other electronic device may be the device shown in fig. 1. Specifically, the communication interface 160 may be a Long Term Evolution (LTE) communication interface, or may be a communication interface of a 5G or future new air interface. Not limited to wireless communication interfaces, device 100 may also be configured with a wired communication interface to support wired communications, such as a Local Access Network (LAN) interface. In this application, the communication interface 160 is configured with a network card, and the network card may include a RoCE network card and a common ethernet network card (e.g., a 10GE network card), or may only include the RoCE network card, and the RoCE network card may implement the function of the common ethernet network card. Here, the functions of the RoCE network card and the ethernet network card may refer to the description of the subsequent basic concepts in the present application, and are not described herein again.
It should be noted that the apparatus 100 shown in fig. 3 is only one implementation manner of the embodiment of the present application, and in practical applications, the apparatus 100 may also include more or less components, which is not limited herein.
Based on the communication system shown in fig. 2 and the device shown in fig. 3, the present application provides a data transmission method, so that a distributed computing system can use a RoCE technology in combination, reduce data transmission delay, and reduce processing resources consumed by the device in data transmission. In addition, by applying the data transmission method, the distributed computing system does not need to modify the implementation codes of the application layer, only needs to add middleware and make simple configuration, can use the RoCE technology, and has simple and convenient implementation process.
To better describe the present application, the basic concepts related to the present application are first introduced.
RoCE
First, Remote Direct Memory Access (RDMA) is a data transfer technique that can quickly move data from the memory of one machine or device to the memory of another machine or device without transferring the data over the network through the operating system kernel protocol stack and without any impact on the operating system. By using the RoCE technology, the copying process of data between a user space and a kernel space and the processing process of the data in a kernel protocol stack are avoided, the memory consumption and the CPU consumption can be reduced, and the delay of data transmission can be reduced.
The remote direct memory access Over Ethernet (RoCE) protocol is a network protocol that allows the use of remote direct memory access technology Over Ethernet. Currently, the widely used RoCE standard is RoCE V2, see FIG. 4, which shows the TCP/IP protocol stack and the protocol stack of RoCE V2.
As shown in the left diagram of fig. 4, in the TCP/IP protocol stack, the application layer is provided by the user process in the user space, and the transport layer and the mechanism below are provided by the kernel space. When the TCP/IP protocol is used for transmitting data, an application layer in a user space is used for explaining the meaning of the data, and a transmission layer provided by a kernel and a network layer and a link layer below the transmission layer encapsulate the data.
As shown in the right diagram of fig. 4, in the protocol stack of RoCEV2, the application layer is provided by the user process in the user space, and the transport layer and the following mechanisms are provided by the network card. Wherein, the transport layer comprises IB transport protocol (infiniBand base transport protocol), and the network layer comprises UDP protocol and IP protocol. When the RoCE protocol is used for transmitting data, an application layer in a user space is used for explaining the meaning of the data, and a transmission layer provided by a network card and a network layer and a link layer below the transmission layer encapsulate the data. Here, the network card capable of providing the data encapsulation function is a RoCE network card supporting the RoCE protocol, which is different from the conventional network card.
Due to the difference between the TCP/IP protocol stack and the RoCEV2 protocol stack, the structure of the generated data frame is different when data is transmitted using the TCP/IP protocol and the RoCEV2 protocol, respectively.
Referring to fig. 5, fig. 5 shows a frame structure of RoCE V2. The RoCEV2 frame includes an ethernet Header (Eth Header), an IP Header (IP Header), a UDP Header (UDP Header), an IB transport layer Header (infiniBand base transport Header), a message Payload (IB Payload), a redundancy check field (ICRC), and a frame check Field (FCS).
Wherein, in the UDP Header (UDP Header), the destination port number is 4791, that is, the current data frame is an encapsulated RoCE V2 frame by the RoCE V2 protocol stack. Namely, whether the protocol used for transmitting the current data frame is the TCP/IP protocol or the RoCE protocol can be known through the specific value of the destination port number in the UDP header of the Ethernet frame.
(II) socket-based connections
Sockets are basic operation units supporting network communication, can be regarded as endpoints of processes between different devices for bidirectional communication, and are the basis of communication. In short, the two communicating parties firstly establish a socket respectively and establish connection through the socket, and then data transmission can be performed, thereby completing communication.
In this application, sockets are divided into two categories according to protocols: TCP/IP sockets and RDMA sockets.
A TCP/IP socket is a socket that uses the conventional TCP/IP protocol. The connection established based on the TCP/IP socket is called a TCP/IP connection, the data transmission based on the TCP/IP connection follows the TCP/IP protocol, and the transmission and processing process of the data in the TCP/IP protocol stack can refer to fig. 4 and the related description.
Here, when data is transmitted based on a TCP/IP connection, a copy process of the data between a user space and a kernel space and a processing process of the data in a kernel protocol stack cause a large part of transmission delay and consume a large amount of processing resources of a device.
RDMA sockets are sockets using the RoCE protocol. The connection established based on the RDMA socket is called RDMA connection, the data transmission based on the RDMA connection follows the RoCE protocol, and the transmission and processing process of the data in the RoCE protocol stack can refer to fig. 4 and the related description.
When the data is transmitted based on the RDMA connection, the copying process of the data between a user space and a kernel space and the processing process of the data in a kernel protocol stack are avoided, the packaging process of the data is transferred to the RoCE network card, and the processing resources of the equipment are not consumed when the RoCE network card packages the data, so that the transmission delay can be reduced, and the processing resources consumed by the equipment in data transmission can be reduced.
Interface function (III)
The API may also be referred to as an interface function, where the interface function is a predefined function, is packaged with implementation code for some functions, and is an interface provided for an application program and a developer to implement some functions based on some software or hardware. The function corresponding to the interface function can be realized by calling the interface function.
Several common interface function names and their corresponding functions are described below.
Socket function: int socket (int family, inttype, int protocol)
The socket function is used to create a socket (socket). A socket is the basis for network communication, and for a user program, a socket is an opened file. During communication, both communication parties need to create a socket, and communication can be performed after connection is established based on the socket.
In the input parameters of the socket function, the protocol specifies the type of the transmission protocol, which indicates the transmission protocol used by the data to be transmitted by using the socket connection.
After the socket function call is successful, a return value, namely a free socket descriptor allocated by the kernel space from the current free socket descriptor (sockfd) vector is obtained, and the socket descriptor points to the created socket.
Bind function: int bind (int sockfd, const struct sockaddr. myaddr, socklen. t addrlen)
The bind function is used for the server to bind the own IP address and port number to the created socket so as to listen to the incoming network connection request.
In the input parameters of the bind function, myaddr is used for describing the IP address and port number of the server, sockfd is used for indicating the socket descriptor of the server, and the bind function is used for binding the socket pointed by sockfd and the IP address and port number described by myaddr.
The list function: int listen (int sock fd, int back log)
The listen function is used for the server side to monitor the connection request initiated by the client side.
In the input parameters of the listen function, sockfd is used to indicate the socket descriptor of the server, and the listen function is used to listen to the incoming connection on the IP address and port number bound by the socket (i.e. the socket pointed to by the sockfd).
An accept function: int acept (int sockfd, struct sockaddr. clicddr. socklen. t. addrlen)
The accept function is used for the server to accept the connection request initiated by the client.
In the input parameters of the accept function, sockfd is used to indicate the socket descriptor of the server, and the accept function is used to establish a new socket and obtain the socket descriptor pointing to the new socket allocated by the kernel after monitoring the connection request on the socket pointed to by sockfd, and accept the connection request initiated by the client on the new socket, which is equivalent to establishing a connection with the client based on the new socket. Here, the new socket uses the same protocol as the listening socket to which sockfd points.
Connect function: int connect (int sockfd, const struct sockaddr. servaddr. socklen _ t addrlen)
The connect function is used for the client to initiate a socket connection request to the server.
In the input parameters of the connect function, sockfd is used for indicating a socket descriptor of the client, servaddr is used for describing an IP address and a port number of the server, and the connect function is used for initiating a connection request to the server on the socket pointed by sockfd.
Recv function: ssize _ trecv (int sock fd, void bunf, size _ t nbytes, int flags)
recv functions are used by the receiving end to receive data over a network connection.
In the input parameters of the recv function, the sockfd is used to indicate the socket descriptor of the receiving end, the recv function is used to receive data based on the network connection, and the network connection is the connection established based on the socket pointed by the sockfd.
Send function: ssize _ tsend (int sock fd, const void, buff, size _ t nbytes, int flags)
The send function is used to send data over the network connection.
In the input parameters of the send function, the sockfd is used to indicate the socket descriptor of the sender, the send function is used to send data based on the network connection, and the network connection is the connection established based on the socket pointed by the sockfd.
Close function: int close (int fd)
The close function is used to cancel the pointing relationship between the input parameter fd and the socket, thereby stopping the data operation on the socket based on the fd.
Dup2 function: int dup2(int oldfd, int newfd)
The dup2 function is used to change the socket pointed to by oldfd to the socket pointed to by newfd.
It is understood that only a few commonly used interface functions are shown here, and that more interface functions may be included in practical applications, and that other interface functions not listed may be used as well in the present application.
(III) server side and client side
A socket (socket) connection is first established before communication between two devices. In this application, a device that initiates a socket connection request is referred to as a client, and a device that receives a socket connection request is referred to as a server. And after the server receives the socket connection request, the socket connection is successfully established between the two devices.
Referring to fig. 6, when a socket connection is established between a server and a client, the two terminals call various interface functions of an application layer to implement related functions. The order in which the server calls the interfaces (functions) is typically: socket sockets, bind functions, listen functions, accept functions; the order in which the client calls the interfaces (functions) is typically: socket, connect function.
The key points of the present application are described below.
(I) intermediate member
In the application, a middleware is deployed in a memory of the device, the middleware is software for providing connection between operating software and application software of a user space, and defines a relatively stable high-level application environment.
The middleware provides various APIs such as those shown in the basic concept (three) interface function described above, and the like. In the present application, the API provided by the middleware supports the RoCE protocol.
(II) interface redirection
In the application, an application layer is deployed in a memory of the device, a plurality of APIs are provided in the application layer, and implementation codes of the APIs provided in the application layer are not changed and support the TCP/IP protocol by default.
In the present application, the various APIs of the application layer are redirected to the APIs of the middleware. Specifically, when the device runs an application program or a process, each API in the application layer is originally called, and after redirection, each API in the middleware is actually called, so that the communication system and the device of the present application can transmit data using an RDMA protocol. Here, the interface redirection may be implemented by a preload technology or other redirection technologies, and the present application is not limited in any way.
Here, when each API in the application layer is called, which APIs of the middleware are redirected to, reference may be made to the relevant description of the subsequent method embodiment.
Referring to fig. 7, fig. 7 shows a flow chart of the data transmission method of the present application.
In the method shown in fig. 7, the storm system includes a first device and a second device, and specific implementations of the first device and the second device may refer to the device provided in fig. 3 and the related description. The first device is provided with a first working node, the second device is provided with a second working node, and the first working node and the second working node are used for jointly executing the Topology submitted by the user, so that data transmission is required between the first working node and the second working node. Fig. 7 illustrates a data transmission process between a first working node and a second working node, and as shown in the figure, the data transmission method of the present application may include the following steps:
s101, the first device and the second device establish RDMA connection between the first working node and the second working node.
In the application, before data transmission, the first working node and the second working node need to establish RDMA connection.
Here, the first device is deployed with an application layer and first middleware, the application layer provides various APIs, the first middleware provides various APIs supporting the RoCE protocol, and the APIs of the application layer are redirected to the APIs of the first middleware; the second device is deployed with second middleware which provides various APIs supporting the RoCE protocol.
In the application, the first device first calls an API of the application layer for establishing an RDMA connection between the first working node and the second working node, and since the API of the application layer is redirected to the first middleware, the RDMA connection between the first working node and the second working node is actually established by the first middleware and a second middleware deployed in the second device.
That is, the first device instructs the first middleware with a second middleware of the second device through an application layer, creating an RDMA connection between the first working node and the second working node. Here, when the first device and the second device establish an RDMA connection, the first device may actively initiate a connection request to the second device, or the first device accepts the connection request initiated by the second device, and the following describes an implementation process of establishing the RDMA connection based on a socket by the first device and the second device.
(1) In the first case, the first device is a client that establishes the RDMA connection, and the second device is a server that establishes the RDMA connection, i.e., the first device actively initiates an RDMA connection request to the second device.
In this case, the first device instructs the first middleware through the application layer: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; the RDMA connection request is initiated to a second middleware of the second device based on the RDMA socket pointed to by the first descriptor, the RDMA connection request being an RDMA connection request between the first working node and the second working node. After the server (second device) accepts the RDMA connection request, the first device and the second device successfully establish an RDMA connection between the first working node and the second working node.
(2) In the second case, the first device is the server that establishes the RDMA connection, and the second device is the client that establishes the RDMA connection, i.e., the first device accepts the RDMA connection request initiated by the second device.
In this case, the first device instructs the first middleware through the application layer: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; creating a new RDMA socket on condition that the RDMA connection request incoming by the second middleware of the second device is listened to on the socket pointed to by the first descriptor; the RDMA connection request is accepted based on the new RDMA socket, wherein the RDMA connection request is an RDMA connection request between the first working node and the second working node. After the server (first device) accepts the RDMA connection request, the first device and the second device successfully establish an RDMA connection between the first working node and the second working node.
In an optional embodiment, before the first device indicates the first middleware through the application layer, the first device may further obtain a label of the server, and determine that the label of the server is the label of the first working node or the label of the second working node.
Here, when establishing a connection between two devices, the tags of the server side need to be uniformly acquired, and the connection is established according to the tags of the server side. The server is a device that receives the connection request, and in this application, the server may be a first device or a second device.
Here, various nodes, such as a super node, a Zookeeper node, a worker node (worker), and the like, may be deployed in the first device and the second device. In order to distinguish different nodes, labels of various types of nodes are predefined in the first device and the second device, and the labels of the nodes are used for identifying the types of the nodes.
Specifically, when the first device is a server, the server tag acquired by the first device is a tag of the first device itself, and when it is determined that the server tag is a tag of a first working node, that is, the server tag is a tag of a worker process, the first device indicates, through an application layer, a first middleware and a second middleware of a second device, and creates an RDMA connection between the first working node and the second working node. When the first device is a client, the server tag acquired by the first device is a tag of the second device, and when the server tag is determined to be a tag of the second working node, namely the server tag is a tag of a worker, the first device indicates the first middleware and the second middleware of the second device through the application layer, and RDMA connection between the first working node and the second working node is established. The tag may include, among other things, an IP address and a port number.
S102, the first device sends data output by the first working node to the second working node through a network card supporting remote direct memory access (RoCE) protocol based on Ethernet.
The first device and the second device successfully establish a remote direct memory access, RDMA, connection between the first working node and the second working node, which may transfer data based on the RDMA connection, via step S101.
In the application, the data output by the first working node is intermediate data generated after the first working node processes the Topology submitted by the user, and the intermediate data needs to be sent to the second working node, so that the second working node further processes the data. And the data output by the first working node is stored in a buffer area of the application layer.
In the application, the first device and the second device are both provided with network cards supporting the RoCE protocol.
Specifically, when the first device sends the data output by the first working node to the second working node, the data output by the first working node is sent from the application layer to the first middleware, the data is sent to the network card supporting the RoCE protocol through the first middleware, the data is encapsulated through the network card supporting the RoCE protocol to obtain a data frame capable of being transmitted in the ethernet, and the data frame is sent to the network card supporting the RoCE protocol of the second device through the network card supporting the RoCE protocol. Here, the data frame is transmitted from the RoCE network card of the first device to the RoCE network card of the second device through the ethernet.
Specifically, when receiving data output by the first working node, the second working node of the second device first receives a data frame sent by the first device through a network card supporting a RoCE protocol in the second device, decapsulates the data frame, sends data obtained after decapsulation to the second middleware, and sends the decapsulated data to an application layer deployed in the second device through the second middleware. At this point, the first working node successfully sends the data to the second working node, and the second working node can continue to further process the received data, so as to jointly execute the Topology submitted by the user with the first working node.
In the storm system, the worker nodes are used to perform actual data processing, and thus, data interaction occurs in large quantities between the devices that deploy the worker nodes. By using the data transmission method shown in fig. 7, when data interaction is required between the working nodes, the RDMA connection between the working nodes is created, so that the data is transmitted between the working nodes based on the created RDMA connection using the RoCE protocol, the transmission delay is reduced, the processing resources consumed by the device in data transmission are reduced, and the data processing efficiency and throughput of the storm system are improved.
By implementing the data transmission method shown in fig. 7, the RoCE technology can be used in the storm system, thereby reducing data transmission delay and reducing the processing resources consumed by the device in data transmission. In addition, by applying the data transmission method shown in fig. 7, the storm system does not need to modify the implementation codes of the interfaces of the application layer of the device, only needs to add a middleware, redirects the interfaces of the application layer to the middleware, can use the RoCE technology, and is simple and convenient in implementation process.
In the specific implementation of the data transmission method shown in fig. 7, the first device and the second device implement functions of creating connection, receiving and sending data, and the like by calling an interface function, and the data transmission method of the present application is described below from the perspective of the device calling the interface function.
Referring to fig. 8, fig. 8 is an interface function call process when a device is used as a client, and the device mentioned in fig. 8 may be the first device or the second device in the data transmission method shown in fig. 7. As shown, the interface function calling process when the device is used as a client may include the following steps:
1. the device calls a socket function of an application layer, wherein the value of an input parameter protocol is a first protocol value, and the first protocol value represents a TCP/IP protocol.
In the application, codes are not changed in an application layer of the device, and when the device calls various interfaces of the application layer, the same as the prior art is still carried out, and a first protocol value is transmitted to the application layer to specify that the device uses a TCP/IP protocol. For example, the first protocol value may include X1, X2, X3, X4, etc., which respectively represent TCP transport protocol, UDP transport protocol, STCP transport protocol, TIPC transport protocol, and the device of the present application may transmit any one of X1, X2, X3, X4 to the application layer.
Here, the socket function of the application layer is redirected to the socket function in the middleware, that is, the device triggers step 2 when calling the socket function through the application layer.
2. The device calls a socket function of the middleware for creating a first socket and obtaining a first descriptor (fd1) pointing to the first socket (socket 1).
Specifically, the device calls a socket function of the middleware by taking the first protocol value as a value of an input parameter protocol. Here, since the first protocol value represents the TCP/IP protocol, the first socket created by the device through the middleware is a TCP/IP socket.
3. The device calls the connect function of the application layer, where the input parameter sockfd-fd 1 is input.
Specifically, when the device calls the connect function of the application layer, a parameter servaddr for describing the IP address and the port number of the server is also transmitted to the application layer.
4. The device judges whether the IP address of the service end described by servaddr is the IP address associated with the worker and whether the port number of the service end is the port number associated with the worker through the SDP module.
If yes, the connection function of the application layer is redirected to the socket function, the dup2 function, the close function and the connection function in the middleware, that is, when the device calls the connection function through the application layer, step 5 is triggered.
If not, the connection function of the application layer is redirected to the connection function of the middleware, that is, when the device calls the connection function through the application layer, step 6 is triggered.
5. If yes, the device calls the socket function, the dup2 function, the close function and the connect function of the middleware in sequence.
First, the device calls a socket function of the middleware for creating a second socket and obtaining a second descriptor (fd2) pointing to the second socket (socket 2). The value of the input parameter protocol is a second protocol value, the second protocol value is transmitted to the middleware by the SDP module, and the second protocol value represents an RDMA protocol. Here, since the second protocol value represents the RDMA protocol, the second socket created by the device through the middleware is an RDMA socket.
Here, the conventional storm system does not configure relevant parameters of the RDMA protocol, and the SDP module does not store any protocol values representing the RDMA protocol. In order to enable the middleware to use the RDMA protocol and avoid modifying implementation codes greatly, the related parameters of the SDP protocol configured by the SDP module of the traditional storm system are utilized. Specifically, the method and the device configure JAVA starting parameters of an SDP module of the terminal, so that the SDP module can transmit a second protocol value representing an SDP protocol to the middleware under the condition that an IP address and a port number of a service end are an IP address and a port number associated with a worker; the middleware converts the second protocol value representing the SDP protocol into the RDMA protocol upon receiving it, i.e., the middleware considers the received second protocol value to represent the RDMA protocol.
Next, the device calls the dup2 function of the middleware, where the input parameters oldfd 1 and newfd 2 are used to point the first descriptor to the second socket. At this point, fd1 and fd2 both point to the second socket.
Then, the device calls a close function through the middleware, wherein the input parameter fd is fd2, which is used for stopping data operation on the second socket based on fd 2. Here, the kernel may reclaim fd2, making efficient use of resources.
Finally, the device calls a connect function of the middleware, wherein values of the input parameters sockfd and servaddr are consistent with the value of the input parameter when the application layer calls the connect function, namely sockfd is fd 1. Here, going through step 5, since fd1 points to the second socket (RDMA socket), the connect function called by the middleware is used to initiate a connection request on the second socket to the server pointed to by the IP address and port number described by servaddr, creating an RDMA connection.
6. If not, the device calls a connect function of the middleware, wherein the values of the input parameters sockfd and servaddr are consistent with the value of the input parameter when the application layer calls the connect function, namely sockfd is fd 1.
Here, after steps 1-4, fd1 points to the first socket (TCP/IP socket) on which the middleware calls the connect function to initiate a connection request to the server, creating a TCP/IP connection.
After step 5 or step 6, the device performs steps 7-8.
7. The device calls a function of the application layer, send or rec, etc., for data transmission, where the sockfd — fd1 parameter is input.
Here, the send or rec function of the application layer is redirected to the send or rec function in the middleware, that is, the device triggers step 8 when the send or rec function is called through the application layer. When the device calls a send function of an application layer, the device is a data sending end, that is, a first device in the embodiment of the method shown in fig. 7; when the device calls the rec function of the application layer, the device is a data receiving end, i.e. a second device in the embodiment of the method shown in fig. 7.
8. The device calls functions used for data transmission, such as send or rec, of the middleware, wherein the value of the input parameter sockfd is consistent with the value of the input parameter when the application layer calls the send or rec functions, namely sockfd is fd 1.
If the device is operating in worker, the device will perform step 5, and fd1 points to the second socket. The device creates an RDMA connection on the second socket pointed to by fd1, and performs a data transfer based on the RDMA connection, via step 8.
If the device runs other non-worker processes, the device executes step 6, the fd1 points to the first socket, and through step 8, the device creates a TCP/IP connection on the first socket pointed by the fd1 and performs data transmission based on the TCP/IP connection. The data transmission process may refer to step S102 and related descriptions in fig. 7, which are not described herein again.
Referring to fig. 9, fig. 9 shows an interface function calling process when the device is used as a server. The difference from fig. 8 is that the function call sequence when the device is used as a server is: socket function, bind function, listen function, accept function, send/rech function. As shown in the figure, the interface function calling process when the device is used as a server may include the following steps:
1. the device calls a socket function of an application layer, wherein the value of an input parameter protocol is a first protocol value, and the first protocol value represents a TCP/IP protocol.
Here, the socket function of the application layer is redirected to the socket function in the middleware, that is, the device triggers step 2 when calling the socket function through the application layer.
2. The device calls a socket function of the middleware for creating a first socket and obtaining a first descriptor (fd1) pointing to the first socket (socket 1).
3. The device calls the bind function of the application layer, where the input parameter sockfd ═ fd 1.
Specifically, when the device calls the bind function of the application layer, a parameter servaddr for describing the IP address and the port number of the server is also transmitted to the application layer. Here, the IP address and port number of the service side are the IP address and port number of the device itself.
4. The device judges whether the IP address of the service end is the IP address associated with the worker or not and whether the port number of the service end is the port number associated with the worker or not through the SDP module.
If yes, the bind function of the application layer is redirected to the socket function, dup2 function, close function and bind function in the middleware, that is, when the device calls the bind function through the application layer, step 5 is triggered.
If not, the bind function of the application layer is redirected to the bind function of the middleware, that is, when the device calls the bind function through the application layer, step 6 is triggered.
5. If yes, the device calls the socket function, dup2 function, close function and bind function of the middleware in sequence.
First, the device calls a socket function of the middleware for creating a second socket and obtaining a second descriptor (fd2) pointing to the second socket (socket 2). The value of the input parameter protocol is a second protocol value, the second protocol value is transmitted to the middleware by the SDP module, and the second protocol value represents an RDMA protocol. Here, since the second protocol value represents the RDMA protocol, the second socket created by the device through the middleware is an RDMA socket.
Next, the device calls the dup2 function of the middleware, where the input parameters oldfd 1 and newfd 2 are used to point the first descriptor to the second socket. At this point, fd1 and fd2 both point to the second socket.
Then, the device calls a close function through the middleware, wherein the input parameter fd is fd2, which is used for stopping data operation on the second socket based on fd 2. Here, the kernel may reclaim fd2, making efficient use of resources.
Finally, the device calls the bind function of the middleware, wherein the values of the input parameters sockfd and servaddr are consistent with the value of the input parameter when the application layer calls the bind function, namely sockfd is fd 1. Here, via step 5, fd1 points to the second socket (RDMA socket) and the bind function called by the middleware is used to bind its own IP address and port number on the second socket.
6. If not, the device calls the bind function of the middleware, wherein the values of the input parameters sockfd and servaddr are consistent with the value of the input parameter when the application layer calls the bind function.
Here, via steps 1-4, fd1 points to the first socket and the bind function called by the middleware is used to bind its own IP address and port number on the first socket (TCP/IP socket).
After step 5 or step 6, the device performs steps 7-12.
7. The device calls a listen function of the application layer, wherein the input parameter sockfd is fd 1.
Here, the list function of the application layer is redirected to the list function of the middleware, i.e. the device triggers step 8 when the list function is called by the application layer.
8. The device calls a listen function of the middleware, wherein the value of the input parameter sockfd is consistent with the value of the input parameter when the application layer calls the listen function, namely sockfd is fd1, and the sockfd is used for monitoring whether a connection request from the client comes into a socket pointed by fd 1.
9. The device calls an accept function of the application layer, wherein the input parameter sockfd is fd 1.
Here, the accept function of the application layer is redirected to the accept function of the middleware, i.e. the device triggers step 10 when calling the accept function through the application layer.
10. The device calls an accept function of the middleware, wherein the value of the input parameter sockfd is consistent with the value of the input parameter when the application layer calls the accept function, namely sockfd is fd1, and the sockfd is used for receiving a connection request transmitted on the socket pointed by fd 1.
Here, if the device is operating worker, the device will perform step 5, and fd1 points to the second socket. The device listens for the client incoming connection request on the second socket pointed to by fd1, creates a new socket (third socket) and gets a third descriptor (fd3) pointing to the third socket, and accepts the client incoming connection request based on the third socket to create an RDMA connection, via step 10.
If the device is running other non-worker processes, the device will perform step 6, and fd1 points to the first socket. The device listens for the client incoming connection request on the first socket pointed to by fd1, creates a new socket (fourth socket) and gets a fourth descriptor (fd4) pointing to the fourth socket, and accepts the client incoming connection request based on the fourth socket to create a TCP/IP connection, via step 10.
11. The device calls a function such as send or rec of the application layer for data transmission, wherein the input parameter sockfd ═ accept (fd1, cliaddr, addren) ═ fd3/fd 4.
Here, the send or rec function of the application layer is redirected to the send or rec function in the middleware, that is, the device triggers step 12 when the send or rec function is called through the application layer. When the device calls a send function of an application layer, the device is a data sending end, that is, a first device in the embodiment of the method shown in fig. 7; when the device calls the rec function of the application layer, the device is a data receiving end, i.e. a second device in the embodiment of the method shown in fig. 7.
12. The device calls a function used for data transmission, such as send or rec of the middleware, wherein the value of the input parameter sockfd is consistent with the value of the input parameter when the application layer calls the send or rec function, that is, sockfd is accepted (fd1, cliaddr, addrlen) is fd3/fd 4.
It is understood that fig. 8 and fig. 9 are only examples, and as the communication technology develops, names of various functions or interfaces may change in the future, and the present application can also call other functions or interfaces with the same function to implement the data transmission method of the present application.
Referring to fig. 10, fig. 10 is a functional block diagram of a first device provided in the present application. The functional blocks of the first device may be implemented by hardware, software or a combination of hardware and software. Those skilled in the art will appreciate that the functional blocks described in FIG. 10 may be combined or separated into sub-blocks to implement the application scheme. Thus, the above description in this application may support any possible combination or separation or further definition of the functional blocks described below.
As shown in fig. 10, the first device may include: a connection unit 101 and a data transmission unit 102, wherein:
a connection unit 101, configured to establish a remote direct memory access RDMA connection between a first working node and a second working node with a second device, where the distributed computing system includes the first device and the second device, the first device deploys the first working node, and the second device deploys the second working node.
And the data transmission unit 102 is configured to send the data output by the first working node to the second working node through a network card supporting an ethernet-based remote direct memory access (rdma) RoCE protocol. Here, the specific implementation of the data transmission unit 102 can refer to the related description of step S102 in fig. 7.
Optionally, when establishing an RDMA connection between a first working node and a second working node, the connection unit 101 may instruct, through an application layer, a first middleware and a second middleware of a second device to create the RDMA connection between the first working node and the second working node, where the first device deploys the application layer and the first middleware, and the second device deploys the second middleware.
Optionally, the first device is a client that establishes the RDMA connection, and the second device is a server that establishes the RDMA connection. The connection unit 101 indicates, through an application layer, a first middleware and a second middleware of a second device, and when creating an RDMA connection between the first working node and the second working node, specifically indicates, through the application layer, the first middleware: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; initiating an RDMA connection request between the first working node and the second working node to a second middleware of the second device based on the RDMA socket pointed to by the first descriptor.
Optionally, the first device is a server that establishes the RDMA connection, and the second device is a client that establishes the RDMA connection. The connection unit 101 indicates, through an application layer, a first middleware and a second middleware of a second device, and when creating an RDMA connection between the first working node and the second working node, specifically by indicating the first middleware: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; creating a new RDMA socket on condition that an RDMA connection request between the first working node and the second working node, which is transmitted by a second middleware of the second device, is listened to on the socket pointed by the first descriptor; accepting the RDMA connection request based on the new RDMA socket.
Optionally, the first device may further include a determining unit 103, configured to obtain the label of the server before the connecting unit indicates the first middleware through an application layer, and determine that the label of the server is the label of the first working node or the label of the second working node, where the label includes an IP address and/or a port number.
It can be understood that, regarding the specific implementation manner of the functional blocks included in the first device in fig. 10, reference may be made to the related description of the first device side in the method embodiments shown in fig. 7, fig. 8, and fig. 9, which is not repeated herein.
By implementing the method and the device, the RoCE technology can be used on the storm system, so that the data transmission time delay is reduced, and the processing resources consumed by the equipment on data transmission are reduced. In addition, by applying the data transmission method, the storm system does not need to modify the implementation codes of all interfaces of the application layer of the equipment, only needs to add a middleware, redirects all interfaces of the application layer to the middleware, can use the RoCE technology, and is simple and convenient in implementation process.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions described in accordance with the present application are generated, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk SolidStateDisk), among others.

Claims (12)

1. A data transmission method, applied to a first device, the method comprising:
the first device is deployed with an application layer and a first middleware, and an Application Programming Interface (API) in the application layer of the first device is redirected to an API supporting remote direct memory access (RoCE) provided by the first middleware;
the first device indicates the first middleware through the application layer and establishes Remote Direct Memory Access (RDMA) connection between a first working node and a second working node with second middleware of a second device; the second middleware provides an API supporting RoCE, wherein a distributed computing system comprises the first device and the second device, the first device deploys the first working node, the second device deploys the second working node, and the first working node and the second working node are used for executing the same task;
and the first device sends the data output by the first working node to the second working node through a network card supporting an Ethernet-based remote direct memory access (RoCE) protocol.
2. The method of claim 1, wherein the first device is a client that establishes the RDMA connection, the second device is a server that establishes the RDMA connection,
the instructing, by the first device, the first middleware and the second middleware of the second device through the application layer to establish an RDMA connection between the first working node and the second working node, specifically including:
the first device instructs, by the application layer, the first middleware to: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; initiating an RDMA connection request between the first working node and the second working node to a second middleware of the second device based on the RDMA socket pointed to by the first descriptor.
3. The method of claim 1, wherein the first device is a server that establishes the RDMA connection, the second device is a client that establishes the RDMA connection,
the instructing, by the first device, the first middleware and the second middleware of the second device through the application layer to establish an RDMA connection between the first working node and the second working node, specifically including:
the first device instructs, by the application layer, the first middleware to: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; creating a new RDMA socket on condition that an RDMA connection request between the first working node and the second working node, which is transmitted by a second middleware of the second device, is listened to on the socket pointed by the first descriptor; accepting the RDMA connection request based on the new RDMA socket.
4. The method of claim 2, wherein the method comprises:
before the first device indicates the first middleware through the application layer, acquiring a label of the server, and determining that the label of the server is the label of the first working node or the label of the second working node, where the label includes an IP address and/or a port number.
5. The method according to any one of claims 2 to 4, wherein the sending, by the first device, the data output by the first working node to the second working node through a network card supporting a RoCE protocol specifically includes:
the first device sends data output by the first working node to the first middleware from the application layer, the data are sent to a network card supporting a RoCE protocol through the first middleware, the data are packaged through the network card supporting the RoCE protocol to obtain a data frame capable of being transmitted in an Ethernet, the data frame is sent to a network card supporting the RoCE protocol of the second device through the network card supporting the RoCE protocol, so that the second device decapsulates the data frame through the network card supporting the RoCE protocol and sends the decapsulated data to the second middleware, and the decapsulated data are sent to the application layer deployed in the second device through the second middleware.
6. A first device, comprising: a connection unit, a data transmission unit, wherein,
the first device is deployed with an application layer and a first middleware, and an Application Programming Interface (API) in the application layer of the first device is redirected to an API supporting remote direct memory access (RoCE) provided by the first middleware;
the connection unit is used for indicating the first middleware through the application layer and establishing Remote Direct Memory Access (RDMA) connection between the first working node and the second working node with the second middleware of the second device; the second middleware provides an API supporting RoCE, wherein a distributed computing system comprises the first device and the second device, the first device deploys the first working node, the second device deploys the second working node, and the first working node and the second working node are used for executing the same task;
and the data transmission unit is used for transmitting the data output by the first working node to the second working node through a network card supporting an Ethernet-based remote direct memory access (RoCE) protocol.
7. The first device of claim 6, wherein the first device is a client that establishes the RDMA connection, the second device is a server that establishes the RDMA connection,
the connection unit is specifically configured to instruct, by the application layer, the first middleware to: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; initiating an RDMA connection request between the first working node and the second working node to a second middleware of the second device based on the RDMA socket pointed to by the first descriptor.
8. The first device of claim 6, wherein the first device is a server that establishes the RDMA connection, the second device is a client that establishes the RDMA connection,
the connection unit is specifically configured to instruct, by the application layer, the first middleware to: creating a TCP/IP socket and obtaining a first descriptor pointing to the TCP/IP socket; creating an RDMA socket to which the first descriptor is directed; creating a new RDMA socket on condition that an RDMA connection request between the first working node and the second working node, which is transmitted by a second middleware of the second device, is listened to on the socket pointed by the first descriptor; accepting the RDMA connection request based on the new RDMA socket.
9. The first device of claim 8, further comprising: a determining unit, configured to obtain a label of the server before the connecting unit indicates the first middleware through the application layer, and determine that the label of the server is a label of the first working node or a label of the second working node, where the label includes an IP address and/or a port number.
10. The first apparatus of any of claims 6-9,
the data transmission unit is specifically configured to send data output by the first working node from the application layer to the first middleware, send the data to a network card supporting a RoCE protocol through the first middleware, encapsulate the data through the network card supporting the RoCE protocol to obtain a data frame capable of being transmitted in an ethernet, send the data frame to a network card supporting the RoCE protocol of the second device through the network card supporting the RoCE protocol, so that the second device decapsulates the data frame through the network card supporting the RoCE protocol, sends the decapsulated data to the second middleware, and sends the decapsulated data to the application layer deployed in the second device through the second middleware.
11. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-5.
12. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-5.
CN201810131741.1A 2018-02-08 2018-02-08 Data transmission method, related device and system Active CN108494817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810131741.1A CN108494817B (en) 2018-02-08 2018-02-08 Data transmission method, related device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810131741.1A CN108494817B (en) 2018-02-08 2018-02-08 Data transmission method, related device and system

Publications (2)

Publication Number Publication Date
CN108494817A CN108494817A (en) 2018-09-04
CN108494817B true CN108494817B (en) 2022-03-04

Family

ID=63340170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810131741.1A Active CN108494817B (en) 2018-02-08 2018-02-08 Data transmission method, related device and system

Country Status (1)

Country Link
CN (1) CN108494817B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108989317A (en) * 2018-07-26 2018-12-11 浪潮(北京)电子信息产业有限公司 A kind of RoCE network card data communication method and network interface card based on FPGA
CN110971639B (en) * 2018-09-30 2023-02-03 广州虎牙信息科技有限公司 Message distribution method, message management system, server, and computer storage medium
CN111163120A (en) * 2018-11-08 2020-05-15 阿里巴巴集团控股有限公司 Data storage and transmission method and device of distributed database and storage medium
CN111181760B (en) * 2019-09-02 2021-10-08 腾讯科技(深圳)有限公司 Network fault detection method and device, computer readable medium and electronic equipment
CN111193653B (en) * 2019-12-31 2021-08-06 腾讯科技(深圳)有限公司 Data transmission method, device, equipment and storage medium
CN113852656B (en) * 2020-06-28 2023-02-10 华为技术有限公司 Data transmission method, processor system and memory access system
CN111865812B (en) * 2020-07-06 2023-04-07 北京合众方达科技有限公司 SDN-supported high-performance service mixed transmission method in distributed network
CN113067849B (en) * 2021-02-05 2022-05-10 湖南国科亿存信息科技有限公司 Network communication optimization method and device based on Glusterfs
CN112817898B (en) * 2021-02-08 2024-06-28 清华大学 Data transmission method, processor, chip and electronic equipment
CN113163025B (en) * 2021-03-25 2022-10-21 盛立安元科技(杭州)股份有限公司 Data transmission method, device, equipment and storage medium
CN113326228B (en) * 2021-07-30 2022-01-11 阿里云计算有限公司 Message forwarding method, device and equipment based on remote direct data storage
CN113691397B (en) * 2021-08-12 2023-10-20 江苏杰瑞信息科技有限公司 Low-delay 5G wireless transparent transmission method for industrial control data transmission
CN115834660B (en) * 2023-02-22 2023-05-26 江苏为是科技有限公司 Non-blocking RDMA connection establishment method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317716A (en) * 2014-10-30 2015-01-28 华为技术有限公司 Method for transmitting data among distributed nodes and distributed node equipment
CN106293508A (en) * 2015-06-26 2017-01-04 伊姆西公司 Data-storage system and method
CN107113298A (en) * 2014-12-29 2017-08-29 Nicira股份有限公司 The method that many leases are supported is provided for RDMA

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8705572B2 (en) * 2011-05-09 2014-04-22 Emulex Corporation RoCE packet sequence acceleration
CN104202391B (en) * 2014-08-28 2018-09-25 浪潮(北京)电子信息产业有限公司 RDMA communication means between the no tight coupling system of shared system address space
CN105518611B (en) * 2014-12-27 2019-10-25 华为技术有限公司 A kind of remote direct data access method, equipment and system
CN106020731B (en) * 2016-05-23 2019-07-02 中国科学技术大学 Store equipment, array of storage devices and network adapter

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317716A (en) * 2014-10-30 2015-01-28 华为技术有限公司 Method for transmitting data among distributed nodes and distributed node equipment
CN107113298A (en) * 2014-12-29 2017-08-29 Nicira股份有限公司 The method that many leases are supported is provided for RDMA
CN106293508A (en) * 2015-06-26 2017-01-04 伊姆西公司 Data-storage system and method

Also Published As

Publication number Publication date
CN108494817A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108494817B (en) Data transmission method, related device and system
US9503957B2 (en) Low cost mesh network capability
US5940600A (en) Isochronous channel having a linked list of buffers
WO2022222901A1 (en) System architecture for implementing dds communication on basis of autosar, communication method, and device
WO2024037296A1 (en) Protocol family-based quic data transmission method and device
US8539089B2 (en) System and method for vertical perimeter protection
TW202249505A (en) Computing resource scheduling method and apparatus
CN116326199A (en) Radio access node device and interface method executed by radio access node device
WO2021238259A1 (en) Data transmission method, apparatus and device, and computer-readable storage medium
CN111698274B (en) Data processing method and device
EP4142266A1 (en) Data transmission method and related device
CN108123865B (en) Message processing method and device
CN114040510B (en) Data transmission method and related device
CN112383617B (en) Method, device, terminal equipment and medium for performing long connection
CN111245794B (en) Data transmission method and device
CN113676544A (en) Cloud storage network and method for realizing service isolation in entity server
Lu et al. TS-DDS: Data Distribution Service (DDS) Over In-Vehicle Time-Sensitive Networking (TSN) Mechanism Research
CN112165529A (en) Method, device, equipment and medium for low-cost cross-network data exchange
CN113497767A (en) Method and device for transmitting data, computing equipment and storage medium
CN113709015A (en) Data transmission method, electronic device and storage medium
WO2024145862A1 (en) Data transmission method and related apparatus
CN115174687B (en) Service calling method, device, electronic equipment and storage medium
WO2024130615A1 (en) Data transmission method and related apparatus
WO2024027194A1 (en) Message forwarding method, device and system, and storage medium
CN112187698B (en) Communication method, service system, electronic equipment and host MCU (micro control Unit) of electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant