WO2024001820A1 - 一种数据传输方法及网关设备 - Google Patents
一种数据传输方法及网关设备 Download PDFInfo
- Publication number
- WO2024001820A1 WO2024001820A1 PCT/CN2023/100640 CN2023100640W WO2024001820A1 WO 2024001820 A1 WO2024001820 A1 WO 2024001820A1 CN 2023100640 W CN2023100640 W CN 2023100640W WO 2024001820 A1 WO2024001820 A1 WO 2024001820A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- message
- network
- rocev2
- gateway device
- path
- Prior art date
Links
- 230000005540 biological transmission Effects 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 title claims abstract description 98
- 238000004364 calculation method Methods 0.000 claims abstract description 178
- 238000012545 processing Methods 0.000 claims description 69
- 230000001960 triggered effect Effects 0.000 claims description 38
- 230000015654 memory Effects 0.000 claims description 28
- 238000003860 storage Methods 0.000 claims description 28
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000005538 encapsulation Methods 0.000 claims description 10
- 230000000670 limiting effect Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 7
- 230000009977 dual effect Effects 0.000 claims description 2
- 229960001484 edetic acid Drugs 0.000 claims description 2
- 238000007726 management method Methods 0.000 description 38
- 230000008569 process Effects 0.000 description 29
- 230000006870 function Effects 0.000 description 23
- 238000005516 engineering process Methods 0.000 description 22
- 230000004044 response Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 238000005457 optimization Methods 0.000 description 12
- 238000013461 design Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 9
- 230000003993 interaction Effects 0.000 description 6
- 230000001133 acceleration Effects 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0896—Bandwidth or capacity management, i.e. automatically increasing or decreasing capacities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/302—Route determination based on requested QoS
Definitions
- the present application relates to the field of data transmission, and in particular, to a data transmission method and gateway device.
- RDMA Remote direct memory access
- TCP transmission control protocol
- RoCEv1 is an RDMA protocol implemented based on the Ethernet link layer (the switch needs to support priority-based flow control (PFC) and other flow control technologies , ensuring reliable transmission at the physical layer)
- RoCEv2 is the user datagram protocol (UDP) layer implementation in Ethernet transmission control/internet protocol (transmission control protocol/internet protocol, TCP/IP).
- PFC priority-based flow control
- UDP user datagram protocol
- RDMA Due to RDMA's own protocol mechanism, it has very high deterministic service-level agreement (SLA) requirements for the lower-layer bearer network, such as network bandwidth, delay, etc., making it currently not well adapted to e.g. In non-data center scenarios such as wide area networks and metropolitan area networks, the application ecosystem of RDMA can only be limited to data centers.
- SLA service-level agreement
- Embodiments of the present application provide a data transmission method and gateway device, which are used to realize long-distance transmission of cross-domain RDMA data based on the SRv6 network by coordinating the interaction process between RoCEv2 messages and the SRv6 network.
- the application scope of RDMA can be expanded from The data center extends to the entire Internet network, and realizes end-to-end data transmission acceleration processing through RDMA, and makes RDMA transmission no longer dependent on expensive dedicated line networks, resulting in lower engineering deployment costs and shorter startup cycles; in addition, it has also changed
- the SRv6 network cannot sense the current status of application requirements, so that the application layer data can complete accurate optimal path planning before entering the SRv6 network (that is, the path calculation results are pre-planned before the RoCEv2 message is sent), which solves the problem of only passing through Post-event optimization based on monitoring technology can significantly improve user experience while reducing network management complexity.
- embodiments of the present application first provide a data transmission method, which can be used in the field of data transmission.
- the method includes: first, the first network card device as the source generates a RoCEv2 message based on the RDMA data to be sent, and the third network card device.
- the first gateway device corresponding to a network card device then receives the RoCEv2 message from the first network card device.
- the RoCEv2 message is a message to be sent by the first gateway device to the second gateway device.
- the first gateway device The device and the second gateway device perform data transmission based on the SRv6 network, where the SRv6 network can be a wide area network or a metropolitan area network. This application does not limit the specific application type of the SRv6 network.
- the first gateway device generates a corresponding request according to the RoCEv2 message, which may be called a target request, and sends the target request to the network controller in the SRv6 network.
- the network controller obtains the path calculation result based on the received target request, and determines the quality of service (QoS) policy based on the path calculation result, where the path calculation result includes the path between the first gateway device and the second gateway device.
- the network path also called SRv6 forwarding path
- the reserved network resources are the network resources of RoCEv2 messages on the entire network path.
- the first gateway device After obtaining the path calculation result, the first gateway device performs processing on the RoCEv2 message to be sent based on the path calculation result. Extension, thereby obtaining an expanded RoCEv2 message.
- the expanded RoCEv2 message carries at least a first identifier and a second identifier, where the first identifier is used to characterize each network node in the network path, and the second identifier is used to characterize RoCEv2
- the RDMA data corresponding to the message enables the network controller to guarantee bandwidth based on the second identifier and the pre-calculated QoS policy.
- the first gateway device sends the expanded RoCEv2 message to the opposite second gateway device through the calculated network path.
- the process of the first gateway device generating the target request based on the RoCEv2 message may be: when the first gateway device determines that the RoCEv2 message is a control message that does not carry a payload field , that is, when the first gateway device determines that the RoCEv2 message is a control message, the first gateway device generates a first path calculation request based on the control message, where the first path calculation request is used to trigger the network The controller determines the path calculation result corresponding to the control message.
- the first path calculation request generated by the first gateway device may be sent to the network controller through the end-side management system. It should also be noted that the first path calculation request may or may not carry network performance requirements, and this application does not limit this.
- the RoCEv2 message is a control message or a data message can be determined based on whether it carries a payload field, and when the RoCEv2 message is a control message, the first gateway device generates a corresponding first Route calculation requests are flexible and targeted.
- the process of the first gateway device generating the target request based on the RoCEv2 message may also be: when the first gateway device determines that the RoCEv2 message is a data message carrying a payload field , that is, when the first gateway device determines that the RoCEv2 message is a data message, the first gateway device generates a second path calculation request based on the data message.
- the second path calculation request must be carried Network performance requirements, where the second path calculation request carrying the network performance requirements is used to request the network controller to determine a path calculation result for the data packet based on the network performance request.
- the second path calculation request generated by the first gateway device may be sent to the network controller through the end-side management system.
- the first gateway device correspondingly generates a second path calculation request, and the second path calculation request needs to carry network performance requirements, so that SRv6
- the network accurately senses the characteristics of the size of RDMA data to be sent at the application layer, and can infer the required network resources based on this, so that the RDMA data at the application layer can complete accurate optimal path planning in advance before entering the SRv6 network, solving the problem of only Problems that can be adjusted after the fact based on monitoring technology can greatly improve the user experience while reducing the complexity of network management.
- the first gateway device generating a second path calculation request carrying network performance requirements according to the data packet may specifically include: in the data packet being a packet triggered by a send primitive operation.
- the first gateway device determines the required network performance requirements based on the DMA length field in the RETH (RDMA extend transport header) header of the data message, where the RETH header is determined by the first network card device corresponding to the first gateway device.
- the DMA length field in the RETH header is used to represent the size of the RDMA data corresponding to the data message. Based on this, the first gateway device generates a second path calculation request carrying the network performance requirements. .
- the message size of the corresponding RDMA data to be transmitted is also determined, but currently there is no corresponding field in the message to identify the message size ( massage size). Therefore, it is necessary to add a field to the header of the message to identify the size of the message to be sent in this send operation.
- the data message triggered by the send primitive operation can also carry a field to identify the message size of the RoCv2 data that needs to be sent in this operation. This can be achieved by reusing the RETH header defined by the IB specification. , with flexibility and wide applicability.
- the first gateway device generates a second path calculation request carrying network performance requirements according to the data packet. Specifically, it may also include: a write primitive operation or a read primitive in the data packet. In the case of operation-triggered messages, the first gateway device first determines the network performance requirements based on the DMA length field in the RETH header of the data message. After that, the first gateway device generates a second path calculation request carrying the network performance requirements. . It should be noted that when the data message is a message triggered by the write primitive operation, it will be carried in the first packet (i.e. first packet). An RDMA data may be split into n data packets and will only be carried in the first packet.
- the RETH (RDMA extend transport header) header indicates through the DMA length field that this operation needs to be transmitted. Message size.
- the data message is a message triggered by the read primitive
- the network performance requirements are carried in the read request message, and the requester triggers the reverse forwarding path calculation with the responder as the starting node.
- the responder receives a read request and replies with a read response, it may reply with multiple response messages, and all messages need to be forwarded according to the precomputed path. Therefore, in this case, the first gateway device can determine the network performance requirement based on the DMA length field in the RETH header of the data message, thereby generating a second path calculation request carrying the network performance requirement.
- the size of the message to be transmitted for this operation can be determined through the DMA length field in the RETH header of the first packet.
- the first gateway device extends the RoCEv2 message according to the path calculation result.
- the expanded RoCEv2 message may be: the first gateway device modifies the RoCEv2 message according to the path calculation result. IPv6 Hop-by-Hop Option in the IPv6 header of the message, thereby obtaining the extended RoCEv2 message.
- the RoCEv2 message can carry the size of the data to be transmitted at the application layer and the forwarding path identifier. Accordingly, after the RDMA traffic enters the SRv6 network, the wide area/ The mature QoS capabilities of metropolitan area networks ensure guaranteed transmission bandwidth.
- the data transmission method based on the SRv6 network may also include the following steps: first, the first gateway device obtains a backup path, which may be called a first backup path, and the first The backup path can be triggered by the network controller when receiving the target request. After that, the first gateway device copies the RoCEv2 message to obtain the first copy message. The first gateway device then copies the first copy message based on the first backup path. The copy message is expanded similarly to the above process to obtain the expanded first copy message. The expanded first copy message still needs to carry the SRv6 forwarding path identifier.
- the carried SRv6 forwarding path identifier may be called a third identifier, and the third identifier is used to characterize each network node in the first backup path.
- the first gateway device sends the expanded first copy message to the second gateway device through the first backup path, so that the second gateway device performs the following operations on the expanded RoCEv2 message and the expanded first copy message.
- Dual-send and selective-receive processing can be: keep the message received first, and discard the message received later. For example, if the expanded RoCEv2 message reaches the second gateway first device, the extended RoCEv2 message is retained and the first extended copy message that arrives is discarded. The reverse is similar and will not be described here.
- the first gateway device in addition to performing transmission optimization processing based on different message classifications, the first gateway device also needs to make an additional copy of the RoCEv2 message and complete the RoCEv2 message based on the backup path.
- Multiple-send and selective-receive processing improves the ability to resist packet loss during network transmission and avoids packet loss problems through traffic redundancy.
- an alternative solution for RoCEv2 anti-packet loss processing is also provided.
- the main difference is reflected in:
- the RDMA gateway does not copy the messages, but the SRv6 network
- the head network node completes the copy processing of RoCEv2 messages.
- the process may be: when the expanded RoCEv2 message is sent to the head network node of the network path (the head network node may also be called the source network node, which refers to the network directly connected to the first gateway device in the network path).
- the trigger head network node copies the expanded RoCEv2 message, thereby obtaining a second copy message, in which the second copy message passes through
- the pre-prepared backup path (which can be called the second backup path) is sent.
- the tail network node of the network path (the tail network node can also be called the destination network node refers to the network path that is directly connected to the second gateway device.
- the connected network node (which is the last network node along the data transmission direction in the network path) performs dual-send and selective-receive processing on the expanded RoCEv2 message and the second copy message.
- the second backup path may also be triggered and generated by the network controller when receiving the target request.
- the gateway device in the embodiment of the present application does not need to implement anti-packet loss processing of dual-transmit and selective-receive.
- This part of the capability is provided by the network nodes in the SRv6 network, which can reduce the forwarding pressure of the gateway device.
- the first gateway device transmits the extended RoCEv2 Sending the message to the second gateway device may also include: when the reserved network resources do not meet the network performance requirements of the expanded RoCEv2 message, the first gateway device performs source-end rate limiting through the flow control mechanism, and sends the expanded RoCEv2 message to the second gateway device through the network path.
- an optimal SRv6 forwarding path under the current situation can also be selected, and at the same time, the first gateway device is notified of the size of the network resources allocated for this transmission, so that The first gateway device uses a flow control mechanism to reduce the transmission speed at the source end to reduce congestion, avoid packet loss, and improve transmission reliability.
- the first gateway device obtains The method of sending the RoCEv2 message to the second gateway device may be: the first network card device generates the original RoCEv2 message based on the RDMA data to be sent, and then the first gateway device performs IPv4 over IPv6 tunnel encapsulation on the original RoCEv2 message, Thus, the RoCEv2 message is obtained.
- the corresponding gateway device when the network card device is deployed with an IPv4 private network address, the corresponding gateway device also needs to have the capability of IPv4 over IPv6 tunnel encapsulation, so that the network card device deployed in the user's private network can It can realize the function of mutual communication across wide area/metropolitan area networks and is achievable.
- the process for the first gateway device to obtain the RoCEv2 message to be sent to the second gateway device may also be: first, the first gateway device receives the to-be-sent RoCEv2 message sent by the first network card device. For the target message sent to the second gateway device, the first network card device corresponds to the first gateway device. After that, the first gateway device determines that the target message is the RoCEv2 message according to the UDP destination port number in the target message. .
- the first gateway device can receive various messages transmitted by the first network card device, the first gateway device can determine whether the received target message is a RoCEv2 message according to The UDP destination port number in the received target packet is used to determine whether the currently received target packet is a RoCEv2 packet, which is operable.
- the reserved network resources include at least any of the following: network bandwidth and minimum network delay.
- a second aspect of the embodiments of the present application provides a gateway device, which serves as a first gateway device and has the function of implementing the method of the above-mentioned first aspect or any possible implementation of the first aspect.
- This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
- the hardware or software includes one or more modules corresponding to the above functions.
- the third aspect of the embodiment of the present application provides a gateway device.
- the gateway device may include a memory, a processor, and a bus system.
- the memory is used to store programs, and the processor is used to call programs stored in the memory.
- the fourth aspect of the embodiments of the present application provides a computer-readable storage medium.
- the computer-readable storage medium stores instructions. When run on a computer, the computer can execute the first aspect or any one of the first aspects. Possible implementation methods.
- the fifth aspect of the embodiment of the present application provides a computer program that, when run on a computer, causes the computer to execute the method of the above-mentioned first aspect or any possible implementation of the first aspect.
- the sixth aspect of the embodiment of the present application provides a chip.
- the chip includes at least one processor and at least one interface circuit.
- the interface circuit is coupled to the processor.
- the at least one interface circuit is used to perform a transceiver function and send instructions to At least one processor, at least one processor is used to run computer programs or instructions, which has the function of implementing the above-mentioned first aspect or any of the possible implementation methods of the first aspect, or it has the function of implementing the above-mentioned second aspect or any of the possible implementation methods.
- the function of any possible implementation method can be realized by hardware, software, or a combination of hardware and software.
- the hardware or software includes one or more functions corresponding to the above functions. module.
- the interface circuit is used to communicate with other modules outside the chip.
- Figure 1 is a schematic diagram of the workflow of the SRv6 TE policy provided by the embodiment of this application;
- Figure 2 is a schematic diagram of an application scenario provided by the embodiment of the present application.
- Figure 3 is a schematic diagram of a general model of the current processing solution provided by the embodiment of the present application.
- Figure 4 is a schematic diagram of an application scenario provided by the embodiment of the present application.
- FIG. 5 is an application schematic diagram of the system architecture provided by the embodiment of the present application.
- Figure 6 is another application schematic diagram of the system architecture provided by the embodiment of the present application.
- Figure 7 is a schematic flow chart of the data transmission method provided by the embodiment of the present application.
- Figure 8 is a schematic flow chart of the network initialization processing stage provided by the embodiment of the present application.
- Figure 9 is a schematic diagram of the change in the header of the RoCEv2 message when it passes through each network node on the network path provided by the embodiment of the present application;
- Figure 10 is a schematic diagram of the format definition provided by the embodiment of the present application.
- Figure 11 is a schematic structural diagram of each component of the data transmission method provided by the embodiment of the present application.
- Figure 12 is a schematic structural diagram of the first gateway device provided by the embodiment of the present application.
- Figure 13 is another schematic structural diagram of the first gateway device provided by an embodiment of the present application.
- Embodiments of the present application provide a data transmission method and gateway device, which are used to realize long-distance transmission of cross-domain RDMA data based on the SRv6 network by coordinating the interaction process between RoCEv2 messages and the SRv6 network.
- the application scope of RDMA can be expanded from The data center extends to the entire Internet network, and realizes end-to-end data transmission acceleration processing through RDMA, and makes RDMA transmission no longer dependent on expensive dedicated line networks, resulting in lower engineering deployment costs and shorter startup cycles; in addition, it has also changed
- the SRv6 network cannot sense the current status of application requirements, so that the application layer data can complete accurate optimal path planning before entering the SRv6 network (that is, the path calculation results are pre-planned before the RoCEv2 message is sent), which solves the problem of only passing through Post-event optimization based on monitoring technology can significantly improve user experience while reducing network management complexity.
- Segment routing based on IPv6 Segment routing IPv6, SRv6
- SRv6 refers to segment routing (SR+IPv6) based on the IPv6 forwarding plane, which uses existing IPv6 forwarding technology to achieve network programmability through flexible IPv6 extension headers.
- SRv6 is a new generation of IP bearer protocol that can simplify and unify traditional complex network protocols and is the basis for building intelligent IP networks in the 5G and cloud era.
- SRv6 combines the source routing advantages of SR with the simplicity and easy scalability of IPv6. It has multiple programming spaces, conforms to the idea of software defined network, and is a powerful tool for realizing intent-driven networks.
- SRv6's rich network programming capabilities can better meet the needs of new network services, and its IPv6 compatibility also makes network service deployment easier.
- SRv6 has two working modes, namely the SRv6 (segment routing IPv6 traffic engineering, SRv6 TE) strategy based on traffic engineering and the best-effort forwarding SRv6 (segment routing IPv6 best effort, SRv6 BE) strategy.
- the SRv6 BE strategy uses the shortest The optimal SRv6 path calculated by the path algorithm;
- SRv6 TE policy is a new tunnel traffic diversion technology developed based on SRv6 technology.
- the SRv6 TE path is represented as a segment list of the specified path, called the segment identity number. List, also referred to as SID list (segment ID list).
- Each SID list is an end-to-end path from a source to a destination and instructs devices on the network to follow the specified path rather than following the shortest path calculated by the interior gateway protocols (IGP). If a packet is directed into the SRv6 TE path, the SID list is added to the packet by the headend, and the rest of the network executes the instructions embedded in the SID list. Compared with the SRv6 BE policy, the SRv6 TE policy can better respond to the differentiated needs of the business through traffic engineering technology, achieving a business-driven network, and is suitable for scenarios where the business has strict requirements on network SLA.
- Topology information includes node and link information, as well as traffic engineering (TE) attributes such as link cost, bandwidth, and delay.
- TE traffic engineering
- the controller calculates the path according to the business requirements and meets the business SLA.
- the controller sends the path information to the head node of the network (that is, the node connected to the source device, which can also be called the source node) through the border gateway protocol (BGP) SR-Policy extension.
- the head node generates SRv6TE.
- the generated SRv6 TE policy includes key information such as headend address, destination address, and Color.
- the head node of the network selects the appropriate SRv6 TE for the service to guide forwarding.
- the transponder When forwarding data, the transponder needs to execute the instructions of the SID issued by itself.
- the SRv6 TE policy calculates paths based on network topology and status, and does not perceive the status and needs of the application. Therefore, it cannot perform strict and precise routing based on the exact bandwidth requirements of the application. It can only schedule application traffic. To the current most appropriate path, monitoring and detection methods need to be used to detect anomalies and perform subsequent path optimization and scheduling.
- RDMA is a direct memory access technology that provides the ability to directly access remote memory based on the network through the registration and binding of the network card and memory, and transfers data from the memory of the local computer device to the remote computer device without the need for operating systems of both parties. (operating system, OS) intervenes, so it will not have any impact on the OS of both parties, so there is no need to use much computer processing power. It completes all processing logic in user mode, eliminating the overhead of external memory copying and context switching, thus freeing up memory bandwidth and CPU cycles to improve application system performance. It has the characteristics of high bandwidth, low latency, and low CPU load.
- OS operating system
- IB network is a network specially designed for RDMA, which ensures reliable transmission from the hardware level.
- RoCE and iWARP are based on Ethernet RDMA technology and support corresponding verbs interfaces.
- the RoCE protocol exists in two versions: RoCEv1 and RoCEv2.
- NIC Network interface card
- a network card device also known as a network card or network adapter, is a piece of computer hardware designed to allow computers to communicate on a computer network. Because it has a media access control (MAC) address, it falls between Layer 1 and Layer 2 of the open system interconnection (OSI) model. It allows users to connect to each other via cable or wirelessly.
- MAC media access control
- OSI open system interconnection
- the network card may also include a data processing unit (DPU), a smart network interface card (smart network interface card), an RDMA network card or Other forms of equipment with network card functions are not specifically limited in this application.
- the RDMA network card is used to: receive remote memory access requests from the central processing unit (CPU) and send them to the network, or receive remote memory access requests from the network and send them through direct memory access (direct memory access (DMA) engine accesses the host memory, and finally returns the access results to the initiator through the network.
- the network card used since the data to be sent is RDMA data, the network card used may be an RDMA network card (RDMA NIC, RNIC).
- a gateway device can be referred to as a gateway, also known as an Internet connector or a protocol converter. It is a computer system or device that provides data conversion services between multiple networks. Between two systems that use different communication protocols, data formats or languages, or even completely different architectures, the gateway is a translator. The gateway must repackage the received information to meet the needs of the destination system, and at the same time play a role The role of filtration and security.
- the gateway based on RDMA data may be called an RDMA gateway (RGW).
- the gateway works at the transport layer of the open system interconnection reference model (OSI/RM) and all levels above it. It repackages information so that they can be processed by another system. For this purpose, the gateway It must also be able to communicate with various applications, including establishing and managing sessions, transmitting and parsing data, etc. In fact, today's gateway can no longer be completely classified as a kind of network hardware, but can be summarized as being able to connect different A combination of network software and hardware.
- OSI/RM open system interconnection reference model
- the SRv6 network-based data transmission method provided by the embodiments of this application can be deployed on existing gateway devices (that is, coupled with existing gateway devices), or can be deployed independently as a proprietary gateway. equipment (that is, decoupled deployment with existing gateway equipment), this application does not limit this.
- IB is a computer network communication standard for high-performance computing. It has extremely high throughput and extremely low latency and is used for data interconnection between computer devices.
- a dedicated network line is a dedicated channel provided by a network service provider to users, making the user's data transmission reliable and trustworthy.
- the advantage of a dedicated line is that it has good security and QoS can be guaranteed.
- Network dedicated lines mainly have the following two channel modes:
- a physical dedicated channel is a dedicated line laid between the service provider and the user. The line is only used independently by the user. Other data cannot enter this line, while general lines allow multiple users to share the channel.
- Virtual dedicated channel A virtual dedicated channel reserves a certain bandwidth for users on a general channel so that users can exclusively enjoy this part of the bandwidth. It is like opening another channel on a public channel and allowing only corresponding users to use it. User data is encrypted to ensure reliability and security.
- data centers such as public cloud/private cloud data centers, supercomputing data centers, etc.
- data centers become the concentration point of computing power.
- Enterprises need to build applications around the data center in order to use the powerful computing power of the data center.
- data and computing power are concentrated inside the data center.
- T+0 the "T+0" calculation type that requires real-time processing.
- the amount of data is not large but has strong real-time processing requirements.
- the processing results may need to be fed back to the production system for real-time control;
- One is the "T+1" calculation type that does not require real-time processing. It has a large amount of data and has no real-time requirements, but the faster the timeliness, the better.
- the embodiment of this application mainly focuses on the problem that the terminal side generates massive data, but the local computing power is insufficient, and the centralized computing power of the data center needs to be used to complete the "T+1" type calculation request, as shown in Figure 2.
- Such computing requests are found in various industries such as seismic exploration data processing in the energy industry, gene sequencing data processing in the medical industry, 3D rendering and video image data processing in the media industry, and high-energy physics and meteorological data analysis and processing in the scientific research field. widely used.
- the scenario demands and current solutions of these industries can be abstracted into a general model as shown in Figure 3.
- the wide area network such as the SRv6 network
- the data center network From the perspective of the network (taking the wide area network as an example), data needs to pass through the wide area network (such as the SRv6 network) and the data center network from the end side to the data center. From the perspective of data transmission, the data first travels through the wide area network to the storage node in the data center through TCP transmission or hard disk transportation, and exists in the form of data copy on the storage node. Then the computing node passes through the data center network, based on TCP or high-performance RDMA technology reads storage node data to complete data processing.
- TCP Transmission or hard disk transportation
- Data security issues The risk of data leakage caused by various abnormal problems in data copying.
- RDMA In view of the mature application of RDMA in high-performance transmission solutions in data centers, the industry hopes to transplant RDMA to Ethernet networks for application.
- RDMA since RDMA was designed based on the IB network at the beginning of its birth, it requires the use of dedicated switches, routers, and network cards. , combined into a dedicated IB network, using hardware resources to ensure high bandwidth, low latency and zero packet loss of the transmission network, while RDMA's own response mechanism in terms of reliability is relatively crude.
- the feature requirements of the IB network were also brought to the Ethernet network, especially the lossless feature requirements.
- the Ethernet network that originally provided best-effort services needed to use complex flow control mechanisms (such as global pause).
- RDMA has the following problems in Ethernet-based transmission:
- Packet loss processing The RDMA network card is limited by hardware resources. It uses Go-back-N's extensive retransmission processing and is very sensitive to packet loss. Abnormal packet loss will cause a sharp drop in transmission efficiency. Therefore, it is necessary to find a way to prevent the network from being damaged. Packet loss.
- RDMA can achieve wide-area transmission by leveraging the high-quality characteristics of network dedicated lines.
- RDMA based on dedicated lines only has certain degradation in latency, and other network service parameters can reach the same level.
- network dedicated lines can provide high-quality bearer services, there are still several problems as follows:
- the network dedicated line activation period is very long and requires management and maintenance by professionals from the network service provider.
- the dedicated line services of different network service providers may not be able to communicate with each other.
- this application solves the above problems by designing a wide area/metropolitan area network that can carry RDMA transmission and through end-to-end RDMA high-performance transmission.
- this application focuses on using SRv6 technology on wide area/metropolitan area networks to solve the problem of reliable RDMA transmission.
- the specific application scenario is shown in Figure 4.
- the method provided by this application The existing RDMA application scope can be extended from the data center to the entire Internet network, and end-to-end data transmission acceleration processing can be achieved through RDMA technology, which mainly solves the following problems:
- the gateway device identify (e.g., identify through the RDMA transmission processing layer) and classify RDMA data (also called RDMA traffic), and accurately perceive the network performance (e.g., bandwidth, delay) of different RMDA data transmission through refined management? etc.) requirements, and then cooperates with the SRv6 network to perform accurate path calculation based on network performance requirements and the real-time link status of the SRv6 network, and then feeds the path calculation results back to the gateway device.
- RDMA data also called RDMA traffic
- network performance e.g., bandwidth, delay
- gateway device cooperates with the network nodes in the SRv6 network (such as wide area routers) to import different RDMA data into the corresponding SRv6 TE policy based on the path calculation results (for example, through the RDMA transmission processing layer), and pass it through the network Network nodes achieve precise bandwidth guarantee.
- the network nodes in the SRv6 network such as wide area routers
- FIG. 5 shows the implementation of this application.
- An application schematic diagram of the system architecture is provided in the example.
- the embodiment of this application is mainly applied to the scenario where the RDMA network interface card (RNIC) located on the end side needs to span the SRv6 network.
- the RDMA network card device can be collectively referred to as the network card device.
- the source network card device and the destination network card device can deploy IPv4 private network addresses or IPv6 addresses and use the RoCEv2 protocol to interact.
- the SRv6 network generally consists of a network controller (such as a wide area network controller) and multiple routing devices, as shown in Figure 5 RGW1, RGW2, R3, R4, and R5.
- Routing devices can also be called network nodes and network controllers. It manages all network nodes corresponding to the SRv6 network and is responsible for the path calculation processing of the SRv6 network. As shown in Figure 5, it is assumed that the network controller has calculated four SRv6 TE policy paths based on the network status, respectively expressed as MD (min delay ) path, LC (least cost) path, BB (bandwidth balance) path, MA (max availability) path, where MD path represents the minimum delay path, LC path represents the minimum cost path, BB path is the bandwidth balance path, and MA path is Maximum utilization path.
- MD path represents the minimum delay path
- LC path represents the minimum cost path
- BB path bandwidth balance path
- MA path is Maximum utilization path.
- the source network card device and the destination network card device can deploy IPv4 private network addresses or IPv6 addresses. This application does not specifically limit this.
- the source network card device and the destination network card device are deployed with IPv4 private network addresses, then the source network card device and the destination network card device need to provide IPv4 over IPv6 tunnel traversal capabilities for their corresponding gateway devices, that is, The gateway device needs to perform IPv4 over IPv6 tunnel encapsulation for the packets sent by the network card device, so that the RNICs deployed in the user's private network can communicate with each other across the wide area network, as shown in Figure 5. The transmission process. If the source network card device and the destination network card device are deployed with IPv6 addresses, then the gateway device does not need to perform IPv4 over IPv6 tunnel encapsulation for the packets sent by the network card device.
- the network card device directly runs the RoCEv2 protocol based on IPv6 and communicates with each other.
- the routes between devices are reachable across SRv6 networks.
- the transmission process is shown in Figure 6.
- the advantage of this method is that the gateway device does not need to provide IPv4 over IPv6 tunnel encapsulation capabilities.
- the entire network runs the IPv6 protocol and protocol stack technology. Unified, networking becomes simpler and clearer.
- other processing procedures are similar. For details, please refer to subsequent implementations. For ease of explanation, subsequent embodiments will take the deployment of IPv4 private network addresses on the source network card device and the destination network card device as an example.
- This application uses RDMA gateway equipment (RGW) to implement the data transmission method based on the SRv6 network.
- RGW RDMA gateway equipment
- All RGWs within a preset range can correspond to an end-side management system, and are managed by this end-side management system.
- the range of RGWs that the end-side management system can manage are pre-deployed and defined. Here No further details will be given.
- the RGW is generally deployed at the exit of the user network, with the local area network (LAN) side connected to the RNIC, and the wide area network (wide area network, WAN) side connected to the network nodes in the SRv6 network.
- LAN local area network
- WAN wide area network
- All traffic sent by the RNIC will be monitored and classified by the egress gateway RGW, so that the RDMA traffic (which can be divided into control traffic and data traffic in the application embodiment) can be processed according to the required network performance (such as bandwidth, delay). , matched to their respective SRv6 forwarding paths, thereby achieving deterministic SLA guarantee when RDMA traffic is forwarded on wide area/metropolitan area networks.
- the required network performance such as bandwidth, delay
- the terminal-side management system may not be needed, but modules with corresponding functions are deployed on the respective RGWs to realize the functions of the terminal-side management system.
- this application no limit To facilitate the explanation, in the following embodiments of this application, the deployment of the terminal-side management system is taken as an example.
- FIG. 7 is a schematic flowchart of a data transmission method provided by an embodiment of the present application. The method may include the following steps:
- the first gateway device obtains the RoCEv2 message.
- the source network card device (which can be called the first network card device) generates a RoCEv2 message based on the RDMA data to be sent, and the source gateway device (which can be called the first gateway device) corresponding to the first network card device then receives the message from The RoCEv2 message of the first network card device is a message to be sent by the first gateway device to the destination gateway device (which can be called the second gateway device).
- the second gateway device performs data transmission based on the SRv6 network.
- the SRv6 network can be a wide area network or a metropolitan area network. This application does not limit the specific application type of the SRv6 network.
- the first gateway device The method of obtaining the RoCEv2 message to be sent to the second gateway device may be: the first network card device generates the original RoCEv2 message based on the RDMA data to be sent, and then the first gateway device performs IPv4 over IPv6 tunnel encapsulation on the original RoCEv2 message. , thereby obtaining the RoCEv2 message, so that the network card device deployed in the user's private network can realize the function of communicating with each other across the wide area/metropolitan area network.
- the first gateway device since the first gateway device can receive various messages passed by the first network card device, the first gateway device determines whether the received target message is RoCEv2
- the message method may be: determining whether the currently received target message is a RoCEv2 message based on the UDP destination port number in the received target message.
- the present application may select a dedicated gateway device, that is, implement the data transmission method based on the SRv6 network provided by the embodiments of the present application.
- the components can be coupled to existing gateway devices to form a dedicated gateway device, but the components that implement the data transmission method based on the SRv6 network are not limited to dedicated gateway devices, that is, they can be decoupled and deployed independently.
- the components that implement the data transmission method based on the SRv6 network are deployed on servers, traditional network equipment, field-programmable gate array (FPGA) equipment, etc.
- the first gateway device described in this application only implements the components An execution subject of the function. This application does not limit the specific physical form of the first gateway device.
- the first gateway device generates a target request according to the RoCEv2 message, and sends the target request to the network controller of the SRv6 network.
- the first gateway device After obtaining the RoCEv2 message, the first gateway device will further generate a corresponding request based on the RoCEv2 message. This request may be called a target request. Afterwards, the first gateway device will send the target request to the network control unit in the SRv6 network. transmitter.
- the first gateway device recognizes the RoCEv2 message. Finally, you can first determine the message type of the RoCEv2 message based on the custom determination rules, and then generate the corresponding target request based on the message type of the RoCEv2 message. Specifically, the first gateway device can distinguish whether the RoCEv2 message is a control (control flow, CF) message or a data (data flow, DF) message according to the RoCEv2 message classification method designed in this application, and then classify the RoCEv2 message according to the RoCEv2 message classification method. Whether the RoCEv2 message is a CF message or a DF message determines what kind of target request is sent subsequently.
- control control flow
- DF data flow
- this application can classify RoCEv2 messages into two categories: CF messages and DF messages. The following are respectively To introduce:
- CF packets do not carry service data payloads. That is to say, if a RoCEv2 packet does not carry a payload field, the RoCEv2 packet is a CF packet.
- CF messages can carry various other L4 headers as needed, such as ETH (extended transport header), CMM (communication management message), etc.
- CF messages are generally triggered by the interaction mechanism of the IB protocol. Different types of CF messages have different lengths, but they are all single short messages of fixed length. The transmission bandwidth requirements can be calculated based on the message size and the number of concurrencies. In general, It is said that the transmission bandwidth of the bearer network is not required.
- CF messages include the following types of messages: CM link establishment message, Read Request message, Cmp Swap message, Fetch Add message, ACK message, Atomic ACK message, and RESYNC message. Which of the above categories a specific CF message belongs to can be determined based on the values of the opcode field and the destination queue pair field in the BTH header of the message, which will not be described in detail in this application.
- DF packets carry service data payloads. That is to say, if a RoCEv2 packet carries a payload field, the RoCEv2 packet is a DF packet.
- DF messages also carry BTH and optional L4 header. DF messages are generally triggered by send, read, and write primitive operations. The length of DF messages is affected by the size of the data to be sent and the maximum transmission unit of the transmission path (path maximum transmission unit, PMTU), generally speaking, the bearer network has higher transmission bandwidth requirements. The transmission bandwidth requirements can be calculated according to the existing rules in the standard protocol, or according to other customized rules. This application does not limit this. .
- DF messages include the following types of messages: read response message, write/send first message, write/send middle message, write/send last message, write/send only message, write/send only with immediate message arts.
- the write/send first message, the write/send middle message and the write/send last message jointly correspond to one RDMA data (that is, an RDMA message), while the write/send only message message or write/send only with immediate message each corresponds to one RDMA data.
- an RDMA data also called an RDMA message or RDMA traffic
- the RDMA data can be split into at least 3 data packets, of which the first split The data packet (also called the first packet) corresponds to the write/send first message, the last split data packet corresponds to the write/send last message, and the middle data packet (can be 1 or more) corresponds to write/send middle message.
- the data packet also called the first packet
- the last split data packet corresponds to the write/send last message
- the middle data packet (can be 1 or more) corresponds to write/send middle message.
- the first gateway device uses the RoCEv2 message classification method described above in this application to classify the obtained RoCEv2 messages, it can generate a corresponding target request according to the type of the RoCEv2 message (that is, whether it is a CF message or a DF message). , respectively introduced below:
- the target request is the first path calculation request.
- the first gateway device determines that the RoCEv2 message is a CF message
- the first gateway device When the first gateway device determines that the RoCEv2 message is a CF message, the first gateway device generates a first path calculation request based on the CF message, where the first path calculation request is used to trigger the network controller to The CF message determines the path calculation result corresponding to the CF message.
- the first path calculation request generated by the first gateway device may be sent to the network controller through the end-side management system.
- the RoCEv2 message is a CF message
- one RDMA data generally corresponds to one message. This is because CF messages are generally not large and do not need to be split.
- the network controller can pre-calculate the path calculation result of RGW1 ⁇ RGW2.
- the calculation result is The path result includes the network path of RGW1 ⁇ RGW2 and the network resources (such as bandwidth) reserved for the CF message. Any subsequent CF messages sent from RGW1 to RGW2 will correspond to the path calculation results.
- the time for the network controller to calculate the path calculation result of RGW1 ⁇ RGW2 for the CF message may include but is not limited to: it may be determined during the network initialization process, or it may be determined during the network initialization process.
- the first gateway device determines when it has a CF message to be sent for the first time. The following details are as follows:
- the network controller calculates the path calculation result of RGW1 ⁇ RGW2 for the CF message.
- Figure 8 is a schematic flow chart of the network initialization processing stage provided by the embodiment of the present application.
- the network controller can collect network status information (such as network topology, Binding) through the BGP-LS protocol. SID, SR-Policy status, etc.).
- RGW1 When RGW1 is powered on, RGW1 is managed by the end-side management system.
- the administrator can orchestrate the connectivity relationships between all managed gateway devices (under management) through the end-side management system.
- the topology diagram of the gateway device can be the default fully connected topology, or it can be a topology customized in advance (this application does not limit this).
- RGW1 receives the orchestration configuration issued by the end-side management system.
- the orchestration configuration is To realize the ability of RGW1 and RGW2 to traverse the tunnel of the SRv6 network.
- the end-side management system determines that the two RGWs (that is, RGW1 and RGW2) are connected, the end-side management system issues respective orchestration configurations to the two RGWs.
- the administrator will configure network performance parameters from RGW1 to each other connected RGW (Note: 10M bps can support 3814 CM link establishments, or 16890 read request requests.
- RGW1 can be adjusted based on local real-time statistics of signaling interactions. Network performance parameters of CF messages). For example, administrators can configure bandwidth and latency as bandwidth and latency parameters respectively.
- the network performance parameters may also be default values set in advance based on certain rules.
- This application does not limit the specific implementation of how to configure the network performance parameters.
- this application does not limit the specific implementation of the network performance parameters.
- the specific types of performance parameters are not limited, and users can determine the required network resource types based on their needs.
- the end-side management system can submit various features to the network controller based on the pre-configured network performance parameters.
- the network controller calculates the path calculation request for the CF message between the gateway devices in the connectivity relationship (such as RGW1 ⁇ RGW2, RGW1 ⁇ RGW3, RGW1 ⁇ RGW4, etc.).
- the network controller calculates the path calculation between the gateway devices in the connectivity relationship.
- the path calculation result of the CF message is transmitted and stored in between. For example, assume that the path calculation result of RGW1 ⁇ RGW2 is recorded as S1, the path calculation result of RGW1 ⁇ RGW3 is recorded as S2,..., and the path calculation result of RGW1 ⁇ RGWn is recorded as Sn, when RGW1 has a CF message to be sent to RGW2, RGW1 will send a first path calculation request (including the destination gateway device RGW2) to the network controller through the end-side management system. The network controller will based on the first path calculation request. Upon request, the pre-calculated path calculation result S1 of RGW1 ⁇ RGW2 can be called, and the path calculation result S1 is sent to RGW1 through the terminal management system.
- the network controller calculates the path calculation result of RGW1 ⁇ RGW2 for the CF message.
- the network initialization process is similar to the above.
- the network controller does not need to calculate the path calculation results of CF packets transmitted between gateway devices with connectivity relationships in advance. Instead, when a gateway When a device (for example, RGW1) needs to send a CF message to another gateway device (for example, RGW2) for the first time, RGW1 directly sends a first path calculation request to the network controller through the end-side management system. The first path calculation request The network performance requirements can be carried in the network controller.
- the network controller then temporarily calculates the path calculation result of RGW1 ⁇ RGW2 based on the first path calculation request. If RGW1 still has CF messages that need to be sent to RGW2, the path calculation result will be obtained based on this calculation. Just keep sending.
- the target request is the second path calculation request carrying network performance requirements.
- the first gateway device When the first gateway device determines that the RoCEv2 message is a DF message, the first gateway device generates a second path calculation request based on the DF message. Different from the above-mentioned first path calculation request, the second path calculation request The request must carry network performance requirements, and the second path calculation request carrying the network performance requirements is used to request the network controller to determine the path calculation result for the DF message based on the network performance request. It should also be noted that in some implementations of the present application, the second path calculation request generated by the first gateway device may be sent to the network controller through the end-side management system.
- the reason why the second path calculation request must carry network performance requirements is that the length of the DF message is determined by the upper-layer RDMA application, and generally the length is not fixed. Therefore, the second path calculation request
- the applied network performance requirements need to be carried (for example, how much bandwidth resources need to be applied for, etc.), which are used to request the network controller to determine the path calculation result based on the carried network performance requirements, because each DF packet may carry different message sizes. , that is, the value of the payload field is different, so the network controller needs to calculate the path calculation results separately for different DF messages. The calculation is based on the network performance requirements carried in the second path calculation request.
- the DF message is a message triggered by the write primitive operation or the read primitive operation.
- the DF message When the DF message is a message triggered by the write primitive operation, it will be in the first packet (i.e. first packet).
- One RDMA data may be split into n data packets, and only the corresponding RDMA data will be carried in the first packet.
- message size (other subsequent data packets corresponding to the RDMA data are sent based on the same network path as the first packet).
- the DMA length field in the RETH header indicates the size of the message that needs to be transmitted for this operation.
- the data message is a message triggered by the read primitive, the network performance requirements are carried in the read request message, and the requester triggers the reverse forwarding path calculation with the responder as the starting node.
- the responder When the responder receives the read request and replies with the read response, it may To reply to multiple response messages, all messages need to be forwarded according to pre-calculated paths. Therefore, in this case, the first gateway device can determine the network performance requirement according to the DMA length field in the RETH header of the DF message, thereby generating a second path calculation request carrying the network performance requirement.
- the DF message is a message triggered by the send primitive operation.
- the DF message triggered by the send primitive operation can also carry a field to identify the message size of the RoCv2 data that needs to be sent in this operation. This is implemented using the RETH header defined by the IB specification.
- the RETH header can be added by the first network card device (the network card device corresponding to the first gateway device) before the payload field of the DF message, where the DMA length field in the RETH header is used to represent the DMA length field corresponding to the DF message.
- the size of RDMA data For example, the first network card device can set the virtual addr (the field defined in the RETH header) to 0xFFFFFFFF (the specific value can be customized), and the remote key (the field defined in the RETH header) to 0xFFFF (the specific value can be customized) Customizable), DMA length (field defined in the RETH header) is the message size of the RoCv2 data that needs to be sent in the DF message triggered by this send primitive operation.
- the first gateway device can determine the network performance requirement based on the DMA length field in the RETH header of the DF message, thereby generating a second path calculation request carrying the network performance requirement.
- the network controller obtains the path calculation result based on the target request, and determines the QoS policy based on the path calculation result.
- the path calculation result includes the network path between the first gateway device and the second gateway device and the network reserved for the RoCEv2 packet. resource.
- the network controller After the first gateway device generates a target request based on the RoCEv2 message and sends the target request to the network controller of the SRv6 network (for example, through the terminal management system), the network controller obtains the path calculation result based on the received target request. , and determine the QoS policy based on the path calculation results. For example, the process of calculating the path calculation results and determining the QoS policy for CF packets is shown in Figure 5.
- the path calculation result includes the network path between the first gateway device and the second gateway device (also known as the SRv6 forwarding path) and the network resources reserved by the network controller for RoCEv2 messages, such as network bandwidth. resources, minimum network delay, etc.
- the reserved network resources are the network resources of the RoCEv2 message on the entire network path.
- the network path is R1 ⁇ R2 ⁇ R3
- the reserved network resource is R1 ⁇ R2 ⁇ R3 resources of the entire network path.
- RDMA data transmission based on the RoCEv2 protocol has clear bandwidth requirements
- RoCEv2 messages when they are transmitted through the SRv6 network, they can be transmitted based on the clear bandwidth requirements (that is, the reserved network resources are the single factor of bandwidth) choose a network path with the shortest delay for it, or you can also choose a network path that meets the requirements based on the clear "bandwidth + delay" requirements (that is, the reserved network resources are the dual target factors of bandwidth + delay).
- the first gateway device obtains the path calculation result, and expands the RoCEv2 message according to the path calculation result to obtain the expanded RoCEv2 message.
- the expanded RoCEv2 message carries the first identifier and the second identifier.
- the first identifier is At Each network node in the network path is represented, and the second identifier is used to represent the RDMA data corresponding to the RoCEv2 message, so that the network controller can guarantee bandwidth based on the second identifier and the QoS policy.
- the network controller After the network controller calculates the path calculation result based on the target request, it can deliver the path calculation result to the first gateway device through the end-side management system. After the first gateway device obtains the path calculation result, it will be sent according to the path calculation result.
- the RoCEv2 message is expanded to obtain an expanded RoCEv2 message.
- the expanded RoCEv2 message carries at least a first identifier and a second identifier, where the first identifier is used to characterize each network node in the network path, and the second identifier The identifier is used to characterize the RDMA data corresponding to the RoCEv2 message, so that the network controller can guarantee bandwidth based on the second identifier and the pre-calculated QoS policy.
- the source network card i.e. the first network card device
- the destination network card i.e. the second network card device
- the RoCEv2 message passes through each network node on the network path.
- the change of the message header will be explained.
- the network path is R1 ⁇ R3 ⁇ R2
- the first network card device is RNIC1
- the second network card device is RNIC2
- the first gateway device is RGW1
- the second gateway device is RGW2.
- RGW1 receives the message and is transmitted by RNIC1.
- the IPv4 over IPv6 encapsulation process is completed first, and then the encapsulated RoCEv2 message is delivered to the head network node R1 on the network path.
- RoCEv2 messages pass the SRH field (this field is used for routing between network nodes).
- RoCEv2 packets can carry the flow identifier RDMA Flow ID of RDMA data (that is, the second identifier, used to provide bandwidth guarantee for the network controller based on the QoS policy), and the forwarding path identifier SRv6 TE Policy ID (i.e. first identification). That is to say, the first gateway device obtains the extended RoCEv2 message by modifying the IPv6 Hop-by-Hop Option in the outer IPv6 header of the RoCEv2 message based on the path calculation result.
- the extended option RDMA Option TLV defined in this application can also carry the RDMA flow type identifier Flow Type, which is used to indicate whether the current message is a CF message or a DF message, and also The reserved field can be carried for use when adding other identifiers later to improve flexibility.
- the first gateway device modifies the outer IPv6 header extension option so that it carries the first identifier and the second identifier, so that after it enters the SRv6 network, the network node can divert the CF message to the based-online IPv6 header.
- TE traffic engineering pre-allocates the SRv6 forwarding path (that is, the network path) and implements strict bandwidth guarantees.
- RGW1 can encapsulate the RDMA Option TLV in the IPv6 header according to the path calculation result of the pre-applied CF message in the corresponding embodiment of Figure 8 (see the format definition shown in Figure 10 for details), so that the IPv6 header can carry the CF message.
- the first identifier that is, the SRv6 forwarding path identifier of the CF message, which can be represented by policy-id1
- the second identifier that is, the CF message flow identifier, which can be represented by cf-flow-id
- the head network node R1 parses the RDMA Option TLV option.
- the forwarding path identifier SRv6 TE Policy ID (see Figure 10) is used for traffic diversion processing, and the pre-delivered QoS policy is matched according to the flow identifier RDMA Flow ID of the RDMA data in the CF message to ensure bandwidth.
- the read request message and the read response message are processed in a somewhat special manner. This is because the read request message is a CF message, and the peer gateway device needs to After receiving the read request message, respond to the read response message (that is, the sending of the read response message is triggered by the read request message), and the read response message is a DF message.
- the specific processing process can be as follows: First, the first network card device RNIC1 sends a read request (that is, a CF message, which can be called a request message) to the second network card device RNIC2, and the first gateway device RGW1 receives the request message.
- the response message of the message submits the path calculation request of R2 ⁇ R1 to the network controller.
- the network controller completes the calculation based on the path calculation request, it delivers the SRv6 forwarding path to the head network node R2, and delivers the forwarding path identifier (which can be represented by policy-id2) and the actual reserved bandwidth (which can be represented by allocated-bw) to the end-side management system. express).
- the end-side management system receives the forwarding path identifier policy-id2 and the actual reserved bandwidth allocated-bw (that is, the path calculation result), and synchronizes the path calculation result of R2 ⁇ R1 to RGW2.
- the path calculation result carries the allocated policy-id2 and The bandwidth allocated for this path calculation is allocated-bw. It should be noted here that if the actual allocated bandwidth is less than the applied bandwidth, RGW2 can locally record is_bw_satisfied as false. The purpose of recording is to use it as a basis for subsequent judgments on whether source-end speed limiting is required.
- the end-side management system notifies RGW1 that the path calculation is completed.
- RGW1 receives the notification, it sends the cached request message according to the general CF message processing procedure.
- RNIC2 responds to the response message.
- RGW2 receives the response message, it first completes IPv4 over IPv6 processing. Then, based on the DF message forwarding path information triggered by the request message, the RDMA Option TLV is encapsulated in the IPv6 header, so that the IPv6 header can carry the DF message flow identifier df-flow-id1 and the DF message forwarding path identifier policy-id2, and Forward the response message to the head network node R2.
- RGW2 If the actual reserved bandwidth of the network controller is less than the requested bandwidth, RGW2 also needs to perform source-end speed limiting based on the actual allocated reserved bandwidth. After receiving it, the head network node R2 parses the RDMA Option TLV option, performs traffic diversion processing based on the SRv6 TE Policy ID in the response message, and matches the pre-delivered QoS policy based on the RDMA Flow ID in the response message to ensure bandwidth.
- the first gateway device's processing function for CF message transmission optimization can be implemented through the CF message transmission optimization component.
- the first gateway device recognizes RoCEv2 After receiving the packets, the function of distinguishing CF packets and DF packets according to the classification method provided by this application can be realized through the RoCEv2 packet classification component shown in Figure 11.
- the first gateway device optimizes processing of DF message transmission
- the first gateway device will trigger the network controller to complete path calculation based on TE traffic engineering based on the message size of the RDMA data to be transmitted, and then modify the outer IPv6 header extension options to carry
- the first identifier and the second identifier are so that after it enters the SRv6 network, the network node can divert the DF packet to the pre-allocated SRv6 forwarding path (i.e., network path) and perform strict bandwidth guarantees.
- the first gateway device needs to cache all DF packets and submit a second path calculation request to the network controller through the end-side management system.
- the request carries network performance such as bandwidth and delay. need.
- network control After the controller calculates the path calculation result based on the submitted second path calculation request, the end-side management system forwards the path calculation result to the first gateway device, and the first gateway device completes the IPv6 header extension option based on the path calculation result. Fill in and send DF message. If the actual bandwidth pre-allocated by the network controller does not meet the requested bandwidth, the first gateway device needs to perform source-side rate limiting based on the actual bandwidth when sending the DF message.
- the network node directs the DF packet to the pre-calculated SRv6 forwarding path according to the IPv6 header extension options in the packet, and completes local QoS guarantee processing.
- the optimization process for transmitting DF messages (which can be referred to as write/send messages) triggered by write/send primitive operations is as follows: First, the first network card device RNIC1 sends a write/send message to the second network card device RNIC2 , the first gateway device RGW1 first caches the write/send message after receiving it. If the write/send message is a write/send first message, subsequent write/send middle and write/send last messages also need to be cached. If write/send only messages, cache the current message.
- RGW1 uses the DMA Length in the message as the bandwidth request, uses the configured latency parameter as the delay request, and submits the R1 ⁇ R2 path calculation request to the network controller for the write/send message to be sent through the end-side management system.
- the network controller completes the calculation of the path calculation result based on the received path calculation request, delivers the SRv6 forwarding path to the head network node R1, and delivers the forwarding path identifier policy-id3 and actual reserved bandwidth allocated-bw to the end-side management system. .
- the end-side management system After receiving the path calculation result, the end-side management system synchronizes the path calculation result of R1 ⁇ R2 to RGW1.
- the path calculation result carries the allocated policy-id3 and the allocated-bw bandwidth reserved for this path calculation. If the actual allocated bandwidth is less than the requested bandwidth, RGW1 needs to locally record is_bw_satisfied as false. The purpose of recording is to use it as a basis for subsequent judgments on whether source-end speed limiting is required.
- RGW1 processes the cached write/send messages in order (write/send first ⁇ write/send middle ⁇ write/send last, or a single write/send only), and first completes IPv4 over IPv6 processing, and then according to write/ The send request triggers the applied DF packet forwarding path information, encapsulating the RDMA Option TLV in the IPv6 header, so that the IPv6 header can carry the DF packet flow identifier df-flow-id2 and the DF packet forwarding path identifier policy-id3, and sends it to the head network Node R1 forwards the message.
- RGW1 needs to perform source-end speed limiting based on the actual allocated reserved bandwidth.
- the head network node R1 parses the RDMA Option TLV option, performs traffic diversion processing according to the SRv6 TE Policy ID in the message, and matches the pre-delivered QoS policy based on the RDMA Flow ID in the message to ensure bandwidth.
- first gateway device's processing function for DF message transmission optimization can be implemented through the DF message transmission optimization component. For details, see Figure 11.
- the first gateway device sends the expanded RoCEv2 message to the second gateway device through the network path.
- the first gateway device can send the expanded RoCEv2 message to the opposite second gateway device through the calculated network path (ie, the above-mentioned SRv6 forwarding path).
- the first gateway device can also actively copy a copy. message, and modify the outer IPv6 header extension option of the copied message so that it carries the SRv6 forwarding path identifier and CF message flow identifier, so that after the copied message enters the SRv6 network, the network controller can direct the copied message The flow is transferred to the backup SRv6 forwarding path, so that the second gateway device at the opposite end performs dual-send and selective-receive processing, and avoids packet loss problems through traffic redundancy.
- the process may be: first, the first gateway device obtains a backup path, which may be called the first Backup path, the first backup path can be triggered and generated by the network controller when receiving the target request. After that, the first gateway device copies the RoCEv2 message to obtain the first copy message, and the first gateway device then copies the RoCEv2 message according to the first backup path.
- a backup path performs a similar expansion on the first copy message to the above process to obtain the expanded first copy message.
- the expanded first copy message still needs to carry the SRv6 forwarding path identifier, where, in the expanded first copy message
- the SRv6 forwarding path identifier carried in the first copy message may be called a third identifier, and the third identifier is used to characterize each network node in the first backup path.
- the first gateway device sends the expanded first copy message to the second gateway device through the first backup path, so that the second gateway device performs the following operations on the expanded RoCEv2 message and the expanded first copy message. Double sending and selective receiving processing.
- the RDMA gateway at the source end needs to make an additional copy of all RoCEv2 packets, complete the IPv4 over IPv6 processing, and then encapsulate the RDMA Option TLV in the IPv6 header. Enables the IPv6 header to carry the backup forwarding strategy pre-allocated by the network controller (in this embodiment, you can choose to use the highest available path, max availability path, and does not require the SRv6 network to provide bandwidth guarantee when forwarding backup traffic), and forwards it to the head network node R1 forwards the copied RoCEv2 packet.
- the peer RDMA gateway receives two RoCEv2 messages from the calculated designated path and the backup path, and performs dual-send and selective-receive processing.
- the so-called dual-send and selective-receive processing method can be: retain the message received first, The message received later is discarded directly. For example, if the expanded RoCEv2 message reaches the second gateway device first, the expanded RoCEv2 message is retained and the first expanded copy message that arrives later is discarded, and vice versa. Similar, will not be described in detail here.
- This application can achieve RoCEv2 anti-packet loss capability by discarding a received message through the dual-transmit and selective-receive capability.
- the above-mentioned anti-packet loss function of the first gateway device can be implemented through the RoCEv2 packet anti-packet loss component. For details, see Figure 11.
- an alternative solution for RoCEv2 anti-packet loss processing is also provided.
- the main difference is reflected in: in the embodiment of the present application, the RDMA gateway does not copy messages, but The head network node of the SRv6 network completes the copy processing of RoCEv2 messages.
- the process may be: when the expanded RoCEv2 message is sent to the head network node of the network path (the head network node may also be called the source network node, which refers to the network directly connected to the first gateway device in the network path).
- the trigger head network node copies the expanded RoCEv2 message, thereby obtaining a second copy message, in which the second copy message passes through
- the pre-prepared backup path (which can be called the second backup path) is sent.
- the tail network node of the network path (the tail network node can also be called the destination network node refers to the network path that is directly connected to the second gateway device.
- the connected network node (which is the last network node along the data transmission direction in the network path) performs dual-send and selective-receive processing on the expanded RoCEv2 message and the second copy message.
- the second backup path may also be triggered and generated by the network controller when receiving the target request.
- the head network node of the SRv6 network after the head network node of the SRv6 network receives the RoCEv2 message, it can first identify whether the current message is RoCEv2 traffic based on the Flow Type field in the RDMA Option TLV of the IPv6 header. If it is RoCEv2 traffic, it will be determined by the SRv6 network. The head network node copies a copy of the message and forwards it according to the highest available path (i.e. the second backup path). After the tail network node receives the double copy of the RoCEv2 message, it can discard the received copy through the dual-send and selective-receive capability. packets, thereby realizing the anti-packet loss capability of RoCEv2 during wide area/metropolitan area network transmission.
- the gateway device in the embodiment of the present application does not need to implement dual-transmit and selective-receive anti-packet loss processing. This part of the capability is provided by the network node in the SRv6 network. ,Can Reduce the forwarding pressure on gateway devices.
- the first gateway device when the reserved network resources do not meet the network performance requirements of the expanded RoCEv2 message, the first gateway device can also use the flow control mechanism to perform source-end speed limit. That is to say, when the network transmission bandwidth cannot meet the demand, you can also choose an optimal SRv6 forwarding path under the current situation, and at the same time inform the first gateway device of the bandwidth allocated for this transmission, so that the first gateway device can pass the flow control mechanism Sending at a reduced rate at the source end to reduce congestion and avoid packet loss.
- the basic logic of the above embodiments of this application is based on the characteristics of the RoCEv2 protocol that can accurately perceive the size of RDMA data to be transmitted at the application layer, and put forward precise network performance requirements for the lower layer bearer network before traffic transmission, thereby pre-processing the path. Planning and network resource reservation ensure the certainty of traffic during transmission on the bearer network. Based on this, this application designed the collaboration between RoCEv2 messages and SRv6 networks.
- this idea can also be applied to flexible Ethernet (flex ethernet, FlexE) hard slicing networks and time sensitive networks (time sensitive network, TSN), through RoCEv2 messages and FlexE , TSN protocol collaboration to achieve wide area/metropolitan area transmission optimization of RDMA data.
- FlexE flexible Ethernet
- TSN time sensitive network
- RoCEv2 messages and FlexE TSN protocol collaboration to achieve wide area/metropolitan area transmission optimization of RDMA data.
- Figure 12 is a schematic structural diagram of a first gateway device provided by an embodiment of the present application.
- the first gateway device 1200 may specifically include: an acquisition module 1201, a generation module 1202, an expansion module 1203 and a sending module 1204, where , the acquisition module 1201 is used to obtain the RoCEv2 message; the generation module 1202 is used to generate a target request according to the RoCEv2 message, and send the target request to the network controller of the SRv6 network, so that the network controller can based on the The target request obtains the path calculation result and causes the network controller to determine the QoS policy based on the path calculation result.
- the path calculation result includes the network path between the first gateway device and the second gateway device and the reservation for the RoCEv2 message.
- Network resources the first gateway device and the second gateway device perform data transmission based on the SRv6 network; the expansion module 1203 is used to obtain the path calculation result, and expand the RoCEv2 message according to the path calculation result to obtain the extension
- the extended RoCEv2 message carries a first identifier and a second identifier.
- the first identifier is used to characterize each network node in the network path.
- the second identifier is used to characterize the corresponding RoCEv2 message.
- RDMA data so that the network controller performs bandwidth guarantee based on the second identification and the QoS policy; the sending module 1204 is configured to send the extended RoCEv2 message to the second gateway device through the network path.
- the generation module 1202 is specifically configured to: when it is determined that the RoCEv2 message is a control message that does not carry a payload field, generate a first path calculation request based on the control message.
- the path calculation request is used to trigger the network controller to determine the path calculation result for the control message.
- the generation module 1202 is specifically configured to: when it is determined that the RoCEv2 message is a data message carrying a payload field, generate a second path calculation request carrying network performance requirements based on the data message. , the second path calculation request is used to request the network controller to determine a path calculation result for the data packet based on the network performance requirement.
- the generation module 1202 is specifically also configured to: when the data message is a message triggered by a send primitive operation, determine the DMA length field in the RETH header of the data message. Network performance requirements, where the RETH header is added by the first network card device before the payload field of the data message. The DMA length field in the RETH header is used to represent the size of the RDMA data corresponding to the data message.
- the first Network card settings The device corresponds to the first gateway device; and generates a second path calculation request carrying the network performance requirement.
- the generation module 1202 is specifically also used to: when the data message is a message triggered by a write primitive operation or a read primitive operation, generate the data according to the RETH header of the data message.
- the DMA length field determines the network performance requirements; a second path calculation request carrying the network performance requirements is generated.
- the extension module 1203 is specifically used to modify the IPv6 Hop-by-Hop Option in the IPv6 header of the RoCEv2 message based on the path calculation result to obtain the extended RoCEv2 message.
- the first gateway device 1200 also includes a backup module 1205, configured to: obtain a first backup path, which is triggered and generated by the network controller when receiving the target request; The RoCEv2 message is copied to obtain the first copied message; the first copied message is expanded according to the first backup path to obtain the expanded first copied message, and the expanded first copied message carries the first copied message.
- a backup module 1205 configured to: obtain a first backup path, which is triggered and generated by the network controller when receiving the target request; The RoCEv2 message is copied to obtain the first copied message; the first copied message is expanded according to the first backup path to obtain the expanded first copied message, and the expanded first copied message carries the first copied message.
- the third identifier is used to characterize each network node in the first backup path; send the expanded first copy message to the second gateway device through the first backup path, so that the second The gateway device performs dual-send and selective-receive processing on the extended RoCEv2 message and the extended first copy message.
- the first gateway device 1200 also includes a backup module 1205, configured to: when the extended RoCEv2 message is sent to the head network node of the network path, trigger the head network node to respond to the expanded RoCEv2 message.
- the RoCEv2 message is copied to obtain a second copy message, which is sent via the second backup path, so that the tail network node of the network path responds to the expanded RoCEv2 message and the second copy message. Dual-transmit and selective-receive processing is performed, and the second backup path is triggered and generated by the network controller when receiving the target request.
- the sending module 1204 is specifically used to: perform source-side rate limiting through a flow control mechanism when the reserved network resources do not meet the network performance requirements of the expanded RoCEv2 message. And send the extended RoCEv2 message to the second gateway device through the network path.
- the first network card device corresponding to the first gateway device and the second network card device corresponding to the second gateway device deploy IPv4 private network addresses.
- the acquisition module 1201 is specifically used to: obtain the address to be sent to The original RoCEv2 message of the second gateway device; perform IPv4 over IPv6 tunnel encapsulation on the original RoCEv2 message to obtain the RoCEv2 message.
- the acquisition module 1201 is specifically configured to: receive a target message sent by a first network card device corresponding to the first gateway device; and according to the UDP destination port in the target message No. determines that the target packet is a RoCEv2 packet.
- the reserved network resources include at least any of the following: network bandwidth and minimum network delay.
- the embodiment of the present application also provides a first gateway device.
- Figure 13 is another structural schematic diagram of the first gateway device provided by the embodiment of the present application.
- the first gateway device 1300 can be deployed with the implementation corresponding to Figure 12.
- Each module of the first gateway device 1200 described in the example is used to implement the functions of the first gateway device 1200 in the embodiment corresponding to Figure 12.
- the first gateway device 1300 is implemented by one or more servers.
- the first gateway device 1300 can be configured Or there is a relatively large difference due to different performance, which may include one or more central processing units (CPU) 1322 and memory 1332, one or more storage media 1330 (for example, one or more storage media 1330 storing application programs 1342 or data 1344). More than one mass storage device).
- CPU central processing units
- storage media 1330 for example, one or more storage media 1330 storing application programs 1342 or data 1344. More than one mass storage device).
- the memory 1332 and the storage medium 1330 may be short-term storage or persistent storage.
- the program stored in the storage medium 1330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the first gateway device 1300 .
- the central processor 1322 may be configured to communicate with the storage medium 1330 and execute a series of instruction operations in the storage medium 1330 on the first gateway device 1300 .
- the first gateway device 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input and output interfaces 1358, and/or, one or more operating systems 1341, such as Windows ServerTM , Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
- operating systems 1341 such as Windows ServerTM , Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
- the central processor 1322 is used to execute the data transmission method based on the SRv6 network performed by the first gateway device in the corresponding embodiment of FIG. 7 .
- the central processor 1322 is configured to: first generate a RoCEv2 message based on the RDMA data to be sent, and then the source gateway device corresponding to the first network card device (which can be called the first gateway device) receives the message from the first network card device.
- the RoCEv2 message is a message to be sent by the first gateway device to the destination gateway device (which can be called the second gateway device). It should be noted that the first gateway device and the second gateway device are based on SRv6 network for data transmission.
- a corresponding request will be further generated based on the RoCEv2 message.
- This request can be called a target request.
- the target request will be sent to the network controller in the SRv6 network, so that the network controller can be based on the target request.
- Request the path calculation result, and determine the QoS policy based on the path calculation result.
- the path calculation result includes the network path between the first gateway device and the second gateway device and the network resources reserved for the RoCEv2 message.
- the network controller calculates the path calculation result based on the target request, it can send the path calculation result to the central processor 1322 through the end-side management system. After the central processor 1322 obtains the path calculation result, it will be sent according to the path calculation result.
- the RoCEv2 message is expanded to obtain an expanded RoCEv2 message.
- the expanded RoCEv2 message carries at least a first identifier and a second identifier, where the first identifier is used to characterize each network node in the network path, and the second identifier
- the identifier is used to characterize the RDMA data corresponding to the RoCEv2 message, so that the network controller can guarantee bandwidth based on the second identifier and the pre-calculated QoS policy.
- the extended RoCEv2 message can be sent to the opposite second gateway device through the calculated network path.
- Embodiments of the present application also provide a computer-readable storage medium.
- the computer-readable storage medium stores a program for signal processing. When it is run on a computer, it causes the computer to execute the embodiment shown in Figure 7. Describe the steps.
- the first gateway device provided by the embodiment of the present application may specifically include a chip.
- the chip may include: a processing unit and a communication unit.
- the processing unit may be, for example, a processor.
- the communication unit may be, for example, an input/output interface, a pin, or a communication unit. circuit etc.
- the processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the robot executes the steps described in the embodiment shown in FIG. 7 .
- the storage unit is a storage unit within the chip, such as a register, cache, etc., and the storage unit It may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory, etc. Access memory (random access memory, RAM), etc.
- ROM read-only memory
- RAM random access memory
- the device embodiments described above are only illustrative.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
- the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
- the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
- the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer or network device, etc.) to execute the method described in each embodiment of the application.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transmitted over a wired connection from a website, computer, or data center. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website, computer or data center.
- DSL digital subscriber line
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a data center integrated with one or more available media.
- the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本申请公开了一种数据传输方法及网关设备,可应用数据传输领域,包括:第一网关设备获取待发往第二网关设备的RoCEv2报文并生成目标请求,将目标请求向网络控制器发送,两个网关设备基于SRv6网络传输数据,网络控制器基于目标请求得到算路结果并返回第一网关设备,算路结果包括两个网关设备间的网络路径及预留的网络资源,第一网关设备根据算路结果扩展RoCEv2报文,扩展后的RoCEv2报文携带第一标识(表征网络路径中各网络节点)及第二标识(表征RDMA数据),网络控制器基于第二标识及QoS策略进行带宽保障,扩展后的RoCEv2报文基于该网络路径进行转发。本申请通过RoCEv2与SRv6网络的精确协同,使得基于SRv6网络进行跨域RDMA长距传输成为可能,使得RDMA传输不再依赖于昂贵的专线网络。
Description
本申请要求于2022年6月29日提交中国专利局、申请号为202210750558.6、发明名称为“一种数据传输方法及网关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及数据传输领域,尤其涉及一种数据传输方法及网关设备。
远程直接内存访问(remote direct memory access,RDMA)作为诞生于数据中心的传输技术,在数据中心内经过多年发展,已经形成了较为完善的生态系统。在高性能计算、大数据分析、分布式存储以及数据库领域,都有使用RDMA替代传输控制协议(transmission control protocol,TCP)的高性能解决方案。
基于融合以太网的远程直接内存访问(RDMA over converged ethernet,RoCE)协议的出现,使得RDMA摆脱了对专用无限带宽技术(infiniband,IB)网络的依赖(RDMA在诞生之初是基于IB网络设计的,需要使用专用的交换机、路由器、网卡,配套组合成专用的IB网络,通过硬件资源保证传输网络的高带宽,低时延与零丢包,RDMA自身在可靠性方面的应对机制较为简陋),借助以太网络的优良生态,极大的拓宽了RDMA技术的应用范围。其中,RoCE协议存在RoCEv1和RoCEv2两个版本,主要区别是:RoCEv1是基于以太网链路层实现的RDMA协议(交换机需要支持优先级的流量控制(priority-based flow control,PFC)等流控技术,在物理层保证可靠传输),而RoCEv2是以太网传输控制/网际协议(transmission control protocol/internet protocol,TCP/IP)中用户数据报协议(user datagram protocol,UDP)层实现。
由于RDMA自身的协议机制,对下层承载网络有很高的确定性服务等级协议(service-level agreement,SLA)要求,如,网络带宽、时延等,使得它目前还不能很好的适配如广域网络、城域网络等非数据中心场景,这就导致了RDMA的应用生态也仅能局限于数据中心之内。
发明内容
本申请实施例提供了一种数据传输方法及网关设备,用于通过协同RoCEv2报文与SRv6网络的交互流程,实现基于SRv6网络进行跨域RDMA数据的长距离传输,可以将RDMA的应用范围从数据中心扩展到整个互联网络,通过RDMA实现端到端的数据传输加速处理,并且使得RDMA传输不再依赖于昂贵的专线网络,在工程部署上成本更低,开通周期更短;此外,还改变了SRv6网络无法感知应用诉求的现状,使得应用层的数据在进入SRv6网络之前就可完成精确的最优路径规划(即算路结果在RoCEv2报文发送前就预先规划好),解决了只能通过基于监测技术进行事后调优的问题,在降低网络管理复杂度的同时,可大幅提升用户体验。
基于此,本申请实施例提供以下技术方案:
第一方面,本申请实施例首先提供一种数据传输方法,可用于数据传输领域中,该方法包括:首先,作为源端的第一网卡设备基于待发送的RDMA数据生成RoCEv2报文,与该第一网卡设备对应的第一网关设备再接收来自第一网卡设备的该RoCEv2报文,该RoCEv2报文为第一网关设备待发往第二网关设备的报文,需要注意的是,第一网关设备与第二网关设备基于SRv6网络进行数据传输,其中,SRv6网络可以是广域网络,也可以是城域网络,具体本申请对SRv6网络的具体应用类型不做限定。之后,第一网关设备根据RoCEv2报文生成对应的请求,该请求可称为目标请求,并将该目标请求向SRv6网络中的网络控制器发送。网络控制器再基于接收到的目标请求得到算路结果,并基于算路结果确定服务质量(quality of service,QoS)策略,其中,该算路结果包括第一网关设备与第二网关设备之间的网络路径(也可称为SRv6转发路径)以及网络控制器为RoCEv2报文所预留的网络资源,预留的网络资源是RoCEv2报文在整个网络路径上的网络资源。网络控制器在基于目标请求计算算路结果之后,可将该算路结果下发给第一网关设备,第一网关设备获取该算路结果后,根据该算路结果对待发送的RoCEv2报文进行扩展,从而得到扩展后的RoCEv2报文,扩展后的RoCEv2报文至少携带第一标识以及第二标识,其中,第一标识用于表征网络路径中的各个网络节点,第二标识用于表征RoCEv2报文对应的RDMA数据,以使得网络控制器基于第二标识以及事先计算的QoS策略进行带宽保障。最后,第一网关设备通过计算得到的网络路径将扩展后的RoCEv2报文向对端的第二网关设备发送。
在本申请上述实施方式中,通过协同RoCEv2报文与SRv6网络的交互流程,实现基于SRv6网络进行跨域RDMA数据的长距离传输,可以将RDMA的应用范围从数据中心扩展到整个互联网络,从而改变了RoCEv2协议的部署形态,解除其原本只能应用于数据中心内部网络或者长距专线网络的约束,使得RDMA传输不再依赖于昂贵的专线网络,通过本申请方法使得RDMA可承载到SRv6网络上,实现端到端的数据传输加速处理,在工程部署上成本更低,开通周期更短。
在第一方面的一种可能的实现方式中,第一网关设备根据RoCEv2报文生成目标请求的过程可以是:第一网关设备在确定RoCEv2报文为不携带payload字段的控制报文的情况下,也就是在第一网关设备确定该RoCEv2报文为控制报文的情况下,第一网关设备根据该控制报文生成第一算路请求,其中,该第一算路请求就用于触发网络控制器为该控制报文确定对应该控制报文的算路结果。需要注意的是,在本申请的一些实施方式中,第一网关设备生成的第一算路请求可通过端侧管理系统向网络控制器发送。还需要注意的是,该第一算路请求中可以携带网络性能需求,也可以不携带,本申请对此不做限定。
在本申请上述实施方式中,具体阐述了可基于是否携带payload字段来判断RoCEv2报文是控制报文还是数据报文,且当RoCEv2报文为控制报文时,第一网关设备对应生成第一算路请求,具备灵活性和针对性。
在第一方面的一种可能的实现方式中,第一网关设备根据RoCEv2报文生成目标请求的过程还可以是:第一网关设备在确定RoCEv2报文为携带payload字段的数据报文的情况下,也就是在第一网关设备确定该RoCEv2报文为数据报文的情况下,第一网关设备根据该数据报文生成第二算路请求,与上述第一算路请求不同的是,该第二算路请求必须携带
网络性能需求,其中,该携带了网络性能需求的第二算路请求就用于请求网络控制器基于该网络性能请求为数据报文确定算路结果。同样需要注意的是,在本申请的一些实施方式中,第一网关设备生成的第二算路请求可通过端侧管理系统向网络控制器发送。
在本申请上述实施方式中,具体阐述了当RoCEv2报文为数据报文时,第一网关设备对应生成第二算路请求,该第二算路请求中需要携带网络性能需求,这样可使得SRv6网络准确感知应用层的待发送RDMA数据大小的特性,据此可推断出所需的网络资源,使得应用层的RDMA数据在进入SRv6网络之前就可事先完成精确的最优路径规划,解决了只能通过基于监测技术进行事后调优的问题,在降低网络管理复杂度的同时,可大幅提升用户体验。
在第一方面的一种可能的实现方式中,第一网关设备根据数据报文生成携带网络性能需求的第二算路请求具体可以包括:在数据报文为send原语操作触发的报文的情况下,第一网关设备根据数据报文的RETH(RDMA extend transport header)首部中的DMA length字段确定所需的网络性能需求,其中,RETH首部由与第一网关设备对应的第一网卡设备在数据报文的payload字段前添加,RETH首部中的DMA length字段用于表征数据报文对应的RDMA数据的大小,基于此,第一网关设备再生成携带所述网络性能需求的第二算路请求。
在本申请上述实施方式中,当数据报文为send原语操作触发的报文时,对应的待传输的RDMA数据的消息大小也是确定的,但目前报文中无相应字段标识该消息大小(massage size)。因此,需要在报文首部增加字段标识本次send操作待发送消息大小。通过对网卡设备的RoCEv2协议进行优化,使得send原语操作触发的数据报文也可携带字段以标识本次操作需要发送的RoCv2数据的消息大小,可以通过复用IB规范定义的RETH首部来实现,具备灵活性和广泛适用性。
在第一方面的一种可能的实现方式中,第一网关设备根据数据报文生成携带网络性能需求的第二算路请求具体也可以包括:在数据报文为write原语操作或read原语操作触发的报文的情况下,第一网关设备首先根据数据报文的RETH首部中的DMA length字段确定网络性能需求,之后,第一网关设备生成携带所述网络性能需求的第二算路请求。需要注意的是,当数据报文为write原语操作触发的报文时,会在首包(即first包,一个RDMA数据可能会被拆分为n个数据包,只会在first包中携带对应RDMA数据的消息大小,后续的对应该RDMA数据的其他数据包则基于与首包同样的网络路径进行发送)的RETH(RDMA extend transport header)首部中通过DMA length字段表明本次操作需要传输的消息大小。类似的,当数据报文为read原语触发的报文时,是在read request报文中携带网络性能需求,由请求方触发以响应方为起始节点的逆向转发路径算路。响应方收到read request请求并回复read response时,可能回复多个response报文,所有报文都需按预计算路径转发。因此,在这种情况下,第一网关设备可以根据数据报文的RETH首部中的DMA length字段确定所述网络性能需求,从而生成携带网络性能需求的第二算路请求。
在本申请上述实施方式中,当数据报文为write原语操作或read原语操作触发的报文时,会在首包的RETH首部中通过DMA length字段本次操作需要传输的消息大小,可直接
基于DMA length字段确定网络性能需求,简单方便,具备可实现性。
在第一方面的一种可能的实现方式中,第一网关设备根据算路结果对RoCEv2报文进行扩展,得到扩展后的RoCEv2报文具体可以是:第一网关设备根据算路结果修改RoCEv2报文IPv6首部的IPv6 Hop-by-Hop Option,从而得到扩展后的RoCEv2报文。
在本申请上述实施方式中,通过对RoCEv2报文的格式扩展,使得RoCEv2报文可携带应用层待传输数据的大小以及转发路径标识,据此可实现RDMA流量入SRv6网络后,利用广域/城域网络成熟的QoS能力,实现可承诺的传输带宽保障。
在第一方面的一种可能的实现方式中,基于SRv6网络的数据传输方法还可以包括如下步骤:首先,第一网关设备获取备份路径,该备份路径可称为第一备份路径,该第一备份路径可以由网络控制器在接收到目标请求时触发生成,之后,第一网关设备对该RoCEv2报文进行复制,得到第一复制报文,第一网关设备再根据第一备份路径对第一复制报文进行与上述过程类似的扩展,得到扩展后的第一复制报文,该扩展后的第一复制报文依然需要携带SRv6转发路径标识,其中,在扩展后的第一复制报文中携带的SRv6转发路径标识可称为第三标识,该第三标识就用于表征第一备份路径中的各个网络节点。最后,第一网关设备通过第一备份路径将扩展后的第一复制报文向第二网关设备发送,以使得第二网关设备针对扩展后的RoCEv2报文以及扩展后的第一复制报文进行双发选收处理,所谓双发选收处理的方式可以是:保留先接收到的那份报文,后接收到的报文直接舍弃,例如,若扩展后的RoCEv2报文先到达第二网关设备,则保留该扩展后的RoCEv2报文,丢弃后到达的扩展后的第一复制报文,反之也是类似,此处不予赘述。
在本申请上述实施方式中,对于所有的RoCEv2报文,第一网关设备除了基于不同报文分类进行传输优化处理外,还需额外复制一份RoCEv2报文,并基于备份路径完成RoCEv2报文的多发选收处理,从而提高在网络传输时的抗丢包能力,以通过流量冗余方式来避免丢包问题的出现。
在第一方面的一种可能的实现方式中,还提供了一种RoCEv2抗丢包处理的替代方案,主要区别体现在:本申请实施例中不在RDMA网关进行报文复制,而是由SRv6网络的头网络节点完成RoCEv2报文的复制处理。具体地,该过程可以是:当扩展后的RoCEv2报文发送至网络路径的头网络节点(该头网络节点也可称为源网络节点,是指网络路径中与第一网关设备直接连接的网络节点,是网络路径中沿数据传输方向的第一个网络节点)时,触发头网络节点对该扩展后的RoCEv2报文进行复制,从而得到第二复制报文,其中,第二复制报文经由事先预备的备份路径(可称为第二备份路径)发送,最后,由该网络路径的尾网络节点(该尾网络节点也可称为目的网络节点,是指网络路径中与第二网关设备直接连接的网络节点,是网络路径中沿数据传输方向的最后一个网络节点)针对扩展后的RoCEv2报文以及第二复制报文进行双发选收处理。其中,第二备份路径也可以由网络控制器在接收到目标请求时触发生成。
在本申请上述实施方式中,本申请实施例的网关设备不需要实现双发选收的抗丢包处理,该部分能力实现由SRv6网络中的网络节点提供,可降低网关设备的转发压力。
在第一方面的一种可能的实现方式中,第一网关设备通过网络路径将扩展后的RoCEv2
报文向第二网关设备发送还可以是:在所述预留的网络资源不满足扩展后的RoCEv2报文的网络性能需求的情况下,第一网关设备通过流控机制进行源端限速,并通过网络路径将扩展后的RoCEv2报文向第二网关设备发送。
在本申请上述实施方式中,网络传输的网络资源无法满足需求时,还可选择一条当前情况下的最优SRv6转发路径,同时告知第一网关设备为本次传输分配的网络资源的大小,让第一网关设备通过流控机制在源端进行降速发送,以此来降低拥塞避免丢包,提高传输的可靠性。
在第一方面的一种可能的实现方式中,假设第一网关设备对应的第一网卡设备以及第二网关设备对应的第二网卡设备上部署的是IPv4私网地址,那么第一网关设备获取待发往第二网关设备的RoCEv2报文的方式可以是:第一网卡设备基于待发送的RDMA数据生成原始RoCEv2报文,然后第一网关设备对该原始RoCEv2报文进行IPv4 over IPv6隧道封装,从而得到所述的RoCEv2报文。
在本申请上述实施方式中,具体阐述了当网卡设备部署的是IPv4私网地址时,则对应的网关设备还需要具备IPv4 over IPv6隧道封装的能力,这样才能使得部署于用户私有网络的网卡设备能够实现跨越广域/城域网络进行相互通信的功能,具备可实现性。
在第一方面的一种可能的实现方式中,第一网关设备获取待发往第二网关设备的RoCEv2报文的过程还可以是:首先,第一网关设备接收第一网卡设备发送的待发往第二网关设备的目标报文,第一网卡设备与所述第一网关设备对应,之后,第一网关设备根据目标报文中的UDP目的端口号确定目标报文为所述的RoCEv2报文。
在本申请上述实施方式中,由于第一网关设备可以接收到由第一网卡设备传递过来的各种报文,第一网关设备判断收到的目标报文是否是RoCEv2报文的方式可以是根据收到的目标报文中的UDP目的端口号来确定当前收到的目标报文是否为RoCEv2报文,具备可操作性。
在第一方面的一种可能的实现方式中,预留的网络资源至少包括如下任意一项:网络带宽、最小网络时延。
在本申请上述实施方式中,具体说明了本申请所述的预留网络资源的典型类型,预备广泛适用性。
本申请实施例第二方面提供一种网关设备,该网关设备作为第一网关设备,该网关设备具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
本申请实施例第三方面提供一种网关设备,该网关设备作为第一网关设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于调用该存储器中存储的程序以执行本申请实施例第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第四方面提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第五方面提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第六方面提供了一种芯片,该芯片包括至少一个处理器和至少一个接口电路,该接口电路和该处理器耦合,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,至少一个处理器用于运行计算机程序或指令,其具有实现如上述第一方面或第一方面任意一种可能实现方式的方法的功能,或,其具有实现如上述第二方面或第二方面任意一种可能实现方式的方法的功能,该功能可以通过硬件实现,也可以通过软件实现,还可以通过硬件和软件组合实现,该硬件或软件包括一个或多个与上述功能相对应的模块。此外,该接口电路用于与该芯片之外的其它模块进行通信。
图1为本申请实施例提供的SRv6 TE策略的工作流程的一个示意图;
图2为本申请实施例提供的应用场景的一个示意图;
图3为本申请实施例提供的当前的处理方案通用模型的一个示意图;
图4为本申请实施例提供的应用场景的一个示意图;
图5为本申请实施例提供的系统架构的一个应用示意图;
图6为本申请实施例提供的系统架构的另一应用示意图;
图7为本申请实施例提供的数据传输方法的一个流程示意图;
图8为本申请实施例提供的网络初始化处理阶段的一个流程示意图;
图9为本申请实施例提供的RoCEv2报文经过网络路径上的各网络节点时报文首部变更情况的一个示意图;
图10为本申请实施例提供的格式定义的一个示意图;
图11为本申请实施例提供的数据传输方法各个组件的一个结构示意图;
图12为本申请实施例提供的第一网关设备的一个结构示意图;
图13为本申请实施例提供的第一网关设备的另一结构示意图。
本申请实施例提供了一种数据传输方法及网关设备,用于通过协同RoCEv2报文与SRv6网络的交互流程,实现基于SRv6网络进行跨域RDMA数据的长距离传输,可以将RDMA的应用范围从数据中心扩展到整个互联网络,通过RDMA实现端到端的数据传输加速处理,并且使得RDMA传输不再依赖于昂贵的专线网络,在工程部署上成本更低,开通周期更短;此外,还改变了SRv6网络无法感知应用诉求的现状,使得应用层的数据在进入SRv6网络之前就可完成精确的最优路径规划(即算路结果在RoCEv2报文发送前就预先规划好),解决了只能通过基于监测技术进行事后调优的问题,在降低网络管理复杂度的同时,可大幅提升用户体验。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含
一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
本申请实施例涉及了许多关于网络的相关知识,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。应理解的是,相关的概念解释可能会因为本申请实施例的具体情况有所限制,但并不代表本申请仅能局限于该具体情况,在不同实施例的具体情况可能也会存在差异,具体此处不做限定。
(1)基于IPv6的分段路由(segment routing IPv6,SRv6)
SRv6是指基于IPv6转发平面的分段路由(SR+IPv6),即采用现有的IPv6转发技术,通过灵活的IPv6扩展头,实现网络可编程。
SRv6是新一代IP承载协议,可以简化并统一传统的复杂网络协议,是5G和云时代构建智能IP网络的基础。SRv6结合了SR的源路由优势和IPv6的简洁易扩展特质,而且具有多重编程空间,符合(software defined network,软件定义网络)思想,是实现意图驱动网络的利器。SRv6丰富的网络编程能力能够更好地满足新的网络业务的需求,而其兼容IPv6的特性也使得网络业务部署更为简便。
SRv6有两种工作模式,分别为基于流量工程的SRv6(segment routing IPv6 traffic engineering,SRv6 TE)策略以及尽力转发的SRv6(segment routing IPv6 best effort,SRv6 BE)策略,其中,SRv6 BE策略是使用最短路径算法计算得到的最优SRv6路径;SRv6 TE策略是在SRv6技术基础上发展的一种新的隧道引流技术,SRv6 TE路径表示为指定路径的段列表(segment list),称为段身份标识号列表,也可简称为SID列表(segment ID list)。每个SID列表是从源到目的地的端到端路径,并指示网络中的设备遵循指定的路径,而不是遵循内部网关协议(interior gateway protocols,IGP)计算的最短路径。如果数据包被导入SRv6 TE路径中,SID列表由头端添加到数据包上,网络的其余设备执行SID列表中嵌入的指令。与SRv6 BE策略相比,SRv6 TE策略可以通过流量工程技术更好地响应业务的差异化需求,做到业务驱动网络,适合于业务对网络SLA有严格要求的场景。
具体地,如图1所示,SRv6 TE策略的工作流程主要可以概括为5个步骤:
①、转发器将网络拓扑信息上报给网络控制器。拓扑信息包括节点、链路信息,以及链路的开销、带宽和时延等流量工程(traffic engineering,TE)属性。
②、控制器基于收集到的拓扑信息,按照业务需求计算路径,符合业务的SLA。
③、控制器通过边界网关协议(border gateway protocol,BGP)SR-Policy扩展将路径信息下发给网络的头节点(即与源端设备接入的节点,也可以叫源节点),头节点生成SRv6 TE。生成的SRv6 TE策略包括头端地址、目的地址和Color等关键信息。
④、网络的头节点为业务选择合适的SRv6 TE指导转发。
⑤、数据转发时,转发器需要执行自己发布的SID的指令。
但需要注意的是,SRv6 TE策略是基于网络拓扑以及状态进行算路,并不感知应用的状态与需求,因此也无法根据应用的确切带宽需求进行严格的精准选路,仅能将应用流量调度到当前最合适的路径上,后续需要配合监控以及检测手段来发现异常,并进行事后的路径优化调度。
(2)远程直接内存访问(remote direct memory access,RDMA)
随着数据中心网络升级换代,带宽从1Gbps增加到10Gbps,再到40Gbps/100Gbps,基础往返延迟从几百微秒降低到几微秒,以TCP为核心的传统传输技术虽几经优化,但也难以达到高速低时延数据中心网络预期的性能要求,因此RDMA技术应时而生。
RDMA是一种直接存储器访问技术,通过网卡和内存的注册绑定,提供了基于网络直接访问远端内存的能力,且将数据从本地计算机设备的内存传输到远端计算机设备,无需双方操作系统(operating system,OS)介入,因此不会对双方OS造成任何影响,这样也就不需要用到多少计算机的处理功能。它在用户态完成所有处理逻辑,消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和CPU周期用于改进应用系统性能,具备了高带宽、低时延、低CPU负载的特性。
目前大致有三类RDMA网络,分别是IB网络、RoCE、互联网广域rdma协议(internet wide area rdma protocol,iWARP)。其中IB网络是一种专为RDMA设计的网络,从硬件级别保证可靠传输。而RoCE和iWARP都是基于以太网的RDMA技术,支持相应的verbs接口。RoCE协议存在RoCEv1和RoCEv2两个版本。
(3)网卡设备(network interface card,NIC)
网卡设备可简称为网卡,也叫网络适配器,是一块被设计用来允许计算机在计算机网络上进行通讯的计算机硬件。由于其拥有媒体控制存取(media access control,MAC)地址,因此属于开放式系统互联通信(open system interconnection,OSI)模型的第1层和2层之间。它使得用户可以通过电缆或无线相互连接。
需要说明的是,在本申请实施例中,所述网卡除了普通意义上的网卡之外,还可以包括数据处理卡(data processing unit,DPU)、智能网卡(smart network interface card)、RDMA网卡或其他形式的具备网卡功能的设备,具体本申请对此不做限定。例如,RDMA网卡就是用于:从中央处理器(central processing unit,CPU)接收远端内存访问请求,并发送到网络上,或,从网络上接收远端内存访问请求,并通过直接存储器访问(direct memory access,DMA)引擎访问主机内存,最后将访问结果通过网络返回给发起者。在本申请实施例中,由于待发送的数据是RDMA数据,因为所使用的网卡可以为RDMA网卡(RDMA NIC,RNIC)。
(4)网关设备(gateway)
网关设备可简称为网关,又可称网间连接器、协议转换器,是多个网络间提供数据转换服务的计算机系统或设备。在使用不同的通信协议、数据格式或语言,甚至体系结构完全不同的两种系统之间,网关就是一个翻译器,网关对收到的信息要重新打包,以适应目的系统的需求,同时起到过滤和安全的作用。在本申请实施例中,基于RDMA数据的网关可称为RDMA网关(RDMA gateway,RGW)。
网关工作在开放式系统互联参考模型(open system interconnection reference model,OSI/RM)的传输层及以上的所有层次,它是通过重新封装信息来使它们能够被另一种系统处理的,为此网关还必须能够同各种应用进行通信,包括建立和管理会话、传输以及解析数据等。事实上现在的网关已经不能完全归为一种网络硬件,而可以概括为能够连接不同
网络的软件和硬件的结合产品。
需要说明的是,本申请实施例所提供的基于SRv6网络的数据传输方法可以部署于现有的网关设备上(即与现有的网关设备进行耦合部署),也可以独立部署为一个专有网关设备(即与现有的网关设备进行解耦部署),本申请对此不做限定。
(5)无限带宽(infiniband,IB)
IB是一个用于高性能计算的计算机网络通信标准,它具有极高的吞吐量和极低的延迟,用于计算机设备与计算机设备之间的数据互连。
(6)网络专线
网络专线是网络服务提供商给用户提供专用的信道,让用户的数据传输变得可靠可信。专线的优点是安全性好,QoS可以得到保证。网络专线主要有如下两种信道方式:
a、物理专用信道:物理专用信道就是在服务商到用户之间铺设有一条专用的线路,线路只给用户独立使用,其他的数据不能进入此线路,而一般的线路就允许多用户共享信道。
b、虚拟专用信道:虚拟专用信道就是在一般的信道上为用户保留一定的带宽,使用户可以独享这部分带宽,就像在公用信道上又开了一个通道,只让相应用户使用,而且用户的数据是加密的,以此来保证可靠性与安全性。
信息技术的推进使得数据中心(如,公有云/私有云数据中心、超算数据中心等)成为算力的集中点。企业需要围绕数据中心构建应用,才能使用数据中心强大的算力,一般场景下数据和算力都集中于数据中心内部。
随着企业生产数字化程度的逐步提升,过程中产生了海量数据。这些数据往往都分布于数据中心之外,地域上可能相隔数百上千公里。数据的处理诉求一般可分为两种:一种是需实时处理的“T+0”计算类型,数据量不大但有强实时性处理要求,处理结果可能需要反馈给生产系统做实时调控;一种是无需实时处理的“T+1”计算类型,数据量大,无实时性要求但时效性越快越好。
本申请实施例主要聚焦于端侧产生海量数据,但本地算力不足,需要使用数据中心集中算力完成“T+1”类型计算请求的问题,如图2所示。这类计算请求在诸如能源行业的地震勘探数据处理、医疗行业的基因测序数据处理、媒体行业的3D渲染与视频图像数据处理,以及科研领域的高能物理、气象数据分析处理等各行业中均有广泛应用。这些行业的场景诉求以及当前的处理方案可抽象为如图3所示的通用模型。从网络(以广域网络为例)视角来看,数据从端侧到数据中心需要穿越广域网络(如,SRv6网络)以及数据中心网络。从数据传输视角看,数据首先通过TCP传输或者硬盘运输的方式穿越广域网络到达数据中心的存储节点,并在存储节点以数据拷贝方式存在,然后计算节点再通过数据中心网络,基于TCP或者高性能RDMA技术读取存储节点数据完成数据处理。在当前的解决方案中,存在如下几个挑战:
a、数据安全性问题:数据拷贝因为各种异常问题导致的数据泄露风险。
b、数据一致性问题:源端数据更新导致的和拷贝数据不一致的风险。
c、数据时效性问题:硬盘运输耗时以及TCP传输在广域网络上的低吞吐率,导致的数据处理时效性降低的问题。
鉴于RDMA在数据中心高性能传输方案中的成熟应用,业界希望将RDMA移植到以太网络中进行应用,然而,由于RDMA在诞生之初是基于IB网络设计的,需要使用专用的交换机、路由器、网卡,配套组合成专用的IB网络,通过硬件资源保证传输网络的高带宽,低时延与零丢包,而RDMA自身在可靠性方面的应对机制就显得较为简陋。当RDMA移植到以太网络后,也将IB网络的特性要求带到了以太网络上,特别是无损特性的要求,使得原本提供尽力而为服务的以太网络需要通过复杂的流控机制(如,全局暂停(global pause)、功率因数校正(power factor correction,PFC)等)或者拥塞控制(如,显式拥塞通知(explicit congestion notification,ECN)、数据中心量化拥塞通知(data center quantized congestion notification,DCQCN)等)来进行能力适配。总的来说,RDMA在基于以太网传输有如下问题:
a、可扩展性:可扩展性差,只适用于封闭的内网,无法跨地域部署。
b、丢包处理:RDMA网卡受限于硬件资源,使用Go-back-N的粗放式重传处理,对丢包非常敏感,异常丢包会导致传输效率急剧下降,因此需要想办法让网络不丢包。
c、对网络设备的要求:为了实现不丢包的无损网络,需要交换机配合(IB网络需要使用专用交换机,RoCE是需要交换机支持PFC、ECN等)。
业界也有探索RDMA在长距网络上的应用尝试,例如,RDMA可以借助于网络专线的高质量特性实现广域传输。与数据中心场景相比,基于专线的RDMA仅在时延上存在一定劣化,其他的网络服务参数均可达到相同水准。网络专线虽然可以提供高质量承载服务,但依然存在如下的几个问题:
a、网络专线的租用成本高昂,且开通部署上也存在较多约束和限制。
b、网络专线开通周期很长,需要网络服务提供商专业人员进行管理维护。
c、不同网络服务提供商的专线服务可能无法互通。
基于此,本申请通过设计一个可以承载RDMA传输的广域/城域网络,通过端到端的RDMA高性能传输来解决上述问题。考虑到工程应用的成本控制以及部署效率问题,本申请聚焦于在广域/城域网络上使用SRv6技术解决RDMA可靠传输的问题,具体的应用场景可如图4所示,本申请提供的方法可以将已有的RDMA应用范围从数据中心扩展到整个互联网络,通过RDMA技术实现端到端的数据传输加速处理,其中主要解决如下几个问题:
a、网关设备如何识别(如,通过RDMA传输处理层识别)并分类RDMA数据(也可称为RDMA流量),通过精细化管理,精确感知不同RMDA数据传输的网络性能(如,带宽、时延等)要求,然后协同SRv6网络基于网络性能要求以及SRv6网络实时链路状态进行精准算路,再将算路结果反馈给网关设备。
b、网关设备如何根据算路结果(如,通过RDMA传输处理层),协同SRv6网络中的网络节点(如,广域路由器)将不同RDMA数据导入到对应的SRv6 TE策略,并通过网络中的网络节点实现精准带宽保障。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
首先对本申请实施例提供的系统架构进行说明,具体请参阅图5,图5为本申请实施
例提供的系统架构的一个应用示意图,本申请实施例主要应用于位于端侧的RDMA网卡设备(RDMA network interface card,RNIC)需要跨越SRv6网络的场景,RDMA网卡设备可统称为网卡设备。其中,源端网卡设备和目的端网卡设备可以部署IPv4私网地址或IPv6地址,使用RoCEv2协议进行交互。SRv6网络一般由网络控制器(如,广域网络控制器)以及多台路由设备组成,如图5中RGW1、RGW2、R3、R4、R5所示,路由设备也可称为网络节点,网络控制器纳管对应SRv6网络下所有的网络节点,并负责SRv6网络的算路处理,如图5所示,假定网络控制器根据网络状态已经计算出4条SRv6 TE策略路径,分别表示为MD(min delay)路径、LC(least cost)路径、BB(bandwidth balance)路径、MA(max availability)路径,其中MD路径表示最小时延路径,LC路径表示最小开销路径,BB路径为带宽平衡路径,MA路径为最大利用率路径。当RDMA流量穿越广域SRv6网络时,广域SRv6网络将承载普通互联网流量(就是指不是RoCEv2数据的其他流量)以及RoCEv2流量。
需要说明的是,在本申请实施例中,源端网卡设备和目的端网卡设备可以部署IPv4私网地址,也可以部署IPv6地址,具体本申请对此不做限定。
但需要注意的是,若源端网卡设备和目的端网卡设备部署的是IPv4私网地址,那么源端网卡设备和目的端网卡设备需要为各自对应的网关设备提供IPv4 over IPv6隧道穿越能力,即网关设备需要为网卡设备发出的报文进行IPv4 over IPv6的隧道封装,使得部署于用户私有网络的RNIC能够跨越广域网络进行相互通信,如图5所示的传输过程。若源端网卡设备和目的端网卡设备部署的是IPv6地址,那么网关设备不需要为网卡设备发出的报文进行IPv4 over IPv6的隧道封装,网卡设备直接基于IPv6运行RoCEv2协议,且互相通信的网卡设备之间路由跨SRv6网络可达,如图6所示的传输过程,这种方式的好处就在于:网关设备不需要提供IPv4 over IPv6的隧道封装能力,全网均运行IPv6协议,协议栈技术归一,组网更加简单清晰。除了网关设备部署的局域网地址不同之外,在本申请实施例中,其他的处理过程则是类似的,具体可参阅后续实施方式。为便于阐述,后续实施例均以源端网卡设备和目的端网卡设备部署IPv4私网地址为例进行说明。
本申请通过RDMA网关设备(RDMA gateway,RGW)来实现基于SRv6网络的数据传输方法,在整体解决方案中,可以配套相应的端侧管理系统来完成与SRv6网络的协同工作。一个预设范围内的所有RGW都可以对应一个端侧管理系统,由这一个端侧管理系统进行纳管,该端侧管理系统可以纳管哪些范围内的RGW是事先部署定义好的,此处不予赘述。RGW一般部署在用户网络的出口位置,局域网(local area network,LAN)侧连接RNIC,广域网(wide area network,WAN)侧连接SRv6网络中的网络节点。RNIC发出的所有流量都会被出口网关RGW监听并分类处理,使得RDMA流量(在申请实施例中可分为控制流量与数据流量)可根据各自所需的网络性能(如,带宽、时延)要求,匹配到各自的SRv6转发路径上,从而实现RDMA流量在广域/城域网络转发时能够得到确定性的SLA保障。
需要注意的是,在本申请的另一些实施方式中,也可以不需要端侧管理系统,而是在各自的RGW上部署相应功能的模块以实现端侧管理系统的功能,具体本申请对此不做限
定,为便于阐述,在本申请以下实施例中,均以部署了端侧管理系统为例进行说明。
之后,对本申请实施例提供的基于SRv6网络的数据传输方法进行说明,本申请实施例的核心主要体现在:如何通过RDMA网关设备完成RoCEv2协议与SRv6网络的协同,实现基于RDMA数据传输所需的精确带宽/时延等网络性能需求,进行事先路径规划以及事后严格QoS保障的处理,从而在SRv6网络实现RoCEv2报文的确定性传输。具体请参阅图7,图7为本申请实施例提供的数据传输方法的一个流程示意图,该方法可以包括如下步骤:
701、第一网关设备获取RoCEv2报文。
首先,源端网卡设备(可称为第一网卡设备)基于待发送的RDMA数据生成RoCEv2报文,与该第一网卡设备对应的源端网关设备(可称为第一网关设备)再接收来自第一网卡设备的该RoCEv2报文,该RoCEv2报文为第一网关设备待发往目的端网关设备(可称为第二网关设备)的报文,需要注意的是,第一网关设备与第二网关设备基于SRv6网络进行数据传输,其中,SRv6网络可以是广域网络,也可以是城域网络,具体本申请对SRv6网络的具体应用类型不做限定。
需要说明的是,在本申请的一些实施方式中,假设第一网关设备对应的第一网卡设备以及第二网关设备对应的第二网卡设备上部署的是IPv4私网地址,那么第一网关设备获取待发往第二网关设备的RoCEv2报文的方式可以是:第一网卡设备基于待发送的RDMA数据生成原始RoCEv2报文,然后第一网关设备对该原始RoCEv2报文进行IPv4 over IPv6隧道封装,从而得到所述的RoCEv2报文,这样才能使得部署于用户私有网络的网卡设备能够实现跨越广域/城域网络进行相互通信的功能。
还需要说明的是,在本申请的一些实施方式中,由于第一网关设备可以接收到由第一网卡设备传递过来的各种报文,第一网关设备判断收到的目标报文是否是RoCEv2报文的方式可以是:根据收到的目标报文中的UDP目的端口号来确定当前收到的目标报文是否为RoCEv2报文。
还需要说明的是,在本申请的一些实施方式中,考虑到发挥报文转发最佳性能,本申请可以选择一种专用网关设备,即实现本申请实施例提供的基于SRv6网络的数据传输方法的组件可以耦合在现有网关设备上以组成一种专用网关设备,但实现基于SRv6网络的数据传输方法的组件并不局限于在专用网关设备上,即可以解耦独立部署,例如,可以将实现基于SRv6网络的数据传输方法的组件部署在服务器、传统网络设备、现场可编程门阵列(field-programmable gate array,FPGA)设备等,本申请所述的第一网关设备仅是实现所述组件功能的一个执行主体,本申请对第一网关设备的具体物理形态不做限定。
702、第一网关设备根据RoCEv2报文生成目标请求,并将目标请求向SRv6网络的网络控制器发送。
第一网关设备在获取到RoCEv2报文之后,将进一步根据RoCEv2报文生成对应的请求,该请求可称为目标请求,之后,该第一网关设备会将该目标请求向SRv6网络中的网络控制器发送。
需要说明的是,在本申请的一些实施方式中,第一网关设备在识别出RoCEv2报文之
后,可以先根据自定义的判定规则以确定该RoCEv2报文的报文类型,再基于RoCEv2报文的报文类型生成对应的目标请求。具体地,第一网关设备可以按照本申请设计的RoCEv2报文分类方法,区分出该RoCEv2报文是控制(control flow,CF)报文,还是数据(data flow,DF)报文,再根据该RoCEv2报文是CF报文还是DF报文来决定后续发送什么样的目标请求。
首先,对本申请设计的RoCEv2报文分类方法进行说明,本申请通过分析IB协议规范以及RoCEv2各种类型的报文格式,可以将RoCEv2报文分为CF报文和DF报文两类,下面分别进行介绍:
一、CF报文
CF报文不带业务数据负载,也就是说,若RoCEv2报文为不携带payload字段的报文,则该RoCEv2报文为CF报文。CF报文除了BTH(base transport header)首部外,可按需携带其他各种L4 header,如ETH(extended transport header)、CMM(communication management message)等。CF报文一般由IB协议的交互机制触发,不同类型的CF报文长度不一,但都为固定长度的单个短报文,传输带宽要求可根据报文大小与并发数量进行推算,总的来说对承载网络的传输带宽要求不大。
CF报文包括如下类型的报文:CM建链报文、Read Request报文、Cmp Swap报文、Fetch Add报文、ACK报文、Atomic ACK报文、RESYNC报文。某个具体的CF报文是属于上面哪一类的报文可根据报文中BTH首部的opcode字段以及destination queue pair字段的取值来确定,具体本申请对此不予赘述。
二、DF报文
DF报文带业务数据负载,也就是说,若RoCEv2报文为携带payload字段的报文,则该RoCEv2报文为DF报文。DF报文还携带BTH以及可选的L4 header,DF报文一般由send、read、write原语操作触发,DF报文的长度受待发送数据大小以及传输路径最大传输单元(path maximum transmission unit,PMTU)影响,总的来说对承载网络有较高的传输带宽要求,传输带宽要求可按标准协议里已有的规则计算,也可以按照其他自定义的规则计算,本申请对此不做限定。
DF报文包括如下类型的报文:read response报文、write/send first报文、write/send middle报文、write/send last报文、write/send only报文、write/send only with immediate报文。
需要注意的是,在本申请实施例中,write/send first报文、write/send middle报文和write/send last报文共同对应一个RDMA数据(即一个RDMA消息),而write/send only报文或write/send only with immediate报文则各自对应一个RDMA数据。也就是说,一个RDMA数据(也可称为RDMA消息或RDMA流量)可以单独作为一个数据包发送,这种情况一般是在RDMA数据不大的情况下,其对应的报文可以是write/send only报文或write/send only with immediate报文;当RDMA数据比较大,无法在一个数据包内发送时,则RDMA数据可分拆为至少3个数据包,其中,分拆出的第一个数据包(也称为首包)对应write/send first报文,分拆出的最后一个数据包对应write/send last报文,中间的数据包(可以是1个或多个)则对应write/send middle报文。
类似地,某个具体的DF报文是属于上面哪一类的报文也可根据报文中BTH首部的opcode字段以及destination queue pair字段的取值来确定,具体本申请对此不予赘述。
第一网关设备采用本申请上述所述的RoCEv2报文分类方法将获取到的RoCEv2报文分类之后,就可以根据RoCEv2报文的类型(即是CF报文还是DF报文)生成对应的目标请求,下面分别进行介绍:
一、当RoCEv2报文为CF报文时,目标请求为第一算路请求
在第一网关设备确定该RoCEv2报文为CF报文的情况下,第一网关设备根据该CF报文生成第一算路请求,其中,该第一算路请求就用于触发网络控制器为该CF报文确定对应该CF报文的算路结果。需要注意的是,在本申请的一些实施方式中,第一网关设备生成的第一算路请求可通过端侧管理系统向网络控制器发送。还需要注意的是,当RoCEv2报文为CF报文时,一个RDMA数据一般就对应一个报文,这是因为CF报文一般都不大,无需进行分拆。
需要说明的是,在本申请实施例中,由于不同类型的CF报文虽然可能长度不一,但都为固定长度的单个短报文,对承载网络的传输带宽要求不大。因此,对于作为源端的第一网关设备(假设记为RGW1)与作为目的端的第二网关设备(假设记为RGW2)来说,网络控制器可预先计算出RGW1→RGW2的算路结果,该算路结果包括RGW1→RGW2的网络路径以及为该CF报文预留的网络资源(如,带宽)。后续只要是由RGW1向RGW2发送的CF报文,对应的都是该算路结果。
还需要说明的是,在本申请实施例中,网络控制器计算RGW1→RGW2的针对CF报文的算路结果的时间可以包括但不限于:可以是在网络初始化处理时确定,也可以是在第一网关设备第一次有CF报文待发送时确定,下面分别进行阐述:
A、网络初始化处理时,网络控制器计算RGW1→RGW2的针对CF报文的算路结果。
具体可参阅图8,图8为本申请实施例提供的网络初始化处理阶段的一个流程示意图,在SRv6网络中,首先,网络控制器可通过BGP-LS协议收集网络状态信息(如网络拓扑、Binding SID、SR-Policy状态等),待RGW1上电时,RGW1被端侧管理系统纳管,管理员可以通过端侧管理系统编排所纳管的所有网关设备之间的连通关系(纳管下的网关设备的拓扑图可以是默认的全连接拓扑,也可以是事先自定义配置的拓扑,本申请对此不做限定),之后,RGW1接收端侧管理系统下发的编排配置,该编排配置用来实现RGW1和RGW2穿越SRv6网络的隧道的能力。端侧管理系统在确定这两个RGW(即RGW1和RGW2)连通时,端侧管理系统就向这两个RGW下发各自的编排配置。此外,管理员会为RGW1到每个连通的其他RGW配置网络性能参数(注:10M bps可支持3814个CM建链,或者16890个read request请求,RGW1可根据本地实时统计的信令交互情况调整CF报文的网络性能参数)。例如,管理员可以配置bandwidth和latency分别作为带宽、时延参数。在本申请的另一些实施方式中,网络性能参数也可以是事先基于一定的规则设定的默认值,本申请对如何配置的网络性能参数的具体实现方式不做限定,并且,本申请对网络性能参数的具体类型(如,带宽、时延)也不做限定,用户可基于需求自行确定所需的网络资源类型。最后,端侧管理系统可以基于事先配置好的网络性能参数向网络控制器提交各个具有
连通关系网关设备之间(如,RGW1→RGW2、RGW1→RGW3、RGW1→RGW4等)的针对CF报文的算路请求,网络控制器再基于各个算路请求计算出各个具有连通关系网关设备之间传递CF报文的算路结果并存储下来,例如,假设RGW1→RGW2的算路结果记为S1、RGW1→RGW3的算路结果记为S2、……、RGW1→RGWn的算路结果记为Sn,当RGW1有CF报文待发送给RGW2时,RGW1会通过端侧管理系统向网络控制器发送第一算路请求(其中包含目的端网关设备RGW2),网络控制器基于该第一算路请求,就可调用事先计算好的RGW1→RGW2的算路结果S1,并通过端侧管理系统将该算路结果S1发送给RGW1。
B、第一次有待发送的CF报文时,网络控制器计算RGW1→RGW2的针对CF报文的算路结果。
在这种情况下,网络初始化处理过程与上述类似,不同的地方在于:网络控制器不需要事先计算好各个具有连通关系网关设备之间传递CF报文的算路结果,而是当某个网关设备(如,RGW1)需要向另一网关设备(如,RGW2)第一次发送CF报文时,RGW1直接通过端侧管理系统向网络控制器发送第一算路请求,该第一算路请求中就可以携带网络性能需求,网络控制器再基于第一算路请求临时计算RGW1→RGW2的算路结果,后续若RGW1还有CF报文需要发往RGW2,则基于该计算得到的算路结果继续发送即可。
二、当RoCEv2报文为DF报文时,目标请求为携带网络性能需求的第二算路请求
在第一网关设备确定该RoCEv2报文为DF报文的情况下,第一网关设备根据该DF报文生成第二算路请求,与上述第一算路请求不同的是,该第二算路请求必须携带网络性能需求,其中,该携带了网络性能需求的第二算路请求就用于请求网络控制器基于该网络性能请求为DF报文确定算路结果。同样需要注意的是,在本申请的一些实施方式中,第一网关设备生成的第二算路请求可通过端侧管理系统向网络控制器发送。
需要说明的是,在本申请实施例中,第二算路请求必须携带网络性能需求的原因是:DF报文的长度是由上层RDMA应用决定的,一般长度不固定,因此第二算路请求中需要携带申请的网络性能需求(如,需要申请多大的带宽资源等),用于请求网络控制器基于携带的网络性能需求来确定算路结果,因为每个DF报文可能携带的消息大小不同,即payload字段的值不同,所以网络控制器需要针对不同的DF报文分别计算算路结果,计算的依据就是第二算路请求中所携带的网络性能需求。
还需要说明的是,在本申请的一些实施方式中,不同操作触发的DF报文,网络性能需求的确定方式也是不一样的,下面分别进行介绍:
A、DF报文为write原语操作或read原语操作触发的报文。
当DF报文为write原语操作操作触发的报文时,会在首包(即first包,一个RDMA数据可能会被拆分为n个数据包,只会在first包中携带对应RDMA数据的消息大小,后续的对应该RDMA数据的其他数据包则基于与首包同样的网络路径进行发送)的RETH首部中通过DMA length字段表明本次操作需要传输的消息大小。类似的,当数据报文为read原语触发的报文时,是在read request报文中携带网络性能需求,由请求方触发以响应方为起始节点的逆向转发路径算路。响应方收到read request请求并回复read response时,可能
回复多个response报文,所有报文都需按预计算路径转发。因此,在这种情况下,第一网关设备可以根据DF报文的RETH首部中的DMA length字段确定所述网络性能需求,从而生成携带网络性能需求的第二算路请求。
B、DF报文为send原语操作触发的报文。
当DF报文为send原语操作触发的报文时,对应的待传输的RDMA数据的消息大小也是确定的,但目前报文中无相应字段标识该消息大小。因此,在本申请实施例中,通过对网卡设备的RoCEv2协议进行优化,使得send原语操作触发的DF报文也可携带字段以标识本次操作需要发送的RoCv2数据的消息大小,可以通过复用IB规范定义的RETH首部来实现。具体地,RETH首部可以由第一网卡设备(与第一网关设备对应的网卡设备)在DF报文的payload字段前添加,其中,RETH首部中的DMA length字段用于表征该DF报文对应的RDMA数据的大小。例如,第一网卡设备可以将virtual addr(RETH首部已定义的字段)置为0xFFFFFFFF(具体置为什么值可自定义),remote key(在RETH首部已定义的字段)置为0xFFFF(具体置为什么值可自定义),DMA length(在RETH首部已定义的字段)为本次send原语操作触发的DF报文所需要发送的RoCv2数据的消息大小。之后,第一网关设备就可以根据DF报文的RETH首部中的DMA length字段确定所述网络性能需求,从而生成携带网络性能需求的第二算路请求。
需要注意的是,在本申请的一些实施方式中,不管是write原语操作或read原语操作触发的DF报文,还是send原语操作触发的DF报文,若一个RDMA数据是被拆为n个数据包,那么是在首包里携带整个RDMA数据所需的带宽、时延等网络性能需求,后续的数据包不带网络性能需求,且后续数据包的算路结果采用首包的算路结果来进行数据传输。
703、网络控制器基于目标请求得到算路结果,并基于算路结果确定QoS策略,该算路结果包括第一网关设备与第二网关设备之间的网络路径以及为RoCEv2报文预留的网络资源。
在第一网关设备根据RoCEv2报文生成目标请求,并将目标请求向SRv6网络的网络控制器发送(如,通过端侧管理系统发送)之后,网络控制器基于接收到的目标请求得到算路结果,并基于算路结果确定QoS策略,例如,如图5所示的为CF报文计算算路结果和确定Qos策略的过程。其中,该算路结果就包括第一网关设备与第二网关设备之间的网络路径(也可称为SRv6转发路径)以及网络控制器为RoCEv2报文所预留的网络资源,例如,网络带宽资源、最小网络时延等。需要注意的是,在本申请实施例中,预留的网络资源是RoCEv2报文在整个网络路径上的网络资源,例如,假设网络路径为R1→R2→R3,则预留的网络资源是R1→R2→R3整条网络路径的资源。
需要说明的是,由于基于RoCEv2协议承载的RDMA数据传输具有带宽需求明确的特点,RoCEv2报文通过SRv6网络传输时,可根据明确的带宽需求(即预留的网络资源为带宽这一单一因素)为其选择一条时延最短的网络路径,或者也可以根据明确的“带宽+时延”需求(即预留的网络资源为带宽+时延双目标因素)为其选择一条符合要求的网络路径。
704、第一网关设备获取算路结果,并根据算路结果对RoCEv2报文进行扩展,得到扩展后的RoCEv2报文,扩展后的RoCEv2报文携带第一标识以及第二标识,第一标识用于
表征网络路径中的各个网络节点,第二标识用于表征RoCEv2报文对应的RDMA数据,以使得网络控制器基于第二标识以及QoS策略进行带宽保障。
网络控制器在基于目标请求计算算路结果之后,可通过端侧管理系统将该算路结果下发给第一网关设备,第一网关设备获取该算路结果后,根据该算路结果对待发送的RoCEv2报文进行扩展,从而得到扩展后的RoCEv2报文,扩展后的RoCEv2报文至少携带第一标识以及第二标识,其中,第一标识用于表征网络路径中的各个网络节点,第二标识用于表征RoCEv2报文对应的RDMA数据,以使得网络控制器基于第二标识以及事先计算的QoS策略进行带宽保障。
在介绍如何对待发送的RoCEv2报文进行扩展之前,首先对源端网卡(即第一网卡设备)与目的端网卡(即第二网卡设备)互相通信,RoCEv2报文经过网络路径上的各网络节点时报文首部变更的情况进行说明,具体可参阅图9,以第一网卡设备和第二网卡设备部署的是IPv4私网地址为例,假设网络路径为R1→R3→R2,第一网卡设备为RNIC1、第二网卡设备为RNIC2、第一网关设备为RGW1、第二网关设备为RGW2,在RoCEv2报文的传递过程中,报文首部的变化如图9所示,RGW1在接收到由RNIC1传递过来的RoCEv2报文(即上述所述的原始RoCEv2报文)后,先完成IPv4 over IPv6封装处理,再将封装好的RoCEv2报文传递给网络路径上的头网络节点R1,在各个网络节点中,RoCEv2报文通过SRH字段(该字段用于网络节点之间的路由)。
对于本处封装的外层IPv6首部,后续流程中将根据SRv6网络中网络控制器返回的算路结果,在IPv6 Hop-by-Hop Option中,封装本申请定义的扩展选项RDMA Option TLV(格式可以如图10所示),使得RoCEv2报文可携带RDMA数据的流标识RDMA Flow ID(即的第二标识,用于给网络控制器基于QoS策略进行带宽保障的),以及转发路径标识SRv6 TE Policy ID(即第一标识)。也就是说,第一网关设备是根据算路结果,通过修改RoCEv2报文中外层IPv6首部的IPv6 Hop-by-Hop Option,得到扩展后的RoCEv2报文。
需要说明的是,在本申请的一些实施方式中,本申请定义的扩展选项RDMA Option TLV也可以携带RDMA 流量类型标识Flow Type,用于表征当前这个报文是CF报文还是DF报文,还可以携带保留字段reserved,用于后续增加其他标识时使用,提高灵活性。
为便于进一步理解上述过程,下面以具体的实例,对第一网关设备针对CF报文和DF报文的传输优化方法的具体实现方式进行说明:
一、第一网关设备针对CF报文传输优化的处理
对于分类出的CF报文,第一网关设备通过修改外层IPv6首部扩展选项,使其携带第一标识和第二标识,以便其进入SRv6网络后,网络节点可将CF报文导流到基于TE流量工程预先分配的SRv6转发路径(即网络路径)上,并进行严格的带宽保证。
具体地,RGW1可以根据图8对应实施例中预申请的CF报文的算路结果,在IPv6首部封装RDMA Option TLV(详见图10所示的格式定义),使得IPv6首部可携带CF报文的第一标识(即CF报文的SRv6转发路径标识,可用policy-id1表示)以及第二标识(即CF报文流标识,可用cf-flow-id表示),并向头网络节点R1转发该扩展后的CF报文。头网络节点R1收到该扩展后的CF报文后,解析RDMA Option TLV选项,根据CF报文中的SRv6
转发路径标识SRv6 TE Policy ID(可参阅图10)进行引流处理,并根据CF报文中RDMA数据的流标识RDMA Flow ID匹配预下发的QoS策略进行带宽保障。
需要说明的是,在本申请的一些实施方式中,对于read request报文与read response报文的处理方式会有些特殊,这是因为read request报文属于CF报文,而对端网关设备需要在收到read request报文后,应答read response报文(即read response报文的发出是由read request报文触发),而read response报文属于DF报文。具体的处理过程可以如下所示:首先,第一网卡设备RNIC1向第二网卡设备RNIC2发送read request请求(即CF报文,可称为request报文),第一网关设备RGW1收到该request报文后先缓存该报文,并以request报文中RETH首部携带的DMA Length作为带宽请求,以配置的latency参数作为时延请求,通过端侧管理系统为即将发送的read response报文(即request报文的响应报文,可简称为response报文),向网络控制器提交R2→R1的算路请求。网络控制器根据该算路请求完成计算后,向头网络节点R2下发SRv6转发路径,向端侧管理系统下发转发路径标识(可用policy-id2表示)以及实际预留带宽(可用allocated-bw表示)。向SRv6转发路径上各网络节点下发QoS策略,包含(df-flow-id1,allocated-bw)。端侧管理系统收到转发路径标识policy-id2以及实际预留带宽allocated-bw(即算路结果),向RGW2同步R2→R1的算路结果,该算路结果中携带分配的policy-id2以及本次算路预留分配的带宽allocated-bw。这里需要注意的是,若实际分配的带宽小于申请的带宽,RGW2可以在本地记录is_bw_satisfied为false,记录的目的是用于后续做是否需要进行源端限速的判断依据。之后,端侧管理系统通知RGW1算路完成。RGW1收到通知后,再按CF报文通用处理流程发送缓存的request报文。RNIC2收到该request报文后,应答response报文。RGW2收到response报文时,首先完成IPv4 over IPv6处理。然后根据request报文触发申请的DF报文转发路径信息,在IPv6首部封装RDMA Option TLV,使得IPv6首部可携带DF报文流标识df-flow-id1、DF报文转发路径标识policy-id2,并向头网络节点R2转发该response报文。若网络控制器实际预留带宽小于申请带宽,RGW2还需要按实际分配预留带宽进行源端限速。头网络节点R2收到后,解析RDMA Option TLV选项,根据response报文中的SRv6 TE Policy ID进行引流处理,并根据response报文中的RDMA Flow ID匹配预下发的QoS策略进行带宽保障。
还需要说明的是,第一网关设备针对CF报文传输优化的处理的功能可通过CF报文传输优化组件实现,具体可参阅图11,其中,需要注意的是,第一网关设备识别出RoCEv2报文后,按照本申请提供的分类方法区分出CF报文和DF报文的功能则可通过图11所示的RoCEv2报文分类组件实现。
二、第一网关设备针对DF报文传输优化的处理
对于分类出的DF报文,第一网关设备会根据待传输的RDMA数据的消息大小,触发网络控制器完成基于TE流量工程的算路计算,然后通过修改外层IPv6首部扩展选项,使其携带第一标识和第二标识,以便其进入SRv6网络后,网络节点可将DF报文导流到预先分配的SRv6转发路径(即网络路径)上,并进行严格的带宽保证。
具体地,对于分类出的DF报文,第一网关设备需缓存所有的DF报文,并通过端侧管理系统向网络控制器提交第二算路请求,请求中携带带宽、时延等网络性能需求。网络控
制器在基于提交的第二算路请求计算到算路结果后,再由端侧管理系统将该算路结果向第一网关设备转发,第一网关设备根据该算路结果完成IPv6首部扩展选项填充,并发送DF报文。若网络控制器预分配的实际带宽未能满足申请带宽,第一网关设备在发送DF报文时,还需根据实际带宽进行源端限速。网络节点根据报文中IPv6首部扩展选项将DF报文引流到预先计算的SRv6转发路径上,并完成本地QoS保障处理。
特别的,对于write/send原语操作触发的DF报文(可简称为write/send报文)传输优化处理流程如下:首先,第一网卡设备RNIC1向第二网卡设备RNIC2发送write/send报文,第一网关设备RGW1收到后先缓存该write/send报文,若该write/send报文是write/send first报文,后续write/send middle以及write/send last报文也需要缓存,若是write/send only报文,则缓存当前报文。RGW1以报文中的DMA Length作为带宽请求,以配置的latency参数作为时延请求,通过端侧管理系统为即将发送的write/send报文,向网络控制器提交R1→R2的算路请求。网络控制器根据收到的算路请求完成算路结果的计算,并向头网络节点R1下发SRv6转发路径,向端侧管理系统下发转发路径标识policy-id3以及实际预留带宽allocated-bw。向SRv6转发路径上各网络节点下发QoS策略,包含(df-flow-id2,allocated-bw)。端侧管理系统收到算路结果后,向RGW1同步R1→R2的算路结果。算路结果中携带分配的policy-id3以及本次算路预留分配的带宽allocated-bw,若实际分配的带宽小于申请的带宽,RGW1需在本地记录is_bw_satisfied为false。记录的目的是用于后续做是否需要进行源端限速的判断依据。之后,RGW1按序处理缓存的write/send报文(write/send first→write/send middle→write/send last,或者单个的write/send only),并首先完成IPv4 over IPv6处理,然后根据write/send请求触发申请的DF报文转发路径信息,在IPv6首部封装RDMA Option TLV,使得IPv6首部可携带DF报文流标识df-flow-id2、DF报文转发路径标识policy-id3,并向头网络节点R1转发该报文。若网络控制器实际预留带宽小于申请带宽,RGW1需要按实际分配预留带宽进行源端限速。头网络节点R1收到该报文后,解析RDMA Option TLV选项,根据报文中的SRv6 TE Policy ID进行引流处理,并根据报文中的RDMA Flow ID匹配预下发的QoS策略进行带宽保障。
还需要说明的是,第一网关设备针对DF报文传输优化的处理的功能可通过DF报文传输优化组件实现,具体可参阅图11。
705、第一网关设备通过网络路径将扩展后的RoCEv2报文向第二网关设备发送。
第一网关设备在得到扩展后的RoCEv2报文之后,就可以通过计算得到的网络路径(即上述所述的SRv6转发路径)将扩展后的RoCEv2报文向对端的第二网关设备发送。
需要注意的是,在本申请的一些实施方式中,对于所有的RoCEv2报文,第一网关设备除了按上述的CF/DF报文传输优化机制进行确定性传输之外,还可以主动复制一份报文,并修改复制报文的外层IPv6首部扩展选项,使其携带SRv6转发路径标识以及CF报文流标识,以便复制后的报文入SRv6网络后,网络控制器可将复制报文导流到备份SRv6转发路径上,以使得对端的第二网关设备进行双发选收处理,通过流量冗余方式来避免丢包问题的出现。
具体地,该过程可以是:首先,第一网关设备获取备份路径,该备份路径可称为第一
备份路径,该第一备份路径可以由网络控制器在接收到目标请求时触发生成,之后,第一网关设备对该RoCEv2报文进行复制,得到第一复制报文,第一网关设备再根据第一备份路径对第一复制报文进行与上述过程类似的扩展,得到扩展后的第一复制报文,该扩展后的第一复制报文依然需要携带SRv6转发路径标识,其中,在扩展后的第一复制报文中携带的SRv6转发路径标识可称为第三标识,该第三标识就用于表征第一备份路径中的各个网络节点。最后,第一网关设备通过第一备份路径将扩展后的第一复制报文向第二网关设备发送,以使得第二网关设备针对扩展后的RoCEv2报文以及扩展后的第一复制报文进行双发选收处理。也就是说,源端的RDMA网关除了根据前述步骤对RoCEv2报文进行传输优化处理外,对于所有的RoCEv2报文需额外复制一份,并完成IPv4 over IPv6处理,然后在IPv6首部封装RDMA Option TLV,使得IPv6首部可携带网络控制器预分配的备份的转发策略(本实施例中可以选择使用最高可用路径max availability path,并不要求SRv6网络在转发备份流量时做带宽保障),并向头网络节点R1转发该复制的RoCEv2报文。对端的RDMA网关从计算得到的指定路径以及备份路径上收到两份RoCEv2报文,进行双发选收处理,所谓双发选收处理的方式可以是:保留先接收到的那份报文,后接收到的报文直接舍弃,例如,若扩展后的RoCEv2报文先到达第二网关设备,则保留该扩展后的RoCEv2报文,丢弃后到达的扩展后的第一复制报文,反之也是类似,此处不予赘述。本申请可以通过双发选收能力丢弃后收到的一份报文,从而实现RoCEv2抗丢包能力。
还需要说明的是,第一网关设备针对上述的抗丢包功能可通过RoCEv2报文抗丢包组件实现,具体可参阅图11。
还需要说明的是,在本申请的另一些实施方式中,还提供了一种RoCEv2抗丢包处理的替代方案,主要区别体现在:本申请实施例中不在RDMA网关进行报文复制,而是由SRv6网络的头网络节点完成RoCEv2报文的复制处理。具体地,该过程可以是:当扩展后的RoCEv2报文发送至网络路径的头网络节点(该头网络节点也可称为源网络节点,是指网络路径中与第一网关设备直接连接的网络节点,是网络路径中沿数据传输方向的第一个网络节点)时,触发头网络节点对该扩展后的RoCEv2报文进行复制,从而得到第二复制报文,其中,第二复制报文经由事先预备的备份路径(可称为第二备份路径)发送,最后,由该网络路径的尾网络节点(该尾网络节点也可称为目的网络节点,是指网络路径中与第二网关设备直接连接的网络节点,是网络路径中沿数据传输方向的最后一个网络节点)针对扩展后的RoCEv2报文以及第二复制报文进行双发选收处理。其中,第二备份路径也可以由网络控制器在接收到目标请求时触发生成。也就是说,SRv6网络的头网络节点收到RoCEv2报文后,可以先根据IPv6首部RDMA Option TLV中的Flow Type字段识别出当前报文是否为RoCEv2流量,如果是RoCEv2流量,则由SRv6网络的头网络节点复制一份报文,并按最高可用路径(即第二备份路径)进行转发,尾网络节点收到双份RoCEv2报文后,可通过双发选收能力丢弃后收到的一份报文,从而实现RoCEv2在广域/城域网络传输时的抗丢包能力。
对比由第一网关设备进行报文复制实现抗丢包处理的过程,本申请实施例的网关设备不需要实现双发选收的抗丢包处理,该部分能力实现由SRv6网络中的网络节点提供,可
降低网关设备的转发压力。
还需要注意的是,在本申请的一些实施方式中,在预留的网络资源不满足扩展后的RoCEv2报文的网络性能需求的情况下,第一网关设备还可以通过流控机制进行源端限速。也就是说,网络传输带宽无法满足需求时,还可选择一条当前情况下的最优SRv6转发路径,同时告知第一网关设备为本次传输分配的带宽大小,让第一网关设备通过流控机制在源端进行降速发送,以此来降低拥塞避免丢包。
需要说明的是,本申请上述实施例的基础逻辑是基于RoCEv2协议可准确感知应用层待传输的RDMA数据大小的特性,在流量传输之前对下层承载网络提出精确的网络性能需求,从而预先进行路径规划与网络资源预留,确保流量在承载网络传输过程中的确定性。基于此,本申请才设计了RoCEv2报文与SRv6网络的协同。但在本申请的另一些实施方式中,这种思路也可应用到如灵活以太(flex ethernet,FlexE)硬切片网络、时间敏感型网络(time sensitive network,TSN)上,通过RoCEv2报文与FlexE、TSN协议的协同,来实现RDMA数据的广域/城域传输优化目的,具体过程与上述类似,此处不予赘述。
在上述实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图12,图12为本申请实施例提供的第一网关设备的一个结构示意图,该第一网关设备1200具体可以包括:获取模块1201、生成模块1202、扩展模块1203以及发送模块1204,其中,获取模块1201,用于获取RoCEv2报文;生成模块1202,用于根据该RoCEv2报文生成目标请求,并将该目标请求向该SRv6网络的网络控制器发送,以使得该网络控制器基于该目标请求得到算路结果,并使得该网络控制器基于该算路结果确定QoS策略,该算路结果包括该第一网关设备与第二网关设备之间的网络路径以及为该RoCEv2报文预留的网络资源,该第一网关设备与该第二网关设备基于SRv6网络进行数据传输;扩展模块1203,用于获取该算路结果,并根据该算路结果对该RoCEv2报文进行扩展,得到扩展后的RoCEv2报文,该扩展后的RoCEv2报文携带第一标识以及第二标识,该第一标识用于表征该网络路径中的各个网络节点,该第二标识用于表征该RoCEv2报文对应的RDMA数据,以使得该网络控制器基于该第二标识以及该QoS策略进行带宽保障;发送模块1204,用于通过该网络路径将该扩展后的RoCEv2报文向该第二网关设备发送。
在一种可能的设计中,生成模块1202,具体用于:在确定该RoCEv2报文为不携带payload字段的控制报文的情况下,根据该控制报文生成第一算路请求,该第一算路请求用于触发该网络控制器为该控制报文确定算路结果。
在一种可能的设计中,生成模块1202,具体用于:在确定该RoCEv2报文为携带payload字段的数据报文的情况下,根据该数据报文生成携带网络性能需求的第二算路请求,该第二算路请求用于请求该网络控制器基于该网络性能需求为该数据报文确定算路结果。
在一种可能的设计中,生成模块1202,具体还用于:在该数据报文为send原语操作触发的报文的情况下,根据该数据报文的RETH首部中的DMA length字段确定该网络性能需求,其中,该RETH首部由第一网卡设备在该数据报文的payload字段前添加,该RETH首部中的DMA length字段用于表征该数据报文对应的RDMA数据的大小,该第一网卡设
备与该第一网关设备对应;生成携带该网络性能需求的第二算路请求。
在一种可能的设计中,生成模块1202,具体还用于:在该数据报文为write原语操作或read原语操作触发的报文的情况下,根据该数据报文的RETH首部中的DMA length字段确定该网络性能需求;生成携带该网络性能需求的第二算路请求。
在一种可能的设计中,扩展模块1203,具体用于:根据该算路结果,修改该RoCEv2报文IPv6首部的IPv6 Hop-by-Hop Option,得到扩展后的RoCEv2报文。
在一种可能的设计中,第一网关设备1200还包括备份模块1205,用于:获取第一备份路径,该第一备份路径由该网络控制器在接收到该目标请求时触发生成;对该RoCEv2报文进行复制,得到第一复制报文;根据该第一备份路径对该第一复制报文进行扩展,得到扩展后的第一复制报文,该扩展后的第一复制报文携带第三标识,该第三标识用于表征该第一备份路径中的各个网络节点;通过该第一备份路径将该扩展后的第一复制报文向该第二网关设备发送,以使得该第二网关设备针对该扩展后的RoCEv2报文以及该扩展后的第一复制报文进行双发选收处理。
在一种可能的设计中,第一网关设备1200还包括备份模块1205,用于:该扩展后的RoCEv2报文发送至该网络路径的头网络节点时,触发该头网络节点对该扩展后的RoCEv2报文进行复制,得到第二复制报文,该第二复制报文经由第二备份路径发送,以使得该网络路径的尾网络节点针对该扩展后的RoCEv2报文以及该第二复制报文进行双发选收处理,该第二备份路径由该网络控制器在接收到该目标请求时触发生成。
在一种可能的设计中,发送模块1204,具体用于:在该预留的网络资源不满足该扩展后的RoCEv2报文的网络性能需求的情况下,通过流控机制进行源端限速,并通过该网络路径将该扩展后的RoCEv2报文向该第二网关设备发送。
在一种可能的设计中,该第一网关设备对应的第一网卡设备以及该第二网关设备对应的第二网卡设备部署IPv4私网地址,该获取模块1201,具体用于:获取待发往第二网关设备的原始RoCEv2报文;对该原始RoCEv2报文进行IPv4 over IPv6隧道封装,得到该RoCEv2报文。
在一种可能的设计中,获取模块1201,具体用于:接收第一网卡设备发送的目标报文,该第一网卡设备与该第一网关设备对应;根据该目标报文中的UDP目的端口号确定该目标报文为该RoCEv2报文。
在一种可能的设计中,预留的网络资源至少包括如下任意一项:网络带宽、最小网络时延。
需要说明的是,第一网关设备1200中各模块/单元之间的信息交互、执行过程等内容,与本申请中图7对应的方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种第一网关设备,请参阅图13,图13为本申请实施例提供的第一网关设备的另一结构示意图,第一网关设备1300上可以部署有图12对应实施例中所描述的第一网关设备1200各个模块,用于实现图12对应实施例中第一网关设备1200的功能,具体的,第一网关设备1300由一个或多个服务器实现,第一网关设备1300可因配置
或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1322和存储器1332,一个或一个以上存储应用程序1342或数据1344的存储介质1330(例如一个或一个以上海量存储设备)。其中,存储器1332和存储介质1330可以是短暂存储或持久存储。存储在存储介质1330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对第一网关设备1300中的一系列指令操作。更进一步地,中央处理器1322可以设置为与存储介质1330通信,在第一网关设备1300上执行存储介质1330中的一系列指令操作。
第一网关设备1300还可以包括一个或一个以上电源1326,一个或一个以上有线或无线网络接口1350,一个或一个以上输入输出接口1358,和/或,一个或一个以上操作系统1341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器1322,用于执行图7对应实施例中的第一网关设备执行的基于SRv6网络的数据传输方法。例如,中央处理器1322,用于:首先基于待发送的RDMA数据生成RoCEv2报文,与该第一网卡设备对应的源端网关设备(可称为第一网关设备)再接收来自第一网卡设备的该RoCEv2报文,该RoCEv2报文为第一网关设备待发往目的端网关设备(可称为第二网关设备)的报文,需要注意的是,第一网关设备与第二网关设备基于SRv6网络进行数据传输。在获取到RoCEv2报文之后,将进一步根据RoCEv2报文生成对应的请求,该请求可称为目标请求,之后,将该目标请求向SRv6网络中的网络控制器发送,以使得网络控制器基于目标请求得到算路结果,并基于算路结果确定QoS策略,该算路结果包括第一网关设备与第二网关设备之间的网络路径以及为RoCEv2报文预留的网络资源。网络控制器在基于目标请求计算算路结果之后,可通过端侧管理系统将该算路结果下发给中央处理器1322,中央处理器1322获取该算路结果后,根据该算路结果对待发送的RoCEv2报文进行扩展,从而得到扩展后的RoCEv2报文,扩展后的RoCEv2报文至少携带第一标识以及第二标识,其中,第一标识用于表征网络路径中的各个网络节点,第二标识用于表征RoCEv2报文对应的RDMA数据,以使得网络控制器基于第二标识以及事先计算的QoS策略进行带宽保障。就可以通过计算得到的网络路径将扩展后的RoCEv2报文向对端的第二网关设备发送。
需要说明的是,中央处理器1322执行上述各个步骤的具体方式,与本申请中图7对应的方法实施例基于同一构思,其带来的技术效果也与本申请上述实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述图7所示实施例描述的步骤。
本申请实施例提供的第一网关设备等具体可以包括芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使机器人内的芯片执行上述图7所示实施例描述的步骤。
可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元
还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
Claims (28)
- 一种数据传输方法,其特征在于,包括:第一网关设备获取RoCEv2报文;所述第一网关设备根据所述RoCEv2报文生成目标请求,并将所述目标请求向SRv6网络的网络控制器发送,以使得所述网络控制器基于所述目标请求得到算路结果,并使得所述网络控制器基于所述算路结果确定QoS策略,所述算路结果包括所述第一网关设备与第二网关设备之间的网络路径以及为所述RoCEv2报文预留的网络资源,所述第一网关设备与所述第二网关设备基于所述SRv6网络进行数据传输;所述第一网关设备获取所述算路结果,并根据所述算路结果对所述RoCEv2报文进行扩展,得到扩展后的RoCEv2报文,所述扩展后的RoCEv2报文携带第一标识以及第二标识,所述第一标识用于表征所述网络路径中的各个网络节点,所述第二标识用于表征所述RoCEv2报文对应的RDMA数据,以使得所述网络控制器基于所述第二标识以及所述QoS策略进行带宽保障;所述第一网关设备通过所述网络路径将所述扩展后的RoCEv2报文向所述第二网关设备发送。
- 根据权利要求1所述的方法,其特征在于,所述第一网关设备根据所述RoCEv2报文生成目标请求包括:所述第一网关设备在确定所述RoCEv2报文为不携带payload字段的控制报文的情况下,所述第一网关设备根据所述控制报文生成第一算路请求,所述第一算路请求用于触发所述网络控制器为所述控制报文确定算路结果。
- 根据权利要求1所述的方法,其特征在于,所述第一网关设备根据所述RoCEv2报文生成目标请求包括:所述第一网关设备在确定所述RoCEv2报文为携带payload字段的数据报文的情况下,所述第一网关设备根据所述数据报文生成携带网络性能需求的第二算路请求,所述第二算路请求用于请求所述网络控制器基于所述网络性能需求为所述数据报文确定算路结果。
- 根据权利要求3所述的方法,其特征在于,所述第一网关设备根据所述数据报文生成携带网络性能需求的第二算路请求包括:在所述数据报文为send原语操作触发的报文的情况下,所述第一网关设备根据所述数据报文的RETH首部中的DMA length字段确定所述网络性能需求,其中,所述RETH首部由第一网卡设备在所述数据报文的payload字段前添加,所述RETH首部中的DMA length字段用于表征所述数据报文对应的RDMA数据的大小,所述第一网卡设备与所述第一网关设备对应;所述第一网关设备生成携带所述网络性能需求的第二算路请求。
- 根据权利要求3所述的方法,其特征在于,所述第一网关设备根据所述数据报文生成携带网络性能需求的第二算路请求包括:在所述数据报文为write原语操作或read原语操作触发的报文的情况下,所述第一网关设备根据所述数据报文的RETH首部中的DMA length字段确定所述网络性能需求;所述第一网关设备生成携带所述网络性能需求的第二算路请求。
- 根据权利要求1-5中任一项所述的方法,其特征在于,所述根据所述算路结果对所述RoCEv2报文进行扩展,得到扩展后的RoCEv2报文包括:所述第一网关设备根据所述算路结果,修改所述RoCEv2报文IPv6首部的IPv6Hop-by-Hop Option,得到扩展后的RoCEv2报文。
- 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:所述第一网关设备获取第一备份路径,所述第一备份路径由所述网络控制器在接收到所述目标请求时触发生成;所述第一网关设备对所述RoCEv2报文进行复制,得到第一复制报文;所述第一网关设备根据所述第一备份路径对所述第一复制报文进行扩展,得到扩展后的第一复制报文,所述扩展后的第一复制报文携带第三标识,所述第三标识用于表征所述第一备份路径中的各个网络节点;所述第一网关设备通过所述第一备份路径将所述扩展后的第一复制报文向所述第二网关设备发送,以使得所述第二网关设备针对所述扩展后的RoCEv2报文以及所述扩展后的第一复制报文进行双发选收处理。
- 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:所述扩展后的RoCEv2报文发送至所述网络路径的头网络节点时,触发所述头网络节点对所述扩展后的RoCEv2报文进行复制,得到第二复制报文,所述第二复制报文经由第二备份路径发送,以使得所述网络路径的尾网络节点针对所述扩展后的RoCEv2报文以及所述第二复制报文进行双发选收处理,所述第二备份路径由所述网络控制器在接收到所述目标请求时触发生成。
- 根据权利要求1-8中任一项所述的方法,其特征在于,所述第一网关设备通过所述网络路径将所述扩展后的RoCEv2报文向所述第二网关设备发送包括:在所述预留的网络资源不满足所述扩展后的RoCEv2报文的网络性能需求的情况下,所述第一网关设备通过流控机制进行源端限速,并通过所述网络路径将所述扩展后的RoCEv2报文向所述第二网关设备发送。
- 根据权利要求1-9中任一项所述的方法,其特征在于,所述第一网关设备对应的第一网卡设备以及所述第二网关设备对应的第二网卡设备部署IPv4私网地址,所述第一网关设备获取待发往第二网关设备的RoCEv2报文包括:所述第一网关设备获取待发往第二网关设备的原始RoCEv2报文;所述第一网关设备对所述原始RoCEv2报文进行IPv4 over IPv6隧道封装,得到所述RoCEv2报文。
- 根据权利要求1-9中任一项所述的方法,其特征在于,所述第一网关设备获取RoCEv2报文包括:所述第一网关设备接收第一网卡设备发送的目标报文,所述第一网卡设备与所述第一网关设备对应;所述第一网关设备根据所述目标报文中的UDP目的端口号确定所述目标报文为所述 RoCEv2报文。
- 根据权利要求1-11中任一项所述的方法,其特征在于,所述预留的网络资源至少包括如下任意一项:网络带宽、最小网络时延。
- 一种第一网关设备,其特征在于,包括:获取模块,用于获取RoCEv2报文;生成模块,用于根据所述RoCEv2报文生成目标请求,并将所述目标请求向SRv6网络的网络控制器发送,以使得所述网络控制器基于所述目标请求得到算路结果,并使得所述网络控制器基于所述算路结果确定QoS策略,所述算路结果包括所述第一网关设备与第二网关设备之间的网络路径以及为所述RoCEv2报文预留的网络资源,所述第一网关设备与所述第二网关设备基于所述SRv6网络进行数据传输;扩展模块,用于获取所述算路结果,并根据所述算路结果对所述RoCEv2报文进行扩展,得到扩展后的RoCEv2报文,所述扩展后的RoCEv2报文携带第一标识以及第二标识,所述第一标识用于表征所述网络路径中的各个网络节点,所述第二标识用于表征所述RoCEv2报文对应的RDMA数据,以使得所述网络控制器基于所述第二标识以及所述QoS策略进行带宽保障;发送模块,用于通过所述网络路径将所述扩展后的RoCEv2报文向所述第二网关设备发送。
- 根据权利要求13所述的设备,其特征在于,所述生成模块,具体用于:在确定所述RoCEv2报文为不携带payload字段的控制报文的情况下,根据所述控制报文生成第一算路请求,所述第一算路请求用于触发所述网络控制器为所述控制报文确定算路结果。
- 根据权利要求13所述的设备,其特征在于,所述生成模块,具体用于:在确定所述RoCEv2报文为携带payload字段的数据报文的情况下,根据所述数据报文生成携带网络性能需求的第二算路请求,所述第二算路请求用于请求所述网络控制器基于所述网络性能需求为所述数据报文确定算路结果。
- 根据权利要求15所述的设备,其特征在于,所述生成模块,具体还用于:在所述数据报文为send原语操作触发的报文的情况下,根据所述数据报文的RETH首部中的DMA length字段确定所述网络性能需求,其中,所述RETH首部由第一网卡设备在所述数据报文的payload字段前添加,所述RETH首部中的DMA length字段用于表征所述数据报文对应的RDMA数据的大小,所述第一网卡设备与所述第一网关设备对应;生成携带所述网络性能需求的第二算路请求。
- 根据权利要求15所述的设备,其特征在于,所述生成模块,具体还用于:在所述数据报文为write原语操作或read原语操作触发的报文的情况下,根据所述数据报文的RETH首部中的DMA length字段确定所述网络性能需求;生成携带所述网络性能需求的第二算路请求。
- 根据权利要求13-17中任一项所述的设备,其特征在于,所述扩展模块,具体用 于:根据所述算路结果,修改所述RoCEv2报文IPv6首部的IPv6 Hop-by-Hop Option,得到扩展后的RoCEv2报文。
- 根据权利要求13-18中任一项所述的设备,其特征在于,所述设备还包括备份模块,用于:获取第一备份路径,所述第一备份路径由所述网络控制器在接收到所述目标请求时触发生成;对所述RoCEv2报文进行复制,得到第一复制报文;根据所述第一备份路径对所述第一复制报文进行扩展,得到扩展后的第一复制报文,所述扩展后的第一复制报文携带第三标识,所述第三标识用于表征所述第一备份路径中的各个网络节点;通过所述第一备份路径将所述扩展后的第一复制报文向所述第二网关设备发送,以使得所述第二网关设备针对所述扩展后的RoCEv2报文以及所述扩展后的第一复制报文进行双发选收处理。
- 根据权利要求13-18中任一项所述的设备,其特征在于,所述设备还包括备份模块,用于:所述扩展后的RoCEv2报文发送至所述网络路径的头网络节点时,触发所述头网络节点对所述扩展后的RoCEv2报文进行复制,得到第二复制报文,所述第二复制报文经由第二备份路径发送,以使得所述网络路径的尾网络节点针对所述扩展后的RoCEv2报文以及所述第二复制报文进行双发选收处理,所述第二备份路径由所述网络控制器在接收到所述目标请求时触发生成。
- 根据权利要求13-20中任一项所述的设备,其特征在于,所述发送模块,具体用于:在所述预留的网络资源不满足所述扩展后的RoCEv2报文的网络性能需求的情况下,通过流控机制进行源端限速,并通过所述网络路径将所述扩展后的RoCEv2报文向所述第二网关设备发送。
- 根据权利要求13-21中任一项所述的设备,其特征在于,所述第一网关设备对应的第一网卡设备以及所述第二网关设备对应的第二网卡设备部署IPv4私网地址,所述获取模块,具体用于:获取待发往第二网关设备的原始RoCEv2报文;对所述原始RoCEv2报文进行IPv4 over IPv6隧道封装,得到所述RoCEv2报文。
- 根据权利要求13-21中任一项所述的设备,其特征在于,所述获取模块,具体用于:接收第一网卡设备发送的目标报文,所述第一网卡设备与所述第一网关设备对应;根据所述目标报文中的UDP目的端口号确定所述目标报文为所述RoCEv2报文。
- 根据权利要求13-23中任一项所述的设备,其特征在于,所述预留的网络资源至少包括如下任意一项:网络带宽、最小网络时延。
- 一种网关设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,所述存储器,用于存储程序;所述处理器,用于执行所述存储器中的程序,使得所述网关设备执行如权利要求1-12中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
- 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
- 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1-12中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210750558.6 | 2022-06-29 | ||
CN202210750558.6A CN117376144A (zh) | 2022-06-29 | 2022-06-29 | 一种数据传输方法及网关设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024001820A1 true WO2024001820A1 (zh) | 2024-01-04 |
Family
ID=89383216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/100640 WO2024001820A1 (zh) | 2022-06-29 | 2023-06-16 | 一种数据传输方法及网关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117376144A (zh) |
WO (1) | WO2024001820A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118118296A (zh) * | 2024-03-01 | 2024-05-31 | 深圳市渊信科技有限公司 | 一种智能网关 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190342199A1 (en) * | 2019-02-08 | 2019-11-07 | Intel Corporation | Managing congestion in a network |
WO2020192397A1 (zh) * | 2019-03-28 | 2020-10-01 | 华为技术有限公司 | 一种发送设备的调整方法和通信装置 |
WO2022022229A1 (zh) * | 2020-07-28 | 2022-02-03 | 华为技术有限公司 | 一种处理报文的方法及装置 |
WO2022052882A1 (zh) * | 2020-09-14 | 2022-03-17 | 华为技术有限公司 | 数据传输方法和装置 |
CN114629836A (zh) * | 2020-11-27 | 2022-06-14 | 华为技术有限公司 | 一种基于分段路由的数据传输方法及装置 |
-
2022
- 2022-06-29 CN CN202210750558.6A patent/CN117376144A/zh active Pending
-
2023
- 2023-06-16 WO PCT/CN2023/100640 patent/WO2024001820A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190342199A1 (en) * | 2019-02-08 | 2019-11-07 | Intel Corporation | Managing congestion in a network |
WO2020192397A1 (zh) * | 2019-03-28 | 2020-10-01 | 华为技术有限公司 | 一种发送设备的调整方法和通信装置 |
WO2022022229A1 (zh) * | 2020-07-28 | 2022-02-03 | 华为技术有限公司 | 一种处理报文的方法及装置 |
WO2022052882A1 (zh) * | 2020-09-14 | 2022-03-17 | 华为技术有限公司 | 数据传输方法和装置 |
CN114629836A (zh) * | 2020-11-27 | 2022-06-14 | 华为技术有限公司 | 一种基于分段路由的数据传输方法及装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118118296A (zh) * | 2024-03-01 | 2024-05-31 | 深圳市渊信科技有限公司 | 一种智能网关 |
Also Published As
Publication number | Publication date |
---|---|
CN117376144A (zh) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7079866B2 (ja) | パケット処理方法、及びデバイス | |
JP6144834B2 (ja) | Sdnスイッチにより正確なフロー・エントリを獲得するための方法、およびsdnスイッチ、コントローラ、およびシステム | |
CN110266578B (zh) | 用于传输和接收包的方法和系统 | |
CN114173374A (zh) | 多接入管理服务分组分类和优先级排定技术 | |
US10231163B2 (en) | Efficient centralized resource and schedule management in time slotted channel hopping networks | |
EP3782401B1 (en) | Client device, network control node and upf for transmission and reception of streams of data packets in multi-connectivity | |
JP2019523621A (ja) | 複数のチャネルを使用してパフォーマンスを向上するインテリジェントアダプティブトランスポートレイヤー | |
CN103125141A (zh) | 移动宽带网络接口的聚合 | |
US11824685B2 (en) | Method for implementing GRE tunnel, access point and gateway | |
JP2001045066A (ja) | IP通信ネットワークシステム及びQoS保証装置 | |
CN111555982B (zh) | 一种基于IPv6扩展头的报文智能选路的方法和系统 | |
WO2015165249A1 (zh) | 一种建立业务路径的方法和设备 | |
WO2024001820A1 (zh) | 一种数据传输方法及网关设备 | |
JP2018511275A (ja) | Tcpトンネル及びネイティブtcp情報に基づくバンドリングシナリオにおけるパケットのスケジューリングのための方法及びシステム | |
CN113055293A (zh) | 软件定义广域网中的选路方法及装置、通信系统 | |
CN115766605A (zh) | 网络拥塞控制方法、装置及系统 | |
WO2024001701A1 (zh) | 数据处理方法、装置及系统 | |
De Schepper et al. | ORCHESTRA: Supercharging wireless backhaul networks through multi-technology management | |
US10805219B2 (en) | Methods and systems for evaluating network performance of an aggregated connection | |
US20190190835A1 (en) | Methods and systems for evaluating network peformance of an aggregated connection | |
Sathyanarayana et al. | Design, Implementation, and Evaluation of Host-based In-band Network Telemetry for TCP | |
US11082255B1 (en) | Method and an apparatus for establishing secure, low latency, optimized paths in a wide area network | |
WO2023005927A1 (zh) | 一种基于SRv6的隧道质量检测方法和相关装置 | |
Kazemi et al. | An IoT-based packet aggregation mechanism for the SDN-based wide area networks | |
WO2023244872A2 (en) | A transport protocol for in-network computing in support of rpc-based applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23830000 Country of ref document: EP Kind code of ref document: A1 |