WO2023109891A1 - Multicast transmission method, apparatus and system - Google Patents

Multicast transmission method, apparatus and system Download PDF

Info

Publication number
WO2023109891A1
WO2023109891A1 PCT/CN2022/139219 CN2022139219W WO2023109891A1 WO 2023109891 A1 WO2023109891 A1 WO 2023109891A1 CN 2022139219 W CN2022139219 W CN 2022139219W WO 2023109891 A1 WO2023109891 A1 WO 2023109891A1
Authority
WO
WIPO (PCT)
Prior art keywords
multicast
data
request
data packet
information
Prior art date
Application number
PCT/CN2022/139219
Other languages
French (fr)
Chinese (zh)
Inventor
古列维奇埃琳娜
吉辛维克多
曲会春
沙列夫拉维夫
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023109891A1 publication Critical patent/WO2023109891A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/14Relay systems
    • H04B7/15Active relay systems
    • H04B7/185Space-based or airborne stations; Stations for satellite systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/54Store-and-forward switching systems 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/901Buffering arrangements using storage descriptor, e.g. read or write pointers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5069Address allocation for group communication, multicast communication or broadcast communication

Abstract

The present application provides a multicast transmission method, apparatus and system. The method is applied to a first computing device, the first computing device comprising a first input/output (IO) device. The method comprises: the first IO device acquires a first request; the first IO device generates a first data packet according to first information and the first request, the first information comprising multicast information, the multicast information being used for identifying the link connections between multicast members of a multicast group and the first computing device, and the first data packet carrying the multicast information and data to be written corresponding to the first request; the first IO device sends the first data packet to the multicast members of the multicast group. The method can achieve reliable and efficient multicast data transmission.

Description

组播传输方法、装置和系统Multicast transmission method, device and system
本申请要求于2021年12月17日提交中国专利局、申请号为202111556558.4、发明名称为“组播传输方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111556558.4 and the title of the invention "Multicast transmission method, device and system" filed with the China Patent Office on December 17, 2021, the entire contents of which are incorporated by reference in this application middle.
技术领域technical field
本申请涉及网络通信技术领域,更具体地,涉及一种组播传输方法、装置和系统。The present application relates to the technical field of network communication, and more specifically, to a method, device and system for multicast transmission.
背景技术Background technique
组播(multicast)技术是一种单个发送端与多个接收端之间进行网络通信的技术。组播技术以组播组地址作为数据包的目的地址,建立组播树,并利用组播树以实现点到多点(point to multi-point,P2MP)的数据转发,有利于降低带宽消耗和提高服务质量。尤其适合在计算集群中有广泛应用。Multicast (multicast) technology is a technology for network communication between a single sender and multiple receivers. The multicast technology uses the multicast group address as the destination address of the data packet, establishes a multicast tree, and uses the multicast tree to realize point-to-multi-point (P2MP) data forwarding, which is beneficial to reduce bandwidth consumption and Improve service quality. It is especially suitable for wide application in computing clusters.
传统的组播传输方法是基于计算设备的主机(host)实现的,该实现过程中需要主机对待发送的数据包进行复制,同时还需要主机的操作系统的参与,使得该方法存在复杂度高、延时大和传输效率低的问题。The traditional multicast transmission method is realized based on the host (host) of the computing device. In the implementation process, the host needs to copy the data packet to be sent, and the operating system of the host needs to participate, which makes the method have high complexity and disadvantages. The problem of large delay and low transmission efficiency.
因此,亟需一种组播传输方法,以实现高效的组播数据传输。Therefore, there is an urgent need for a multicast transmission method to realize efficient multicast data transmission.
发明内容Contents of the invention
本申请提供一种组播传输方法、装置和系统,该方法可以实现可靠和高效的组播数据传输。The present application provides a multicast transmission method, device and system, and the method can realize reliable and efficient multicast data transmission.
第一方面,提供了一种组播传输方法,应用于第一计算设备,该第一计算设备包括第一输入输出IO设备,该方法包括:该第一IO设备获取第一请求;该第一IO设备根据第一信息和该第一请求,生成第一数据包,该第一信息包括组播信息,该组播信息用于标识组播组的组播成员与该第一计算设备之间的链路连接,该第一数据包携带该组播信息和该第一请求对应的待写入数据;该第一IO设备向该组播组的组播成员发送该第一数据包。In a first aspect, a method for multicast transmission is provided, which is applied to a first computing device, and the first computing device includes a first input/output IO device. The method includes: the first IO device acquires a first request; the first The IO device generates a first data packet according to the first information and the first request, the first information includes multicast information, and the multicast information is used to identify the communication between the multicast member of the multicast group and the first computing device Link connection, the first data packet carries the multicast information and the data to be written corresponding to the first request; the first IO device sends the first data packet to the multicast members of the multicast group.
其中,一个组播组可以包括至少2个组播成员。组播信息用于标识组播组的组播成员与该第一计算设备之间的链路连接,即该组播信息用于标识至少2个链路连接。待写入数据可以是第一请求对应的数据的全部数据,即待写入数据的数据量与第一请求对应的数据的数据量相同。可选的,待写入数据还可以是第一请求对应的数据中的一部分数据,即待写入数据的数据量小于第一请求对应的数据的数据量。Wherein, a multicast group may include at least 2 multicast members. The multicast information is used to identify the link connections between the multicast members of the multicast group and the first computing device, that is, the multicast information is used to identify at least two link connections. The data to be written may be all data corresponding to the first request, that is, the data volume of the data to be written is the same as that of the data corresponding to the first request. Optionally, the data to be written may also be a part of the data corresponding to the first request, that is, the data volume of the data to be written is smaller than the data volume of the data corresponding to the first request.
上述技术方案中,第一IO设备可以根据第一请求和第一信息进行数据封装生成第一数据包,并向组播组的组播成员发送该第一数据包,而不是通过第一主机包括的处理器根据工作请求进行封装生成数据包,避免了传统组播传输方法中存在复杂度高、延时大和传输效率低的问题。通过设置第一IO设备发送的第一数据包携带用于标识组播组的组播成员与该第一计算设备之间的链路连接的组播信息,能够确保组播数据的可靠传输。因此,本申请提供的组播传输方法可以实现可靠和高效的组播数据传输。In the above technical solution, the first IO device may perform data encapsulation according to the first request and the first information to generate the first data packet, and send the first data packet to the multicast members of the multicast group, instead of including the first data packet through the first host The processor encapsulates and generates data packets according to the work request, avoiding the problems of high complexity, large delay and low transmission efficiency in the traditional multicast transmission method. By setting the first data packet sent by the first IO device to carry the multicast information for identifying the link connection between the multicast members of the multicast group and the first computing device, reliable transmission of multicast data can be ensured. Therefore, the multicast transmission method provided by the present application can realize reliable and efficient multicast data transmission.
一种可能的设计中,该组播信息包括组播标识符,该组播标识符是第一计算设备与组播 组建立链路连接时获得的。In a possible design, the multicast information includes a multicast identifier, and the multicast identifier is obtained when the first computing device establishes a link connection with the multicast group.
其中,组播组可以包括至少2个组播成员(记为,组播成员1和组播成员2),组播标识符可以标识至少两条链路连接(记为,链路1和链路2),即链路1与组播成员1和第一计算设备关联,链路2与组播成员2和第一计算设备关联。Wherein, the multicast group can include at least 2 multicast members (denoted as multicast member 1 and multicast member 2), and the multicast identifier can identify at least two link connections (denoted as link 1 and link 2), that is, link 1 is associated with multicast member 1 and the first computing device, and link 2 is associated with multicast member 2 and the first computing device.
上述技术方案中,通过组播信息包括的组播标识符,可以标识组播组的组播成员与该第一计算设备之间的链路连接。In the above technical solution, the link connection between the multicast member of the multicast group and the first computing device can be identified through the multicast identifier included in the multicast information.
在另一种可能的设计中,该第一数据包还包括端口号,端口号用于指示传输该第一数据包的方式为组播传输方式。In another possible design, the first data packet further includes a port number, and the port number is used to indicate that the first data packet is transmitted in a multicast transmission manner.
可选的,该端口号可以为第一数据包的目的端口号。Optionally, the port number may be the destination port number of the first data packet.
在另一种可能的设计中,该第一计算设备还包括第一主机,该第一IO设备与该第一主机通过IO网络通信,该第一请求为该第一主机包括的处理器中运行的应用程序发送的请求。In another possible design, the first computing device further includes a first host, the first IO device communicates with the first host through an IO network, and the first request is to run in a processor included in the first host request sent by the application.
在另一种可能的设计中,该第一信息还包括第一窗口信息,该第一窗口信息用于指示该组播组的组播成员在第一时间段内能够处理的数据的最小数量。In another possible design, the first information further includes first window information, where the first window information is used to indicate the minimum amount of data that can be processed by the multicast members of the multicast group within the first time period.
上述技术方案中,第一IO设备生成第一数据包时,考虑了组播组的组播成员在第一时间段内能够处理的数据的最小数量,可以通过控制第一数据包携带的数据量,实现网络的流量控制。In the above technical solution, when the first IO device generates the first data packet, the minimum amount of data that can be processed by the multicast members of the multicast group within the first time period is considered, and the amount of data carried by the first data packet can be controlled , to achieve network flow control.
可选的,第一IO设备还可以对窗口信息进行更新。在一个示例性,该第一IO设备还可以执行以下操作:将第二窗口信息更新为该第一窗口信息,该第二窗口信息是该组播组的组播成员在该第一时间段前的一个时间段内能够处理的数据的最小数量。第二窗口信息指示的组播组的组播成员能够处理的数据的最小数量,与第一窗口信息指示的组播组的组播成员能够处理的数据的最小数量不同。这种实现方式中,通过设置第一IO设备在预设时间段内及时地对窗口信息进行更新,可以准确地控制第一数据包携带的数据量,实现网络的流量控制。Optionally, the first IO device may also update the window information. In an exemplary example, the first IO device may also perform the following operations: update the second window information to the first window information, and the second window information is that the multicast members of the multicast group before the first time period The minimum amount of data that can be processed within a period of time. The minimum amount of data that can be processed by the multicast members of the multicast group indicated by the second window information is different from the minimum amount of data that can be processed by the multicast members of the multicast group indicated by the first window information. In this implementation manner, by setting the first IO device to update the window information in a timely manner within a preset time period, the amount of data carried by the first data packet can be accurately controlled to realize network flow control.
在另一种可能的设计中,该第一IO设备发送该第一数据包采用的传输协议为TCP/IP。In another possible design, the transmission protocol used by the first IO device to send the first data packet is TCP/IP.
上述技术方案中,提供了一种基于TCP/IP的组播传输的方法,且该方法避免了传统组播传输方法中存在复杂度高、延时大和传输效率低的问题,可以实现可靠和高效的组播数据传输。In the above technical solution, a method of multicast transmission based on TCP/IP is provided, and this method avoids the problems of high complexity, large delay and low transmission efficiency in the traditional multicast transmission method, and can realize reliable and efficient multicast data transmission.
在另一种可能的设计中,该待写入数据为该第一请求对应的数据中的一部分数据,该第一信息还包括指示信息,该指示信息用于指示对该待写入数据进行封装,在该第一IO设备根据第一信息和该第一请求,生成第一数据包之前,该方法还包括:该第一IO设备向该组播组的组播成员发送IO写命令,该IO写命令用于指示将该第一请求对应的数据存储至第一注册区域MR中,该第一请求对应的数据位于该第一计算设备包括的第一主机包括的存储器中,该第一MR为第二主机包括的存储器中的存储区域注册到第二IO设备的存储器中的存储区域,第二计算设备包括该第二主机和该第二IO设备,该第二计算设备为该组播组的组播成员;该第一IO设备接收该组播组的组播成员发送的该指示信息。其中,该待写入数据的数据量小于该第一请求对应的数据的数据量。In another possible design, the data to be written is a part of the data corresponding to the first request, and the first information further includes indication information, and the indication information is used to indicate that the data to be written should be encapsulated , before the first IO device generates the first data packet according to the first information and the first request, the method further includes: the first IO device sends an IO write command to a multicast member of the multicast group, and the IO The write command is used to instruct to store the data corresponding to the first request into the first registration area MR, the data corresponding to the first request is located in the memory included in the first host included in the first computing device, and the first MR is The storage area in the memory included in the second host is registered to the storage area in the memory of the second IO device, the second computing device includes the second host and the second IO device, and the second computing device is a member of the multicast group A multicast member: the first IO device receives the indication information sent by the multicast member of the multicast group. Wherein, the data volume of the data to be written is smaller than the data volume of the data corresponding to the first request.
上述技术方案中,第一IO设备可以主动请求向组播组的组播成员的存储区域中写入数据,以及根据接收到的组播组的组播成员发送的指示信息进行封装生成第一数据包,并向组播组的组播成员发送第一数据包,这样可以有效降低组播组的组播成员侧发生拥塞的概率,以更好地满足组播组的组播成员的需求。In the above technical solution, the first IO device may actively request to write data into the storage area of the multicast member of the multicast group, and encapsulate and generate the first data according to the received indication information sent by the multicast member of the multicast group packet, and send the first data packet to the multicast members of the multicast group, which can effectively reduce the probability of congestion on the side of the multicast members of the multicast group, so as to better meet the needs of the multicast members of the multicast group.
在另一种可能的设计中,该IO写命令包括第二密钥值和第二位置信息,该第二密钥值用于识别该第二MR,该第二位置信息用于指示该待写入数据在该第二MR中的位置。In another possible design, the IO write command includes a second key value and second position information, the second key value is used to identify the second MR, and the second position information is used to indicate the The location of the input data in the second MR.
在另一种可能的设计中,该第一信息还包括信用值,该信用值用于指示该组播组的组播成员在第二时间段内能够处理的请求的最小数量,该第一数据包的基本传输头部BTH携带该信用值。In another possible design, the first information further includes a credit value, where the credit value is used to indicate the minimum number of requests that can be processed by the multicast members of the multicast group within the second time period, and the first data The basic transmission header BTH of the packet carries the credit value.
上述技术方案中,第二IO设备将信用值发送给第一IO设备后,使得第一IO设备生成第一数据包时考虑了组播组的组播成员在第一时间段内能够处理的请求的最小数量,有利于实现网络的流量控制。In the above technical solution, after the second IO device sends the credit value to the first IO device, when the first IO device generates the first data packet, the requests that the multicast members of the multicast group can process within the first time period are considered The minimum number is conducive to the realization of network flow control.
在另一种可能的设计中,该第一IO设备获取第一请求,包括:该第一IO设备接收该组播组的组播成员发送的该第一请求,该第一请求用于指示将位于第二MR中的该待写入数据存储至该组播组的组播成员的存储区域,该第二MR为第一主机包括的存储器中的存储区域注册到该第一IO设备的存储器中的存储区域,该第一计算设备还包括该第一主机。In another possible design, the first IO device obtaining the first request includes: the first IO device receiving the first request sent by the multicast member of the multicast group, the first request is used to indicate that the The data to be written in the second MR is stored in the storage area of the multicast members of the multicast group, and the second MR is registered in the storage area of the first IO device for the storage area in the memory included in the first host storage area, the first computing device also includes the first host.
上述技术方案中,第一IO设备根据组播组的组播成员的第一请求,向该组播组的组播成员发送第一数据包,该第一数据包携带有该第一请求对应的待写入数据和该组播组的组播信息,可以实现可靠和高效的组播数据传输。In the above technical solution, the first IO device sends a first data packet to the multicast member of the multicast group according to the first request of the multicast member of the multicast group, and the first data packet carries the information corresponding to the first request. The data to be written and the multicast information of the multicast group can realize reliable and efficient multicast data transmission.
在另一种可能的设计中,该第一请求包括第一密钥值,第一位置信息和预设字段,该第一密钥值用于识别该第二MR,该第一位置信息用于指示该待写入数据在该第二MR中的位置,该预设字段的取值用于指示该第一请求。In another possible design, the first request includes a first key value, first location information and a preset field, the first key value is used to identify the second MR, and the first location information is used to Indicate the position of the data to be written in the second MR, and the value of the preset field is used to indicate the first request.
在另一种可能的设计中,该第一IO设备发送该第一数据包采用的传输协议为基于以太网的远程直接数据存取RDMA。In another possible design, the transmission protocol used by the first IO device to send the first data packet is Ethernet-based Remote Direct Data Access (RDMA).
上述技术方案中,提供了一种基于RDMA的组播传输方法,该方法可以实现可靠和高效的组播数据传输。In the above technical solution, an RDMA-based multicast transmission method is provided, which can realize reliable and efficient multicast data transmission.
在另一种可能的设计中,该第一IO设备向该组播组的组播成员发送该第一数据包,包括:该第一IO设备向转发设备发送该第一数据包,该转发设备用于对该第一数据包进行复制,并将复制后的该第一数据包转发至该组播组的组播成员,该组播组的组播成员与该第一计算设备之间的链路连接包括该转发设备。In another possible design, the first IO device sending the first data packet to the multicast members of the multicast group includes: the first IO device sending the first data packet to a forwarding device, and the forwarding device It is used to copy the first data packet, and forward the copied first data packet to the multicast members of the multicast group, and the link between the multicast members of the multicast group and the first computing device Road connections include the forwarding device.
上述技术方案中,第一IO设备不对第一数据包进行复制,有利于降低第一IO设备的资源开销。In the above technical solution, the first IO device does not copy the first data packet, which is beneficial to reduce the resource overhead of the first IO device.
在另一种可能的设计中,该组播组的组播成员包括第二计算设备,该第二计算设备与该第一计算设备不同,在该第一IO设备向该组播组的组播成员发送该第一数据包之后,该方法还包括:该第一IO设备接收该第二计算设备发送的第二请求,该第二请求用于请求获取该第一数据包携带的该第一请求对应的该待写入数据;该第一IO设备向该第二计算设备发送第二数据包,该第二数据包携带该待写入数据,且该第二数据包包括的端口号用于指示传输该第二数据包的传输方式为单播传输方式。In another possible design, the multicast members of the multicast group include a second computing device, the second computing device is different from the first computing device, and the multicasting of the multicast group by the first IO device After the member sends the first data packet, the method further includes: the first IO device receives a second request sent by the second computing device, and the second request is used to request to obtain the first request carried in the first data packet Corresponding to the data to be written; the first IO device sends a second data packet to the second computing device, the second data packet carries the data to be written, and the port number included in the second data packet is used to indicate The transmission mode for transmitting the second data packet is a unicast transmission mode.
上述技术方案中,第一IO设备根据第二请求向第二计算设备发送第二数据包,使得未成功接收到第一数据包的组播成员接收该第二数据包,可以保证组播数据的可靠传输。In the above technical solution, the first IO device sends the second data packet to the second computing device according to the second request, so that the multicast members who have not successfully received the first data packet receive the second data packet, which can ensure the integrity of the multicast data. Reliable transmission.
在另一种可能的设计中,该组播组的组播成员仅包括第二计算设备和第三计算设备,该第三计算设备,该第二计算设备和该第一计算设备中的任意两个计算设备不同,该方法还包括:在该第一IO设备接收到第一完成消息和第二完成消息后,该第一IO设备向该第一主机包括的处理器发送第三完成消息,该第三完成消息用于指示该第一请求已成功执行,该第一完成消息用于指示该第二计算设备已成功执行该第一请求,该第二完成消息用于指示该第三计算设备已成功执行该第一请求。In another possible design, the multicast members of the multicast group only include the second computing device and the third computing device, the third computing device, any two of the second computing device and the first computing device Different computing devices, the method further includes: after the first IO device receives the first completion message and the second completion message, the first IO device sends a third completion message to the processor included in the first host, the The third completion message is used to indicate that the first request has been successfully executed, the first completion message is used to indicate that the second computing device has successfully executed the first request, and the second completion message is used to indicate that the third computing device has successfully executed The first request was executed successfully.
上述技术方案中,第一IO设备在接收到组播组的所有组播成员发送的完成消息(即,第 一完成消息和第二完成消息)后,才会向第一主机包括的处理器反馈第一请求的处理结果,而不是每次接收到一个组播成员发送的完成消息后就反馈该完成消息对应的请求的处理结果,这样可以提高IO处理的速率,以实现高效的组播数据传输。In the above technical solution, the first IO device will not feed back to the processor included in the first host until it receives the completion messages (that is, the first completion message and the second completion message) sent by all the multicast members of the multicast group. The processing result of the first request, instead of feeding back the processing result of the request corresponding to the completion message every time a completion message sent by a multicast member is received, this can improve the IO processing rate to achieve efficient multicast data transmission. .
上述第一方面所描述的第一IO设备包括网络接口控制器、智能网络接口控制器、主机总线适配器、主机通道适配器、加速器、数据处理器、图像处理器、人工智能设备、软件定义基础设施中的至少一种。上述第一方面所描述的IO网络包括高速串行计算机扩展总线标准PCIe、计算机快速链接CXL、缓存一致互联协议CCIX、统一总线Ubus中任意一种。The first IO device described in the first aspect above includes a network interface controller, an intelligent network interface controller, a host bus adapter, a host channel adapter, an accelerator, a data processor, an image processor, an artificial intelligence device, and a software-defined infrastructure at least one of . The IO network described in the first aspect above includes any one of the high-speed serial computer expansion bus standard PCIe, the computer fast link CXL, the cache coherent interconnection protocol CCIX, and the unified bus Ubus.
第二方面,提供了一种组播传输方法,应用于第二计算设备,该第二计算设备为组播组的组播成员,该第二计算设备包括第二输入输出IO设备和第二主机,该第二IO设备与该第二主机通过IO网络通信,该方法包括:该第二IO设备接收第一IO设备发送的第一数据包,该第一数据包携带组播信息和第一请求对应的待写入数据,该组播信息用于标识组播组的组播成员与第一计算设备之间的链路连接,第一计算设备包括该第一IO设备,该第一计算设备与该第二计算设备不同;该第二IO设备根据该第一数据包,将该待写入数据存储至该第二主机的存储器中。In a second aspect, a multicast transmission method is provided, which is applied to a second computing device, the second computing device is a multicast member of a multicast group, and the second computing device includes a second input and output IO device and a second host , the second IO device communicates with the second host through an IO network, the method includes: the second IO device receives a first data packet sent by the first IO device, the first data packet carries multicast information and a first request Corresponding to the data to be written, the multicast information is used to identify the link connection between the multicast member of the multicast group and the first computing device, the first computing device includes the first IO device, and the first computing device and The second computing device is different; the second IO device stores the data to be written into the memory of the second host according to the first data packet.
上述技术方案中,第二IO设备接收第一IO设备发送的第一数据包,该第一数据包携带有组播信息和第一请求对应的待写入数据,可以实现可靠和高效的组播数据传输。In the above technical solution, the second IO device receives the first data packet sent by the first IO device, the first data packet carries the multicast information and the data to be written corresponding to the first request, which can realize reliable and efficient multicast data transmission.
在一种可能的实现方式中,该方法还包括:该第二IO设备向该第一IO设备发送第二请求,该第二请求用于请求获取该第一数据包携带的该第一请求对应的该待写入数据。In a possible implementation manner, the method further includes: the second IO device sending a second request to the first IO device, where the second request is used to request to obtain the corresponding information of the first request carried in the first data packet. The data to be written.
上述技术方案中,在第二IO设备未能成功接收到第一数据包的情况下,第二IO设备会主动向第一IO设备发送第二请求,以获取未成功接收的数据,可以实现可靠的数据传输。In the above technical solution, when the second IO device fails to receive the first data packet, the second IO device will actively send a second request to the first IO device to obtain the unsuccessfully received data, which can realize reliable data transmission.
上述第二方面所描述的第二IO设备包括网络接口控制器、智能网络接口控制器、主机总线适配器、主机通道适配器、加速器、数据处理器、图像处理器、人工智能设备、软件定义基础设施中的至少一种。上述第二方面所描述的IO网络包括高速串行计算机扩展总线标准PCIe,内存互联CXL,统一总线Ubus中任意一种。The second IO device described in the second aspect above includes a network interface controller, an intelligent network interface controller, a host bus adapter, a host channel adapter, an accelerator, a data processor, an image processor, an artificial intelligence device, and a software-defined infrastructure at least one of . The IO network described in the second aspect above includes any one of the high-speed serial computer expansion bus standard PCIe, the memory interconnection CXL, and the unified bus Ubus.
第三方面,提供了一种组播传输装置,该装置包括收发单元和处理单元,该收发单元,用于获取第一请求;该处理单元,用于根据第一信息和该第一请求,生成第一数据包,该第一信息包括组播信息,该组播信息用于标识组播组的组播成员与该组播传输装置之间的链路连接,该第一数据包携带该组播信息和该第一请求对应的待写入数据;该收发单元,还用于向该组播组的组播成员发送该第一数据包。In a third aspect, a multicast transmission device is provided, the device includes a transceiver unit and a processing unit, the transceiver unit is used to acquire a first request; the processing unit is used to generate a request according to the first information and the first request The first data packet, the first information includes multicast information, the multicast information is used to identify the link connection between the multicast member of the multicast group and the multicast transmission device, and the first data packet carries the multicast The information and the data to be written corresponding to the first request; the transceiver unit is further configured to send the first data packet to the multicast members of the multicast group.
在一种可能的设计中,该组播信息包括组播标识符,该组播标识符是该组播组与该组播传输装置建立链路连接时获得的。In a possible design, the multicast information includes a multicast identifier, and the multicast identifier is obtained when the multicast group establishes a link connection with the multicast transmission device.
在另一种可能的设计中,该第一数据包还包括端口号,该端口号用于指示传输该第一数据包的方式为组播传输方式。In another possible design, the first data packet further includes a port number, and the port number is used to indicate that the first data packet is transmitted in a multicast transmission manner.
在另一种可能的设计中,该第一信息还包括第一窗口信息,该第一窗口信息用于指示该组播组的组播成员在第一时间段内能够处理的数据的最小数量。In another possible design, the first information further includes first window information, where the first window information is used to indicate the minimum amount of data that can be processed by the multicast members of the multicast group within the first time period.
可选的,该处理单元还用于对窗口信息进行更新。在一个示例性,该处理单元还用语:将第二窗口信息更新为该第一窗口信息,该第二窗口信息是该组播组的组播成员在该第一时间段前的一个时间段内能够处理的数据的最小数量。第二窗口信息指示的组播组的组播成员能够处理的数据的最小数量,与第一窗口信息指示的组播组的组播成员能够处理的数据的最小数量不同。Optionally, the processing unit is also used to update window information. In an example, the processing unit also uses the words: update the second window information to the first window information, the second window information is a time period before the first time period of the multicast members of the multicast group The minimum amount of data that can be processed. The minimum amount of data that can be processed by the multicast members of the multicast group indicated by the second window information is different from the minimum amount of data that can be processed by the multicast members of the multicast group indicated by the first window information.
第四方面,提供了一种组播传输装置,应用于第二计算设备,该第二计算设备为组播组 的组播成员,该第二计算设备包括第二输入输出IO设备和第二主机,该第二IO设备与该第二主机通过IO网络通信,该装置包括收发单元和处理单元,该收发单元,用于接收第一IO设备发送的第一数据包,该第一数据包携带组播信息和第一请求对应的待写入数据,该组播信息用于标识组播组的组播成员与第一计算设备之间的链路连接,第一计算设备包括该第一IO设备,该第一计算设备与该第二计算设备不同;该处理单元,用于根据该第一数据包,将该待写入数据存储至该第二主机的存储器中。In a fourth aspect, a multicast transmission device is provided, which is applied to a second computing device, the second computing device is a multicast member of a multicast group, and the second computing device includes a second input and output IO device and a second host , the second IO device communicates with the second host through an IO network, the device includes a transceiver unit and a processing unit, the transceiver unit is configured to receive a first data packet sent by the first IO device, and the first data packet carries a group broadcast information and the data to be written corresponding to the first request, the multicast information is used to identify the link connection between the multicast member of the multicast group and the first computing device, the first computing device includes the first IO device, The first computing device is different from the second computing device; the processing unit is configured to store the data to be written into the memory of the second host according to the first data packet.
在一种可能的设计中,该收发单元还用于向该第一IO设备发送第二请求,该第二请求用于请求获取该第一数据包携带的该第一请求对应的该待写入数据。In a possible design, the transceiver unit is further configured to send a second request to the first IO device, and the second request is used to request to obtain the to-be-written data.
第五方面,提供了一种第一输入输出IO设备,该第一IO设备具有实现上述第三方面所描述的组播传输装置的功能。该功能可以基于硬件实现,也可以基于硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In a fifth aspect, a first input/output IO device is provided, and the first IO device has a function of realizing the multicast transmission device described in the third aspect above. This function may be implemented based on hardware, or may be implemented by corresponding software based on hardware. The hardware or software includes one or more modules corresponding to the above functions.
在一种可能的实现方式中,第一IO设备的结构中包括处理器,该处理器被配置为支持第一IO设备执行上述方法中相应的功能。In a possible implementation manner, the structure of the first IO device includes a processor, and the processor is configured to support the first IO device to perform corresponding functions in the foregoing method.
该第一IO设备还可以包括存储器,该存储器用于与处理器耦合,其保存第一IO设备必要的程序指令和数据。The first IO device may further include a memory, which is used to be coupled with the processor, and stores necessary program instructions and data of the first IO device.
在另一种可能的实现方式中,该第一IO设备包括:处理器、发送器、接收器、随机存取存储器、只读存储器以及总线。其中,处理器通过总线分别耦接发送器、接收器、随机存取存储器以及只读存储器。其中,当需要运行第一IO设备时,通过固化在只读存储器中的基本输入/输出系统或者嵌入式系统中的bootloader引导系统进行启动,引导第一IO设备进入正常运行状态。在第一IO设备进入正常运行状态后,在随机存取存储器中运行应用程序和操作系统,使得该处理器执行第一方面或第一方面的任意可能的实现方式中的方法。In another possible implementation manner, the first IO device includes: a processor, a transmitter, a receiver, a random access memory, a read only memory, and a bus. Wherein, the processor is respectively coupled to the transmitter, the receiver, the random access memory and the read-only memory through the bus. Wherein, when the first IO device needs to be operated, the basic input/output system solidified in the read-only memory or the bootloader boot system in the embedded system is started to guide the first IO device into a normal operation state. After the first IO device enters the normal running state, run the application program and the operating system in the random access memory, so that the processor executes the method in the first aspect or any possible implementation manner of the first aspect.
第六方面,提供了一种第二输入输出IO设备,该第二IO设备具有实现上述第四方面所描述的组播传输装置的功能。该功能可以基于硬件实现,也可以基于硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。A sixth aspect provides a second input/output IO device, the second IO device has the function of implementing the multicast transmission device described in the fourth aspect above. This function may be implemented based on hardware, or may be implemented by corresponding software based on hardware. The hardware or software includes one or more modules corresponding to the above functions.
在一种可能的实现方式中,第二IO设备支持执行上述方法中相应的功能。In a possible implementation manner, the second IO device supports execution of corresponding functions in the foregoing method.
该第二IO设备还可以包括存储器,该存储器用于与处理器耦合,其保存第二IO设备必要的程序指令和数据。The second IO device may further include a memory, which is used to be coupled with the processor, and stores necessary program instructions and data of the second IO device.
在另一种可能的实现方式中,该第二IO设备包括:发送器、接收器、随机存取存储器、只读存储器以及总线。其中,处理器通过总线分别耦接发送器、接收器、随机存取存储器以及只读存储器。其中,当需要运行第二IO设备时,通过固化在只读存储器中的基本输入/输出系统或者嵌入式系统中的bootloader引导系统进行启动,引导第二IO设备进入正常运行状态。在第二IO设备进入正常运行状态后,在随机存取存储器中运行应用程序和操作系统,使得该处理器执行第二方面或第二方面的任意可能的实现方式中的方法。In another possible implementation manner, the second IO device includes: a transmitter, a receiver, a random access memory, a read only memory, and a bus. Wherein, the processor is respectively coupled to the transmitter, the receiver, the random access memory and the read-only memory through the bus. Wherein, when the second IO device needs to be operated, the basic input/output system solidified in the read-only memory or the bootloader boot system in the embedded system is started to guide the second IO device into a normal operation state. After the second IO device enters the normal running state, run the application program and the operating system in the random access memory, so that the processor executes the method in the second aspect or any possible implementation manner of the second aspect.
第七方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述第一方面或第二方面,以及上述第一方面或第二方面的任一种可能执行的方法。In a seventh aspect, a computer program product is provided, and the computer program product includes: computer program code, when the computer program code is run on a computer, it causes the computer to execute the above-mentioned first aspect or the second aspect, and the above-mentioned first aspect Or any of the possible implementation methods of the second aspect.
第八方面,提供了一种计算机可读介质,该计算机可读介质存储有程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述第一方面或第二方面,以及上述第一方面或第二方面的任一种可能执行的方法。这些计算机可读存储包括但不限于如下的一个或者多个:只读存储器(read-only memory,ROM)、可编程ROM(programmable ROM,PROM)、可擦除的PROM(erasable PROM,EPROM)、Flash存储器、电EPROM(electrically EPROM, EEPROM)以及硬盘驱动器(hard drive)。In an eighth aspect, a computer-readable medium is provided, the computer-readable medium stores program code, and when the computer program code runs on a computer, the computer executes the above-mentioned first aspect or the second aspect, and the above-mentioned first aspect. Aspect or any method that may be implemented in the second aspect. These computer-readable storages include, but are not limited to, one or more of the following: read-only memory (read-only memory, ROM), programmable ROM (programmable ROM, PROM), erasable PROM (erasable PROM, EPROM), Flash memory, electrical EPROM (electrically EPROM, EEPROM) and hard disk drive (hard drive).
第九方面,提供一种芯片系统,该芯片系统包括处理器与数据接口,其中,处理器通过该数据接口读取存储器上存储的指令,以执行上述第一方面或第二方面,以及上述第一方面或第二方面的任意一种可能的实现方式中的方法。在具体实现过程中,该芯片系统可以以中央处理器(central processing unit,CPU)、微控制器(micro controller unit,MCU)、微处理器(micro processing unit,MPU)、数字信号处理器(digital signal processing,DSP)、片上系统(system on chip,SoC)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或可编辑逻辑器件(programmable logic device,PLD)的形式实现。A ninth aspect provides a chip system, the chip system includes a processor and a data interface, wherein the processor reads the instructions stored in the memory through the data interface, so as to execute the first aspect or the second aspect above, and the first aspect above A method in any possible implementation of the first aspect or the second aspect. In the specific implementation process, the chip system can be based on a central processing unit (central processing unit, CPU), a microcontroller (micro controller unit, MCU), a microprocessor (micro processing unit, MPU), a digital signal processor (digital signal processor) signal processing, DSP), system on chip (system on chip, SoC), application-specific integrated circuit (application-specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or programmable logic device (programmable logic device, PLD) in the form of realization.
第十方面,提供了一种组播传输系统,该系统包括如上述第三方面所述的组播传输装置和上述第四方面所述的组播传输装置。A tenth aspect provides a multicast transmission system, which includes the multicast transmission device described in the third aspect and the multicast transmission device described in the fourth aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
附图说明Description of drawings
图1是适用于本申请实施例的一个计算集群的示意图。Fig. 1 is a schematic diagram of a computing cluster applicable to the embodiment of the present application.
图2是图1中的计算集群包括的计算设备的示意图。FIG. 2 is a schematic diagram of computing devices included in the computing cluster in FIG. 1 .
图3是本申请实施例提供的一种组播传输方法300的示意性流程图。FIG. 3 is a schematic flowchart of a multicast transmission method 300 provided by an embodiment of the present application.
图4是本申请实施例提供的一种组播传输场景的示意图。Fig. 4 is a schematic diagram of a multicast transmission scenario provided by an embodiment of the present application.
图5是本申请实施例提供的一种组播传输方法500的示意性流程图。FIG. 5 is a schematic flowchart of a multicast transmission method 500 provided by an embodiment of the present application.
图6是本申请实施例提供的基于TCP/IP传输的数据包的格式示意图。FIG. 6 is a schematic diagram of a format of a data packet transmitted based on TCP/IP provided by an embodiment of the present application.
图7是本申请实施例提供的窗口的示意图。FIG. 7 is a schematic diagram of a window provided by an embodiment of the present application.
图8是本申请实施例提供的一种组播传输方法800的示意性流程图。FIG. 8 is a schematic flowchart of a multicast transmission method 800 provided by an embodiment of the present application.
图9是本申请实施例提供的基于RoCE协议传输的数据包和消息的格式示意图。FIG. 9 is a schematic diagram of the format of data packets and messages transmitted based on the RoCE protocol provided by the embodiment of the present application.
图10是本申请实施例提供的一种组播传输方法1000的示意性流程图。FIG. 10 is a schematic flowchart of a multicast transmission method 1000 provided by an embodiment of the present application.
图11是本申请实施例提供的一种组播传输方法1100的示意性流程图。FIG. 11 is a schematic flowchart of a multicast transmission method 1100 provided by an embodiment of the present application.
图12是本申请实施例提供的一种组播传输装置1200的示意性结构图。FIG. 12 is a schematic structural diagram of a multicast transmission device 1200 provided by an embodiment of the present application.
图13是本申请实施例提供的一种组播传输装置1300的硬件结构示意图。FIG. 13 is a schematic diagram of a hardware structure of a multicast transmission device 1300 provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
本申请将围绕包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。The present application presents various aspects, embodiments or features in terms of a system comprising a number of devices, components, modules and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. Additionally, combinations of these schemes can also be used.
本申请实施例描述的网络架构以及业务场景是为了更加清楚地说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着网络架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The network architecture and business scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided by the embodiments of the present application. For the evolution of architecture and the emergence of new business scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
本申请实施例中所称的芯片可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是应用处理器 (application processor,AP),或其他集成芯片。The chip referred to in the embodiment of the present application may be a system chip (system on chip, SoC), may also be a central processing unit (central processor unit, CPU), may also be a network processor (network processor, NP), may also be It is a digital signal processing circuit (digital signal processor, DSP), and it can also be an application processor (application processor, AP), or other integrated chips.
为了便于理解,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。For ease of understanding, the following first introduces related terms and concepts that may be involved in the embodiments of the present application.
1,组播(multicast)1. Multicast
组播是主机间一对多的通讯模式,组播是一种允许一个或多个组播源发送同一报文到多个接收者的技术,即组播技术可以实现点到多点(point to multi-point,P2MP)的报文转发。组播源将一份报文发送到特定的组播地址,组播地址不同于单播地址,它并不属于特定某个主机,而是属于一组主机。一个组播地址表示一个群组,需要接收组播报文的接收者都加入这个群组。Multicast is a one-to-many communication mode between hosts. Multicast is a technology that allows one or more multicast sources to send the same message to multiple receivers. That is, multicast technology can realize point to multipoint (point to multipoint) multi-point, P2MP) packet forwarding. The multicast source sends a message to a specific multicast address. Unlike a unicast address, a multicast address does not belong to a specific host, but belongs to a group of hosts. A multicast address represents a group, and all receivers who need to receive multicast packets join this group.
2,单播(unicast)2. Unicast
单播是主机间一对一的通讯模式,网络中的设备根据网络报文中包含的目的地址选择传输路径,将单播报文传送到指定的目的地,只对接收到的数据进行转发,不会进行复制。它能够针对每台主机及时的响应。Unicast is a one-to-one communication mode between hosts. The devices in the network select the transmission path according to the destination address contained in the network message, transmit the unicast message to the specified destination, and only forward the received data without will be copied. It can respond to each host in a timely manner.
3,协议栈(protocol stack)3. Protocol stack
协议栈是指网络中各层协议的总和,其形象的反映了一个网络中数据传输的过程:由上层协议到底层协议,再由底层协议到上层协议。简单来说,协议栈(例如但不限于是,传输控制协议/因特网互联协议(transmission control protocol/internet protocol,TCP/IP)栈)就是协议(例如,但不限于TCP/IP)的实现。The protocol stack refers to the sum of all layers of protocols in the network, which vividly reflects the process of data transmission in a network: from the upper layer protocol to the lower layer protocol, and then from the lower layer protocol to the upper layer protocol. In simple terms, a protocol stack (for example, but not limited to, transmission control protocol/internet protocol (transmission control protocol/internet protocol, TCP/IP) stack) is an implementation of a protocol (for example, but not limited to TCP/IP).
4,远程直接数据存取(remote direct memory access,RDMA)4. Remote direct memory access (RDMA)
RDMA技术实现了在网络传输过程中两个节点之间数据缓冲区数据的直接传递,在本节点可以直接将数据通过网络传送到远程节点的内存中,绕过操作系统内的多次内存拷贝。相比于传统的网络传输,RDMA无需操作系统和TCP/IP协议的介入,可以轻易的实现超低延时的数据处理、超高吞吐量传输,不需要远程节点CPU等资源的介入,不必因为数据的处理和迁移耗费过多的资源。RDMA technology realizes the direct transmission of data buffer data between two nodes during the network transmission process. The local node can directly transmit the data to the memory of the remote node through the network, bypassing multiple memory copies in the operating system. Compared with traditional network transmission, RDMA does not require the intervention of the operating system and TCP/IP protocol, and can easily realize ultra-low-latency data processing and ultra-high-throughput transmission without the intervention of remote node CPU and other resources. The processing and migration of data consumes excessive resources.
RDMA的工作过程如下:The working process of RDMA is as follows:
1)当一个应用执行RDMA读或写请求时,在不需要任何内核内存参与的条件下,RDMA请求从运行在用户空间中的应用中发送到本地网卡(network interface controller,NIC)。其中,网卡又称为网络接口控制器。1) When an application executes an RDMA read or write request, the RDMA request is sent from the application running in user space to the local network interface controller (NIC) without any kernel memory involvement. Wherein, the network card is also called a network interface controller.
2)本地NIC读取缓冲的内容,并通过网络传送到远程NIC。2) The local NIC reads the buffered content and sends it over the network to the remote NIC.
3)在网络上传输的RDMA信息包含目标虚拟地址、内存钥匙和数据本身。请求完成既可以完全在用户空间中处理(通过轮询用户级完成排列),或者在应用一直睡眠到请求完成时的情况下通过内核内存处理。RDMA操作使应用可以从一个远程应用的内存中读数据或向这个内存写数据。3) The RDMA information transmitted on the network contains the target virtual address, the memory key and the data itself. Request completion can either be handled entirely in user space (by polling the user-level completion queue), or through kernel memory in the case where the application sleeps until the request completes. RDMA operations enable applications to read data from or write data to a remote application's memory.
4)目标NIC确认内存钥匙,直接将数据写入应用缓存中。用于操作的远程虚拟内存地址包含在RDMA信息中。4) The target NIC confirms the memory key and directly writes the data into the application cache. The remote virtual memory address used for the operation is included in the RDMA information.
5,基于以太网的RDMA(RDMA over converged ethernet,RoCE)5. Ethernet-based RDMA (RDMA over converged ethernet, RoCE)
RoCE支持在标准以太网基础设施上使用RDMA技术。RoCE技术中,RDMA网卡(RDMA-aware network interface controller,RNIC)把协议栈(即,用户数据包协议(user datagram protocol,UDP))全部卸载(offload)到RNIC的ASIC芯片上来实现。其中,卸载是指将主机侧协议栈的处理工作从主机的CPU移交给网卡来处理。而且在主机上用户缓存(buffer)到网卡缓存也是直接通过的存储器直接访问(direct memory access,DMA)方式把数据搬到网卡中,然后网卡通过网络协议UDP的方式将数据传送到对端去,对端收到数据后也直接在 网卡上把数据接收下来,并且直接DMA到用户缓存中。这样整个过程都没有CPU和内存拷贝的参与,从而减轻了CPU和服务器I/O系统的TCP/IP处理负担,消除了服务器的网络瓶颈。RoCE supports the use of RDMA technology over standard Ethernet infrastructure. In the RoCE technology, the RDMA network card (RDMA-aware network interface controller, RNIC) offloads all the protocol stack (that is, the user datagram protocol (UDP)) to the ASIC chip of the RNIC for implementation. Wherein, offloading refers to handing over the processing work of the protocol stack on the host side from the CPU of the host to the network card for processing. Moreover, the user cache (buffer) to the network card cache on the host is also directly moved to the network card through direct memory access (DMA), and then the network card transmits the data to the peer through the network protocol UDP. After the peer end receives the data, it also directly receives the data on the network card, and directly DMAs it into the user cache. In this way, there is no CPU and memory copy involved in the whole process, thereby reducing the TCP/IP processing burden of the CPU and server I/O system, and eliminating the network bottleneck of the server.
6,存储器直接访问(direct memory access,DMA)6. Direct memory access (DMA)
DMA是指一种高速的数据传输操作,允许在IO设备和存储器之间直接读写数据,既不通过CPU,也不需要CPU干预。换句话说,DMA是指IO设备不通过CPU而直接与系统内存交换数据的接口技术。可以理解的是,DMA方式是一种完全由硬件执行I/O交换的工作方式。而RDMA方式是通过网络把数据直接传入计算机的存储区,即将数据从一个系统快速移动到远程系统存储器中,也就是说RDMA方式是一种软件和硬件结合执行I/O交换的工作方式。DMA refers to a high-speed data transfer operation that allows direct reading and writing of data between IO devices and memory, neither through the CPU nor CPU intervention. In other words, DMA refers to an interface technology in which IO devices directly exchange data with system memory without going through the CPU. It can be understood that the DMA method is a working method in which the I/O exchange is completely performed by the hardware. The RDMA method is to directly transfer data to the storage area of the computer through the network, that is, to quickly move the data from one system to the remote system memory. That is to say, the RDMA method is a working method that combines software and hardware to perform I/O exchange.
下面,具体介绍本申请的相关技术方案:Below, specifically introduce the relevant technical solutions of the present application:
计算集群(可以简称为集群)是一种计算系统。计算集群通过将一组计算设备连接起来高度紧密地协作完成计算工作。计算集群中的单个计算设备可以称为节点。图1是适用于本申请实施例的一个计算集群的示意图。如图1所示,计算集群100包括但不限于多个计算设备,图1以计算集群100包括六个计算设备为例进行说明,该六个计算设备分别为计算设备111、计算设备112、计算设备113、计算设备114、计算设备115和计算设备116。其中,任意两个计算设备(例如,计算设备111与计算设备115)可以通过网络110进行通信。A computing cluster (may be referred to simply as a cluster) is a computing system. Computing clusters connect a group of computing devices to work closely together to complete computing tasks. Individual computing devices in a computing cluster may be referred to as nodes. Fig. 1 is a schematic diagram of a computing cluster applicable to the embodiment of the present application. As shown in FIG. 1 , the computing cluster 100 includes but is not limited to multiple computing devices. FIG. 1 takes the computing cluster 100 including six computing devices as an example for illustration, and the six computing devices are computing device 111, computing device 112, computing device Device 113 , computing device 114 , computing device 115 , and computing device 116 . Wherein, any two computing devices (for example, computing device 111 and computing device 115 ) may communicate through the network 110 .
下面结合图2对图1中所示的计算设备进行介绍。图2所示的计算设备100可以是图1所示的计算设备111至计算设备116中的任意一个计算设备。The computing device shown in FIG. 1 will be introduced below with reference to FIG. 2 . The computing device 100 shown in FIG. 2 may be any one of the computing devices 111 to 116 shown in FIG. 1 .
如图2所示的计算设备200包括主机210、输入输出(input output,IO)互联通道220以及IO设备230。其中,主机210可以通过IO互联通道220连接IO设备230。The computing device 200 shown in FIG. 2 includes a host 210 , an input-output (input output, IO) interconnection channel 220 and an IO device 230 . Wherein, the host 210 can be connected to the IO device 230 through the IO interconnection channel 220 .
主机210可以是运算核心和控制核心,用于向IO设备230发送待处理请求,以及接收IO设备230发送的待处理请求的处理结果。主机210包括第一处理器211和第一存储器212,该第一处理器211可以为中央处理单元(central processing unit,CPU),该第一处理器211还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。该第一处理器211还可以为一种片上芯片(system of chip,SoC)或者嵌入式处理器。第一处理器211具有处理指令、执行操作、处理数据等功能。第一处理器211可以为多个进程分配独立的存储器资源,从而运行多个进程。第一处理器211可寻址的地址空间包括第一存储器212。第一存储器212可以由随机存取器(random access memory,RAM),硬盘(例如,固态磁盘(solid state disk,SSD))或其他存储介质实现。第一存储器212可用于存储多个进程的程序代码。The host 210 may be a computing core and a control core, and is configured to send a pending request to the IO device 230 and receive a processing result of the pending request sent by the IO device 230 . The host 210 includes a first processor 211 and a first memory 212, the first processor 211 may be a central processing unit (central processing unit, CPU), and the first processor 211 may also be other general-purpose processors, digital signal processing (digital signal processor, DSP), application specific integrated circuit (ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. The first processor 211 may also be an on-chip (system of chip, SoC) or embedded processor. The first processor 211 has functions such as processing instructions, executing operations, and processing data. The first processor 211 can allocate independent memory resources for multiple processes, so as to run multiple processes. The address space addressable by the first processor 211 includes the first memory 212 . The first memory 212 may be implemented by a random access device (random access memory, RAM), a hard disk (for example, a solid state disk (solid state disk, SSD)) or other storage media. The first memory 212 can be used to store program codes of a plurality of processes.
IO互联通道220是主机210与IO设备230之间的互连机制,例如,高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)、统一总线(unified bus,UB或Ubus)等等。The IO interconnection channel 220 is an interconnection mechanism between the host computer 210 and the IO device 230, for example, a high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe), a computer express link (compute express link, CXL), and a cache coherent interconnection Protocol (cache coherent interconnect for accelerators, CCIX), unified bus (unified bus, UB or Ubus) and so on.
IO设备230是指可以与主机210和/或其他计算设备进行数据传输的硬件,例如:IO设备230用于接收主机210发送的待处理请求,并利用协议栈对待处理请求进行处理以执行该待处理请求。IO设备230还用于向主机210发送该处理请求的处理结果。在本申请实施例中,IO设备230具有网络能力,可以实现协议栈的功能。其中IO设备230具有协议栈(例如,TCP/IP协议栈)功能,是指IO设备230具有利用协议栈包括的各种协议(例如,TCP/IP,UDP和以太网协议等)实现待处理请求在网络中传输的所有协议处理的能力。IO设备230可 以为网络接口控制器(network interface controller,NIC)、智能NIC(smart-NIC)、RDMA网卡(RDMA-aware network interface controller,RNIC)、主机总线适配器(host bus adapter,HBA)、主机通道适配器(host channel adaptor,HCA)、加速器(accelerator)、数据处理器(data processing unit,DPU)、图像处理器(graphics processing unit,GPU)、人工智能(artificial intelligence,AI)设备、软件定义基础设施(software defined infrastructure,SDI)等中的至少一种。IO设备230可以包括第二处理器231和第二存储器232。第二存储器232可以由随机存取器(random access memory,RAM),硬盘(例如,SSD)或其他存储介质实现。The IO device 230 refers to hardware that can perform data transmission with the host 210 and/or other computing devices. Process the request. The IO device 230 is also configured to send the processing result of the processing request to the host 210 . In the embodiment of the present application, the IO device 230 has network capabilities and can realize the function of a protocol stack. Wherein the IO device 230 has a protocol stack (for example, TCP/IP protocol stack) function, which means that the IO device 230 has various protocols (for example, TCP/IP, UDP and Ethernet protocols, etc.) included in the protocol stack to realize pending requests The ability to handle all protocols transmitted in the network. The IO device 230 may be a network interface controller (network interface controller, NIC), a smart NIC (smart-NIC), an RDMA network card (RDMA-aware network interface controller, RNIC), a host bus adapter (host bus adapter, HBA), a host Channel adapter (host channel adapter, HCA), accelerator (accelerator), data processing unit (data processing unit, DPU), image processing unit (graphics processing unit, GPU), artificial intelligence (artificial intelligence, AI) equipment, software definition foundation At least one of the facilities (software defined infrastructure, SDI), etc. The IO device 230 may include a second processor 231 and a second memory 232 . The second storage 232 may be implemented by a random access device (random access memory, RAM), a hard disk (for example, SSD) or other storage media.
下面,结合图3对本申请提供的组播传输方法进行详细介绍。In the following, the multicast transmission method provided by the present application will be introduced in detail with reference to FIG. 3 .
图3是本申请实施例提供的一种组播传输方法300的示意性流程图。如图3所示,该方法300包括步骤310至步骤330。可以理解的是,下文中以方法300应用于第一计算设备为例进行介绍,第一计算设备包括第一输入输出IO设备和第一主机,第一主机包括处理器和存储器,第一IO设备与第一主机通过IO网络通信。示例性的,该方法300中的第一计算设备可以是上述图1所示的计算集群中的任意一个计算设备,该方法300中的组播组可以包括至少2个组播成员,以图1为例,当第一计算设备为计算设备111时,该至少2个组播成员可以分别为计算设备112和计算设备116。示例性的,第一计算设备的结构和组播组的任意一个组播成员的结构可以如上述图2所示。下面,具体介绍步骤310至步骤330。FIG. 3 is a schematic flowchart of a multicast transmission method 300 provided by an embodiment of the present application. As shown in FIG. 3 , the method 300 includes step 310 to step 330 . It can be understood that, in the following, the method 300 is applied to a first computing device as an example. The first computing device includes a first input and output IO device and a first host. The first host includes a processor and a memory. The first IO device communicate with the first host through the IO network. Exemplarily, the first computing device in the method 300 may be any computing device in the computing cluster shown in FIG. 1 above, and the multicast group in the method 300 may include at least two multicast members, as shown in FIG. 1 For example, when the first computing device is computing device 111, the at least two multicast members may be computing device 112 and computing device 116, respectively. Exemplarily, the structure of the first computing device and the structure of any multicast member of the multicast group may be as shown in FIG. 2 above. Next, step 310 to step 330 will be described in detail.
步骤310,第一IO设备获取第一请求。 Step 310, the first IO device obtains the first request.
步骤320,第一IO设备根据第一信息和第一请求,生成第一数据包,第一信息包括组播信息,组播信息用于标识组播组的组播成员与第一计算设备之间的链路连接,第一数据包携带组播信息和第一请求对应的待写入数据。 Step 320, the first IO device generates a first data packet according to the first information and the first request, the first information includes multicast information, and the multicast information is used to identify the communication between the multicast members of the multicast group and the first computing device The link connection, the first data packet carries the multicast information and the data to be written corresponding to the first request.
组播信息包括组播标识符,组播标识符是第一计算设备与组播组建立链路连接时获得的。在本申请实施例中,一个组播组可以包括至少2个组播成员,即该一个组播组可以包括2个或2个以上的组播组成员。组播信息用于标识组播组的组播成员与第一计算设备之间的链路连接,即组播信息用于标识至少2个链路连接。The multicast information includes a multicast identifier, which is obtained when the first computing device establishes a link connection with the multicast group. In this embodiment of the present application, a multicast group may include at least 2 multicast members, that is, a multicast group may include 2 or more multicast group members. The multicast information is used to identify the link connections between the multicast members of the multicast group and the first computing device, that is, the multicast information is used to identify at least two link connections.
第一数据包还包括端口号,端口号用于指示传输第一数据包的方式为组播传输方式。可选的,该端口号为该第一数据包的目的端口号,即第一数据包的目的端口号的取值用于指示传输第一数据包的方式为组播传输方式。The first data packet further includes a port number, and the port number is used to indicate that the transmission mode of the first data packet is a multicast transmission mode. Optionally, the port number is the destination port number of the first data packet, that is, the value of the destination port number of the first data packet is used to indicate that the transmission mode of the first data packet is a multicast transmission mode.
在本申请实施例中,对发送第一请求的执行主体不作具体限定。在一个示例中,第一请求为第一主机包括的处理器中运行的应用程序发送的请求,为便于描述,下文中将这种场景记为场景一。可以理解的是,场景一中第一请求为写请求,该写请求可以用于请求将第一主机包括的存储器中存储的数据存储(写入)至组播组的组播成员对应的主机的存储器中。在另一个示例中,第一请求为第二主机包括的处理器中运行的应用程序发送的请求,第二计算设备包括第二主机和第二IO设备,第二计算设备为组播组的组播成员,为便于描述,下文中将这种场景记为场景二。可以理解的是,场景二中第一请求为读请求,该读请求可以用于请求将第一主机包括的存储器中存储的数据存储(读取)至组播组的组播成员对应的主机的存储器中。下面,具体介绍场景一和场景二。In this embodiment of the present application, no specific limitation is imposed on the execution subject that sends the first request. In an example, the first request is a request sent by an application program running on a processor included in the first host. For ease of description, this scenario is referred to as scenario 1 hereinafter. It can be understood that the first request in Scenario 1 is a write request, and the write request may be used to request to store (write) the data stored in the memory included in the first host to the host corresponding to the multicast member of the multicast group. in memory. In another example, the first request is a request sent by an application program running in a processor included in the second host, the second computing device includes the second host and a second IO device, and the second computing device is a group of a multicast group Broadcasting members, for the convenience of description, this scenario will be recorded as scenario 2 in the following. It can be understood that the first request in Scenario 2 is a read request, and the read request may be used to request to store (read) the data stored in the memory included in the first host to the host corresponding to the multicast member of the multicast group. in memory. In the following, Scenario 1 and Scenario 2 will be introduced in detail.
场景一:scene one:
场景一中,第一请求为第一主机包括的处理器中运行的应用程序发送的请求。在场景一中根据传输协议的不同,又可以分为两种实现方式,为便于描述,下文中将这两种实现方式记为实现方式一和实现方式二。下面,具体介绍实现方式一和实现方式二。In Scenario 1, the first request is a request sent by an application running on a processor included in the first host. In the first scenario, according to different transmission protocols, it can be divided into two implementation modes. For the convenience of description, these two implementation modes will be recorded as implementation mode 1 and implementation mode 2 in the following. In the following, implementation manner 1 and implementation manner 2 will be introduced in detail.
实现方式一:Implementation method one:
在实现方式一中,第一IO设备发送第一数据包采用的传输协议为TCP/IP。In the first implementation, the transmission protocol used by the first IO device to send the first data packet is TCP/IP.
可选的,第一信息还可以包括第一窗口信息,第一窗口信息用于指示组播组的组播成员在第一时间段内能够处理的数据的最小数量。Optionally, the first information may further include first window information, where the first window information is used to indicate the minimum amount of data that can be processed by the multicast members of the multicast group within the first time period.
可选的,第一IO设备还可以对窗口信息进行更新。在一个示例性,该第一IO设备还可以执行以下操作:将第二窗口信息更新为该第一窗口信息,该第二窗口信息是该组播组的组播成员在该第一时间段前的一个时间段内能够处理的数据的最小数量。第二窗口信息指示的组播组的组播成员能够处理的数据的最小数量,与第一窗口信息指示的组播组的组播成员能够处理的数据的最小数量不同。这种实现方式中,通过设置第一IO设备在预设时间段内及时地对窗口信息进行更新,可以准确地控制第一数据包携带的数据量,实现网络的流量控制。Optionally, the first IO device may also update the window information. In an exemplary example, the first IO device may also perform the following operations: update the second window information to the first window information, and the second window information is that the multicast members of the multicast group before the first time period The minimum amount of data that can be processed within a period of time. The minimum amount of data that can be processed by the multicast members of the multicast group indicated by the second window information is different from the minimum amount of data that can be processed by the multicast members of the multicast group indicated by the first window information. In this implementation manner, by setting the first IO device to update the window information in a timely manner within a preset time period, the amount of data carried by the first data packet can be accurately controlled to realize network flow control.
可选的,当第一数据包携带的待写入数据为第一请求对应的数据中的一部分数据时,第一IO设备还可以根据第一信息和第一请求生成至少一个数据包,该至少一个数据包携带组播信息和第一请求对应的除待写入数据外的全部数据。当第一数据包携带的待写入数据为第一请求对应的数据中的全部数据时,第一IO设备根据第一信息和第一请求可以仅生成一个第一数据包。可选的,当组播组的组播成员未成功接收到该第一数据包时,该第一IO设备还可以生成一个数据包,该一个数据包携带的第一请求对应的数据中的全部数据和单播信息,该单播信息用于指示组播组中未成功接收到第一数据包的组播成员。Optionally, when the data to be written carried by the first data packet is part of the data corresponding to the first request, the first IO device may also generate at least one data packet according to the first information and the first request, and the at least One data packet carries the multicast information and all data corresponding to the first request except the data to be written. When the data to be written carried in the first data packet is all data in the data corresponding to the first request, the first IO device may generate only one first data packet according to the first information and the first request. Optionally, when the multicast member of the multicast group fails to receive the first data packet, the first IO device may also generate a data packet, and the data packet carries all the data corresponding to the first request. Data and unicast information, where the unicast information is used to indicate the multicast members in the multicast group that have not successfully received the first data packet.
实现方式二:Implementation method two:
在实现方式二中,第一IO设备发送第一数据包采用的传输协议为基于以太网的RDMA(例如但不限于为,RoCEv2)。In the second implementation manner, the transmission protocol used by the first IO device to send the first data packet is Ethernet-based RDMA (for example, but not limited to, RoCEv2).
可选的,待写入数据为第一请求对应的数据中的一部分数据,第一信息还包括指示信息,指示信息用于指示对待写入数据进行封装。这种实现方式中,在第一IO设备根据第一信息和第一请求,生成第一数据包之前,方法还包括:第一IO设备向组播组的组播成员发送IO写命令,IO写命令用于指示将第一请求对应的数据存储至第一注册区域(memory registration,MR)中,第一请求对应的数据位于第一计算设备包括的第一主机包括的存储器中,第一MR为第二主机包括的存储器中的存储区域注册到第二IO设备的存储器中的存储区域,第二计算设备包括第二主机和第二IO设备,第二计算设备为组播组的组播成员;第一IO设备接收组播组的组播成员发送的所述指示信息。其中,IO写命令包括第二密钥值和第二位置信息,第二密钥值用于识别第二MR,第二位置信息用于指示待写入数据在第二MR中的位置。Optionally, the data to be written is a part of the data corresponding to the first request, and the first information further includes indication information, and the indication information is used to indicate to encapsulate the data to be written. In this implementation, before the first IO device generates the first data packet according to the first information and the first request, the method further includes: the first IO device sends an IO write command to the multicast members of the multicast group, and the IO write The command is used to instruct to store the data corresponding to the first request into a first registration area (memory registration, MR), the data corresponding to the first request is located in the memory included in the first host included in the first computing device, and the first MR is The storage area in the memory included in the second host is registered to the storage area in the memory of the second IO device, the second computing device includes the second host and the second IO device, and the second computing device is a multicast member of the multicast group; The first IO device receives the indication information sent by the multicast member of the multicast group. Wherein, the IO write command includes a second key value and second position information, the second key value is used to identify the second MR, and the second position information is used to indicate the position of the data to be written in the second MR.
可选的,第一信息还包括信用值,信用值用于指示组播组的组播成员在第二时间段内能够处理的请求的最小数量,第一数据包的基本传输头部(base transport header,BTH)携带信用值。Optionally, the first information further includes a credit value, and the credit value is used to indicate the minimum number of requests that the multicast member of the multicast group can handle within the second time period, and the basic transport header (base transport) of the first data packet header, BTH) carries the credit value.
场景二:Scene two:
场景二中第一请求为读请求,该第一请求可以用于请求将第一主机包括的存储器中存储的数据存储(读取)至组播组的组播成员对应的主机的存储器中。这种场景中,第一IO设备发送第一数据包采用的传输协议为基于以太网的远程直接数据存取RDMA。In the second scenario, the first request is a read request, and the first request may be used to request to store (read) the data stored in the memory included in the first host into the memory of the host corresponding to the multicast member of the multicast group. In this scenario, the transmission protocol used by the first IO device to send the first data packet is Ethernet-based remote direct data access (RDMA).
可选的,第一IO设备获取第一请求,包括:第一IO设备接收组播组的组播成员发送的第一请求,第一请求用于指示将位于第二MR中的待写入数据存储至组播组的组播成员的存储区域,第二MR为第一主机包括的存储器中的存储区域注册到第一IO设备的存储器中的存储区域,第一计算设备还包括第一主机。可选的,这种实现方式中,第一共享接收队列(shared receive queue,SRQ)的窗口大小非零,第一SRQ的窗口大小非零用于指示第一SRQ中可用 的接收队列元素(receive queue element。RQE)的数目非零,第一SRQ中的可用的RQE用于处理第一IO设备接收到的数据包,第一SRQ存储在第一主机包括的存储器中。Optionally, obtaining the first request by the first IO device includes: the first IO device receives the first request sent by the multicast member of the multicast group, and the first request is used to indicate that the data to be written in the second MR Stored in the storage area of the multicast members of the multicast group, the second MR is a storage area in the memory included in the first host and registered in the storage area in the memory of the first IO device, and the first computing device also includes the first host. Optionally, in this implementation, the window size of the first shared receive queue (shared receive queue, SRQ) is non-zero, and the non-zero window size of the first SRQ is used to indicate the receive queue elements (receive queue) available in the first SRQ. element.RQE) is non-zero, the available RQE in the first SRQ is used to process the data packet received by the first IO device, and the first SRQ is stored in the memory included in the first host.
可选的,第一请求包括第一密钥值,第一位置信息和预设字段,第一密钥值用于识别第二MR,第一位置信息用于指示待写入数据在第二MR中的位置,预设字段的取值用于指示第一请求。Optionally, the first request includes a first key value, first location information and a preset field, the first key value is used to identify the second MR, and the first location information is used to indicate that the data to be written is in the second MR In the position, the value of the preset field is used to indicate the first request.
步骤330,第一IO设备向组播组的组播成员发送第一数据包。 Step 330, the first IO device sends the first data packet to the multicast members of the multicast group.
第一IO设备向组播组的组播成员发送第一数据包,可选的,该第一数据包在网络中传输时可以是以消息的形式承载的,对根据第一数据包生成一个消息的方式不作具体限定。可以理解的是,当第一请求对应的数据被封装成多个数据包,该多个数据包包括该第一数据包时,在网络中传输时的一个消息可以包括该多个数据包。The first IO device sends a first data packet to the multicast members of the multicast group. Optionally, the first data packet may be carried in the form of a message when transmitted in the network, and a message is generated according to the first data packet The method is not specifically limited. It can be understood that when the data corresponding to the first request is encapsulated into multiple data packets, and the multiple data packets include the first data packet, a message during transmission in the network may include the multiple data packets.
可选的,第一IO设备向组播组的组播成员发送第一数据包,包括:第一IO设备向转发设备发送第一数据包,转发设备用于对第一数据包进行复制,并将复制后的第一数据包转发至组播组的组播成员,组播组的组播成员与第一计算设备之间的链路连接包括转发设备。对第一计算设备与组播组的组播成员之间包括的转发设备的数目不作具体限定,例如可以包括1、2或5个转发设备等。可以理解的是,转发设备对第一数据包进行复制得到的数据包的数目,与组播组的组播成员的数目关联。复制后的每个第一数据包都携带有第一请求对应的待写入数据,但复制后的每个第一数据包携带的目的地址可能存在不同,即复制后的每个第一数据包携带的目的地址与组播组中的一个组播成员一一对应。Optionally, the first IO device sending the first data packet to the multicast members of the multicast group includes: the first IO device sends the first data packet to the forwarding device, and the forwarding device is used to copy the first data packet, and The copied first data packet is forwarded to the multicast members of the multicast group, and the link connection between the multicast members of the multicast group and the first computing device includes a forwarding device. The number of forwarding devices included between the first computing device and the multicast members of the multicast group is not specifically limited, for example, 1, 2 or 5 forwarding devices may be included. It can be understood that the number of data packets obtained by duplicating the first data packet by the forwarding device is associated with the number of multicast members of the multicast group. Each first data packet after copying carries the data to be written corresponding to the first request, but the destination address carried by each first data packet after copying may be different, that is, each first data packet after copying The carried destination address is in one-to-one correspondence with a multicast member in the multicast group.
可选的,组播组的组播成员包括第二计算设备,第二计算设备与第一计算设备不同,在上述步骤330之后还可以执行以下操作:第一IO设备接收第二计算设备发送的第二请求,第二请求用于请求获取第一数据包携带的第一请求对应的待写入数据;第一IO设备向第二计算设备发送第二数据包,第二数据包携带待写入数据,且第二数据包包括的端口号用于指示传输第二数据包的传输方式为单播传输方式。其中,第二数据包包括的端口号可以为第二数据包包括的目的端口号,即第二数据包的目的端口号的取值可以用于指示传输第二数据包的传输方式为单播传输方式。Optionally, the multicast members of the multicast group include a second computing device, and the second computing device is different from the first computing device. After the above step 330, the following operations may also be performed: the first IO device receives the information sent by the second computing device. The second request, the second request is used to request to obtain the data to be written corresponding to the first request carried by the first data packet; the first IO device sends the second data packet to the second computing device, and the second data packet carries the data to be written data, and the port number included in the second data packet is used to indicate that the transmission mode for transmitting the second data packet is a unicast transmission mode. Wherein, the port number included in the second data packet may be the destination port number included in the second data packet, that is, the value of the destination port number of the second data packet may be used to indicate that the transmission mode of the second data packet is unicast transmission Way.
可选的,组播组的组播成员仅包括第二计算设备和第三计算设备,第三计算设备,第二计算设备和第一计算设备中的任意两个计算设备不同,在上述步骤330之后还可以执行以下操作:在第一IO设备接收到第一完成消息和第二完成消息后,第一IO设备向第一主机包括的处理器发送第三完成消息,第三完成消息用于指示第一请求已成功执行,第一完成消息用于指示第二计算设备已成功执行第一请求,第二完成消息用于指示第三计算设备已成功执行第一请求。Optionally, the multicast members of the multicast group only include the second computing device and the third computing device, and the third computing device is different from any two computing devices in the second computing device and the first computing device. In the above step 330 After that, the following operations can also be performed: after the first IO device receives the first completion message and the second completion message, the first IO device sends a third completion message to the processor included in the first host, and the third completion message is used to indicate The first request has been successfully executed, the first completion message is used to indicate that the second computing device has successfully executed the first request, and the second completion message is used to indicate that the third computing device has successfully executed the first request.
可选的,上述场景一的实现方式一中,在步骤330之后还可以包括以下步骤:在该组播组的每个组播成员接收到第三消息的情况下,该第一IO设备接收到该每个组播成员发送的确认数据包,该第三消息用于指示该第三消息携带的数据包包括该第一请求对应的该待写入数据的最后一个数据包,该确认数据包表示该每个组播成员成功接收到携带有该第一请求对应的该待写入数据的所有数据包。这种发送确认数据包的实现方式,又称为累积确认(cumulative acknowledgement)的方式,即接收侧接收到多个消息后,才会向发送端发送一个确认信息,该一个确认信息表示该接收侧接收到该多个消息。这种实现方式中,第一IO设备仅通过获取一个确认数据包便可得知组播组的组播成员接收到的第一请求对应的数据包的情况,这样有利于提高数据传输的效率。Optionally, in the implementation mode 1 of the above-mentioned scenario 1, after step 330, the following steps may also be included: when each multicast member of the multicast group receives the third message, the first IO device receives The confirmation data packet sent by each multicast member, the third message is used to indicate that the data packet carried in the third message includes the last data packet of the data to be written corresponding to the first request, and the confirmation data packet indicates Each multicast member successfully receives all data packets carrying the data to be written corresponding to the first request. This implementation of sending confirmation packets is also called cumulative acknowledgment (cumulative acknowledgment), that is, the receiving side will send an acknowledgment message to the sending end after receiving multiple messages, and the acknowledgment message indicates that the receiving side The plurality of messages is received. In this implementation manner, the first IO device can know the situation of the data packet corresponding to the first request received by the multicast member of the multicast group only by obtaining one confirmation data packet, which is beneficial to improve the efficiency of data transmission.
应理解,上述图3所示的方法300仅为示意,并不对本申请实施例提供的组播传输方法 构成任何限定。上述方法300中未详细赘述的内容可以参见下文图4至图11对应的方法实施例中的相关描述,此处不再详细赘述。It should be understood that the method 300 shown in FIG. 3 above is only for illustration, and does not constitute any limitation to the multicast transmission method provided in the embodiment of the present application. For content that is not described in detail in the above method 300, refer to the relevant descriptions in the method embodiments corresponding to FIG. 4 to FIG. 11 below, and will not be described in detail here.
在本申请实施例中,第一IO设备可以根据第一请求和第一信息进行数据封装生成第一数据包,并向组播组的组播成员发送该第一数据包,而不是通过第一主机包括的处理器根据工作请求进行封装生成数据包,避免了传统组播传输方法中存在复杂度高、延时大和传输效率低的问题。通过设置第一IO设备发送的第一数据包携带用于标识组播组的组播成员与该第一计算设备之间的链路连接的组播信息,能够确保组播数据的可靠传输。因此,本申请实施例提供的组播传输方法可以实现可靠和高效的组播数据传输。In this embodiment of the application, the first IO device may perform data encapsulation according to the first request and the first information to generate the first data packet, and send the first data packet to the multicast members of the multicast group instead of passing the first The processor included in the host machine encapsulates and generates data packets according to the work request, which avoids the problems of high complexity, large delay and low transmission efficiency in the traditional multicast transmission method. By setting the first data packet sent by the first IO device to carry the multicast information for identifying the link connection between the multicast members of the multicast group and the first computing device, reliable transmission of multicast data can be ensured. Therefore, the multicast transmission method provided by the embodiment of the present application can realize reliable and efficient multicast data transmission.
图4是本申请实施例提供的一种组播传输场景的示意图。Fig. 4 is a schematic diagram of a multicast transmission scenario provided by an embodiment of the present application.
如图4所示,假设图4中的源端计算设备1(简记为INI 1)为图1所示的计算集群100中的计算设备111,目标端计算设备1(简记为TGT 1)为该计算集群100中的计算设备114,目标端计算设备2(简记为TGT 2)为该计算集群100中的计算设备115。此外,为了便于描述,以下假设主机10中存在至少一个应用程序,称为应用程序1。如图4所示,在主机10存在至少一个应用程序1的情况下,在主机10的存储区域中创建队列队(queue pairs,QP)1和完成队列(complete queue,CQ)1。QP1包括发送队列(send queue,SQ)1和共享接收队列(shared receive queue,SRQ)1。SRQ1可以被多个TGT(即,TGT 1和TGT 2)共享。SQ1用于存储主机10中的应用程序1生成的工作请求(work request,WR)WR。CQ1用于存放IO设备11已经执行完成的WR的处理结果。SRQ1与INI 1关联。此外,在主机20的存储区域中创建QP2和CQ2,QP2包括RQ2和SQ2。在主机30包括的存储器中创建QP3和CQ3,QP3包括RQ3和SQ3。存储在SQ中的待处理请求可以称为发送队列元素(send queue element,SQE),即一个SQE用于存在一个待处理请求。存储在RQ中的待处理请求可以称为接收队列元素(receive queue element,RQE),即一个RQE用于存储接收到的一个待处理请求。存储在CQ中的待处理请求的处理结果可以称为完成队列元素(complete queue element,CQE),即一个CQE用于存储一个待处理请求的处理结果。示例性的,当主机10、主机20和主机30中任意一个主机的结构为图2所示的主机210的结构时,主机10、主机20和主机30中任意一个主机包括的存储器可以为第一存储器212。As shown in FIG. 4, suppose that the source computing device 1 (abbreviated as INI 1) in FIG. 4 is the computing device 111 in the computing cluster 100 shown in FIG. 1, and the target computing device 1 (abbreviated as TGT 1) The computing device 114 in the computing cluster 100, the target computing device 2 (abbreviated as TGT 2) is the computing device 115 in the computing cluster 100. In addition, for ease of description, it is assumed that at least one application program, called application program 1, exists in the host computer 10 below. As shown in Figure 4, under the situation that host computer 10 has at least one application program 1, create queue team (queue pairs, QP) 1 and complete queue (complete queue, CQ) 1 in the storage area of host computer 10. QP1 includes send queue (send queue, SQ) 1 and shared receive queue (shared receive queue, SRQ) 1. SRQ1 may be shared by multiple TGTs (ie, TGT 1 and TGT 2). SQ1 is used to store the work request (work request, WR) WR generated by the application program 1 in the host computer 10. CQ1 is used to store the processing result of the WR that the IO device 11 has completed. SRQ1 is associated with INI 1. Furthermore, QP2 and CQ2 are created in the storage area of the host 20, and QP2 includes RQ2 and SQ2. QP3 and CQ3 are created in memory included in host 30, QP3 including RQ3 and SQ3. The pending requests stored in the SQ may be referred to as send queue elements (send queue element, SQE), that is, one SQE is used for one pending request. The pending requests stored in the RQ can be called receive queue elements (receive queue element, RQE), that is, one RQE is used to store a received pending request. The processing result of the pending request stored in the CQ may be called a complete queue element (CQE), that is, a CQE is used to store the processing result of a pending request. Exemplarily, when the structure of any one of host 10, host 20, and host 30 is the structure of host 210 shown in FIG. 2 , the memory included in any one of host 10, host 20, and host 30 may be the first memory 212.
在图4所示的场景中,组播组1的组播成员包括TGT 1和TGT 2。TGT 1包括主机20和IO设备21,TGT 2包括主机30和IO设备31。INI 1可以通过组播方式向组播组1的组播成员发送信息(例如,工作请求或数据包),以及通过单播方式接收组播组1的组播成员发送的信息。组播组1的任意一个组播成员也可以通过单播方式向INI 1发送信息。在本申请实施例中,可以通过组播标识符唯一标识图4所示的INI 1与组播组1的所有组播成员(即,TGT1和TGT 2)之间的链路连接,为便于描述,下文中将该组播标识符记为复制组(replicat group,RG)标识符。INI可以利用一个RG标识符唯一标识该一个INI与一个组播组的所有组播成员之间的组播链路连接。该一个组播组的任意一个组播成员可以利用另一个RG标识符,唯一识别该任意一个组播成员作为该一个组播组的组播成员时与该INI之间的单播链路连接。上述一个RG标识符与上述另一个RG标识符可以不同。可选的,上述一个RG标识符与上述另一个RG标识符也可以相同。In the scenario shown in Figure 4, the multicast members of multicast group 1 include TGT 1 and TGT 2. TGT 1 includes a host 20 and an IO device 21, and TGT 2 includes a host 30 and an IO device 31. INI 1 can send information (for example, work requests or data packets) to multicast members of multicast group 1 through multicast, and receive information sent by multicast members of multicast group 1 through unicast. Any multicast member of multicast group 1 can also send information to INI 1 through unicast. In the embodiment of the present application, the link connection between the INI 1 shown in Figure 4 and all multicast members (that is, TGT1 and TGT 2) of the multicast group 1 shown in Figure 4 can be uniquely identified by the multicast identifier, for ease of description , hereinafter the multicast identifier is recorded as a replicat group (replicat group, RG) identifier. The INI can use an RG identifier to uniquely identify the multicast link connections between the INI and all multicast members of a multicast group. Any multicast member of the multicast group may use another RG identifier to uniquely identify the unicast link connection between the multicast member and the INI when the multicast member is a multicast member of the multicast group. The above-mentioned one RG identifier may be different from the above-mentioned other RG identifier. Optionally, the foregoing one RG identifier may also be the same as the foregoing another RG identifier.
例如,在图4中,INI 1可以利用ini_rg_id_1标识符唯一标识INI 1与组播组1的所有组播成员之间的链路连接,该链路连接包括链路1和链路2,链路1是INI 1至转发设备1,转发设备1至TGT 1的链路,链路2是INI 1至转发设备1,转发设备1至TGT 2的链路。TGT 1可以利用tgt_rg_id_a标识符唯一标识TGT 1作为组播组1的成员时,TGT 1与INI 1 之间的单播链路1。TGT 2可以利用tgt_rg_id_b标识符唯一标识TGT 2作为组播组1的成员时,TGT 2与INI 1之间的单播链路2。ini_rg_id_1标识符,tgt_rg_id_a标识符和tgt_rg_id_b标识符中的任意一个标识符可以不同。For example, in Figure 4, INI 1 can use the ini_rg_id_1 identifier to uniquely identify the link connection between INI 1 and all multicast members of multicast group 1, the link connection includes link 1 and link 2, link 1 is the link from INI 1 to forwarding device 1, forwarding device 1 to TGT 1, and link 2 is the link from INI 1 to forwarding device 1, and forwarding device 1 to TGT 2. TGT 1 can use the tgt_rg_id_a identifier to uniquely identify the unicast link 1 between TGT 1 and INI 1 when TGT 1 is a member of multicast group 1. TGT 2 can use the tgt_rg_id_b identifier to uniquely identify the unicast link 2 between TGT 2 and INI 1 when TGT 2 is a member of multicast group 1. Any one of the ini_rg_id_1 identifier, tgt_rg_id_a identifier, and tgt_rg_id_b identifier may be different.
在本申请实施例中,将一个INI与一个组播组的所有组播成员之间的链路连接、一个INI与该一个组播组中的每个组播成员的单播链路连接以及RG,可以称为一个可靠的点到多点的连接(reliable point to multi-point,rP2M)。还以图4为例进行说明,即一个rP2MP包括:单播链路连接(即,单播链路1和单播链路2),组播链路连接(即,链路1和链路2,链路1是INI 1至转发设备1,转发设备1至TGT 1的链路,链路2是INI 1至转发设备1,转发设备1至TGT 2的链路。),以及RG(包括INI 1对应的ini_rg_id_1标识符、TGT 1对应的tgt_rg_id_a标识符以及TGT2对应的tgt_rg_id_b标识符)。一个INI还可以与多个组播组分别创建组播链路连接。相应的,可以给该一个INI分配多个RG标识符,该多个RG标识符与多个组播链路连接一一对应。一个TGT还可以与多个INI分别建立多个单播链路连接。相应的,可以给该一个TGT分配多个RG标识符,该多个RG标识符与该多个单播链路连接一一对应。一个组播组可以包括至少2个组播成员,可选的,该一个组播组还可以包括更多数目的组播成员。例如,图4所示的组播组1还可以包括3、4、5、9或20个组播成员。In the embodiment of this application, one INI is connected to the link between all multicast members of a multicast group, one INI is connected to the unicast link of each multicast member in the multicast group, and the RG , can be called a reliable point-to-multipoint connection (reliable point to multi-point, rP2M). Also take Figure 4 as an example, that is, one rP2MP includes: a unicast link connection (that is, a unicast link 1 and a unicast link 2), a multicast link connection (that is, a link 1 and a link 2 , link 1 is the link from INI 1 to forwarding device 1, forwarding device 1 to TGT 1, link 2 is the link from INI 1 to forwarding device 1, forwarding device 1 to TGT 2.), and RG (including INI 1 corresponds to the ini_rg_id_1 identifier, TGT 1 corresponds to the tgt_rg_id_a identifier and TGT2 corresponds to the tgt_rg_id_b identifier). An INI can also create multicast link connections with multiple multicast groups respectively. Correspondingly, multiple RG identifiers may be allocated to the one INI, and the multiple RG identifiers correspond to multiple multicast link connections one by one. A TGT can also establish multiple unicast link connections with multiple INIs respectively. Correspondingly, multiple RG identifiers may be allocated to the one TGT, and the multiple RG identifiers are in one-to-one correspondence with the multiple unicast link connections. A multicast group may include at least 2 multicast members, and optionally, the multicast group may also include more multicast members. For example, the multicast group 1 shown in FIG. 4 may also include 3, 4, 5, 9 or 20 multicast members.
应理解,上述图4所示的组播传输场景仅为示意,并不对本申请提供的组播传输方法适用的组播场景构成任何限定。例如,上述图4所示的组播场景中的组播组还可以包括更多数目的组播成员,例如3个或5个等。图4中以INI 1至TGT 1的单播链路1不包括中间转发设备为例。可选的,在另一些实现方式中,INI 1至TGT 1的单播链路1中还可以包括一个或多个转发设备。It should be understood that the multicast transmission scenario shown in FIG. 4 above is only for illustration, and does not constitute any limitation to the multicast scenario applicable to the multicast transmission method provided in this application. For example, the multicast group in the multicast scenario shown in FIG. 4 may also include more multicast members, such as 3 or 5 members. In Figure 4, the unicast link 1 from INI 1 to TGT 1 does not include an intermediate forwarding device as an example. Optionally, in other implementation manners, the unicast link 1 from the INI 1 to the TGT 1 may also include one or more forwarding devices.
下面以上述图4所示的应用场景为例,结合图5中的实施例,对本申请实施例提供的组播传输方法的一种具体实现方式进行详细描述。Taking the application scenario shown in FIG. 4 as an example, a specific implementation manner of the multicast transmission method provided by the embodiment of the present application will be described in detail below in combination with the embodiment in FIG. 5 .
图5是本申请实施例提供的一种组播传输方法500的示意性流程图。应理解,为了便于描述,图5中仅以图4中的INI 1和TGT 1为例进行说明。图4中的TGT 2执行的操作与TGT1执行的操作原理相似,具体详细赘述的内容可以参见方法500中TGT 1对应的操作。如图5所示,该方法500可以包括步骤510至步骤592,下面分别对步骤510至步骤592进行详细描述。FIG. 5 is a schematic flowchart of a multicast transmission method 500 provided by an embodiment of the present application. It should be understood that, for ease of description, only INI 1 and TGT 1 in FIG. 4 are taken as examples in FIG. 5 for illustration. The operations performed by TGT 2 in FIG. 4 are similar in principle to the operations performed by TGT1. For specific details, please refer to the operations corresponding to TGT 1 in method 500. As shown in FIG. 5 , the method 500 may include step 510 to step 592 , and step 510 to step 592 will be described in detail below.
步骤510,主机10获取工作队列元素(work queue element,WQE)1,并将WQE1放入SQ1中,WQE1用于携带写请求1,写请求1携带待写入数据1位于主机10包括的存储器中的源地址。Step 510, the host 10 obtains a work queue element (work queue element, WQE) 1, and puts WQE1 into SQ1, WQE1 is used to carry a write request 1, and the write request 1 carries the data 1 to be written and is located in the memory included in the host 10 source address.
其中,主机10获取WQE1可以包括如下步骤:主机10包括的处理器中的应用程序1生成WR1;主机10调度驱动提供的接口把WR1转化成WQE1。WR1和WQE1和携带的信息相同,仅是格式不相同。可以理解的是,WQE1放入SQ1后,WQE1又称为SQE1。为便于描述,下文中统一将从SQ1中获取的WQE1称为SQE1。例如,SQE1可以用于指示写请求1,写请求1用于请求把存储在主机10包括的存储器中的地址为0x12345678的长度为10字节的数据,写入至组播组1的组播成员对应的计算设备的存储器中。示例性的,当图4中的INI 1的结构为图2所示的计算设备的结构时,主机10包括的处理器中的应用程序1可以是图2所示主机210包括的第一处理器211中的应用程序,主机10包括的存储器可以是图2所示的第一存储器212,IO设备11可以是图2所示的IO设备230,IO设备11的存储器可以是图2所示的第二存储器232。Wherein, the acquisition of WQE1 by the host 10 may include the following steps: the application program 1 in the processor included in the host 10 generates WR1; the host 10 schedules the interface provided by the driver to convert WR1 into WQE1. WR1 and WQE1 carry the same information, but the format is different. It is understandable that after WQE1 is put into SQ1, WQE1 is also called SQE1. For ease of description, WQE1 obtained from SQ1 is collectively referred to as SQE1 hereinafter. For example, SQE1 may be used to indicate a write request 1, and the write request 1 is used to request that the data stored in the memory included in the host 10 with an address of 0x12345678 and a length of 10 bytes be written to the multicast members of the multicast group 1 in the memory of the corresponding computing device. Exemplarily, when the structure of the INI 1 in FIG. 4 is the structure of the computing device shown in FIG. 2, the application program 1 in the processor included in the host 10 may be the first processor included in the host 210 shown in FIG. 2 211, the memory included in the host 10 may be the first memory 212 shown in FIG. 2, the IO device 11 may be the IO device 230 shown in FIG. Two memory 232.
步骤520,IO设备11从SQ1中获取SQE1。In step 520, the IO device 11 acquires SQE1 from SQ1.
其中,IO设备11获取的SQE1的格式为聚散链(scatter gather list,SGL)格式,即SQE1中包括了待写入数据1的源地址和SQE1的源地址,待写入数据1的源地址可以为待写入数据1位于主机10包括的存储器中的地址,SQE1的源地址可以为SQE1位于主机10包括的存储器中的地址。Wherein, the format of SQE1 obtained by IO device 11 is a scatter gather list (SGL) format, that is, SQE1 includes the source address of data 1 to be written and the source address of SQE1, and the source address of data 1 to be written It may be the address of the data to be written 1 located in the memory included in the host 10 , and the source address of SQE1 may be the address of SQE1 located in the memory included in the host 10 .
步骤530,IO设备11根据SQE1,组播信息和窗口信息,对SQE1指示的待写入数据1进行封装,生成数据包1。In step 530, the IO device 11 encapsulates the data 1 to be written indicated by the SQE1 according to the SQE1, the multicast information and the window information, and generates a data packet 1.
组播信息是指INI 1与组播组1的组播成员之间的组播信息,该组播信息包括但不限于:INI1的媒体接入控制(media access control,MAC)地址,INI1的单播IP地址,组播组的MAC地址,组播组的IP地址,INI 1的RG标识符。INI 1的RG标识符指示INI 1和组播组1的组播成员之间的组播链路连接。窗口信息包括:TCP窗口信息和拥塞窗口信息,TCP窗口信息用于指示组播组1的所有成员对应的所有TCP窗口大小中的最小TCP窗口大小,拥塞窗口信息用于指示组播组1的所有成员对应的所有拥塞窗口大小中的最小拥塞窗口大小。本申请实施例中对确定拥塞窗口的算法不作具体限定。例如,拥塞控制算法包括但不限于慢启动算法、拥塞避免算法、快速重传算法和快速恢复算法。Multicast information refers to the multicast information between INI 1 and the multicast members of multicast group 1, the multicast information includes but not limited to: the media access control (media access control, MAC) address of INI1, the single address of INI1 Broadcast IP address, MAC address of multicast group, IP address of multicast group, RG identifier of INI 1. The RG identifier of INI 1 indicates the multicast link connection between INI 1 and the multicast members of multicast group 1. The window information includes: TCP window information and congestion window information, the TCP window information is used to indicate the minimum TCP window size among all TCP window sizes corresponding to all members of the multicast group 1, and the congestion window information is used to indicate all members of the multicast group 1 The minimum congestion window size among all the congestion window sizes corresponding to the member. The algorithm for determining the congestion window in this embodiment of the present application is not specifically limited. For example, congestion control algorithms include, but are not limited to, slow start algorithms, congestion avoidance algorithms, fast retransmission algorithms, and fast recovery algorithms.
上述步骤530中,IO设备11卸载了TCP/IP协议栈的功能,即IO设备11可以实现TCP/IP协议栈的功能。也就是说,IO设备11可以利用TCP/IP协议栈将写请求1对应的待写入数据1封装成数据包。可选的,IO设备11还可以对接收到的数据包进行解析得到数据包的头部携带的内容,以及该数据包的载荷中携带的数据等。示例性的,IO设备11封装后得到的数据包1的格式可以如图6所示,该数据包1依次可以包括以太网头、IP头、UDP头、TCP头、载荷和帧校验序列。其中,TCP头中的扩展头可以携带RG标识符。可选的,该扩展头还可以携带消息边界,消息边界表示传输协议把数据当作一条独立的消息在网上传输。图6所示的数据包的头部携带的具体内容可以参见下文中的表1。In the above step 530, the IO device 11 offloads the function of the TCP/IP protocol stack, that is, the IO device 11 can realize the function of the TCP/IP protocol stack. That is to say, the IO device 11 can use the TCP/IP protocol stack to encapsulate the data 1 to be written corresponding to the write request 1 into a data packet. Optionally, the IO device 11 may also parse the received data packet to obtain the content carried in the header of the data packet, the data carried in the payload of the data packet, and the like. Exemplarily, the format of the data packet 1 obtained after encapsulation by the IO device 11 may be shown in FIG. 6 , and the data packet 1 may sequentially include an Ethernet header, an IP header, a UDP header, a TCP header, a payload, and a frame check sequence. Wherein, the extension header in the TCP header may carry the RG identifier. Optionally, the extension header may also carry a message boundary, which indicates that the transmission protocol transmits the data as an independent message on the network. For the specific content carried in the header of the data packet shown in FIG. 6 , refer to Table 1 below.
表1Table 1
Figure PCTCN2022139219-appb-000001
Figure PCTCN2022139219-appb-000001
上述表1中的TGT i(i=1或2)为组播组1中的任意一个组播成员(即,TGT 1或TGT 2)。在上述表1中,目的UDP端口号为rp2m_port_x,用于指示组播传输方式。目的UDP端口号为rp2m_port_y,用于指示单播传输方式。可以理解的是,表1中示出的源UDP端口号为 src_port,src_port可以是一个可变的值,表1中示出的目的UDP端口号为rp2m_port_x和rp2m_port_y是一个固定值。还应理解的是,在本申请实施例中,对于同一传输方式(单播传输方式或组播传输方式)的目的UDP端口号可以是相同的。例如,设置rp2m_port_y等于1000,则INI1利用单播链路向TGT 1或TGT 2发送的数据包包括的目的UDP端口号都等于1000,以及TGT 1或TGT 2利用单播链路向INI 1发送的数据包包括的目的UDP端口号也等于1000。可选的,目的UDP端口号可以从互联网名称与数字地址分配机构(the internet corporation for assigne names an numbers,ICANN)端口范围中选择。表1示出了分配给INI 1的RG标识符为ini_rg_id,ini_rg_id用于唯一识别INI 1与组播组1的组播成员之间的所有链路连接,该所有链路连接包括链路1和链路2,链路1是INI 1至转发设备1,转发设备1至TGT 1的链路,链路2是INI 1至转发设备1,转发设备1至TGT 2的链路。表1中还示出了分配给TGT i的RG标识符为tgt_rg_id[i],tgt_rg_id[i]可以用于唯一识别该TGT i作为组播组1的组播成员时,该TGT i至INI I的单播链路连接。TGT i (i=1 or 2) in the above Table 1 is any multicast member in the multicast group 1 (that is, TGT 1 or TGT 2). In the above Table 1, the destination UDP port number is rp2m_port_x, which is used to indicate the multicast transmission mode. The destination UDP port number is rp2m_port_y, which is used to indicate the unicast transmission mode. It can be understood that the source UDP port number shown in Table 1 is src_port, and src_port can be a variable value, and the destination UDP port number shown in Table 1 is rp2m_port_x and rp2m_port_y is a fixed value. It should also be understood that, in this embodiment of the present application, the destination UDP port numbers for the same transmission mode (unicast transmission mode or multicast transmission mode) may be the same. For example, if rp2m_port_y is set equal to 1000, the destination UDP port number included in the data packets sent by INI1 to TGT 1 or TGT 2 through the unicast link is equal to 1000, and the packets sent by TGT 1 or TGT 2 to INI 1 through the unicast link The destination UDP port number included in the data packet is also equal to 1000. Optionally, the destination UDP port number may be selected from the Internet Corporation for Assigne Names an Numbers (ICANN) port range. Table 1 shows that the RG identifier assigned to INI 1 is ini_rg_id, and ini_rg_id is used to uniquely identify all link connections between INI 1 and multicast members of multicast group 1, and all link connections include link 1 and Link 2, link 1 is the link from INI 1 to forwarding device 1, forwarding device 1 to TGT 1, and link 2 is the link from INI 1 to forwarding device 1, and forwarding device 1 to TGT 2. Table 1 also shows that the RG identifier assigned to TGT i is tgt_rg_id[i], and when tgt_rg_id[i] can be used to uniquely identify the TGT i as a multicast member of multicast group 1, the TGT i to INI I unicast link connection.
在一种实现方式中,上述步骤530中,IO设备11根据SQE1,组播信息和窗口信息,对SQE1指示的待写入数据1进行封装,生成数据包1,可以包括以下步骤:IO设备11根据SQE1从主机10中存储待写入数据1的位置获取待写入数据1;IO设备11根据窗口信息确定IO设备11的发送窗口大小,TCP窗口信息指示的TCP窗口的大小非零;IO设备11将待写入数据1封装至数据包1的载荷中,以及将组播信息封装至数据包1的数据包头中,生成数据包1,数据包1的载荷携带的数据量等于IO设备11的发送窗口大小。上述实现方式中,当上述组播组1的任意一个组播成员的TCP窗口大小等于零时,IO设备11会暂停封装数据包和发送封装后的数据包,此时IO设备11可以使用单播通道探测并单独更新组播组1的任意一个组播成员的TCP窗口大小。一直到IO设备11确定组播组1的任意一个组播成员的TCP窗口大小非零时,IO设备11再根据窗口信息进行数据封装生成数据包并发送该数据包。其中,IO设备11通过单播方式向组播组的任意一组播个成员发送数据包时,该数据包的头部携带的参数可以参见上述表1中的第三列的内容。其中,对INI 1获取组播组1的每个组播成员的TCP窗口大小和拥塞窗口大小的方式不作具体限定。在一个示例中,INI 1与组播组1的每个组播成员建立TCP连接时,组播组1的每个组播成员会告知INI 1该每个组播成员的TCP窗口大小和拥塞窗口大小。在另一个示例中,TGT(即,组播组1的任意一个组播成员)可以根据其接收缓冲区的状态生成定期的TCP窗口大小,以及根据所在网络状态更新拥塞窗口大小。INI 1根据TGT的单个窗口通知,维护给定组播组的每个组播成员的TCP窗口大小和拥塞窗口大小。In one implementation, in the above step 530, the IO device 11 encapsulates the data to be written 1 indicated by SQE1 according to SQE1, multicast information and window information, and generates a data packet 1, which may include the following steps: IO device 11 Acquire the data 1 to be written from the position where the data 1 to be written is stored in the host 10 according to SQE1; the IO device 11 determines the sending window size of the IO device 11 according to the window information, and the size of the TCP window indicated by the TCP window information is non-zero; the IO device 11 Encapsulate the data 1 to be written into the payload of the data packet 1, and encapsulate the multicast information into the data packet header of the data packet 1 to generate a data packet 1, the amount of data carried by the payload of the data packet 1 is equal to that of the IO device 11 Send window size. In the above implementation, when the TCP window size of any one of the multicast members of the above multicast group 1 is equal to zero, the IO device 11 will suspend encapsulating the data packet and sending the encapsulated data packet. At this time, the IO device 11 can use the unicast channel Detect and individually update the TCP window size of any multicast member of multicast group 1. Until the IO device 11 determines that the TCP window size of any multicast member of the multicast group 1 is non-zero, the IO device 11 performs data encapsulation according to the window information to generate a data packet and sends the data packet. Wherein, when the IO device 11 sends a data packet to any multicast member of the multicast group in a unicast manner, the parameters carried in the header of the data packet can refer to the content in the third column in Table 1 above. Wherein, the manner in which the INI 1 obtains the TCP window size and the congestion window size of each multicast member of the multicast group 1 is not specifically limited. In one example, when INI 1 establishes a TCP connection with each multicast member of multicast group 1, each multicast member of multicast group 1 will inform INI 1 of the TCP window size and congestion window of each multicast member size. In another example, the TGT (that is, any multicast member of the multicast group 1) can generate a regular TCP window size according to the state of its receiving buffer, and update the congestion window size according to the state of the network where it is located. INI 1 maintains the TCP window size and congestion window size for each multicast member of a given multicast group based on the individual window notifications of the TGT.
以图7中的(1)为例说明上述数据包1携带的内容,待写入数据1可以对应图7中的(1)中的第10个字节至第25个字节的内容,这种实现方式中,数据包1的载荷中携带第10个字节至第25个字节的所有内容,数据包1的头部携带的参数可以参见上述表1中的第二列的内容。这种实现方式中,IO设备11的发送窗口大小为16个字节,TGT 1的TCP窗口大小和TGT2的TCP窗口大小可以等于16个字节,TGT 1的拥塞窗口大小和TGT 2的拥塞窗口大小也可以等于16个字节。Take (1) in FIG. 7 as an example to illustrate the content carried by the above-mentioned data packet 1. The data to be written 1 may correspond to the content of the 10th byte to the 25th byte in (1) in FIG. 7 . In one implementation manner, the payload of data packet 1 carries all content from the 10th byte to the 25th byte, and the parameters carried in the header of data packet 1 can refer to the content in the second column in Table 1 above. In this implementation, the sending window size of IO device 11 is 16 bytes, the TCP window size of TGT 1 and the TCP window size of TGT2 can be equal to 16 bytes, the congestion window size of TGT 1 and the congestion window of TGT 2 Size can also be equal to 16 bytes.
在另一种实现方式中,待写入数据1的数据量较大,组播组1的组播成员的缓冲区难以一次性接收该待写入数据1。基于此,IO设备11根据SQE1,组播信息和窗口信息,对SQE1指示的待写入数据1进行封装,生成数据包1,可以包括以下步骤:IO设备11根据SQE1从主机10中存储待写入数据1的位置获取待写入数据1;IO设备11根据窗口信息确定发送窗口大小,该TCP窗口信息指示的TCP窗口的大小非零,且发送窗口大小小于待写入数据1的 数据量;IO设备11根据根据发送窗口大小从待写入数据1中获取待封装数据1,待写入数据1包括待封装数据1;IO设备11将待封装数据1封装至数据包1的载荷中,以及将组播信息封装至数据包1的数据包头中,生成数据包1,数据包1的载荷仅携带待写入数据1的部分数据。这种实现方式中,在上述步骤530之后还可以包括如下步骤:IO设备11还会通过组播传输方式向组播组1的成员发送至少一个数据包2,数据包2的头部携带的内容与数据包1头部携带的内容相同,数据包2的载荷携带的内容大小是根据TCP窗口信息和待封装数据2确定的,待封装数据2是待写入数据1中除待封装数据1外的数据。上述实现方式中,当TCP窗口信息指示的TCP窗口大小等于零时,IO设备11暂停封装消息和发送封装后的消息,此时IO设备11可以使用单播通道探测并单独更新该TCP窗口信息指示的TCP窗口大小。一直到IO设备11确定该TCP窗口信息指示的TCP窗口大小非零时,IO设备11再根据该窗口信息进行数据封装生成消息并发送该消息。其中,IO设备11通过单播方式向组播组的任意一个成员发送消息时,该消息包括的数据包的头部携带的参数可以参见上述表1中的第三列的内容。In another implementation manner, the amount of data 1 to be written is relatively large, and it is difficult for the buffers of the multicast members of the multicast group 1 to receive the data 1 to be written at one time. Based on this, the IO device 11 encapsulates the data to be written 1 indicated by SQE1 according to SQE1, multicast information and window information, and generates a data packet 1, which may include the following steps: the IO device 11 stores the data to be written from the host 10 according to SQE1. Enter the position of data 1 to obtain data 1 to be written; IO device 11 determines the size of the sending window according to the window information, the size of the TCP window indicated by the TCP window information is non-zero, and the size of the sending window is less than the amount of data to be written into data 1; The IO device 11 obtains the data to be encapsulated 1 from the data to be written 1 according to the size of the sending window, the data to be written 1 includes the data to be encapsulated 1; the IO device 11 encapsulates the data to be encapsulated 1 into the payload of the data packet 1, and The multicast information is encapsulated into the data packet header of the data packet 1 to generate the data packet 1, and the payload of the data packet 1 only carries part of the data to be written into the data 1. In this implementation, after the above step 530, the following steps may also be included: the IO device 11 will also send at least one data packet 2 to the members of the multicast group 1 through the multicast transmission mode, and the content carried in the header of the data packet 2 The same as the content carried in the header of data packet 1, the size of the content carried in the payload of data packet 2 is determined according to the TCP window information and the data to be encapsulated 2, the data to be encapsulated 2 is the data to be written in the data 1 except the data to be encapsulated 1 The data. In the above implementation, when the size of the TCP window indicated by the TCP window information is equal to zero, the IO device 11 suspends encapsulating the message and sending the encapsulated message. At this time, the IO device 11 can use a unicast channel to detect and update the window indicated by the TCP window information separately. TCP window size. Until the IO device 11 determines that the size of the TCP window indicated by the TCP window information is not zero, the IO device 11 performs data encapsulation according to the window information to generate a message and sends the message. Wherein, when the IO device 11 sends a message to any member of the multicast group through unicast, the parameters carried in the header of the data packet included in the message can refer to the content in the third column in Table 1 above.
以图7中的(2)为例说明上述步骤,假设IO设备11通过组播传输方式向组播组1发送数据包1和数据包2,即可实现将待写入数据1传输给组播组1的组播成员的目的。INI 1在与组播组1的组播成员建立TCP连接时,获取TGT 1的TCP窗口大小为7个字节,TGT 1的TCP窗口大小为6个字节。基于此,IO设备11生成的数据包1中携带图7中的(2)所示的待封装数据1中的内容。此后,INI 1接收到TGT 1的TCP窗口大小更新为10个字节,以及接收到TGT 1的TCP窗口大小更新为15个字节,基于此,IO设备11生成的数据包2中携带图7中的(2)所示的待封装数据2中的内容。也就是说,IO设备11发送2个数据包(即,数据包1和数据包2)便实现了将待写入数据1传输给组播组1的组播成员的目的。Take (2) in Figure 7 as an example to illustrate the above steps. Assume that the IO device 11 sends data packet 1 and data packet 2 to the multicast group 1 through multicast transmission, and then the data to be written 1 can be transmitted to the multicast group. The purpose of the multicast membership of group 1. When INI 1 establishes a TCP connection with the multicast member of multicast group 1, the TCP window size of TGT 1 is 7 bytes, and the TCP window size of TGT 1 is 6 bytes. Based on this, the data packet 1 generated by the IO device 11 carries the content of the data to be encapsulated 1 shown in (2) in FIG. 7 . Thereafter, INI 1 receives the TCP window size update of TGT 1 to 10 bytes, and receives the TCP window size update of TGT 1 to 15 bytes. Based on this, the data packet 2 generated by IO device 11 carries the data shown in Figure 7. The content in the data to be packaged 2 shown in (2) in. That is to say, the IO device 11 sends two data packets (that is, the data packet 1 and the data packet 2 ) to realize the purpose of transmitting the data 1 to be written to the multicast members of the multicast group 1 .
步骤540,IO设备11向组播组1的组播成员发送数据包1。In step 540, the IO device 11 sends the data packet 1 to the multicast members of the multicast group 1.
组播组1的组播成员包括TGT 1和TGT 2。相应的,IO设备21和IO设备31都会接收到数据包1。IO设备31是TGT 2包括的IO设备,可以理解的是,图5中并未示出TGT 2。The multicast members of multicast group 1 include TGT 1 and TGT 2. Correspondingly, both the IO device 21 and the IO device 31 will receive the data packet 1 . The IO device 31 is an IO device included in the TGT 2. It can be understood that the TGT 2 is not shown in FIG. 5 .
在一些实现方式中,IO设备11对待写入数据1进行封装仅得到一个数据包1,此后IO设备11可以通过组播方式发送数据包1,即数据包1的头部携带的信息可以如上述表1中的第二列内容所示。In some implementations, the IO device 11 encapsulates the data to be written 1 to obtain only one data packet 1, after which the IO device 11 can send the data packet 1 through multicast, that is, the information carried in the header of the data packet 1 can be as described above shown in the second column of Table 1.
在另一些实现方式中,IO设备11对待写入数据1进行封装得到数据包1和数据包2,此后IO设备11通过组播方式发送数据包1和数据包2。对这种实现方式中,对IO设备11发送数据包1和数据包2的时机不作具体限定。在一个示例中,IO设备11可以在接收到组播组1的每个成员发送的数据包1的确认(ACK)信息后,再通过组播传输方式向组播组1发送数据包2。在另一个示例中,IO设备11通过组播传输方式向组播组1发送数据包1后,可以通过组播传输方式向组播组1发送数据包2,这种实现方式中,IO设备11不需要在接收到组播组1的每个组播成员发送的数据包1的ACK信息后再发送数据包2,组播组1的每个组播成员也不需要在每次接收到一个消息后就发送一个ACK信息,该组播组1的每组播个成员(例如,TGT 1)可以在接收到多个消息后再发送一个ACK信息,该一个ACK信息可以表示该组播组1的每个组播成员接收到该多个消息。这种发送确定信息的实现方式,又称为累积确认的方式,即接收侧接收到多个消息后,才会向发送端发送一个确认信息,该一个确认信息表示该接收侧接收到该多个消息。在又一个示例中,还可以选择设置消息的TCP报头的确认序号(ACK number)字段,每个消息可以携带唯一的标识符,以帮助消息在接收侧实现聚合。In other implementation manners, the IO device 11 encapsulates the data to be written 1 to obtain the data packet 1 and the data packet 2, and then the IO device 11 sends the data packet 1 and the data packet 2 through multicast. In this implementation manner, the timing of sending the data packet 1 and the data packet 2 by the IO device 11 is not specifically limited. In an example, the IO device 11 may send the data packet 2 to the multicast group 1 through multicast transmission after receiving the acknowledgment (ACK) information of the data packet 1 sent by each member of the multicast group 1 . In another example, after IO device 11 sends data packet 1 to multicast group 1 through multicast transmission, it may send data packet 2 to multicast group 1 through multicast transmission. In this implementation, IO device 11 It is not necessary to send data packet 2 after receiving the ACK information of data packet 1 sent by each multicast member of multicast group 1, and each multicast member of multicast group 1 does not need to receive a message every time Just send an ACK message afterwards, each multicast member (for example, TGT 1) of this multicast group 1 can send another ACK message after receiving a plurality of messages, and this ACK message can represent this multicast group 1 Each multicast member receives the plurality of messages. This method of sending confirmation information is also called the cumulative confirmation method, that is, the receiving side will send a confirmation message to the sending terminal after receiving multiple messages, and the confirmation message indicates that the receiving side has received the multiple messages. information. In yet another example, an acknowledgment sequence number (ACK number) field of the TCP header of the message may also be selected to be set, and each message may carry a unique identifier to help message aggregation at the receiving side.
可选的,在上述步骤540之后,还可以执行如下操作:在INI 1侧为组播组1的每个组 播成员分别设置一个定时器(timer),即定时器1对应TGT 1,定时器2对应TGT 2,以确保可靠的数据传输。在IO设备11向组播组1的组播成员发送数据包(例如,数据包1)之后,便开始定时器1和定时器2的计时,以确定每个TGT的往返时间(round trip time,RTT)。当IO设备11确定一个TGT的RTT超过重传超时时间(retransmission time out,RTO)时,IO设备11可以利用单播链路向该一个TGT重新发送该消息。此时,IO设备11发送的数据包中封装的参数可以参见上述表1中第三列的内容。Optionally, after the above-mentioned step 540, the following operations can also be performed: a timer (timer) is respectively set for each multicast member of the multicast group 1 on the INI 1 side, that is, timer 1 corresponds to TGT 1, and timer 1 2 corresponds to TGT 2 to ensure reliable data transmission. After the IO device 11 sends a data packet (for example, data packet 1) to the multicast member of the multicast group 1, the timing of timer 1 and timer 2 is started to determine the round trip time (round trip time, RTT). When the IO device 11 determines that the RTT of a TGT exceeds a retransmission timeout (retransmission time out, RTO), the IO device 11 may resend the message to the TGT using a unicast link. At this time, the parameters encapsulated in the data packet sent by the IO device 11 can refer to the content in the third column in the above Table 1.
为便于描述,下文中,均以INI 1通过组播传输方式向组播组1的组播成员发送一个数据包为例进行介绍。也就是说,组播组1的任意一个组播成员接收到该一个数据包,即接收到待写入数据1的所有内容。For the convenience of description, in the following, INI 1 sends a data packet to the multicast members of multicast group 1 through multicast transmission as an example. That is to say, any multicast member of the multicast group 1 receives the data packet, that is, receives all the contents of the data 1 to be written.
步骤550,IO设备21对数据包1进行解析,获取待写入数据1,并生成rqe1,rqe1携带SQE1的处理结果1,SQE1的处理结果1指示将IO设备21的存储器中的待写入数据1写入主机20的存储器中。Step 550, the IO device 21 parses the data packet 1, obtains the data to be written 1, and generates rqe1, which carries the processing result 1 of SQE1, and the processing result 1 of SQE1 indicates that the data to be written in the memory of the IO device 21 1 is written into the memory of the host computer 20.
上述步骤550中IO设备21对数据包1进行解析的步骤,与上述步骤530中IO设备11生成数据包1的步骤,是互逆的。以图2为例,当TGT 1的结构为图2所示的计算设备的结构时,主机20的存储器可以是图2所示的第一存储器212,IO设备21的存储器可以是图2所述的第二存储器232。The step of parsing the data packet 1 by the IO device 21 in the above step 550 and the step of generating the data packet 1 by the IO device 11 in the above step 530 are reciprocal. Taking Fig. 2 as an example, when the structure of the TGT 1 is the structure of the computing device shown in Fig. 2, the memory of the host computer 20 can be the first memory 212 shown in Fig. 2, and the memory of the IO device 21 can be the first memory 212 shown in Fig. 2 . The second memory 232.
可选的,当CQ1被INI 1侧的多个QP共享时,rqe1中还需要携带SQ1的标识(identity,I D)和SQE1的索引,SQE1的索引指示SQE1位于SQ1中的位置。Optionally, when CQ1 is shared by multiple QPs on the INI 1 side, rqe1 also needs to carry the index of SQ1 (identity, ID) and SQE1, and the index of SQE1 indicates the position of SQE1 in SQ1.
步骤560,IO设备21向主机20发送rqe1。相应的,主机20将接收到的rqe1存储至RQ2中。Step 560 , the IO device 21 sends rqel to the host 20 . Correspondingly, the host 20 stores the received rqel into RQ2.
其中,rqe1的格式为SGL格式,即rqe1中包括了待写入数据1位于IO设备21的存储器中的源地址和rqe1位于IO设备21的存储器中的源地址。Wherein, the format of rqel is SGL format, that is, rqel includes the source address of the data to be written 1 located in the memory of the IO device 21 and the source address of rqel located in the memory of the IO device 21 .
步骤570,主机20从RQ2中获取rqe1,根据rqe1从IO设备21的存储器中读取待写入数据1,并将待写入数据1存储至主机20的存储器中。Step 570 , the host 20 acquires rqel from RQ2 , reads the data 1 to be written from the memory of the IO device 21 according to rqel , and stores the data 1 to be written into the memory of the host 20 .
在一些实现方式中,RQ2与待写入数据1存储至主机20的存储器关联,即RQ2用于指示该主机20的存储器。In some implementation manners, RQ2 is associated with the memory in which the data 1 to be written is stored in the host 20 , that is, RQ2 is used to indicate the memory of the host 20 .
步骤580,IO设备21通过单播方式向IO设备11发送确认ACK消息1,ACK消息1指示IO设备11成功接收到数据包1携带的待写入数据1。In step 580, the IO device 21 sends an acknowledgment ACK message 1 to the IO device 11 in a unicast manner, and the ACK message 1 indicates that the IO device 11 has successfully received the data 1 carried in the data packet 1 to be written.
上述步骤580中,ACK消息1的头部携带的参数可以参见上述表1中的第四列的内容。可选的,ACK消息1中还可以携带TCP窗口大小和拥塞窗口大小。一个ACK消息1需要消耗SRQ1中的一个共享接收队列元素(shared receive queue element,SRQE)。可选的,在一些实现方式中,当IO设备11接收到ACK消息1后,发现IO设备11的SRQ1中的所有SRQE都被占用(即,IO设备11的接收窗口等于零),此时IO设备11也可以丢弃该ACK消息1,并将IO设备11的接收窗口等于零发送给组播组1的组播成员,以使组播组1的组播成员暂停发送ACK消息。一直到IO设备11的SRQ1中存在未利用的SRQE时,IO设备11向组播组1的组播成员发送IO设备11更新后的接收窗口。相应的,组播组1的组播成员得知IO设备11的接收窗口非零后,组播组1的组播成员会继续向IO设备11发送ACK消息。In the above step 580, the parameters carried in the header of the ACK message 1 can refer to the content in the fourth column in the above Table 1. Optionally, the ACK message 1 may also carry the TCP window size and the congestion window size. An ACK message 1 needs to consume a shared receive queue element (SRQE) in SRQ1. Optionally, in some implementations, after the IO device 11 receives the ACK message 1, it is found that all SRQEs in the SRQ1 of the IO device 11 are occupied (that is, the receiving window of the IO device 11 is equal to zero), and at this time the IO device 11 may also discard the ACK message 1, and send the receiving window of the IO device 11 equal to zero to the multicast members of the multicast group 1, so that the multicast members of the multicast group 1 suspend sending ACK messages. Until there is an unused SRQE in SRQ1 of the IO device 11 , the IO device 11 sends the updated receiving window of the IO device 11 to the multicast members of the multicast group 1 . Correspondingly, after the multicast members of the multicast group 1 know that the receiving window of the IO device 11 is not zero, the multicast members of the multicast group 1 will continue to send ACK messages to the IO device 11 .
IO设备21执行上述步骤580,相应的,IO设备11接收到确认ACK消息1。其中IO设备11严格按照顺序接收来自组播组1的组播组成员的ACK消息,IO设备11可以根据ACK消息中携带的TCP序列号(sequence number)确定是否发生ACK消息的乱序现象。当IO设备11确定当前接收到的ACK消息并不是IO设备11当前期待接收的ACK消息,IO设备11可以丢 弃该ACK消息,并指示组播组对应的组播成员重新发生该ACK消息。The IO device 21 executes the above step 580, and correspondingly, the IO device 11 receives the confirmation ACK message 1 . Wherein the IO device 11 receives the ACK messages from the multicast group members of the multicast group 1 in strict order, and the IO device 11 can determine whether the out-of-order phenomenon of the ACK message occurs according to the TCP sequence number (sequence number) carried in the ACK message. When the IO device 11 determines that the currently received ACK message is not the ACK message that the IO device 11 is currently expecting to receive, the IO device 11 may discard the ACK message and instruct the corresponding multicast member of the multicast group to regenerate the ACK message.
步骤590,主机20向IO设备21发送SQE2。 Step 590 , the host 20 sends SQE2 to the IO device 21 .
其中,SQE2格式为SGL格式,即SQE2携带SQE2位于主机20的存储器中的位置和待写入数据1位于主机20的存储器中的位置。SQE2指示SQE1对应的数据请求已执行完成。Wherein, the SQE2 format is the SGL format, that is, SQE2 carries the location of SQE2 in the memory of the host 20 and the location of the data to be written 1 in the memory of the host 20 . SQE2 indicates that the data request corresponding to SQE1 has been executed.
相应的,IO设备21接收到SQE2,IO设备21对SQE2进行解析,确定主机20已经成功接收SQE1指示的待写入数据1。Correspondingly, the IO device 21 receives the SQE2, and the IO device 21 analyzes the SQE2 to determine that the host 20 has successfully received the data 1 to be written indicated by the SQE1.
步骤591,IO设备21向IO设备11发送IO完成消息1,IO完成消息1指示SQE1执行完成。In step 591, the IO device 21 sends an IO completion message 1 to the IO device 11, and the IO completion message 1 indicates that the execution of the SQE1 is completed.
在步骤591之前,IO设备了接收到主机20发送的SQE2后,对SQE2进行解析,确定待写入数据1已经成功写入主机20的存储器中,生成IO完成消息1。Before step 591, after receiving the SQE2 sent by the host 20, the IO device analyzes the SQE2, determines that the data 1 to be written has been successfully written into the memory of the host 20, and generates an IO completion message 1.
步骤592,IO设备11向主机10发送RQE1,RQE1指示SQE1执行完成。In step 592, the IO device 11 sends RQE1 to the host 10, and RQE1 indicates that the execution of SQE1 is completed.
相应的,主机10接收RQE1,并将接收到的RQE1存储至SRQ1中,存储在SRQ1中的RQE1又称为SRQE1。Correspondingly, the host 10 receives RQE1, and stores the received RQE1 into SRQ1, and the RQE1 stored in SRQ1 is also called SRQE1.
上述步骤510至步骤592中,以INI 1和TGT 1之间传输处理流程为例进行了介绍。可以理解的是,INI 1和TGT 2之间的传输处理流程,与INI 1和TGT 1之间传输处理流程相似。可以理解的是,上述步骤中都是以接收成功为例进行描述的。可选的,上述步骤中的接收成功还可以替换为接收失败,这种实现方式中,需要INI 1通过组播或单播重新发送未成功接收的消息。可以理解的是,上述步骤510至步骤592中以INI 1向TGT 1发送一个数据包1为例进行描述,可选的,上述方法还适用于当INI 1需要向TGT 1发送多个数据包以实现一个WQE对应的数据请求的示例中。可以理解的是,上述步骤中均以TGT 1和TGT 2是组播组1的组播成员为例进行描述,可选的,组播组1还可以包括更多数目的组播成员。In the above step 510 to step 592, the transmission processing flow between INI 1 and TGT 1 is taken as an example for introduction. It can be understood that the transmission processing flow between INI 1 and TGT 2 is similar to the transmission processing flow between INI 1 and TGT 1. It can be understood that, the above steps are all described by taking successful reception as an example. Optionally, the success of reception in the above steps can also be replaced by failure of reception. In this implementation, INI 1 needs to resend the unsuccessfully received message through multicast or unicast. It can be understood that, in the above step 510 to step 592, INI 1 sends a data packet 1 to TGT 1 as an example for description. Optionally, the above method is also applicable when INI 1 needs to send a plurality of data packets to TGT 1. In the example of implementing a data request corresponding to WQE. It can be understood that, in the above steps, TGT 1 and TGT 2 are described as multicast members of multicast group 1 as an example. Optionally, multicast group 1 may also include more multicast members.
可以理解的是,上述IO设备11在网络中向组播组1的组播成员发送数据包1时,该数据包1可以以消息(message)的形式承载。It can be understood that, when the above-mentioned IO device 11 sends the data packet 1 to the multicast members of the multicast group 1 in the network, the data packet 1 may be carried in the form of a message (message).
在本申请实施例中,提供了一种基于TCP/IP的组播传输方法。具体实现时,INI 1包括的IO设备11可以根据写请求1和组播信息生成数据包1,并向组播组1的组播成员发送该数据包1,而不是通过INI 1包括的主机10中的处理器进行数据封装生成数据包,避免了现有技术中基于TCP/IP实现组播传输时存在复杂度高、延时大和传输效率低的问题。通过设置INI 1包括的IO设备11发送的数据包1携带用于标识组播组1的组播成员与INI 1之间的链路连接的组播信息,能够确保组播数据的可靠传输。In the embodiment of the present application, a TCP/IP-based multicast transmission method is provided. During concrete implementation, the IO equipment 11 that INI 1 comprises can generate data packet 1 according to write request 1 and multicast information, and send this data packet 1 to the multicast member of multicast group 1, rather than by the main frame 10 that INI 1 comprises The processor in the device performs data encapsulation to generate data packets, which avoids the problems of high complexity, large delay and low transmission efficiency in the prior art when implementing multicast transmission based on TCP/IP. By setting the data packet 1 sent by the IO device 11 included in the INI 1 to carry the multicast information used to identify the link connection between the multicast member of the multicast group 1 and the INI 1, the reliable transmission of the multicast data can be ensured.
下面以上述图4所示的应用场景为例,结合图8中的实施例,对本申请实施例提供的组播传输方法的另一种具体实现方式进行详细描述。Taking the application scenario shown in FIG. 4 as an example below, another specific implementation manner of the multicast transmission method provided by the embodiment of the present application will be described in detail in combination with the embodiment in FIG. 8 .
图8是本申请实施例提供的一种组播传输方法800的示意性流程图。如图8所示,该方法800包括步骤810至步骤880。下面,具体介绍步骤810至步骤880。本申请实施例中,IO设备11、IO设备21和IO设备31中的任意一个IO设备均支持RoCE。FIG. 8 is a schematic flowchart of a multicast transmission method 800 provided by an embodiment of the present application. As shown in FIG. 8 , the method 800 includes step 810 to step 880 . Next, step 810 to step 880 will be described in detail. In the embodiment of the present application, any one of the IO devices in the IO device 11, the IO device 21, and the IO device 31 supports RoCE.
步骤810,主机10获取SQE1,并将SQE1存储至SQ1中。Step 810, the host 10 obtains SQE1, and stores SQE1 into SQ1.
SQE1用于携带写请求1,写请求1携带待写入数据1位于主机10包括的存储器中的源地址。主机10获取SQE1可以包括如下步骤:主机10中的应用程序1生成WR1;主机10调度驱动提供的接口把WR1转化成WQE1。WR1和WQE1和携带的信息相同,仅是格式不相同。可以理解的是,WQE1放入SQ1后,WQE1又称为SQE1。下文中,统一将从SQ1中获取的WQE1称为SQE1。SQE1 is used to carry the write request 1, and the write request 1 carries the source address of the data 1 to be written in the memory included in the host 10. The acquisition of SQE1 by the host 10 may include the following steps: the application program 1 in the host 10 generates WR1; the host 10 schedules the interface provided by the driver to convert WR1 into WQE1. WR1 and WQE1 carry the same information, but the format is different. It is understandable that after WQE1 is put into SQ1, WQE1 is also called SQE1. Hereinafter, the WQE1 obtained from SQ1 is collectively referred to as SQE1.
步骤811,IO设备11从SQ1中获取SQE1。In step 811, the IO device 11 acquires SQE1 from SQ1.
其中,SQE1的格式为SGL格式,即SQE1中包括了待写入数据1的源地址和SQE1的源地址,待写入数据1的源地址为待写入数据1位于主机10包括的存储器中的地址,SQE1的源地址为SQE1位于主机10包括的存储器中的位置。Wherein, the format of SQE1 is SGL format, that is, SQE1 includes the source address of the data 1 to be written and the source address of SQE1, and the source address of the data 1 to be written is that the data 1 to be written is located in the memory included in the host computer 10. Address, the source address of SQE1 is the location of SQE1 in the memory included in the host computer 10 .
步骤812,IO设备11根据SQE1从主机10的存储器中获取待写入数据1,对待写入数据1进行封装生成数据包1。Step 812 , the IO device 11 obtains the data 1 to be written from the memory of the host 10 according to the SQE1 , and encapsulates the data 1 to be written to generate a data packet 1 .
数据包1的格式可以如图9中的(1)所示,该数据包1依次包括以太网头、IP头、,UDP头、BTH、载荷、不变循环冗余码校验(invariant cyclic redundancy check,ICRC)和帧检验序列(frame check sequence,FCS)即。BTH的格式可以参见图9中的(2)所示。图9中的(2)中,操作码(opration code,Opcode)用于表明该包的类型(type)或载荷(payload)中更高层的协议类型。S是Solicited Event的缩写,表明回应者产生应该产生一个事件。M是MigReq的缩写,一般用于迁移状态。Pad表明有多少额外字节被填充到IB PayLoad中。TVer是Transport Header Version的缩写,表明该包的版本号。Partition Key用来表征与本Packet关联的逻辑内存分区。rsvd是reserved的缩写,该字段是保留的。Destination QP表明目的端Queue Pair序号。A是Acknowledge Request,表示该packet的应答可由响应者调度。PSN是数据包的序列号(packet sequence number,PSN),在接收侧用于判断数据包是否发生乱序。下文中的表2定义了RDMA操作对应的OpCode列表。BTH还包括扩展头,扩展头中可以包括以下至少一个字段:AETH字段,RETH字段,ImmDt字段和SynETH字段。AETH字段用于指示SQ1中的SQE可用的数目,RETH字段用于指示RDMA读操作或RDMA写操作,ImmDt字段用于指示携带立即数,SynETH字段的取值用于唯一指示一个读请求,该一个读请求为组播组1的组播成员发送给INI 1的请求,即该读请求用于请求将INI 1中存储的待写入数据存储至该组播组1的组播成员对应的存储区域中。The format of data packet 1 can be as shown in (1) among Fig. 9, and this data packet 1 comprises Ethernet head, IP head, UDP head, BTH, load, invariant cyclic redundancy check (invariant cyclic redundancy check) successively check, ICRC) and frame check sequence (frame check sequence, FCS) namely. For the format of the BTH, refer to (2) in FIG. 9 . In (2) in FIG. 9 , the operation code (opration code, Opcode) is used to indicate the type (type) of the packet or the higher layer protocol type in the payload (payload). S is the abbreviation of Solicited Event, indicating that the responder should generate an event. M is the abbreviation of MigReq, which is generally used for migrating status. Pad indicates how many extra bytes are padded into IB PayLoad. TVer is the abbreviation of Transport Header Version, indicating the version number of the package. Partition Key is used to represent the logical memory partition associated with this Packet. rsvd is an abbreviation of reserved, and this field is reserved. Destination QP indicates the sequence number of the destination Queue Pair. A is Acknowledge Request, indicating that the response of this packet can be scheduled by the responder. PSN is the sequence number (packet sequence number, PSN) of the data packet, which is used on the receiving side to determine whether the data packet is out of order. Table 2 below defines the OpCode list corresponding to the RDMA operation. The BTH also includes an extended header, and the extended header may include at least one of the following fields: AETH field, RETH field, ImmDt field and SynETH field. The AETH field is used to indicate the number of available SQEs in SQ1, the RETH field is used to indicate RDMA read operations or RDMA write operations, the ImmDt field is used to indicate carrying immediate data, and the value of the SynETH field is used to uniquely indicate a read request. The read request is a request sent to INI 1 by the multicast member of multicast group 1, that is, the read request is used to request to store the data to be written stored in INI 1 to the storage area corresponding to the multicast member of the multicast group 1 middle.
表2Table 2
Figure PCTCN2022139219-appb-000002
Figure PCTCN2022139219-appb-000002
Figure PCTCN2022139219-appb-000003
Figure PCTCN2022139219-appb-000003
可以理解的是,在本申请实施例中,数据包1的头部包括的ImmDt字段用于指示不携带立即数,数据包1的头部包括的RETH字段用于指示RDMA写操作,SynETH字段的取值为空,即不用于唯一指示一个读请求。此外,数据包1的头部包括的其它参数信息可以参见上文中的表1中的第二列的内容。It can be understood that, in this embodiment of the application, the ImmDt field included in the header of the data packet 1 is used to indicate that no immediate value is carried, the RETH field included in the header of the data packet 1 is used to indicate the RDMA write operation, and the SynETH field The value is empty, that is, it is not used to uniquely indicate a read request. In addition, other parameter information included in the header of the data packet 1 can refer to the content in the second column in Table 1 above.
步骤820,IO设备11向组播组1的组播成员发送消息1,消息1包括数据包1。相应的,组播组1的成员(即,IO设备21和IO设备31)分别接收到该消息1。In step 820, the IO device 11 sends a message 1 to the multicast members of the multicast group 1, where the message 1 includes a data packet 1. Correspondingly, the members of the multicast group 1 (ie, the IO device 21 and the IO device 31 ) respectively receive the message 1 .
示例性的,图9中的(3)示出了消息1包括的数据包1的格式。可选的,在另一些实现方式中,IO设备11对将待写入数据1进行封装还可以生成多个数据包,该多个数据包共同携带待写入数据1。这种实现方式中,消息1还可以包括该多个数据包。示例性的,图9中的(3)示出了一个消息包括2个数据包的格式。Exemplarily, (3) in FIG. 9 shows the format of the data packet 1 included in the message 1 . Optionally, in other implementation manners, the IO device 11 may also generate multiple data packets after encapsulating the data 1 to be written, and the multiple data packets jointly carry the data 1 to be written. In this implementation manner, message 1 may also include the multiple data packets. Exemplarily, (3) in FIG. 9 shows a format in which a message includes 2 data packets.
上述实现方式中,当IO设备21或IO设备31接收到消息1,且对消息1进行校验后,IO设备21或IO设备31可以向INI 1发送一个ACK消息,该ACK消息表示IO设备21或IO设备31已经正确接收到消息1。In the above implementation, when the IO device 21 or the IO device 31 receives the message 1, and after checking the message 1, the IO device 21 or the IO device 31 can send an ACK message to the INI 1, and the ACK message indicates that the IO device 21 Or the IO device 31 has correctly received message 1.
步骤830,主机20从IO设备21的RQ2中获取rqe1,rqe1用于指示将待写入数据1写入主机20的存储器中的位置1。 Step 830 , the host 20 acquires rqel from RQ2 of the IO device 21 , rqel is used to indicate to write the data 1 to be written into the location 1 in the memory of the host 20 .
rqe1的格式为SGL格式,即rqe1携带rqe1位于IO设备21的存储器中的源地址和待写入数据1位于IO设备21的存储器中的源地址。The format of rqel is SGL format, that is, rqel carries the source address of rqel in the memory of the IO device 21 and the source address of the data to be written 1 in the memory of the IO device 21 .
在步骤830之前,IO设备21还用于执行如下操作:对消息1进行解析,获取待写入数据1,并将生成的rqe1存储至RQ2中。Before step 830, the IO device 21 is further configured to perform the following operations: parse the message 1, obtain the data 1 to be written, and store the generated rqel into RQ2.
步骤831,主机30从IO设备31的RQ3中获取rqe2,rqe2用于指示将待写入数据1写入主机30的存储器中的位置2。Step 831 , the host 30 acquires rqe2 from RQ3 of the IO device 31 , and rqe2 is used to indicate to write the data 1 to be written into the location 2 in the memory of the host 30 .
rqe2的格式为SGL格式,即rqe2携带rqe2位于IO设备31的存储器中的源地址和待写入数据1位于IO设备31的存储器中的源地址。The format of rqe2 is the SGL format, that is, rqe2 carries the source address of rqe2 located in the memory of the IO device 31 and the source address of the data to be written 1 located in the memory of the IO device 31 .
在步骤831之前,IO设备31还用于执行如下操作:对消息1进行解析,获取待写入数据1,并将生成的rqe2存储至RQ3中。Before step 831, the IO device 31 is further configured to perform the following operations: parse the message 1, obtain the data 1 to be written, and store the generated rqe2 into RQ3.
步骤840,主机20根据rqe1,从IO设备21的存储器中读取待写入数据1,并将待写入数据1存储至主机20的存储器中的位置1。Step 840 , the host 20 reads the data 1 to be written from the memory of the IO device 21 according to rqel, and stores the data 1 to be written into the location 1 in the memory of the host 20 .
步骤841,主机30根据rqe2,从IO设备31的存储区域中读取待写入数据1,并将待写入数据1存储至主机30的存储器中的位置2。Step 841, the host 30 reads the data 1 to be written from the storage area of the IO device 31 according to rqe2, and stores the data 1 to be written to the location 2 in the memory of the host 30.
步骤850,IO设备11向主机10发送CQE1,CQE1包括SQE1的完成信息。In step 850, the IO device 11 sends CQE1 to the host 10, and the CQE1 includes completion information of the SQE1.
上述步骤850是在IO设备11接收到IO设备21和IO设备31发送的ACK消息后执行的,该ACK消息用于指示成功接收到该消息1。The above step 850 is executed after the IO device 11 receives the ACK message sent by the IO device 21 and the IO device 31, and the ACK message is used to indicate that the message 1 is successfully received.
步骤860,主机20向IO设备21发送SQE2,SQE2包括rqe1的完成信息。 Step 860, the host 20 sends SQE2 to the IO device 21, and the SQE2 includes the completion information of rqel.
步骤861,IO设备21向INI 1发送IO完成消息1,IO完成消息1指示IO设备21成功执行SQE1对应的写请求1。相应的,INI 1的IO设备11接收到IO完成消息1,并向TGT 1发送ACK消息。Step 861, the IO device 21 sends an IO completion message 1 to the INI 1, and the IO completion message 1 indicates that the IO device 21 successfully executes the write request 1 corresponding to the SQE1. Correspondingly, the IO device 11 of the INI 1 receives the IO completion message 1, and sends an ACK message to the TGT 1.
步骤862,IO设备11向主机10的主机10发送CQE2,CQE1包括SQE2的完成信息。In step 862, the IO device 11 sends a CQE2 to the host 10 of the host 10, and the CQE1 includes the completion information of the SQE2.
步骤870,主机30向IO设备31发送SQE3,SQE2指示rqe2的完成信息。 Step 870, the host 30 sends SQE3 to the IO device 31, and SQE2 indicates the completion information of rqe2.
步骤871,IO设备31向INI 1发送IO完成消息2,IO完成消息2指示IO设备31成功执行SQE1对应的写请求1。相应的,INI 1的IO设备11接收到IO完成消息1,并向TGT 2发送ACK消息。Step 871, the IO device 31 sends an IO completion message 2 to the INI 1, and the IO completion message 2 indicates that the IO device 31 successfully executes the write request 1 corresponding to the SQE1. Correspondingly, the IO device 11 of the INI 1 receives the IO completion message 1, and sends an ACK message to the TGT 2.
步骤870,IO设备11向主机10发送CQE3,CQE3包括SQE2的完成信息。In step 870, the IO device 11 sends a CQE3 to the host 10, and the CQE3 includes the completion information of the SQE2.
步骤880,主机10向应用程序1发送IO完成消息3,IO完成消息3指示WR1对应的写请求1已被执行。In step 880, the host 10 sends an IO completion message 3 to the application program 1, and the IO completion message 3 indicates that the write request 1 corresponding to WR1 has been executed.
在步骤880之前,还可以包括如下步骤:主机10根据CQE2和CQE3生成IO完成消息3。Before step 880, the following step may also be included: the host 10 generates an IO completion message 3 according to CQE2 and CQE3.
可以理解的是,上述步骤810至步骤880的执行顺序仅为示意并不构成任何限定。例如,还可以先执行步骤841再执行步骤840。It can be understood that, the execution sequence of the above step 810 to step 880 is only for illustration and does not constitute any limitation. For example, step 841 may also be performed first and then step 840 is performed.
上述实现方式,提供了一种基于RDMA的组播传输的方法,该方法可以实现可靠和高效的数据传输。The above implementation manner provides an RDMA-based multicast transmission method, which can realize reliable and efficient data transmission.
下面以上述图4所示的应用场景为例,结合图10中的实施例,对本申请实施例提供的组播传输方法的又一种具体实现方式进行详细描述。Taking the application scenario shown in FIG. 4 as an example below, another specific implementation manner of the multicast transmission method provided by the embodiment of the present application is described in detail in combination with the embodiment in FIG. 10 .
图10是本申请实施例提供的一种组播传输方法1000的示意性流程图。如图10所示,该方法1000包括步骤1010至步骤1080。下面,具体介绍步骤1010至步骤1080。FIG. 10 is a schematic flowchart of a multicast transmission method 1000 provided by an embodiment of the present application. As shown in FIG. 10 , the method 1000 includes step 1010 to step 1080 . Next, step 1010 to step 1080 will be described in detail.
本申请实施例中,IO设备11、IO设备21和IO设备31中的任意一个IO设备均支持RoCE。两个IO设备之间基于RoCE传输报文时,需要进行内存注册(memory registration,MR)。注册一个MR后,MR具有如下属性:RDMA操作上下文(context)、MR被注册的缓存地址(address,简记为addr)、MR被注册的缓存长度(length)、MR被注册的本地密钥(local key,lkey)和MR被注册的远程密钥(remote key,rkey)。以INI 1为例,可以将INI 1包括的主机10的存储器中的存储区域注册到INI 1包括的IO设备11的存储器中,此后该IO设备11就可以实现直接对该主机10的存储器中的存储区域的操作(例如读操作或写操作)。In the embodiment of the present application, any one of the IO devices in the IO device 11, the IO device 21, and the IO device 31 supports RoCE. When transmitting packets based on RoCE between two IO devices, memory registration (MR) is required. After registering an MR, the MR has the following attributes: RDMA operation context (context), MR registered cache address (address, abbreviated as addr), MR registered cache length (length), MR registered local key ( local key, lkey) and MR registered remote key (remote key, rkey). Taking INI 1 as an example, the storage area in the memory of the host computer 10 included in INI 1 can be registered in the memory of the IO device 11 included in INI 1, and then the IO device 11 can directly implement the storage in the memory of the host computer 10. An operation on a storage area (such as a read operation or a write operation).
可选的,在步骤1010之前,INI 1与TGT 1交互可以获取TGT 1包括的主机30的存储区中的区域1的地址和远程密钥值rkey1,即INI 1获取了该区域1的读写权限,此后INI 1可以向访问自己的存储器一样直接访问区域1,以实现读操作或写操作。同样的,INI 1与TGT 2交互可以获取TGT 2包括的主机20的存储器中的区域2的地址和远程密钥值rkey2,即INI 1获取了该区域2的读写权限,此后INI 1可以向访问自己的存储器一样直接访问区域2,以实现读操作或写操作。其中,rkey1的密钥值与rkey2的密钥值相同,为便于描述,下文中将rkey1的密钥值与rkey2的密钥值统称为rkey的密钥值。可以理解的是,对TGT 1而言,rkey的密钥值或rkey1的密钥值都用于指示区域1。对TGT 2而言,rkey的密钥值或rkey2的密钥值都用于指示区域2。Optionally, before step 1010, INI 1 interacts with TGT 1 to obtain the address and remote key value rkey1 of area 1 in the storage area of host 30 included in TGT 1, that is, INI 1 has obtained the read and write of area 1 After that, INI 1 can directly access area 1 like accessing its own memory to achieve read or write operations. Similarly, the interaction between INI 1 and TGT 2 can obtain the address of area 2 and the remote key value rkey2 in the memory of the host computer 20 included in TGT 2, that is, INI 1 has obtained the read and write authority of this area 2, after which INI 1 can send Access area 2 directly as accessing its own memory for read or write operations. The key value of rkey1 is the same as the key value of rkey2. For the convenience of description, the key value of rkey1 and the key value of rkey2 are collectively referred to as the key value of rkey in the following. It can be understood that, for TGT 1, the key value of rkey or the key value of rkey1 is used to indicate area 1. For TGT 2, either the key value of rkey or the key value of rkey2 is used to indicate area 2.
步骤1010,主机10获取SQE1,并将SQE1存储至SQ1中。Step 1010, the host 10 obtains SQE1, and stores SQE1 into SQ1.
SQE1用于携带写请求1,写请求1携带待写入数据1位于主机10包括的存储器中的源地址。主机10获取SQE1可以包括如下步骤:主机10中的应用程序1生成WR1;主机10调度驱动提供的接口把WR1转化成WQE1。WR1和WQE1和携带的信息相同,仅是格式不相同。可以理解的是,WQE1放入SQ1后,WQE1又称为SQE1。下文中,统一将从SQ1中获取的WQE1称为SQE1。SQE1 is used to carry the write request 1, and the write request 1 carries the source address of the data 1 to be written in the memory included in the host 10. The acquisition of SQE1 by the host 10 may include the following steps: the application program 1 in the host 10 generates WR1; the host 10 schedules the interface provided by the driver to convert WR1 into WQE1. WR1 and WQE1 carry the same information, but the format is different. It is understandable that after WQE1 is put into SQ1, WQE1 is also called SQE1. Hereinafter, the WQE1 obtained from SQ1 is collectively referred to as SQE1.
步骤1011,IO设备11从SQ1中获取SQE1。In step 1011, the IO device 11 acquires SQE1 from SQ1.
其中,SQE1的格式为SGL格式,即SQE1中包括了待写入数据1的源地址和SQE1的源地址,待写入数据1的源地址为待写入数据1位于主机10包括的存储器中的地址,SQE1的源地址为SQE1位于主机10包括的存储器中的位置。Wherein, the format of SQE1 is SGL format, that is, SQE1 includes the source address of the data 1 to be written and the source address of SQE1, and the source address of the data 1 to be written is that the data 1 to be written is located in the memory included in the host computer 10. Address, the source address of SQE1 is the location of SQE1 in the memory included in the host computer 10 .
步骤1012,IO设备11根据SQE1,组播链路信息和rkey的密钥值,对SQE1指示的待写入数据1进行封装生成消息1,消息1包括数据包1。Step 1012 , the IO device 11 encapsulates the data 1 to be written indicated by the SQE1 according to the SQE1 , the multicast link information and the rkey key value to generate a message 1 , the message 1 includes the data packet 1 .
可选的,在步骤1012之前还包括以下步骤:IO设备11根据SQE1从主机10的存储器中获取待写入数据1。Optionally, the following step is also included before step 1012: the IO device 11 obtains the data 1 to be written from the memory of the host 10 according to SQE1.
其中,数据包1携带待写入数据1。数据包1的格式可以如图9中的(1)所示。Wherein, data packet 1 carries data 1 to be written. The format of the data packet 1 may be as shown in (1) in FIG. 9 .
步骤1020,IO设备11向组播组1的组播成员发送消息1。In step 1020, the IO device 11 sends message 1 to the multicast members of the multicast group 1.
组播组1的组播成员包括TGT 1和TGT 2。相应的,IO设备21和IO设备31会接收到该消息1。消息1中封装的参数可以参见上述表1中的第二列所示。The multicast members of multicast group 1 include TGT 1 and TGT 2. Correspondingly, the IO device 21 and the IO device 31 will receive the message 1 . For the parameters encapsulated in message 1, refer to the second column in Table 1 above.
步骤1030,IO设备21对消息1进行解析,获取待写入数据1和rkey的密钥值,并将待写入数据1存储至rkey的密钥值指示的区域1中。Step 1030, the IO device 21 parses the message 1, obtains the data 1 to be written and the key value of the rkey, and stores the data 1 to be written into the area 1 indicated by the key value of the rkey.
步骤1040,IO设备31对消息1进行解析,获取待写入数据1和rkey的密钥值,并将待写入数据1存储至rkey的密钥值指示的区域2中。Step 1040, the IO device 31 parses the message 1, obtains the data 1 to be written and the key value of the rkey, and stores the data 1 to be written into the area 2 indicated by the key value of the rkey.
步骤1050,IO设备11向主机10发送CQE1,CQE1包括SQE1的完成信息。相应的,主机10将接收到的CQE1存储至CQ1中。 Step 1050, the IO device 11 sends CQE1 to the host 10, and the CQE1 includes the completion information of the SQE1. Correspondingly, the host 10 stores the received CQE1 into CQ1.
步骤1060,IO设备21向INI 1发送IO完成消息1。相应的,IO设备11接收到IO完成消息1。其中,IO完成消息1用于指示TGT 1成功执行SQE1对应的写请求1。 Step 1060, the IO device 21 sends an IO completion message 1 to the INI 1. Correspondingly, the IO device 11 receives the IO completion message 1 . Wherein, the IO completion message 1 is used to indicate that the TGT 1 successfully executes the write request 1 corresponding to the SQE1.
可选的,在步骤1060之后IO设备11还可以向IO设备21发送IO完成消息1的ACK消息。Optionally, after step 1060, the IO device 11 may also send an ACK message of the IO completion message 1 to the IO device 21.
步骤1061,IO设备11向主机10发送CQE2。相应的,主机10将接收到的CQE2存储至CQ1中。 Step 1061 , the IO device 11 sends CQE2 to the host 10 . Correspondingly, the host 10 stores the received CQE2 into CQ1.
步骤1070,IO设备31向INI 1发送IO完成消息2。 Step 1070, the IO device 31 sends an IO completion message 2 to the INI 1.
其中,IO完成消息2用于指示TGT 2成功执行SQE1对应的写请求1。相应的,IO设备 31接收到IO完成消息1。其中,IO完成消息3用于指示TGT 2成功执行SQE1对应的写请求1。Wherein, the IO completion message 2 is used to indicate that the TGT 2 successfully executes the write request 1 corresponding to the SQE1. Correspondingly, the IO device 31 receives the IO completion message 1. Wherein, the IO completion message 3 is used to indicate that the TGT 2 successfully executes the write request 1 corresponding to the SQE1.
可选的,在步骤1070之后IO设备11还可以向IO设备31发送IO完成消息2的ACK消息。Optionally, after step 1070, the IO device 11 may also send an ACK message of the IO completion message 2 to the IO device 31.
步骤1071,IO设备11向主机10发送CQE3。相应的,主机10将接收到的CQE3存储至CQ1中。 Step 1071 , the IO device 11 sends CQE3 to the host 10 . Correspondingly, the host 10 stores the received CQE3 into CQ1.
步骤1080,IO设备11向应用程序2发送IO完成消息3,IO完成消息3指示WR1对应的写请求1已被成功执行。Step 1080, the IO device 11 sends an IO completion message 3 to the application program 2, and the IO completion message 3 indicates that the write request 1 corresponding to WR1 has been successfully executed.
在步骤1080之前,还可以包括如下步骤:主机10根据CQE2和CQE3生成IO完成消息3。Before step 1080, the following step may also be included: the host 10 generates an IO completion message 3 according to CQE2 and CQE3.
可以理解的是,上述步骤1010至步骤1080的执行顺序仅为示意并不构成任何限定。例如,还可以先执行步骤1040再执行步骤1030。It can be understood that, the execution order of the above steps 1010 to 1080 is only for illustration and does not constitute any limitation. For example, step 1040 may also be performed first and then step 1030 is performed.
上述实现方式,提供了一种基于RDMA的组播传输的方法,该方法可以实现可靠和高效的数据传输。The above implementation manner provides an RDMA-based multicast transmission method, which can realize reliable and efficient data transmission.
下面以上述图4所示的应用场景为例,结合图11中的实施例,对本申请实施例提供的组播传输方法的又一种具体实现方式进行详细描述。Taking the application scenario shown in FIG. 4 as an example below, another specific implementation manner of the multicast transmission method provided by the embodiment of the present application is described in detail in combination with the embodiment in FIG. 11 .
图11是本申请实施例提供的一种组播传输方法1100的示意性流程图。如图11所示,该方法1100包括步骤1110至步骤1130。下面,具体介绍步骤1110至步骤1130。FIG. 11 is a schematic flowchart of a multicast transmission method 1100 provided by an embodiment of the present application. As shown in FIG. 11 , the method 1100 includes step 1110 to step 1130 . Next, step 1110 to step 1130 will be described in detail.
在本申请实施例中,待写入数据1可以存储在主机10的存储器的区域1中,在步骤1110之前,INI 1与TGT 1交互后,TGT 1可以获取区域1的地址VA1和远程密钥值rkey1,即TGT1获取了该区域1的读写权限,此后TGT1 1可以向访问自己的存储器一样直接访问区域1进行读操作或写操作。同样的,INI 1与TGT 2交互后,TGT 2也可以获取区域1的地址VA1和远程密钥值rkey1,即TGT 2获取了该区域1的读写权限,此后TGT1 2可以向访问自己的存储器一样直接访问区域1进行读操作或写操作。其中,TGT 1和TGT 2获取区域1中存储的数据的流程如下文中的步骤1110至步骤1120的描述。In this embodiment of the application, the data 1 to be written can be stored in the area 1 of the memory of the host 10, and before step 1110, after the INI 1 interacts with the TGT 1, the TGT 1 can obtain the address VA1 of the area 1 and the remote key The value is rkey1, that is, TGT1 has obtained the read and write permission of this area 1, and then TGT1 1 can directly access area 1 to perform read or write operations like accessing its own memory. Similarly, after INI 1 interacts with TGT 2, TGT 2 can also obtain the address VA1 of area 1 and the remote key value rkey1, that is, TGT 2 has obtained the read and write permission of this area 1, and then TGT1 2 can access its own memory The same direct access to area 1 for read or write operations. Wherein, the procedure for TGT 1 and TGT 2 to obtain the data stored in area 1 is as described in steps 1110 to 1120 below.
步骤1110,主机10获取SQE1,并将SQE1存储在SQ1中。Step 1110, the host 10 obtains SQE1, and stores SQE1 in SQ1.
在步骤1110之前还包括如下步骤:主机10在主机10的存储器中创建IO命令消息1。其中,SQE1包括IO命令消息1,IO命令消息1用于指示将待写入数据1存储至组播组1的组播成员对应的主机(即,主机20和主机30)的存储器中。示例性的,待写入数据1可以但不限于是10个字节的数据。Before step 1110 , the following steps are also included: the host 10 creates an IO command message 1 in the memory of the host 10 . Wherein, SQE1 includes an IO command message 1, and the IO command message 1 is used to instruct to store the data 1 to be written into the memory of the hosts corresponding to the multicast members of the multicast group 1 (ie, the host 20 and the host 30). Exemplarily, the data 1 to be written may be, but not limited to, 10 bytes of data.
步骤1111,IO设备11从SQ1中获取SQE1。In step 1111, the IO device 11 acquires SQE1 from SQ1.
其中,SQE1的格式为SGL格式,即SQE1中包括IO命令消息1对应的待写入数据1位于主机10的存储器的区域1的信息,且SQE1中包括VA1和rkey1。The format of SQE1 is SGL format, that is, SQE1 includes the information that the data 1 to be written corresponding to the IO command message 1 is located in area 1 of the memory of the host 10, and SQE1 includes VA1 and rkey1.
步骤1112,IO设备11向组播组1的组播成员发送IO命令消息1。In step 1112, the IO device 11 sends an IO command message 1 to the multicast members of the multicast group 1.
组播组1的组播成员包括TGT 1和TGT 2,相应的,IO设备21和IO设备31会接收到IO命令消息1。IO命令消息1可以包括至少一个RoCE的数据包,该至少一个RoCE的数据包包头携带的参数可以参见上文表1中的第二列内容。The multicast members of the multicast group 1 include TGT 1 and TGT 2. Correspondingly, the IO device 21 and the IO device 31 will receive the IO command message 1. The IO command message 1 may include at least one RoCE data packet, and the parameters carried in the header of the at least one RoCE data packet may refer to the content in the second column in Table 1 above.
可选的,在步骤1112之前,IO设备11还可以进行如下操作:根据组播链路信息和rkey的密钥值,生成该至少一个RoCE的数据包。Optionally, before step 1112, the IO device 11 may also perform the following operations: generate the at least one RoCE data packet according to the multicast link information and the rkey key value.
可以理解的是,上述IO命令消息1中携带VA1和rkey1,但不携带待写入数据1。It can be understood that the above IO command message 1 carries VA1 and rkey1, but does not carry the data 1 to be written.
步骤1113,IO设备21向主机20发送rqe1。相应的,主机20接收到rqe1。Step 1113 , the IO device 21 sends rqel to the host 20 . Correspondingly, the host 20 receives rqel.
其中,rqe1的格式为SGL格式,即rqe1中包括了IO命令消息1位于IO设备21的存储 器中的源地址,以及待写入数据1位于主机10的存储器中的地址(即,VA1)。Wherein, the format of rqel is the SGL format, that is, rqe1 includes the source address of the IO command message 1 located in the memory of the IO device 21, and the address (that is, VA1) of the data to be written 1 located in the memory of the host computer 10.
可选的,在步骤1113之前,IO设备21还用于执行如下操作:对IO命令消息1进行解析,得到rq1,并将生成的rqe1存储至RQ2中,rqe1用于指示将待写入数据1存储至主机20的存储器的区域1中。Optionally, before step 1113, the IO device 21 is also configured to perform the following operations: parse the IO command message 1 to obtain rq1, and store the generated rqe1 in RQ2, where rqe1 is used to indicate that the data to be written 1 Stored in area 1 of the memory of the host computer 20 .
可选的,在步骤1113之后,IO设备21还会向IO设备11发送ACK报文,ACK报文用于指示IO设备21已成功接收IO命令消息1。Optionally, after step 1113, the IO device 21 will also send an ACK message to the IO device 11, where the ACK message is used to indicate that the IO device 21 has successfully received the IO command message 1.
步骤1114,IO设备31向主机30发送rqe2。相应的,主机30接收到rqe2。Step 1114 , the IO device 31 sends rqe2 to the host 30 . Correspondingly, the host 30 receives rqe2.
在步骤1114之前,IO设备31还用于执行如下操作:对IO命令消息1进行解析,得到rq2,并将生成的rqe2存储至RQ2中,rqe2用于指示将待写入数据1存储至主机30的存储器的区域2中。Before step 1114, the IO device 31 is also used to perform the following operations: parse the IO command message 1 to obtain rq2, and store the generated rqe2 in RQ2, and rqe2 is used to indicate that the data 1 to be written is stored in the host 30 area 2 of the memory.
其中,rqe2的格式为SGL格式,即rqe2中包括了IO命令消息1位于IO设备31的存储器中的源地址,以及待写入数据1位于主机10的存储器中的地址。The format of rqe2 is SGL format, that is, rqe2 includes the source address of the IO command message 1 located in the memory of the IO device 31 and the address of the data to be written 1 located in the memory of the host 10 .
可选的,在步骤1114之后,IO设备31还会向IO设备11发送ACK消息,ACK消息用于指示IO设备31已成功接收IO命令消息1。Optionally, after step 1114, the IO device 31 will also send an ACK message to the IO device 11, where the ACK message is used to indicate that the IO device 31 has successfully received the IO command message 1.
可选的,在步骤1114之后,INI 1可以将时刻1作为聚合时刻的起始时刻,时刻1是IO设备11接收到ACK消息1和ACK消息2中的最后一个ACK消息的时刻。Optionally, after step 1114, INI 1 may use time 1 as the start time of the aggregation time, and time 1 is the time when IO device 11 receives the last ACK message in ACK message 1 and ACK message 2.
步骤1115,IO设备11向主机10发送CQE1。 Step 1115 , the IO device 11 sends CQE1 to the host 10 .
其中,CQE1用于指示SQE1包括的IO命令消息1已被执行。在一些实现方式中,IO设备11接收到组播组1的所有组播成员发送的IO命令消息1的ACK消息后,IO设备11才执行上述步骤6。Wherein, CQE1 is used to indicate that the IO command message 1 included in SQE1 has been executed. In some implementation manners, after the IO device 11 receives the ACK message of the IO command message 1 sent by all the multicast members of the multicast group 1, the IO device 11 performs the above step 6.
步骤1116,主机20向IO设备21发送SQE2。 Step 1116, the host 20 sends SQE2 to the IO device 21.
其中,SQE2的格式可以为SGL格式,即SQE2中包括SQE2位于主机20的存储器中的地址,以及待写入数据1位于主机20的存储器的区域1的地址。也就是说,SQE2可以携带地址信息1,地址信息1包括VA1和lkey1,lkey1用于指示IO设备21中的注册区域MR2,VA1用于指示MR2中的一块区域2,该一块区域2用于存储待写入数据1。SQE2还可以携带读取信息1,读取信息1用于指示从主机10的存储器中的区域1中获取待写入数据1的信息。读取信息1可以包括(rkey1,size,offset),rkey1用于指示区域1,size用于指示区域1中的数据的大小,offset用于指示待读取区域的地址相对于区域1的起始地址的偏移量。在一个示例中,读取信息1可以具体用于指示从区域1中获取待写入数据1的全部数据,即读取信息1可以包括(rkey1,size,offset),此时,size的大小等于待写入数据1的大小,offset的取值用于指示区域1的结束地址至区域1的起始地址的偏移量。在另一个示例中,读取信息1可以具体用于指示从区域1中获取待写入数据1的部分数据,即读取信息1可以包括(rkey1,size,offset),此时,size的大小等于待写入数据1的部分数据的大小,offset的取值用于指示该部分数据的地址在区域1中的偏移量。可以理解的是,这种实现方式中,TGT 1需要发送多个(例如,至少两个)读取信息才能获取待写入数据1的全部内容。Wherein, the format of SQE2 may be SGL format, that is, SQE2 includes the address of SQE2 in the memory of the host 20 and the address of the data 1 to be written in area 1 of the memory of the host 20 . That is to say, SQE2 can carry address information 1, address information 1 includes VA1 and lkey1, lkey1 is used to indicate the registration area MR2 in IO device 21, VA1 is used to indicate an area 2 in MR2, and this area 2 is used to store Data 1 to be written. The SQE2 may also carry read information 1 , which is used to indicate that the data 1 to be written is acquired from the area 1 in the memory of the host 10 . Read information 1 may include (rkey1, size, offset), where rkey1 is used to indicate area 1, size is used to indicate the size of data in area 1, and offset is used to indicate the address of the area to be read relative to the start of area 1 The offset of the address. In an example, read information 1 may be specifically used to indicate that all data to be written in data 1 is obtained from area 1, that is, read information 1 may include (rkey1, size, offset), and at this time, the size of size is equal to The size of data 1 to be written, and the value of offset is used to indicate the offset from the end address of area 1 to the start address of area 1. In another example, the read information 1 can be specifically used to indicate that part of the data to be written into the data 1 is obtained from the area 1, that is, the read information 1 can include (rkey1, size, offset), and at this time, the size of the size It is equal to the size of the part of data to be written into data 1, and the value of offset is used to indicate the offset of the address of the part of data in area 1. It can be understood that, in this implementation manner, TGT 1 needs to send multiple (for example, at least two) read messages to obtain all the contents of the data 1 to be written.
步骤1117,主机30向IO设备31发送SQE3。Step 1117, the host 30 sends SQE3 to the IO device 31.
其中,SQE3的格式可以为SGL格式,即SQE3中包括SQE3位于主机30的存储器中的地址,以及待写入数据1位于主机30的存储器中的区域1的地址。也就是说,SQE3可以携带地址信息2,地址信息2包括VA2和lkey2,lkey2用于指示IO设备31中的注册区域MR3,VA2用于指示MR3中的一块区域3,该一块区域3用于存储待写入数据1。SQE3还可以携带读取信息2,读取信息2用于指示从主机10的存储器中获取待写入数据1的信息。读取信息2 可以包括(rkey1,size,offset),rkey1用于指示区域1,size用于指示读取的区域1中的数据的大小,offset用于指示待读取区域的地址相对于区域1的起始地址的偏移量。在一个示例中,读取信息2可以具体用于指示从区域1中获取待写入数据1的全部数据,即读取信息2可以包括(rkey1,size,offset),此时,size的大小等于待写入数据1的大小,offset的取值用于指示区域1的结束地址至区域1的起始地址的偏移量。在另一个示例中,读取信息2可以具体用于指示从区域1中获取待写入数据1的部分数据,即读取信息2可以包括(rkey1,size,offset),此时,size的大小等于待写入数据1的部分数据的大小,offset的取值用于指示该部分数据的地址在区域1中的偏移量。可以理解的是,这种实现方式中,TGT 2需要发送多个(例如,至少两个)读取信息才能获取待写入数据1的全部内容。Wherein, the format of SQE3 may be SGL format, that is, SQE3 includes the address of SQE3 located in the memory of the host 30 and the address of area 1 of the data 1 to be written located in the memory of the host 30 . That is to say, SQE3 can carry address information 2, address information 2 includes VA2 and lkey2, lkey2 is used to indicate the registration area MR3 in IO device 31, VA2 is used to indicate an area 3 in MR3, and this area 3 is used to store Data 1 to be written. The SQE3 may also carry read information 2 , and the read information 2 is used to indicate that the data 1 to be written is obtained from the memory of the host 10 . Read information 2 may include (rkey1, size, offset), where rkey1 is used to indicate area 1, size is used to indicate the size of the data in area 1 to be read, and offset is used to indicate that the address of the area to be read is relative to area 1 The offset of the starting address of . In an example, read information 2 may be specifically used to indicate that all data to be written in data 1 is obtained from area 1, that is, read information 2 may include (rkey1, size, offset), and at this time, the size of size is equal to The size of data 1 to be written, and the value of offset is used to indicate the offset from the end address of area 1 to the start address of area 1. In another example, the read information 2 may be specifically used to indicate that part of the data to be written into the data 1 is obtained from the area 1, that is, the read information 2 may include (rkey1, size, offset), and at this time, the size of the size It is equal to the size of the part of data to be written into data 1, and the value of offset is used to indicate the offset of the address of the part of data in area 1. It can be understood that, in this implementation manner, TGT 2 needs to send multiple (for example, at least two) read messages to obtain all the contents of the data 1 to be written.
步骤1118,IO设备21向IO设备11发送读请求消息1。相应的,IO设备11会接收到读请求消息1。In step 1118, the IO device 21 sends a read request message 1 to the IO device 11. Correspondingly, the IO device 11 will receive the read request message 1 .
可选的,在步骤1118之前,IO设备21还需要根据SQE2和上文中表1所示的第四列的内容,生成读请求消息1,即读请求消息1携带地址信息1和读取信息1。读请求消息1的格式可以参见图9中的(4)所示,该读请求消息1包括一个数据包,该一个数据包的扩展头中应包括SynETH字段1和RETH字段1。SynETH字段1的取值用于唯一指示同步读操作请求1,读操作请求1用于请求读取主机10的存储器的区域1中的待写入数据1。RETH字段1的取值用于指示读操作。具体的,SynETH字段1的取值等于tag1,即tag1用于唯一指示同步读操作请求1。Optionally, before step 1118, the IO device 21 also needs to generate a read request message 1 according to SQE2 and the contents of the fourth column shown in Table 1 above, that is, the read request message 1 carries address information 1 and read information 1 . The format of the read request message 1 can be referred to as shown in (4) in FIG. The value of the SynETH field 1 is used to uniquely indicate a synchronous read operation request 1 , and the read operation request 1 is used to request to read data 1 to be written in area 1 of the memory of the host 10 . The value of RETH field 1 is used to indicate a read operation. Specifically, the value of SynETH field 1 is equal to tag1, that is, tag1 is used to uniquely indicate synchronous read operation request 1.
其中,IO设备21通过单播方式向IO设备11发送读请求消息1,读请求消息1中封装的地址信息可以参见上文表1中的第四列内容。Wherein, the IO device 21 sends a read request message 1 to the IO device 11 in a unicast manner, and the address information encapsulated in the read request message 1 can be referred to in the fourth column in Table 1 above.
步骤1119,IO设备31向IO设备11发送读请求消息2。相应的,IO设备11接收到读请求消息2。In step 1119, the IO device 31 sends a read request message 2 to the IO device 11. Correspondingly, the IO device 11 receives the read request message 2 .
可选的,在步骤1119之前,IO设备31还需要根据SQE3和上文中表1所示的第四列的内容,生成读请求消息2,即读请求消息2携带地址信息2和读取信息2。读请求消息2的格式可以参见图9中的(4)所示,该读请求消息2包括一个数据包,该一个数据包的扩展头中应包括SynETH字段2和RETH字段2。SynETH字段2的取值用于唯一指示同步读操作请求1,读操作请求1用于请求读取INI 1的区域1中的待写入数据1。RETH字段1的取值用于指示读操作。具体的,SynETH字段2的取值等于tag1,即tag1用于唯一指示同步读操作请求1。Optionally, before step 1119, the IO device 31 also needs to generate a read request message 2 according to SQE3 and the contents of the fourth column shown in Table 1 above, that is, the read request message 2 carries address information 2 and read information 2 . The format of the read request message 2 can be referred to as shown in (4) in FIG. The value of the SynETH field 2 is used to uniquely indicate the synchronous read operation request 1, and the read operation request 1 is used to request to read the data 1 to be written in the area 1 of the INI 1. The value of RETH field 1 is used to indicate a read operation. Specifically, the value of the SynETH field 2 is equal to tag1, that is, tag1 is used to uniquely indicate the synchronous read operation request 1.
其中,IO设备31通过单播方式向IO设备11发送读请求消息2,读请求消息2中封装的地址信息可以参见上文表1中的第四列内容。Wherein, the IO device 31 sends a read request message 2 to the IO device 11 in a unicast manner, and the address information encapsulated in the read request message 2 can be referred to in the fourth column in Table 1 above.
步骤1120,IO设备11向组播组1的成员发送同步读请求响应消息1。相应的,组播组1的成员接收到同步读请求响应消息1。组播组1的组播成员包括TGT 1和TGT 2。In step 1120, the IO device 11 sends a synchronous read request response message 1 to the members of the multicast group 1. Correspondingly, the members of the multicast group 1 receive the synchronous read request response message 1 . The multicast members of multicast group 1 include TGT 1 and TGT 2.
可选的,在步骤1120之前,IO设备11还可以执行如下步骤:对读请求消息1和读请求消息2进行处理,生成同步读请求响应消息1,同步读请求响应消息1携带待写入数据1和组播组成员1的信息。其中,IO设备11对读请求消息1和读请求消息2进行处理,生成同步读请求响应消息1,可以包括以下步骤:IO设备11对读请求消息1进行解析,得到SynETH字段1和读取信息1(即,(rkey1,size,offset)),以及对读请求消息2进行解析,得到SynETH字段2和读取信息2(即,(rkey1,size,offset));IO设备11根据SynETH字段1的取值和SynETH字段2的取值,确定读请求消息1对应的读请求任务和读请求消息2对应的读请求任务是同一任务;IO设备11根据读取信息1从区域1中获取待写入数据1,并根据组播组1的链路信息和待写入数据1进行封装,生成同步读请求响应消息1。组播组1的链路 信息的内容可以参见上文表1中的第二列内容所示。Optionally, before step 1120, the IO device 11 may also perform the following steps: process the read request message 1 and the read request message 2, generate a synchronous read request response message 1, and the synchronous read request response message 1 carries data to be written 1 and the information of multicast group member 1. Wherein, the IO device 11 processes the read request message 1 and the read request message 2 to generate a synchronous read request response message 1, which may include the following steps: the IO device 11 parses the read request message 1 to obtain the SynETH field 1 and the read information 1 (that is, (rkey1, size, offset)), and read request message 2 is parsed to obtain SynETH field 2 and read information 2 (that is, (rkey1, size, offset)); IO device 11 according to SynETH field 1 and the value of SynETH field 2 to determine that the read request task corresponding to read request message 1 and the read request task corresponding to read request message 2 are the same task; Input data 1, and encapsulate according to the link information of multicast group 1 and data 1 to be written, and generate synchronous read request response message 1. For the content of the link information of multicast group 1, refer to the content in the second column in Table 1 above.
步骤1121,IO设备21对同步读请求响应消息1进行解析,获取待写入数据1,并将待写入数据1存储至rkey2的密钥值指示的区域1中。In step 1121, the IO device 21 parses the synchronous read request response message 1, obtains the data 1 to be written, and stores the data 1 to be written into the area 1 indicated by the key value of rkey2.
可选的,在步骤1121之后IO设备21还可以向IO设备11发送ACK消息3,ACK消息3表示IO设备21成功接收到同步读请求响应消息1。Optionally, after step 1121, the IO device 21 may also send an ACK message 3 to the IO device 11, where the ACK message 3 indicates that the IO device 21 has successfully received the synchronous read request response message 1.
步骤1122,IO设备31对同步读请求响应消息1进行解析,获取待写入数据1,并将待写入数据1存储至rkey3的密钥值指示的区域2中。In step 1122, the IO device 31 parses the synchronous read request response message 1, acquires the data 1 to be written, and stores the data 1 to be written in the area 2 indicated by the key value of rkey3.
可选的,在步骤1122之后IO设备31还可以向IO设备11发送ACK消息4,ACK消息4表示IO设备31成功接收到同步读请求响应消息1。Optionally, after step 1122, the IO device 31 may also send an ACK message 4 to the IO device 11, where the ACK message 4 indicates that the IO device 31 has successfully received the synchronous read request response message 1.
步骤1123,IO设备11向主机10发送CQE2,CQE2包括向组播组1发送同步读请求响应消息1的完成信息。 Step 1123 , the IO device 11 sends CQE2 to the host 10 , and the CQE2 includes the completion information of sending the synchronous read request response message 1 to the multicast group 1 .
步骤1124,主机20向IO设备11发送SQE4,SQE4包括TGT 1执行同步读请求响应消息1的完成信息。 Step 1124, the host 20 sends SQE4 to the IO device 11, and the SQE4 includes the completion information of TGT 1 executing the synchronous read request response message 1.
其中,SQE4的格式为SGL格式,即SQE4包括TGT 1中存储待写入数据1的区域1的地址信息。Wherein, the format of SQE4 is the SGL format, that is, SQE4 includes the address information of the area 1 storing the data 1 to be written in the TGT 1.
步骤1125,主机30向IO设备11发送SQE5,SQE5包括TGT 2执行同步读请求响应消息1的完成信息。 Step 1125, the host 30 sends SQE5 to the IO device 11, and the SQE5 includes the completion information of the TGT 2 executing the synchronous read request response message 1.
其中,SQE5的格式为SGL格式,即SQE5包括TGT 2中存储待写入数据1的区域2的地址信息。Wherein, the format of SQE5 is the SGL format, that is, SQE5 includes the address information of the area 2 storing the data 1 to be written in the TGT 2.
步骤1126,IO设备21向IO设备11发送IO完成消息1,IO完成消息1用于指示TGT 1成功执行同步读请求响应消息1对应的任务。相应的,IO设备11接收到IO完成消息1。Step 1126, the IO device 21 sends an IO completion message 1 to the IO device 11, and the IO completion message 1 is used to indicate that the TGT 1 successfully executes the task corresponding to the synchronous read request response message 1. Correspondingly, the IO device 11 receives the IO completion message 1 .
可选的,在步骤1126之后IO设备11还可以向IO设备21发送IO完成消息1的ACK消息,表示IO设备11已接收到IO完成消息1。Optionally, after step 1126, the IO device 11 may also send an ACK message of the IO completion message 1 to the IO device 21, indicating that the IO device 11 has received the IO completion message 1.
步骤1127,IO设备11向主机10发送CQE3,CQE3包括TGT 1成功执行同步读请求响应消息1的完成信息。In step 1127, the IO device 11 sends CQE3 to the host 10, and the CQE3 includes the completion information that TGT 1 successfully executes the synchronous read request response message 1.
步骤1128,IO设备31向IO设备11发送IO完成消息2,IO完成消息2用于指示TGT 2成功执行同步读请求响应消息1对应的认为。相应的,IO设备11接收到IO完成消息1。Step 1128, the IO device 31 sends an IO completion message 2 to the IO device 11, and the IO completion message 2 is used to indicate that the TGT 2 successfully executes the response corresponding to the synchronous read request response message 1. Correspondingly, the IO device 11 receives the IO completion message 1 .
可选的,在步骤1128之后IO设备11还可以向IO设备31发送IO完成消息2的ACK消息,表示IO设备11已接收到IO完成消息2。Optionally, after step 1128, the IO device 11 may also send an ACK message of the IO completion message 2 to the IO device 31, indicating that the IO device 11 has received the IO completion message 2.
可选的,在步骤1128之后IO设备11还可以向IO设备31发送IO完成消息2的ACK消息。Optionally, after step 1128, the IO device 11 may also send an ACK message of the IO completion message 2 to the IO device 31.
步骤1129,IO设备11向主机10发送CQE4,CQE4包括TGT 2成功执行同步读请求响应消息1的完成信息。In step 1129, the IO device 11 sends a CQE4 to the host 10, and the CQE4 includes the completion information that the TGT 2 successfully executes the synchronous read request response message 1.
步骤1130,主机10向应用程序1发送IO完成消息3,IO完成消息3指示SQE1对应的IO命令消息1已被执行。Step 1130, the host 10 sends an IO completion message 3 to the application program 1, and the IO completion message 3 indicates that the IO command message 1 corresponding to the SQE1 has been executed.
在步骤1130之前,还可以包括如下步骤:处理器1根据CQE3和CQE4生成IO完成消息3。Before step 1130, the following step may also be included: processor 1 generates IO completion message 3 according to CQE3 and CQE4.
可以理解的是,上述步骤1110至步骤1130的执行顺序仅为示意并不构成任何限定。例如,还可以先执行步骤1114再执行步骤1113。例如,还可以先执行步骤1117再执行步骤1116。It can be understood that, the execution order of the above steps 1110 to 1130 is only for illustration and does not constitute any limitation. For example, step 1114 may also be performed first and then step 1113 is performed. For example, step 1117 may also be performed first and then step 1116 is performed.
上述实现方式,提供了另一种基于RDMA的组播传输的方法,该方法可以实现可靠和高效的数据传输。The above implementation manner provides another RDMA-based multicast transmission method, which can realize reliable and efficient data transmission.
当上述各个方法实施例中的计算设备(例如,INI 1、TGT 1或TGT 2)是通过虚拟机实现时,上述主机和IO设备分别对应虚拟机中的主机和IO设备,虚拟机中的主机和IO设备通过承载其虚拟功能的物理主机和物理IO设备来实现。其实现方式与上述实现方式类似,不再赘述。When the computing device (for example, INI 1, TGT 1 or TGT 2) in each of the above method embodiments is realized by a virtual machine, the above-mentioned host and IO device correspond to the host and the IO device in the virtual machine respectively, and the host in the virtual machine and IO devices are realized through physical hosts and physical IO devices that carry their virtual functions. Its implementation is similar to the above implementation and will not be repeated here.
上文中描述的组播传输方法仅为示意,并不对本申请实施例提供的组播传输方法构成任何限定。上文结合图3至图11,详细描述了本申请实施例提供的组播传输方法,下面将结合图12和图13,详细描述本申请的装置的实施例。方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。The multicast transmission method described above is only for illustration, and does not constitute any limitation to the multicast transmission method provided by the embodiment of the present application. The multicast transmission method provided by the embodiment of the present application is described in detail above with reference to FIG. 3 to FIG. 11 , and the embodiment of the device of the present application will be described in detail below in conjunction with FIG. 12 and FIG. 13 . The descriptions of the method embodiments correspond to the descriptions of the device embodiments. Therefore, for parts not described in detail, reference may be made to the foregoing method embodiments.
图12是本申请实施例提供的一种组播传输装置1200的示意性结构图。图12所示的组播传输装置1200可以执行上述实施例的组播传输方法的相应步骤。如图12所示,该组播传输装置1200包括:收发单元1210和处理单元1220。FIG. 12 is a schematic structural diagram of a multicast transmission device 1200 provided by an embodiment of the present application. The multicast transmission apparatus 1200 shown in FIG. 12 may execute corresponding steps of the multicast transmission method in the foregoing embodiments. As shown in FIG. 12 , the multicast transmission device 1200 includes: a transceiver unit 1210 and a processing unit 1220 .
在一些实现方式中,该装置1200应用于第一IO设备,收发单元1210用于接收上述步骤520、步骤540、步骤580、步骤591、步骤592、步骤811、步骤820、步骤850、步骤861、步骤862、步骤871、步骤872、步骤1101、步骤1020、步骤1050、步骤1061、步骤1070、步骤1071、步骤1111、步骤1112、步骤1115、步骤1118、步骤1119、步骤1120、步骤1123、步骤1126、步骤1127、步骤1128、步骤1129、步骤310和步骤330。处理单元1220用于执行上述步骤530、步骤812、步骤1012、步骤320。上述步骤具体可以参见上文方法实施例中的相关描述,此处不再详细赘述。In some implementations, the apparatus 1200 is applied to the first IO device, and the transceiver unit 1210 is configured to receive the above-mentioned step 520, step 540, step 580, step 591, step 592, step 811, step 820, step 850, step 861, Step 862, Step 871, Step 872, Step 1101, Step 1020, Step 1050, Step 1061, Step 1070, Step 1071, Step 1111, Step 1112, Step 1115, Step 1118, Step 1119, Step 1120, Step 1123, Step 1126 , step 1127 , step 1128 , step 1129 , step 310 and step 330 . The processing unit 1220 is configured to execute the above step 530 , step 812 , step 1012 , and step 320 . For details of the above steps, refer to the relevant descriptions in the above method embodiments, and details are not repeated here.
应理解的是,本申请实施例的装置1200可以用于实现上述实施例的组播传输方法。具体的,当装置1200为硬件时,该装置1200可以是IO设备本身,或也可以是IO设备中的部分模块。当装置1200为软件时,该装置1200可以是部署在IO设备中的软件系统。It should be understood that the apparatus 1200 in the embodiment of the present application may be used to implement the multicast transmission method in the foregoing embodiments. Specifically, when the apparatus 1200 is hardware, the apparatus 1200 may be the IO device itself, or may also be some modules in the IO device. When the apparatus 1200 is software, the apparatus 1200 may be a software system deployed in an IO device.
在另一些实现方式中,该装置1200应用于第二IO设备,收发单元1210用于执行上述步骤540、步骤560、步骤830、步骤831、步骤860、步骤861、步骤870、步骤871、步骤1060、步骤1070、步骤1112、步骤1114、步骤1116、步骤1117、步骤1118、步骤1119、步骤1126、步骤1128。处理单元1220用于执行上述步骤550、步骤580、步骤590、步骤591、步骤840、步骤841、步骤1030、步骤1040、步骤1121、步骤1122。上述步骤具体可以参见上文方法实施例中的相关描述,此处不再详细赘述。In other implementations, the apparatus 1200 is applied to the second IO device, and the transceiver unit 1210 is used to perform the above steps 540, 560, 830, 831, 860, 861, 870, 871, and 1060 , step 1070, step 1112, step 1114, step 1116, step 1117, step 1118, step 1119, step 1126, step 1128. The processing unit 1220 is configured to execute the above step 550 , step 580 , step 590 , step 591 , step 840 , step 841 , step 1030 , step 1040 , step 1121 , and step 1122 . For details of the above steps, refer to the relevant descriptions in the above method embodiments, and details are not repeated here.
图13是本申请实施例提供的一种组播传输装置1300的硬件结构示意图。图13所示的组播传输装置1300可以执行上述实施例的组播传输方法中IO设备所实现的方法的操作步骤。FIG. 13 is a schematic diagram of a hardware structure of a multicast transmission device 1300 provided by an embodiment of the present application. The multicast transmission apparatus 1300 shown in FIG. 13 can execute the operation steps of the method implemented by the IO device in the multicast transmission method of the above embodiment.
如图13所示,该装置1300包括处理器1301、存储器1302、通信接口1303和数据传输线1304。其中,处理器1301、存储器1302、通信接口1303通过数据传输线1304进行通信,也可以通过无线传输等其他手段实现通信。该存储器1302用于存储指令,该处理器1301用于执行该存储器1302存储的计算机指令或程序代码。As shown in FIG. 13 , the device 1300 includes a processor 1301 , a memory 1302 , a communication interface 1303 and a data transmission line 1304 . Wherein, the processor 1301, the memory 1302, and the communication interface 1303 communicate through the data transmission line 1304, and the communication may also be realized through other means such as wireless transmission. The memory 1302 is used to store instructions, and the processor 1301 is used to execute the computer instructions or program codes stored in the memory 1302 .
在本申请实施例中,处理器1301可以为网卡或智能网卡中的处理器。存储器1302可以包括计算机指令或程序代码,该计算机指令或程序代码可以用于实现上述图12中的收发单元1210的功能,和/或上述图12中的处理单元1220的功能。收发单元1210的功能和处理单元1220的功能可以参见上文中的描述,此处不再详细赘述。In this embodiment of the present application, the processor 1301 may be a processor in a network card or an iNIC. The memory 1302 may include computer instructions or program codes, and the computer instructions or program codes may be used to realize the functions of the transceiver unit 1210 in FIG. 12 and/or the functions of the processing unit 1220 in FIG. 12 . For the functions of the transceiver unit 1210 and the functions of the processing unit 1220, reference may be made to the above description, and details will not be repeated here.
该存储器1302可以包括只读存储器和随机存取存储器,并向处理器1301提供指令和数据。存储器1302还可以包括非易失性随机存取存储器。例如,存储器1302还可以存储设备类型的信息。The memory 1302 may include read-only memory and random-access memory, and provides instructions and data to the processor 1301 . Memory 1302 may also include non-volatile random access memory. For example, memory 1302 may also store device type information.
该存储器1302可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储 器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The memory 1302 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
该数据传输线1304用于连接处理器1301、存储器1302以及通信接口1303。The data transmission line 1304 is used to connect the processor 1301 , the memory 1302 and the communication interface 1303 .
本申请实施例还提供了一种计算机可读介质,该计算机可读介质存储有程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述第一IO设备或第二IO设备执行的方法。这些计算机可读存储包括但不限于如下的一个或者多个:只读存储器(read-only memory,ROM)、可编程ROM(programmable ROM,PROM)、可擦除的PROM(erasable PROM,EPROM)、Flash存储器、电EPROM(electrically EPROM,EEPROM)以及硬盘驱动器(hard drive)。The embodiment of the present application also provides a computer-readable medium, the computer-readable medium stores program codes, and when the computer program codes run on the computer, the computer executes the above-mentioned first IO device or the second IO device. method. These computer-readable storages include, but are not limited to, one or more of the following: read-only memory (read-only memory, ROM), programmable ROM (programmable ROM, PROM), erasable PROM (erasable PROM, EPROM), Flash memory, electrical EPROM (electrically EPROM, EEPROM) and hard drive (hard drive).
本申请实施例还提供一种计算设备,其包括前述的主机和IO设备。An embodiment of the present application also provides a computing device, which includes the aforementioned host and an IO device.
本申请实施例还提供一种计算集群,包括多个前述计算设备,该多个计算设备中的每个计算设备包括前述IO设备和前述的主机。An embodiment of the present application also provides a computing cluster, including a plurality of computing devices described above, and each computing device in the computing devices includes the aforementioned IO device and the aforementioned host.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储 介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (23)

  1. 一种组播传输方法,其特征在于,应用于第一计算设备,所述第一计算设备包括第一输入输出IO设备,所述方法包括:A multicast transmission method, characterized in that it is applied to a first computing device, the first computing device includes a first input and output IO device, and the method includes:
    所述第一IO设备获取第一请求;The first IO device acquires a first request;
    所述第一IO设备根据第一信息和所述第一请求,生成第一数据包,所述第一信息包括组播信息,所述组播信息用于标识组播组的组播成员与所述第一计算设备之间的链路连接,所述第一数据包携带所述组播信息和所述第一请求对应的待写入数据;The first IO device generates a first data packet according to the first information and the first request, the first information includes multicast information, and the multicast information is used to identify the multicast members of the multicast group and the A link connection between the first computing devices, the first data packet carrying the multicast information and the data to be written corresponding to the first request;
    所述第一IO设备向所述组播组的组播成员发送所述第一数据包。The first IO device sends the first data packet to the multicast members of the multicast group.
  2. 根据权利要求1所述的方法,其特征在于,The method according to claim 1, characterized in that,
    所述组播信息包括组播标识符,所述组播标识符是所述第一计算设备与所述组播组建立链路连接时获得的。The multicast information includes a multicast identifier, and the multicast identifier is obtained when the first computing device establishes a link connection with the multicast group.
  3. 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that,
    所述第一数据包还包括端口号,所述端口号用于指示传输所述第一数据包的方式为组播传输方式。The first data packet further includes a port number, and the port number is used to indicate that the transmission mode of the first data packet is a multicast transmission mode.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,The method according to any one of claims 1 to 3, characterized in that,
    所述第一计算设备还包括第一主机,所述第一IO设备与所述第一主机通过IO网络通信,所述第一请求为所述第一主机包括的处理器中运行的应用程序发送的请求。The first computing device further includes a first host, the first IO device communicates with the first host through an IO network, and the first request is sent by an application program running in a processor included in the first host request.
  5. 根据权利要求4所述的方法,其特征在于,The method according to claim 4, characterized in that,
    所述第一信息还包括第一窗口信息,所述第一窗口信息用于指示所述组播组的组播成员在第一时间段内能够处理的数据的最小数量。The first information further includes first window information, and the first window information is used to indicate the minimum amount of data that can be processed by the multicast members of the multicast group within a first time period.
  6. 根据权利要求5所述的方法,其特征在于,The method according to claim 5, characterized in that,
    所述第一IO设备发送所述第一数据包采用的传输协议为传输控制协议/因特网互联协议TCP/IP。The transmission protocol used by the first IO device to send the first data packet is Transmission Control Protocol/Internet Protocol TCP/IP.
  7. 根据权利要求4所述的方法,其特征在于,所述待写入数据为所述第一请求对应的数据中的一部分数据,所述第一信息还包括指示信息,所述指示信息用于指示对所述待写入数据进行封装,在所述第一IO设备根据第一信息和所述第一请求,生成第一数据包之前,所述方法还包括:The method according to claim 4, wherein the data to be written is a part of the data corresponding to the first request, and the first information further includes indication information, and the indication information is used to indicate Encapsulating the data to be written, before the first IO device generates a first data packet according to the first information and the first request, the method further includes:
    所述第一IO设备向所述组播组的组播成员发送IO写命令,所述IO写命令用于指示将所述第一请求对应的数据存储至第一注册区域MR中,所述第一请求对应的数据位于所述第一计算设备包括的第一主机包括的存储器中,所述第一MR为第二主机包括的存储器中的存储区域注册到第二IO设备的存储器中的存储区域,第二计算设备包括所述第二主机和所述第二IO设备,所述第二计算设备为所述组播组的组播成员;The first IO device sends an IO write command to the multicast members of the multicast group, the IO write command is used to instruct to store the data corresponding to the first request in the first registration area MR, and the first The data corresponding to a request is located in the memory included in the first host included in the first computing device, and the first MR is registered as a storage area in the memory included in the second host to a storage area in the memory of the second IO device , the second computing device includes the second host and the second IO device, and the second computing device is a multicast member of the multicast group;
    所述第一IO设备接收所述组播组的组播成员发送的所述指示信息。The first IO device receives the indication information sent by the multicast member of the multicast group.
  8. 根据权利要求7所述的方法,其特征在于,The method according to claim 7, characterized in that,
    所述IO写命令包括第二密钥值和第二位置信息,所述第二密钥值用于识别所述第二MR,所述第二位置信息用于指示所述待写入数据在所述第二MR中的位置。The IO write command includes a second key value and second location information, the second key value is used to identify the second MR, and the second location information is used to indicate that the data to be written is in the The location in the second MR described above.
  9. 根据权利要求7或8所述的方法,其特征在于,The method according to claim 7 or 8, characterized in that,
    所述第一信息还包括信用值,所述信用值用于指示所述组播组的组播成员在第二时间段内能够处理的请求的最小数量,所述第一数据包的基本传输头部BTH携带所述信用值。The first information also includes a credit value, the credit value is used to indicate the minimum number of requests that can be processed by the multicast members of the multicast group within the second time period, and the basic transmission header of the first data packet Part BTH carries the credit value.
  10. 根据权利要求1至3任一项所述的方法,其特征在于,所述第一IO设备获取第一请 求,包括:The method according to any one of claims 1 to 3, wherein said first IO device obtaining a first request comprises:
    所述第一IO设备接收所述组播组的组播成员发送的所述第一请求,所述第一请求用于指示将位于第二MR中的所述待写入数据存储至所述组播组的组播成员的存储区域,所述第二MR为第一主机包括的存储器中的存储区域注册到所述第一IO设备的存储器中的存储区域,所述第一计算设备还包括所述第一主机。The first IO device receives the first request sent by the multicast member of the multicast group, and the first request is used to indicate that the data to be written in the second MR is stored in the group The storage area of the multicast member of the broadcast group, the second MR registers the storage area in the memory included in the first host to the storage area in the memory of the first IO device, and the first computing device also includes the Describe the first host.
  11. 根据权利要求10所述的方法,其特征在于,The method according to claim 10, characterized in that,
    所述第一请求包括第一密钥值,第一位置信息和预设字段,所述第一密钥值用于识别所述第二MR,所述第一位置信息用于指示所述待写入数据在所述第二MR中的位置,所述预设字段的取值用于指示所述第一请求。The first request includes a first key value, first location information and a preset field, the first key value is used to identify the second MR, and the first location information is used to indicate the to-be-written The position of the input data in the second MR, and the value of the preset field is used to indicate the first request.
  12. 根据权利要求8至11任一项所述的方法,其特征在于,The method according to any one of claims 8 to 11, characterized in that,
    所述第一IO设备发送所述第一数据包采用的传输协议为基于以太网的远程直接数据存取RDMA。The transmission protocol used by the first IO device to send the first data packet is Ethernet-based Remote Direct Data Access (RDMA).
  13. 根据权利要求1至12任一项所述的方法,其特征在于,所述第一IO设备向所述组播组的组播成员发送所述第一数据包,包括:The method according to any one of claims 1 to 12, wherein the first IO device sends the first data packet to the multicast members of the multicast group, comprising:
    所述第一IO设备向转发设备发送所述第一数据包,所述转发设备用于对所述第一数据包进行复制,并将复制后的所述第一数据包转发至所述组播组的组播成员,所述组播组的组播成员与所述第一计算设备之间的链路连接包括所述转发设备。The first IO device sends the first data packet to a forwarding device, and the forwarding device is configured to copy the first data packet and forward the copied first data packet to the multicast A multicast member of a group, the link connection between the multicast member of the multicast group and the first computing device includes the forwarding device.
  14. 根据权利要求1至13任一项所述的方法,其特征在于,所述组播组的组播成员包括第二计算设备,The method according to any one of claims 1 to 13, wherein the multicast members of the multicast group include a second computing device,
    在所述第一IO设备向所述组播组的组播成员发送所述第一数据包之后,所述方法还包括:After the first IO device sends the first data packet to the multicast members of the multicast group, the method further includes:
    所述第一IO设备接收所述第二计算设备发送的第二请求,所述第二请求用于请求获取所述第一数据包携带的所述第一请求对应的所述待写入数据;The first IO device receives a second request sent by the second computing device, and the second request is used to request to obtain the data to be written corresponding to the first request carried in the first data packet;
    所述第一IO设备向所述第二计算设备发送第二数据包,所述第二数据包携带所述待写入数据,且所述第二数据包包括的端口号用于指示传输所述第二数据包的传输方式为单播传输方式。The first IO device sends a second data packet to the second computing device, the second data packet carries the data to be written, and the port number included in the second data packet is used to indicate the transmission of the The transmission mode of the second data packet is a unicast transmission mode.
  15. 根据权利要求1至14任一项所述的方法,其特征在于,所述组播组的组播成员包括第二计算设备和第三计算设备,The method according to any one of claims 1 to 14, wherein the multicast members of the multicast group include a second computing device and a third computing device,
    所述方法还包括:The method also includes:
    在所述第一IO设备接收到第一完成消息和第二完成消息后,所述第一IO设备向所述第一主机包括的处理器发送第三完成消息,所述第三完成消息用于指示所述第一请求已成功执行,所述第一完成消息用于指示所述第二计算设备已成功执行所述第一请求,所述第二完成消息用于指示所述第三计算设备已成功执行所述第一请求。After the first IO device receives the first completion message and the second completion message, the first IO device sends a third completion message to the processor included in the first host, and the third completion message is used for Indicating that the first request has been successfully executed, the first completion message is used to indicate that the second computing device has successfully executed the first request, and the second completion message is used to indicate that the third computing device has successfully executed The first request was successfully executed.
  16. 一种组播传输装置,其特征在于,包括收发单元和处理单元,A multicast transmission device, characterized in that it includes a transceiver unit and a processing unit,
    所述收发单元,用于获取第一请求;The transceiver unit is configured to obtain a first request;
    所述处理单元,用于根据第一信息和所述第一请求,生成第一数据包,所述第一信息包括组播信息,所述组播信息用于标识组播组的组播成员与所述组播传输装置之间的链路连接,所述第一数据包携带所述组播信息和所述第一请求对应的待写入数据;The processing unit is configured to generate a first data packet according to the first information and the first request, the first information includes multicast information, and the multicast information is used to identify the multicast members of the multicast group and the The link connection between the multicast transmission devices, the first data packet carries the multicast information and the data to be written corresponding to the first request;
    所述收发单元,还用于向所述组播组的组播成员发送所述第一数据包。The transceiver unit is further configured to send the first data packet to the multicast members of the multicast group.
  17. 根据权利要求16述的装置,其特征在于,The device according to claim 16, characterized in that,
    所述组播信息包括组播标识符,所述组播标识符是所述组播传输装置与所述组播组建立 链路连接时获得的。The multicast information includes a multicast identifier, and the multicast identifier is obtained when the multicast transmission device establishes a link connection with the multicast group.
  18. 根据权利要求16或17所述的装置,其特征在于,Apparatus according to claim 16 or 17, characterized in that,
    所述第一数据包还包括端口号,所述端口号用于指示传输所述第一数据包的方式为组播传输方式。The first data packet further includes a port number, and the port number is used to indicate that the transmission mode of the first data packet is a multicast transmission mode.
  19. 根据权利要求16至18任一项所述的装置,其特征在于,Apparatus according to any one of claims 16 to 18, characterized in that
    所述处理单元发送所述第一数据包采用的传输协议为传输控制协议/因特网互联协议TCP/IP,或为基于以太网的远程直接数据存取RDMA。The transmission protocol used by the processing unit to send the first data packet is Transmission Control Protocol/Internet Protocol TCP/IP, or Remote Direct Data Access RDMA based on Ethernet.
  20. 一种输入输出IO设备,其特征在于,所述IO设备包括至少一个处理器和通信接口,所述至少一个处理器,用于执行计算机程序或指令,以使得所述IO设备执行如权利要求1至15中任一项所述的方法。An input and output IO device, characterized in that the IO device includes at least one processor and a communication interface, and the at least one processor is used to execute computer programs or instructions, so that the IO device performs the process described in claim 1. The method described in any one of to 15.
  21. 一种计算设备,其特征在于,所述计算设备包括主机和输入输出IO设备,所述主机和所述IO设备通过IO网络通信;A computing device, characterized in that the computing device includes a host and an input/output IO device, and the host and the IO device communicate through an IO network;
    所述主机用于运行应用程序,向所述IO设备发送所述应用程序产生的请求;The host is used to run an application, and send a request generated by the application to the IO device;
    所述IO设备用于执行如权利要求1至15中任一项所述的方法。The IO device is used to execute the method according to any one of claims 1-15.
  22. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至15中任一项所述的方法。A computer-readable storage medium, characterized by comprising a computer program, which causes the computer to execute the method according to any one of claims 1 to 15 when the computer program is run on the computer.
  23. 一种组播传输系统,其特征在于,所述系统包括第一计算设备和组播组,所述组播组为权利要求1至15中任一项所述的方法中的组播组,所述第一计算设备包括第一输入输出IO设备,所述第一IO设备用于执行权利要求1至15中任一项所述的方法的操作步骤。A multicast transmission system, characterized in that the system includes a first computing device and a multicast group, and the multicast group is the multicast group in the method according to any one of claims 1 to 15, wherein The first computing device includes a first input and output IO device, and the first IO device is used to execute the operation steps of the method described in any one of claims 1 to 15.
PCT/CN2022/139219 2021-12-17 2022-12-15 Multicast transmission method, apparatus and system WO2023109891A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111556558.4 2021-12-17
CN202111556558.4A CN116266800A (en) 2021-12-17 2021-12-17 Multicast transmission method, device and system

Publications (1)

Publication Number Publication Date
WO2023109891A1 true WO2023109891A1 (en) 2023-06-22

Family

ID=86743952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139219 WO2023109891A1 (en) 2021-12-17 2022-12-15 Multicast transmission method, apparatus and system

Country Status (2)

Country Link
CN (1) CN116266800A (en)
WO (1) WO2023109891A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240642A (en) * 2023-11-15 2023-12-15 常州楠菲微电子有限公司 IB multicast message copying and receiving device and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101325536A (en) * 2007-06-15 2008-12-17 上海贝尔阿尔卡特股份有限公司 Base station of WiMAX system, method and apparatus for controlling transmission of multicast data packet in gateway
CN104378217A (en) * 2014-11-26 2015-02-25 中国联合网络通信集团有限公司 Method and device for determining multicast group data
CN109067578A (en) * 2018-07-31 2018-12-21 杭州迪普科技股份有限公司 A kind of method and apparatus of rapidly channel switching
CN110768709A (en) * 2018-07-27 2020-02-07 清华大学 Multicast and unicast cooperative data transmission method, server and terminal
CN110768708A (en) * 2018-07-27 2020-02-07 清华大学 Multicast method, server and terminal based on communication satellite

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101325536A (en) * 2007-06-15 2008-12-17 上海贝尔阿尔卡特股份有限公司 Base station of WiMAX system, method and apparatus for controlling transmission of multicast data packet in gateway
CN104378217A (en) * 2014-11-26 2015-02-25 中国联合网络通信集团有限公司 Method and device for determining multicast group data
CN110768709A (en) * 2018-07-27 2020-02-07 清华大学 Multicast and unicast cooperative data transmission method, server and terminal
CN110768708A (en) * 2018-07-27 2020-02-07 清华大学 Multicast method, server and terminal based on communication satellite
CN109067578A (en) * 2018-07-31 2018-12-21 杭州迪普科技股份有限公司 A kind of method and apparatus of rapidly channel switching

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240642A (en) * 2023-11-15 2023-12-15 常州楠菲微电子有限公司 IB multicast message copying and receiving device and method
CN117240642B (en) * 2023-11-15 2024-01-19 常州楠菲微电子有限公司 IB multicast message copying and receiving device and method

Also Published As

Publication number Publication date
CN116266800A (en) 2023-06-20

Similar Documents

Publication Publication Date Title
US20220197838A1 (en) System and method for facilitating efficient event notification management for a network interface controller (nic)
US11470000B2 (en) Medical device communication method
US10013390B2 (en) Secure handle for intra-and inter-processor communications
US11381514B2 (en) Methods and apparatus for early delivery of data link layer packets
US7817634B2 (en) Network with a constrained usage model supporting remote direct memory access
US10880204B1 (en) Low latency access for storage using multiple paths
US10320677B2 (en) Flow control and congestion management for acceleration components configured to accelerate a service
US11886940B2 (en) Network interface card, storage apparatus, and packet receiving method and sending method
WO2021204091A1 (en) Method and device for clearing buffer
WO2023040949A1 (en) Network interface card, message sending method and storage apparatus
WO2023109891A1 (en) Multicast transmission method, apparatus and system
US10326696B2 (en) Transmission of messages by acceleration components configured to accelerate a service
WO2020007278A1 (en) Data transmitting method and device, and data receiving method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906636

Country of ref document: EP

Kind code of ref document: A1