CN112291293B - Task processing method, related equipment and computer storage medium - Google Patents

Task processing method, related equipment and computer storage medium Download PDF

Info

Publication number
CN112291293B
CN112291293B CN201910687998.XA CN201910687998A CN112291293B CN 112291293 B CN112291293 B CN 112291293B CN 201910687998 A CN201910687998 A CN 201910687998A CN 112291293 B CN112291293 B CN 112291293B
Authority
CN
China
Prior art keywords
task request
task
data
communication
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910687998.XA
Other languages
Chinese (zh)
Other versions
CN112291293A (en
Inventor
王庚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910687998.XA priority Critical patent/CN112291293B/en
Publication of CN112291293A publication Critical patent/CN112291293A/en
Application granted granted Critical
Publication of CN112291293B publication Critical patent/CN112291293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Abstract

The embodiment of the invention discloses a task processing method, a task processing device and a computer storage medium, wherein the task processing method comprises the following steps: the method comprises the steps that first equipment obtains a task network graph, wherein the task network graph comprises at least one communication task request and/or at least one calculation task request forming service communication; if the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request; and after the task network graph is executed, the service communication is completed. By adopting the embodiment of the invention, the problems that the task request is frequently issued and other functional operations except the transmission function cannot be realized in the traditional scheme can be solved.

Description

Task processing method, related equipment and computer storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a task processing method, a related device, and a computer storage medium.
Background
Infiniband (IB) is a switching network designed to meet the requirements of high bandwidth and low latency, and can be specifically referred to as IB Specification Vol 1-Release-1.3 protocol, and is composed of a transport layer, a network layer, a link layer and a physical layer. The transport layer provides an IB resource management interface and an Input Output (IO) interface for an application, and completes service communication with a remote device in a queue manner. The IO interfaces mainly comprise a sending post-send interface, a receiving post-recv interface and a queue polling poll-cq interface.
The host device provides a transmission layer instance to communicate with the remote device through the IB network, and a single transmission layer instance is named as a Queue Pair (QP) and consists of a Send Queue (SQ) and a Receive Queue (RQ). Multiple QPs may be used simultaneously in a host device to correspond to communication with multiple remote devices. When the host device and the remote device perform actual service communication, each task request support in the service communication is associated with one QP, for example, a task request is sent to a sending queue SQ, and a task request is received to a receiving queue RQ. Accordingly, a remote direct data access (RDMA) execution engine of the host device may execute the task requests according to an actual issuing order of the task requests, which may involve frequent interaction of the task requests, affect device performance, and reduce task processing efficiency. In addition, the RDMA execution engine only has a transmission function of data transceiving and does not support other operation functions, such as data operation, so that the requirement of actual service communication may not be met.
Disclosure of Invention
The embodiment of the invention discloses a task processing method, related equipment and a computer storage medium, which can solve the problems that task requests are frequently issued and other functional operations except a transmission function cannot be realized in the traditional scheme.
In a first aspect, an embodiment of the present invention discloses a method for task processing, where the method is applied to a first device side, and the method includes: the method comprises the steps of obtaining a task network graph, wherein the task network graph comprises at least one communication task request and/or at least one calculation task request which form service communication. Any two communication task requests support parallel execution or serial execution, the communication task requests are used for requesting to perform data transmission operation indicated by the communication task requests, and the calculation task requests are used for requesting to perform data calculation operation indicated by the calculation task requests. When the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; when the task network graph comprises the computing task request, executing the data computing operation indicated by the computing task request; and after the task network graph is executed, the service communication is completed.
With reference to the first aspect, in some possible embodiments, the communication task request carries a transmission type of the communication task request, and when the transmission type of the communication task request indicates that the communication task request is a send task request, the first device obtains data to be sent, and sends the data to be sent by using an RDMA technique. Or, when the transmission type of the communication task request indicates that the communication task request is a task receiving request, the first device receives data to be received sent by the second device by using an RDMA technology.
With reference to the first aspect, in some possible embodiments, the communication task request further carries a source data address. And when the communication task request is a task sending request, the data to be sent is stored in the source data address. When the communication task request is a task receiving request, the first device can store the data to be received in the source data address.
With reference to the first aspect, in some possible embodiments, when the communication task request further carries an identifier of a first task request that the communication task request depends on, the data to be sent and the data to be received are execution results of the first task request, and the first task request is the communication task request or the computation task request. Or, when the communication task request does not carry the identifier of the first task request depended on by the communication task request, the data to be transmitted and the data to be received may be pre-stored data of the remote device (pre-stored data for short).
With reference to the first aspect, in some possible embodiments, the communication task request further carries a queue group identifier QP associated with the communication task request, and when the communication task request is a send queue request, the first device issues the send task request to a receive queue in the queue group QP corresponding to the queue group identifier, and waits for an RDMA execution engine of the first device to execute the send task request. When the communication task request is a receiving queue request, the first device issues the receiving queue request to a sending queue in a queue group QP corresponding to the queue group identifier, and waits for an RDMA execution engine of the first device to execute the receiving queue request.
With reference to the first aspect, in some possible embodiments, when the first device queries a completion queue entry for the communication task request in the completion queue, or receives a completion notification message for completion of execution of the communication task request, it determines that the communication task request is completely executed. The completion queue stores a completion queue entry used for indicating the identifier of the executed communication task request.
With reference to the first aspect, in some possible embodiments, the calculation task request carries a calculation operation type, and the first device may obtain data to be calculated, and perform a calculation operation indicated by the calculation operation type on the data to be calculated, so as to obtain an execution result of the calculation task request.
With reference to the first aspect, in some possible embodiments, the computing task request further carries a source data address. The source data address is used to store data to be computed.
With reference to the first aspect, in some possible embodiments, the computation task request further carries an identifier of a second task request on which the computation task request depends. The data to be calculated is the execution result of the second task request.
With reference to the first aspect, in some possible embodiments, the first device may perform task decomposition on the service communication, and obtain at least one service task request and an execution dependency relationship of the at least one service task request, where the execution dependency relationship includes a data dependency relationship and a task execution relationship. The task execution relation includes serial execution and/or parallel execution, the data dependency relation is used for indicating other task requests depended on when the business task request is executed, and the business task request may specifically include a communication task request and/or a calculation task request. And the first equipment generates a task network graph according to the execution dependency relationship of the at least one service task request and the at least one service task request.
In a second aspect, an embodiment of the present invention discloses a network device (specifically, a network interface card), including a task execution engine and an RDMA execution engine. The task execution engine is used for acquiring a task network diagram, and the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication. The task execution engine is also used for calling the RDMA execution engine to execute the data transmission operation indicated by the communication task request if the task network graph comprises the communication task request; and if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request. The task execution engine is also used for determining to complete the business communication after the task network graph is executed.
With reference to the second aspect, in some possible embodiments, the communication task request carries a transmission type, and the task execution engine is specifically configured to store a task request entry form in a send queue in the corresponding queue group QP if the transmission type indicates that the communication task request is a send task request, and wait for the RDMA execution engine to execute the send task request; and if the transmission type indicates that the communication task request is a received task request, storing the communication task request in a form of a task request entry in a received queue in a corresponding queue group QP, and waiting for the RDMA execution engine to execute the received task request. Specifically, if the communication task request is a send task request, the RDMA execution engine executes the send task request, obtains data to be sent, and sends the data to be sent by using an RDMA technology. And if the communication task request is a receiving task request, the RDMA execution engine executes the receiving task request and receives data to be received sent by the host equipment by adopting the RDMA technology.
In some possible embodiments, in combination with the second aspect, the RDMA execution engine is further configured to send, after the execution of the corresponding task request, a completion notification message for the corresponding task request to the task execution engine, so as to notify that the corresponding task request is completed. Or after the RDMA execution engine executes the communication task request, a completion queue entry can be automatically generated and added to the completion queue, and the completion queue entry is used for indicating that the communication task request is executed completely.
In combination with the second aspect, in some possible embodiments, the computing task request carries a computing operation type. The task execution engine is used for acquiring the data to be calculated, executing the calculation operation indicated by the calculation operation type on the data to be calculated, and obtaining the execution result of the calculation task request.
For the content that is not shown or not described in the embodiment of the present invention, reference may be specifically made to the explanation in the embodiment of the method described in the foregoing first aspect, and details are not described here again.
In a third aspect, an embodiment of the present invention provides a first device, where the first device includes a functional unit configured to perform the method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a first device, which includes a network interface card and a host processor. The host processor is used for generating a task network graph according to at least one service task request of the obtained service communication and the execution dependency relationship of the at least one service task request. The network interface card is used for acquiring the task network diagram from the host processor and executing the task network diagram to realize the service communication. For details that are not shown or described in the embodiments of the present invention, reference may be made to the related explanations in the foregoing embodiments, and details are not described here.
In a fifth aspect, an embodiment of the present invention provides a network device (specifically, a network interface card), including a memory and a processor coupled to the memory; the memory is configured to store instructions, and the processor is configured to execute the instructions; when the processor executes the instructions, the processor executes the method described in the fourth aspect with the network interface card as the execution subject.
In a sixth aspect, an embodiment of the present invention provides another first device, including a memory and a processor coupled to the memory; the memory is configured to store instructions, and the processor is configured to execute the instructions; wherein the processor, when executing the instructions, performs the method described in the first aspect.
In some possible embodiments, the first device further includes a communication interface, which is in communication with the processor, and the communication interface is used for communicating with other devices (such as network devices and the like) under the control of the processor.
In a seventh aspect, a computer-readable storage medium storing program code for task processing is provided. The program code comprises instructions for performing the method described in the first or second aspect above.
The invention can be further combined to provide more implementation modes on the basis of the implementation modes provided by the aspects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a diagram of an RDMA network framework provided by the prior art.
Fig. 2 is a schematic diagram of a conventional network framework provided by the prior art.
Fig. 3 is a schematic diagram of an IB network framework according to an embodiment of the present invention.
Fig. 4A-4B are schematic diagrams of two task request processes provided by the embodiment of the invention.
Fig. 5 is a schematic diagram of a task processing framework according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a network framework according to an embodiment of the present invention.
Fig. 7 is a flowchart illustrating a task processing method according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of a task network diagram according to an embodiment of the present invention.
Fig. 9 is a flowchart illustrating another task processing method according to an embodiment of the present invention.
Fig. 10A-10B are schematic diagrams of two services based on tree networking according to an embodiment of the present invention.
Fig. 11 is a schematic diagram of another task network diagram provided by an embodiment of the present invention.
Fig. 12 is an operation diagram of a task process according to an embodiment of the present invention.
Fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present invention.
Fig. 14 is a schematic structural diagram of a host device according to an embodiment of the present invention.
Fig. 15 is a schematic structural diagram of another host device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings of the present invention.
First, some technical knowledge related to the present application is introduced.
1. Remote Direct Memory Access (RDMA) technology
RDMA techniques are proposed to address server-side data processing delays in network transport. RDMA transfers material data directly to a storage area of a host device over a network, quickly moving data from one system to a memory of a remote system without any impact on the system and without consuming much of the computing resources and processing functions of the host device. Currently there are three RDMA technologies supported: infiniband (IB), roCE and iWARP. RDMA can avoid duplication to provide low latency, reduce CPU utilization, reduce memory bandwidth bottlenecks, and provide high bandwidth utilization. RDMA provides channel-based IO operations, allowing direct reading and writing of remote virtual memory for applications using RDMA techniques.
As shown in fig. 1, in a conventional network (e.g., a socket network), an application requests a network resource from an operation request, and the network resource is transferred by a system call. As shown in fig. 1, the application program is stored in a virtual buffer, which creates an instance buffer in the operating system, and accesses the application program stored in the operating system of the host device through the network card nic. In other words, in the conventional network, the network resource (application program) is owned by the operating system of the local device, and the user cannot directly access to obtain the network resource, and must rely on the operating system to move and obtain the network resource (i.e. data) from the virtual buffer of the application program, and then transmit the network resource (i.e. data) onto the line through the protocol stack. Accordingly, in the remote device, the application program needs to rely on the operating system to obtain the data on the line and store the data in the virtual buffer.
However, in RDMA technology, RDMA allows applications to exchange messages directly after a channel is established in the operating system without further intervention by the operating system. The message may be an RDMA direct read message, an RDMA write message, an RDMA receive message, an RDMA send message, or the like. Specifically, as shown in fig. 2, the respective applications of the local device and the remote device are stored in the corresponding virtual buffer buffers of the devices, and they may directly implement mutual access and acquisition of data by using RDMA technology. For example, the local device may use RDMA techniques to access applications stored in buffers in the remote device.
2. Infiniband (IB) technology
Infiniband (IB) is a serial networking technology used for computer network communications standards for high performance computing with very high throughput and very low latency for computer-to-computer data interconnects. Infiniband also serves as a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems.
Infiniband is a switching fabric IP technology, and the design idea is to establish a single connection link among devices such as remote storage, network and server through a set of central mechanisms (specifically, a central Infiniband switch), and to direct the flow by the central Infiniband switch, so that the performance, reliability and effectiveness of the system are greatly improved, and the data flow congestion among hardware devices can be relieved. Referring specifically to fig. 3, a schematic diagram of an IB network framework is shown. The network framework shown in fig. 3 includes: a processing node 301 (processor node), a storage node 302 (storage node), an Input Output (IO) node 303, and an Infiniband switch 304.
The number of processing nodes 301 is not limited, and one is illustrated as an example. The processing node 301 may include a Central Processing Unit (CPU), a memory, and a Host Channel Adapter (HCA). The number of the central processing units CPU is not limited, and the CPU, the memory, and the HCA may be connected by a bus (e.g., PCIe bus).
Storage nodes 302 may include, but are not limited to, RAID subsystems, storage subsystems (storage subsystems), or other system nodes for data storage. The RAID subsystem includes a processor, a memory, a target adapter (TCA), a Small Computer System Interface (SCSI), and a storage resource, where the storage resource includes, but is not limited to, a hard disk, a magnetic disk, or other storage devices. The storage subsystem may include a controller, a target adapter, a TCA, and a storage resource.
The IO node may specifically be an input/output IO device, which may include at least one adapted IO unit (IO module) that supports connections with images, hard disks, networks, and the like. The number of Infiniband switches is usually multiple and they together form an Infiniband switching network, referred to as an Infiniband network. Optionally, the Infiniband switch also supports communications with router.
In practical applications, the nodes communicate with each other through an Infiniband switch, such as the Infiniband switch in fig. 3 supports communication with TCA and HCA, and the communication link thereof may be referred to as Infiniband link. Infiniband link is an optical fiber connecting the HCA and the TCA, and in an Infiniband network framework, a hardware manufacturer is allowed to connect the TCA and the HCA in three ways of 1 fiber, 4 fibers and 12 fibers. In addition, as shown in fig. 3, the HCA is a bridge connecting the memory controller and the TCA, and the TCA packages and transmits the digital signals of the IO devices (e.g., the network card and the SCSI controller) to the HCA.
3. Queue group QP
QP is an important concept of Infiniband, which refers to a combination of receive and transmit queues. In practical application, when the host device calls an Application Programming Interface (API) to receive and transmit a data request (for example, specifically, a task request), the data request is actually stored in the QP, and then the data requests in the QP are processed one by one in a polling manner. Fig. 4A is a schematic diagram illustrating a request process. As shown in the drawing, a task request (WR) generated in a host device exists in the form of a task request entry (WRE) in a task queue (WQ). When a task request in the WQ is processed and completed by hardware (hardware) in the host device, a corresponding Completion Queue Entry (CQE) is generated and stored in the task completion (WC) in the form of a Completion Queue (CQ). In practical applications, the WQ may specifically be a Receive Queue (RQ) or a Send Queue (SQ).
Please refer to fig. 4B, which specifically illustrates a specific processing diagram of task requests in the sending queue and the receiving queue. As shown in fig. 4B, a send task request (send WQE) is stored in the send queue SQ, and a receive task request (receive WQE) is stored in the receive queue RQ. When the hardware (hardware) of the host device processes the send task request, the RDMA technology is used to read the data requested to be sent by the WQE from a preset read memory (read buffer), then the data is written into the WQE (i.e. RDMA write WQE), and finally the WQE (send WQE) is sent, i.e. the hardware executes the WQE to send the corresponding data. Accordingly, when the hardware processing of the computing device receives the task request, the data requested to be received by receiving the task request may be stored in a receiving memory (receive buffer).
Next, a flow framework for task request processing is introduced.
Fig. 5 is a schematic diagram of a task processing framework according to an embodiment of the present invention. As shown in fig. 5, completing one service communication includes three task requests (WRs), which are as shown in the figure: post WR1 to post WR3. Each WR exists in the QP in the form of a task request entry (WRE), each WR support is associated with one QP, specifically sending a send task request into a send queue SQ, receiving a receive task request into a receive queue RQ. After the host device completes processing of each WRE, a corresponding Completion Queue Entry (CQE) is generated and stored in a completion queue CQ for indicating that the WRE is executed.
As shown, a host device (e.g., host 0) may communicate with host device 1 (shown as host 1) via QP1, host device 2 (shown as host 2) via QP2, host device 3 (host 3) via QP3, and so on via QPn. QP1 through QP n are associated with the same completion queue CQ. In this example, in the actual service communication, host0 needs to merge the data from host1 and host2 and then send the merged data to host3.
Specifically, 3 task requests (post WR1 to post WR 3) are generated in the traffic, and hereinafter referred to as WR1 to WR3. Where WR1 is specifically a sending task request recv WR1 for indicating that data a from host1 is requested to be received, host0 may send WR1 in the form of WRE1 to RQ under QP 1. WR2 is also a send task request recv WR2 indicating that data B from host2 is requested to be received. host0 sends WR2 as WRE2 to RQ at QP2 after completing the issuing of WR 1. When host0 completes WR1 and WR2, that is, receives data from each of host1 and host2, corresponding CQE1 and CQE2 may be generated to correspondingly indicate or notify that WR1 and WR2 are completely executed.
Accordingly, after WR1 and WR2 are completed at host0, a poll completion queue (poll CQ) may be controlled to know whether WR1 and WR2 are completed. After the execution is completed, WR1 and WR2 may be acquired corresponding to the received data a and data B, respectively. Further host0 can realize fusion or superposition of data a and data B through program control, and obtain new data C, that is, data C = data a + data B. Host0 issues WR3 into SQ under CQ3 to perform the WR3. WR3 is used to request data C to be sent down to host3. Accordingly, after the RDAM execution engine of host0 receives WR3, data C may be obtained in response to WR3 and sent to host3. After the execution of WR3 is complete, a corresponding CQE3 may be generated and stored in the CQ. Accordingly, host0 can actively query CQE3 in CQ (poll CQ) to know that the traffic communication is completed, and the CPU idle of host0 is reached.
In practice it has been found that: when the business communication involves task requests in multiple QPs, the task requests have a direct dependency relationship, and at this time, the host device needs to frequently interact with the RDAM execution engine of the host device, specifically, the host interface card HCA, through the PCIe bus, and the process is complex and cumbersome. And the host device needs to complete the processing of the dependency relationship between WRs on the CPU side, which cannot be realized on the HCA side, for example, the simple data calculation in the above is also placed on the CPU side, not on the HCA side. Therefore, CPU resources in the host equipment are consumed comparatively, and frequent interaction also prolongs the time delay of data processing and reduces the efficiency of service communication.
In addition, the RDMA execution engine executes according to the issuing sequence of the task requests WR, can only ensure the sequential execution of different task requests, and cannot support the synchronous processing of different task requests in the same QP. Moreover, the RDMA execution engine only has a data transmission function, cannot realize data calculation, and cannot meet the calculation requirements of actual service communication.
In order to solve the above problems, the present application proposes another method for task processing, and an associated device to which the method is applied. Fig. 6 is a schematic diagram of a network framework according to an embodiment of the present invention. The network framework shown in fig. 6 includes N host devices and M storage devices (storage) that communicate with each other through an IB network, where M and N are positive integers. In the figure, 4 host devices and 2 storage devices are shown as examples, and they are respectively: host device 1 to host device 4, storage device 1 to storage device 2. As shown, a host processor (host processor) 601, a host memory (host memory) 602, and a network interface card (also referred to as a host interface card, HCA) 603 are included in any host device 600 (e.g., host device 1). Optionally, a control processor (control memory) 604 may also be included. Wherein, the first and the second end of the pipe are connected with each other,
the processor (which may be specifically the host processor 601 or the control processor 604) may include one or more processing units, such as: the processor may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.
The controller can be, among other things, a neural hub and a command center of the host device 600. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in the processor for storing instructions and data. In some embodiments, the memory in the processor is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor. If the processor needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses and reducing the latency of the processor, thereby increasing the efficiency of the system.
Host memory 602 may include volatile memory (volatile memory), such as Random Access Memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); the memory may also comprise a combination of the above kinds of memories. The memory may be configured to store a set of program code to facilitate the processor in invoking the program code stored in the memory to implement the corresponding functionality.
As shown, the host memory 602 may include application applications and a library. The application may specifically be an application that is configured by the system in a customized manner to support running in the host processor 602, such as a High Performance Computing (HPC) application and an Artificial Intelligence (AI) application in the figure. The communication library supports providing a communication interface to enable the host device 600 to communicate with other devices. As shown, the communication library may include a message-passing interface (MPI) communication library and a collect communication (NLLC) communication library.
Specifically, in the HPC or AI application scenario, multiple host devices access the IB network through the HCA603, which may form a host cluster. The HPC application supports calling a collection collecting interface provided by the MPI communication library to complete communication. The AI application can also call a collective interface provided by a communication library similar to the NCCL to complete the communication. The MPI or NCCL communication library provides an RDMA interface through HCA, and particularly provides a send interface, a receive interface and a queue polling poll CQ interface to complete actual communication.
The network interface card 603 is responsible for processing data, such as parsing a task network diagram and processing a task request included in the task network diagram in the following of the present application. As shown, the network interface card 603 includes a host interface (host interface) 6031, a task executor (WR executor) 6032, and an IB network interface (IB interface) 6033. Wherein the content of the first and second substances,
the host interface 6031 is used to enable communication between the network interface card 603 and the host device 600, such as the network interface card 603 and the control processor 604.IB network interface 6033 is used to enable host device 600 to communicate with an IB network for communicating with other host devices in the network over the IB network.
The task executor 6032, also called a task execution engine, is configured to process a task request included in the business communication. Optionally, the task executor 6032 includes a control unit and a calculation unit therein. The control unit is used for controlling the logic sequence of task request processing, and the computing unit is used for realizing computing operation indicated by the task request, such as fusion or summation processing of data A and data B, and the like. The calculation operation can be specifically set by the system, for example, SUM, MAX, MIN, and so on.
In actual practice, the task executor 6032 may interact with queues (queues) in the host device. For example, a task request to send is issued to the SQ, a task request to receive is issued to the RQ, and a request completion event, which may also be referred to as a completion notification message or a completion notification event, is issued to the CQ for storage, and is used to indicate that processing for the task request has been completed. Specifically, no matter the task request is sent or the task request is received (both are communication task requests), the task request is issued to the corresponding queue group CQ in the form of a task request entry WRE, and the completion notification event is issued to the corresponding completion queue CQ in the form of a completion queue entry CQE.
Optionally, an RDMA execution engine (not shown) is also included in the host device. In practical applications, the RDMA execution engine and the task execution engine (task executor 6032) cooperate with each other to complete the processing of all task requests included in the service communication, so as to implement the corresponding service communication. For example, the task executor 6032 obtains the task network graph, parses the task requests included in the task network graph, and calls the RDMA execution engine according to the logical sequence of the task requests to complete the task processing corresponding to the task requests.
For the communication task request, the HCA puts the communication task request into the corresponding queue group QP through the control unit of the task execution engine 6032, and specifically stores the communication task request in the form of a task request entry WRE. And waiting for the RDMA execution engine to schedule and execute the communication task request. After the communication task request is completed, a corresponding completion queue entry CQE is generated and filled in a corresponding completion queue. The completion queue entry is used to indicate that the communication task request has been executed, and usually the completion queue entry corresponds to the communication task request one-to-one, that is, each communication task request corresponds to one completion queue entry and also corresponds to one task request entry. Optionally, the completion queue entry is used to indicate that the corresponding communication task request has been executed, and usually carries an identifier of the corresponding communication task request. I.e. an identifier indicating the executed communication task request. Optionally, the control unit obtains the execution state of the communication task request by receiving a completion notification message for the execution of the communication task request or actively polling a completion queue CQ, and determines whether the execution of the communication task request is finished.
For the calculation task request, the HCA obtains data of the calculation task request through the control unit in the task execution engine 6032, and calculates corresponding data through the calculation unit in the task execution engine 6032 to store the calculation result at a corresponding destination address. After the computing unit completes the computation requested by the computation task request, a completion notification message is sent to the control unit to notify that the computation task request has been currently executed. The specific implementation of the communication task request and the calculation task request will be described in detail below, and will not be described in detail here.
In this application, the host processor 601, the host memory 602, the network interface card 603, and the control processor 604 may be interconnected using a bus. The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not only one bus or type of bus.
Based on the above embodiments, embodiments related to the task request processing method according to the present application are described below. Fig. 7 is a flowchart illustrating a task request processing method according to an embodiment of the present invention. The method shown in fig. 7 is applied in a host device as shown in fig. 6, which includes a host processor and a network interface card HCA. The method comprises the following implementation steps:
s701, the first device constructs a task network graph through the host processor, the task network graph comprises at least one service task request forming service communication and respective execution dependency relations of the service task request, and the service task request can specifically comprise a communication task request and/or a calculation task request.
For a certain service communication of the first device, the first device (specifically, the host processor of the first device) may perform task decomposition on the service communication, so as to obtain at least one service task request constituting the service communication and an execution dependency relationship of the service request. The execution dependency relationship is used to indicate the execution order of the business task requests and other task requests that depend on the execution order. And further, the first equipment constructs and generates a task network graph according to the at least one service task request and the respective execution dependency relationship.
The specific implementation of task decomposition is not limited. For example, the first device may perform task splitting according to operation steps of the service communication, for example, one operation step is correspondingly encapsulated as one task request, so as to obtain one or more service task requests composing the service communication.
Task requests WR referred to in the present application can be divided into two categories, namely, a calculation task request and a communication task request. Each task request WR is distinguished by having a correspondingly unique identification WR-id including, but not limited to, an identity identification, number, or other identifier customized to distinguish task requests. The first device related to the present application may specifically be the host device shown in fig. 6, which includes a host processor (processor CPU for short) and a network interface card HCA, and reference is made to the related description in the embodiment shown in fig. 6, which is not described herein again.
In some embodiments, the task request carries custom set parameters, which may include, but are not limited to, any one or a combination of more of the following: the task request comprises information such as identification of the task request, task type of the task request, source data address, destination address and length, and verification keyword (qkey). The task type of the task request is used to indicate whether the task request is a computation task request or a communication task request, and the task type of the task request may be generally represented by a preset character, for example, "0" represents that the task request is a communication task request, "1" represents that the task request is a computation task request, and so on.
S702, the host processor of the first device sends the task network graph to the network interface card HCA of the first device. Accordingly, the HCA of the first device obtains the task network map.
S703, the first device analyzes the task network diagram through the HCA, executes each task request in the task network diagram, and completes service communication.
After the first device constructs the task network map through the host processor, the task network map can be issued to the network interface card HCA of the first device. Correspondingly, the HCA receives the task network diagram, analyzes the task network diagram, and obtains at least one service task request forming service communication and an execution dependency relationship of the service task request. The business task request comprises a communication task request and/or a calculation task request. Further, the HCA executes the service task request according to the execution dependency relationship of each service task request in the task network diagram until the processing of each service task request in the task network diagram is completed, so that the service communication of the first device is realized.
In some embodiments, the communication task request is used to request a corresponding data transfer operation, such as data transmission or data reception. Some custom parameters may be carried in the communication task request, including, but not limited to, any one or combination of the following: the communication task request identifier, the queue group QP (or queue group QP identifier) associated with the communication task request, the transmission type of the communication task request (for example, send task request or receive task request), other task requests (specifically, the identifier of the other task requests) depended by the communication task request, and other information required by communication, such as information of source data address and length, destination address and length, and the like. The source data address and the destination address are both used for storing data, and the length is used for indicating the size of the data. The source data address is typically used to store the received data, and the destination address is typically used to store the data that needs to be sent or calculated by the device itself, etc.
The calculation task request is used for requesting corresponding data calculation operations, such as data fusion/summation SUM, data maximum MAX, data minimum MIN, and the like. Some custom parameters may be carried in the compute task request, for example, which may include, but are not limited to, a combination of any one or more of: an identification of the computing task request, a type of the computing operation (e.g., SUM, MAX, MIN, etc.), a type of the data operation (e.g., integer int, integer long, etc.), an address and a length of the source data, an address and a length of the destination data, a verification key (qkey), other task requests (specifically, identifications of other task requests) on which the computing task request depends, and so on. The data operation type is used for indicating that the computing task requests to perform the data operation and relates to the type of data, and may be an integer int, a floating point float, and the like. The verification key is used for verifying the validity of the information by the communication opposite-end device, and is similar to key verification.
In some embodiments, the execution dependencies may specifically include task execution relationships and data dependencies. The task execution relationship is used to indicate the execution order of the task requests, including but not limited to parallel execution and/or serial execution. The data dependency relationship is used for indicating other task requests which are depended on when the task request is executed, and particularly execution results from the other task requests can be depended on. For example, when executing the second task request, the execution result of the first task request needs to be used, and the second task request depends on the first task request, and specifically depends on the execution result of the first task request.
For example, taking the first device as the host device 2 as an example, the process of the host device 2 performing service communication specifically includes: the host device 2 needs to receive data from the host device 4 and the host device 5, fuse the data to obtain fused data, and send the fused data to the host device 1 for use. Referring to fig. 8, a schematic diagram of a task network diagram corresponding to the service communication is specifically shown. Wherein, the square represents the communication task request, and the circle represents the calculation task request. Referring to fig. 8, the service communication includes 4 task requests, which are a computation task request T3, communication task requests T1, T2, and T4. For example, the communication task request T1 is used to indicate that data 1 received from the host device 4 carries an identifier of a task request (WR-ID: 1 shown in the figure), a TYPE of the task request (WR-TYPE: 0, which is denoted as a communication task request), a queue group identifier associated with the task request (local-QPN: 1), a transmission TYPE of the task request (code: receive, which is denoted as a received task request), and an identifier of another task request (dependent: null, which is denoted as independent of another task request). Optionally, the source data address (src-addr) and the source data length (src-length) may also be included in T1. The source data address is a start address for storing data 1, and the source data length is used for reflecting the size of data 1.
The communication task request T2 is used to indicate that data 2 from the host device 5 is received, and T2 carries an identifier of a task request (shown as WR-ID: 2), a TYPE of the task request (WR-TYPE: 0, which indicates that the communication task request is received), a queue group identifier associated with the task request (local-QPN: 2), a transmission TYPE of the task request (opcode: receive, which indicates that the task request is received), and an identifier of another task request (dependent: null, which indicates that the task request is not dependent on another task request). Optionally, the source data address (src-addr) and the source data length (src-length) may also be included in T2. The source data address is a start address for storing the data 2, and the source data length is used for reflecting the size of the data 2.
And the calculation task request T3 is used for indicating that the fusion operation is performed on the data 1 and the data 2 to obtain fused data. T3 carries an identification of the task request (WR-ID: 3 is shown), a TYPE of the task request (WR-TYPE: 1, which is expressed as a computation task request), a computation operation TYPE (calc-TYPE: SUM, which is expressed as data summation), a source data address 1 (src-addr 1), a source data address 2 (src-addr 2), a destination address (dest-addr), a data operation TYPE (data-TYPE: int), a data length (data-length), and identifications of other tasks (destend: 1, 2, which are expressed that T3 depends on T1 and T2, and particularly depends on execution results of the tasks, namely data 1 and data 2. The source data address 1 and the source data address 2 are respectively used for storing respective execution results of the T1 and T2 task requests, for example, the source data address 1 is used for storing data 1, and the source data address 2 is used for storing data 2. The destination address is used to store the result of the execution of the computation task request T3, which here may be used to store the fused data. The data length is used to reflect the size of the fused data.
The communication task request T4 is used to instruct to transmit the fused data to the host apparatus 1. T4 carries an identifier of the task request (WR-ID: 4 shown in the figure), a TYPE of the task request (WR-TYPE: 0 shown as a communication task request), a queue group identifier associated with the task request (local-QPN: 3), a transmission TYPE of the task request (opcode: send shown as a sending task request), and an identifier of other task requests on which the task request depends (dependent: 3 shown as T4 depending on a computing task request T3). Optionally, T4 may also carry an identifier of the destination device (also referred to as an identifier of the remote device, remote-id: 1, indicating the host device 1), a queue group identifier associated with the destination device (also referred to as a remote queue group, remote-QPN: 5), and the like. The identification of the destination device is used to indicate the remote device to which the T4 task request needs to communicate.
As shown in fig. 8, the task network diagram includes information such as 4 task requests constituting service communication, an execution sequence of each task request, and data dependency. The execution sequence may be parallel, e.g., T1 and T2 in FIG. 8 are parallel; it can also be performed serially, for example, in fig. 8, T3 is executed first and then T4 is executed. The host device 2 analyzes the task network diagram, and after each task request included in the task network diagram is executed in sequence, the service communication can be realized, and the fused data is sent to the host device 1 for use.
By implementing the embodiment of the invention, the task processing of the service communication is realized based on the task network diagram, so that the task request WR can be merged and issued once through the network diagram, the WR issuing times are reduced, the time consumption of WR issuing is reduced, and the equipment bandwidth resource is saved. In addition, the task request is executed at the HCA side, other business processing can be carried out after the CPU sends the task network diagram, the CPU does not need to be occupied, and compared with the traditional technology, the method can reduce the load of the CPU and improve the utilization rate of the CPU.
Fig. 9 is a schematic flowchart of another task processing method according to an embodiment of the present invention. The method as shown in fig. 9 is applied in a host device (first device) comprising a network interface card HCA and a host processor, the network interface card comprising a task execution engine and an RDMA execution engine, the method comprising the implementation steps of:
s901, HCA obtains the task network diagram, analyzes the task network diagram, and obtains at least one communication task request and/or at least one calculation task request which form the service communication.
In the present application, an HCA (specifically, a task execution engine of the HCA) of a first device acquires a task network graph of service communication, analyzes the task network graph, and obtains information included in the task network graph, such as at least one task request constituting the service communication and an execution dependency relationship of the task request. Each task request also carries a custom parameter, which may include, but is not limited to, at least one of the following: the task type of the task request, the identification of the task request, the transmission type of the task request, the source data address, the destination address and other information. The task type of the task request is used for indicating that the task request can be a communication task request and/or a calculation task request.
And S902, the HCA executes the task network graph, and if the task network graph contains the communication task request, the RDMA execution engine is waited to be called to execute the data transmission operation indicated by the communication task request.
The HCA executes each task request in sequence according to the information contained in the task network diagram, and corresponding service communication is realized. Wherein:
for the communication task request, the HCA identifies the communication task request as a send task request (send) or a receive task request (receive) according to the transmission type carried in the communication task request. If the transmission type in the communication task request indicates that the communication task request is a sending task request, the HCA acquires data to be sent and sends the data to be sent by adopting an RDMA technology.
Optionally, the communication task request may also carry a queue group identifier (QPN) associated with the communication task request, and the HCA may store the communication task request in a queue group QP corresponding to the queue group identifier QPN in the form of a task request entry WRE. Specifically, the communication task request is a send task request, and the HCA may store the send task request in a send queue SQ in a queue group in a WRE form, and wait to invoke the RDMA execution engine to execute the send task request. If the communication task request is a receive task request, the HCA may store the send task request in a receive queue RQ in a queue group in a WRE format to wait for the RDMA execution engine to be invoked to execute the receive task request.
Specifically, the HCA parses the task network graph through the task execution engine to obtain each task request included in the task network graph. If the task execution engine can acquire whether the task request is a calculation task request or a communication task request according to the task type carried in the task request. For the communication task request, the task execution engine can identify whether the communication task request is a task sending request or a task receiving request according to the transmission type carried in the task request. For the sending task request, the task execution engine may issue the sending task request to a sending queue in a queue group CQ corresponding to the queue group identifier in the form of a task request entry WRE, and wait to invoke the RDMA execution engine to execute the sending task request. Specifically, when the RDMA execution engine executes the send task request, the to-be-sent data may be obtained, and the to-be-sent data is sent by using an RDMA technique. And aiming at the received task request, the task execution engine issues the received task request to a receiving queue in a queue group CQ corresponding to the queue group identifier in the form of a task request entry WRE, and waits for calling the RDMA execution engine to execute the received task request. Specifically, the RDMA execution engine is called to receive the data to be received sent by the second device by adopting the RDMA technology.
Optionally, the communication task request may also carry a source data address and a destination device identifier. The HCA acquires the data to be sent from the source data address through the RDMA execution engine so as to send the data to be sent to the target equipment corresponding to the target equipment identification. Optionally, the communication task request may also carry a source data length, and the HCA may specifically obtain, from the source data address, to-be-sent data corresponding to the source data length through the RDMA execution engine.
Optionally, the communication task request may also carry an identifier of another task request (e.g., the first task request) on which the communication task request depends, and the data to be sent may be an execution result of the first task request. The first task request may include, but is not limited to, a communication task request or a computing task request.
If the transmission type in the communication task request indicates that the communication task request is a receive task request, the HCA may also receive data to be received transmitted by using an RDMA technique from a second host device (for short, a second device). Optionally, the communication task request carries a source data address, and the HCA may store the data to be received at the source data address. The communication task request can also carry a source data length, the HCA can store the data to be received from the source data address, and the source data length is used for reflecting/indicating the size of the data to be received.
Optionally, the communication task request may also carry an identifier of another task request (for example, the first task request) on which the communication task request depends, and the data to be received may be an execution result of the first task request. For the first task request, reference is made to the foregoing embodiments, and details are not repeated here.
Optionally, after the RDMA execution engine completes executing the communication task request, a completion notification message may be actively sent to the task execution engine to notify that the communication task request is currently completed. Alternatively, after the RDMA execution engine executes the communication task request, it may generate a corresponding completion queue entry CQE, and add it to the completion queue. Accordingly, when receiving the completion notification message for the communication task request or querying the completion queue entry for the communication task request through the active polling completion queue, the task execution engine may determine that the communication task request is currently executed, so as to continue the execution of the next task request.
And S903, if the task network graph comprises the calculation task request, performing data calculation operation indicated by the calculation task request.
For the calculation task request, the calculation task request carries a calculation operation type, and the task execution engine of the HCA can acquire data to be calculated and perform the calculation operation indicated by the calculation operation type on the data to be calculated to obtain a calculation result. Alternatively, the source data address may be carried in the computation task request, and the HCA may obtain the data to be computed from the source data address. The number of the data to be calculated is not limited, and may be one or more. For example, when there are two pieces of data to be calculated, specifically, the two pieces of data to be calculated may be the first data to be calculated and the second data to be calculated, the source data addresses carried in the calculation task request are also two pieces, specifically, the two pieces of data may be the first source data address and the second source data address. Accordingly, the HCA may obtain the first data to be computed from the first source data address and the second data to be computed from the second source data address.
Optionally, the calculation task request may also carry a source data length for reflecting the size of the data to be calculated, and the HCA obtains the data to be calculated corresponding to the source data length from the source data address. Optionally, the computation task request further carries a destination address, and the HCA may store the computation result to the destination address.
Optionally, the computation task request carries an identifier of another task request (for example, a second task request) on which the computation task request depends, and the data to be computed may specifically be an execution result of the second task request. The second task request may be a communication task request or a calculation task request, and is not limited.
And S904, after the HCA executes each task request contained in the task network diagram, the service communication is completed.
The HCA processes each task request included in the task network graph in turn according to the task request processing principle described above, so as to implement service communication of the first device.
To facilitate a better understanding of embodiments of the invention, a detailed description is given below of a specific example. Please refer to fig. 10A and fig. 10B, which illustrate a communication diagram based on hierarchical aggregation of tree-based networking. Fig. 10A shows a communication diagram of data aggregation, and fig. 10B shows a communication diagram of data distribution. Each node of the illustrated cluster establishes connections with parent and child nodes in the physical topology. And the node completes the data aggregation/fusion of all the child nodes and the node to obtain aggregated data. The parent node receives the aggregated data (also called fused data) for all child nodes. And if a node has a father node, sending aggregated data to the father node, waiting for the father node to send the aggregated data of the father node, and then forwarding the aggregated data to all child nodes. And if the parent node does not exist in a certain node, sending the self aggregated data to all the child nodes. In the illustration, each node may represent one host device, and the illustrations may be host devices 1 (host 1) to 7 (host 7), respectively. Specifically, taking a parent node as the host device 2 as an example, the host device 2 needs to receive respective data of the host device 4 and the host device 5, and fuse the data to obtain first fused data, so as to send the first fused data to the parent node host device 1 of itself. Accordingly, the host device 2 may receive the total fused data sent by the host device 1, and the total fused data may specifically include data of each of the host devices 2 to 7. The total fused data is then forwarded to its child nodes, host device 4 and host device 5.
Taking the service communication of the host device 2 as an example, the specific process of the service communication of the host device 2 is as follows: the host device 2 needs to receive the data from the host device 4 and the host device 5, merge them, and send them to the parent node host device 1, and wait for the host device 1 to complete the data aggregation of all nodes in the cluster, and then send them to the host device 2. The host device 2 then issues the aggregation result to the host device 4 and the host device 4. With this specific flow of business communication, the host processor of the host device 2 (host 2) can construct a task network diagram as shown in fig. 11.
In fig. 11, squares represent communication task requests, and circles represent calculation task requests. The figure includes 7 task requests T1 to T7. There are 6 communication task requests and 1 computation task request T3. Specifically, the communication task request T1 is for receiving data 1 from the host device 4, and T2 is for requesting to receive data 2 from the host device 5. And T3 is used for fusing the data 1 and the data 2 to obtain first fused data. T4 is used to send the first fused data to the host device 1. T5 is for receiving the second fused data from the host device 1. T6 is used to send the second fused data to the host device 4, and T7 is used to send the second fused data to the host device 5.
The communication task request T1 carries a custom parameter, and the diagram may include an identifier of the task request (WR-ID: 1), a task TYPE of the task request (WR-TYPE: 0, which indicates that the communication task request is received), a queue group identifier associated with the task request (local-QPN: 1), a transmission TYPE of the task request (opcode: receive, which indicates that the task request is received), a source data address (src-addr: buffer 1), and an identifier of another task request on which the task request depends (depended: null, which indicates that T1 does not depend on another task request). Wherein the source data address is used to store data 1 of the host device 4 received by T1.
Similarly, the parameters that may be carried by each task request are exemplarily given in the figure. For example, the computation task request T3 may carry, in addition to the parameters in T1, a computation operation type (calc-type: SUM, which indicates data fusion/summation), a source data address 1 (src-addr 1: buffer 1), a source data address 2 (src-addr 2: buffer 2), and a destination address (des-addr: buffer 3). The source data address 1 stores data 1, the source data address 2 stores data 2, and the destination address is used for storing first fusion data obtained after fusion SUM operation is performed on the data 1 and the data 2. The communication task requests T6 and T7 may also carry an identifier (remote-id) of the destination device (indicating that the communication task request supports the remote host device for communication).
Referring to fig. 12, an operational diagram of task network graph execution is shown. Specifically, the task execution engine of the host device 2 receives the task network map from the host processor through the host interface. The task execution engine comprises a control unit and a computing unit. And the task execution engine analyzes the task network diagram through the control unit to obtain 7 task requests T1-T7. First, the control unit may execute T1 and T2 in parallel, issue T1 as a task request entry WRE1 to the receive queue RQ in QP1 (the queue group corresponding to the queue group identification local-local-QPN: 1), and wait for the RDMA execution engine to schedule execution of T1 (step S1 in the figure). And issuing the T2 to a receiving queue RQ in the QP2 in the form of a task request entry WRE2, and waiting for the RDMA execution engine to schedule and execute the T2 (shown in the figure S2). Alternatively, host device 2 may implement WRE1 or WRE2 processing via RDMA execution engines, i.e., receive data 1 from host device 4 sent using RDMA techniques and data 2 from host device 5 sent using RDMA techniques. Further, the host device 2 may store the received data 1 and data 2 to the source data address 1 (buffer 1) and the source data address 2 (buffer 2), respectively.
After each completion of a communication task request, for example, after T1 (WR 1 in particular) is executed, the RDMA execution engine may generate a corresponding completion queue entry CQE and place the CQE into a corresponding completion queue CQ. Specifically, after WRE1 and WRE2 are executed, CQE1 and CQE2 may be generated correspondingly and placed in completion queue CQ.
The control unit may determine the execution status (whether the execution is finished) of the communication task request by actively polling the completion queue CQ to determine whether a completion queue entry corresponding to the communication task request exists. Alternatively, the RDMA execution engine may actively send a completion notification event/message to the task execution engine (specifically, a control unit in the task execution engine) after executing one communication task request, and accordingly, after receiving the completion notification message for the communication task request, the control unit may determine that the execution of the communication task request is completed. Illustratively, as shown, the control unit actively polls the completion queue CQ for the presence of CQE1 and CQE2 to determine the respective execution status of T1 and T2. CQE1 indicates that T1 (WRE 1) has been executed, and CQE2 indicates that T2 (WRE 2) has been executed (steps S3 and S4 are shown).
After determining that the T1 and the T2 are completely executed, the control unit may continue to analyze T3, and invoke the computing unit to obtain a T1 execution result (data 1) and a T2 execution result (data 2) that the T3 depends on from the buffer1 and the buffer2, respectively, as shown in steps S5 and S6 in the figure. And further performing fusion operation indicated by the calculation operation type on the data 1 and the data 2 to obtain first fusion data. Alternatively, the calculation unit may also store the first fused data into the destination address buffer3 (step S7 is illustrated). After the computing unit completes the computation, a completion notification message can be actively sent to the control unit to notify the execution of the current computation task completion request.
After determining that T3 is completely executed, the control unit may then parse and execute T4, issue T4 in the form of WRE4 to the transmit queue RQ in the queue group QP3, and wait for the execution of the RDMA execution engine. Accordingly, the RDMA execution engine acquires the data to be sent (here, the first converged data) from the buffer3, and sends the first converged data to the host device 1 by using the RDMA technique (step S8 shown in the figure). The control unit determines the execution state of T4 by actively polling or receiving a completion notification event (step S9 shown in the figure), which is specifically referred to the related description of the determination of the execution states of T1 and T2, and is not described herein again.
The control unit, upon determining that T4 is done, may then parse and execute T5, send T5 down to the receive queue RQ in QP3 in the form of WRE5, and wait for the RDMA execution engine to execute (step S10 shown). The control unit determines the execution state to T5 by active polling or receiving a completion notification message or the like (step S11 shown in the figure). Similarly, after determining that T5 finishes executing, the control unit may parse and execute T6 and T7 in parallel, issue T6 in the form of WRE6 to the send queue SQ of QP1, issue T7 in the form of WRE7 to the send queue SQ of QP2, and wait for the RDMA execution engine to execute (steps S12 and S13 shown in the figure). Accordingly, the control unit determines the respective execution states of T6 and T7 by actively polling or receiving a completion notification event or the like (steps S14 and S15 are illustrated). After the control unit determines that the execution of T7 is completed, it may determine that each task request in the task network graph has been completed, and the control unit may send a notification message to the host processor through the host interface to notify that the task network graph has been completed and that the service communication of the host device 2 has been completed currently.
By implementing the embodiment of the invention, the task network graph compatible with the communication task request and the calculation task request is designed, and the issuing times of the task request are reduced, so that the time consumption for issuing the task request is reduced. In addition, the task execution engine is used for analyzing and executing the thought task network graph, so that the utilization rate of a CPU is reduced, and any task request in the task network graph supports the control processing operation of the same QP or different QPs.
Based on the above examples, the following sets forth the relevant products to which the present application is directed. Fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present invention. As shown in fig. 13, the network device 500 (specifically, a network interface card, or simply a network card) includes one or more processors 501, a communication interface 502, and a memory 503, where the processors 501, the communication interface 502, and the memory 503 may be connected by a bus, and may also implement communication by other means such as wireless transmission. The embodiment of the present invention is exemplified by being connected through a bus 504, wherein the memory 503 is used for storing instructions, and the processor 501 is used for executing the instructions stored by the memory 503. The memory 503 stores program codes, and the processor 501 can call the program codes stored in the memory 503 to execute the relevant steps with HCA as the main execution body in the above method embodiment, and/or the technical contents described in the text. For example, the following steps are performed: acquiring a task network diagram, wherein the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication, the communication task request is used for requesting to perform data transmission operation indicated by the communication task request, and the calculation task request is used for requesting to perform data calculation operation indicated by the calculation task request; if the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request; and after the task network graph is executed, the service communication is completed.
It should be understood that processor 501 may be comprised of one or more general-purpose processors, such as a Central Processing Unit (CPU). The processor 501 may be configured to run related program codes to implement related steps of the task processing method embodiment, where the HCA is the main execution body.
The communication interface 502 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules/devices. For example, in the embodiment of the present application, the communication interface 602 may be specifically configured to receive a task network map and the like from a host processor.
The Memory 503 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory 503 may also comprise a combination of the above kinds of memories. The memory 503 may be used to store a set of program codes, so that the processor 501 may call the program codes stored in the memory 503 to implement the relevant contents of the HCA as the execution subject in the embodiment of the method of the present invention.
It should be noted that fig. 13 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the network interface card may further include more or less components, which is not limited herein. For the content that is not shown or not described in the embodiment of the present invention, reference may be made to the related explanation in the embodiment described in fig. 1 to fig. 12, and details are not described here.
Fig. 14 is a schematic structural diagram of a host device according to an embodiment of the present invention. The host apparatus 600 includes: a processing unit 602 and a communication unit 603. The processing unit 602 is used to control and manage the actions of the host apparatus 600. Illustratively, the processing unit 602 is configured to support the host device 600 in performing steps S701-S703 in FIG. 7, steps S901-S904 in FIG. 9, and/or in performing other steps of the techniques described herein. The communication unit 603 is used to support communication between the host apparatus 600 and other apparatuses. Optionally, the host device 600 may further include a storage unit 601 for storing program codes and data of the computing device 600.
The Processing Unit 602 may be a Processor or a controller, such as a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication unit 603 may be a communication interface, a transceiver circuit, etc., wherein the communication interface is a generic term and may include one or more interfaces, such as interfaces between terminal devices and other devices. The storage unit 601 may be a memory.
When the processing unit 602 is a processor, the communication unit 603 is a communication interface, and the storage unit 601 is a memory, the host device according to the embodiment of the present invention may be the host device shown in fig. 7.
Referring to fig. 15, the host device 610 includes: processor 612, communication interface 613, memory 611. Optionally, the computing device 610 may also include a bus 614 and a network interface card 615 (simply network card). The communication interface 613, the processor 612, the memory 611, and the network interface card 615 may be connected to each other via a bus 614; the bus 614 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 614 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 15, but this is not intended to represent only one bus or type of bus. The network card 615 may be specifically the network card (network device) 500 shown in fig. 13, and is not described herein again. Network card 616 includes processor 501, communication interface 502, and memory 503. Which are shown connected by a bus. The memory 503 is used for storing instructions, and the processor 501 is used for executing the instructions stored in the memory 503. The memory 503 stores program codes, and the processor 501 can call the program codes stored in the memory 503 to execute the operation steps with HCA as the execution subject as in the above method embodiments.
The specific implementation of the host device shown in fig. 14 or fig. 15 may also refer to the corresponding description of the foregoing method embodiment, and is not described herein again.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware or in software executed by a processor. The software instructions may be composed of corresponding software modules, and the software modules may be stored in a Random Access Memory (RAM), a flash Memory, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a register, a hard disk, a removable hard disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a computing device. Of course, the processor and the storage medium may reside as discrete components in a computing device.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed to implement the processes of the embodiments of the methods described above. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Claims (14)

1. A method of task processing, the method comprising:
the method comprises the steps that a first device obtains a task network diagram, wherein the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication, the communication task request is used for requesting to perform data transmission operation indicated by the communication task request, the calculation task request is used for requesting to perform data calculation operation indicated by the calculation task request, the communication task request carries one or more of identification of the communication task request, a queue group QP associated with the communication task request, a transmission type of the communication task request, other task requests depended by the communication task request, a source data address and a destination address, the calculation task request carries one or more of identification of the calculation task request, a calculation operation type, a data operation type, a verification keyword, other task requests depended by the calculation task request, a source data address and a destination address, the source data address is used for storing received data, and the destination address is used for storing data required to be sent or data obtained by self calculation;
if the task network graph comprises the communication task request, a task execution engine in the first device is used for executing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology;
if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request by using the task execution engine;
and after the task network graph is executed, the service communication is completed.
2. The method of claim 1, wherein performing the data transfer operation indicated by the commumcation task request using RDMA over Remote Direct Memory Access (RDMA) technology comprises:
if the transmission type indicates that the communication task request is a sending task request, obtaining data to be sent, and sending the data to be sent by adopting an RDMA (remote direct memory access) technology;
and if the transmission type indicates that the communication task request is a receiving task request, receiving data to be received sent by the second equipment by adopting an RDMA technology.
3. The method of claim 2, wherein performing the data transfer operation indicated by the commumcation task request using RDMA over Remote Direct Memory Access (RDMA) technology comprises:
if the communication task request is a sending task request, storing the sending task request in a sending queue in a queue group corresponding to a queue group identifier, waiting for executing the sending task request to obtain data to be sent, and sending the data to be sent by adopting an RDMA (remote direct memory access) technology;
and if the communication task request is a received task request, storing the received task request in a receiving queue in the queue group corresponding to the queue group identifier, and waiting for executing the received task request to receive data to be received sent by second equipment by using an RDMA (remote direct memory access) technology.
4. The method of claim 2,
when the communication task request also carries an identifier of a first task request depended by the communication task request, the data to be sent and the data to be received are execution results of the first task request; alternatively, the first and second electrodes may be,
and when the communication task request does not carry the identifier of the first task request depended by the communication task request, the data to be sent and the data to be received are pre-stored data.
5. The method of claim 3, further comprising:
inquiring a completion queue entry aiming at the communication task request in a completion queue or receiving a completion notification message aiming at the completion of the execution of the communication task request, and determining that the execution of the communication task request is completed;
the completion queue stores the completion queue entry used for indicating the identifier of the communication task request after completion of execution.
6. The method of any of claims 1-5, wherein the performing the data computation operation indicated by the computation task request comprises:
and acquiring data to be calculated, and performing calculation operation indicated by the calculation operation type on the data to be calculated to obtain an execution result of the calculation task request.
7. The method according to any one of claims 1-5, wherein before the obtaining the task network graph, further comprising:
performing task decomposition on service communication to obtain at least one task request forming the service communication and an execution dependency relationship of the at least one task request, wherein the execution dependency relationship comprises a data dependency relationship and a task execution relationship, the task execution relationship comprises serial execution and/or parallel execution, the data dependency relationship is used for indicating other task requests depended on when the task request is executed, and the task request comprises a communication task request and/or a calculation task request;
and generating a task network graph according to the at least one task request and the execution dependency relationship of the at least one task request.
8. A network device comprising a task execution engine and an RDMA execution engine, wherein:
the task execution engine is used for acquiring a task network diagram, wherein the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication, the communication task request carries one or more of an identifier of the communication task request, a queue group QP associated with the communication task request, a transmission type of the communication task request, other task requests depended on by the communication task request, a source data address and a destination address, the calculation task request carries one or more of an identifier of the calculation task request, a calculation operation type, a data operation type, a verification keyword, other task requests depended on by the calculation task request, a source data address and a destination address, the source data address is used for storing received data, and the destination address is used for storing data required to be sent or data obtained by self calculation;
the task execution engine is further used for calling the RDMA execution engine to execute the data transmission operation indicated by the communication task request if the task network graph comprises the communication task request;
the task execution engine is further configured to execute a data calculation operation indicated by the calculation task request if the task network graph includes the calculation task request;
and the task execution engine is also used for determining to finish the service communication after the task network graph is executed.
9. The apparatus of claim 8,
the task execution engine is specifically configured to store the send task request in a send queue in a corresponding queue group QP in a form of a task request entry if the transmission type indicates that the communication task request is a send task request, and wait for the RDMA execution engine to execute the send task request; alternatively, the first and second electrodes may be,
and if the transmission type indicates that the communication task request is a received task request, storing the received task request in a received queue in a corresponding queue group QP in a form of a task request entry, and waiting for an RDMA execution engine to execute the received task request.
10. The apparatus of claim 9,
the RDMA execution engine is used for acquiring data to be sent and sending the data to be sent by adopting an RDMA technology if the communication task request is a sending task request; and if the communication task request is a task receiving request, receiving data to be received sent by the host equipment by adopting an RDMA technology.
11. The apparatus of claim 8,
the RDMA execution engine is further used for sending a completion notification message aiming at the communication task request to the task execution engine after the communication task request is executed; or after the communication task request is executed, generating a completion queue entry and adding the completion queue entry into a completion queue, wherein the completion queue entry is used for indicating that the communication task request is executed;
the task execution engine is further configured to determine that the communication task request is executed completely if a completion notification message for the communication task request is received or the completion queue is polled to obtain the completion queue entry.
12. The apparatus according to any one of claims 8-11,
the task execution engine is used for acquiring data to be calculated, and executing the calculation operation indicated by the calculation operation type on the data to be calculated to obtain the execution result of the calculation task request.
13. A network device, comprising a memory for storing instructions and a processor coupled to the memory for executing the instructions; wherein the processor, when executing the instructions, performs the method of any of claims 1-7 above.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN201910687998.XA 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium Active CN112291293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910687998.XA CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910687998.XA CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN112291293A CN112291293A (en) 2021-01-29
CN112291293B true CN112291293B (en) 2023-01-06

Family

ID=74418899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910687998.XA Active CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN112291293B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112822300B (en) * 2021-04-19 2021-07-13 北京易捷思达科技发展有限公司 RDMA (remote direct memory Access) -based data transmission method and device and electronic equipment
CN113537937A (en) * 2021-07-16 2021-10-22 重庆富民银行股份有限公司 Task arrangement method, device and equipment based on topological sorting and storage medium
CN114296916B (en) * 2021-12-23 2024-01-12 苏州浪潮智能科技有限公司 Method, device and medium for improving RDMA release performance
CN114510339B (en) * 2022-04-20 2022-07-29 苏州浪潮智能科技有限公司 Computing task scheduling method and device, electronic equipment and readable storage medium
CN115237582B (en) * 2022-09-22 2022-12-09 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN116721007B (en) * 2023-08-02 2023-10-27 摩尔线程智能科技(北京)有限责任公司 Task control method, system and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129390A (en) * 2011-03-10 2011-07-20 中国科学技术大学苏州研究院 Task scheduling system of on-chip multi-core computing platform and method for task parallelization
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
EP3096226A4 (en) * 2014-10-29 2017-11-15 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation processing method, conversation management system and computer device
CN107819855A (en) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 A kind of message distributing method and device
CN108628672A (en) * 2018-05-04 2018-10-09 武汉轻工大学 Method for scheduling task, system, terminal device and storage medium
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN109840154A (en) * 2019-01-08 2019-06-04 南京邮电大学 A kind of computation migration method that task based access control relies under mobile cloud environment
CN109918432A (en) * 2019-01-28 2019-06-21 中国平安财产保险股份有限公司 Extract method, apparatus, computer equipment and the storage medium of task nexus chain

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063344B (en) * 2014-06-20 2018-06-26 华为技术有限公司 A kind of method and network interface card for storing data
DK3358463T3 (en) * 2016-08-26 2020-11-16 Huawei Tech Co Ltd METHOD, DEVICE AND SYSTEM FOR IMPLEMENTING HARDWARE ACCELERATION TREATMENT
CN114095427A (en) * 2017-12-29 2022-02-25 西安华为技术有限公司 Method and network card for processing data message

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129390A (en) * 2011-03-10 2011-07-20 中国科学技术大学苏州研究院 Task scheduling system of on-chip multi-core computing platform and method for task parallelization
EP3096226A4 (en) * 2014-10-29 2017-11-15 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation processing method, conversation management system and computer device
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN107819855A (en) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 A kind of message distributing method and device
CN108628672A (en) * 2018-05-04 2018-10-09 武汉轻工大学 Method for scheduling task, system, terminal device and storage medium
CN109840154A (en) * 2019-01-08 2019-06-04 南京邮电大学 A kind of computation migration method that task based access control relies under mobile cloud environment
CN109918432A (en) * 2019-01-28 2019-06-21 中国平安财产保险股份有限公司 Extract method, apparatus, computer equipment and the storage medium of task nexus chain

Also Published As

Publication number Publication date
CN112291293A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN112291293B (en) Task processing method, related equipment and computer storage medium
US20220214919A1 (en) System and method for facilitating efficient load balancing in a network interface controller (nic)
US20200344189A1 (en) Communication method and communication apparatus
US7937447B1 (en) Communication between computer systems over an input/output (I/O) bus
EP4136532A1 (en) Storage transactions with predictable latency
US20170168986A1 (en) Adaptive coalescing of remote direct memory access acknowledgements based on i/o characteristics
US11922304B2 (en) Remote artificial intelligence (AI) acceleration system
CN113485823A (en) Data transmission method, device, network equipment and storage medium
US20230353419A1 (en) Cross network bridging
US20100306387A1 (en) Network interface device
WO2021073546A1 (en) Data access method, device, and first computer device
US10609125B2 (en) Method and system for transmitting communication data
WO2022032984A1 (en) Mqtt protocol simulation method and simulation device
US20230152978A1 (en) Data Access Method and Related Device
KR20240004315A (en) Network-attached MPI processing architecture within SMARTNICs
US10305772B2 (en) Using a single work item to send multiple messages
CN111970497B (en) Video stream processing method and device, SDN controller and storage medium
US20220291928A1 (en) Event controller in a device
WO2022143774A1 (en) Data access method and related device
US20150254100A1 (en) Software Enabled Network Storage Accelerator (SENSA) - Storage Virtualization Offload Engine (SVOE)
US20150254196A1 (en) Software Enabled Network Storage Accelerator (SENSA) - network - disk DMA (NDDMA)
WO2024077999A1 (en) Collective communication method and computing cluster
CN117041147B (en) Intelligent network card equipment, host equipment, method and system
Fredj et al. On distributed intrusion detection systems design for high speed networks
WO2022179293A1 (en) Network card, computing device and data acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant