CN112291293A - Task processing method, related equipment and computer storage medium - Google Patents

Task processing method, related equipment and computer storage medium Download PDF

Info

Publication number
CN112291293A
CN112291293A CN201910687998.XA CN201910687998A CN112291293A CN 112291293 A CN112291293 A CN 112291293A CN 201910687998 A CN201910687998 A CN 201910687998A CN 112291293 A CN112291293 A CN 112291293A
Authority
CN
China
Prior art keywords
task request
task
data
communication
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910687998.XA
Other languages
Chinese (zh)
Other versions
CN112291293B (en
Inventor
王庚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910687998.XA priority Critical patent/CN112291293B/en
Publication of CN112291293A publication Critical patent/CN112291293A/en
Application granted granted Critical
Publication of CN112291293B publication Critical patent/CN112291293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)
  • Multi Processors (AREA)

Abstract

The embodiment of the invention discloses a task processing method, a task processing device and a computer storage medium, wherein the task processing method comprises the following steps: the method comprises the steps that first equipment obtains a task network graph, wherein the task network graph comprises at least one communication task request and/or at least one calculation task request forming service communication; if the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request; and after the task network graph is executed, the service communication is completed. By adopting the embodiment of the invention, the problems that the task request is frequently issued and other functional operations except the transmission function cannot be realized in the traditional scheme can be solved.

Description

Task processing method, related equipment and computer storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a task processing method, a related device, and a computer storage medium.
Background
Infiniband (IB) is a switching network designed to meet the requirements of high bandwidth and low latency, and can be specifically referred to as IB Specification Vol 1-Release-1.3 protocol, and is composed of a transport layer, a network layer, a link layer and a physical layer. The transport layer provides an IB resource management interface and an Input Output (IO) interface for an application, and completes service communication with a remote device in a queue manner. The IO interfaces mainly comprise a sending post-send interface, a receiving post-recv interface and a queue polling poll-cq interface.
The host device provides a transmission layer instance to communicate with the remote device through an IB network, and a single transmission layer instance is named as a queue group (QP) and consists of a Sending Queue (SQ) and a Receiving Queue (RQ). Multiple QPs may be used simultaneously in a host device to correspond to communication with multiple remote devices. When the host device and the remote device perform actual service communication, each task request support in the service communication is associated with one QP, for example, a task request is sent into a sending queue SQ, and a task request is received into a receiving queue RQ. Accordingly, a remote direct data access (RDMA) execution engine of the host device may execute the task requests according to an actual issuing order of the task requests, which may involve frequent interaction of the task requests, affect device performance, and reduce task processing efficiency. In addition, the RDMA execution engine only has a data transceiving transmission function and does not support other operation functions, such as data operation, and the like, which may not meet the requirements of actual service communication.
Disclosure of Invention
The embodiment of the invention discloses a task processing method, related equipment and a computer storage medium, which can solve the problems that task requests are frequently issued and other functional operations except a transmission function cannot be realized in the traditional scheme.
In a first aspect, an embodiment of the present invention discloses a method for task processing, where the method is applied to a first device side, and the method includes: the method comprises the steps of obtaining a task network graph, wherein the task network graph comprises at least one communication task request and/or at least one calculation task request which form service communication. Any two communication task requests support parallel execution or serial execution, the communication task requests are used for requesting to perform data transmission operation indicated by the communication task requests, and the calculation task requests are used for requesting to perform data calculation operation indicated by the calculation task requests. When the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; when the task network graph comprises the computing task request, executing the data computing operation indicated by the computing task request; and after the task network graph is executed, the service communication is completed.
With reference to the first aspect, in some possible embodiments, the communication task request carries a transmission type of the communication task request, and when the transmission type of the communication task request indicates that the communication task request is a send task request, the first device obtains data to be sent, and sends the data to be sent by using an RDMA technique. Or, when the transmission type of the communication task request indicates that the communication task request is a task receiving request, the first device receives data to be received sent by the second device by using an RDMA technology.
With reference to the first aspect, in some possible embodiments, the communication task request further carries a source data address. And when the communication task request is a task sending request, the data to be sent is stored in the source data address. When the communication task request is a task receiving request, the first device can store the data to be received in the source data address.
With reference to the first aspect, in some possible embodiments, when the communication task request further carries an identifier of a first task request that the communication task request depends on, the data to be sent and the data to be received are execution results of the first task request, and the first task request is the communication task request or the computation task request. Or, when the communication task request does not carry the identifier of the first task request depended on by the communication task request, the data to be transmitted and the data to be received may be pre-stored data of the remote device (pre-stored data for short).
With reference to the first aspect, in some possible embodiments, the communication task request further carries a queue group identifier QP associated with the communication task request, and when the communication task request is a send queue request, the first device issues the send task request to a receive queue in the queue group QP corresponding to the queue group identifier, and waits for an RDMA execution engine of the first device to execute the send task request. When the communication task request is a receiving queue request, the first device issues the receiving queue request to a sending queue in a queue group QP corresponding to the queue group identifier, and waits for an RDMA execution engine of the first device to execute the receiving queue request.
With reference to the first aspect, in some possible embodiments, when the first device queries a completion queue entry for the communication task request in the completion queue, or receives a completion notification message for completion of execution of the communication task request, it determines that the communication task request is completely executed. The completion queue stores a completion queue entry used for indicating the identifier of the executed communication task request.
With reference to the first aspect, in some possible embodiments, the calculation task request carries a calculation operation type, and the first device may obtain data to be calculated, and perform a calculation operation indicated by the calculation operation type on the data to be calculated, so as to obtain an execution result of the calculation task request.
With reference to the first aspect, in some possible embodiments, the computing task request further carries a source data address. The source data address is used to store data to be computed.
With reference to the first aspect, in some possible embodiments, the computation task request further carries an identifier of a second task request on which the computation task request depends. The data to be calculated is the execution result of the second task request.
With reference to the first aspect, in some possible embodiments, the first device may perform task decomposition on the service communication, and obtain at least one service task request and an execution dependency relationship of the at least one service task request, where the execution dependency relationship includes a data dependency relationship and a task execution relationship. The task execution relation includes serial execution and/or parallel execution, the data dependency relation is used for indicating other task requests depended on when the business task request is executed, and the business task request may specifically include a communication task request and/or a calculation task request. And the first equipment generates a task network graph according to the execution dependency relationship of the at least one service task request and the at least one service task request.
In a second aspect, an embodiment of the present invention discloses a network device (specifically, a network interface card), including a task execution engine and an RDMA execution engine. The task execution engine is used for acquiring a task network diagram, and the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication. The task execution engine is also used for calling the RDMA execution engine to execute the data transmission operation indicated by the communication task request if the task network graph comprises the communication task request; and if the task network graph comprises the calculation task request, executing the data calculation operation indicated by the calculation task request. The task execution engine is also used for determining to complete the business communication after the task network graph is executed.
With reference to the second aspect, in some possible embodiments, the communication task request carries a transmission type, and the task execution engine is specifically configured to store a task request entry form in a send queue in the corresponding queue group QP if the transmission type indicates that the communication task request is a send task request, and wait for the RDMA execution engine to execute the send task request; and if the transmission type indicates that the communication task request is a received task request, storing the communication task request in a form of a task request entry in a received queue in a corresponding queue group QP, and waiting for the RDMA execution engine to execute the received task request. Specifically, if the communication task request is a send task request, the RDMA execution engine executes the send task request, obtains data to be sent, and sends the data to be sent by using an RDMA technology. And if the communication task request is a receiving task request, the RDMA execution engine executes the receiving task request and receives data to be received sent by the host equipment by adopting the RDMA technology.
In some possible embodiments, in combination with the second aspect, the RDMA execution engine is further configured to send, after the execution of the corresponding task request, a completion notification message for the corresponding task request to the task execution engine, so as to notify that the corresponding task request is completely executed. Or after the RDMA execution engine executes the communication task request, a completion queue entry may be automatically generated and added to the completion queue, where the completion queue entry is used to indicate that the communication task request has been executed.
In combination with the second aspect, in some possible embodiments, the computing task request carries a computing operation type. The task execution engine is used for acquiring the data to be calculated, executing the calculation operation indicated by the calculation operation type on the data to be calculated, and obtaining the execution result of the calculation task request.
For the content that is not shown or not described in the embodiment of the present invention, reference may be specifically made to the explanation in the embodiment of the method described in the foregoing first aspect, and details are not described here again.
In a third aspect, an embodiment of the present invention provides a first device, where the first device includes a functional unit configured to perform the method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a first device, which includes a network interface card and a host processor. The host processor is used for generating a task network graph according to at least one service task request of the obtained service communication and the execution dependency relationship of the at least one service task request. The network interface card is used for acquiring the task network diagram from the host processor and executing the task network diagram to realize the service communication. For details that are not shown or described in the embodiments of the present invention, reference may be made to the related explanations in the foregoing embodiments, and details are not described here.
In a fifth aspect, an embodiment of the present invention provides a network device (specifically, a network interface card), including a memory and a processor coupled to the memory; the memory is configured to store instructions, and the processor is configured to execute the instructions; when the processor executes the instructions, the processor executes the method described in the fourth aspect with the network interface card as the execution subject.
In a sixth aspect, an embodiment of the present invention provides another first device, including a memory and a processor coupled to the memory; the memory is configured to store instructions, and the processor is configured to execute the instructions; wherein the processor executes the instructions to perform the method described in the first aspect.
In some possible embodiments, the first device further includes a communication interface, which is in communication with the processor, and the communication interface is used for communicating with other devices (such as network devices and the like) under the control of the processor.
In a seventh aspect, a computer-readable storage medium storing program code for task processing is provided. The program code comprises instructions for performing the method described in the first or second aspect above.
The invention can be further combined to provide more implementation modes on the basis of the implementation modes provided by the aspects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a diagram of an RDMA network framework provided by the prior art.
Fig. 2 is a schematic diagram of a conventional network framework provided by the prior art.
Fig. 3 is a schematic diagram of an IB network framework according to an embodiment of the present invention.
Fig. 4A-4B are schematic diagrams of two task request processes provided by the embodiment of the invention.
Fig. 5 is a schematic diagram of a task processing framework according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a network framework according to an embodiment of the present invention.
Fig. 7 is a flowchart illustrating a task processing method according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of a task network diagram according to an embodiment of the present invention.
Fig. 9 is a flowchart illustrating another task processing method according to an embodiment of the present invention.
Fig. 10A-10B are schematic diagrams of two services based on tree networking according to an embodiment of the present invention.
Fig. 11 is a schematic diagram of another task network diagram provided by an embodiment of the present invention.
Fig. 12 is an operation diagram of a task process according to an embodiment of the present invention.
Fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present invention.
Fig. 14 is a schematic structural diagram of a host device according to an embodiment of the present invention.
Fig. 15 is a schematic structural diagram of another host device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings of the present invention.
First, some technical knowledge related to the present application is introduced.
Remote Direct Memory Access (RDMA) technology
RDMA techniques are proposed to address server-side data processing delays in network transport. RDMA transfers material data directly to a storage area of a host device over a network, quickly moving data from one system to a memory of a remote system without any impact on the system and without consuming much of the computing resources and processing functions of the host device. Currently there are three RDMA technologies supported: infiniband (IB), RoCE and iWARP. RDMA can avoid duplication to provide low latency, reduce CPU utilization, reduce memory bandwidth bottlenecks, and provide high bandwidth utilization. RDMA provides channel-based IO operations, allowing direct reading and writing of remote virtual memory for applications using RDMA techniques.
As shown in fig. 1, in a conventional network (e.g., a socket network), an application requests a network resource from an operation request, and the network resource is transferred by a system call. As shown in fig. 1, the application program is stored in a virtual buffer, which creates an instance buffer in the operating system, and accesses the application program stored in the operating system of the host device through the network card nic. In other words, in the conventional network, the network resource (application program) is owned by the operating system of the local device, and the user cannot directly access to obtain the network resource, and must rely on the operating system to move and obtain the network resource (i.e. data) from the virtual buffer of the application program, and then transmit the network resource (i.e. data) onto the line through the protocol stack. Accordingly, in the remote device, the application program needs to rely on the operating system to obtain the data on the line and store the data in the virtual buffer.
However, in RDMA technology, RDMA allows applications to exchange messages directly after a channel is established in the operating system without further intervention by the operating system. The message may be an RDMA direct read message, an RDMA write message, an RDMA receive message, an RDMA send message, or the like. Specifically, as shown in fig. 2, the respective applications of the local device and the remote device are stored in the corresponding virtual buffer buffers of the devices, and they may directly implement mutual access and acquisition of data by using RDMA technology. For example, the local device may use RDMA techniques to access applications stored in buffers in the remote device.
Second, Infiniband (IB) technology
Infiniband (ib) is a serial networking technology used for high performance computing computer network communications standards with very high throughput and very low latency for computer-to-computer data interconnects. Infiniband also serves as a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems.
Infiniband is a switching fabric IP technology, and the design idea is to establish a single connection link among remote storage, networks, servers and other devices through a set of central mechanisms (specifically, a central Infiniband switch), and to direct the flow through the central Infiniband switch, thereby greatly improving the performance, reliability and effectiveness of the system, and relieving the data flow congestion among hardware devices. Referring specifically to fig. 3, a schematic diagram of an IB network framework is shown. The network framework shown in fig. 3 includes: a processing node 301(processor node), a storage node 302(storage node), an Input Output (IO) node 303, and an Infiniband switch 304.
The number of processing nodes 301 is not limited, and one is illustrated as an example. The processing node 301 may include a Central Processing Unit (CPU), a memory, and a Host Channel Adapter (HCA). The number of the central processing units CPU is not limited, and the CPU, the memory, and the HCA may be connected by a bus (e.g., PCIe bus).
Storage nodes 302 may include, but are not limited to, RAID subsystems, storage subsystems (storage subsystems), or other system nodes for data storage. The RAID subsystem includes a processor, a memory, a target adapter (TCA), a Small Computer System Interface (SCSI), and a storage resource, where the storage resource includes, but is not limited to, a hard disk, a magnetic disk, or other storage devices. The storage subsystem may include a controller, a target adapter, a TCA, and a storage resource.
The IO node may specifically be an input/output IO device, which may include at least one adapted IO unit (IO module) that supports connection with an image, a hard disk, a network, and the like. The number of Infiniband switches is usually multiple and they together form an Infiniband switching network, referred to as an Infiniband network. Optionally, the Infiniband switch also supports communications with router.
In practical applications, the nodes communicate with each other through an Infiniband switch, for example, the Infiniband switch supports communication with the TCA and the HCA, and a communication link thereof may be referred to as an Infiniband link. Infiniband link is an optical fiber connecting the HCA and the TCA, and in an Infiniband network framework, a hardware manufacturer is allowed to connect the TCA and the HCA in three ways of 1 fiber, 4 fibers and 12 fibers. In addition, as shown in the figure, the HCA is a bridge connecting the memory controller and the TCA, and the TCA packages and transmits the digital signals of the IO devices (e.g., network card, SCSI controller) to the HCA.
Three, queue group QP
QP is an important concept of Infiniband, which refers to a combination of receive and transmit queues. In practical application, when the host device calls an Application Programming Interface (API) to receive and transmit a data request (for example, specifically, a task request), the data request is actually stored in the QP, and then the data requests in the QP are processed one by one in a polling manner. Fig. 4A is a schematic diagram illustrating a request process. As shown, a task request (WR) generated in the host device exists in the form of a task request entry (WRE) in a task queue (WQ). After the task request in the WQ is processed by hardware (hardware) in the host device, a corresponding Completion Queue Entry (CQE) is generated and stored in the task completion (WC) in a Completion Queue (CQ) manner. In practical applications, the WQ may specifically be a Receive Queue (RQ) or a Send Queue (SQ).
Please refer to fig. 4B, which specifically illustrates a specific processing diagram of task requests in the sending queue and the receiving queue. As shown in fig. 4B, a send task request (send WQE) is stored in the send queue SQ, and a receive task request (receive WQE) is stored in the receive queue RQ. When the hardware (hardware) of the host device processes the send task request, the RDMA technology is used to read the data requested to be sent by the WQE from a preset read memory (read buffer), then the data is written into the WQE (RDMA write WQE), and finally the WQE (send WQE) is sent, i.e. the hardware executes the WQE to send the corresponding data. Accordingly, when the hardware processing of the computing device receives the task request, the data requested to be received by receiving the task request may be stored in a receiving memory (receive buffer).
Next, a flow framework for task request processing is introduced.
Fig. 5 is a schematic diagram of a task processing framework according to an embodiment of the present invention. As shown in fig. 5, completing one service communication includes three task requests (WRs), which are as shown in the figure: post WR 1-post WR 3. Each WR exists in the QP in the form of a task request entry (WRE), each WR support is associated with one QP, specifically sending a send task request into a send queue SQ, receiving a receive task request into a receive queue RQ. After the host device completes processing of each WRE, a corresponding Completion Queue Entry (CQE) is generated and stored in the completion queue CQ, which is used to indicate that the WRE has been executed.
As shown, a host device (e.g., host0) may communicate with host device 1 (shown as host1) via QP1, host device 2 (shown as host2) via QP2, host device 3(host3) via QP3, and so on with host device n (host n) via QPn. QP1 through QP n are associated with the same completion queue CQ. In this example, in the actual service communication, host0 needs to merge the data from host1 and host2 and then send the merged data to host 3.
Specifically, 3 task requests (post WR1 to post WR3) are generated in the traffic, and are referred to as WR1 to WR3 in the following. Among them, WR1 is specifically a send task request recv WR1 for indicating that data a from host1 is requested to be received, host0 may send WR1 in the form of WRE1 to RQ under QP 1. WR2 is also a send task request recv WR2 indicating that data B from host2 is requested to be received. host0 sends WR2 in the form of WRE2 to RQ under QP2 after completing the down-sending of WR 1. When host0 completes WR1 and WR2, i.e., receives data from each of host1 and host2, corresponding CQE1 and CQE2 may be generated to correspondingly indicate or inform WR1 and WR2 of the completion of execution.
Accordingly, after the WR1 and WR2 are finished under host0, a poll completion queue (poll CQ) may be controlled to know whether the WR1 and WR2 are finished. After execution, WR1 and WR2 may be obtained corresponding to the received data a and data B, respectively. Further host0 may implement fusion or superposition of data a and data B through program control, and obtain new data C, i.e. data C is data a + data B. Host0 issues WR3 into SQ under CQ3 to perform the WR 3. The WR3 is used to request data C to be sent down to host 3. Accordingly, after the RDAM execution engine of host0 receives the WR3, data C may be retrieved in response to WR3 and sent to host 3. Upon completion of the execution of WR3, a corresponding CQE3 may be generated and stored in the CQ. Accordingly, host0 can actively query CQE3 in cq (poll cq) to know that the traffic communication is completed, and the CPU idle (CPU idle) of host 0.
In practice it has been found that: when the business communication involves task requests in multiple QPs, the task requests have a direct dependency relationship, and at this time, the host device needs to frequently interact with the RDAM execution engine of the host device, specifically, the host interface card HCA, through the PCIe bus, and the process is complex and cumbersome. And the host device needs to complete the processing of the dependency relationship between WRs on the CPU side, which cannot be realized on the HCA side, for example, the simple data calculation in the above is also placed on the CPU side, not on the HCA side. Therefore, CPU resources in the host equipment are consumed comparatively, and frequent interaction also prolongs the time delay of data processing and reduces the efficiency of service communication.
In addition, the RDMA execution engine executes according to the issuing sequence of the task requests WR, only the sequence execution of different task requests can be ensured, and the synchronous processing of different task requests in the same QP cannot be supported. Moreover, the RDMA execution engine only has a data transmission function, cannot realize data calculation, and cannot meet the calculation requirements of actual service communication.
In order to solve the above problems, the present application proposes another method for task processing, and an associated device to which the method is applied. Fig. 6 is a schematic diagram of a network framework according to an embodiment of the present invention. The network framework shown in fig. 6 includes N host devices and M storage devices (storage) that communicate with each other through an IB network, where M and N are positive integers. In the figure, 4 host devices and 2 storage devices are shown as examples, and they are respectively: host device 1 to host device 4, storage device 1 to storage device 2. As shown, a host processor (host processor)601, a host memory (host memory)602, and a network interface card (also referred to as a host interface card, HCA)603 are included in any host device 600 (e.g., host device 1). Optionally, a control processor (control memory)604 may also be included. Wherein the content of the first and second substances,
the processor (which may be specifically the host processor 601 or the control processor 604) may include one or more processing units, such as: the processor may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.
The controller can be, among other things, a neural hub and a command center of the host device 600. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in the processor for storing instructions and data. In some embodiments, the memory in the processor is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor. If the processor needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses and reducing the latency of the processor, thereby increasing the efficiency of the system.
Host memory 602 may include volatile memory (volatile memory), such as Random Access Memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above. The memory may be configured to store a set of program code to facilitate the processor in invoking the program code stored in the memory to implement the corresponding functionality.
As shown, the host memory 602 may include an application and a library. The application may specifically be an application that is configured by the system in a customized manner to support running in the host processor 602, such as a High Performance Computing (HPC) application and an Artificial Intelligence (AI) application in the figure. The communication library supports providing a communication interface to enable the host device 600 to communicate with other devices. As shown, the communication library may include a message-passing interface (MPI) communication library and a collect communication (NLLC) communication library.
Specifically, in the HPC or AI application scenario, multiple host devices access the IB network through the network interface card HCA603, which may form a host cluster. The HPC application supports calling a collection collecting interface provided by the MPI communication library to complete communication. The AI application can also call a collective interface provided by a communication library similar to the NCCL to complete the communication. The MPI or NCCL communication library provides an RDMA interface through HCA, and particularly provides a send interface, a receive interface and a queue polling poll CQ interface to complete actual communication.
The network interface card 603 is responsible for processing data, such as parsing a task network diagram and processing a task request included in the task network diagram in the following of the present application. As shown, the network interface card 603 includes a host interface (host interface)6031, a task executor (WR executor)6032, and an IB network interface (IB interface) 6033. Wherein the content of the first and second substances,
the host interface 6031 is used to enable communication between the network interface card 603 and the host device 600, such as the network interface card 603 and the control processor 604. IB network interface 6033 is used to enable host device 600 to communicate with an IB network for communicating with other host devices in the network over the IB network.
The task executor 6032, also called a task execution engine, is configured to process a task request included in the business communication. Optionally, the task executor 6032 includes a control unit and a calculation unit therein. The control unit is used for controlling the logic sequence of task request processing, and the computing unit is used for realizing computing operation indicated by the task request, such as fusion or summation processing of data A and data B, and the like. The calculation operation can be specifically set by the system, for example, SUM, MAX, MIN, and so on.
In actual practice, the task executor 6032 may interact with queues (queues) in the host device. For example, a task request to send is issued to the SQ, a task request to receive is issued to the RQ, and a request completion event, which may also be referred to as a completion notification message or a completion notification event, is issued to the CQ for storage, and is used to indicate that processing for the task request has been completed. Specifically, no matter the task request is sent or the task request is received (both are communication task requests), the task request is issued to the corresponding queue group CQ in the form of a task request entry WRE, and the completion notification event is issued to the corresponding completion queue CQ in the form of a completion queue entry CQE.
Optionally, an RDMA execution engine (not shown) is also included in the host device. In practical applications, the RDMA execution engine and the task execution engine (task executor 6032) cooperate with each other to complete the processing of all task requests included in the service communication, so as to implement the corresponding service communication. For example, the task executor 6032 obtains the task network graph, parses the task requests included in the task network graph, and calls the RDMA execution engine according to the logical sequence of the task requests to complete the task processing corresponding to the task requests.
For the communication task request, the HCA puts the communication task request into the corresponding queue group QP through the control unit of the task execution engine 6032, and specifically stores the communication task request in the form of a task request entry WRE. And waiting for the RDMA execution engine to schedule and execute the communication task request. After the communication task request is completed, a corresponding completion queue entry CQE is generated and filled in a corresponding completion queue. The completion queue entry is used to indicate that the communication task request has been executed, and usually the completion queue entry corresponds to the communication task request one-to-one, that is, each communication task request corresponds to one completion queue entry and also corresponds to one task request entry. Optionally, the completion queue entry is used to indicate that the corresponding communication task request has been executed, and generally carries an identifier of the corresponding communication task request. I.e. an identifier indicating the completed communication task request. Optionally, the control unit obtains the execution state of the communication task request by receiving a completion notification message for the execution of the communication task request or actively polling a completion queue CQ, and determines whether the execution of the communication task request is finished.
For the calculation task request, the HCA obtains data of the calculation task request through the control unit in the task execution engine 6032, and calculates corresponding data through the calculation unit in the task execution engine 6032 to store the calculation result at a corresponding destination address. After the computing unit completes the computation requested by the computation task request, a completion notification message is sent to the control unit to notify that the computation task request has been currently executed. The specific implementation of the communication task request and the calculation task request will be described in detail below, and will not be described in detail here.
In this application, the host processor 601, the host memory 602, the network interface card 603, and the control processor 604 may be interconnected using a bus. The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but not only one bus or type of bus.
Based on the above embodiments, embodiments related to the task request processing method according to the present application are described below. Fig. 7 is a flowchart illustrating a task request processing method according to an embodiment of the present invention. The method shown in fig. 7 is applied in a host device as shown in fig. 6, which includes a host processor and a network interface card HCA. The method comprises the following implementation steps:
s701, the first device constructs a task network graph through a host processor, the task network graph comprises at least one service task request forming service communication and respective execution dependency relations of the service task request, and the service task request can specifically comprise a communication task request and/or a calculation task request.
For a certain service communication of the first device, the first device (specifically, the host processor of the first device) may perform task decomposition on the service communication, so as to obtain at least one service task request constituting the service communication and an execution dependency relationship of the service request. The execution dependency relationship is used to indicate the execution order of the business task requests and other task requests that depend on the business task requests. And further the first equipment constructs and generates a task network graph according to the at least one service task request and the respective execution dependency relationship.
The specific implementation of task decomposition is not limited. For example, the first device may perform task splitting according to operation steps of the service communication, for example, one operation step is correspondingly encapsulated as one task request, so as to obtain one or more service task requests composing the service communication.
Task requests WR referred to in the present application can be divided into two categories, namely, a calculation task request and a communication task request. Each task request WR is distinguished by having a correspondingly unique identification WR-id including, but not limited to, an identity identification, number, or other identifier customized to distinguish task requests. The first device related to the present application may specifically be the host device shown in fig. 6, which includes a host processor (processor CPU for short) and a network interface card HCA, and reference is made to the related description in the embodiment shown in fig. 6, which is not described herein again.
In some embodiments, the task request carries custom set parameters, which may include, but are not limited to, any one or a combination of more of the following: the task request comprises information such as identification of the task request, task type of the task request, source data address, destination address and length, and verification keyword (qkey). The task type of the task request is used to indicate whether the task request is a computation task request or a communication task request, and the task type of the task request may be generally represented by a preset character, for example, "0" represents that the task request is a communication task request, "1" represents that the task request is a computation task request, and so on.
S702, the host processor of the first device sends the task network graph to the network interface card HCA of the first device. Accordingly, the HCA of the first device obtains the task network map.
S703, the first device analyzes the task network diagram through the HCA, executes each task request in the task network diagram, and completes service communication.
After the first device constructs the task network graph through the host processor, the task network graph can be issued to the network interface card HCA of the first device. Correspondingly, the HCA receives the task network diagram, analyzes the task network diagram, and obtains at least one service task request forming service communication and an execution dependency relationship of the service task request. The business task request comprises a communication task request and/or a calculation task request. Further, the HCA executes the service task request according to the execution dependency relationship of each service task request in the task network diagram until the processing of each service task request in the task network diagram is completed, so that the service communication of the first device is realized.
In some embodiments, the communication task request is used to request a corresponding data transfer operation, such as data transmission or data reception. Some custom parameters may be carried in the communication task request, including, but not limited to, any one or combination of the following: the communication task request includes an identifier of the communication task request, a queue group QP (or a queue group QP identifier) associated with the communication task request, a transmission type of the communication task request (for example, send the task request or receive the task request), other task requests (specifically, the identifier of the other task requests) depended by the communication task request, and other information required by communication, such as information of a source data address and length, a destination address and length, and the like. The source data address and the destination address are used for storing data, and the length is used for indicating the size of the data. The source data address is typically used to store the received data, and the destination address is typically used to store the data that needs to be sent or calculated by the device itself, etc.
The calculation task request is used for requesting corresponding data calculation operations, such as data fusion/summation SUM, data maximum MAX, data minimum MIN, and the like. Some custom parameters may be carried in the compute task request, for example, which may include, but are not limited to, a combination of any one or more of: the task request may include an identification of the computing task request, a type of the computing operation (e.g., SUM, MAX, MIN, etc.), a type of the data operation (e.g., integer int, long and integer long, etc.), an address and a length of the source data, an address and a length of the destination data, an authentication key (qkey), other task requests (specifically, identifications of other task requests) on which the computing task request depends, and so on. The data operation type is used for indicating that the computing task requests to perform the data operation and relates to the type of data, and may be an integer int, a floating point float, and the like. The verification key is used for verifying the validity of the information by the communication opposite-end device, and is similar to key verification.
In some embodiments, the execution dependencies may specifically include task execution relationships and data dependencies. The task execution relationship is used to indicate the execution order of the task requests, including but not limited to parallel execution and/or serial execution. The data dependency relationship is used for indicating other task requests which are depended on when the task request is executed, and particularly execution results from the other task requests can be depended on. For example, when executing the second task request, the execution result of the first task request needs to be used, and the second task request depends on the first task request, and specifically depends on the execution result of the first task request.
For example, taking the first device as the host device 2 as an example, the process of the host device 2 performing service communication specifically includes: the host device 2 needs to receive data from the host device 4 and the host device 5, fuse the data to obtain fused data, and send the fused data to the host device 1 for use. Referring to fig. 8, a schematic diagram of a task network diagram corresponding to the service communication is specifically shown. Wherein, the square represents the communication task request, and the circle represents the calculation task request. Referring to fig. 8, the service communication includes 4 task requests, i.e., a computation task request T3, a communication task request T1, T2, and T4. For example, the communication task request T1 is used to indicate that data 1 from the host device 4 is received, and T1 carries an identifier of the task request (shown as WR-ID:1), a TYPE of the task request (WR-TYPE: 0, which is denoted as a communication task request), a queue group identifier associated with the task request (local-QPN: 1), a transmission TYPE of the task request (opcode: receive, which is denoted as a received task request), and an identifier of another task request (dependent: null, which is denoted as independent of another task request). Optionally, the T1 may further include a source data address (src-addr) and a source data length (src-length). The source data address is a start address for storing data 1, and the source data length is used for reflecting the size of data 1.
The communication task request T2 is used to indicate that data 2 from the host device 5 is received, and T2 carries an identifier of the task request (shown as WR-ID:2), a TYPE of the task request (WR-TYPE: 0, which indicates that the communication task request is received), a queue group identifier associated with the task request (local-QPN: 2), a transmission TYPE of the task request (opcode: receive, which indicates that the task request is received), and an identifier of another task request on which the task request depends (depend: null, which indicates that the task request is not dependent on another task request). Optionally, the T2 may further include a source data address (src-addr) and a source data length (src-length). The source data address is a start address for storing the data 2, and the source data length is used for reflecting the size of the data 2.
The computation task request T3 is used to indicate that the fusion operation is performed on data 1 and data 2 to obtain fused data. T3 carries an identification of the task request (WR-ID: 3 as shown), a TYPE of the task request (WR-TYPE: 1 as computation task request), a TYPE of computation operation (calc-TYPE: SUM as data summation), a source data address 1(src-addr1), a source data address 2(src-addr2), a destination address (dest-addr), a data operation TYPE (data-TYPE: int), a data length (data-length), and identifications of other tasks (depended: 1, 2 as T3 depends on T1 and T2, and specifically depends on their execution results, namely, data 1 and data 2 here). Wherein, the source data address 1 and the source data address 2 are respectively used for storing the execution results of the T1 and the T2 task requests, for example, the source data address 1 is used for storing data 1, and the source data address 2 is used for storing data 2. The destination address is used to store the results of the execution of the compute task request T3, which may be used here to store the fused data. The data length is used to reflect the size of the fused data.
The communication task request T4 is for instructing to send the fused data to the host apparatus 1. T4 carries an identifier of the task request (WR-ID: 4 shown in the figure), a TYPE of the task request (WR-TYPE: 0 shown as a communication task request), a queue group identifier associated with the task request (local-QPN: 3), a transmission TYPE of the task request (opcode: send shown as a sending task request), and an identifier of other task requests on which the task request depends (dependent: 3 shown as T4 depending on the computation task request T3). Optionally, T4 may also carry an identifier of the destination device (also referred to as an identifier of the remote device, remote-lid: 1, indicating the host device 1), a queue group identifier associated with the destination device (also referred to as a remote queue group, remote-QPN: 5), and so on. The identification of the destination device is used to indicate the remote device to which the task request of T4 requires communication.
As shown in fig. 8, the task network diagram includes information such as 4 task requests constituting service communication, an execution sequence of each task request, and data dependency. The execution sequence may be parallel execution, such as T1 and T2 in FIG. 8 in parallel; it may also be performed serially, for example, in FIG. 8, T3 is executed first and then T4 is executed. The host device 2 analyzes the task network diagram, and after each task request included in the task network diagram is executed in sequence, the service communication can be realized, and the fused data is sent to the host device 1 for use.
By implementing the embodiment of the invention, the task processing of the service communication is realized based on the task network diagram, so that the task request WR can be merged and issued once through the network diagram, the WR issuing times are reduced, the time consumption of the WR issuing is reduced, and the device bandwidth resource is saved. In addition, the task request is executed at the HCA side, other business processing can be carried out after the CPU sends the task network diagram, the CPU does not need to be occupied, and compared with the traditional technology, the method can reduce the load of the CPU and improve the utilization rate of the CPU.
Fig. 9 is a schematic flowchart of another task processing method according to an embodiment of the present invention. The method as shown in fig. 9 is applied in a host device (first device) comprising a network interface card HCA and a host processor, the network interface card comprising a task execution engine and an RDMA execution engine, the method comprising the implementation steps of:
s901, HCA obtains the task network diagram, analyzes the task network diagram, and obtains at least one communication task request and/or at least one calculation task request which form the service communication.
In the present application, an HCA (specifically, a task execution engine of the HCA) of a first device acquires a task network graph of service communication, analyzes the task network graph, and obtains information included in the task network graph, such as at least one task request constituting the service communication and an execution dependency relationship of the task request. Each task request also carries a custom parameter, which may include, but is not limited to, at least one of the following: the task type of the task request, the identification of the task request, the transmission type of the task request, the source data address, the destination address and other information. The task type of the task request is used for indicating that the task request can be a communication task request and/or a calculation task request.
And S902, the HCA executes the task network graph, and if the task network graph contains the communication task request, the RDMA execution engine is waited to be called to execute the data transmission operation indicated by the communication task request.
The HCA executes each task request in sequence according to the information contained in the task network diagram, and corresponding service communication is realized. Wherein:
for the communication task request, the HCA identifies the communication task request as a send task request (send) or a receive task request (receive) according to the transmission type carried in the communication task request. If the transmission type in the communication task request indicates that the communication task request is a sending task request, the HCA acquires data to be sent and sends the data to be sent by adopting an RDMA technology.
Optionally, the communication task request may also carry a queue group identifier (QPN) associated with the communication task request, and the HCA may store the communication task request in a queue group QP corresponding to the queue group identifier QPN in the form of a task request entry WRE. Specifically, the communication task request is a send task request, and the HCA may store the send task request in a send queue SQ in a queue group in a WRE form, and wait to invoke the RDMA execution engine to execute the send task request. If the communication task request is a receive task request, the HCA may store the send task request in a receive queue RQ in a queue group in a WRE format to wait for the RDMA execution engine to be invoked to execute the receive task request.
Specifically, the HCA parses the task network graph through the task execution engine to obtain each task request included in the task network graph. If the task execution engine can acquire whether the task request is a calculation task request or a communication task request according to the task type carried in the task request. For the communication task request, the task execution engine can identify whether the communication task request is a task sending request or a task receiving request according to the transmission type carried in the task request. For the sending task request, the task execution engine may issue the sending task request to a sending queue in a queue group CQ corresponding to the queue group identifier in the form of a task request entry WRE, and wait to invoke the RDMA execution engine to execute the sending task request. Specifically, when the RDMA execution engine executes the send task request, the data to be sent may be acquired, and the data to be sent is sent by using an RDMA technique. And aiming at the received task request, the task execution engine issues the received task request to a receiving queue in a queue group CQ corresponding to the queue group identifier in the form of a task request entry WRE, and waits for calling the RDMA execution engine to execute the received task request. Specifically, the RDMA execution engine is called to receive the data to be received sent by the second device by adopting the RDMA technology.
Optionally, the communication task request may also carry a source data address and a destination device identifier. The HCA acquires the data to be sent from the source data address through the RDMA execution engine so as to send the data to be sent to the target equipment corresponding to the target equipment identification. Optionally, the communication task request may also carry a source data length, and the HCA may specifically obtain, from the source data address, to-be-sent data corresponding to the source data length through the RDMA execution engine.
Optionally, the communication task request may also carry an identifier of another task request (e.g., the first task request) on which the communication task request depends, and the data to be sent may be an execution result of the first task request. The first task request may include, but is not limited to, a communication task request or a computing task request.
If the transmission type in the communication task request indicates that the communication task request is a receive task request, the HCA may also receive data to be received transmitted by using an RDMA technique from a second host device (for short, a second device). Optionally, the communication task request carries a source data address, and the HCA may store the data to be received at the source data address. The communication task request can also carry a source data length, the HCA can store the data to be received from the source data address, and the source data length is used for reflecting/indicating the size of the data to be received.
Optionally, the communication task request may also carry an identifier of another task request (e.g., the first task request) on which the communication task request depends, and the data to be received may be an execution result of the first task request. For the first task request, reference is made to the foregoing embodiments, and details are not repeated here.
Optionally, after the RDMA execution engine completes executing the communication task request, a completion notification message may be actively sent to the task execution engine to notify that the communication task request is currently completed. Alternatively, after the RDMA execution engine executes the communication task request, it may generate a corresponding completion queue entry CQE, and add it to the completion queue. Accordingly, when receiving the completion notification message for the communication task request or querying the completion queue entry for the communication task request through the active polling completion queue, the task execution engine may determine that the communication task request is currently executed, so as to continue the execution of the next task request.
And S903, if the task network graph comprises the calculation task request, performing data calculation operation indicated by the calculation task request.
For the calculation task request, the calculation task request carries a calculation operation type, and the task execution engine of the HCA can acquire data to be calculated and perform the calculation operation indicated by the calculation operation type on the data to be calculated to obtain a calculation result. Alternatively, the computation task request may carry a source data address from which the HCA may obtain the data to be computed. The number of the data to be calculated is not limited, and may be one or more. For example, when there are two pieces of data to be calculated, specifically, the two pieces of data to be calculated may be the first data to be calculated and the second data to be calculated, the source data addresses carried in the calculation task request are also two pieces, specifically, the two pieces of data may be the first source data address and the second source data address. Accordingly, the HCA may obtain the first data to be computed from the first source data address and the second data to be computed from the second source data address.
Optionally, the calculation task request may also carry a source data length for reflecting the size of the data to be calculated, and the HCA obtains the data to be calculated corresponding to the source data length from the source data address. Optionally, the computation task request further carries a destination address, and the HCA may store the computation result to the destination address.
Optionally, the computation task request carries an identifier of another task request (for example, a second task request) on which the computation task request depends, and the data to be computed may specifically be an execution result of the second task request. The second task request may be a communication task request or a calculation task request, and is not limited.
And S904, after the HCA executes each task request contained in the task network diagram, the service communication is completed.
The HCA processes each task request included in the task network graph in turn according to the task request processing principle described above, so as to implement service communication of the first device.
To facilitate a better understanding of embodiments of the present invention, a detailed description is given below with respect to a specific example. Please refer to fig. 10A and fig. 10B, which illustrate a communication diagram based on hierarchical aggregation of tree-based networking. Fig. 10A shows a communication diagram of data aggregation, and fig. 10B shows a communication diagram of data distribution. Each node of the illustrated cluster establishes connections with parent and child nodes in the physical topology. And the node completes the data aggregation/fusion of all the child nodes and the node to obtain aggregated data. The parent node receives the aggregated data (also called fused data) for all child nodes. And if a node has a father node, sending aggregated data to the father node, waiting for the father node to send the aggregated data of the father node, and then forwarding the aggregated data to all child nodes. And if the parent node does not exist in a certain node, sending the self aggregated data to all the child nodes. In the illustration, each node may represent a host device, and the illustrations may be host device 1(host1) to host device 7(host7), respectively. Specifically, taking a parent node as the host device 2 as an example, the host device 2 needs to receive respective data of the host device 4 and the host device 5, and fuse the data to obtain first fused data, so as to send the first fused data to the parent node host device 1 of itself. Accordingly, the host device 2 may receive the total fused data sent by the host device 1, and the total fused data may specifically include data of each of the host devices 2 to 7. The total fused data is then forwarded to its child nodes, host device 4 and host device 5.
Taking the service communication of the host device 2 as an example, the specific process of the service communication of the host device 2 is as follows: the host device 2 needs to receive the data from the host device 4 and the host device 5, merge them, and send them to the parent node host device 1, and wait for the host device 1 to complete the data aggregation of all nodes in the cluster, and then send them to the host device 2. The host device 2 then issues the aggregation result to the host device 4 and the host device 4. With this detailed flow of business communication, the host processor of the host device 2(host2) can construct a task network diagram as shown in fig. 11.
In fig. 11, squares represent communication task requests, and circles represent calculation task requests. The illustration includes 7 task requests T1-T7. There are 6 communication task requests and 1 computation task request T3. Specifically, the communication task request T1 is for receiving data 1 from the host device 4, and T2 is for requesting to receive data 2 from the host device 5. T3 is used to fuse data 1 and data 2 to obtain first fused data. T4 is used to transmit the first fused data to the host device 1. T5 is for receiving the second fused data from the host device 1. T6 is for transmitting the second fused data to the host device 4, and T7 is for transmitting the second fused data to the host device 5.
The communication task request T1 carries a custom parameter, which may include an identifier of the task request (WR-ID:1), a task TYPE of the task request (WR-TYPE: 0, which is denoted as a communication task request), a queue group identifier associated with the task request (local-QPN: 1), a transmission TYPE of the task request (opcode: receive, which is denoted as a received task request), a source data address (src-addr: buffer1), and an identifier of another task request on which the task request depends (depended: null, which denotes that T1 does not depend on another task request). Wherein the source data address is used to store data 1 of the host device 4 received at T1.
Similarly, the parameters that may be carried by each task request are exemplarily given in the figure. For example, the computation task request T3 may carry a computation operation type (calc-type: SUM, which indicates data fusion/summation), a source data address 1(src-addr 1: buffer1), a source data address 2(src-addr 2: buffer2), and a destination address (des-addr: buffer3) in addition to the parameters in T1. The source data address 1 stores data 1, the source data address 2 stores data 2, and the destination address is used for storing first fusion data obtained after fusion SUM operation is performed on the data 1 and the data 2. The correspondence task requests T6 and T7 may also carry an identification (remote-lid) of the destination device (indicating the remote host device that the correspondence task request supports communication).
Referring to fig. 12, an operational diagram of task network graph execution is shown. Specifically, the task execution engine of the host device 2 receives the task network map from the host processor through the host interface. The task execution engine comprises a control unit and a computing unit. The task execution engine analyzes the task network diagram through the control unit to obtain 7 task requests T1-T7. First, the control unit may execute T1 and T2 in parallel, issue T1 as a task request entry WRE1 into the receive queue RQ in QP1 (the queue group identifies the queue group corresponding to local-local-QPN: 1), and wait for the RDMA execution engine to schedule execution of T1 (step S1 in the figure). T2 is issued as a task request entry WRE2 into the receive queue RQ in QP2, waiting for the RDMA execution engine to schedule execution of T2 (illustrated as S2). Alternatively, host device 2 may implement WRE1 or WRE2 processing by RDMA execution engines, i.e., receiving data 1 from host device 4 sent using RDMA techniques and data 2 from host device 5 sent using RDMA techniques. Further, the host device 2 may store the received data 1 and data 2 to the source data address 1(buffer1) and the source data address 2(buffer2), respectively.
After each completion of a communication task request, such as a T1 (which may be specifically a WR1), the RDMA execution engine may generate a corresponding completion queue entry CQE and place the CQE into a corresponding completion queue CQ. Specifically, after WRE1 and WRE2 are executed, CQE1 and CQE2 may be generated and placed into completion queue CQ.
The control unit may determine the execution status (whether the execution is finished) of the communication task request by actively polling the completion queue CQ to determine whether a completion queue entry corresponding to the communication task request exists. Alternatively, the RDMA execution engine may actively send a completion notification event/message to the task execution engine (specifically, a control unit in the task execution engine) after executing one communication task request, and accordingly, after receiving the completion notification message for the communication task request, the control unit may determine that the execution of the communication task request is completed. Illustratively, as shown, the control unit actively polls the completion queue CQ for the presence of CQE1 and CQE2 to determine the respective execution status of T1 and T2. Among them, CQE1 indicates that T1(WRE1) has been executed, and CQE2 indicates that T2(WRE2) has been executed (steps S3 and S4 are shown).
After determining that the execution of T1 and T2 is completed, the control unit may continue to parse T3, and call the calculating unit to obtain the execution result (data 1) of T1 and the execution result (data 2) of T2 that T3 depends on from buffer1 and buffer2, respectively, as shown in steps S5 and S6 in the figure. And further executing fusion operation indicated by the calculation operation type on the data 1 and the data 2 to obtain first fusion data. Alternatively, the calculation unit may also store the first fused data into the destination address buffer3 (illustrated step S7). After the computing unit completes the computation, a completion notification message can be actively sent to the control unit to notify the execution of the current computation task completion request.
The control unit, after determining that T3 is completely executed, may then parse and execute T4, issue T4 in the form of WRE4 to the transmit queue RQ in the queue group QP3, and wait for the RDMA execution engine to execute. Accordingly, the RDMA execution engine acquires the data to be sent (here, the first converged data) from the buffer3, and sends the first converged data to the host device 1 by using the RDMA technique (illustrated as step S8). The control unit determines the execution status of T4 by actively polling or receiving a completion notification event (step S9 shown in the figure), which is specifically referred to the related descriptions of the determination of the execution status of T1 and T2, and is not described herein again.
The control unit, upon determining that T4 is complete, may then parse and execute T5, issue T5 in the form of WRE5 to the receive queue RQ in QP3, waiting for the RDMA execution engine to execute (step S10 shown). The control unit determines the execution state to T5 by actively polling or receiving a completion notification message or the like (illustrated step S11). Similarly, after determining that the execution of T5 is completed, the control unit may parse and execute T6 and T7 in parallel, issue T6 in the form of WRE6 to the transmit queue SQ of QP1, issue T7 in the form of WRE7 to the transmit queue SQ of QP2, and wait for the execution of the RDMA execution engine (steps S12 and S13 are shown). Accordingly, the control unit determines the respective execution states of T6 and T7 by actively polling or receiving a completion notification event or the like (illustrated as steps S14 and S15). After the control unit determines that the execution of T7 is completed, it may determine that each task request in the task network diagram has been completed, and the control unit may send a notification message to the host processor through the host interface to notify that the task network diagram has been completed and the service communication of the host device 2 has been completed currently.
By implementing the embodiment of the invention, the task network graph compatible with the communication task request and the calculation task request is designed, and the issuing times of the task request are reduced, so that the time consumption for issuing the task request is reduced. In addition, the task execution engine is used for analyzing and executing the thought task network graph, so that the utilization rate of a CPU is reduced, and any task request in the task network graph supports the control processing operation of the same QP or different QPs.
Based on the above examples, the following sets forth the relevant products to which the present application is directed. Fig. 13 is a schematic structural diagram of a network device according to an embodiment of the present invention. As shown in fig. 13, the network device 500 (specifically, a network interface card, which may be simply referred to as a network card) includes one or more processors 501, a communication interface 502, and a memory 503, where the processors 501, the communication interface 502, and the memory 503 may be connected by a bus, and may also implement communication by other means such as wireless transmission. The embodiment of the present invention is exemplified by being connected through a bus 504, wherein the memory 503 is used for storing instructions, and the processor 501 is used for executing the instructions stored by the memory 503. The memory 503 stores program codes, and the processor 501 can call the program codes stored in the memory 503 to execute the relevant steps with HCA as the main execution body in the above method embodiment, and/or the technical contents described in the text. For example, the following steps are performed: acquiring a task network diagram, wherein the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication, the communication task request is used for requesting to perform data transmission operation indicated by the communication task request, and the calculation task request is used for requesting to perform data calculation operation indicated by the calculation task request; if the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology; if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request; and after the task network graph is executed, the service communication is completed.
It should be appreciated that processor 501 may be comprised of one or more general-purpose processors, such as a Central Processing Unit (CPU). The processor 501 may be configured to run related program codes to implement related steps of the task processing method embodiment, where the HCA is the main execution body.
The communication interface 502 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules/devices. For example, in the embodiment of the present application, the communication interface 602 may be specifically configured to receive a task network map and the like from a host processor.
The Memory 503 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a hard disk (hard disk Drive, HDD), or a Solid-State Drive (SSD); the memory 503 may also comprise a combination of the above kinds of memories. The memory 503 may be used to store a set of program codes, so that the processor 501 may call the program codes stored in the memory 503 to implement the relevant contents of the HCA as the execution subject in the embodiment of the method of the present invention.
It should be noted that fig. 13 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the network interface card may further include more or less components, which is not limited herein. For the content that is not shown or not described in the embodiment of the present invention, reference may be made to the related explanation in the embodiment described in fig. 1 to fig. 12, and details are not described here.
Fig. 14 is a schematic structural diagram of a host device according to an embodiment of the present invention. The host apparatus 600 includes: a processing unit 602 and a communication unit 603. The processing unit 602 is used to control and manage the actions of the host apparatus 600. Illustratively, the processing unit 602 is configured to support the host device 600 in performing steps S701-S703 in FIG. 7, steps S901-S904 in FIG. 9, and/or in performing other steps of the techniques described herein. The communication unit 603 is used to support communication between the host apparatus 600 and other apparatuses. Optionally, the host device 600 may further include a storage unit 601 for storing program codes and data of the computing device 600.
The Processing Unit 602 may be a Processor or a controller, such as a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication unit 603 may be a communication interface, a transceiver circuit, etc., wherein the communication interface is a generic term and may include one or more interfaces, such as interfaces between terminal devices and other devices. The storage unit 601 may be a memory.
When the processing unit 602 is a processor, the communication unit 603 is a communication interface, and the storage unit 601 is a memory, the host device according to the embodiment of the present invention may be the host device shown in fig. 7.
Referring to fig. 15, the host device 610 includes: processor 612, communication interface 613, memory 611. Optionally, the computing device 610 may also include a bus 614 and a network interface card 615 (simply network card). The communication interface 613, the processor 612, the memory 611, and the network interface card 615 may be connected to each other via a bus 614; the bus 614 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 614 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 15, but this is not intended to represent only one bus or type of bus. The network card 615 may be specifically the network card (network device) 500 shown in fig. 13, and is not described herein again. Network card 616 includes processor 501, communication interface 502, and memory 503. Which are shown connected by a bus. The memory 503 is used for storing instructions, and the processor 501 is used for executing the instructions stored in the memory 503. The memory 503 stores program codes, and the processor 501 can call the program codes stored in the memory 503 to execute the operation steps with HCA as the execution subject as in the above method embodiment.
The specific implementation of the host device shown in fig. 14 or fig. 15 may also refer to the corresponding description of the foregoing method embodiment, and is not described herein again.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware or in software executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in a Random Access Memory (RAM), a flash Memory, a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a register, a hard disk, a removable hard disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a computing device. Of course, the processor and the storage medium may reside as discrete components in a computing device.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Claims (17)

1. A method of task processing, the method comprising:
the method comprises the steps that a first device obtains a task network diagram, wherein the task network diagram comprises at least one communication task request and/or at least one calculation task request for forming service communication, the communication task request is used for requesting to perform data transmission operation indicated by the communication task request, and the calculation task request is used for requesting to perform data calculation operation indicated by the calculation task request;
if the task network graph comprises the communication task request, performing data transmission operation indicated by the communication task request by adopting a Remote Direct Memory Access (RDMA) technology;
if the task network graph comprises the calculation task request, executing data calculation operation indicated by the calculation task request;
and after the task network graph is executed, the service communication is completed.
2. The method of claim 1, wherein the communication task request carries a transmission type,
the method for executing the data transmission operation indicated by the communication task request by adopting the remote direct memory access RDMA technology comprises the following steps:
if the transmission type indicates that the communication task request is a sending task request, obtaining data to be sent, and sending the data to be sent by adopting an RDMA (remote direct memory access) technology;
and if the transmission type indicates that the communication task request is a receiving task request, receiving data to be received sent by the second equipment by adopting an RDMA technology.
3. The method of claim 2, wherein the communication task request further carries a queue group identifier,
the method for executing the data transmission operation indicated by the communication task request by adopting the remote direct memory access RDMA technology comprises the following steps:
if the communication task request is a sending task request, storing the sending task request in a sending queue in a queue group corresponding to the queue group identifier, waiting for executing the sending task request to obtain data to be sent, and sending the data to be sent by adopting an RDMA (remote direct memory access) technology;
and if the communication task request is a received task request, storing the received task request in a receiving queue in the queue group corresponding to the queue group identifier, and waiting for executing the received task request to receive data to be received sent by second equipment by using an RDMA (remote direct memory access) technology.
4. The method according to claim 2 or 3, wherein the communication task request further carries a source data address, and the source data address is used for storing the data to be sent or storing the data to be received.
5. The method according to any one of claims 2 to 4,
when the communication task request also carries an identifier of a first task request depended by the communication task request, the data to be sent and the data to be received are execution results of the first task request; alternatively, the first and second electrodes may be,
and when the communication task request does not carry the identifier of the first task request depended by the communication task request, the data to be sent and the data to be received are pre-stored data.
6. The method of claim 3, further comprising:
inquiring a completion queue entry aiming at the communication task request in a completion queue or receiving a completion notification message aiming at the completion of the execution of the communication task request, and determining that the execution of the communication task request is completed;
the completion queue stores the completion queue entry used for indicating the identifier of the communication task request after completion of execution.
7. The method according to any of claims 1-6, wherein the computing task request carries a computing operation type,
the executing the data computing operation indicated by the computing task request comprises:
and acquiring data to be calculated, and performing calculation operation indicated by the calculation operation type on the data to be calculated to obtain an execution result of the calculation task request.
8. The method of claim 7, wherein the computation task request further carries a source data address, and wherein the source data address is used for storing the data to be computed.
9. The method according to claim 7 or 8, wherein the computation task request further carries an identifier of a second task request on which the computation task request depends, and the data to be computed is an execution result of the second task request.
10. The method according to any one of claims 1-9, wherein before the obtaining the task network graph, further comprising:
performing task decomposition on service communication to obtain at least one task request forming the service communication and an execution dependency relationship of the at least one task request, wherein the execution dependency relationship comprises a data dependency relationship and a task execution relationship, the task execution relationship comprises serial execution and/or parallel execution, the data dependency relationship is used for indicating other task requests depended on when the task request is executed, and the task request comprises a communication task request and/or a calculation task request;
and generating a task network graph according to the at least one task request and the execution dependency relationship of the at least one task request.
11. A network device comprising a task execution engine and an RDMA execution engine, wherein:
the task execution engine is used for acquiring a task network graph, and the task network graph comprises at least one communication task request and/or at least one calculation task request for forming service communication;
the task execution engine is further used for calling the RDMA execution engine to execute the data transmission operation indicated by the communication task request if the task network graph comprises the communication task request;
the task execution engine is further configured to execute a data calculation operation indicated by the calculation task request if the task network graph includes the calculation task request;
and the task execution engine is also used for determining to finish the service communication after the task network graph is executed.
12. The apparatus of claim 11, wherein the communication task request carries a transmission type,
the task execution engine is specifically configured to store the send task request in a send queue in a corresponding queue group QP in a form of a task request entry if the transmission type indicates that the communication task request is a send task request, and wait for the RDMA execution engine to execute the send task request; alternatively, the first and second electrodes may be,
and if the transmission type indicates that the communication task request is a received task request, storing the received task request in a received queue in a corresponding queue group QP in a form of a task request entry, and waiting for an RDMA execution engine to execute the received task request.
13. The apparatus of claim 12,
the RDMA execution engine is used for acquiring data to be sent and sending the data to be sent by adopting an RDMA technology if the communication task request is a sending task request; and if the communication task request is a task receiving request, receiving data to be received sent by the host equipment by adopting an RDMA technology.
14. The apparatus according to any one of claims 11-13,
the RDMA execution engine is further used for sending a completion notification message aiming at the communication task request to the task execution engine after the communication task request is executed; or after the communication task request is executed, generating a completion queue entry and adding the completion queue entry into a completion queue, wherein the completion queue entry is used for indicating that the communication task request is executed;
the task execution engine is further configured to determine that the communication task request is executed completely if a completion notification message for the communication task request is received or the completion queue is polled to obtain the completion queue entry.
15. The apparatus according to any of claims 11-14, wherein the compute task request carries a compute operation type,
the task execution engine is used for acquiring data to be calculated, and executing the calculation operation indicated by the calculation operation type on the data to be calculated to obtain the execution result of the calculation task request.
16. A network device, comprising a memory and a memory coupled to the memory, the memory configured to store instructions, the processor configured to execute the instructions; wherein the processor, when executing the instructions, performs the method of any of claims 1-9 above.
17. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 10.
CN201910687998.XA 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium Active CN112291293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910687998.XA CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910687998.XA CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN112291293A true CN112291293A (en) 2021-01-29
CN112291293B CN112291293B (en) 2023-01-06

Family

ID=74418899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910687998.XA Active CN112291293B (en) 2019-07-27 2019-07-27 Task processing method, related equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN112291293B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112822300A (en) * 2021-04-19 2021-05-18 北京易捷思达科技发展有限公司 RDMA (remote direct memory Access) -based data transmission method and device and electronic equipment
CN113537937A (en) * 2021-07-16 2021-10-22 重庆富民银行股份有限公司 Task arrangement method, device and equipment based on topological sorting and storage medium
CN114296916A (en) * 2021-12-23 2022-04-08 苏州浪潮智能科技有限公司 Method, device and medium for improving RDMA (remote direct memory Access) release performance
CN114510339A (en) * 2022-04-20 2022-05-17 苏州浪潮智能科技有限公司 Computing task scheduling method and device, electronic equipment and readable storage medium
CN115237582A (en) * 2022-09-22 2022-10-25 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN116721007A (en) * 2023-08-02 2023-09-08 摩尔线程智能科技(北京)有限责任公司 Task control method, system and device, electronic equipment and storage medium
WO2024093885A1 (en) * 2022-10-31 2024-05-10 华为技术有限公司 Chip system and collective communication method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129390A (en) * 2011-03-10 2011-07-20 中国科学技术大学苏州研究院 Task scheduling system of on-chip multi-core computing platform and method for task parallelization
CN104063344A (en) * 2014-06-20 2014-09-24 华为技术有限公司 Data storage method and network interface card
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
EP3096226A4 (en) * 2014-10-29 2017-11-15 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation processing method, conversation management system and computer device
CN107690622A (en) * 2016-08-26 2018-02-13 华为技术有限公司 Realize the method, apparatus and system of hardware-accelerated processing
CN107819855A (en) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 A kind of message distributing method and device
CN108628672A (en) * 2018-05-04 2018-10-09 武汉轻工大学 Method for scheduling task, system, terminal device and storage medium
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN109840154A (en) * 2019-01-08 2019-06-04 南京邮电大学 A kind of computation migration method that task based access control relies under mobile cloud environment
CN109918432A (en) * 2019-01-28 2019-06-21 中国平安财产保险股份有限公司 Extract method, apparatus, computer equipment and the storage medium of task nexus chain
CN109992405A (en) * 2017-12-29 2019-07-09 西安华为技术有限公司 A kind of method and network interface card handling data message

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129390A (en) * 2011-03-10 2011-07-20 中国科学技术大学苏州研究院 Task scheduling system of on-chip multi-core computing platform and method for task parallelization
CN104063344A (en) * 2014-06-20 2014-09-24 华为技术有限公司 Data storage method and network interface card
EP3096226A4 (en) * 2014-10-29 2017-11-15 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation processing method, conversation management system and computer device
CN107690622A (en) * 2016-08-26 2018-02-13 华为技术有限公司 Realize the method, apparatus and system of hardware-accelerated processing
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
CN109697122A (en) * 2017-10-20 2019-04-30 华为技术有限公司 Task processing method, equipment and computer storage medium
CN107819855A (en) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 A kind of message distributing method and device
CN109992405A (en) * 2017-12-29 2019-07-09 西安华为技术有限公司 A kind of method and network interface card handling data message
CN108628672A (en) * 2018-05-04 2018-10-09 武汉轻工大学 Method for scheduling task, system, terminal device and storage medium
CN109840154A (en) * 2019-01-08 2019-06-04 南京邮电大学 A kind of computation migration method that task based access control relies under mobile cloud environment
CN109918432A (en) * 2019-01-28 2019-06-21 中国平安财产保险股份有限公司 Extract method, apparatus, computer equipment and the storage medium of task nexus chain

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112822300A (en) * 2021-04-19 2021-05-18 北京易捷思达科技发展有限公司 RDMA (remote direct memory Access) -based data transmission method and device and electronic equipment
CN113537937A (en) * 2021-07-16 2021-10-22 重庆富民银行股份有限公司 Task arrangement method, device and equipment based on topological sorting and storage medium
CN114296916A (en) * 2021-12-23 2022-04-08 苏州浪潮智能科技有限公司 Method, device and medium for improving RDMA (remote direct memory Access) release performance
CN114296916B (en) * 2021-12-23 2024-01-12 苏州浪潮智能科技有限公司 Method, device and medium for improving RDMA release performance
CN114510339A (en) * 2022-04-20 2022-05-17 苏州浪潮智能科技有限公司 Computing task scheduling method and device, electronic equipment and readable storage medium
CN115237582A (en) * 2022-09-22 2022-10-25 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN115237582B (en) * 2022-09-22 2022-12-09 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
TWI831729B (en) * 2022-09-22 2024-02-01 大陸商摩爾線程智能科技(北京)有限責任公司 Method for processing multiple tasks, processing device and heterogeneous computing system
WO2024093885A1 (en) * 2022-10-31 2024-05-10 华为技术有限公司 Chip system and collective communication method
CN116721007A (en) * 2023-08-02 2023-09-08 摩尔线程智能科技(北京)有限责任公司 Task control method, system and device, electronic equipment and storage medium
CN116721007B (en) * 2023-08-02 2023-10-27 摩尔线程智能科技(北京)有限责任公司 Task control method, system and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112291293B (en) 2023-01-06

Similar Documents

Publication Publication Date Title
CN112291293B (en) Task processing method, related equipment and computer storage medium
US20220214919A1 (en) System and method for facilitating efficient load balancing in a network interface controller (nic)
US20200314181A1 (en) Communication with accelerator via RDMA-based network adapter
US7937447B1 (en) Communication between computer systems over an input/output (I/O) bus
US10521283B2 (en) In-node aggregation and disaggregation of MPI alltoall and alltoallv collectives
CN113485823A (en) Data transmission method, device, network equipment and storage medium
US20170168986A1 (en) Adaptive coalescing of remote direct memory access acknowledgements based on i/o characteristics
US11922304B2 (en) Remote artificial intelligence (AI) acceleration system
WO2016187813A1 (en) Data transmission method and device for photoelectric hybrid network
US11750418B2 (en) Cross network bridging
US20100306387A1 (en) Network interface device
WO2021073546A1 (en) Data access method, device, and first computer device
US10609125B2 (en) Method and system for transmitting communication data
US20230080588A1 (en) Mqtt protocol simulation method and simulation device
WO2022017475A1 (en) Data access method and related device
US20220358002A1 (en) Network attached mpi processing architecture in smartnics
US10305772B2 (en) Using a single work item to send multiple messages
CN116471242A (en) RDMA-based transmitting end, RDMA-based receiving end, data transmission system and data transmission method
US20220291928A1 (en) Event controller in a device
WO2022143774A1 (en) Data access method and related device
US20150254100A1 (en) Software Enabled Network Storage Accelerator (SENSA) - Storage Virtualization Offload Engine (SVOE)
US20150254196A1 (en) Software Enabled Network Storage Accelerator (SENSA) - network - disk DMA (NDDMA)
WO2024077999A1 (en) Collective communication method and computing cluster
CN117041147B (en) Intelligent network card equipment, host equipment, method and system
Tian et al. A novel software-based multi-path rdma solutionfor data center networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant