CN114553879A

CN114553879A - Distributed task processing method, system and storage medium

Info

Publication number: CN114553879A
Application number: CN202011390848.1A
Authority: CN
Inventors: 吴文斐; 刘俊林; 陈奕熹
Original assignee: Zhongguancun Haihua Information Technology Frontier Research Institute
Current assignee: Zhongguancun Haihua Information Technology Frontier Research Institute
Priority date: 2020-11-24
Filing date: 2020-12-02
Publication date: 2022-05-27
Also published as: CN114553880A; CN114546633A

Abstract

The application discloses a distributed task processing method, a system and a storage medium, wherein the distributed task processing method comprises the following steps: sending a first data packet containing task parameters and identification information for indicating nodes performing aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; wherein the task parameters are obtained by performing a distributed computing task; receiving a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation on the task parameter in the first data packet. The method and the device solve the problem of communication bottleneck in the distributed system, and improve the processing efficiency of the distributed tasks.

Description

Distributed task processing method, system and storage medium

Technical Field

The present application relates to the field of computer data processing, and in particular, to a distributed task processing method, system, and storage medium.

Background

At present, a network architecture of a distributed computing system includes a parameter server and a plurality of nodes, where the nodes are responsible for obtaining model parameters through computation and synchronizing to the parameter server, and the parameter server is responsible for updating global parameters according to the model parameters transmitted by the nodes through computation. However, as the amount of data transmitted between each node and the parameter server increases in such an architecture, a problem of communication bottleneck is inevitably encountered.

Disclosure of Invention

In view of the above-mentioned shortcomings of the related art, it is an object of the present application to provide a distributed task processing method, system and storage medium, which overcome the technical problem of communication bottleneck in a distributed computing system in the related art.

To achieve the above and other related objects, a first aspect of the disclosure provides a distributed task processing method, including: sending a first data packet containing task parameters and identification information for indicating nodes performing aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; wherein the task parameters are obtained by performing a distributed computing task; receiving a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation on the task parameter in the first data packet.

A second aspect of the present disclosure provides a distributed task processing system, including: a sending module, configured to send a first data packet including a task parameter and identification information indicating a node performing an aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; the task parameters are obtained by executing a distributed computing task; a receiving module, configured to receive a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation according to the task parameter in the first data packet.

A third aspect of the disclosure provides a working node comprising: at least one memory for storing at least one program; at least one processor, coupled to the at least one memory, configured to execute and implement the distributed task processing method according to the first aspect when the at least one program is executed.

A fourth aspect of the present disclosure provides a distributed task processing method, including: receiving a plurality of first data packets containing task parameters and identification information for indicating nodes which perform aggregation operation, wherein the identification information is used for indicating that the task parameters are performed aggregation operation by a forwarding node or a parameter node; wherein the plurality of task parameters are obtained by a plurality of working nodes through executing a distributed computing task; performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending the aggregation parameter to a parameter node for verification; and feeding back a second data packet which is sent by the parameter node and contains the aggregation parameter to the corresponding working node.

A fifth aspect of the present disclosure provides a distributed task processing system, including: a receiving module, configured to receive a plurality of first data packets including task parameters and identification information indicating a node performing an aggregation operation, where the identification information is used to indicate that the task parameters are aggregated by a forwarding node or a parameter node; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task; the processing module is used for performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending the aggregation parameter to a parameter node for verification; and the feedback module is used for feeding back a second data packet which contains the aggregation parameters and is sent by the parameter node to the corresponding working node.

A sixth aspect of the present disclosure provides a forwarding node, including: at least one memory for storing at least one program; at least one processor, coupled to the at least one memory, configured to execute and implement the distributed task processing method according to the fourth aspect when the at least one program is executed.

A seventh aspect of the present disclosure provides a distributed task processing method, including: receiving a second data packet containing aggregation parameters; the aggregation parameter is obtained by a forwarding node through performing aggregation operation on task parameters in a plurality of first data packets, and the first data packets contain identification information for indicating that the forwarding node performs the aggregation operation; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task; and performing verification operation on the received second data packet, and feeding back the verified second data packet to the forwarding node, so that the forwarding node feeds back the second data packet containing the aggregation parameter to the corresponding working node.

An eighth aspect of the present disclosure provides a distributed task processing system, including: a receiving module, configured to receive a second data packet containing aggregation parameters; the aggregation parameter is obtained by a forwarding node through performing aggregation operation on task parameters in a plurality of first data packets, and the first data packets contain identification information for indicating that the forwarding node performs the aggregation operation; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task; and the processing module is used for performing verification operation on the aggregation parameters in the received second data packet and feeding back the verified second data packet to the forwarding node, so that the forwarding node feeds back the second data packet containing the aggregation parameters to the corresponding working node.

A ninth aspect of the present disclosure provides a parameter node, comprising: at least one memory for storing at least one program; at least one processor, coupled to the at least one memory, configured to execute and implement the distributed task processing method according to the seventh aspect when the at least one program is executed.

A tenth aspect disclosed in the present application provides a distributed task processing system, comprising: a plurality of working nodes according to the third aspect; at least one forwarding node according to the sixth aspect, communicatively connected to the working node, and configured to perform an aggregation operation according to a task parameter in a first data packet sent by the working node to obtain an aggregation result or an aggregation parameter; the parameter node according to the ninth aspect is communicatively connected to the forwarding node, and configured to perform an aggregation operation on the aggregation result to obtain an aggregation parameter and/or perform a verification operation on the aggregation parameter.

An eleventh aspect of the present disclosure provides a computer-readable storage medium storing at least one program which, when executed by a processor, executes and implements the distributed task processing method according to the first aspect, or executes and implements the distributed task processing method according to the fourth aspect, or executes and implements the distributed task processing method according to the seventh aspect.

To sum up, the distributed task processing method, the distributed task processing system and the storage medium provided by the application execute the aggregation operation of the received task parameters from the working nodes by the forwarding nodes in the distributed task processing architecture, and send the aggregation parameters obtained by the aggregation operation to the parameter nodes for verification, thereby solving the communication bottleneck problem between the forwarding nodes and the parameter nodes and improving the processing efficiency of the distributed tasks.

Other aspects and advantages of the present application will be readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application have been shown and described in the following detailed description. As those skilled in the art will recognize, the disclosure of the present application enables those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention as it is directed to the present application. Accordingly, the descriptions in the drawings and the specification of the present application are illustrative only and not limiting.

Drawings

The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates will be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. The brief description of the drawings is as follows:

FIG. 1 is a diagram illustrating a distributed task processing architecture according to an embodiment of the present invention.

Fig. 2A and 2B are schematic diagrams illustrating a distributed task processing architecture according to an embodiment of the present invention.

FIG. 3 is a schematic diagram illustrating a distributed task processing architecture according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a distributed task processing method according to an embodiment of the present invention.

Fig. 5 is a schematic flow chart of a distributed task processing method according to another embodiment of the present application.

Fig. 6 is a schematic flowchart of a distributed task processing method according to another embodiment of the present application.

Fig. 7 is a schematic flowchart of a distributed task processing method according to another embodiment of the present application.

Fig. 8 is a schematic flowchart of a distributed task processing method according to another embodiment of the present application.

Fig. 9 is a schematic flowchart of a distributed task processing method according to another embodiment of the present application.

FIG. 10 is a flowchart illustrating an exception handling method for distributed tasks according to an embodiment of the present disclosure.

Fig. 11 is a flowchart illustrating an exception handling method for distributed tasks according to another embodiment of the present invention.

FIG. 12 is a block diagram illustrating exemplary components of a distributed task processing system according to an embodiment of the present invention.

FIG. 13 is a block diagram of the modules of the distributed task processing system of the present application in another embodiment.

FIG. 14 is a block diagram illustrating the modules of a distributed task processing system according to another embodiment of the present invention.

FIG. 15 is a block diagram illustrating the modular components of an exception handling system for distributed tasks according to an embodiment of the present application.

FIG. 16 is a block diagram of the modules of an exception handling system for distributed tasks according to an embodiment of the present invention.

FIG. 17 is a block diagram illustrating the modules of an embodiment of the working node of the present application.

Fig. 18 is a block diagram illustrating the modules of a forwarding node according to an embodiment of the present invention.

FIG. 19 is a block diagram illustrating the modules of a parameter node according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application is provided for illustrative purposes, and other advantages and capabilities of the present application will become apparent to those skilled in the art from the present disclosure.

In the following description, reference is made to the accompanying drawings that describe several embodiments of the application. It is to be understood that other embodiments may be utilized and that changes in the module or unit composition, electrical, and operation may be made without departing from the spirit and scope of the present disclosure. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Although the terms first, second, etc. may be used herein to describe various elements, information, or parameters in some instances, these elements or parameters should not be limited by these terms. These terms are only used to distinguish one element or parameter from another element or parameter. For example, a first data format may be referred to as a second data format, and similarly, a second data format may be referred to as a first data format, without departing from the scope of the various described embodiments. The first data format and the second data format are both describing one data format, but they are not the same data format unless the context clearly dictates otherwise. Depending on context, for example, the word "if" as used herein may be interpreted as "at … …" or "at … …".

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

Distributed computing systems have found increasingly widespread use as data has exploded and computer performance has improved. One of the more common applications of distributed computing systems is, for example, distributed machine learning model training. Taking gradient training of a distributed machine learning model as an example, a distributed architecture of the distributed machine learning model includes a plurality of working nodes and parameter nodes. In some examples, the training process of the distributed architecture executing the machine learning model is as follows, for example, a plurality of working nodes are responsible for training respective data sets to obtain gradient parameters, and transmit the gradient parameters to the parameter nodes, and the parameter nodes update the global gradient parameters accordingly. In other examples, the training process of the distributed architecture executing the machine learning model includes that a plurality of working nodes share the training operation of each part in one model, each part is trained by using the data set to obtain corresponding gradient parameters, and the gradient parameters are transmitted to the parameter nodes, so that the global gradient parameters are updated by the parameter nodes. However, as the amount of data to be transmitted increases, if only one parameter node is set, the bottleneck problem of transmission bandwidth caused by the parameter node cannot be avoided; if a plurality of parameter nodes are set, the problems of network interconnection saturation and overlarge load can be caused.

In view of this, the present application provides a Distributed task processing architecture, which can be used to implement Distributed Computing (Distributed Computing) tasks in a single-tenant mode and a Multi-tenant (Multi-tenant) mode.

Please refer to fig. 1, 2A-2B, which are schematic diagrams illustrating a distributed task processing architecture according to an embodiment of the present invention. As shown in fig. 1, the distributed task processing architecture includes a plurality of working nodes, at least one forwarding node, and a parameter node.

Wherein the plurality of worker nodes are configured to perform one or more distributed computing tasks; each working node is in communication connection with the forwarding node so as to transmit task parameters obtained by executing the distributed computing task to the forwarding node; the forwarding node receives the task parameters sent by the working node, executes aggregation operation to obtain aggregation parameters, and transmits the aggregation parameters to the parameter node for verification; and the parameter node receives the aggregation parameters sent by the forwarding node and executes verification operation.

When the parameter node checks and confirms that the aggregation parameter is correct, the parameter node returns the aggregation parameter to the forwarding node, and the forwarding node distributes the aggregation parameter to each working node; and each working node executes a new round of distributed computing task according to the received aggregation parameters.

Therefore, the distributed task processing architecture executes the aggregation operation through the forwarding node and executes the verification operation through the parameter node, the bottleneck problem of transmission bandwidth is solved, the calculation efficiency of the distributed calculation task is improved, and the communication overhead in the network is effectively reduced.

Under the condition that large-scale distributed computing tasks need to be processed, the number of the working nodes is increased, and in order to avoid the situation that the transmission bandwidth between the forwarding nodes and the working nodes becomes a bottleneck, the forwarding nodes can be further arranged to be a primary forwarding node and a secondary forwarding node.

In some embodiments, a plurality of worker nodes, multi-level forwarding nodes, and parameter nodes are included within the distributed task processing architecture. Illustratively, as shown in fig. 2A, the multi-level forwarding nodes include a plurality of first-level forwarding nodes, at least one second-level forwarding node. The first-level forwarding nodes are in communication connection with a plurality of working nodes, and the second-level forwarding nodes are in communication connection with all the first-level forwarding nodes. Similar to the embodiment shown in fig. 1, the primary forwarding node receives task parameters obtained by the work node executing the distributed computation task and executes aggregation operation to obtain aggregation parameters, and the aggregation parameters are sent to the parameter node via the secondary forwarding node, and the parameter node executes verification operation.

In some embodiments, the aggregation operation performed by the primary forwarding node results in partially completed intermediate results (for the sake of clarity, the final result of completing the aggregation is hereinafter referred to as "aggregation parameter", and the partially completed intermediate results are hereinafter referred to as "aggregation result"); the second-level forwarding node receives the aggregation result sent by the first-level forwarding node and then executes aggregation operation to obtain the final aggregation parameter.

In order to improve the processing efficiency of the distributed computing task and fully utilize resources, the secondary forwarding node can be further connected with at least one working node. Therefore, the second-level forwarding node receives the aggregation result sent by the first-level forwarding node, also receives the task parameter sent by the working node connected with the second-level forwarding node, and performs aggregation operation on the aggregation result and the task parameter to obtain a final aggregation parameter.

As the amount of data required to be processed increases, the distributed task processing architecture may also be, for example, the architecture shown in fig. 2B. As shown in the figure, the multi-stage forwarding node may further include a Data Center (Data Center), and the Data Center may be used to communicatively connect the first-stage forwarding node and the second-stage forwarding node and implement functions of communication transmission, Data interaction, and the like between the first-stage forwarding node and the second-stage forwarding node.

It should be understood that the distributed task processing architecture shown in fig. 2A and 2B may also be adjusted according to actual requirements in a specific application scenario, for example, one or more working nodes are added or deleted. It will be apparent to those skilled in the art that any variations and modifications made without departing from the distributed task processing architecture as shown in fig. 1 are within the scope of the present application.

For ease of understanding, the terms referred to in this application will be described below.

The working node is responsible for parameter calculation of distributed calculation tasks, and may be a single computer device, or an entity device or a virtual device used in a cloud architecture-based service system. The single computer device may be an autonomously configured computer device that can perform distributed computing tasks based on computational instructions, which may be located in a private room or in a leased stand located in a public room. The Service system of the cloud architecture comprises a public cloud Service end and a private cloud Service end, wherein the public or private cloud Service end comprises IaaS (Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), SaaS (Software-as-a-Service) and the like. The private cloud service end comprises an Array cloud computing service platform, an Amazon cloud computing service platform, a Baidu cloud computing platform, a Tencent cloud computing platform and the like. The working node may also be a virtual device, where the operation instruction configured by the working node is a software program executable by the virtual device, and the entity or the virtual device of the working node is configured in the distributed architecture/system. In some embodiments, the above devices may be located on a single server or in multiple servers according to the hardware device of the operation instruction actually run by the working node, and the operation instruction executed by the working node is completed through data communication between the servers.

The forwarding node has the functions of aggregation calculation and forwarding. Illustratively, the forwarding node is, for example, a Programmable Switch (Programmable Switch) or a network dedicated device embedded with an FPGA, provides a path for a working node and a parameter node connected thereto, and receives data sent by the working node to perform corresponding computing processing. Since the servers are usually distributed relatively densely, the working nodes and the forwarding nodes are exemplarily connected in a TOR manner, that is, the switches are connected to the server cabinets, so that the network ports of the servers are connected to the switches on the upper portion of the cabinets, thereby simplifying the wiring and facilitating the maintenance and management.

The parameter node is responsible for managing the storage and updating of parameters. In some embodiments, the parameter node further has an aggregation calculation and verification function to ensure correctness of the result obtained by aggregation. Illustratively, the parameter node is, for example, one server or a cluster composed of a plurality of servers.

The communication connection is a connection mode which establishes connection between different nodes based on a certain communication protocol or other matching rules and realizes data transmission and data exchange, and comprises connection established by identification information pairing between the nodes. The communication protocol is a rule to be followed for realizing communication or service between nodes, and the protocol defines the format of a data unit, the connection mode of the data unit and the sending and receiving time sequence of the data so as to ensure the identification and transmission of the data between the nodes; the communication protocol includes a TCP/IP protocol, etc. The identification information is used for determining the attributes of different nodes so as to determine that the connection between the nodes is correct and corresponding, and then the data transmission between subsequent nodes is realized.

The single-tenant mode refers to the distributed processing architecture being used for processing the same distributed computing task, that is, all the work nodes perform computing of one distributed computing task. Correspondingly, the multi-tenant mode means that a plurality of distributed computing tasks can be processed simultaneously in the distributed processing architecture, that is, the working nodes are divided into a plurality of clusters, and each working node in the same cluster executes the same distributed computing task, so that the computing efficiency can be improved, and the waste of computing resources is avoided.

Illustratively, the distributed computing task includes a computing task that performs gradient training on the machine learning model in a distributed computing manner. The machine learning model is a model obtained by using a machine learning algorithm, such as a classification model. In the distributed computing task, each working node is respectively distributed with a subset of a complete training data set, and gradient updating of a machine learning model is completed through a plurality of rounds of iterative training.

For the sake of clarity, the distributed task processing architecture is used herein as an example for processing a distributed computing task (hereinafter referred to as "task") based on machine learning model training; meanwhile, unless otherwise specified, in the embodiments of the present application, each worker node is used to execute one task (i.e., in a single-tenant mode) as an example.

In order to clearly show the data processing and transmission flow between nodes in the distributed task processing architecture, different execution entities are used as cut-in angles in the following description.

Example one

Referring to fig. 3 and fig. 4, fig. 3 is a schematic diagram illustrating an embodiment of a distributed task processing architecture of the present application, and fig. 4 is a flowchart illustrating a distributed task processing method of the present application in an embodiment. Taking the single-tenant mode as an example, after the worker node executes the training task of the current round to obtain the task parameter (for example, the gradient value obtained by the training subset), as shown in fig. 3 and 4, the worker node executes step S101.

In step S101, a first data packet including a task parameter and identification information indicating a node performing an aggregation operation is transmitted; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; wherein the task parameters are obtained by performing a distributed computing task.

In this case, each worker node sends a Packet (Packet) containing the task parameters to the forwarding node. For the purpose of distinction, a data packet containing task parameters sent by a working node is referred to as a first data packet, and a data packet containing aggregation parameters obtained after aggregation operation is referred to as a second data packet.

It should be noted that the "first" and "second" are only used to distinguish whether the data packet is aggregated, and although other information (such as information of source port number, destination port number, etc.) in the data packet may change with the transmission process in an actual scenario, they are not used as a basis for distinguishing the first data packet from the second data packet; instead, the first and second packets are distinguished by parameter data (task parameters or aggregation parameters) carried in the packets.

The task parameters are parameters obtained by calculation in the process that the working nodes execute the distributed calculation task, such as gradient values obtained by training a deep neural network model. Of course, not limited to this, for example, when the distributed task processing architecture is used to execute a computation task based on the private data of the distributed architecture, such as a federal Learning (fed Learning) training task, the task parameter may also be a parameter after encryption processing of each working node, and the like.

Wherein, the first data packet further comprises identification information for indicating a node performing the aggregation operation. When the identification information indicates that the aggregation operation is executed by the forwarding node, the forwarding node executes the aggregation operation after receiving the first data packet; and when the identification information indicates that the aggregation operation is executed by the parameter node, the forwarding node receives the first data packet and forwards the first data packet to the parameter node, and the parameter node executes the aggregation operation. Under normal conditions (the term "normal conditions" herein refers to the problem of packet loss and the like caused by the fact that no communication link interruption, network delay and the like occur in data transmission), the identification information in the first data packet sent by the working node is an instruction for executing the aggregation operation by the forwarding node.

In some embodiments, the worker node adds the identification information in a Header (Header) of the packet when encapsulating to form the packet. The identification information is composed, for example, with one or more flags (Flag) in the header. Illustratively, the identification information indicating the node performing the aggregation operation is represented by "0" or "1": when the identification information is '0', the aggregation operation is executed by the forwarding node, and when the identification information is '1', the aggregation operation is executed by the parameter node, and the forwarding node only executes the forwarding operation.

It should be understood that the above embodiments are only schematic examples and are not limited, and the form or type of the identification information in a specific application scenario may vary according to actual requirements, for example, the representation may also be a field represented by a plurality of bits (bits).

In some embodiments, the first data packet further includes node identification information, so that the node performing the aggregation operation confirms which working nodes perform the current task cooperatively. Here, the working nodes add node identification information in the header of the first data packet, and when the forwarding node or the parameter node receives and analyzes the first data packet, it is possible to know how many working nodes execute a task corresponding to the first data packet in total according to the node identification information.

In some embodiments, the node identification information may also be used for confirming from which working node a currently received data packet originated. Illustratively, the node identification information is represented by a field of a plurality of bits. For example, when the current task is executed by 4 working nodes, the node identification information includes, for example: 0001. 0010, 0100, 1000, wherein "0" or "1" of each bit indicates whether the corresponding working node sent the first data packet. The forwarding node or the parameter node can know that the current task is executed by 4 working nodes through the node identification information, and can also know which working nodes send data packets which have been received currently, which working nodes send data packets which need to be received yet, which working nodes from which the first data packets received currently originate, and the like.

And after the working node sends the first data packet, the forwarding node receives the first data packet and executes aggregation operation. Please refer to fig. 5, which is a flowchart illustrating a distributed task processing method according to another embodiment of the present application. As shown in fig. 3 and 5, the forwarding node performs step S201 and step S202.

In step S201, a plurality of first packets containing task parameters and identification information indicating nodes performing an aggregation operation are received.

The forwarding node receives first data packets sent by a plurality of working nodes, and extracts task parameters in the first data packets through analysis. Under normal conditions, the identification information in the first data packet sent by the working node indicates that the aggregation operation is executed by the forwarding node.

In step S202, an aggregation operation is performed on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and the aggregation parameter is sent to a parameter node for verification.

The forwarding node performs aggregation operation on task parameters sent by each working node under the same task according to one or more identification information of identification information, task identification information and node identification information of the node for indicating the forwarding node to perform the aggregation operation, so as to obtain aggregation parameters.

The aggregation operation refers to an operation of performing integrated calculation on each task parameter. For example, the task parameters are subjected to operation operations such as accumulation, superposition, merging, averaging, or summarization. For example, the aggregation operation is to calculate a sum of gradient values (corresponding to task parameters) provided by each node, thereby obtaining a sum of gradient values as the aggregation parameter. As another example, the aggregation operation is to sum the encryption values provided by the nodes based on the multi-party security calculation, and the like. The aggregation parameter obtained through the aggregation operation is used for data processing of each working node corresponding to the same distributed computing task, for example, each working node performs gradient update on the distributed computing task of the next round according to the received gradient value (corresponding to the aggregation parameter) updated by the forwarding node or the parameter node.

In some embodiments, the forwarding node may extract the task parameter in each first data packet to perform the aggregation operation after all first data packets under a task arrive. In some embodiments, because the arrival sequence of the first data packets under the same task is different, the forwarding node may aggregate the first data packets arriving first to obtain a partial aggregation result, and when a subsequent first data packet arrives, aggregate the task parameter in the subsequent first data packet with the obtained partial aggregation result to obtain a final aggregation parameter.

After the aggregation parameters are obtained, the forwarding node sends the aggregation parameters to the parameter node for verification. And after the parameter node receives the first data packet sent by the forwarding node, the parameter node executes verification operation. Please refer to fig. 6, which is a flowchart illustrating a distributed task processing method according to another embodiment of the present application. As shown in fig. 3 and 6, the parameter node performs step S301 and step S302.

In step S301, a second data packet containing an aggregation parameter is received; the aggregation parameter is obtained by the forwarding node by performing an aggregation operation on task parameters in a plurality of first data packets, and the first data packets include identification information for indicating that the forwarding node performs the aggregation operation.

Here, the parameter node receives a second data packet sent by the forwarding node, where the second data packet includes an aggregation parameter obtained after the forwarding node performs an aggregation operation.

In step S302, a verification operation is performed on the received second data packet, and the verified second data packet is fed back to the forwarding node, so that the forwarding node feeds back the second data packet containing the aggregation parameter to the corresponding working node.

Here, the parameter node performs a verification operation on the received second packet. The verification operation comprises at least one of: and checking whether the task parameters provided by the working nodes are omitted in the aggregation operation and whether the aggregation parameters overflow.

Wherein, said checking whether the aggregation operation is missed refers to checking whether an aggregation parameter of a certain task indeed corresponds to a corresponding number of working nodes. For example, when a task is cooperatively completed by four working nodes, the parameter node checks whether the aggregation parameter of the forwarding node aggregation is obtained based on the complete task parameter aggregation sent by the four working nodes.

Wherein, whether the check aggregation parameter overflows refers to whether the aggregation parameter aggregated by the check forwarding node exceeds a preset data representation range. For example, data obtained when a work node performs distributed computation is generally in a Floating Point Number (Floating Point Number) data format, which is referred to as a first data format; and the data format that the forwarding node can handle is an Integer (Integer), referred to as the second data format. The data representation range of integers is limited compared to floating point numbers (2)^-15～2¹⁵) Therefore, the result of aggregation may exceed the data representation range that the forwarding node can represent.

And under the condition that the verification aggregation operation is omitted, the parameter node does not return any data packet until the working node detects packet loss and retransmits the first data packet because the working node does not receive the second data packet within the preset time. As mentioned above, when the working node detects that the packet is lost, the first data packet is retransmitted, the forwarding node only performs the forwarding operation on the retransmitted first data packet, and the aggregation operation is completed by the parameter node.

And under the condition that the aggregation operation and the aggregation parameters are checked to be correct, the parameter node returns the second data packet after the check to the forwarding node, and the forwarding node receives the second data packet after the check and executes the distribution operation. As shown in fig. 3 and 5, the forwarding node then performs step S203.

In step S203, the second data packet containing the aggregation parameter sent by the parameter node is fed back to the corresponding working node.

Here, the forwarding node receives a second data packet containing the aggregation parameter sent by the parameter node, and forwards the second data packet to the corresponding working node. The forwarding node may distribute the second data packet to corresponding work nodes executing the same task, for example, in a Multicast (Multicast) manner.

And after receiving the second data packet, the working node extracts the aggregation parameters in the second data packet and executes the next round of tasks according to the aggregation parameters. As shown in fig. 3 and 4, the worker node then performs step S102.

In step S102, receiving a second data packet containing aggregation parameters; and the aggregation parameters are used for processing data of all the working nodes corresponding to the same distributed computing task.

The working node receives a second data packet containing the aggregation parameter, and performs data processing according to the aggregation parameter in the second data packet. And the data processing refers to a process that the working node updates and replaces the corresponding task according to the aggregation parameter. For example, the working node performs gradient update according to the average gradient value (corresponding to the aggregation parameter) in the second data packet, so as to execute the next round of tasks with the new gradient value. For another example, the working node performs decryption processing according to the encrypted value (corresponding to the aggregation parameter) in the second data packet, so as to perform next round of multiparty security calculation according to the value obtained after decryption.

Thus, the worker node, forwarding node, and parameter node system complete a round of distributed computing tasks.

Example two

The above embodiment is a workflow of each node in the single-tenant mode. In the multi-tenant mode (i.e., the working node executes multiple tasks at the same time), the node executing the aggregation operation needs to know which task corresponds to the currently received data packet, so as to prevent confusion of parameters of different tasks. Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, the first data packet sent by the working node further includes task identification information, so that a node performing aggregation operation confirms a distributed computing task corresponding to the first data packet.

Here, the working node adds task identification information to a header of a first data packet, and when a forwarding node or a parameter node receives and analyzes the first data packet, the working node can obtain a task corresponding to the current first data packet according to the task identification information. Illustratively, the task identification information is, for example, a serial number (Job ID) assigned to each task by the work node. It should be understood that the task identification information for each task is unique.

EXAMPLE III

The above-described embodiment is a description of a flow in a normal case (normal here means a case where a packet loss problem due to a communication link interruption, a network delay, or the like does not occur). In an actual scenario, packet loss problems caused by situations such as communication link interruption and network delay may be encountered.

Therefore, on the basis of any embodiment or any combination thereof, in some embodiments, the method further includes performing packet loss detection during transmission of the first data packet, and retransmitting the first data packet when packet loss is detected.

Here, taking fig. 3 as an example, the working node starts timing when sending the first data packet, and when the timing data reaches a preset threshold and does not receive the returned data packet yet, the working node determines that a packet loss situation exists, and resends the first data packet containing the task parameter. And when the working node receives the returned second data packet, the working node returns the timing to zero and restarts the timing when the data packet is sent next time.

In some other specific examples, the working node uses N consecutive first data packets as a group, marks an issuing order, monitors feedback information corresponding to each first data packet during the first data packet of the group is issued in sequence, determines a packet loss condition of the group according to the number and/or the order of the received feedback information of the group, and retransmits the first data determined to have packet loss.

In order for the forwarding node or the parameter node to identify whether the first data packet received by the forwarding node is a data packet retransmitted due to a packet loss, in some embodiments, when the working node retransmits the first data packet, retransmission identification information is added to the first data packet. The retransmission identification information is used for the forwarding node and the parameter node to know that the corresponding first data packet is a retransmitted data packet, so that the retransmission identification information is distinguished from the first data packet sent by the working node under normal conditions.

In some embodiments, identification information, which is used to indicate a node performing an aggregation operation, in a first data packet retransmitted by a working node is indicated as being performed by the parameter node for the forwarding node to perform a forwarding operation on the first data packet according to the identification information, and the parameter node performs an aggregation operation on the task parameter.

Here, when the identification information in the first packet received by the forwarding node indicates that the aggregation operation is performed by the parameter node, the forwarding node performs the forwarding operation on the first packet, and the parameter node performs the aggregation operation. For this, as shown in fig. 7, the parameter node further performs step S401 and step S402.

In step S401, a plurality of first packets containing task parameters and identification information indicating that an aggregation operation is performed by a parameter node are received.

Here, each worker node transmits a first packet including a task parameter, and identification information indicating a node performing an aggregation operation in the first packet indicates that the parameter node performs the aggregation operation. The forwarding node receives the first data packet and executes forwarding operation according to one or more of identification information used for indicating that the parameter node executes aggregation operation and retransmission identification information; and the parameter node receives the first data packet forwarded by the forwarding node.

In step S402, an aggregation operation is performed on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and a second data packet including the aggregation parameter is sent to the forwarding node, so that the forwarding node feeds back the second data packet including the aggregation parameter to the corresponding working node.

Here, the parameter node performs an aggregation operation on the task parameter in each first data packet according to one or more identification information of the task identification information, the node identification information, the identification information for indicating that the parameter node performs the aggregation operation, and the retransmission identification information, to obtain an aggregation parameter. And after the aggregation parameters are obtained, the parameter nodes send the second data packets containing the aggregation parameters to the forwarding nodes, and the forwarding nodes distribute the second data packets to all the working nodes. For the specific process, please refer to the foregoing embodiments, which are not described herein again.

Example four

On the basis of any of the foregoing embodiments or any combination thereof, in order to improve efficiency of the aggregation operation, the forwarding node may further perform Partition (Partition) processing on the internal space, where each Partition is a computing resource, and each computing resource is used to perform an aggregation operation of the task parameters. The computing resource refers to a unit having a computing processing capability, and may also be referred to as an Aggregator (Aggregator).

Thus, when receiving the first data packet, the forwarding node determines a computing resource among the computing resources to perform the aggregation operation. For example, the forwarding nodes may poll by Index (Index) of computing resources to determine available computing resources. As another example, the forwarding nodes may have random access to randomly determine a computing resource to perform the aggregation operation.

For the forwarding node to internally allocate the computing resource to perform the corresponding aggregation operation, as shown in fig. 8, the working node further performs step S501 and step S502.

In step S501, resource configuration identification information corresponding to the same distributed computing task is generated, where the resource configuration identification information is used to instruct a forwarding node to allocate computing resources for performing aggregation operation.

Here, the worker node generates a resource configuration identification information for the distributed computing task it executes. Illustratively, the resource configuration identification information is obtained by performing Hash (Hash) on task identification information and a sequence number of the computing resource by the working node, for example. It should be noted that the resource configuration identification information added by the working nodes for the same task is the same, so as to ensure that the first data packets sent by the working nodes executing the same task are aggregated by the same computing resource.

In step S502, a first data packet including task parameters and the resource configuration identification information is sent.

Here, the working node adds the resource configuration identification information to the first data packet, and sends the first data packet to the forwarding node, so that the forwarding node allocates corresponding computing resources according to the resource configuration identification information to perform aggregation operation.

After the forwarding node executes the aggregation operation to obtain the aggregation parameter, sending a second data packet to the parameter node for verification; and after the parameter node returns the verified second data packet to the forwarding node, the forwarding node distributes the second data packet to each working node. For this reason, as shown in fig. 8, the working node further performs step S503: a second data packet containing aggregation parameters is received. For the subsequent processes, please refer to the foregoing embodiments, which are not described herein.

EXAMPLE five

In order to fully utilize the computing resources as much as possible and avoid the problems of resource waste and unbalanced distribution in the static allocation manner of the divided regions, on the basis of any of the above embodiments or any combination thereof, in some embodiments, the forwarding node may further allocate the computing resources to the first packet in a dynamic allocation manner.

Thus, after receiving the first packet including the task parameter and the resource configuration identification information, the forwarding node executes step S601 and step S602 as shown in fig. 9.

In step S601, a plurality of first packets including task parameters and resource configuration identification information indicating allocation of computing resources to perform an aggregation operation are received.

The forwarding node receives a first data packet which is sent by a plurality of working nodes and contains task parameters, and the first data packet also contains resource configuration identification information used for indicating the allocation of computing resources for executing the aggregation operation. For the specific process, please refer to the foregoing embodiments, which are not described herein again.

In step S602, computing resources corresponding to the same distributed computing task are dynamically allocated according to the resource configuration identification information in each of the first data packets, and aggregation operations are performed on corresponding multiple task parameters by using the computing resources to obtain aggregation parameters corresponding to the distributed computing task.

Here, the forwarding node dynamically allocates a computing resource for the first data packet to perform an aggregation operation according to the resource allocation identification information in the received first data packet. The forwarding node may further determine a task corresponding to the computing resource according to the task identification information in the first data packet, and determine how many nodes are shared under the current task, so as to determine how many first data packets need to be aggregated, thereby keeping the computing resource in an occupied state during the aggregation operation.

Illustratively, the manner in which the forwarding node dynamically allocates the computing resources includes step S611 and step S612 (not shown).

In step S611, the computing resources of each first data packet corresponding to the same distributed computing task are determined according to the mapping relationship between the resource configuration identification information in each first data packet and each computing resource.

Wherein the mapping relation represents a one-to-one correspondence relation between the resource configuration identification information and the computing resources. For example, the resource configuration identification information is serial numbers of the computing resources in all the computing resources, and the forwarding node can directly determine the corresponding computing resources according to the serial numbers. Illustratively, the mapping relationship is determined by hashing the resource configuration identification information. For example, the resource configuration identification information is obtained by performing Hash (Hash) on task identification information and sequence numbers of the computing resources by the working node.

In step S612, an aggregation operation corresponding to the same distributed computing task is performed by using the allocated computing resources, so as to obtain an aggregation parameter.

After the forwarding node allocates the first data packet to a computing resource, the forwarding node allocates the subsequently received first data packets sent by other nodes corresponding to the same task to the computing resource, so that the computing resource extracts the task parameters in the first data packet, and performs aggregation operation on the plurality of task parameters to obtain an aggregation parameter.

After obtaining the aggregation parameter, the forwarding node sends the aggregation parameter to the parameter node for verification, and for a specific process, reference is made to the foregoing embodiment, which is not described herein again.

EXAMPLE six

It should be understood that the manner of dynamic allocation of the above embodiments can also be applied in the multi-tenant mode. However, when a forwarding node dynamically allocates computing resources in the multi-tenant mode, a situation may occur in which multiple tasks are allocated to the same computing resource, thereby causing a problem of computing resource collision (e.g., a hash collision problem).

Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, when a computing resource dynamically allocated to a distributed computing task corresponding to a first data packet is already occupied, a forwarding node marks collision identification information in the first data packet and performs a forwarding operation, so that a parameter node performs an aggregation operation on task parameters in the forwarded first data packet according to the collision identification information to obtain an aggregation parameter.

Here, the forwarding node forwards the first data packet to a parameter node; correspondingly, the parameter node is further configured to receive a first data packet which is sent by the forwarding node and contains conflict identification information, and perform aggregation operation on task parameters in the forwarded first data packet according to the conflict identification information to obtain an aggregation parameter. Here, the parameter node learns that the currently received first data packet is forwarded by the forwarding node due to a collision according to the collision identification information, and performs an aggregation operation according to one or more identification information of task identification information, node identification information, and the like in the first data packet. For the specific process, please refer to the foregoing embodiments, which are not described herein again.

And after the aggregation parameters are obtained, the parameter node sends a second data packet containing the aggregation parameters to the forwarding node. In some embodiments, in order for the forwarding node to identify whether a currently received data packet is a first data packet sent by a working node or a second data packet sent by a parameter node, the parameter node adds verification identification information to the second data packet when returning the verified second data packet; when the forwarding node analyzes the verification identification information in the data packet, the currently received data packet can be determined as the verified second data packet sent by the parameter node.

In some embodiments, in order to prevent a collision from occurring in the next computation resource allocation, the parameter node is further configured to update the resource configuration identification information according to the collision identification information, and send a second data packet including the aggregation parameter and the updated resource configuration identification information, so that the forwarding node feeds back the second data packet to the corresponding working node.

Thereby, the forwarding node further performs the steps of: and receiving a second data packet which is sent by the parameter node and contains the aggregation parameter, and feeding back the second data packet to the corresponding working node. For the specific process, please refer to the foregoing embodiments, which are not described herein again.

EXAMPLE seven

Based on any of the above embodiments or any combination thereof, in some embodiments, when it is determined that the computing resource does not need to be occupied, the forwarding node releases the corresponding computing resource.

Illustratively, after the forwarding node completes the aggregation operation to obtain the aggregation parameter, and prepares to send the second data packet containing the aggregation parameter to the parameter node, the corresponding computing resource is released.

Illustratively, when the forwarding node receives the second data packet containing the aggregation parameter returned by the parameter node and prepares to distribute to the working nodes, the corresponding computing resources are released.

Illustratively, after the forwarding node receives a first data packet sent by a part of working nodes corresponding to a task (at this time, a computing resource is occupied), it receives a first data packet retransmitted by another working node corresponding to the same task due to packet loss, at this time, the forwarding node forwards the retransmitted first data packet to the parameter node, and releases the corresponding computing resource. At this time, the forwarding node also forwards the task parameters in the first data packets sent by the previously received partial working nodes or the aggregation result obtained according to the received first data packets sent by the partial working nodes to the working nodes, and the parameter nodes execute aggregation operation to obtain the aggregation parameters.

Example eight

In the above embodiment, all the task parameters in the first data packet sent by the working node are in the second data format. As mentioned above, when the forwarding node performs the aggregation operation on the task parameters in the second data format, sometimes the aggregation result exceeds the data representation range that can be represented by the forwarding node, and thus the parameter node checks that the aggregation parameter overflows.

In order to provide a solution mechanism in the case of aggregation parameter overflow, on the basis of any of the above embodiments or any combination thereof, in the case of verifying that the aggregation parameter overflows, as shown in fig. 10, the parameter node performs step S701, step S702, and step S703.

In step S701, receiving a second data packet containing an aggregation parameter; the aggregation parameters are obtained by the forwarding node through performing aggregation operation on first data packets which are sent by the plurality of working nodes and contain the task parameters in the second data format.

Here, according to the flow in the foregoing embodiment, the parameter node receives the second packet containing the aggregation parameter sent by the forwarding node. The task parameter in the first data packet sent by the working node is in the second data format (i.e., an integer), and thus the aggregation parameter obtained by the forwarding node performing the aggregation operation is also in the second data format.

In step S702, when the overflow of the aggregation result is detected, an abnormal retransmission instruction is issued; and the abnormal retransmission instruction is used for indicating the working node to perform data format conversion.

Here, the parameter node detects whether the received aggregation parameter overflows. Illustratively, when an aggregation parameter aggregated by a forwarding node overflows, the value of the aggregation parameter represents the maximum range (e.g., 2) of the data format that can be processed by the forwarding node³¹). Therefore, when the parameter node detects that the aggregation parameter overflows, an abnormal retransmission instruction is sent to the working node under the current corresponding task to instruct the working node to resend the first dataAnd the packet and the task parameters are converted into a data format by the working node when the working node retransmits the packet, so that the condition that the forwarding node executes the aggregation operation and overflows is avoided.

In some embodiments, the abnormal retransmission instruction is, for example, a data packet containing identification information for instructing a work node to perform a retransmission operation; and the parameter node sends the data packet and forwards the data packet to the working node by the forwarding node, and the working node obtains the abnormal retransmission instruction by analyzing and identifying the identification information in the data packet after receiving the data packet.

And after receiving the abnormal retransmission instruction of the parameter node, the working node retransmits the first data packet. For this reason, as shown in fig. 11, the working node further performs step S801.

In step S801, when an abnormal retransmission instruction is received, a first data packet including a task parameter in a first data format and identification information for instructing an aggregation operation to be performed by a parameter node is transmitted; and the abnormal retransmission instruction is used for indicating the working node to perform data format conversion.

After receiving the abnormal retransmission instruction, the working node retransmits the first data packet containing the task parameter. At this time, since the task parameter in the second data format may cause the overflow of the aggregated parameter obtained by aggregation, when the working node resends the first data packet, the task parameter in the second data format (integer) is converted into the first data format (floating point number). And because the forwarding node cannot process the task parameter of the first data format, the identification information in the first data packet retransmitted by the working node is indicated as the aggregation operation performed by the parameter node.

After the parameter node receives the first data packet retransmitted by the working node, the parameter node executes step S703.

In step S703, receiving a first data packet retransmitted by the working node based on the abnormal retransmission instruction; and performing aggregation operation on the task parameters of each first data format according to the identification information in the first data packet to obtain a second data packet containing the aggregation parameters, and feeding the second data packet back to each working node.

The parameter node receives a first data packet retransmitted by a working node, and performs aggregation operation on the task parameters in the first data format according to one or more identification information of identification information, task identification information, node identification information and retransmission identification information in the first data packet, which is used for indicating the node performing the aggregation operation. For the flow of the parameter node performing the aggregation operation, please refer to the foregoing embodiment, which is not described herein again.

And after the parameter node executes the aggregation operation to obtain the aggregation parameter of the first data format, sending a second data packet containing the aggregation parameter to the forwarding node, and distributing the second data packet to each working node by the forwarding node.

To this end, the working node performs step S802: a second data packet containing aggregation parameters is received. For the specific process, please refer to the foregoing embodiments, which are not described herein again.

Example nine

The embodiment of the present application further provides a distributed task processing system, configured to execute the distributed task processing method corresponding to steps S101 to S102 in the foregoing embodiment, and the distributed task processing system has corresponding functional modules and can achieve the same technical effect.

Referring now to FIG. 12, shown is a block diagram illustrating the components of a distributed task processing system according to an embodiment of the present application. As shown in fig. 12, the distributed task processing system includes a transmitting module 101 and a receiving module 102.

The sending module is used for sending a first data packet containing task parameters and identification information used for indicating nodes executing aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; the task parameters are derived by performing a distributed computing task.

The receiving module is used for receiving a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation according to the task parameter in the first data packet.

In some embodiments, the distributed task processing system further includes a generation module configured to generate resource configuration identification information corresponding to the same distributed computing task, where the resource configuration identification information is used to instruct a forwarding node to allocate computing resources for performing aggregation operations.

Correspondingly, the sending module is further configured to send a first data packet including task parameters and the resource configuration identification information; wherein the task parameters are obtained by a plurality of working nodes through executing the distributed computing task.

Correspondingly, the receiving module is further configured to receive a second data packet containing the aggregation parameter; wherein the aggregation parameter is used for updating task parameters of the distributed computing task; and the aggregation parameter is obtained after a computing resource in the forwarding node executes aggregation operation according to the task parameter in the first data packet.

In the embodiment, for simplicity of description, the sending module and the receiving module in the distributed task processing system may be implemented by a dedicated hardware-based system that performs a specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions, and details of the distributed task processing method in the embodiment shown in fig. 4 are not repeated here.

Example ten

The embodiment of the present application further provides a distributed task processing system, configured to execute the distributed task processing method corresponding to steps S201 to S203 in the foregoing embodiment, and have corresponding functional modules and can achieve the same technical effect.

Referring now to FIG. 13, shown is a block diagram illustrating the components of a distributed task processing system in accordance with an embodiment of the present invention. As shown in fig. 13, the distributed task processing system includes a receiving module 201, a processing module 202, and a feedback module 203.

The receiving module is used for receiving a plurality of first data packets which comprise task parameters and identification information used for indicating nodes which execute aggregation operation, wherein the identification information is used for indicating that the task parameters are executed aggregation operation by a forwarding node or a parameter node; wherein the plurality of task parameters are derived by a plurality of working nodes by performing a distributed computing task.

The processing module is used for performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending the aggregation parameter to a parameter node for verification.

The feedback module is used for feeding back a second data packet which is sent by the parameter node and contains the aggregation parameter to the corresponding working node.

In some embodiments, the receiving module is further configured to receive a plurality of first data packets including task parameters and resource configuration identification information indicating allocation of computing resources to perform the aggregation operation; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task.

Correspondingly, the processing module is further configured to dynamically allocate a computing resource to the first data packet according to the resource configuration identification information in the first data packet, perform aggregation operation by the computing resource to obtain an aggregation parameter, and send the aggregation parameter to a parameter node for verification.

Correspondingly, the feedback module is further configured to feed back a second data packet containing the aggregation parameter sent by the parameter node to the corresponding working node.

In the embodiment, for simplicity of description, the sending module and the receiving module in the distributed task processing system may be implemented by a dedicated hardware-based system that performs a specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions, and details of the distributed task processing method in the embodiment shown in fig. 5 are not repeated here.

EXAMPLE eleven

The present application further provides a distributed task processing system, configured to execute the distributed task processing method corresponding to steps S301 to S302, which has corresponding functional modules and can achieve the same technical effect.

Referring now to FIG. 14, therein is shown a block diagram of the modules of the distributed task processing system of the present application in one embodiment. As shown in fig. 14, the distributed task processing system includes a receiving module 301 and a processing module 302.

The receiving module is used for receiving a second data packet containing aggregation parameters; the aggregation parameter is obtained by a forwarding node through performing aggregation operation on task parameters in a plurality of first data packets, and the first data packets contain identification information for indicating that the forwarding node performs the aggregation operation; wherein the plurality of task parameters are derived by a plurality of working nodes by performing a distributed computing task.

The processing module is configured to perform a check operation on the received second data packet, and feed back the checked second data packet to the forwarding node, so that the forwarding node feeds back the second data packet including the aggregation parameter to the corresponding working node.

In the embodiment, for simplicity of description, the sending module and the receiving module in the distributed task processing system may be implemented by a dedicated hardware-based system that performs a specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions, and details of the distributed task processing method in the embodiment shown in fig. 6 are not repeated here.

Example twelve

The application also provides an exception handling system for the distributed tasks, which is used for executing the exception handling method for the distributed tasks corresponding to the steps S701 to S703, has corresponding functional modules, and can achieve the same technical effect.

Referring to FIG. 15, a block diagram illustrating modules of the distributed task exception handling system of the present application in one embodiment is shown. As shown in fig. 15, the exception handling system includes a receiving module 401, a detecting module 402, and a processing module 403.

The receiving module is used for receiving a second data packet containing an aggregation result; the aggregation result is obtained by the forwarding node executing aggregation operation on a first data packet which is sent by a plurality of working nodes and contains task parameters in a second data format; wherein the task parameters are obtained by the working nodes through executing a distributed computing task.

The detection module is used for sending an abnormal retransmission instruction when the aggregation result is detected to overflow; and the abnormal identification information is used for the working node to adjust the task parameters and execute retransmission operation.

The processing module is used for receiving a first data packet retransmitted by a working node based on an abnormal retransmission instruction, wherein the retransmitted first data packet comprises task parameters in a first data format and identification information used for indicating that a parameter node executes aggregation operation; and performing aggregation operation on the task parameters of each first data format according to the identification information in the first data packet to obtain a second data packet containing the aggregation parameters, and feeding the second data packet back to each working node.

In the embodiment, for simplicity of description, the sending module and the receiving module in the distributed task processing system may be implemented by a dedicated hardware-based system that performs a specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions to implement the exception handling method for the distributed task in the embodiment shown in fig. 10, which is not described herein again.

EXAMPLE thirteen

The application also provides an exception handling system for the distributed tasks, which is used for executing the exception handling method for the distributed tasks corresponding to the steps S801 to S802, has corresponding functional modules, and can achieve the same technical effect.

Referring to FIG. 16, a block diagram illustrating modules of the distributed task exception handling system of the present application in one embodiment is shown. As shown in fig. 16, the exception handling system includes a sending module 501 and a receiving module 502.

The sending module is used for sending a first data packet which contains task parameters in a first data format and identification information used for indicating a parameter node to execute aggregation operation when receiving an abnormal retransmission instruction; the abnormal retransmission instruction is used for indicating the working node to perform data format conversion; the task parameters are obtained by executing a distributed computing task;

the receiving module is used for receiving a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by a parameter node after aggregation operation is executed according to the first data packet.

In the embodiment, for simplicity of description, the sending module and the receiving module in the distributed task processing system may be implemented by a dedicated hardware-based system that performs a specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions, and details of the distributed task processing method in the embodiment shown in fig. 11 are not repeated here.

Example fourteen

The application also provides a working node. Referring to fig. 17, a block diagram of modules of the working node of the present application in one embodiment is shown. As shown in fig. 17, the working node comprises at least one memory 601 and at least one processor 602.

Wherein the at least one memory is to store at least one program; in embodiments, the memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In certain embodiments, the memory may also include memory that is remote from the one or more processors, such as network attached memory that is accessed via RF circuitry or external ports and a communications network, which may be the internet, one or more intranets, local area networks, wide area networks, storage area networks, and the like, or suitable combinations thereof. The memory controller may control access to the memory by other components of the device, such as the CPU and peripheral interfaces.

In some embodiments, the at least one processor is connected to the at least one memory, and is configured to execute and implement at least one embodiment described in the above distributed task processing method when running the at least one program, such as the embodiment correspondingly described in fig. 4. In an embodiment, the processor is operatively coupled with a memory and/or a non-volatile storage device. More specifically, the processor may execute instructions stored in the memory and/or the non-volatile storage device to perform operations in the computing device, such as generating image data and/or transmitting image data to an electronic display. As such, the processor may include one or more general purpose microprocessors, one or more special purpose processors, one or more field programmable logic arrays, or any combination thereof.

Example fifteen

The application also provides a forwarding node. Referring to fig. 18, a block diagram of modules of a forwarding node according to an embodiment of the present invention is shown. As shown in fig. 18, the forwarding node includes at least one memory 701 and at least one processor 702.

In some embodiments, the at least one processor is connected to the at least one memory, and is configured to execute and implement at least one embodiment described in the above distributed task processing method when running the at least one program, such as the embodiment correspondingly described in fig. 5. In an embodiment, the processor is operatively coupled with a memory and/or a non-volatile storage device. More specifically, the processor may execute instructions stored in the memory and/or the non-volatile storage device to perform operations in the computing device, such as generating image data and/or transmitting image data to an electronic display. As such, the processor may include one or more general purpose microprocessors, one or more special purpose processors, one or more field programmable logic arrays, or any combination thereof.

Example sixteen

The application also provides a parameter node. Referring to fig. 19, a block diagram of modules of a parameter node according to an embodiment of the present invention is shown. As shown in fig. 19, the parameter node comprises at least one memory 801 and at least one processor 802.

In some embodiments, the at least one processor is connected to the at least one memory, and is configured to execute and implement at least one embodiment described in the above distributed task processing method when running the at least one program, such as the embodiment correspondingly described in fig. 6. In an embodiment, the processor is operatively coupled with a memory and/or a non-volatile storage device. More specifically, the processor may execute instructions stored in the memory and/or the non-volatile storage device to perform operations in the computing device, such as generating image data and/or transmitting image data to an electronic display. As such, the processor may include one or more general purpose microprocessors, one or more special purpose processors, one or more field programmable logic arrays, or any combination thereof.

The present application further provides a computer readable and writable storage medium, which stores a computer program, and the computer program, when executed, implements at least one embodiment described above for the distributed task processing method, such as the embodiment described in fig. 4.

The present application also provides a computer-readable and writable storage medium storing a computer program that, when executed, implements at least one embodiment described above for the distributed task processing method, such as the embodiment described in fig. 5.

The present application also provides a computer-readable and writable storage medium storing a computer program that, when executed, implements at least one embodiment described above for the distributed task processing method, such as the embodiment described in fig. 6.

The present application also provides a computer-readable and writable storage medium storing a computer program which, when executed, implements at least one embodiment described above for the distributed task processing method, such as the embodiment described in fig. 8.

The present application also provides a computer-readable and writable storage medium storing a computer program that, when executed, implements at least one embodiment described above for the distributed task processing method, such as the embodiment described in fig. 9.

The present application further provides a computer readable and writable storage medium, which stores a computer program, and when executed, the computer program implements at least one embodiment described above for the distributed task exception handling method, such as the embodiment described in fig. 10.

The present application also provides a computer-readable and writable storage medium storing a computer program, which when executed, implements at least one embodiment described above for the distributed task exception handling method, such as the embodiment described in fig. 11.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application.

In the embodiments provided herein, the computer-readable and writable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, a USB flash drive, a removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable-writable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be non-transitory, tangible storage media. Disk and disc, as used in this application, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

In one or more exemplary aspects, the functions described in the computer program of the methods described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and/or writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.

The flowcharts and block diagrams in the figures described above of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims

1. A distributed task processing method, comprising:

sending a first data packet containing task parameters and identification information for indicating nodes performing aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node executes an aggregation operation on the task parameter; wherein the task parameters are obtained by performing a distributed computing task;

receiving a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation on the task parameter in the first data packet.

2. The distributed task processing method according to claim 1, wherein the first data packet further includes task identification information, so that a forwarding node or a parameter node that performs an aggregation operation confirms the distributed computation task corresponding to the first data packet.

3. The distributed task processing method according to claim 1, wherein the first data packet further includes node identification information for a forwarding node or a parameter node that performs an aggregation operation to confirm work nodes corresponding to the same distributed computation task.

4. The distributed task processing method according to claim 1, further comprising: performing packet loss detection during transmission of the first data packet, and retransmitting the first data packet when packet loss is detected.

5. The distributed task processing method according to claim 1 or 4, wherein the identification information indicating the nodes performing the aggregation operation indicates that the parameter node performs the aggregation operation when determining to retransmit, so that the forwarding node performs the forwarding operation on the first packet according to the identification information, and the parameter node performs the aggregation operation on the task parameter.

6. The distributed task processing method according to claim 1, wherein the first packet further includes resource configuration identification information, so that the forwarding node allocates a computation resource to the first packet according to the resource configuration identification information.

7. The distributed task processing method according to claim 1, further comprising: and executing a new round of distributed computing task according to the aggregation parameters in the second data packet.

8. The distributed task processing method according to claim 1, wherein the distributed computing task includes a computing task that performs gradient training on a machine learning model by using a distributed computing method.

9. A distributed task processing system, comprising:

a sending module, configured to send a first data packet including a task parameter and identification information indicating a node performing an aggregation operation; the identification information is used for indicating that a forwarding node or a parameter node performs aggregation operation on the task parameters; the task parameters are obtained by executing a distributed computing task;

a receiving module, configured to receive a second data packet containing aggregation parameters; the aggregation parameters are used for carrying out data processing on all working nodes corresponding to the same distributed computing task; and the aggregation parameter is obtained by the forwarding node or the parameter node after performing aggregation operation according to the task parameter in the first data packet.

10. A working node, comprising:

at least one memory for storing at least one program;

at least one processor coupled to the at least one memory and configured to execute and implement the distributed task processing method according to any one of claims 1 to 8 when running the at least one program.

11. A distributed task processing method, comprising:

receiving a plurality of first data packets containing task parameters and identification information for indicating nodes which execute aggregation operation, wherein the identification information is used for indicating that the task parameters are executed aggregation operation by forwarding nodes or parameter nodes; wherein the plurality of task parameters are obtained by a plurality of working nodes through executing a distributed computing task;

performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending the aggregation parameter to a parameter node for verification;

and feeding back a second data packet which is sent by the parameter node and contains the aggregation parameter to the corresponding working node.

12. The distributed task processing method according to claim 11, wherein the received first packet further includes task identification information for confirming the distributed computing task corresponding to the first packet.

13. The distributed task processing method of claim 11, wherein the received first packet further includes node identification information for identifying the work nodes corresponding to the same distributed computing task.

14. The distributed task processing method of claim 11, wherein the received first data packet further includes resource configuration identification information for indicating allocation of computing resources for performing the aggregation operation; correspondingly, the method further comprises the following steps: and allocating a computing resource for the first data packet according to the resource configuration identification information to execute aggregation operation.

15. The distributed task processing method of claim 11, wherein when identification information in the received first packet indicates that an aggregation operation is performed by a parameter node, a forwarding operation is performed on the first packet according to the identification information.

16. The distributed task processing method of claim 11, wherein the distributed computing task comprises a computing task that performs gradient training on a machine learning model using a distributed computing approach.

17. A distributed task processing system, comprising:

a receiving module, configured to receive a plurality of first data packets including task parameters and identification information indicating a node performing an aggregation operation, where the identification information is used to indicate that the task parameters are aggregated by a forwarding node or a parameter node; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task;

the processing module is used for performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending the aggregation parameter to a parameter node for verification;

and the feedback module is used for feeding back a second data packet which contains the aggregation parameters and is sent by the parameter node to the corresponding working node.

18. A forwarding node, comprising:

at least one memory for storing at least one program;

at least one processor coupled to the at least one memory and configured to execute and implement the distributed task processing method according to any one of claims 11 to 16 when running the at least one program.

19. A distributed task processing method, comprising:

receiving a second data packet containing aggregation parameters; the aggregation parameter is obtained by a forwarding node through performing aggregation operation on task parameters in a plurality of first data packets, and the first data packets contain identification information for indicating that the forwarding node performs the aggregation operation; wherein the plurality of task parameters are obtained by a plurality of working nodes through executing a distributed computing task;

and performing verification operation on the received second data packet, and feeding back the verified second data packet to the forwarding node, so that the forwarding node feeds back the second data packet containing the aggregation parameter to the corresponding working node.

20. The distributed task processing method according to claim 19, further comprising:

receiving a plurality of first data packets containing task parameters and identification information for indicating that an aggregation operation is performed by a parameter node;

and performing aggregation operation on each task parameter according to the identification information in the first data packet to obtain an aggregation parameter, and sending a second data packet containing the aggregation parameter to a forwarding node, so that the forwarding node feeds the second data packet containing the aggregation parameter back to the corresponding working node.

21. The distributed task processing method according to claim 19, wherein the first packet further includes task identification information for confirming the distributed computing task corresponding to the first packet.

22. The distributed task processing method of claim 19, wherein the first packet further includes node identification information for identifying the work nodes corresponding to the same distributed computing task.

23. The distributed task processing method of claim 19, wherein the distributed computing task comprises a computing task that performs gradient training on a machine learning model using distributed computing.

24. A distributed task processing system, comprising:

a receiving module, configured to receive a second data packet containing aggregation parameters; the aggregation parameter is obtained by a forwarding node through performing aggregation operation on task parameters in a plurality of first data packets, and the first data packets contain identification information for indicating that the forwarding node performs the aggregation operation; wherein the task parameters are obtained by a plurality of working nodes through executing a distributed computing task;

and the processing module is used for performing verification operation on the aggregation parameters in the received second data packet and feeding back the verified second data packet to the forwarding node, so that the forwarding node feeds back the second data packet containing the aggregation parameters to the corresponding working node.

25. A parameter node, comprising:

at least one memory for storing at least one program;

at least one processor coupled to the at least one memory and configured to execute and implement the distributed task processing method according to any one of claims 19 to 23 when running the at least one program.

26. A distributed task processing system, comprising:

a plurality of working nodes according to claim 10;

at least one forwarding node according to claim 18, communicatively connected to the working node, configured to perform an aggregation operation according to the task parameter in the first data packet sent by the working node to obtain an aggregation result or an aggregation parameter;

the parameter node according to claim 25, being communicatively connected to the forwarding node, and configured to perform an aggregation operation on the aggregation result to obtain an aggregation parameter and/or perform a verification operation on the aggregation parameter.

27. The distributed task processing system of claim 26, wherein the forwarding nodes are TOR-connected to working nodes.

28. The distributed task processing system of claim 26, wherein the forwarding nodes include primary forwarding nodes and secondary forwarding nodes; the primary forwarding node is in communication connection with a plurality of working nodes, and the secondary forwarding node is in communication connection with at least one primary forwarding node.

29. The distributed task processing system of claim 28, wherein the parameter node is communicatively coupled to the secondary forwarding node and at least one worker node.

30. A computer-readable storage medium, characterized by storing at least one program which, when executed by a processor, executes and implements the distributed task processing method according to any one of claims 1 to 8, or executes and implements the distributed task processing method according to any one of claims 11 to 16, or executes and implements the distributed task processing method according to any one of claims 19 to 23.