CN111597055B

CN111597055B - Distributed data processing system, distributed computing task deployment system and method

Info

Publication number: CN111597055B
Application number: CN202010724559.4A
Authority: CN
Inventors: 柳俊丞; 上官士源; 李新奇; 郭冉; 袁进辉
Original assignee: Beijing Oneflow Technology Co Ltd
Current assignee: Beijing Oneflow Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2021-06-11
Anticipated expiration: 2040-07-24
Also published as: CN113342525A; CN111597055A

Abstract

The disclosure provides a distributed data processing system, a distributed computing task deployment system and a distributed computing task deployment method. The distributed data processing system is used for processing data in parallel on the plurality of computing devices, each computing device comprises a forward data processing component and a backward data processing component, at least one computing device comprises a model parameter component which is not possessed by other computing devices and a model parameter updating component corresponding to the model parameter component, the model parameter component inputs a group of model parameters used by the data to be processed in parallel into the broadcasting components of other computing devices through the broadcasting components corresponding to the model parameter component, and the model parameter updating component acquires a corresponding global gradient value from the gradient convergence component corresponding to the model parameter updating component for updating.

Description

Distributed data processing system, distributed computing task deployment system and method

Technical Field

The present disclosure relates to data processing technology, and more particularly, to a distributed data processing system, a distributed computing task deployment system, and a method thereof.

Background

With the development of machine learning and the gradual and deep research of artificial neural networks, the concept of deep learning is widely concerned and applied. Deep learning is a special machine learning, which adopts a mesh hierarchy structure to express learned objects, combines abstract concepts through simple concepts, and realizes abstract concept expression through simple concept calculation. Deep learning is currently a great advance in the fields of image recognition, speech recognition and natural language processing. The deep learning involves many model parameters, which results in huge calculation amount and large scale of training data, and thus needs to consume more calculation resources.

Currently, both general purpose processors GPU and special purpose chips TPU are many times more powerful than CPU, but the desire for computing power for real-world applications is endless, and practitioners need to process larger-scale data in larger-scale models at a faster rate, which cannot be met by one hardware device alone. The development of hardware is limited by the manufacturing process (chip area, power consumption, clock signal propagation range), and it is impossible to improve the processing capacity of one chip without limit. Therefore, many high throughput devices are connected together by high speed interconnect technology to cooperatively perform large scale tasks. In a common GPU cluster architecture, GPUs within the same node (server) communicate via NVLink or PCIe, and multiple nodes are interconnected via high-speed ethernet or Infiniband. In the hardware deployment of the TPU Cloud inside Google, each server manages several TPUs, and the servers are connected into a large-scale cluster by a high-speed interconnection technology.

For this purpose, data parallelism is proposed by the person skilled in the art. The data is divided into a plurality of parts, each device processes one part, each device only needs to process a small part of the whole data, and the running time of the system is shortened from the time required by one device to process all data to the time required by one device to process a small part of data, so that the speed is increased, which is the most common parallel mode of a big data scene. The data parallel is particularly suitable for a scene of a convolutional neural network, a model of the convolutional neural network is composed of a plurality of small convolutional kernels, and the model is small in size, so that the communication traffic is small when the gradient of the model is synchronized among devices. Currently, all frameworks are well able to support this parallel mode.

However, when performing distributed deep learning training, in the conventional data parallel processing, when a batch of data is divided and placed on each computing device for computation, that is, each computing device executes the same computation flow using different input data, it is also necessary to place model parameters on each computing device in a mirror image manner, aggregate gradients computed by each device, and update the parameters. At present, the most mainstream gradient aggregation mode is AllReduce, that is, gradient Reduce generated on each computing device is added together and then Broadcast is carried out on each computing device, and each computing device obtains the same aggregated gradient and then updates the parameters. Note that in this case, the parameter update calculations performed on each computing device will be identical, so the parameters on each computing device will also continue to be kept consistent after the update. Corresponding to the model parameters to be maintained, parameters of a maintenance update component (Optimizer) are also required to be used for updating the model parameters, for example, Adam Optimizer needs to maintain two parameters m and v with the same data volume and model parameters, and adds a gradient, the Adam can totally relate to a memory (model parameters, gradient, first moment (m) of the gradient, first moment (v) of the gradient) with the size 4 times that of the model parameters when performing parameter update, in this case, if N computing devices are used for data parallel training, the computation of the parameter update part will be repeated N times, and at the same time, the device memory with the size N × 4 × the model will be required to perform parameter update. When the deep learning model is trained, forward and backward calculation on one device can be independently carried out, but after the updating gradient of the model is obtained on each model, synchronization among the devices is needed, and model parameter updating is carried out after the gradient of the complete batch data is aggregated.

The general computing equipment for deep learning generally comprises a CPU, a GPU, an FPGA, an ASIC (application specific integrated circuit), and the like, and the GPU, the FPGA, and the ASIC can be collectively referred to as data acceleration processing equipment, and the data acceleration processing equipment often has a faster computing speed and a higher memory bandwidth, but the memory capacity of the data acceleration processing equipment often has the problems of limited memory, high unit price, difficulty in expansion, and the like. Obviously, the existing repetitive computation and memory in data parallel are derived from Distributed training parameters and optimizers which are also Distributed (Distributed) on each computing device, which undoubtedly increases the occupation of memory resources of the computing devices on the one hand, and inevitably causes transmission overhead of parameter synchronization between devices on the other hand. It is a huge waste for the current computing devices, especially the GPU, which is expensive in memory resources.

Accordingly, it is desirable to have a deployment system for a distributed data processing system or a distributed computing task that reduces the waste of computing device memory and computing resources during data parallel processing.

Disclosure of Invention

To this end, the object of the present invention is to solve at least one of the above problems. According to one aspect of the present disclosure, a distributed data processing system is provided for parallel processing of data on a plurality of computing devices, each computing device comprising a forward data processing component and a backward data processing component, wherein at least one computing device comprises a model parameter component that is not available to other computing devices and a model parameter updating component corresponding to one of the model parameter components, the model parameter component inputs a set of model parameters used by data to be processed in parallel, which are contained in the model parameter component, to a broadcasting component of the other computing devices through a broadcasting component corresponding thereto, and the model parameter updating component obtains a corresponding global gradient value from a gradient convergence component corresponding to the model parameter updating component for update processing.

A distributed data processing system according to the present disclosure, wherein the broadcast component of each computing device inputs model parameters to a corresponding forward data processing component to perform forward data processing and inputs to a backward data processing component to perform backward data processing.

A distributed data processing system according to the present disclosure, further comprising: one or more operation task components connected in series between the model parameter component and the corresponding broadcast component, wherein each operation task component in the operation task components is a single-input and single-output operation task component; and one or more operation task components connected in series between the model parameter updating component and the corresponding gradient convergence component, wherein each operation task component in the operation task components is a single-input and single-output operation task component.

According to the distributed data processing system of the present disclosure, the broadcast component connected to the model parameter component inputs the model parameters in the model parameter component to the forward operation component and the backward operation component corresponding to the model parameter component on the computing device where the model parameter component is located, and also to the parallel broadcast component on the other parallel computing devices, so that the parallel broadcast component inputs the received model parameters to the parallel forward operation component and the backward operation component.

According to another aspect of the present disclosure, there is provided a distributed data processing method including: the method comprises the steps of fragmenting data to be processed into a plurality of fragmented data according to the number of computing devices of a distributed data processing architecture, sending the fragmented data to each computing device, executing forward data processing and backward data processing, dividing a model parameter set into a plurality of subsets, and distributing one subset to a model parameter component on one computing device; and one model parameter component sends the parameters maintained by the model parameter component to a corresponding broadcast component of other computing devices for processing the fragment data in parallel through the broadcast component connected with the model parameter component, so that the forward data processing component and the backward data processing component on the other computing devices process the fragment data based on the received model parameters.

The distributed data processing method according to the present disclosure further includes: and one gradient convergence component corresponding to the broadcast component connected to the model parameter component acquires the gradient value sent by the corresponding gradient convergence component on other computing equipment so as to acquire a global gradient value, and transmits the global gradient value to one model parameter updating component corresponding to the model parameter component so that the model parameter updating component can update the model parameters.

According to another aspect of the present disclosure, there is provided a distributed computing task deployment system, comprising: the operation description component is used for describing an operation neural network model based on an operation type and acquiring computing resources to be processed, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and position labels of the computing devices are given to a forward operation task node, a broadcast node, a backward operation task node and a gradient convergence node to which fragment data of operation data to be subjected to parallel computing belong; the model parameter node configuration component is used for acquiring model parameters for processing operation based on the description of an operation neural network model, calculating the total amount of all the model parameters, and dividing all the model parameters into a plurality of model parameter nodes in a load balancing manner, wherein one model parameter node is only configured on one computing device and is arranged in front of a subsequent forward operation task node of the model parameter node in the neural network model through a broadcast node; and a model parameter update node configuration component, corresponding to each model parameter node, configuring one update component node, wherein one model parameter update node is configured on one computing device only and is connected to and behind a gradient sink node corresponding to a corresponding broadcast node of the model parameter node, and the gradient sink node is arranged behind a backward operation task node corresponding to a corresponding operation task node of the model parameter node.

The deployment system of distributed computing tasks according to the present disclosure further comprises: and the single subsequent operation task node configuration component traverses each subsequent forward operation task node of the model parameter nodes, configures a position label which is the same as the model parameter node for the single subsequent forward operation task node which only consumes the output of the model parameter node and only has single output, serially connects the forward operation task node between the model parameter node and the corresponding broadcast node, and configures a backward operation task node corresponding to the single subsequent forward operation task node between the model parameter update node corresponding to the model parameter node and the gradient convergence node corresponding to the broadcast node.

According to still another aspect of the present disclosure, there is also provided a distributed computing task deployment method, including: describing a job neural network model based on job types and acquiring computing resources to be processed, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and endowing position labels of the computing devices to which fragment data of job data to be subjected to parallel computing belong with a forward computing task node, a broadcast node, a backward computing task node and a gradient convergence node; based on a neural network model of operation, obtaining model parameters for processing operation, calculating the total amount of all the model parameters, dividing all the model parameters into a plurality of parts in a load balancing manner, configuring model parameter nodes with corresponding number, and configuring a position label of a computing device for one model parameter node; inserting a broadcast node between each model parameter node and a subsequent forward operation task node thereof, and configuring the same position label; assigning a model parameter update node corresponding to a model parameter node with a location label of the same computing device as the model parameter node; and inserting a gradient aggregation node corresponding to the inserted broadcast node between the model parameter updating node and the corresponding backward operation task node, and configuring the same position label.

The deployment method of the distributed computing task according to the present disclosure further includes: traversing each subsequent forward operation task node of the model parameter node, modifying a position label configured for a single subsequent forward operation task node which only consumes the output of the model parameter node and has only single output into a position label which is the same as the model parameter node, and connecting the forward operation task node with the modified position label in series between the model parameter node and a corresponding broadcast node; and modifying the position label of the backward operation task node corresponding to the single backward forward operation task node into the position label same as the model parameter node, and connecting the backward operation task node with the modified position label in series between the model parameter updating node and the gradient convergence node.

According to still another aspect of the present disclosure, there is also provided a distributed computing task deployment method, including: describing a job neural network model based on a job type and acquiring computing resources to be processed, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and endowing position labels of the computing devices to forward computing task nodes to which fragment data of job data to be subjected to parallel computing belong; based on the forward part of the neural network model of the operation, acquiring model parameters for processing the operation, calculating the total amount of all the model parameters, dividing all the model parameters into a plurality of parts in a load balancing manner, configuring a corresponding number of model parameter nodes, and configuring a position label of a computing device for one model parameter node; inserting a broadcast node between each model parameter node and a subsequent forward operation task node thereof, and configuring the same position label; the method comprises the steps of configuring a model parameter updating node corresponding to a model parameter node to have a position label same as the model parameter node, configuring a gradient aggregation node corresponding to a broadcast node to have a position label same as the broadcast node, and configuring a backward operation task node corresponding to a forward operation task node to have a position label same as the forward operation task node.

By adopting the deployment system and method of the distributed computing task and the distributed data processing system and method, the model parameters and the updated component parameters for data parallel processing are deployed on one of a plurality of computing devices executing data parallel processing, and the model parameters in the model parameter component are input to each other parallel computing device through the broadcast component for subsequent forward data processing components and backward data processing components, so that the requirement of deploying the same model parameters on other parallel computing devices is reduced, and the memory space requirement of the computing devices in terms of model parameter deployment is reduced. Correspondingly, updating component parameters are correspondingly deployed on the computing equipment for deploying the model parameters, and the gradient aggregation component is used for carrying out aggregate communication to obtain gradients which are generated to the data processing component after the gradient aggregation component is subsequently communicated with other computing equipment, so that a global gradient is obtained, and the model parameters can be updated through the model parameter updating component. This also reduces the need to deploy and simultaneously update component parameters on other parallel computing devices, thereby reducing the memory space requirements of the computing devices in terms of model parameter updates. Therefore, compared with the data parallel mode of global gradient Aggregation (ALLREDUCE) adopted in the prior art, the use of computing resources and memory for updating the component part is effectively reduced. Meanwhile, because the global gradient aggregation cost is approximately equivalent to the gradient aggregation and broadcast of the present disclosure, the communication efficiency of the present disclosure is consistent with that of the global gradient aggregation manner of the related art. As described above, in addition to the model parameters, the present disclosure deploys the operation nodes (e.g., type conversion or normalization processing of the model) only involved by the model parameters and the model parameter nodes on the same computing device and in front of the broadcast nodes corresponding to the model parameter nodes or components, so that the parallel deployment of the operation nodes only involved by the model parameters on other computing devices can be further reduced, and the use of memory and computing resources on other computing devices is also further saved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

Fig. 1 is a schematic diagram illustrating a schematic structure of a distributed data processing system according to a first embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a schematic structure of a distributed data processing system according to a second embodiment of the present disclosure.

FIG. 3 is a schematic flow diagram illustrating a distributed data processing method according to the present disclosure.

FIG. 4 is a schematic diagram illustrating a system for deploying distributed computing tasks, according to one embodiment of the present disclosure.

Detailed Description

The present invention will be described in further detail with reference to the following examples and the accompanying drawings so that those skilled in the art can practice the invention with reference to the description.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. Furthermore, the reference to "first" does not imply the presence of "second," and sometimes the reference to first or second is only used for simplicity. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

For a better understanding of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

Fig. 1 is a schematic diagram illustrating a schematic structure of a distributed data processing system according to a first embodiment of the present disclosure. As shown, a distributed data processing system according to the present disclosure includes a plurality of computing devices, such as computing device 100, computing device 200, computing device 300, and computing device 400, among others. For simplicity, only four computing devices, such as GPUs, for processing data in parallel are shown in fig. 1, and the number of computing devices can be reduced or increased as needed, at least two, during actual use. As shown in fig. 1, according to the distributed data processing system of the present disclosure, data to be processed is fragmented to various computing devices for processing, for example, fragmented data 1 is input into computing device 100, fragmented data 2 is input into computing device 200, fragmented data 3 is input into computing device 300, fragmented data 4 is input into computing device 400, and so on. Taking computing device 100 as an example, computing device 100 includes forward

data processing components

113, 213, 313, 413, etc., and backward

data processing components

423, 323, 223, 123, etc. This is merely illustrative, wherein the neural network model based on actual data processing may comprise more. The model parameters component 110 for configuring a subset of the model parameters is connected to the broadcast component 112, and sends the subset of model parameters via the broadcast component 112 to the subsequent forward data processing component 113 and backward data processing component 123 residing on the present computing device 100, while sending the subset of model parameters to the parallel broadcast component 112 in the other computing devices 200, 300, and 400, so that the parallel broadcast component 112 in the other computing devices provides the obtained subset of model parameters to the parallel forward data processing component 113 and backward data processing component 123 in the other computing devices for processing other sliced data.

Likewise, other subsets of the model parameter sets for job processing are also configured to the

model parameter components

210, 310, and 410, etc. of other computing devices. Each computing device is configured with only one model parameter component, and the model parameter component manages only a subset of the model parameters. Alternatively, when the number of divided model parameter subsets is greater than the number of computing devices, a plurality of model parameter components may be configured on one of the computing devices. For example, if there are 4 computing devices for parallel computing and the model parameters are divided into 8 subsets, then two model parameter components may be configured on each computing device to manage both model parameter subsets. If the model parameters are divided into 6 subsets, two model parameter components are selectively configured on two of the computing devices to manage the two model parameter subsets, and one model parameter component is configured on each of the other two computing devices to manage the two model parameter subsets. The model parameters component 210 for configuring a subset of the model parameters is connected to the broadcast component 212, and sends the subset of model parameters to the subsequent forward data processing component 213 and backward data processing component 223 on the present computing device 200 via the broadcast component 212, while sending the subset of model parameters to the parallel broadcast component 212 in the other computing devices 100, 300, and 400, so that the parallel broadcast component 212 in the other computing devices provides the obtained subset of model parameters to the parallel forward data processing component 213 and backward data processing component 223 in the other computing devices for processing other sliced data. Similarly,

model parameter components

310 and 410 perform similar processing.

Corresponding to the

model parameter components

110, 210, 310, and 410 deployed on each computing device, model

parameter update components

120, 220, 320, and 420 are deployed in the post-processing section. Meanwhile, a

gradient convergence component

122, 222, 322, and 422 is deployed in a backward processing section corresponding to the

broadcast component

112, 212, 312, and 412 deployed after each

model parameter component

110, 210, 310, and 410. Each model parameter update component aggregates the sliced gradient data via a gradient aggregation component deployed on the same computing device. For example, the gradient convergence component 122 obtains a global gradient value of gradient values of each parallel segment by performing backward operation on different segment data sent by the parallel gradient convergence component 122 in the other computing devices 200, 300, and 400 through set communication, and sends the global gradient value to the model parameter updating component 120 for updating. Similarly, the gradient convergence component 222 obtains a global gradient value of gradient values of each parallel segment by performing backward operation processing on different segment data sent by the parallel gradient convergence component 222 on the other computing devices 100, 300, and 400 through set communication, and sends the global gradient value to the model parameter update component 220 for model parameter update.

If the memory size required by the model or the total model parameters is M and the number of the computing devices is N, according to the conventional data parallel distributed data processing system, the total model parameters need to be integrally input to each computing device for parallel processing, so that each computing device needs to update all the model parameters. Thus, taking the adammomizer as an example, for the AllReduce approach, each computing device needs to allocate to the Optimizer one M space for model parameters, one M memory space for gradients, one M space for the first moment (M) of gradients, one M memory space for the second moment (v) of gradients, which is 4 × M (i.e., model parameters, gradients, M, v) size memory, so that for N computing devices, a total of N × 4 × M. As with the distributed data processing system described above in accordance with the present disclosure, each computing device would then need to allocate 2 xm (post-broadcast model parameters, gradient before convergence) size memory, i.e., memory space for the model parameters of

broadcast components

112, 212, 312, and 412, etc., and the gradient required by

gradient convergence components

122, 222, 322, and 422. Furthermore, the total memory required by all the parametric

model update components

120, 220, 320, and 420 is 3 × M (model parameters, M, v), where the memory of the model parameters in all the computing devices is M, the memory space of the first moment (M) of the gradient is M, and the memory space of the second moment (v) of the gradient is M, so that the total memory usage of the N computing devices in performing model update is 2 × N × M + 3 × M. It is clear that this saves a lot of memory space of the computing device compared to conventional distributed parallel data processing systems. Also, ideally, the computational overhead at the model parameter update component using the distributed data processing system of the present disclosure is 1/N of the computational overhead of the conventional update component part, with the same number of N computing devices.

Fig. 2 is a schematic diagram of a schematic structure of a distributed data processing system according to a second embodiment of the present disclosure. The embodiment shown in fig. 2 differs from the embodiment shown in fig. 1 in that an

arithmetic task component

111, 211, 311, and 411 is connected in series between the model parameter component and its corresponding broadcast component on each computing device. Although only one compute task component 111 is shown here, it is virtually any number. Correspondingly, in the backward part, a serial operation task component is arranged between the model parameter updating component and the corresponding gradient convergence component, and each operation task component in the plurality of operation task components is a single-input and single-output

operation task component

121, 221, 321 and 421. Each of these arithmetic task components is a single-input and single-output arithmetic task component. This is a distributed data processing system for semi-precision training. By adopting the technology distributed data processing system disclosed by the invention in semi-precision data training, the operation nodes (such as type conversion or normalization processing of the model) only participating in the model parameters and the model parameter nodes are deployed on the same computing equipment and in front of the broadcast nodes corresponding to the model parameter nodes or components, so that the parallel deployment of the operation nodes only participating in the model parameters on other computing equipment can be further reduced, and the use of memory resources on other computing equipment is further saved. The same portions in fig. 2 as those in fig. 1 are replaced with dashed boxes, and thus the description will not be repeated.

Specifically, if the memory size required for the model or the total model parameters is M and the number of computing devices is N, according to the conventional data parallel distributed data processing system for semi-precision training, the total model parameters need to be input to each computing device for parallel processing as a whole, and therefore each computing device needs to update all the model parameters. Thus, for the alamoptizer example, for the AllReduce approach, each computing device needs to allocate 5 × M (model parameters, first moment of gradient (M), first moment of gradient (v), model at half precision and gradient at half precision) sized memory for the model parameter update component, for a total of N × 5 × M. As with the distributed data processing system described above in accordance with the present disclosure, each computing device would then need to be assigned M (semi-precision model after broadcast, semi-precision gradient before convergence). In addition, a total of 4 × M (model, M, v, single precision gradient) is also assigned to all computing devices. This requires a total memory usage of nxm + 4 xm for implementing model parameter updates. It is clear that this saves a lot of memory space of the computing device compared to conventional distributed parallel data processing systems. Also, ideally, where the computing devices are both N, the computational overhead at the model parameter update component using the distributed data processing system of the present disclosure is 1/N of the computational overhead of the conventional update component portion.

It should be noted that, when actual component deployment is performed, all data-parallel model parameters need to be traversed first, and for each model parameter, the successor nodes of the model parameters need to be traversed. If the traversed model parameter has only one successor node or component, and the successor node or component consumes only the model parameter and has only one output, then the successor node will be treated as the node consuming only the model parameter together, and the successor of the successor node is traversed according to the same condition until all the successor nodes meeting the condition are found, the model parameter and all the successor nodes meeting the condition are taken as a group, and all the groups of the models form a list of the group.

FIG. 3 is a schematic flow diagram illustrating a distributed data processing method according to the present disclosure. As shown in fig. 3, first at step S310, data to be processed is sliced into a plurality of sliced data based on the number of computing devices for data parallel processing, and the sliced data is transmitted to the respective computing devices, and forward data processing and backward data processing are performed. Concurrently or subsequently, at step S320, the set of model parameters needed for data processing is divided into a plurality of subsets, one subset being distributed to model parameter components on only one computing device. Next, at step S330, each model parameter component sends the parameters maintained by itself to a corresponding one of the other computing devices processing the sliced data in parallel through a broadcast component connected thereto, so that the corresponding broadcast component sends the received subset of the model parameters to the subsequent forward data processing component and backward data processing component of the computing device where the corresponding broadcast component is located, which require the model parameters, so that the forward data processing component and the backward data processing component process the sliced data based on the obtained model parameters. Finally, in step S340, the gradient convergence component obtains the gradient values sent by the corresponding gradient convergence components on the other computing devices, so as to obtain a global gradient value, and transmits the global gradient value to a model parameter updating component corresponding to the model parameter component, so that the model parameter updating component performs model parameter updating processing.

FIG. 4 is a schematic diagram illustrating a system for deploying distributed computing tasks, according to one embodiment of the present disclosure. As shown in fig. 4, a system 500 for deploying distributed computing tasks includes: a job description component 510, a model parameter node configuration component 520, and a model parameter update node configuration component 530. Job description component 510 describes a job neural network model based on job type and obtains computational resources to be processed for a job, including multiple computing devices that can perform parallel computations. Such as computing device 1, computing device 2, computing device 3 … …, and so forth. The job description component 510 gives the position labels of the belonging computing devices to the forward operation task node, the broadcast node, the backward operation task node and the gradient sink node to which the fragment data subjected to the parallel operation of the job data belongs. The model parameter node configuration component 520 obtains model parameters for processing the job based on the description of the job neural network model, calculates the total amount of all the model parameters, and divides all the model parameters into several model parameter nodes in a load balancing manner, wherein one model parameter node is configured on only one computing device and is placed in the neural network before its subsequent forward operation task node via a broadcast node. The model parameter update node configuration component 530 configures one model parameter update node corresponding to each model parameter node, wherein one model parameter update node is configured on only one computing device and is connected to and behind a gradient sink node corresponding to a broadcast node corresponding to the model parameter node, and the gradient sink node is arranged behind a backward operation task node corresponding to an operation task node corresponding to the model parameter node.

Optionally, the distributed computing task deployment system 500 further includes a single successor computing task node configuration component 540, which traverses each successor forward computing task node of the model parameter nodes, configures the same position label as the model parameter node for a single successor forward computing task node that consumes only the model parameter node output and has only a single output, and is connected in series between the model parameter node and its corresponding broadcast node, and configures any node of backward computing corresponding to the single successor forward computing task node between a model parameter update node and a gradient sink node corresponding to the model parameter node and the corresponding broadcast node.

The method for deploying the distributed computing task on the distributed data processing system can be implemented on the deployment system 500 of the distributed computing task shown in fig. 4, and the deployment method can be implemented before the backward-forward deployment of the neural network, and can also be implemented after the backward-forward deployment of the neural network.

Specifically, before expanding from front to back, all the data-parallel model parameters are traversed, for each model parameter, the successor node of the model parameter is traversed, if the model parameter has only one successor node, and the successor node only consumes the model parameter and only has one output, the successor node is treated as the node only consuming the model parameter, and the successor of the successor node is traversed according to the same condition until all the successor nodes meeting the condition are found, the model parameter and all the successor nodes meeting the condition are taken as a group (namely, the group formed by the model parameter node and the single input and output successor node, and a plurality of single input and output successor nodes connected in series may exist in the group), and the group of all the models forms a list of the group.

And calculating the sizes of the model parameters in the groups for all the found groups, and distributing the groups to the computing devices according to the sizes of the model parameters, so that the sizes of the model parameters distributed on each computing device are as average as possible, namely, the model parameter subsets distributed to the computing devices are balanced in load. The position tags (placements) of the nodes in each group are modified according to the assignment of the subset of model parameters, and a broadcast node (e.g., broadcast

components

112, 212, 312, and 412 in FIG. 1) is inserted between the last node in each group and the successor to that node. On the computing equipment for deploying the model parameter nodes, the broadcast node takes the last node of the group where the model parameter node is located as input, and the broadcast nodes on the computing equipment for deploying the model parameter nodes are taken as input by the broadcast nodes on the other equipment in parallel. As shown in fig. 2, for example, broadcast node 112 on computing device 1 takes compute node 111 as input, and parallel broadcast nodes 112 on other computing devices take broadcast node 112 on computing device 1 as input. Optionally, the topology sequence of the successor node of the last node of each group containing the model parameter node in the whole calculation graph is obtained, all the groups are sorted in an ascending order according to the minimum value of the topology sequences of all the successor nodes of the last node in the group, and the broadcasting nodes corresponding to each group are sequentially controlled according to the sorting result, so that the broadcasting process is executed according to the sequence.

As the system evolves backward and builds the training graph, the deployment system 500 also deploys onto a single device backward nodes generated based on forward nodes already placed in the same device. In the case of parallel forward nodes, corresponding parallel backward nodes are deployed on multiple parallel computing devices. Also corresponding to the location labels of the model parameter nodes, the relevant nodes of the corresponding update component also have the same location labels and are thus also deployed on the same computing device, and in the backward direction there will be gradient sink nodes corresponding to the broadcast nodes, e.g. corresponding to the

gradient sink components

122, 222, 322 and 42 in fig. 1 or 2. Similarly, these gradient sink nodes also have parallel gradient sink nodes on other computing devices.

Alternatively, the deployment of the distributed computing task may be performed after all the neural networks are deployed in the forward and backward directions. Specifically, after the forward and backward unrolling, all the forward and backward operation nodes and the loss function nodes have been obtained. Traversing all the model parameters of the data parallel, traversing the successor nodes of the model parameters for each model parameter, if the model parameter has only one successor node and the successor node only consumes the model parameter and only has one output, then the successor nodes are treated together as the nodes only consuming the model parameter, and the successors of the successor nodes are traversed according to the same condition until all the successor nodes meeting the condition are found, and the model parameter and all the successor nodes meeting the condition are used as a group (namely the group formed by the model parameter node and the single input and output successor node, and a plurality of single input and output successor nodes connected in series can be arranged in the group), and all the model groups form a list of the group.

components

And then searching the corresponding model parameter updating node which is already expanded based on the model parameter node and the corresponding broadcasting node. The model parameter update node configuration component 530 of the deployment system 500 modifies the location labels of the deployed model parameter update nodes based on their corresponding model parameter node location labels such that the location labels of the two are consistent and both are deployed on the same computing device. And inserting a gradient sink node corresponding to the broadcast node between a first backward node (a backward node corresponding to a last forward node in the group) and a precursor node of the node, if the computing device is the computing device where the group is located, the output of the gradient sink node is connected to the first backward node, the parallel gradient sink nodes on other devices do not output, or the outputs of the gradient sink nodes on other computing devices are combined to the gradient sink node corresponding to the model parameter updating node in an aggregate communication mode. As shown in fig. 2, for example, the gradient sink node 122 on the computing device 1 outputs to the operation node 121, and the parallel gradient sink nodes 122 on other computing devices take the gradient sink node 122 on the computing device 1 as an output object.

The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.

Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.

It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A system for deploying a distributed computing task, comprising:

the operation description component is used for describing an operation neural network model based on an operation type and acquiring computing resources to be processed of the operation, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and position labels of the computing devices are given to a forward operation task node, a broadcast node, a backward operation task node and a gradient sink node which belong to fragment data for performing parallel operation on operation data;

the model parameter node configuration component is used for acquiring model parameters for processing operation based on the description of an operation neural network model, calculating the total amount of all the model parameters, and dividing all the model parameters into a plurality of model parameter nodes in a load balancing manner, wherein one model parameter node is only configured on one computing device and is arranged in the neural network before a subsequent forward operation task node through a broadcast node; and

and a model parameter update node configuration component configured to configure one update component node corresponding to each model parameter node, wherein one model parameter update node is configured on one computing device only and is connected to and behind a gradient sink node corresponding to a broadcast node corresponding to the model parameter node, and the gradient sink node is arranged behind a backward operation task node corresponding to an operation task node corresponding to the model parameter node.

2. The deployment system of distributed computing tasks of claim 1, further comprising:

and the single subsequent operation task node configuration component traverses each subsequent forward operation task node of the model parameter nodes, configures the position labels which are the same as the model parameter nodes for the single subsequent forward operation task nodes which only consume the model parameter node output and only have single output, is connected in series between the model parameter nodes and the corresponding broadcast nodes, and configures the backward operation task nodes corresponding to the single subsequent forward operation task nodes between the model parameter nodes and the corresponding broadcast nodes and between the model parameter update nodes and the gradient convergence nodes corresponding to the model parameter nodes and the corresponding broadcast nodes.

3. A method of deploying a distributed computing task, comprising:

describing a neural network model of the operation based on the operation type and acquiring computing resources to be processed of the operation, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and position labels are given to the computing devices by a forward computing task node, a broadcast node, a backward computing task node and a gradient convergence node which belong to partitioned data for performing parallel computing on the operation data;

based on a neural network model of the operation, obtaining model parameters for processing the operation, calculating the total amount of all the model parameters, dividing all the model parameters into a plurality of parts in a load balancing manner, configuring a corresponding number of model parameter nodes, and configuring a position label of only one computing device for each model parameter node;

inserting a broadcast node between each model parameter node and a subsequent forward operation task node thereof, and configuring the same position label;

assigning the same position label of the computing equipment to one model parameter updating node corresponding to any one model parameter node; and

and inserting a gradient aggregation node corresponding to the inserted broadcast node between the model parameter updating node and the corresponding backward operation task node, and configuring the same position label.

4. The method for deploying a distributed computing task according to claim 3, further comprising:

traversing each subsequent forward operation task node of the model parameter node, modifying a position label configured for a single subsequent forward operation task node which only consumes the output of the model parameter node and has only single output into a position label which is the same as the model parameter node, and connecting the position label in series between the model parameter node and a corresponding broadcast node; and

and modifying the position label of the backward operation task node corresponding to the single subsequent forward operation task node into the position label same as the model parameter node, and connecting the position label in series between the model parameter updating node and the gradient convergence node.

5. A method of deploying a distributed computing task, comprising:

describing a job neural network model based on job types and acquiring computing resources to be processed of jobs, wherein the computing resources comprise a plurality of computing devices capable of executing parallel computing, and position labels are given to forward computing task nodes to which fragment data for performing parallel computing on job data belong

Based on the forward part of the neural network model of the operation, obtaining model parameters for processing the operation, calculating the total amount of all the model parameters, dividing all the model parameters into a plurality of parts in a load balancing manner, configuring a corresponding number of model parameter nodes, and configuring a position label of a computing device for each model parameter node;

the method comprises the steps of configuring a model parameter updating node corresponding to any model parameter node to have the same position label, configuring a gradient aggregation node corresponding to any broadcast node to have the same position label, configuring a backward operation task node corresponding to any forward operation task node to have the same position label, and configuring the backward operation task node corresponding to any forward operation task node to have the same position label.

6. The method for deploying a distributed computing task according to claim 5, further comprising: