CN113297127A

CN113297127A - Parameter updating method and platform system for large-scale distributed training cluster

Info

Publication number: CN113297127A
Application number: CN202010109266.5A
Authority: CN
Inventors: 张翔宇; 李杨; 张曼妮; 孙军欢
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-08-24

Abstract

The invention provides a parameter updating method and a platform system of a large-scale distributed training cluster. Specifically, according to the parameter updating method, the platform system and the computer readable storage medium storing the related program codes, the logical topology formed by each computing module comprises a plurality of levels and is aggregated from a low level to a high level in stages according to the physical topology of the computing module and the difference of communication channels of different computing modules; and after the highest level finishes aggregation, reversely and synchronously updating the parameters from the highest level to the lower level layer by layer so as to realize efficient parameter aggregation among the calculation modules.

Description

Parameter updating method and platform system for large-scale distributed training cluster

Technical Field

The invention relates to the technical field of deep neural network model distributed training; in particular to a parameter updating method and a platform system of a large-scale distributed training cluster.

Background

In recent years, artificial intelligence, particularly deep learning, has made a significant breakthrough in the fields of image recognition, language processing, and the like, and has begun to be widely used in commercial activities. One key reason that deep learning can make such a major breakthrough is that deep learning processes a large number of samples in the training process, and learns from the samples many features contained in the samples. In order to achieve the purpose of "feature learning", the deep neural network model is a necessary means for deep learning. And the deep neural network model also has a plurality of hidden layers between the input layer and the output layer. Deep neural networks are capable of modeling complex nonlinear relationships. As most machine learning algorithms predict input data using a generated model by generating a model that matches training data, deep learning algorithms also require model training using a deep neural network to find matching model parameter values, so that the deep learning algorithm model can provide accurate prediction.

Distributed training can be broadly divided into two types: model parallel distributed training and data parallel distributed training; the former is a method of storing calculation graphs of a model in different calculation nodes respectively and performing pipeline calculation among the calculation nodes, and the latter is a method of loading samples in a sample set to different calculation modules respectively and aggregating updated variable quantities after the calculation modules perform calculation to achieve parameter update with parameter consistency. And the deep neural network model is usually trained by adopting a data parallel mode. In the data parallel training mode, the parameter update of each computation module in each iteration is usually realized by a parameter aggregation method. The existing parameter polymerization technology mainly includes the following two ways:

firstly, gradient collection, averaging and distribution work is undertaken through a centralized Parameter Server (Parameter Server), and the access bandwidth of the Parameter Server is easy to become a bottleneck in a deployment mode based on the communication model, so that the expandability of a large-scale cluster is seriously influenced;

a decentralized Ring-Allreduce mode, namely a communication mode based on Ring topology; in the Ring-AllReduce method, each computing module transmits and collects corresponding parameter (gradient) segments and synchronously collected parameter (gradient) segments to adjacent computing in a successive way, so that the communication bottleneck problem of the parameter server in the mode is solved, but the data communication has large delay overhead in a super-large-scale cluster scene.

As training clusters become larger, current parameter updating methods, specifically, parameter aggregation methods (such as gradient aggregation), suffer from various challenges such as scalability and efficiency. The parameter aggregation technology provided by each implementation of the invention can solve the problems that the parameter updating of the deep neural network model in the prior art restricts the expansibility of a large-scale cluster and influences the updating efficiency of the parameter to a certain extent.

Disclosure of Invention

In view of this, the present invention provides a parameter updating method and a platform system for a large-scale distributed training cluster.

On one hand, the embodiment of the invention provides a parameter updating method for a large-scale distributed training cluster, which is used for parameter aggregation among different computing modules in a large-scale cluster scene.

The parameter updating method for the large-scale distributed training cluster comprises the following steps:

determining a logic topology formed by each computing module according to the physical topology of the computing module and the difference of communication channels of different computing modules, wherein the logic topology comprises a plurality of layers, and the logic topology is aggregated from a low layer to a high layer in stages; and after the aggregation is finished at the highest level, reversely and synchronously updating the parameters from the highest level to the lower level layer by layer.

In some embodiments, in the parameter updating method, the highest-level aggregation includes parameter aggregation between head node devices in different racks.

In some embodiments, the parameter updating method is provided, wherein the hierarchical aggregation includes parameter aggregation between different devices; particularly parameter aggregation between devices accessing the same switch.

In some embodiments, the parameter update method provided in this embodiment, wherein the hierarchical aggregation includes parameter aggregation between different NUMA nodes.

In the above distributed training process, the computing modules may be implemented by one device, or may be implemented by NUMA nodes therein, or even each specific independent GPU or CPU therein, and in the parameter updating method provided in some embodiments, each computing module is implemented by performing computation through a GPU, a CPU, or the like, then, in the training process, communication between the computing modules is performed, and if they are located in the same device, communication between the computing modules is performed based on DMA; the communication between the computation modules is based on RDMA if they are located in different devices, respectively.

In the parameter updating method provided in some embodiments, for any level of aggregation process, an efficient Allreduce based on a matrix topology may be included; specifically, on the level, nodes (calculation modules) participating in the aggregation of the level form the matrix topology, and the reduce-scatter operation in the horizontal direction, the all-reduce operation in the vertical direction and the all-gather operation in the horizontal direction are sequentially executed based on the matrix topology, so that the aggregation of the level is realized.

On the other hand, the embodiment of the invention provides a distributed training platform system, which is used for efficient parameter aggregation among computing modules in a large-scale training cluster scene.

With reference to the first aspect, the distributed training platform system includes:

a plurality of training clusters;

the training cluster comprises a plurality of computing modules;

in the distributed training process, the computing modules execute the parameter updating method in the first aspect to update the parameters among the computing modules of the training cluster.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium.

With reference to the first and second aspects, the computer-readable storage medium stores program code for updating parameters of a distributed training neural network model; the program code includes instructions for executing operations for implementing the parameter updating method according to the first aspect.

The parameter updating method for the large-scale distributed training cluster, the platform system and the computer readable storage medium storing the related program code provided by the above embodiments, wherein the logical topology formed by each computing module includes a plurality of levels and is aggregated in stages from a low level to a high level according to the physical topology of the computing module and the difference of communication channels of different computing modules; and after the highest level finishes aggregation, reversely and synchronously updating the parameters from the highest level to the lower level layer by layer so as to realize efficient parameter aggregation among the calculation modules.

The technical solution of the present invention is further described with reference to the accompanying drawings and specific embodiments.

Drawings

To more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings related to a part of the embodiments of the present invention or the description in the prior art will be briefly introduced below.

Fig. 1 is a schematic diagram of a communication model updated by a Parameter Server (Parameter Server), i.e., PS-based Parameter in the prior art;

fig. 2 is a schematic diagram of a Ring-Allreduce communication model based on Ring topology in the prior art.

Fig. 3 is a schematic diagram of parameter update communication of a large-scale distributed training cluster according to some embodiments of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention is clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of a portion of the invention and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Generally, a deep neural network model is trained by adopting a data parallel mode, namely, a plurality of model copies are generated, and samples are respectively input into the model copies to perform distributed training. Then, in the training process, gradients generated under different model copy items for the same batch of samples need to be subjected to gradient aggregation to realize iteration of the batch processing process. Currently, the mainstream deep learning framework, such as Tensorflow, PyTorch, etc., generally supports and recommends the use of multiple CPUs/GPUs (i.e., computing modules) to accelerate the training process. Particularly, as computer technology develops, the degree of integration thereof becomes higher and higher. A typical training machine device is generally configured with multiple CPUs/GPUs, directionally connected to network devices such as network cards of the machine device through buses of other standards such as PCIe buses or nvlinks (communications between GPUs), and connected to multiple devices through switches, and if there are many devices, the devices are distributed to multiple racks, and the devices in each rack communicate with the devices in other racks through switches or even higher-level switches. In addition, in order to avoid bus bandwidth bottlenecks and memory access conflicts encountered by the multi-core CPU with the increase of the number, some devices themselves also include a plurality of NUMA nodes, and each NUMA node generally includes one CPU and one memory bank belonging to the NUMA node, and sometimes includes a plurality of GPUs. When performing large-scale distributed training on a large number of large distributed clusters formed by switches, racks, etc. of the aforementioned devices, it is obviously a great challenge to perform parameter updating (such as gradient aggregation) for each iteration among such large-scale computing modules in a data parallel training mode. As shown in fig. 1, the collection, averaging and distribution work centralization scheme of the gradient is exclusively undertaken by the Parameter Server (Parameter Server) in the prior art, and the access bandwidth of the Parameter Server is inevitably a bottleneck. In the other scheme in the prior art, a decentralized scheme (i.e., Ring-Allreduce) in which each computing module participates based on a Ring topology is shown in fig. 2, although each computing module in the decentralized scheme is enabled to transmit and collect corresponding gradient segments and synchronously collected gradient segments to adjacent computing modules in a successive manner, the communication bottleneck problem of the parameter server in the above manner is overcome, in a super-large scale training cluster scene, nodes in the Ring topology are inevitably increased, throughput time of each node and communication time between each node are overlapped, and thus, data communication is inevitably subjected to a very large delay overhead.

Therefore, in view of the above problems, the present invention provides a parameter updating method for a large-scale distributed training cluster, a platform system, and a computer software storage medium storing related program codes.

The following are some preferred embodiments of the invention.

Some of these preferred embodiments provide a method of parameter updating for a large-scale distributed training cluster. The method comprises the following steps:

determining a logic topology formed by each computing module according to the physical topology of the computing module and the difference of communication channels of different computing modules, wherein the logic topology comprises a plurality of layers, and the logic topology is aggregated from a low layer to a high layer in stages; after the polymerization is finished at the highest level, reversely and synchronously updating parameters from the highest level to the lower level layer by layer;

in other words, by combining the situations of the physical topology, the communication channel and the like, aiming at the communication bottleneck and considering the parameter updating (specifically, aggregation) efficiency, all the computing modules are divided into a plurality of layers, and the first node computing module of each layer is selected as the aggregation destination node of the layer and the node computing module participating in the high-level aggregation; and polymerizing layer by layer from low level to high level; after the polymerization is finished at the highest level, reversely and synchronously updating parameters from the highest level to the lower level layer by layer;

in some of the foregoing preferred embodiments, in the parameter updating method for a large-scale distributed training cluster, the highest-level aggregation may be to determine respective first node devices on different racks, and further perform parameter aggregation between first node computing modules therein;

in some of the foregoing preferred embodiments, a method for updating parameters of a large-scale distributed training cluster is provided, where the hierarchical aggregation may include parameter aggregation of head node calculation modules between different devices; especially the parameter aggregation of the calculation module of the first node among the devices accessing the same switch;

in some of the foregoing preferred embodiments, in a parameter updating method for a large-scale distributed training cluster, the hierarchical aggregation may further include parameter aggregation of a head node calculation module among different NUMA nodes; especially, parameter aggregation between calculation modules of the first node of the same equipment;

in some of the above-mentioned preferred embodiments, in a parameter updating method for a large-scale distributed training cluster, each computing module is implemented by performing computation through a GPU, a CPU, or the like, and then, in the training process, communication between the computing modules is performed based on DMA if they are located in the same device; the communication between the computation modules is based on RDMA if they are located in different devices, respectively.

In some of the foregoing preferred embodiments, in the parameter updating method for a large-scale distributed training cluster, any one of the hierarchical aggregation processes may implement aggregation among nodes in the current hierarchy by using an Allreduce based on a matrix topology; specifically, on a corresponding level, nodes participating in the level aggregation form a matrix topology, and reduce-scanner operation in the horizontal direction, all-reduce operation in the vertical direction and all-gather operation in the horizontal direction are sequentially executed based on the matrix topology, so that parameter consistency among the nodes is finally realized; the following specifically illustrates an Allreduce based on a matrix topology by way of example, which includes:

1) constructing a matrix topology from each node (namely, a computing module participating in the level aggregation);

2) the reduce-scatter operation is performed in the horizontal direction:

in the horizontal direction of the matrix, parameter sets on each node of each row are respectively divided into n parameter subsets; wherein n corresponds to the number of nodes in each row in the horizontal direction, namely the column number of the matrix; enabling each row of nodes in the horizontal direction to form logic rings connected end to end respectively; in the first iteration, each node in each row respectively sends the parameter subsets corresponding to the column numbers to a downstream adjacent node, receives the parameter subsets sent by the upstream adjacent node, and executes a merging operation with the parameter subsets corresponding to the node (namely the received parameter subsets); in each subsequent iteration, each node in each row respectively sends the parameter subsets combined in the previous iteration to a downstream adjacent node, receives the combined parameter subsets sent by the upstream adjacent node, and executes combination operation with the parameter subsets corresponding to the node (namely the received combined parameter subsets); after n-1 iterations, respectively enabling the parameter subsets corresponding to the serial numbers of the downstream adjacent columns of the nodes on each node of each row to obtain the data of the corresponding parameter subsets on each node of the row through multiple parameter subset combination, namely obtaining the row node combination parameter subsets corresponding to the serial numbers of the downstream adjacent columns;

2) and respectively carrying out all-reduce operation on the row node merging parameter subsets of the columns in the vertical direction:

in the vertical direction of the matrix, respectively enabling the row node merging parameter subsets corresponding to the downstream adjacent column numbers on the nodes of each column to execute all-reduce operation, and enabling each node of each column to respectively obtain data of the parameter subsets corresponding to the downstream adjacent column numbers on each node of the whole matrix after executing the all-reduce operation, namely obtaining the matrix node merging parameter subsets corresponding to the downstream adjacent column numbers;

3) performing an all-gather operation in the horizontal direction:

after each node in the matrix respectively obtains the matrix node merging parameter subsets corresponding to the downstream adjacent column numbers through the all-reduce operation in the vertical direction, each node sequentially transmits and copies the matrix node merging parameter subsets corresponding to the downstream adjacent column numbers to the adjacent nodes along the same sequence in the horizontal direction; after n-1 times of iteration and standardized operation, parameter sets of all nodes in the matrix reach a completely consistent state;

further, in some of the foregoing preferred embodiments, in a parameter updating method for a large-scale distributed training cluster, except for the parameter aggregation at the highest level, in the parameter aggregation process at other levels, an all-gather operation in the horizontal direction may also be omitted, and instead, a matrix node merging parameter subset corresponding to the number of its downstream adjacent column on the other node in the horizontal direction of the row where the first node (computation module) is located is copied to the first node (computation module) for merging operation, that is, a parameter subset merging operation (corresponding to the column number of the node in the row direction) of the first node at this level is directly performed.

With reference to the drawings, the following is an embodiment of the foregoing embodiments, and further illustrates a method for updating parameters of a large-scale distributed training cluster according to the present invention. As shown in fig. 3, according to the physical topology and other factors, in order to take efficiency into account (note that aggregation between layers and inter-layer dependency during reverse synchronization are instead too much layers blindly, and acceleration effect is weakened), and in combination with the characteristics of Allreduce based on the matrix topology, the logical topologies of a plurality of computing modules are formed into an upper layer and a lower layer with appropriate scale; wherein, a lower layer first node calculation module is selected in the lower layer and used for upper layer parameter aggregation; after the parameter updating aggregation is started, performing staged aggregation from a lower layer to an upper layer; wherein, the lower layer and the upper layer both use the Allreduce based on the matrix topology; and after the upper layer finishes aggregation, parameters are updated reversely and synchronously through the lower layer first node calculation module.

Still other embodiments of the present invention provide a distributed training platform system for efficient parameter aggregation among computing modules in a large-scale training cluster scenario. The system comprises:

a plurality of training clusters;

the training clusters respectively comprise a plurality of computing modules;

in the distributed training process, the computing modules execute the parameter updating method described in any of the embodiments to update the parameters among the computing modules of the training cluster.

Further, still other embodiments of the present invention provide a computer-readable storage medium storing program code enabled for distributed training neural network model parameter updates; the program code includes instructions for performing the operations of the parameter updating method according to any of the embodiments.

The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto.

Claims

1. A parameter updating method for large-scale distributed training cluster is characterized in that,

determining a logic topology formed by each computing module according to the physical topology of the computing module and the difference of communication channels of different computing modules, wherein the logic topology comprises a plurality of layers, and the logic topology is aggregated from a low layer to a high layer in stages;

and after the aggregation is finished at the highest level, reversely and synchronously updating the parameters from the highest level to the lower level layer by layer.

2. The parameter updating method according to claim 1,

the highest level aggregation comprises parameter aggregation among head node devices among different racks.

3. The parameter updating method according to claim 1,

wherein the hierarchical aggregation comprises parameter aggregation among different devices.

4. The parameter updating method according to claim 1,

wherein the hierarchical aggregation comprises parameter aggregation among different NUMA nodes.

5. The parameter updating method according to claim 1,

each calculation module is realized by a GPU or a CPU execution meter;

the communication between the various computing modules is,

if the two devices are located in the same equipment, communication is carried out based on DMA;

if they are located in different devices, respectively, the communication is based on RDMA.

6. The parameter updating method according to claim 1,

for any level of parameter aggregation, all reduction based on matrix topology can be included;

the method comprises the steps of constructing a matrix topology, and sequentially carrying out reduce-scatter operation in the horizontal direction, all-reduce operation in the vertical direction and all-gather operation in the horizontal direction.

7. The parameter updating method according to claim 6,

and dividing the logic topology of the computing module into an upper layer and a lower layer with proper scale for updating the parameters according to the characteristics of the Allreduce based on the matrix topology, the physical topology of the computing module and the difference of communication channels of different computing modules, wherein the parameter aggregation of the upper layer and the lower layer is based on the Allreduce based on the matrix topology.

8. The parameter updating method according to claim 6,

in the parameter aggregation processes of other levels except the highest level parameter aggregation, the all-gather operation in the horizontal direction is omitted, and the following steps are replaced: the matrix node merge parameter subsets on other nodes of its row are merged directly on the head node.

9. A distributed training platform system, comprising:

a plurality of training clusters;

the training clusters respectively comprise a plurality of computing modules;

in the distributed training process, each computing module executes the parameter updating method of any one of claims 1 to 8.

10. A computer-readable storage medium, comprising:

program code for neural network parameter updates is stored;

the program code comprising instructions for performing operations implementing the parameter updating method of any of claims 1-8.