CN111210020A

CN111210020A - Method and system for accelerating distributed machine learning

Info

Publication number: CN111210020A
Application number: CN201911155664.4A
Authority: CN
Inventors: 李丹; 王帅; 耿金坤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-05-29
Anticipated expiration: 2039-11-22
Also published as: CN111210020B

Abstract

The embodiment of the invention provides a method and a system for accelerating distributed machine learning, wherein the method comprises the following steps: inputting a machine learning model into a distributed machine learning system, establishing a plurality of connections between any two nodes in the distributed machine learning system, and giving different priorities to the connections; and distributing the parameters of the machine learning model to a plurality of connections for transmission, so that the emergency parameters can be transmitted as soon as possible through high-priority connections, and the training of distributed machine learning is accelerated. The embodiment of the invention considers the overlapping of communication and calculation in the forward calculation process, reduces the randomness of the parameter transmission sequence under the existing machine learning framework by preferentially transmitting the emergency parameters, overlaps the communication and calculation in the forward calculation process, hides the communication overhead, coordinates the communication of different nodes through network-level flow scheduling, realizes distributed communication scheduling, and further accelerates the training process.

Description

Method and system for accelerating distributed machine learning

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for accelerating distributed machine learning.

Background

As the amount of data grows, distributed training based on data parallelism has become a widely adopted machine learning acceleration in the industry. In data parallelism, different compute nodes have multiple copies of the same model, each compute node uses different training data to compute model updates, and then model updates are summarized among all compute nodes. After the model summary is completed, a new round of calculation is started. With the increase of the model size (e.g. the number of parameters of the BERT model is up to 3 hundred million), the communication overhead brought by the model aggregation gradually becomes a large factor influencing the performance of distributed machine learning.

The machine learning calculation is carried out layer by layer and can be divided into two stages of forward calculation and backward propagation: the forward calculation is responsible for the loss of the calculation model, and the backward propagation is responsible for the update of the calculation model. In order to reduce the influence of communication on the learning performance of the distributed machine, the existing scheme generally adopts a 'wait-free back propagation' mode, namely, during back propagation, the transmission of the update of the back layer model and the calculation of the update of the front layer model are overlapped so as to hide the overhead of the transmission of the update of the model as much as possible.

However, the existing schemes do not consider the overlapping of communication and calculation in the forward calculation process, and the transmission sequence of parameters is random, so that the iteration time is greatly increased, and the machine learning efficiency is low.

Disclosure of Invention

In order to solve the above problem, embodiments of the present invention provide a method and a system for accelerating distributed machine learning.

In a first aspect, an embodiment of the present invention provides a method for accelerating distributed machine learning, including: inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;

and distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.

Preferably, the allocating the parameters of the machine learning model to a plurality of connections for transmission specifically includes:

setting a priority for each connection in the distributed machine learning system;

for any priority connection, acquiring a target parameter corresponding to the connection of any priority, wherein the target parameter is the machine learning model parameter required to be transmitted by the connection of any priority;

and the connection with any priority transmits the target parameters.

Preferably, the method further comprises the following steps: and taking any priority as the priority of the target parameter.

Preferably, for any two target parameters with different priorities, when the two target parameters enter the network, the target parameter with high priority is transmitted first.

Preferably, the acquiring the target parameter corresponding to the connection of any priority specifically includes:

calculating the distribution proportion of the parameters of the machine learning model among different priorities through a heuristic algorithm;

and acquiring a target parameter corresponding to the connection with any priority according to the distribution proportion.

In a second aspect, an embodiment of the present invention provides a system for accelerating distributed machine learning, including: the connection module is used for inputting a machine learning model into the distributed machine learning system, and for any two nodes in the distributed machine learning system, a plurality of connections are established between any two nodes;

and the training module is used for distributing the parameters of the machine learning model to a plurality of connections for transmission and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for accelerating distributed machine learning according to the first aspect of the present invention.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for accelerating distributed machine learning according to the first aspect of the present invention.

According to the method and the system for accelerating distributed machine learning, provided by the embodiment of the invention, the overlapping of communication and calculation in the forward calculation process is considered, the randomness of the parameter transmission sequence under the existing machine learning framework is reduced by preferentially transmitting emergency parameters, the communication and calculation in the forward calculation process can be overlapped, the communication overhead is hidden, the communication of different nodes is coordinated through network-level flow scheduling, the distributed communication scheduling is realized, and the training process is accelerated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic diagram comparing random transmission and sequential transmission of parameters;

FIG. 2 is a flow chart of a method for accelerating distributed machine learning according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a comparison of network scheduling and end-side scheduling in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a system for accelerating distributed machine learning according to an embodiment of the present invention;

fig. 5 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

However, the existing schemes do not consider the overlapping of communication and calculation in the forward calculation process, and the parameter transmission sequence has randomness, so that the distributed machine learning training performance still needs to be improved. Based on this, the invention aims to design a distributed machine learning acceleration scheme based on flow scheduling.

Fig. 1 is a schematic diagram comparing random transmission and sequential transmission of parameters, as shown in fig. 1, where (a) in fig. 1 represents random transmission of parameters of a machine learning model, and (b) in fig. 1 represents sequential transmission of parameters of the machine learning model, where we assume that the machine learning model has 4 layers, and the order from front to back is layer1, layer1, layer2, layer3 and layer4, i.e. 1, 2, 3 and 4 in the figure. In the figure, Pull represents a Pull parameter, Push represents a Push gradient, ForwardPass represents forward calculation, Back-Propagation (BP for short) represents backward Propagation, and Aggregation represents parameter summarization. Network denotes the Network, Worker denotes the Worker node, and PS denotes the parameter server.

Referring to fig. 1 (b), at the beginning of the execution process, the machine learning model parameters are first obtained, and after the parameters of layer1 reach worker from PS, layer1 may start to calculate, at this time, the transmission of the parameters of layer2 may be parallel to the calculation of layer1, when the calculation of layer1 is completed, the parameters of layer2 also reach worker, the calculation of layer2 may also start, and so on, until the calculation of the last layer of layer4 is completed.

Then the BP process is started, BP is reversed, the sequence from layer4 to layer1, after the gradient calculation of layer4 is completed, it can be transferred to PS, and at the same time, the gradient calculation of layer3 can be performed in parallel. When the gradient of layer4 reaches the PS, the PS can summarize the gradients on different workers and update the parameters. And until all gradients are summarized, one iteration calculation is finished, and the next iteration can be started.

Referring to fig. 1 (a), when the parameters reach the worker in the order of 3, 2, 4 and 1, the worker can start the calculation of layer1 only if receiving the parameters of layer1, and the calculation of layer2 depends on the calculation result of layer1, so that the calculation cannot be started even if receiving the parameters of layer 2. The figure also shows an iterative process.

Comparing fig. 1 (a) and fig. 1 (b), it can be seen that if the parameters are transmitted in sequence, the iteration time is greatly shortened, and thus the distributed machine learning training process is accelerated.

Geryon, a distributed machine learning training acceleration scheme based on flow scheduling, is designed, communication and calculation in the forward calculation process can be overlapped, communication overhead is hidden, communication of different nodes is coordinated through flow scheduling of the whole network scale, distributed communication scheduling is achieved, and therefore the distributed machine learning training process is accelerated.

The idea of the method provided by the invention is as follows: the forward calculation of machine learning is layer-by-layer and has dependency on parameters, and by designing a reasonable scheduling strategy, communication and calculation of different layers are fully overlapped to hide communication overhead, emergency parameters are transmitted preferentially, and training speed is improved. Specifically, the distributed machine learning training is accelerated by adopting the following three strategies:

first, "multiple connection policy": the existing distributed machine learning system only has one connection between any two computing nodes, and Geryon allocates different machine learning model parameters to different connections for transmission by establishing multiple connections between the computing nodes in the distributed machine learning system, so that queuing time delay of emergency parameters in a single sending queue at the end side is reduced.

Fig. 2 is a flowchart of a method for accelerating distributed machine learning according to an embodiment of the present invention, as shown in fig. 2, the method includes:

s1, inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;

and S2, distributing the parameters of the machine learning model to a plurality of connections for transmission, and giving different priorities to the plurality of connections so as to accelerate the distributed machine learning training.

Firstly, a machine learning model is input into a distributed machine learning system, any two nodes in the distributed machine learning system refer to a worker node and a parameter server node, namely, a plurality of connections are established between any worker node and any parameter server node, and different parameters of the machine learning model are distributed to different connections for transmission, so that queuing delay of emergency parameters in a single sending queue at an end side is reduced.

Specifically, the allocating the parameters of the machine learning model to multiple connections for transmission specifically includes:

and the connection with any priority transmits the target parameters.

Second, "multi-priority policy": geryon assigns different priorities to different connections, assigns higher priority to the connection carrying the emergency parameters, and adopts strict priority scheduling at both the end side and the switch, thereby ensuring that the emergency parameters carried by the high-priority connection can be transmitted preferentially in the whole network.

Geryon requires both the network card (i.e., the end side) and the switch to be configured with strict priority scheduling policies. Geryon assigns different parameters to different connections, each connection having a priority, and this priority label is carried in each packet of the connection. The end-side and the switch rely on the priority label on the packets for strict priority scheduling of these packets.

Specifically, the method further comprises the following steps: and taking any priority as the priority of the target parameter.

It should be noted that, since each packet includes a priority flag, the priority on the packet is the priority of the connection transmitting the packet, so that the switch can distinguish the parameters with different degrees of urgency from different nodes to implement distributed communication scheduling.

On the basis of the above embodiment, preferably, the distribution ratio of the parameters of the machine learning model among different priorities is calculated through a heuristic algorithm;

Third, "adaptive parameter allocation policy": geryon supports the use of any number of priorities, requiring only a minimum of two priorities. Aiming at different priority numbers, the segmentation proportion of the parameters among different priorities is calculated through a heuristic algorithm, namely, in all the parameters, which parameters are distributed to the high-priority connection and which parameters are distributed to the low priority, so that a self-adaptive parameter distribution strategy is realized, and the usability is improved.

On the basis of the above embodiment, preferably, for any two target parameters with different priorities, when the two target parameters enter the network, the target parameter with high priority is transmitted first.

Fig. 3 is a schematic diagram illustrating comparison between network scheduling and end-side scheduling in an embodiment of the present invention, as shown in fig. 3, PS0 denotes a parameter server 0, PS1 denotes a parameter server 1, network fabric denotes a network structure, network-level scheduling denotes a network scheduling policy, end-host-level scheduling denotes an end-side scheduling policy, low-priority parameter in PS0 denotes that a parameter maintained on PS0 is low-priority, high-priority parameter in PS1 denotes that a parameter maintained on PS1 is high-priority, flow direction denotes flow direction, and physical link denotes physical connection.

In the distributed machine learning system, a plurality of PSs and a plurality of workers exist, only one worker and two PSs are placed for clarity of illustration, and then two schemes of controlling parameter transmission sequence through stream scheduling and controlling parameter sending sequence through an application layer are compared. On the left is the network schedule and on the right is the end-side schedule.

Assuming that the low priority parameter is located on PS0 and the high priority parameter is located on PS1, when the two parameters enter the network, the end-side scheduling cannot control the transmission order of the two parameters, so that the two parameters compete for network resources together and the transmission time is long.

For network scheduling, it can still be ensured that the high priority parameter is transmitted preferentially, the low priority parameter does not occupy the bandwidth, and the low priority parameter is transmitted only after the high priority parameter is transmitted. Therefore, the network scheduling can ensure the prior transmission of the emergency parameters in the whole network range.

Therefore, in the application, when two target parameters with different priorities enter the network, the high-priority parameter is transmitted first, and only after the high-priority parameter is transmitted, the low-priority parameter is transmitted. Therefore, the network scheduling can ensure the prior transmission of the emergency parameters in the whole network range.

The embodiment of the invention realizes a join flow scheduling mechanism based on RDMA of TensorFlow, uses two priority queues, measures common benchmark models such as AlexNet, VGG-19, ResNet-50 and YOLO on a distributed machine learning cluster consisting of 8 servers each containing a K40c GPU, and compares the models with standard TensorFlow distributed training and TensorFlow distributed training improved by load balancing. Different schemes all adopt the same hyper-parameter setting, and the convergence of the model is not influenced.

The comparison index is the training throughput of different schemes, namely the number of pictures trained in unit time.

Experimental results show that in a 10G network environment, Geryon can improve the training throughput by 4.37 times compared with standard TensorFlow distributed training, and the training throughput is improved by 1.2 times compared with TensorFlow distributed training added with load balance improvement.

Further, we compare with the end-side communication scheduling scheme, and find that the performance advantage of Geryon over the end-side communication scheduling scheme is more and more obvious as the number of parameter units increases. When the number of slices of the VGG-19 model reaches 680, the training throughput of Geryon is 1.25 times of the end-side communication scheduling performance.

Fig. 4 is a schematic structural diagram of a system for accelerating distributed machine learning according to an embodiment of the present invention, as shown in fig. 4, the system includes a connection module 401 and a training module 402, where:

the connection module 401 is configured to input a machine learning model into a distributed machine learning system, and for any two nodes in the distributed machine learning system, establish multiple connections between the any two nodes;

the training module 402 is configured to distribute parameters of the machine learning model over multiple connections for transmission to accelerate distributed machine learning training.

First, the connection module 401 inputs the machine learning model into the distributed machine learning system, and establishes a plurality of connections for any two nodes in the distributed machine learning system.

The training module 402 assigns parameters of the machine learning model to multiple connections for transmission and gives different priorities to the multiple connections, so that training of distributed machine learning is accelerated.

The system embodiment provided in the embodiments of the present invention is for implementing the above method embodiments, and for details of the process and the details, reference is made to the above method embodiments, which are not described herein again.

Fig. 5 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the bus 504. The communication interface 502 may be used for information transfer of an electronic device. The processor 501 may call logic instructions in the memory 503 to perform a method comprising:

inputting a machine learning model into a distributed machine learning system, and establishing a plurality of connections between any two nodes in the distributed machine learning system;

In addition, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of accelerating distributed machine learning, comprising:

2. The method for accelerating distributed machine learning according to claim 1, wherein the allocating parameters of the machine learning model to a plurality of connections for transmission comprises:

and the connection with any priority transmits the target parameters.

3. The method of accelerating distributed machine learning according to claim 1, further comprising: and taking any priority as the priority of the target parameter.

4. The method of accelerating distributed machine learning according to claim 3, wherein for any two target parameters with different priorities, the target parameter with high priority is transmitted first when the two target parameters enter the network.

5. The method for accelerating distributed machine learning according to claim 2, wherein the acquiring the target parameter corresponding to the connection of any priority specifically includes:

6. A system for accelerating distributed machine learning, comprising:

the connection module is used for inputting a machine learning model into the distributed machine learning system, and for any two nodes in the distributed machine learning system, a plurality of connections are established between any two nodes;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for accelerating distributed machine learning according to any one of claims 1 to 5 are implemented when the program is executed by the processor.

8. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the method of accelerating distributed machine learning according to any one of claims 1 to 5.