CN115660034A - Distributed model training method, device and system - Google Patents

Distributed model training method, device and system Download PDF

Info

Publication number
CN115660034A
CN115660034A CN202211332354.7A CN202211332354A CN115660034A CN 115660034 A CN115660034 A CN 115660034A CN 202211332354 A CN202211332354 A CN 202211332354A CN 115660034 A CN115660034 A CN 115660034A
Authority
CN
China
Prior art keywords
network
gradient
parameter
server
output result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211332354.7A
Other languages
Chinese (zh)
Other versions
CN115660034B (en
Inventor
沈亮
于佃海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211332354.7A priority Critical patent/CN115660034B/en
Publication of CN115660034A publication Critical patent/CN115660034A/en
Application granted granted Critical
Publication of CN115660034B publication Critical patent/CN115660034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

The disclosure provides a method, a device and a system for training a distributed model, and relates to the field of artificial intelligence, in particular to the field of deep learning. The specific implementation scheme is as follows: inputting sample data into a first network to execute forward calculation to obtain an intermediate result; sending the intermediate result to a server; in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result; calculating the gradient of an output result according to the loss value; sending the gradient of the output result to a server; responding to the received parameter gradient of the second network sent by the server, executing reverse calculation of the first network, and obtaining the parameter gradient of the first network; the parameters of the first network are updated based on the parameter gradient of the first network. The implementation mode can reduce the cross-node communication times and improve the overall training performance.

Description

Distributed model training method, device and system
Technical Field
The disclosure relates to the field of artificial intelligence, in particular to the field of deep learning, and specifically relates to a method, a device and a system for distributed model training.
Background
In deep learning model training in recent years, the trend of using more training data and larger models has not changed. Larger models and data volumes mean more computation and storage requirements, and also longer training times. How to distribute the computation and storage requirements to multiple training devices to increase training speed is a key issue.
Data parallelism (data parallelism) is a parallel strategy to solve the above problem, in which the training task is split over multiple processes (devices), each process maintaining the same model parameters and the same computational task, but processing different data (batch data). In this way, data and computations under the same global batch are split into different processes, thereby relieving computation and storage pressure on a single device.
Distributed model training (e.g., moE (mix-of-Experts model)) is one of the technical paths to implement very large scale model training. The idea of this model is to train multiple neural networks (distributed among multiple compute nodes), each of which trains a different portion of the data set. The input data of different devices pass through respective routing networks (for example, gate networks), different expert nodes are selected, and therefore, communication (namely, alltoALL communication) can be achieved between different devices, and data distribution is achieved. When the number of devices is too large, communication exists across nodes, and the overall computing performance is reduced.
Disclosure of Invention
The present disclosure provides a method, apparatus, system, device, storage medium, and computer program product for distributed model training.
According to a first aspect of the present disclosure, a method for training a distributed model is provided, which is applied to a client, and includes: inputting sample data into a first network to execute forward calculation to obtain an intermediate result; sending the intermediate result to a server; in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result; calculating the gradient of the output result according to the loss value; sending the gradient of the output result to the server; in response to receiving a parameter gradient of a second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing reverse calculation based on the gradient of the output result; updating the parameter of the first network based on the parameter gradient of the first network.
According to a second aspect of the present disclosure, there is provided a method for distributed model training, applied to a server, including: responding to the received intermediate result sent by the client, inputting the intermediate result into a second network for forward calculation to obtain an output result; sending the output result to the client; in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network; sending the parameter gradient of the second network to the client; updating the parameters of the second network based on the parameter gradient of the second network.
According to a third aspect of the present disclosure, there is provided a model training system comprising: at least one client configured to perform the method of the first aspect; a server configured to perform the method of the second aspect.
According to a fourth aspect of the present disclosure, there is provided an apparatus for distributed model training, applied to a client, including: the first forward computing unit is configured to input sample data into a first network to execute forward computing to obtain an intermediate result; a first sending unit configured to send the intermediate result to a server; a loss value calculation unit configured to calculate a loss value based on a tag corresponding to the sample data and an output result sent by the server in response to receiving the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result; a first gradient calculation unit configured to calculate a gradient of the output result from the loss value; a second transmitting unit configured to transmit the gradient of the output result to the server; a first reverse calculation unit configured to perform reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result; a first updating unit configured to update a parameter of the first network based on a parameter gradient of the first network.
According to a fifth aspect of the present disclosure, there is provided an apparatus for distributed model training, applied to a server, including: the second forward computing unit is configured to respond to the received intermediate result sent by the client and input the intermediate result into a second network for forward computing to obtain an output result; a third sending unit configured to send the output result to the client; a second reverse calculation unit configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network; a fourth sending unit configured to send the parameter gradient of the second network to the client; a second updating unit configured to update a parameter of the second network based on a parameter gradient of the second network.
According to a sixth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.
According to a seventh aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.
According to an eighth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.
According to the distributed model training method, device and system provided by the embodiment of the disclosure, parameters of a second network (for example, a Gate network and an Expert network) of a model are uniformly placed on the same server (server node), other parameters are placed on respective computing equipment (client node), the computation of a first network is realized on respective client, and the computation of the second network is realized on the server. Therefore, the superposition of communication and calculation of different nodes can be realized, the whole route communication of the MoE is in the machine, the whole communication efficiency is improved, and calculation and storage resources are fully utilized.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
1a-1c are schematic diagrams of exemplary system architectures to which an embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a method of distributed model training according to the present disclosure;
FIG. 3 is a flow diagram of yet another embodiment of a method of distributed model training according to the present disclosure;
4a-4c are schematic diagrams of an application scenario of a method of distributed model training according to the present disclosure;
FIG. 5 is a schematic block diagram of one embodiment of an apparatus for distributed model training according to the present disclosure;
FIG. 6 is a schematic structural diagram of yet another embodiment of an apparatus for distributed model training according to the present disclosure;
FIG. 7 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1a illustrates an exemplary system architecture 100 to which embodiments of the distributed model training method or distributed model training apparatus of the present disclosure may be applied.
As shown in fig. 1a, the system architecture 100 may include a Server (Server) and a plurality of clients (Worker, i.e., compute nodes, also called clients).
The system architecture 100 is a programming architecture commonly employed in the field of distributed training, and mainly solves the following two problems:
1. the model parameters are too large: the single-machine memory space is insufficient, and distributed storage is needed.
2. Too many training data: the single machine training is too slow, and the training nodes need to be enlarged to improve the concurrent training speed.
As shown in fig. 1a, the system architecture 100 mainly includes two parts, a Server and a Worker, wherein the Server is responsible for storing and updating parameters, and the Worker is responsible for training. In brief, the basic idea of model training based on the system architecture is as follows: when the training data is too much and one Worker is too slow, a plurality of workers can be introduced to train simultaneously, and the parameters of the models need to be synchronized at the moment. The intuitive idea is to introduce a Server that acts as a medium for parameter exchange between workers. When the model parameters are too large, so that the single-machine storage space is insufficient or the Worker is too much, so that one Server is a bottleneck, a plurality of Servers are required to be introduced.
The specific process of model training is as follows:
1. the training data (sample set) is evenly distributed to different Workers.
2. And fragmenting the model parameters and storing the parameters on different servers.
3. A Worker end: reading minibatch training data, pulling the latest parameters from the Server end, calculating the gradient, and uploading to different servers according to the fragments.
4. A Server terminal: receiving the gradient uploaded by the Worker terminal, and updating the parameters according to the optimization algorithm. And dividing into two mechanisms of synchronous training and asynchronous training according to whether the Server needs to wait for the gradients of all Worker ends for each parameter update.
The MoE model may be trained using the system architecture 100. In the MoE model, data is calculated through a Backbone network (Backbone network) (generally, a plurality of fully-connected layers) to obtain an intermediate result, and after selection of a top-k Gate (Gate control network), k Expert networks (Expert networks) are selected for each token (character) in the result; as shown in FIG. 1b, gate for top-1 is chosen, and 1 Expert is chosen for each token. After the data H _0 is routed, the Expert _0 is selected, and then the data H _0 is trained by the neural network to obtain an output m. Meanwhile, the Top-1 Gate can obtain the Gate _ loss for measuring the score of the Gate selection at this time, and the smaller the Gate _ loss is, the more reasonable and uniform the Expert routing at this time is.
The MoE + data parallel method (namely MoE + DP) is different from the traditional data parallel method, and MoE model parameters are divided into two types, namely Dense parameters of a Backbone layer and Sparse parameters of an Expert layer. For data parallelism, the Dense parameters are the same on each card (computing device, such as GPU), the parameters are updated consistently with those of ordinary data parallelism, and after the reverse phase is finished, each card synchronizes the gradient of the parameters; for the spare parameter, under MoE + DP, the spare parameter of each card initializes different values, as shown in fig. 1c, 3 excelts per card, and 6 excelts in total for 2 cards, where top _1gate is that under total excelt =6, 1 excelt with the highest score is selected for routing, and then calculation is performed according to the mode of single-card MoE, and if cross-card occurs, send/recv needs to be called to implement cross-card communication. As shown in fig. 1c, after passing through Top-1 Gate, the 4 th Expert is selected for the Output tensor H _0 of the backhaul on Rank-0 (card number), then H _0 is sent to Rank-1, after being calculated by Expert _4, the H _0 is sent back to Rank-0, and the final Output _0 is obtained.
It should be noted that the method for distributed model training provided by the embodiments of the present disclosure may be executed by a server and a client. Accordingly, the apparatus for distributed model training may be disposed in a server and a client. And is not particularly limited herein.
It should be understood that the number of servers and clients in FIG. 1a is merely illustrative. There may be any number of servers and clients, as desired for an implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method of distributed model training in accordance with the present disclosure is shown. The distributed model training method is applied to a client and comprises the following steps:
step 201, inputting sample data into the first network to execute forward calculation, and obtaining an intermediate result.
In this embodiment, an executing entity (for example, a client shown in fig. 1 a) of the distributed model training method may receive sample data from a database through a wired connection manner or a wireless connection manner, and perform supervised training, so that the sample data is labeled. The distributed model may include two networks: a first network and a second network, wherein each client is arranged with one first network and a plurality of second networks are arranged in the server. The first network essentially operates as a backbone network and the second network may be a network that typically communicates across hardware, such as a gated network (Gate network) and an Expert network (Expert network). Each first network may select one second network from the first networks to process the intermediate result calculated in the forward direction, and receive the selected second network to calculate the gradient of the local first network by calculating the gradient of the feedback in the reverse direction. According to the method and the system, the second networks originally arranged at different clients are uniformly arranged in the server, so that cross-hardware communication can be avoided.
As shown in fig. 4a, the network structure of the prior art, the Gate network, the Expert network and the backhaul network are on their own nodes. In the hybrid Expert model, in order to expand the scale of the model, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.
As shown in fig. 4b, since the data is parallel, each client reads different data. client-0 reads in Data _0, client-1 reads in Data _1, client-2 reads in Data _2.
And each client node executes forward calculation of the first network to respectively obtain an intermediate result H. As shown in FIG. 4b, client-0 gave H0, client-1 gave H1, and client-2 gave H2.
Step 202, the intermediate result is sent to the server.
In this embodiment, send and recv are called to implement cross-card communication, and an intermediate result output by the client is sent to the server. And the second network in the server continues to perform forward calculation based on the intermediate result, and obtains an output result and returns the output result to the client. As shown in FIG. 4b, client-0, client-1 and client-2 send H0, H1 and H2, respectively, to the server node. On the server node, H0 is calculated through a Gate _0 network, expert _5 is selected, and Output _0 is obtained through calculation; h1, selecting Expert _7 through Gate _1 network calculation, and calculating to obtain Output _1; h2 is calculated through a Gate _2 network, and the Expert _1 is selected for routing to obtain Output _2 through calculation.
And the server node sends the output result back to the original client node. As shown in FIG. 4b, the server node sends Output _0 back to client-0; output _1 sends back client-1 and Output _2sends back client-2.
And step 203, responding to the received output result sent by the server, and calculating a loss value based on the label corresponding to the sample data and the output result.
In this embodiment, the output result of the server is received, forward calculation is continued, and according to the existing loss function calculation method of the labeled sample, the loss value of the whole distributed model is finally obtained through the label corresponding to the sample data and the output result.
And step 204, calculating the gradient of the output result according to the loss value.
In this embodiment, each client node performs a backward calculation to obtain the input output @ grad of the Expert backward network (the gradient of the output result is obtained by partial derivation of the output result by the loss value). client-0 obtains input Output _0@ Grad of the Expert _5 reverse network; client-1 obtains input Output _1@ Grad of Expert _7 reverse network; client-2 gets the input Output _2@ Grad of the Expert _1 reverse network.
Step 205, the gradient of the output result is sent to the server.
In this embodiment, each client node sends output @ Grad to the server node. And the server node receives the output @ Grad, and respectively executes the reverse calculation of the selected Expert network and the Gate network to obtain the parameter gradients of the two networks and the gradient H @ Grad returned to the client.
As shown in fig. 4c, the server node accepts Output _0@ grad, and performs the reverse calculation of the Expert _5 network and the Gate _0 network; receiving Output _1@ Grad, and executing reverse calculation of an Expert _7 network and a Gate _1 network; accepting Output _2@ Grad, and executing reverse calculation of the Expert _2 network and the Gate _2 network.
And step 206, in response to receiving the parameter gradient of the second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network.
In the embodiment, after the client accepts the H @ Grad, the backscattering of the BackBone network is executed until the backward calculation is finished. As shown in fig. 4c, the client-0 obtains the parameter gradient of the backhaul _0 network; the client-1 obtains the parameter gradient of the BackBone _1 network; and the client-2 obtains the parameter gradient of the BackBone-2 network.
Step 207, updating the parameters of the first network based on the parameter gradient of the first network.
In this embodiment, the updated parameter of the first network can be obtained by subtracting the parameter gradient from the parameter of the first network. The network parameter updating process is the prior art, and therefore is not described in detail.
The core idea of the method provided by the above embodiment of the present disclosure is to collectively place the second networks related to cross-machine communication on the server node, and then place the remaining first networks on the respective client nodes.
And in the forward phase, after the first network of the client node is calculated, the first network is sent to the server node. And the server node selects a corresponding second network, and after the calculation is finished, the result is sent back to the client node, so that the forward calculation is finished.
And in the reverse phase, the client node sends the returned gradient to the server node, and the server node calculates the returned gradient according to the second network selected by the forward calculation and sends the returned gradient to the client node.
In the optimizer stage, the server node and the client node respectively optimize parameters of the server node and the client node.
In some optional implementations of this embodiment, before the updating the parameter of the first network based on the parameter gradient of the first network, the method further includes: and carrying out parameter gradient synchronization of the first network with other clients. In order to keep the network parameters of all the clients consistent, gradient synchronization of the parameters of all the clients is performed, and parameter consistency is ensured. The gradient synchronization process of the parameter can be realized by using an Allreduce sum synchronous communication operation, the gradient obtained on each process after the Allreduce sum operation is used for the gradient of the parameter is the same, the gradient value at this time is equal to the sum of the corresponding positions of the gradients on all the processes, then the gradient sum after the Allreduce is used for each process to divide the number of the processes in the data parallel, and the gradient obtained in this way is the average value of the gradients on all the processes before synchronization.
In some optional implementations of this embodiment, the updating the parameter of the first network based on the parameter gradient of the first network includes: updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network. The optimizer may include: 1. basic gradient descent method: including standard Gradient Descent (GD), random Gradient Descent (SGD), and Batch Gradient Descent (BGD); 2. the momentum optimization method comprises the following steps: including a standard momentum optimization method (MomentumOptimizer), a newton acceleration gradient momentum optimization method (NAG), etc.; 4. the self-adaptive learning rate optimization method comprises the following steps: including an adadra (Adaptive gradient algorithm), RMSProp (root mean square prop) algorithm, adam algorithm, etc.; 5. a fusion optimization method comprises the following steps: adam is adagard + momentum; nadam: adam + Nesterov.
The convergence rate of the model can be accelerated through the optimizer, and the training time of the model is shortened.
In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network. In the mixed Expert model, in order to expand the model size, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.
In order to solve the problems, the invention provides a method and a device for training a MoE model based on a parameter server. All the Gate and Expert parameters of the model are uniformly placed on the same node (server node), other parameters are placed on respective computing equipment, and the Gate network and the Expert network are computed on the server node. Therefore, communication and computational overlap of different nodes can be achieved, the whole routing communication of the MoE is built in the machine, the whole communication efficiency is improved, and computing and storage resources are fully utilized.
With further reference to FIG. 3, a flow 300 of yet another embodiment of a method of distributed model training is illustrated. The process 300 of the distributed model training method is applied to a server, and comprises the following steps:
step 301, in response to receiving the intermediate result sent by the client, inputting the intermediate result into the second network for forward calculation, and obtaining an output result.
In this embodiment, an executing agent (e.g., the server shown in fig. 1 a) of the distributed model training method may receive the intermediate result from the client through a wired connection or a wireless connection. The intermediate result is that the client performs step 201. And the second network continues to perform forward calculation based on the intermediate result to obtain an output result of the whole model. For example, the server node receives the intermediate result H sent by each client, selects an Expert network on the server node through the corresponding Gate routing network, and executes forward calculation to obtain an Output result Output.
As shown in fig. 4b, on the server node, H0 is calculated through a Gate _0 network, expert _5 is selected, and Output _0 is obtained through calculation; h1, performing network calculation through Gate _1, selecting Expert _7, and calculating to obtain Output _1; h2 is calculated through a Gate _2 network, the route is selected from Expert _1, and Output _2 is obtained through calculation.
Step 302, sending the output result to the client.
In this embodiment, the server node sends an Output back to the original client node. And the client receiving the corresponding output result executes step 203 to obtain the gradient of the output result.
As shown in FIG. 4b, the server node sends Output _0 back to client-0; output _1 sends back client-1 and Output _2sends back client-2.
Step 303, in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network.
In this embodiment, the server node receives output @ grad, and respectively executes the reverse calculations of the selected Expert network and the Gate network to obtain the parameter gradients of the two networks and the gradient h @ grad returned to the client.
As shown in fig. 4c, the server node accepts Output _0@ grad, and performs the reverse calculation of the Expert _5 network and the Gate _0 network; receiving Output _1@ Grad, and executing reverse calculation of an Expert _7 network and a Gate _1 network; accepting Output _2@ Grad, and executing reverse calculation of the Expert _2 network and the Gate _2 network.
And step 304, sending the parameter gradient of the second network to the client.
In this embodiment, the server node transmits the H @ Grad back to the corresponding client node. The client then executes step 206 to obtain the parameter gradient of the first network.
And step 305, updating the parameters of the second network based on the parameter gradient of the second network.
In this embodiment, the server and the client each maintain their own stored network parameters. The server is responsible for subtracting the parameter gradient of the second network from the original parameter of the second network to obtain the updated parameter of the second network.
The method provided by the above embodiment of the present disclosure has a core idea that the second networks related to cross-machine communication are collectively and uniformly placed on the server node, and then the remaining first networks are placed on the respective client nodes.
And in the forward phase, after the first network of the client node is calculated, the first network is sent to the server node. And the server node selects a corresponding second network, and after the calculation is finished, the result is sent back to the client node, so that the forward calculation is finished.
And in the reverse phase, the client node sends the returned gradient to the server node, and the server node calculates the returned gradient according to the second network selected by the forward calculation and sends the returned gradient to the client node.
In the optimizer stage, the server node and the client node respectively optimize parameters of the server node and the client node.
In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network. In the hybrid Expert model, in order to expand the scale of the model, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.
In some optional implementations of this embodiment, the parameter gradient of the second network includes a parameter gradient of each gating network and a parameter gradient of each expert network, and before the sending the parameter gradient of the second network to the client, the method further includes: the parameter gradients of all gating networks are synchronized. In order to keep the network parameters of all the gating networks consistent, the gradient synchronization of the parameters of all the gating networks is carried out, and the consistency of the parameters is ensured. And the parameters of the expert network are not changed, so that the MoE model with high accuracy can be obtained. The gradient synchronization process of the parameter can be realized by using an Allreduce sum synchronous communication operation, the gradient obtained on each process after the Allreduce sum operation is used for the gradient of the parameter is the same, the gradient value at this time is equal to the sum of the corresponding positions of the gradients on all the processes, then the gradient sum after the Allreduce is used for each process to divide the number of the processes in the data parallel, and the gradient obtained in this way is the average value of the gradients on all the processes before synchronization.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for distributed model training, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for training distributed models of the present embodiment includes: a first forward calculation unit 501, a first transmission unit 502, a loss value calculation unit 503, a first gradient calculation unit 504, a second transmission unit 505, a first reverse calculation unit 506, and a first update unit 507. The first forward computing unit 501 is configured to input sample data into a first network to perform forward computing, and obtain an intermediate result; a first sending unit 502 configured to send the intermediate result to a server; a loss value calculating unit 503 configured to, in response to receiving an output result sent by the server, calculate a loss value based on a label corresponding to the sample data and the output result, where the output result is obtained by performing forward calculation by a second network in the server based on the intermediate result; a first gradient calculation unit 504 configured to calculate a gradient of the output result from the loss value; a second transmitting unit 505 configured to transmit the gradient of the output result to the server; a first reverse calculation unit 506, configured to perform a reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, where the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result; a first updating unit 507 configured to update a parameter of the first network based on the parameter gradient of the first network.
In this embodiment, the specific processes of the first forward calculating unit 501, the first sending unit 502, the loss value calculating unit 503, the first gradient calculating unit 504, the second sending unit 505, the first backward calculating unit 506, and the first updating unit 507 of the apparatus 500 for distributed model training may refer to steps 201 to 207 in the corresponding embodiment of fig. 2.
In some optional implementations of the present embodiment, the apparatus 500 further comprises a first synchronization unit (not shown in the drawings) configured to: performing parameter gradient synchronization of the first network with other clients before said updating the parameter of the first network based on the parameter gradient of the first network.
In some optional implementations of the present embodiment, the first updating unit 507 is further configured to: updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network.
In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network.
With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for distributed model training, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 3, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 6, the apparatus 600 for distributed model training of the present embodiment includes: a second forward calculation unit 601, a third transmission unit 602, a second backward calculation unit 603, a fourth transmission unit 604, and a second update unit 605. The second forward computing unit 601 is configured to, in response to receiving an intermediate result sent by the client, input the intermediate result into the second network for forward computing, and obtain an output result; a third sending unit 602 configured to send the output result to the client; a second reverse calculation unit 603 configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network; a fourth sending unit 604 configured to send the parameter gradient of the second network to the client; a second updating unit 605 configured to update the parameter of the second network based on the parameter gradient of the second network.
In this embodiment, the specific processes of the second forward computing unit 601, the third transmitting unit 602, the second backward computing unit 603, the fourth transmitting unit 604 and the second updating unit 605 of the negative apparatus 600 for distributed model training may refer to steps 301 to 305 in the corresponding embodiment of fig. 3.
In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network.
In some optional implementations of this embodiment, the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and the apparatus 600 further comprises a second synchronization unit (not shown in the drawings) configured to: synchronizing the parameter gradients of all gated networks before sending the parameter gradients of the second network to the client.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of flows 200 or 300.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of flows 200 or 300.
A computer program product comprising a computer program which, when executed by a processor, implements the method of flow 200 or 300.
FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the methods of distributed model training. For example, in some embodiments, the method of distributed model training may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM703 and executed by the computing unit 701, may perform one or more steps of the method of distributed model training described above. Alternatively, in other embodiments, the computing unit 701 may be configured in any other suitable manner (e.g., by means of firmware) as a method of performing distributed model training.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (18)

1. A method for training a distributed model is applied to a client and comprises the following steps:
inputting sample data into a first network to execute forward calculation to obtain an intermediate result;
sending the intermediate result to a server;
in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result;
calculating the gradient of the output result according to the loss value;
sending the gradient of the output result to the server;
in response to receiving a parameter gradient of a second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing reverse calculation based on the gradient of the output result;
updating a parameter of the first network based on the parameter gradient of the first network.
2. The method of claim 1, wherein prior to the updating the parameter of the first network based on the parameter gradient of the first network, the method further comprises:
and carrying out parameter gradient synchronization of the first network with other clients.
3. The method of claim 1, wherein the updating the parameter of the first network based on the parameter gradient of the first network comprises:
updating, by an optimizer, a parameter of the first network based on the parameter gradient of the first network.
4. The method of claim 1, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.
5. A method for distributed model training is applied to a server and comprises the following steps:
responding to the received intermediate result sent by the client, inputting the intermediate result into a second network for forward calculation to obtain an output result;
sending the output result to the client;
in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network;
sending the parameter gradient of the second network to the client;
updating the parameters of the second network based on the parameter gradient of the second network.
6. The method of claim 5, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.
7. The method of claim 6, wherein the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and
before the sending the parameter gradient of the second network to the client, the method further comprises:
the parameter gradients of all gating networks are synchronized.
8. A model training system, comprising:
at least one client configured to perform the method of any one of claims 1-4;
a server configured to perform the method of any one of claims 5-7.
9. A distributed model training device is applied to a client and comprises:
a first forward computing unit configured to input sample data into a first network to perform forward computing, resulting in an intermediate result;
a first sending unit configured to send the intermediate result to a server;
a loss value calculation unit configured to calculate a loss value based on a tag corresponding to the sample data and an output result sent by the server in response to receiving the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result;
a first gradient calculation unit configured to calculate a gradient of the output result from the loss value;
a second transmitting unit configured to transmit the gradient of the output result to the server;
a first reverse calculation unit configured to perform reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result;
a first updating unit configured to update a parameter of the first network based on a parameter gradient of the first network.
10. The apparatus of claim 9, wherein the apparatus further comprises a first synchronization unit configured to:
performing parameter gradient synchronization of the first network with other clients before said updating the parameter of the first network based on the parameter gradient of the first network.
11. The apparatus of claim 9, wherein the first updating unit is further configured to:
updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network.
12. The apparatus of claim 9, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.
13. A distributed model training device applied to a server comprises:
the second forward computing unit is configured to respond to the received intermediate result sent by the client and input the intermediate result into a second network for forward computing to obtain an output result;
a third sending unit configured to send the output result to the client;
a second reverse calculation unit configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network;
a fourth sending unit configured to send the parameter gradient of the second network to the client;
a second updating unit configured to update a parameter of the second network based on a parameter gradient of the second network.
14. The apparatus of claim 13, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.
15. The apparatus of claim 14, wherein the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and
the apparatus further comprises a second synchronization unit configured to:
synchronizing the parameter gradients of all gated networks before sending the parameter gradients of the second network to the client.
16. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
CN202211332354.7A 2022-10-28 2022-10-28 Distributed model training method, device and system Active CN115660034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211332354.7A CN115660034B (en) 2022-10-28 2022-10-28 Distributed model training method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211332354.7A CN115660034B (en) 2022-10-28 2022-10-28 Distributed model training method, device and system

Publications (2)

Publication Number Publication Date
CN115660034A true CN115660034A (en) 2023-01-31
CN115660034B CN115660034B (en) 2023-08-15

Family

ID=84993134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211332354.7A Active CN115660034B (en) 2022-10-28 2022-10-28 Distributed model training method, device and system

Country Status (1)

Country Link
CN (1) CN115660034B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117953A (en) * 2018-09-11 2019-01-01 北京迈格威科技有限公司 Network parameter training method and system, server, client and storage medium
CN112329919A (en) * 2020-11-05 2021-02-05 北京百度网讯科技有限公司 Model training method and device
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
CN112884086A (en) * 2021-04-06 2021-06-01 北京百度网讯科技有限公司 Model training method, device, equipment, storage medium and program product
CN113240079A (en) * 2021-04-29 2021-08-10 华为技术有限公司 Model training method and device
US20210303988A1 (en) * 2020-03-30 2021-09-30 Amazon Technologies, Inc. Multi-model training pipeline in distributed systems
CN113569891A (en) * 2021-01-25 2021-10-29 腾讯科技(深圳)有限公司 Training data processing device, electronic equipment and storage medium of neural network model
US20210374503A1 (en) * 2018-10-15 2021-12-02 Board Of Trustees Of The University Of Illinois Network-centric architecture and algorithms to accelerate distributed training of neural networks
US20220029971A1 (en) * 2019-12-13 2022-01-27 TripleBlind, Inc. Systems and Methods for Providing a Modified Loss Function in Federated-Split Learning
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN114860412A (en) * 2022-05-19 2022-08-05 北京百度网讯科技有限公司 Task processing method and device, electronic equipment and medium
US20220261626A1 (en) * 2021-02-08 2022-08-18 International Business Machines Corporation Distributed Adversarial Training for Robust Deep Neural Networks
CN114996578A (en) * 2022-06-13 2022-09-02 深圳市欢太科技有限公司 Model training method, target object selection method, device and electronic equipment
US11467992B1 (en) * 2020-09-24 2022-10-11 Amazon Technologies, Inc. Memory access operation in distributed computing system

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117953A (en) * 2018-09-11 2019-01-01 北京迈格威科技有限公司 Network parameter training method and system, server, client and storage medium
US20210374503A1 (en) * 2018-10-15 2021-12-02 Board Of Trustees Of The University Of Illinois Network-centric architecture and algorithms to accelerate distributed training of neural networks
US20220029971A1 (en) * 2019-12-13 2022-01-27 TripleBlind, Inc. Systems and Methods for Providing a Modified Loss Function in Federated-Split Learning
US20210303988A1 (en) * 2020-03-30 2021-09-30 Amazon Technologies, Inc. Multi-model training pipeline in distributed systems
US11467992B1 (en) * 2020-09-24 2022-10-11 Amazon Technologies, Inc. Memory access operation in distributed computing system
CN112329919A (en) * 2020-11-05 2021-02-05 北京百度网讯科技有限公司 Model training method and device
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
CN113569891A (en) * 2021-01-25 2021-10-29 腾讯科技(深圳)有限公司 Training data processing device, electronic equipment and storage medium of neural network model
US20220261626A1 (en) * 2021-02-08 2022-08-18 International Business Machines Corporation Distributed Adversarial Training for Robust Deep Neural Networks
CN112884086A (en) * 2021-04-06 2021-06-01 北京百度网讯科技有限公司 Model training method, device, equipment, storage medium and program product
CN113240079A (en) * 2021-04-29 2021-08-10 华为技术有限公司 Model training method and device
CN114035937A (en) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 Distributed training and reasoning method, system, equipment and readable storage medium based on artificial intelligence
CN114860412A (en) * 2022-05-19 2022-08-05 北京百度网讯科技有限公司 Task processing method and device, electronic equipment and medium
CN114996578A (en) * 2022-06-13 2022-09-02 深圳市欢太科技有限公司 Model training method, target object selection method, device and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KUN YANG 等: ""On the Convergence of Hybrid Federated Learning with Server-Clients Collaborative Training"", 《2022 56TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS)》, pages 252 - 257 *
安涛: ""基于分布式环境的卷积神经网络并行算法优化研究"", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, vol. 2022, no. 1, pages 140 - 183 *
杨曌伊: ""面向深度学习应用的执行优化系统研究与实现"", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, vol. 2022, no. 3, pages 140 - 285 *

Also Published As

Publication number Publication date
CN115660034B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
EP3913545A2 (en) Method and apparatus for updating parameter of multi-task model, and electronic device
JP7454529B2 (en) Distributed model training device and method, electronic device, storage medium, and computer program
CN112529201B (en) Entangled quantum state conversion method, device, equipment, storage medium and product
WO2023201981A1 (en) Mixture-of-experts model implementation method and system, electronic device, and storage medium
CN115203126B (en) Operator fusion processing method, device, equipment and storage medium
US20230095725A1 (en) Method of processing quantum circuit, electronic device, and storage medium
CN115860128A (en) Quantum circuit operation method and device and electronic equipment
CN114580645A (en) Simulation method, device and equipment for random quantum measurement and storage medium
CN114239853A (en) Model training method, device, equipment, storage medium and program product
CN115660034B (en) Distributed model training method, device and system
CN115906987A (en) Deep learning model training method, virtual image driving method and device
CN115759232A (en) Multitask parallel processing method, device, equipment and medium of deep learning framework
CN114722048A (en) Data processing method and device, electronic equipment and storage medium
CN113361574A (en) Training method and device of data processing model, electronic equipment and storage medium
CN114429211A (en) Method, apparatus, device, medium and product for generating information
CN114626523A (en) Method, device and equipment for training deep learning model and storage medium
CN115629879A (en) Load balancing method and device for distributed model training
CN117827619B (en) Time-consuming prediction simulation method, device, equipment, medium and system for heterogeneous calculation force
JP7391127B2 (en) Point cloud data processing method, apparatus, electronic device, storage medium, and program
CN116560817B (en) Task execution method, device, electronic equipment and storage medium
CN117520461B (en) Distribution method, device, equipment and medium of logic fragments
CN117521829A (en) Quantum circuit simulation method and device and electronic equipment
CN117539602A (en) Method and device for task speculation behind, electronic equipment and storage medium
CN117669751A (en) Quantum circuit simulation method and device and electronic equipment
CN114816758A (en) Resource allocation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant