CN115660034A

CN115660034A - Distributed model training method, device and system

Info

Publication number: CN115660034A
Application number: CN202211332354.7A
Authority: CN
Inventors: 沈亮; 于佃海
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-31
Anticipated expiration: 2042-10-28
Also published as: CN115660034B

Abstract

The disclosure provides a method, a device and a system for training a distributed model, and relates to the field of artificial intelligence, in particular to the field of deep learning. The specific implementation scheme is as follows: inputting sample data into a first network to execute forward calculation to obtain an intermediate result; sending the intermediate result to a server; in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result; calculating the gradient of an output result according to the loss value; sending the gradient of the output result to a server; responding to the received parameter gradient of the second network sent by the server, executing reverse calculation of the first network, and obtaining the parameter gradient of the first network; the parameters of the first network are updated based on the parameter gradient of the first network. The implementation mode can reduce the cross-node communication times and improve the overall training performance.

Description

Distributed model training method, device and system

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to the field of deep learning, and specifically relates to a method, a device and a system for distributed model training.

Background

In deep learning model training in recent years, the trend of using more training data and larger models has not changed. Larger models and data volumes mean more computation and storage requirements, and also longer training times. How to distribute the computation and storage requirements to multiple training devices to increase training speed is a key issue.

Data parallelism (data parallelism) is a parallel strategy to solve the above problem, in which the training task is split over multiple processes (devices), each process maintaining the same model parameters and the same computational task, but processing different data (batch data). In this way, data and computations under the same global batch are split into different processes, thereby relieving computation and storage pressure on a single device.

Distributed model training (e.g., moE (mix-of-Experts model)) is one of the technical paths to implement very large scale model training. The idea of this model is to train multiple neural networks (distributed among multiple compute nodes), each of which trains a different portion of the data set. The input data of different devices pass through respective routing networks (for example, gate networks), different expert nodes are selected, and therefore, communication (namely, alltoALL communication) can be achieved between different devices, and data distribution is achieved. When the number of devices is too large, communication exists across nodes, and the overall computing performance is reduced.

Disclosure of Invention

The present disclosure provides a method, apparatus, system, device, storage medium, and computer program product for distributed model training.

According to a first aspect of the present disclosure, a method for training a distributed model is provided, which is applied to a client, and includes: inputting sample data into a first network to execute forward calculation to obtain an intermediate result; sending the intermediate result to a server; in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result; calculating the gradient of the output result according to the loss value; sending the gradient of the output result to the server; in response to receiving a parameter gradient of a second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing reverse calculation based on the gradient of the output result; updating the parameter of the first network based on the parameter gradient of the first network.

According to a second aspect of the present disclosure, there is provided a method for distributed model training, applied to a server, including: responding to the received intermediate result sent by the client, inputting the intermediate result into a second network for forward calculation to obtain an output result; sending the output result to the client; in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network; sending the parameter gradient of the second network to the client; updating the parameters of the second network based on the parameter gradient of the second network.

According to a third aspect of the present disclosure, there is provided a model training system comprising: at least one client configured to perform the method of the first aspect; a server configured to perform the method of the second aspect.

According to a fourth aspect of the present disclosure, there is provided an apparatus for distributed model training, applied to a client, including: the first forward computing unit is configured to input sample data into a first network to execute forward computing to obtain an intermediate result; a first sending unit configured to send the intermediate result to a server; a loss value calculation unit configured to calculate a loss value based on a tag corresponding to the sample data and an output result sent by the server in response to receiving the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result; a first gradient calculation unit configured to calculate a gradient of the output result from the loss value; a second transmitting unit configured to transmit the gradient of the output result to the server; a first reverse calculation unit configured to perform reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result; a first updating unit configured to update a parameter of the first network based on a parameter gradient of the first network.

According to a fifth aspect of the present disclosure, there is provided an apparatus for distributed model training, applied to a server, including: the second forward computing unit is configured to respond to the received intermediate result sent by the client and input the intermediate result into a second network for forward computing to obtain an output result; a third sending unit configured to send the output result to the client; a second reverse calculation unit configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network; a fourth sending unit configured to send the parameter gradient of the second network to the client; a second updating unit configured to update a parameter of the second network based on a parameter gradient of the second network.

According to a sixth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.

According to an eighth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.

According to the distributed model training method, device and system provided by the embodiment of the disclosure, parameters of a second network (for example, a Gate network and an Expert network) of a model are uniformly placed on the same server (server node), other parameters are placed on respective computing equipment (client node), the computation of a first network is realized on respective client, and the computation of the second network is realized on the server. Therefore, the superposition of communication and calculation of different nodes can be realized, the whole route communication of the MoE is in the machine, the whole communication efficiency is improved, and calculation and storage resources are fully utilized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

1a-1c are schematic diagrams of exemplary system architectures to which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of distributed model training according to the present disclosure;

FIG. 3 is a flow diagram of yet another embodiment of a method of distributed model training according to the present disclosure;

4a-4c are schematic diagrams of an application scenario of a method of distributed model training according to the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for distributed model training according to the present disclosure;

FIG. 6 is a schematic structural diagram of yet another embodiment of an apparatus for distributed model training according to the present disclosure;

FIG. 7 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1a illustrates an exemplary system architecture 100 to which embodiments of the distributed model training method or distributed model training apparatus of the present disclosure may be applied.

As shown in fig. 1a, the system architecture 100 may include a Server (Server) and a plurality of clients (Worker, i.e., compute nodes, also called clients).

The system architecture 100 is a programming architecture commonly employed in the field of distributed training, and mainly solves the following two problems:

1. the model parameters are too large: the single-machine memory space is insufficient, and distributed storage is needed.

2. Too many training data: the single machine training is too slow, and the training nodes need to be enlarged to improve the concurrent training speed.

As shown in fig. 1a, the system architecture 100 mainly includes two parts, a Server and a Worker, wherein the Server is responsible for storing and updating parameters, and the Worker is responsible for training. In brief, the basic idea of model training based on the system architecture is as follows: when the training data is too much and one Worker is too slow, a plurality of workers can be introduced to train simultaneously, and the parameters of the models need to be synchronized at the moment. The intuitive idea is to introduce a Server that acts as a medium for parameter exchange between workers. When the model parameters are too large, so that the single-machine storage space is insufficient or the Worker is too much, so that one Server is a bottleneck, a plurality of Servers are required to be introduced.

The specific process of model training is as follows:

1. the training data (sample set) is evenly distributed to different Workers.

2. And fragmenting the model parameters and storing the parameters on different servers.

3. A Worker end: reading minibatch training data, pulling the latest parameters from the Server end, calculating the gradient, and uploading to different servers according to the fragments.

4. A Server terminal: receiving the gradient uploaded by the Worker terminal, and updating the parameters according to the optimization algorithm. And dividing into two mechanisms of synchronous training and asynchronous training according to whether the Server needs to wait for the gradients of all Worker ends for each parameter update.

The MoE model may be trained using the system architecture 100. In the MoE model, data is calculated through a Backbone network (Backbone network) (generally, a plurality of fully-connected layers) to obtain an intermediate result, and after selection of a top-k Gate (Gate control network), k Expert networks (Expert networks) are selected for each token (character) in the result; as shown in FIG. 1b, gate for top-1 is chosen, and 1 Expert is chosen for each token. After the data H _0 is routed, the Expert _0 is selected, and then the data H _0 is trained by the neural network to obtain an output m. Meanwhile, the Top-1 Gate can obtain the Gate _ loss for measuring the score of the Gate selection at this time, and the smaller the Gate _ loss is, the more reasonable and uniform the Expert routing at this time is.

The MoE + data parallel method (namely MoE + DP) is different from the traditional data parallel method, and MoE model parameters are divided into two types, namely Dense parameters of a Backbone layer and Sparse parameters of an Expert layer. For data parallelism, the Dense parameters are the same on each card (computing device, such as GPU), the parameters are updated consistently with those of ordinary data parallelism, and after the reverse phase is finished, each card synchronizes the gradient of the parameters; for the spare parameter, under MoE + DP, the spare parameter of each card initializes different values, as shown in fig. 1c, 3 excelts per card, and 6 excelts in total for 2 cards, where top _1gate is that under total excelt =6, 1 excelt with the highest score is selected for routing, and then calculation is performed according to the mode of single-card MoE, and if cross-card occurs, send/recv needs to be called to implement cross-card communication. As shown in fig. 1c, after passing through Top-1 Gate, the 4 th Expert is selected for the Output tensor H _0 of the backhaul on Rank-0 (card number), then H _0 is sent to Rank-1, after being calculated by Expert _4, the H _0 is sent back to Rank-0, and the final Output _0 is obtained.

It should be noted that the method for distributed model training provided by the embodiments of the present disclosure may be executed by a server and a client. Accordingly, the apparatus for distributed model training may be disposed in a server and a client. And is not particularly limited herein.

It should be understood that the number of servers and clients in FIG. 1a is merely illustrative. There may be any number of servers and clients, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of distributed model training in accordance with the present disclosure is shown. The distributed model training method is applied to a client and comprises the following steps:

step 201, inputting sample data into the first network to execute forward calculation, and obtaining an intermediate result.

In this embodiment, an executing entity (for example, a client shown in fig. 1 a) of the distributed model training method may receive sample data from a database through a wired connection manner or a wireless connection manner, and perform supervised training, so that the sample data is labeled. The distributed model may include two networks: a first network and a second network, wherein each client is arranged with one first network and a plurality of second networks are arranged in the server. The first network essentially operates as a backbone network and the second network may be a network that typically communicates across hardware, such as a gated network (Gate network) and an Expert network (Expert network). Each first network may select one second network from the first networks to process the intermediate result calculated in the forward direction, and receive the selected second network to calculate the gradient of the local first network by calculating the gradient of the feedback in the reverse direction. According to the method and the system, the second networks originally arranged at different clients are uniformly arranged in the server, so that cross-hardware communication can be avoided.

As shown in fig. 4a, the network structure of the prior art, the Gate network, the Expert network and the backhaul network are on their own nodes. In the hybrid Expert model, in order to expand the scale of the model, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.

As shown in fig. 4b, since the data is parallel, each client reads different data. client-0 reads in Data _0, client-1 reads in Data _1, client-2 reads in Data _2.

And each client node executes forward calculation of the first network to respectively obtain an intermediate result H. As shown in FIG. 4b, client-0 gave H0, client-1 gave H1, and client-2 gave H2.

Step 202, the intermediate result is sent to the server.

In this embodiment, send and recv are called to implement cross-card communication, and an intermediate result output by the client is sent to the server. And the second network in the server continues to perform forward calculation based on the intermediate result, and obtains an output result and returns the output result to the client. As shown in FIG. 4b, client-0, client-1 and client-2 send H0, H1 and H2, respectively, to the server node. On the server node, H0 is calculated through a Gate _0 network, expert _5 is selected, and Output _0 is obtained through calculation; h1, selecting Expert _7 through Gate _1 network calculation, and calculating to obtain Output _1; h2 is calculated through a Gate _2 network, and the Expert _1 is selected for routing to obtain Output _2 through calculation.

And the server node sends the output result back to the original client node. As shown in FIG. 4b, the server node sends Output _0 back to client-0; output _1 sends back client-1 and Output _2sends back client-2.

And step 203, responding to the received output result sent by the server, and calculating a loss value based on the label corresponding to the sample data and the output result.

In this embodiment, the output result of the server is received, forward calculation is continued, and according to the existing loss function calculation method of the labeled sample, the loss value of the whole distributed model is finally obtained through the label corresponding to the sample data and the output result.

And step 204, calculating the gradient of the output result according to the loss value.

In this embodiment, each client node performs a backward calculation to obtain the input output @ grad of the Expert backward network (the gradient of the output result is obtained by partial derivation of the output result by the loss value). client-0 obtains input Output _0@ Grad of the Expert _5 reverse network; client-1 obtains input Output _1@ Grad of Expert _7 reverse network; client-2 gets the input Output _2@ Grad of the Expert _1 reverse network.

Step 205, the gradient of the output result is sent to the server.

In this embodiment, each client node sends output @ Grad to the server node. And the server node receives the output @ Grad, and respectively executes the reverse calculation of the selected Expert network and the Gate network to obtain the parameter gradients of the two networks and the gradient H @ Grad returned to the client.

As shown in fig. 4c, the server node accepts Output _0@ grad, and performs the reverse calculation of the Expert _5 network and the Gate _0 network; receiving Output _1@ Grad, and executing reverse calculation of an Expert _7 network and a Gate _1 network; accepting Output _2@ Grad, and executing reverse calculation of the Expert _2 network and the Gate _2 network.

And step 206, in response to receiving the parameter gradient of the second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network.

In the embodiment, after the client accepts the H @ Grad, the backscattering of the BackBone network is executed until the backward calculation is finished. As shown in fig. 4c, the client-0 obtains the parameter gradient of the backhaul _0 network; the client-1 obtains the parameter gradient of the BackBone _1 network; and the client-2 obtains the parameter gradient of the BackBone-2 network.

Step 207, updating the parameters of the first network based on the parameter gradient of the first network.

In this embodiment, the updated parameter of the first network can be obtained by subtracting the parameter gradient from the parameter of the first network. The network parameter updating process is the prior art, and therefore is not described in detail.

The core idea of the method provided by the above embodiment of the present disclosure is to collectively place the second networks related to cross-machine communication on the server node, and then place the remaining first networks on the respective client nodes.

And in the forward phase, after the first network of the client node is calculated, the first network is sent to the server node. And the server node selects a corresponding second network, and after the calculation is finished, the result is sent back to the client node, so that the forward calculation is finished.

And in the reverse phase, the client node sends the returned gradient to the server node, and the server node calculates the returned gradient according to the second network selected by the forward calculation and sends the returned gradient to the client node.

In the optimizer stage, the server node and the client node respectively optimize parameters of the server node and the client node.

In some optional implementations of this embodiment, before the updating the parameter of the first network based on the parameter gradient of the first network, the method further includes: and carrying out parameter gradient synchronization of the first network with other clients. In order to keep the network parameters of all the clients consistent, gradient synchronization of the parameters of all the clients is performed, and parameter consistency is ensured. The gradient synchronization process of the parameter can be realized by using an Allreduce sum synchronous communication operation, the gradient obtained on each process after the Allreduce sum operation is used for the gradient of the parameter is the same, the gradient value at this time is equal to the sum of the corresponding positions of the gradients on all the processes, then the gradient sum after the Allreduce is used for each process to divide the number of the processes in the data parallel, and the gradient obtained in this way is the average value of the gradients on all the processes before synchronization.

In some optional implementations of this embodiment, the updating the parameter of the first network based on the parameter gradient of the first network includes: updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network. The optimizer may include: 1. basic gradient descent method: including standard Gradient Descent (GD), random Gradient Descent (SGD), and Batch Gradient Descent (BGD); 2. the momentum optimization method comprises the following steps: including a standard momentum optimization method (MomentumOptimizer), a newton acceleration gradient momentum optimization method (NAG), etc.; 4. the self-adaptive learning rate optimization method comprises the following steps: including an adadra (Adaptive gradient algorithm), RMSProp (root mean square prop) algorithm, adam algorithm, etc.; 5. a fusion optimization method comprises the following steps: adam is adagard + momentum; nadam: adam + Nesterov.

The convergence rate of the model can be accelerated through the optimizer, and the training time of the model is shortened.

In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network. In the mixed Expert model, in order to expand the model size, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.

In order to solve the problems, the invention provides a method and a device for training a MoE model based on a parameter server. All the Gate and Expert parameters of the model are uniformly placed on the same node (server node), other parameters are placed on respective computing equipment, and the Gate network and the Expert network are computed on the server node. Therefore, communication and computational overlap of different nodes can be achieved, the whole routing communication of the MoE is built in the machine, the whole communication efficiency is improved, and computing and storage resources are fully utilized.

With further reference to FIG. 3, a flow 300 of yet another embodiment of a method of distributed model training is illustrated. The process 300 of the distributed model training method is applied to a server, and comprises the following steps:

step 301, in response to receiving the intermediate result sent by the client, inputting the intermediate result into the second network for forward calculation, and obtaining an output result.

In this embodiment, an executing agent (e.g., the server shown in fig. 1 a) of the distributed model training method may receive the intermediate result from the client through a wired connection or a wireless connection. The intermediate result is that the client performs step 201. And the second network continues to perform forward calculation based on the intermediate result to obtain an output result of the whole model. For example, the server node receives the intermediate result H sent by each client, selects an Expert network on the server node through the corresponding Gate routing network, and executes forward calculation to obtain an Output result Output.

As shown in fig. 4b, on the server node, H0 is calculated through a Gate _0 network, expert _5 is selected, and Output _0 is obtained through calculation; h1, performing network calculation through Gate _1, selecting Expert _7, and calculating to obtain Output _1; h2 is calculated through a Gate _2 network, the route is selected from Expert _1, and Output _2 is obtained through calculation.

Step 302, sending the output result to the client.

In this embodiment, the server node sends an Output back to the original client node. And the client receiving the corresponding output result executes step 203 to obtain the gradient of the output result.

As shown in FIG. 4b, the server node sends Output _0 back to client-0; output _1 sends back client-1 and Output _2sends back client-2.

Step 303, in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network.

In this embodiment, the server node receives output @ grad, and respectively executes the reverse calculations of the selected Expert network and the Gate network to obtain the parameter gradients of the two networks and the gradient h @ grad returned to the client.

And step 304, sending the parameter gradient of the second network to the client.

In this embodiment, the server node transmits the H @ Grad back to the corresponding client node. The client then executes step 206 to obtain the parameter gradient of the first network.

And step 305, updating the parameters of the second network based on the parameter gradient of the second network.

In this embodiment, the server and the client each maintain their own stored network parameters. The server is responsible for subtracting the parameter gradient of the second network from the original parameter of the second network to obtain the updated parameter of the second network.

The method provided by the above embodiment of the present disclosure has a core idea that the second networks related to cross-machine communication are collectively and uniformly placed on the server node, and then the remaining first networks are placed on the respective client nodes.

In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network. In the hybrid Expert model, in order to expand the scale of the model, a plurality of nodes are required to be added, and each node possesses a different Expert. After each node passes input data through a Gate network, the nodes need to communicate with each other, and the data is sent to the node to which the Expert selected by the node belongs. Often there are a large amount of cross-machine communication, be unfavorable for subsequent calculation optimization simultaneously, influence holistic training efficiency.

In some optional implementations of this embodiment, the parameter gradient of the second network includes a parameter gradient of each gating network and a parameter gradient of each expert network, and before the sending the parameter gradient of the second network to the client, the method further includes: the parameter gradients of all gating networks are synchronized. In order to keep the network parameters of all the gating networks consistent, the gradient synchronization of the parameters of all the gating networks is carried out, and the consistency of the parameters is ensured. And the parameters of the expert network are not changed, so that the MoE model with high accuracy can be obtained. The gradient synchronization process of the parameter can be realized by using an Allreduce sum synchronous communication operation, the gradient obtained on each process after the Allreduce sum operation is used for the gradient of the parameter is the same, the gradient value at this time is equal to the sum of the corresponding positions of the gradients on all the processes, then the gradient sum after the Allreduce is used for each process to divide the number of the processes in the data parallel, and the gradient obtained in this way is the average value of the gradients on all the processes before synchronization.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for distributed model training, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for training distributed models of the present embodiment includes: a first forward calculation unit 501, a first transmission unit 502, a loss value calculation unit 503, a first gradient calculation unit 504, a second transmission unit 505, a first reverse calculation unit 506, and a first update unit 507. The first forward computing unit 501 is configured to input sample data into a first network to perform forward computing, and obtain an intermediate result; a first sending unit 502 configured to send the intermediate result to a server; a loss value calculating unit 503 configured to, in response to receiving an output result sent by the server, calculate a loss value based on a label corresponding to the sample data and the output result, where the output result is obtained by performing forward calculation by a second network in the server based on the intermediate result; a first gradient calculation unit 504 configured to calculate a gradient of the output result from the loss value; a second transmitting unit 505 configured to transmit the gradient of the output result to the server; a first reverse calculation unit 506, configured to perform a reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, where the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result; a first updating unit 507 configured to update a parameter of the first network based on the parameter gradient of the first network.

In this embodiment, the specific processes of the first forward calculating unit 501, the first sending unit 502, the loss value calculating unit 503, the first gradient calculating unit 504, the second sending unit 505, the first backward calculating unit 506, and the first updating unit 507 of the apparatus 500 for distributed model training may refer to steps 201 to 207 in the corresponding embodiment of fig. 2.

In some optional implementations of the present embodiment, the apparatus 500 further comprises a first synchronization unit (not shown in the drawings) configured to: performing parameter gradient synchronization of the first network with other clients before said updating the parameter of the first network based on the parameter gradient of the first network.

In some optional implementations of the present embodiment, the first updating unit 507 is further configured to: updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network.

In some optional implementations of this embodiment, the first network is a backbone network, and the second network includes at least one gating network and at least one expert network, where each gating network corresponds to one backbone network.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of an apparatus for distributed model training, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 3, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for distributed model training of the present embodiment includes: a second forward calculation unit 601, a third transmission unit 602, a second backward calculation unit 603, a fourth transmission unit 604, and a second update unit 605. The second forward computing unit 601 is configured to, in response to receiving an intermediate result sent by the client, input the intermediate result into the second network for forward computing, and obtain an output result; a third sending unit 602 configured to send the output result to the client; a second reverse calculation unit 603 configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network; a fourth sending unit 604 configured to send the parameter gradient of the second network to the client; a second updating unit 605 configured to update the parameter of the second network based on the parameter gradient of the second network.

In this embodiment, the specific processes of the second forward computing unit 601, the third transmitting unit 602, the second backward computing unit 603, the fourth transmitting unit 604 and the second updating unit 605 of the negative apparatus 600 for distributed model training may refer to steps 301 to 305 in the corresponding embodiment of fig. 3.

In some optional implementations of this embodiment, the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and the apparatus 600 further comprises a second synchronization unit (not shown in the drawings) configured to: synchronizing the parameter gradients of all gated networks before sending the parameter gradients of the second network to the client.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of

flows

200 or 300.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of

flows

200 or 300.

A computer program product comprising a computer program which, when executed by a processor, implements the method of

flow

200 or 300.

FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the methods of distributed model training. For example, in some embodiments, the method of distributed model training may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM703 and executed by the computing unit 701, may perform one or more steps of the method of distributed model training described above. Alternatively, in other embodiments, the computing unit 701 may be configured in any other suitable manner (e.g., by means of firmware) as a method of performing distributed model training.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training a distributed model is applied to a client and comprises the following steps:

inputting sample data into a first network to execute forward calculation to obtain an intermediate result;

sending the intermediate result to a server;

in response to receiving an output result sent by the server, calculating a loss value based on a label corresponding to the sample data and the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result;

calculating the gradient of the output result according to the loss value;

sending the gradient of the output result to the server;

in response to receiving a parameter gradient of a second network sent by the server, performing reverse calculation of the first network to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing reverse calculation based on the gradient of the output result;

updating a parameter of the first network based on the parameter gradient of the first network.

2. The method of claim 1, wherein prior to the updating the parameter of the first network based on the parameter gradient of the first network, the method further comprises:

and carrying out parameter gradient synchronization of the first network with other clients.

3. The method of claim 1, wherein the updating the parameter of the first network based on the parameter gradient of the first network comprises:

updating, by an optimizer, a parameter of the first network based on the parameter gradient of the first network.

4. The method of claim 1, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.

5. A method for distributed model training is applied to a server and comprises the following steps:

responding to the received intermediate result sent by the client, inputting the intermediate result into a second network for forward calculation to obtain an output result;

sending the output result to the client;

in response to receiving the gradient of the output result sent by the client, performing reverse calculation of the second network based on the gradient of the output result to obtain a parameter gradient of the second network;

sending the parameter gradient of the second network to the client;

updating the parameters of the second network based on the parameter gradient of the second network.

6. The method of claim 5, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.

7. The method of claim 6, wherein the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and

before the sending the parameter gradient of the second network to the client, the method further comprises:

the parameter gradients of all gating networks are synchronized.

8. A model training system, comprising:

at least one client configured to perform the method of any one of claims 1-4;

a server configured to perform the method of any one of claims 5-7.

9. A distributed model training device is applied to a client and comprises:

a first forward computing unit configured to input sample data into a first network to perform forward computing, resulting in an intermediate result;

a first sending unit configured to send the intermediate result to a server;

a loss value calculation unit configured to calculate a loss value based on a tag corresponding to the sample data and an output result sent by the server in response to receiving the output result, wherein the output result is obtained by a second network in the server through forward calculation based on the intermediate result;

a first gradient calculation unit configured to calculate a gradient of the output result from the loss value;

a second transmitting unit configured to transmit the gradient of the output result to the server;

a first reverse calculation unit configured to perform reverse calculation of the first network in response to receiving a parameter gradient of a second network sent by the server, to obtain the parameter gradient of the first network, wherein the parameter gradient of the second network is obtained by performing the reverse calculation based on the gradient of the output result;

a first updating unit configured to update a parameter of the first network based on a parameter gradient of the first network.

10. The apparatus of claim 9, wherein the apparatus further comprises a first synchronization unit configured to:

performing parameter gradient synchronization of the first network with other clients before said updating the parameter of the first network based on the parameter gradient of the first network.

11. The apparatus of claim 9, wherein the first updating unit is further configured to:

updating, by an optimizer, a parameter of the first network based on a parameter gradient of the first network.

12. The apparatus of claim 9, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.

13. A distributed model training device applied to a server comprises:

the second forward computing unit is configured to respond to the received intermediate result sent by the client and input the intermediate result into a second network for forward computing to obtain an output result;

a third sending unit configured to send the output result to the client;

a second reverse calculation unit configured to, in response to receiving the gradient of the output result sent by the client, perform reverse calculation of the second network based on the gradient of the output result, resulting in a parameter gradient of the second network;

a fourth sending unit configured to send the parameter gradient of the second network to the client;

a second updating unit configured to update a parameter of the second network based on a parameter gradient of the second network.

14. The apparatus of claim 13, wherein the first network is a backbone network and the second network comprises at least one gating network and at least one expert network, wherein each gating network corresponds to one backbone network.

15. The apparatus of claim 14, wherein the parameter gradients of the second network comprise a parameter gradient of each gating network and a parameter gradient of each expert network, and

the apparatus further comprises a second synchronization unit configured to:

synchronizing the parameter gradients of all gated networks before sending the parameter gradients of the second network to the client.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.