CN115759230A

CN115759230A - Model training and task processing method, device, system, equipment and storage medium

Info

Publication number: CN115759230A
Application number: CN202211469464.8A
Authority: CN
Inventors: 沈力; 孙昊; 陈士祥; 陶大程
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-07

Abstract

The invention discloses a method, a device, a system, equipment and a storage medium for model training and task processing, wherein the model training method is applied to each node equipment in a distributed cluster, the distributed cluster also comprises central equipment, and the method comprises the following steps: determining gradient information corresponding to an initial task processing model of the current node equipment in the current round of training, and compressing the gradient information to obtain a node compression gradient, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training; uploading the node compression gradients of the current node equipment to central equipment, and acquiring a central compression gradient obtained by fusing and compressing the node compression gradients uploaded by each node equipment from the central equipment; determining the gradient momentum used by the training of the current round according to the gradient momentum used by the training of the previous round and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round.

Description

Model training and task processing method, device, system, equipment and storage medium

Technical Field

The embodiment of the invention relates to computer technology, in particular to a method, a device, a system, equipment and a storage medium for model training and task processing.

Background

With the development of Artificial Intelligence (Artificial Intelligence) technology, task processing models (such as natural language processing models and recommendation models) implemented based on AI algorithms have been widely developed. Distributed training methods have been proposed because model training on a single node has become extremely time consuming or even impossible due to increased model size and increased data set.

The gradient descent method is a method widely used for optimizing a model, and the distributed gradient descent method is obtained by combining the gradient descent method with distributed training. In the process of model training, the distributed gradient descent method needs to transmit gradient information between nodes. In the process of implementing the invention, the inventor finds that with the increase of the model scale and the number of nodes, when the model training is carried out by using the distributed gradient descent method, the communication data volume is larger and larger, the communication overhead is obviously increased, the communication efficiency is low, and the model training speed is slow.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a system, a device, and a storage medium for model training and task processing, which can reduce communication overhead, improve communication efficiency, and increase model training speed.

In a first aspect, an embodiment of the present invention provides a model training method, where the model training method is applied to each node device in a distributed cluster, where the distributed cluster further includes a central device, and the model training method includes:

determining gradient information corresponding to an initial task processing model of current node equipment in the current round of training, and performing compression processing on the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment, and the current node equipment is any one node equipment in the node equipment;

uploading the node compression gradient of the current node equipment to the central equipment, and acquiring a central compression gradient from the central equipment, wherein the central compression gradient is obtained by fusing and compressing the node compression gradient uploaded by each node equipment by the central equipment;

determining the gradient momentum used by the current training round according to the gradient momentum used by the previous training round and the central compression gradient;

and updating the initial task processing model of the current round of training based on the gradient momentum used by the current round of training to obtain the target task processing model of the current round of training.

In a second aspect, an embodiment of the present invention provides a task processing method, including:

acquiring current characteristic information of a target task;

inputting the current feature information into a global target model obtained by training the model training method according to the embodiment of the invention, and processing the target task based on the current feature information by using the global target model, thereby obtaining a processing result of the target task.

In a third aspect, an embodiment of the present invention provides a model training apparatus, where the model training apparatus is applied to each node device in a distributed cluster, the distributed cluster further includes a central device, and the model training apparatus includes:

the gradient compression module is used for determining gradient information corresponding to an initial task processing model of current node equipment in the current round of training and compressing the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment, and the current node equipment is any one node equipment in the node equipment;

the gradient transmission module is used for uploading the node compression gradient of the current node equipment to the central equipment and acquiring a central compression gradient from the central equipment, wherein the central compression gradient is obtained by fusing and compressing the node compression gradient uploaded by each node equipment by the central equipment;

the momentum determining module is used for determining the gradient momentum used by the current round of training according to the gradient momentum used by the previous round of training and the central compression gradient;

and the model updating module is used for updating the initial task processing model of the current round of training based on the gradient momentum used by the current round of training to obtain the target task processing model of the current round of training.

In a fourth aspect, an embodiment of the present invention provides a task processing apparatus, including:

the characteristic acquisition module is used for acquiring the current characteristic information of the target task;

and the task processing module is used for inputting the current characteristic information into a global target model obtained by training the model training method according to the embodiment of the invention, so as to process the target task based on the current characteristic information by using the global target model, thereby obtaining a processing result of the target task.

In a fifth aspect, an embodiment of the present invention provides a model training system, where the model training system includes a distributed cluster, and each distributed cluster includes a central device and node devices for executing the model training method according to any one of the embodiments of the present invention.

In a sixth aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a model training method according to any one of the embodiments of the present invention when executing the program, or implements a task processing method according to the embodiment of the present invention when executing the program.

In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used to implement the model training method according to any one of the embodiments of the present invention when executed by a processor, or the computer program is used to implement the task processing method according to the embodiment of the present invention when executed by the processor.

In the embodiment of the invention, gradient information corresponding to the initial task processing model of the current node equipment in the distributed cluster in the current round of training can be determined, and the gradient information is compressed to obtain the node compression gradient of the current node equipment; uploading the node compression gradients of the current node equipment to central equipment, and acquiring the central compression gradients from the central equipment, wherein the central compression gradients are obtained by fusing and compressing the node compression gradients uploaded by each node equipment by the central equipment; determining the gradient momentum used in the training of the current round according to the gradient momentum used in the previous round of training and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round. In the embodiment of the invention, when the model is trained, the communication gradient between the node equipment and the central equipment is compressed and transmitted, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic flow chart diagram of a model training method provided by an embodiment of the present invention;

FIG. 2 is another schematic flow chart diagram of a model training method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a task processing device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present invention, where the method may be executed by a model training apparatus according to an embodiment of the present invention, and the apparatus may be implemented in software and/or hardware. In a specific embodiment, the apparatus may be integrated in an electronic device, which may be, for example, a computer, a server, or the like. The following embodiment will be described by taking as an example that the apparatus is integrated in an electronic device, where the electronic device may be a current node device, the current node device may be any one of node devices in a distributed cluster, the distributed cluster further includes a central device, and each node device and the central device form a training cluster centered on the central device. Referring to fig. 1, the method may specifically include the following steps:

step 101, determining gradient information corresponding to an initial task processing model of current node equipment in the current round of training, and compressing the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment.

In the embodiment of the present invention, the task processing model may be a model for processing a target task, the model may be a model implemented by a neural network such as a convolutional neural network or a deep neural network, and the target task may be a target detection task, an article recommendation task, a text classification task, a machine translation task, and the like, which is not specifically limited herein.

For example, the default current round of training is non-first round of training of the current node device, and the previous round of training may be first round of training of the current node device or non-first round of training of the current node device. When the previous round of training is the first round of training of the current node equipment, the original model uniformly distributed by the central equipment can be obtained firstly, all model parameters of the original model are initialized randomly, namely all model parameters of the original model are assigned randomly, and after the random assignment, the initial task processing model of the first round of training of the current node equipment is obtained; next, various data which are initially set and used for the first round of training can be obtained, and the data can comprise initial first-order gradient momentum, initial second-order gradient momentum, initial accumulated compression residual error, learning rate, first-order hyper-parameters, second-order hyper-parameters and the like of the current node equipment; calculating gradient information corresponding to an initial task processing model of first-round training based on training data in current node equipment, compressing the gradient information based on initial accumulated compression residual errors to obtain node compression gradient of the first-round training of the current node equipment, and acquiring central compression gradient of the first-round training from central equipment; determining a gradient momentum for first-round training based on the initial first-order gradient momentum, the initial second-order gradient momentum, a first-order hyperparameter, a second-order hyperparameter, a central compression gradient and the like, and updating an initial task processing model of the first-round training based on the gradient momentum for the first-round training to obtain a target task processing model of the first-round training; if the previous training round of the current node device is the non-initial training round of the current node device, the previous training round of the current node device is the same as the current training round of the current node device, which will be described in detail later.

Specifically, the local training data of the current node device may be selected from the local training data set, a loss function of the initial task processing model of the current round of training is determined based on the local training data, a gradient of the loss function at each model parameter of the initial task processing model of the current round of training is calculated, and gradient information corresponding to the initial task processing model of the current round of training is obtained, that is, the gradient information may be considered as a set or a vector formed by gradients of the loss function of the initial task processing model at each model parameter. Since the neural network includes a large amount of model parameters, the data size of the gradient information is very large.

The training data of the current round selected from the local training data set may be one sample data selected randomly or a batch of sample data selected randomly. The local training dataset may include a large amount of sample data obtained by analyzing and labeling locally generated and acquired data, that is, the sample data in the embodiment of the present invention is sample data with a sample label.

Illustratively, the target task is a target detection task, and the local training data set of the current node device may be constructed according to the local image database. For example, targets (e.g., people and vehicles) in original images in a local image database may be marked to obtain marked images, and the marks may be regarded as sample labels, so that a large amount of sample data is constructed, and a local training data set is constructed according to the large amount of sample data.

In compressing the gradient information, the gradient information may be compressed based on a residual using a compression function having a contraction property. For example, an accumulated compressed residual may be locally stored in the current node device, and the accumulated compressed residual is used to accumulate a loss amount of the gradient information caused by compression of the current node device in each communication process, and the current gradient information may be compressed by using a compression function after the accumulated compressed residual is added to the current gradient information, so as to reduce an influence of gradient compression on a training effect.

And 102, uploading the node compression gradient of the current node equipment to the central equipment, and acquiring the central compression gradient from the central equipment, wherein the central compression gradient is obtained by fusing and compressing the node compression gradient uploaded by each node equipment by the central equipment.

After the gradient compression, the data volume is reduced, the node compression gradient is uploaded to the central equipment, and compared with the method that the current gradient information is directly uploaded to the central equipment, the data volume of communication is remarkably reduced, and the communication speed is improved. In addition, the central compression gradient is obtained by fusing and compressing the node compression gradients uploaded by each node device by the central device, namely the central compression gradient is also compressed gradient information, which is equivalent to the compressed gradient information transmitted by the opposite end of communication, so that the data volume of communication transmission is further reduced.

And 103, determining the gradient momentum used in the training of the current round according to the gradient momentum used in the previous round of training and the central compression gradient.

In practical application, because the number of samples in the training data set is huge, if the whole training data set is traversed during each training, a large amount of computing resources are consumed, and the training speed is slow; therefore, in the embodiment of the invention, each round of training is performed by selecting the training data of the round from the training data set, and the training data of the round is sample data of small batch (even one), so that the calculation resources occupied by each round of training can be reduced; however, a small batch of sample data is used in each round of training, and when the gradient descent method is used for training, each descent is not strictly descending towards the minimum direction, and a left-right oscillation phenomenon may occur, so that the final model is converged very slowly; aiming at the problem, the embodiment of the invention adopts a momentum gradient descent method to train the model, namely, momentum is used for replacing the original gradient to optimize the model parameters.

By using the momentum gradient descent method, the previous update speed is taken into account every time when the model parameters are updated, the update amplitude of each model parameter in each direction not only depends on the current gradient, but also depends on whether each gradient is consistent in each direction in the past, if the gradient at a certain parameter is updated along the current direction all the time (namely the gradient direction in the last period of the parameter is consistent), the update amplitude of the parameter in each update is larger and larger, and if the gradient at a certain parameter is changed continuously in one direction (namely the gradient direction in the last period of the parameter is inconsistent), the update amplitude is attenuated, so that a larger learning rate can be used, and the model can be converged more quickly.

And step 104, updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain a target task processing model of the training round.

Specifically, the update direction and the update amplitude of each model parameter of the initial task processing model of the current round of training may be determined based on the gradient momentum used in the current round of training, and each model parameter of the initial task processing model of the current round of training may be updated based on the update direction and the update amplitude of each model parameter, so as to obtain the target task processing model of the current round of training.

In a specific implementation, the model parameter mentioned in the embodiment of the present invention may be a structure parameter of the model, for example, a weight, an offset, and the like of each layer of the model. For example, when the task processing model is a convolutional neural network model, the model parameters of the convolutional layer may include weights of convolutional kernels and offsets of channels, and the model parameters of the fully-connected layer may include weights of the fully-connected layer and offsets of channels.

In the embodiment of the invention, gradient information corresponding to the initial task processing model of the current node equipment in the distributed cluster in the current round of training can be determined, and the gradient information is compressed to obtain the node compression gradient of the current node equipment; uploading the node compression gradients of the current node equipment to central equipment, and acquiring the central compression gradients from the central equipment, wherein the central compression gradients are obtained by fusing and compressing the node compression gradients uploaded by each node equipment by the central equipment; determining the gradient momentum used by the training of the current round according to the gradient momentum used by the training of the previous round and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round. In the embodiment of the invention, when the model is trained, the communication gradient between the node equipment and the central equipment is compressed and transmitted, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

The model training method provided in the embodiment of the present invention is further described below, and as shown in fig. 2, the method may include the following steps:

step 201, selecting training data of the current round from a local training data set.

For example, one sample data may be randomly selected from the local training data set, and the randomly selected sample data may be used as the training data of the current round, so as to reduce the computing resources used in the training process.

Step 202, determining a loss function of the initial task processing model of the training round based on the training data of the training round.

Specifically, the training data of the current round is sample data with sample labels, and a loss function of the initial task processing model can be determined based on the actual output of the initial task processing model of the current round of training and the sample labels.

And 203, calculating the gradient of the loss function at each model parameter of the initial task processing model of the training round to obtain gradient information corresponding to the initial task processing model of the training round.

That is, the gradient information can be regarded as a set or vector of gradients of the loss function of the initial task processing model at the respective model parameters. Since the neural network includes a large number of model parameters, the data size of the gradient information is very large.

Step 204, obtaining the accumulated compression residual error of the current node device.

Specifically, the accumulated compression residual may be considered as the sum of the loss amount of the gradient information due to the compression of the current node device in each communication process accumulated in the current node device until the current training round.

And step 205, compressing the gradient information by using a preset compression function based on the accumulated compression residual error to obtain a node compression gradient of the current node equipment.

Specifically, the accumulated compression residual and the gradient information may be summed, the summed data may be compressed by using a preset compression function, the preset compression function may be any compression function with a contraction property, and after compression, the node compression gradient of the current node device in the current training round is obtained.

For example, the node compression gradient of the current node device in the current round of training may be calculated by the following formula:

where t represents a training round, i represents an index of a node device,

representing the node compression gradient, g, of the t-th round of training on the node device i _t，i Gradient information corresponding to the initial task processing model of the t-th round of training on the node device i, e _t-1，i Represents the accumulated compression residual error after the t-1 th round of training on the node device i, Q _worker Representing a preset compression function on the node device.

And step 206, determining the compression residual error according to the gradient information and the node compression gradient of the current node device.

For example, the gradient information may be subtracted from the node compression gradient of the current node device, so as to obtain the current compression residual.

And step 207, updating the accumulated compression residual based on the current compression residual.

Specifically, the current accumulated compressed residual may be added to the current compressed residual, so as to update the accumulated compressed residual. For example, the accumulated compression residual of the current node device may be updated according to the following formula:

wherein e is _t，i And (4) representing the accumulated compression residual error after the t round of training on the node device i.

And step 208, uploading the node compression gradient of the current node equipment to the central equipment, and acquiring the central compression gradient from the central equipment.

Specifically, after receiving the node compression gradients sent by each node device (including the current node device), the central device may calculate an average value of the node compression gradients to obtain a compression gradient average value; the central device may also locally maintain a cumulative compression residual, where the cumulative compression residual is used to accumulate a loss amount of gradient information caused by compression of the central device in each communication process, and the central device sums the cumulative compression residual and the average compression gradient value, and compresses the summed data to obtain a central compression gradient, and the current node device obtains the central compression gradient from the central device.

For example, the central compression gradient may be calculated by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing the central compression gradient of the t-th round of training,

mean value of compressed gradient representing the t-th training round, E _t-1 Represents the cumulative compression residual after the t-1 th round of training on the central device, Q _server Representing a preset compression function at the central facility.

Where N represents the total number of node devices in the distributed cluster.

In addition, the central device may also update its accumulated compression residual in each round of training, and the updating method may be as follows:

wherein E is _t Representing the cumulative compressed residual after the t-th round of training on the central device.

And step 209, determining the first-order gradient momentum used in the training of the current round according to the first-order hyper-parameter, the first-order gradient momentum used in the previous round of training and the central compression gradient.

For example, the first-order gradient momentum used in the training round can be calculated according to the following formula:

wherein m is _t Represents the first order gradient momentum, beta, used in the t-th round of training ₁ Denotes a first order hyperparameter, m _t-1 The first order gradient momentum used in the t-1 round of training is shown.

And step 210, determining the second-order gradient momentum used in the training of the current round according to the second-order hyper-parameter, the second-order gradient momentum used in the previous round of training and the central compression gradient.

Specifically, the candidate second-order gradient momentum used in the training of the current round can be determined according to the second-order hyper-parameter, the second-order gradient momentum used in the previous round of training and the central compression gradient; the larger one of the second-order gradient momentum used in the previous round of training and the candidate second-order gradient momentum used in the current round of training is selected to obtain the second-order gradient momentum used in the current round of training, so that the model convergence can be accelerated, and the model training speed is improved.

For example, the candidate second-order gradient momentum used in the current round of training may be calculated according to the following formula:

wherein v is _t Represents the candidate second-order gradient momentum, v, used in the t-th round of training _t-1 Represents the second-order gradient momentum, beta, used in the t-1 round of training ₂ Representing a second order hyperparameter.

Wherein the content of the first and second substances,

representing the second order gradient momentum used for the t-th round of training.

And step 211, determining a descending gradient of the training of the current round based on the first-order gradient momentum and the second-order gradient momentum used in the training of the current round.

And step 212, updating the initial task processing model of the training round according to the descending gradient of the training round to obtain a target task processing model of the training round.

Specifically, the initial task processing model is updated, that is, each model parameter of the initial task processing model is updated, and the initial task processing model of the training round may be updated according to the following formula:

wherein x is _t+1 Model parameters of the initial task processing model representing the t +1 th round of training, namely the model parameters, x, of the target task processing model obtained by the t-th round of training _t Model parameters of an initial task processing model of the t-th training round are represented, alpha represents a learning rate,

representing the falling gradient of the t-th round of training.

Step 213, determining whether a training cutoff condition is reached, if so, executing step 214, otherwise, returning to step 201.

For example, the training cutoff condition may be a set training round number limiting condition or a set loss function limiting condition, and is not specifically limited herein. Training is stopped under the training round number limiting condition, for example, when the training round number reaches a preset round number; and if the loss function limiting condition is that the loss function value of the target task processing model obtained by the local equipment is minimum or lower than a preset loss function value, or the average value of the loss function values of the target task processing models obtained by each equipment is minimum or lower than a preset loss function value, stopping training.

And step 214, determining the target task processing model obtained in the last round of training as the node target model of the current node equipment.

And step 215, uploading the node target model of the current node equipment to the central equipment.

And step 216, acquiring a global target model from the central equipment, wherein the global target model is obtained by fusing the node target models uploaded by the node equipment by the central equipment.

For example, after receiving the node target models uploaded by each node device, the central device may calculate an average value of each model parameter of each node target model, thereby obtaining a global target model.

Through analysis of the model training method provided by the embodiment of the invention, when the continuity assumption, the bounded variance, the bounded gradient and the contraction property of the compression function C (C is a fixed constant) are satisfied, the convergence speed of the model of the embodiment of the invention can satisfy the following formula:

wherein, T represents the total training round,

it is shown that it is desirable to,

denotes the gradient, f (x) _t ) Represents the loss function of the t-th round of training,

representing an asymptotic upper bound on the order of the function.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

For example, the algorithm involved in training a node device through the entire model may include:

1. initializing model parameters, first-order gradient momentum, second-order gradient momentum, local accumulated compressed residual of each node device, local accumulated compressed residual of central device, learning rate, first-order hyper-parameters and second-order hyper-parameters;

2. node equipment main process:

3. setting a training total round T;

4. performing, for each node device:

5. calculating gradient information g corresponding to the initial task processing model of the current node equipment in the current round of training _t，i ；

6. Calculating node compression gradient of current node equipment in current round of training

7. Updating the cumulative compression residual e of the current node device _t，i ；

8. Compressing the node into a gradient

Upload to a central facility and obtain a central compression gradient from the central facility

9. Calculating the first order gradient momentum m used in the training round _t ；

10. Calculating candidate second-order gradient momentum v used in the current round of training _t ；

11. Determining second order gradient momentum for the current round of training

12. And updating the initial task processing model of the training round based on the first-order gradient momentum and the second-order gradient momentum used by the training round to obtain a target task processing model of the training round.

The algorithm involved by the central device may include:

1. a central device main process;

2. setting a training total round T;

3. receiving node compression gradient uploaded by each node device

4. Calculating the mean value of the compression gradient

5. Calculating a central compression gradient

6. Updating the cumulative compression residual E of a central facility _t ；

7. Sending a central compression gradient to each node device

In the embodiment of the invention, equivalently, a distributed adaptive random gradient descent model training method is provided, and the gradients of the communication between the node equipment and the central equipment are compressed and transmitted during model training, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

Furthermore, in the model training process, the embodiment of the invention updates the two step momentum to adaptively adjust the learning rate, so that the model convergence can be accelerated and the model training speed can be increased.

In addition, by analyzing the training process, the embodiment of the invention provides the upper bound of the convergence speed of the model, so that the method provided by the embodiment of the invention can be ensured to obtain the linear acceleration effect along with the increase of the scale of the node equipment, and the superiority of the method in large-scale training is reflected.

Finally, in the distributed adaptive stochastic gradient descent model training method provided by the embodiment of the invention, in the training process, the compression algorithm can be constructed on not only the node equipment but also the central equipment, and the construction method is simple and flexible.

The embodiment of the invention also provides a task processing method, which comprises the following steps:

(1) And acquiring current characteristic information of the target task.

(2) The current characteristic information is input into a global target model obtained by training according to the model training method of the embodiment of the invention, so that the target task is processed by utilizing the global target model based on the current characteristic information, and the processing result of the target task is obtained.

The target tasks may include, but are not limited to, target detection tasks, item recommendation tasks, text classification tasks, machine translation tasks, and the like. Taking the target task as an example of the target detection task, the current feature information of the target task may be feature information of a current image, for example, the feature information of the current image may be input into a global target model, the output of the global target model may be a detection result of the current image, and the detection result may include, for example, a position where the target is located, a type of the target (e.g., a person, a vehicle), a confidence level, and the like. The global target model can be obtained by training with the model training method provided by the embodiment of the invention, and the specific process of model training and use is not repeated here.

According to the global target model for processing the target task, the gradient of communication between the node equipment and the central equipment is compressed and transmitted in the training process, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

Fig. 3 is a structural diagram of a model training apparatus provided in an embodiment of the present invention, where the model training apparatus is applied to each node device in a distributed cluster, and the distributed cluster further includes a central device, and the apparatus is adapted to execute the model training method provided in the embodiment of the present invention. As shown in fig. 3, the apparatus may specifically include:

a gradient compression module 301, configured to determine gradient information corresponding to an initial task processing model of a current node device in a current round of training, and perform compression processing on the gradient information to obtain a node compression gradient of the current node device, where the initial task processing model of the current round of training is a target task processing model obtained in a previous round of training of the current node device, and the current node device is any one node device in the node devices;

a gradient transmission module 302, configured to upload the node compression gradient of the current node device to the central device, and obtain a central compression gradient from the central device, where the central compression gradient is obtained by performing fusion compression on the node compression gradients uploaded by each node device by the central device;

a momentum determination module 303, configured to determine a gradient momentum used in the current round of training according to the gradient momentum used in the previous round of training and the central compression gradient;

a model updating module 304, configured to update the initial task processing model of the current round of training based on the gradient momentum used in the current round of training, so as to obtain a target task processing model of the current round of training.

In an embodiment, the determining, by the gradient compression module 301, gradient information corresponding to an initial task processing model of the current node device in the current round of training includes:

selecting training data of the current round from a local training data set;

determining a loss function of an initial task processing model of the current round of training based on the current round of training data;

and calculating the gradient of the loss function at each model parameter of the initial task processing model of the current round of training to obtain the gradient information corresponding to the initial task processing model of the current round of training.

In an embodiment, the compressing the gradient information by the gradient compression module 301 to obtain the node compression gradient of the current node device includes:

acquiring an accumulated compression residual error of the current node equipment;

and compressing the gradient information by using a preset compression function based on the accumulated compression residual error to obtain the node compression gradient of the current node equipment.

In one embodiment, the apparatus further comprises:

a residual error updating module, configured to determine a current compression residual error according to the gradient information and the node compression gradient of the current node device; and updating the accumulated compression residual based on the compression residual of this time.

In an embodiment, the momentum determination module 303 is specifically configured to:

determining the first-order gradient momentum used by the current training according to the first-order hyper-parameter, the first-order gradient momentum used by the previous training and the central compression gradient;

and determining the second-order gradient momentum used by the training of the current round according to the second-order hyper-parameter, the second-order gradient momentum used by the training of the previous round and the central compression gradient.

In one embodiment, the determining module 303 determines the second-order gradient momentum used in the current round of training according to the second-order hyper-parameter, the second-order gradient momentum used in the previous round of training, and the central compression gradient, and includes:

determining candidate second-order gradient momentum used by the training of the current round according to the second-order hyper-parameter, the second-order gradient momentum used by the previous round of training and the central compression gradient;

and selecting the larger one of the second-order gradient momentum used in the previous round of training and the candidate second-order gradient momentum used in the current round of training to obtain the second-order gradient momentum used in the current round of training.

In an embodiment, the model update module 304 is specifically configured to:

determining a falling gradient of the current round of training based on a first order gradient momentum and a second order gradient momentum used by the current round of training;

and updating the initial task processing model of the training round according to the descending gradient of the training round to obtain the target task processing model of the training round.

In one embodiment, the apparatus further comprises:

a condition determining module for determining whether a training cutoff condition is reached;

if the training cutoff condition is not met, returning to the gradient compression module 301 to execute the gradient information corresponding to the initial task processing model for determining the current node device in the current round of training, and when the training cutoff condition is met, determining the target task processing model obtained in the last round of training as the node target model of the current node device by the model updating module 304.

In one embodiment, the apparatus further comprises:

the model uploading module is used for uploading the node target model of the current node equipment to the central equipment;

and the model acquisition module is used for acquiring a global target model from the central equipment, and the global target model is obtained by fusing the node target models uploaded by the node equipment by the central equipment.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the functional module, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The model training device of the embodiment of the invention can determine the gradient information corresponding to the initial task processing model of the current node equipment in the distributed cluster in the current round of training, and compress the gradient information to obtain the node compression gradient of the current node equipment; uploading the node compression gradients of the current node equipment to central equipment, and acquiring a central compression gradient from the central equipment, wherein the central compression gradient is obtained by fusing and compressing the node compression gradients uploaded by each node equipment by the central equipment; determining the gradient momentum used in the training of the current round according to the gradient momentum used in the previous round of training and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round. In the embodiment of the invention, when the model is trained, the communication gradients of the node equipment and the central equipment are compressed and transmitted, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

Fig. 4 is a block diagram of a task processing device according to an embodiment of the present invention, which is suitable for executing the task processing method according to the embodiment of the present invention. As shown in fig. 4, the apparatus may specifically include:

a feature obtaining module 401, configured to obtain current feature information of the target task;

a task processing module 402, configured to input the current feature information into a global target model obtained through training by using the model training method according to the embodiment of the present invention, so as to process the target task based on the current feature information by using the global target model, thereby obtaining a processing result of the target task.

According to the global target model used by the task processing device, in the training process, the communication gradients of the node equipment and the central equipment are compressed and transmitted, so that the data volume of transmission is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

The embodiment of the invention also provides a model training system, which comprises a distributed cluster, wherein the distributed cluster comprises a central device and node devices for executing the model training method.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the computer system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units described in the embodiments of the present invention may be implemented by software, and may also be implemented by hardware. The described modules and/or units may also be provided in a processor, which may be described as: a processor includes a gradient compression module, a gradient transmission module, a momentum determination module, and a model update module; alternatively, it can be described as: a processor includes a feature acquisition module and a task processing module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not assembled into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: determining gradient information corresponding to an initial task processing model of current node equipment in a current round of training, and compressing the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment, and the current node equipment is any one node equipment in the node equipment; uploading the node compression gradient of the current node equipment to the central equipment, and acquiring a central compression gradient from the central equipment, wherein the central compression gradient is obtained by fusing and compressing the node compression gradient uploaded by each node equipment by the central equipment; determining the gradient momentum used by the current training round according to the gradient momentum used by the previous training round and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round.

Alternatively, the one or more programs, when executed by an apparatus, cause the apparatus to comprise: acquiring current characteristic information of a target task; inputting the current feature information into a global target model obtained by training according to the model training method in the embodiment of the invention, and processing the target task by using the global target model based on the current feature information so as to obtain a processing result of the target task.

According to the technical scheme of the embodiment of the invention, the gradient information corresponding to the initial task processing model of the current node equipment in the distributed cluster in the current round of training can be determined, and the gradient information is compressed to obtain the node compression gradient of the current node equipment; uploading the node compression gradients of the current node equipment to central equipment, and acquiring the central compression gradients from the central equipment, wherein the central compression gradients are obtained by fusing and compressing the node compression gradients uploaded by each node equipment by the central equipment; determining the gradient momentum used by the training of the current round according to the gradient momentum used by the training of the previous round and the central compression gradient; and updating the initial task processing model of the training round based on the gradient momentum used by the training round to obtain the target task processing model of the training round. In the embodiment of the invention, when the model is trained, the communication gradient between the node equipment and the central equipment is compressed and transmitted, so that the transmitted data volume is reduced, the communication overhead is reduced, the communication efficiency is improved, and the model training speed is further improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A model training method is applied to each node device in a distributed cluster, the distributed cluster further comprises a center device, and the model training method comprises the following steps:

determining gradient information corresponding to an initial task processing model of current node equipment in a current round of training, and compressing the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment, and the current node equipment is any one node equipment in the node equipment;

2. The model training method according to claim 1, wherein the determining gradient information corresponding to the initial task processing model of the current node device in the current round of training comprises:

selecting training data of the current round from a local training data set;

3. The model training method according to claim 1, wherein the compressing the gradient information to obtain the node compression gradient of the current node device includes:

4. The model training method according to claim 3, wherein after the gradient information is compressed to obtain the node compression gradient of the current node device, the method further comprises:

determining the current compression residual error according to the gradient information and the node compression gradient of the current node equipment;

and updating the accumulated compression residual based on the current compression residual.

5. The model training method according to claim 1, wherein the determining the gradient momentum used in the current round of training according to the gradient momentum used in the previous round of training and the central compression gradient comprises:

and determining the second-order gradient momentum used by the training of the current round according to the second-order hyper-parameter, the second-order gradient momentum used by the previous round of training and the central compression gradient.

6. The model training method of claim 5, wherein the determining the second-order gradient momentum used in the current round of training according to the second-order hyper-parameter, the second-order gradient momentum used in the previous round of training, and the central compression gradient comprises:

determining candidate second-order gradient momentum used by the training of the current round according to the second-order hyper-parameter, the second-order gradient momentum used by the training of the previous round and the central compression gradient;

7. The model training method according to claim 5 or 6, wherein the updating the initial task processing model of the current round of training based on the gradient momentum used in the current round of training to obtain the target task processing model of the current round of training comprises:

and updating the initial task processing model of the current round of training according to the descending gradient of the current round of training to obtain the target task processing model of the current round of training.

8. The model training method of claim 1, wherein after updating the initial task processing model of the current round of training based on the gradient momentum used in the current round of training to obtain the target task processing model of the current round of training, further comprising:

determining whether a training cutoff condition is reached;

and if the training cutoff condition is not met, returning to execute gradient information corresponding to the initial task processing model for determining the current node equipment current training, and determining a target task processing model obtained by the last training as the node target model of the current node equipment until the training cutoff condition is met.

9. The model training method of claim 8, further comprising:

uploading the node target model of the current node equipment to the central equipment;

and acquiring a global target model from the central equipment, wherein the global target model is obtained by fusing the node target models uploaded by the node equipment by the central equipment.

10. A task processing method, comprising:

acquiring current characteristic information of a target task;

inputting the current feature information into a global target model obtained by training in the model training method according to claim 9, so as to process the target task based on the current feature information by using the global target model, thereby obtaining a processing result of the target task.

11. A model training device is applied to each node device in a distributed cluster, the distributed cluster further comprises a center device, and the model training device comprises:

the gradient compression module is used for determining gradient information corresponding to an initial task processing model of current node equipment in the current round of training and compressing the gradient information to obtain a node compression gradient of the current node equipment, wherein the initial task processing model of the current round of training is a target task processing model obtained in the previous round of training of the current node equipment, and the current node equipment is any one of the node equipment;

a momentum determination module, configured to determine a gradient momentum used in the current round of training according to the gradient momentum used in the previous round of training and the central compression gradient;

12. A task processing apparatus, characterized by comprising:

a task processing module, configured to input the current feature information into a global target model obtained by training in the model training method according to claim 9, so as to process the target task based on the current feature information by using the global target model, thereby obtaining a processing result of the target task.

13. A model training system, characterized in that the model training system comprises a distributed cluster comprising a central device and node devices for performing the model training method according to any one of claims 1 to 9.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the model training method according to any one of claims 1 to 9 when executing the program or the processor implements the task processing method according to claim 10 when executing the program.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a model training method as claimed in any one of claims 1 to 9, or which, when being executed by a processor, carries out a task processing method as claimed in claim 10.