WO2022042741A1

WO2022042741A1 - Learning model training method, working node, server, device and medium

Info

Publication number: WO2022042741A1
Application number: PCT/CN2021/115544
Authority: WO
Inventors: 徐茂轩; 吴臻志; 祝夭龙
Original assignee: 北京灵汐科技有限公司
Priority date: 2020-08-31
Filing date: 2021-08-31
Publication date: 2022-03-03
Also published as: CN112016699B; CN112016699A

Abstract

A deep learning model training method, a working node, a parameter server, an electronic device and a readable medium, the method comprising: acquiring target batch training samples (S210); training multiple layers of a target model according to the target batch training samples, the trainings of target layers in each training cycle starting from the second training cycle comprising receiving historical global statistical parameters sent by the parameter server (S221), wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the current target layer of the target model, and the historical training data comprises target statistical parameters acquired by the current working node in training cycles prior to the current training cycle and target statistical parameters acquired by the other working nodes in training cycles prior to the current training cycle; acquiring the target statistical parameters of the current target layer (S222); determining the actual statistical parameters of the current target layer based on the historical global statistical parameters and the target statistical parameters (S223); and performing batch standardization on the current target layer based on the actual statistical parameters and the target batch training samples, and sending the target statistical parameters to the parameter server (S224).

Description

Learning model training methods, worker nodes, servers, devices, media

technical field

The invention relates to the technical field of deep learning, in particular to a deep learning model training method, a working node, a parameter server, an electronic device and a readable storage medium.

Background technique

With the development of information technology, the use of deep learning models for training to use the trained models to predict target data has been more and more widely used. In order to further improve the accuracy of the trained models, the size of the training samples The number is also getting larger, which leads to the complexity of the training and the longer training time.

In related technologies, multiple worker nodes can usually be used to train the same model. For example, different worker nodes are responsible for training different training layers in the same model. In this case, the next training layer needs to wait for the training of the previous training layer to complete before the training is completed. Being able to perform the training process, the latency of which greatly increases the total time for model training, thereby reducing the efficiency of model training.

It can be seen from this that in the process of using multiple working nodes to train the same model in the related art, there is a defect of low model training efficiency.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a deep learning model training method, a working node and a parameter server, so as to solve the problem of low model training efficiency in the process of using multiple working nodes to train the same deep learning model in the related art.

In order to solve the above-mentioned technical problems, the present invention is achieved in this way:

In a first aspect, an embodiment of the present invention provides a deep learning model training method, which is applied to a worker node. The method includes multiple training cycles, and each training cycle includes:

Obtain the target batch training samples;

The multiple layers of the target model are trained according to the target batch training samples, wherein the multiple layers of the target model include at least one target layer requiring batch normalization, starting from the second training cycle, in each training cycle Training the target layer includes:

Receive the historical global statistical parameters sent by the parameter server, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the current target layer of the target model, and the historical training data includes the current working node at The target statistical parameters obtained in the training period before the current training period, and the target statistical parameters obtained in the training period before the current training period by other working nodes belonging to the same working network as the current working node;

Obtain the target statistical parameters of the current target layer, wherein the target statistical parameters are the statistical parameters of the target batch training samples;

Determine the actual statistical parameters of the current target layer based on the historical global statistical parameters and the target statistical parameters, and

Batch normalize the current target layer based on the actual statistical parameters and the target batch training samples, and send the target statistical parameters to the parameter server.

In a second aspect, an embodiment of the present invention further provides a deep learning model training method, which is applied to a parameter server, wherein the deep learning model training method includes multiple computing cycles, and in each computing cycle, the Methods include:

Receive the target statistical parameters for the target layer in the target training model sent by multiple working nodes, where the target statistical parameters of the target layer include that the working node that sent the target statistical parameters trains the target in the training period corresponding to the current computing period Statistical parameters of target batch training samples used in layers, wherein a plurality of the working nodes belong to the same working network, and the target training model includes at least one target layer;

The historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model are calculated according to the received target statistical parameters.

In a third aspect, an embodiment of the present invention further provides a working node, including:

The sample acquisition module is used to acquire target batch training samples;

A training module, configured to train multiple layers of the target model according to the target batch training samples, wherein the multiple layers of the target model include at least one target layer requiring batch normalization, and the training module includes:

The first receiving unit is configured to receive the historical global statistical parameters sent by the parameter server, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the target layer of the target model, and the historical training data includes the current The target statistical parameters obtained by the working node in the training cycle before the current training cycle, and the target statistical parameters obtained by other working nodes belonging to the same working network as the current work in the training cycle before the current training cycle;

a first obtaining unit, configured to obtain the target statistical parameters of the currently trained target layer, wherein the target statistical parameters are the statistical parameters of the target batch training samples;

The determining unit is configured to determine the actual statistical parameters of the currently trained target layer based on the historical global statistical parameters and the target statistical parameters, and perform the corresponding target layer based on the actual statistical parameters and the target batch training samples. batch normalization, and sending the target statistical parameters to the parameter server.

In a fourth aspect, an embodiment of the present invention further provides a parameter server, including:

A parameter server, characterized in that it includes:

The parameter receiving module is used to receive the target statistical parameters for the target layer in the target training model sent by the plurality of working nodes, and the target statistical parameters of the target layer include the working node that sends the target statistical parameters in the current calculation cycle. Statistical parameters of the target batch training samples used in training the target layer in the training cycle, wherein a plurality of the working nodes belong to the same working network, and the target training model includes at least one target layer;

The updating module calculates the historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model according to the received target statistical parameters.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a processor, a memory, and a program or instruction stored in the memory and executable on the processor, the program or instruction being The steps of implementing the deep learning model training method described in the first aspect when the processor is executed, or the steps of implementing the deep learning model training method described in the second aspect when the program or instruction is executed by the processor.

In a sixth aspect, an embodiment of the present invention further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the deep learning model described in the first aspect is implemented The steps of the training method, or the steps of implementing the deep learning model training method described in the second aspect when the program or instructions are executed by the processor.

In the embodiment of the present invention, when calculating the historical global statistical parameters, each working node only needs to send the target statistical parameters obtained by their respective calculations to the parameter server. The parameter server can directly calculate and obtain the historical global statistical parameters according to the target statistical parameters sent by each working node. During the entire computing process, the amount of data that needs to be transmitted is relatively small, which improves the efficiency of the deep learning model training method. Moreover, by using the historical global statistical parameters to correct the target statistical parameters of the current target layer, the result obtained after standardizing the sample data can be made more accurate.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

Figure 1 is a schematic diagram of the system architecture of a deep learning model training method based on a mini-batch gradient descent method;

2 is a flowchart of an embodiment of the deep learning model training method provided by the first aspect of the present invention;

3 is a flowchart of another embodiment of the deep learning model training method provided by the first aspect of the present invention;

4 is a flowchart of an embodiment of the deep learning model training method provided by the second aspect of the present disclosure;

FIG. 5 is a flowchart of another embodiment of the deep learning model training method provided by the second aspect of the present disclosure;

6 is a schematic diagram of data interaction between a working node and a parameter server in a deep learning model training method provided by an embodiment of the present invention;

7 is a structural diagram of a working node provided by an embodiment of the present invention;

FIG. 8 is a structural diagram of a parameter server provided by an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the training process of large-scale deep learning models, in order to speed up the convergence speed of the model and improve the training efficiency, and considering that the total number of samples may be large (it is impossible to use all the sample data for model iteration), the model training is usually performed using the mini-batch gradient descent method. . Among them, when using the mini-batch gradient descent method for model training, each iteration uses batch size (batch size: the number of samples selected for one training) samples to update the parameters. However, for a large-scale deep learning model with many parameters or a lot of intermediate activation data, the deep learning model cannot perform small batch calculations on a single worker node, and the training process usually needs to be performed on multiple worker nodes. For ease of description, a network composed of multiple working nodes is hereinafter referred to as a working network.

For example, in the process of placing the training process on multiple worker nodes, a data-parallel training method for a large-scale deep learning model can be used, for example, the training network as shown in Figure 1 can be used. Work node 10, to put the same model on each work node 10 for training, then divide the training data set (generate mini-batch training samples), and distribute the divided mini-batch training samples 20 to different On the working node 10, so that the working node 10 performs model training based on the assigned small batch training samples 20, and each working node 10 exchanges data with the parameter server 30 after the training is completed, and reports the training results, or each working node 10. Perform data interaction with the parameter server 30 during the training process to standardize batch training samples (hereinafter referred to as: batch standardization), which can process each batch of standardization layers in the forward propagation of the training process. Statistics parameters are also used during backpropagation.

In one embodiment, a single worker node can independently train deep learning based on a small batch of training samples, and synchronize the updated data to the parameter server. This embodiment is only applicable to small deep learning training models.

In another optional implementation, in the training process of a large-scale deep learning network model, when using the mini-batch gradient descent method for model training, each worker node uses the same model, and retrieves data from the database for training, respectively. Complete small batch training with batch size of samples together (that is, batch size is the sum of the number of training samples trained by all worker nodes in one model iteration process). After each worker node runs, synchronize the model or parameter update data to the parameters. After the server and parameter server get all the working node data, the model is updated, and the updated model is synchronized to each working node. Among them, in the process of model training performed by each worker node, if the network layer needs to perform global batch size statistics for batch size samples (the batch normalization layer (batch norm) is used as an example for illustration below), each work needs to be done. When the node runs to this layer, the data of this layer is synchronized to the parameter server. After the parameter server completes the calculation of the statistical value, it is synchronized to each working node.

For example, when training the batch normalization layer contained in the model training network (the current model training network usually contains multiple batch normalization layers), the parameter server needs to collect statistics on the training data of all working nodes, so that each working node can be based on The statistical results are respectively corrected and calculated for the current data of this layer, and the corresponding target layer is batch-standardized using the corrected statistical values.

When the network layer needs to perform global statistics on batch size samples, the worker node needs to perform the following waiting process:

Wait for other worker nodes to train to the same batch of normalization layers, so that the parameter server can obtain the batch normalization layer data reported by each worker node; Send statistical results.

In this way, the communication volume between the working nodes and the parameter server is greatly increased, and the waiting time of each working node during the training process is increased. The present application can solve the problem of low model training efficiency in the process of using the mini-batch gradient descent method for model training.

Please refer to FIG. 2. FIG. 2 is a flowchart of a deep learning model training method provided by an embodiment of the present invention, and the method is applied to a worker node. The deep learning model training method includes multiple training cycles, as shown in Figure 2, and each training cycle may include the following steps:

In step S210, obtain target batch training samples;

In step S220, multiple layers of the target model are trained according to the target batch training samples.

Wherein, the multiple layers of the target model include at least one target layer that needs batch normalization. Accordingly, starting from the second training cycle, training the target layer in each training cycle includes:

In step S221, the historical global statistical parameters sent by the parameter server are received, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the current target layer of the target model, and the historical training data Including the target statistical parameters obtained by the current working node in the training cycle before the current training cycle, and the target statistical parameters obtained by other working nodes belonging to the same working network as the current working node in the training cycle before the current training cycle;

In step S222, the target statistical parameters of the current target layer are obtained, wherein the target statistical parameters include the statistical parameters of the target batch training samples;

In step S223, the actual statistical parameters of the current target layer are determined based on the historical global statistical parameters and the target statistical parameters;

In step S224, batch normalization is performed on the current target layer based on the actual statistical parameters and the target batch training samples, and the target statistical parameters are sent to the parameter server.

The training network for training the target model needs to perform multiple model iteration training on the target model, then in each iteration training (that is, each training cycle): each working node in the training network is trained in the forward direction to In the target normalization layer (ie, the target layer above), statistical calculation is performed on the currently used training samples to obtain target statistical parameters, and the target statistical parameters are reported to the parameter server. It should be pointed out that the target statistical parameter includes the statistical value and the number of samples of the training samples used by the worker nodes that obtain the target statistical parameter. The parameter server may calculate and obtain the historical global statistical parameters according to the received target statistical parameters of each working node belonging to the same working network. In this way, the historical global statistical parameters calculated and obtained by the parameter server can reflect the statistical characteristics of all historical training samples. In the process of iteratively training the target normalization layer in the next training cycle of the working node, all the data obtained by the working node from the parameter server The historical global statistical parameters are the historical statistical parameters of the sample data that have been trained on the target normalization layer. The work node combines the target statistical parameters of the target batch training samples of the target normalization layer with the historical global statistical parameters to obtain this The actual statistical parameters of the normalization layer.

When calculating the historical global statistical parameters, each worker node only needs to send the target statistical parameters obtained by their respective calculations to the parameter server. The parameter server can directly calculate and obtain the historical global statistical parameters according to the target statistical parameters sent by each working node. During the entire computing process, the amount of data that needs to be transmitted is relatively small, which improves the efficiency of the deep learning model training method. Moreover, by using the historical global statistical parameters to correct the target statistical parameters of the current target layer, the result obtained after standardizing the sample data can be made more accurate.

In the present disclosure, how to train the target layer in the first training cycle is not particularly limited. For example, the target statistical parameters of the target layer in the first cycle and the target batch training samples can be directly used to batch normalize the current target layer. Of course, the present disclosure is not limited to this, and the target statistical parameters of the target layer in the first cycle can also be modified by using preset values to obtain the actual statistical parameters of the first training cycle, and then the actual statistical parameters and the target batch can be used The training samples are batch normalized to the current target layer. Alternatively, the parameter server can be used to calculate and obtain global statistical parameters according to the target batch training samples of each worker node, and then use the global statistical parameters and the target batch training samples to perform batch normalization on the current target layer.

It should be noted that the target model may include multiple normalization layers, and the parameter server may update the historical global statistical parameters of each normalization layer respectively. Each worker node can obtain the historical global statistical parameters corresponding to the m1 standardization layer from the parameter server when forward training to the m1 standardization layer; when each worker node is forward trained to the m2 standardization layer, can obtain the m2 standardization layer from the parameter server. The historical global statistical parameters of , where the normalization layers in the target model include m1 normalization layer and m2 normalization layer.

The above-mentioned historical global statistical parameters may be statistical values or series of statistical values obtained by performing one or more calculations such as variance, summation, and integration of historical data, or, the historical global statistical parameters may also include The number of training samples for which training is completed, wherein the statistical value sequence may include multiple statistical values, for example, including variance values and summation values.

It should be noted that the above-mentioned historical global statistical parameters may be the historical statistical parameters received by the working node before training the target layer (that is, the historical statistical parameters received before the current training cycle), which are not used for the target layer currently. Statistics on training samples.

As mentioned above, the above-mentioned target layer may be a target batch normalization layer, and the target batch training sample may be a sample set including at least one training sample, and a working node applying the deep learning model training method provided by the present application, based on the sample The training samples in the set are used to train the target layer of the target model, that is, the target batch training samples may also be referred to as the current training samples for the training of the target layer. The above-mentioned training of the target layer of the target model according to the target batch training samples can also be understood as: the target batch training samples are trained forward to the target layer of the target model.

In addition, the above-mentioned target statistical parameter may be a statistical value or a series of statistical values obtained by performing statistical calculation on the target batch of training samples, and the statistical calculation method may be the same as the statistical calculation method performed by the parameter server on the historical training data. This will not be repeated here.

It should be noted that, before training multiple layers of the target model according to the target batch training samples, the worker nodes also need to obtain the target batch data. The globality of the target batch data, that is, the global data information can be used in the training process.

In an optional implementation manner, the target batch of training samples may be obtained from a database in a random sampling manner.

For example, sampling with replacement can be adopted. In this embodiment, after the worker node obtains the target batch training samples in the database, the target batch training samples may not be deleted, and other worker nodes can also randomly select the target batch training samples. The training samples in the training samples.

In another optional implementation manner, the target batch training samples arranged at preset positions may also be acquired from a database, wherein the training samples stored in the database are arranged in random order.

Wherein, the target batch training samples listed at the preset position can be obtained from the training sample arranged at the first position, and N-1 training samples arranged after the training sample are obtained, wherein, N represents The number of training samples included in the batch of training samples. For example, the database can allocate training samples within it to assign different training samples to different worker nodes.

In some optional embodiments, the parameter types of the historical global statistical parameters are the same as the parameter types of the target statistical parameters. For example, the historical global statistical parameters include statistical parameters for historical training samples and historical training samples containing The target statistical parameters include the statistical value of the target batch training samples and the number of samples contained in the target batch training samples. The above-mentioned number of samples may refer to the number of times the training samples have been used, that is, if the same sample has been trained n times, the number of samples is n.

In some optional embodiments, the actual statistical parameters of the target layer may include statistical parameter values.

In some optional embodiments, the above-mentioned determination of the actual statistical parameters of the target layer based on the historical global statistical parameters and the target statistical parameters may be to correct and calculate the target statistical parameters by using the historical global statistical parameters, to obtain the actual statistical parameters of the target layer.

In a possible implementation manner, the above-mentioned batch normalization correction formula can be adjusted according to information such as the variance of the statistical value of the target layer and the degree of deviation from the first statistical value. The specific process of the batch normalization is the same as that of the prior art The batch normalization process of the middle batch normalization layer has the same meaning and will not be repeated here.

In some optional embodiments, after acquiring the target statistical parameters, the worker node also sends the target statistical parameters to the parameter server, so that the parameter server updates the historical global statistical parameters according to the target statistical parameters, so that the parameters The server sends the updated historical global statistical parameters to each working node to realize data synchronization between the working nodes, so that the updated historical global statistical parameters can be used in the next iteration to calibrate and calculate the target statistical parameters in the iteration cycle , until the model training is complete. As mentioned above, the historical global statistical parameters and the target statistical parameters may include the training sample statistical parameter value and the number of training samples, so that the communication volume between the worker nodes and the parameter server is small and the communication efficiency is improved.

It should be pointed out that, in this implementation manner, the working node is not specifically limited to perform: batch normalize the current target layer based on the actual statistical parameters and the target batch training samples, and send the target statistical parameters to the The sequence of these two steps for the parameter server.

In a possible implementation manner, after obtaining the actual statistical parameter, each working node may store the actual statistical parameter in the working node, so as to use the actual statistical parameter for back propagation.

In the embodiment of the present invention, the working node receives the historical global statistical parameters determined by the parameter server according to the historical training data, and when the working node trains the target layer of the target model based on the target batch training samples, based on the received historical global statistical parameters parameters and target statistical parameters of the target batch training samples, determine the actual statistical parameters of the target layer, and perform forward and back propagation training of the target model based on the actual statistical parameters, so that the worker nodes do not need to wait for the parameter server to obtain When all the working nodes of the target model train the target layer with the statistical parameters of the training samples, the statistical parameters can only be executed based on the statistical parameters sent by the parameter server when the statistical parameters are updated according to these statistical parameters of the training samples and sent to each working node. Forward and backpropagation training greatly reduces the time for worker nodes to wait for the parameter server to issue statistical parameters, which can improve the training efficiency of the deep learning model.

As described above, in the present disclosure, the deep learning model training method is equivalent to a gradient descent method. After the data of the target layer is normalized, the parameters of the target model need to be trained by using the normalized data.

In the same training network, multiple worker nodes use different training samples to train the same model. In order to obtain a more accurate model, the model parameters obtained after training by each worker node (referred to as single-body target model parameters for ease of description) are sent to the parameter server, which receives multiple single-body target models After the parameters are set, the overall parameters of the target model are updated in combination with all the received individual target model parameters, and the overall updated target model is sent to each work node, so that each work node continues to adjust the parameters based on different training samples. The updated target model is trained.

For any working node, as shown in Figure 3, each training cycle of the deep learning model training method may further include:

In step S230, sending the single target model parameters obtained by training each layer at the current working node to the parameter server; and

In step S240, before the next training cycle starts, the updated target training model sent by the parameter server is received as the target training model of the next training cycle.

As a second aspect of the present disclosure, please refer to FIG. 4 , which is a flowchart of a deep learning model training method provided by the present application. The deep learning model training method is applied to a parameter server. In the second aspect provided by the present disclosure , the deep learning model training method includes multiple computing cycles, and in each computing cycle, the method may include the following steps:

In step S310, the target statistical parameters for the target layer in the target training model sent by the plurality of working nodes are received, and the target statistical parameters of the target layer include the training data corresponding to the current computing cycle of the working node that sent the target statistical parameters. Statistical parameters of the target batch training samples used in training the target layer in the cycle, wherein a plurality of the working nodes belong to the same working network, and the target training model includes at least one target layer;

In step S320, the historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model are calculated according to the received target statistical parameters.

In this embodiment, the above-mentioned working node may be a working node that trains the target model together with the parameter server, which may be a working node that executes the method shown in FIG. 2 .

After obtaining the historical global statistical parameters, the parameter server may send the historical global statistical parameters to all working nodes that train the target model. Wherein, the historical global statistical parameters have the same meaning as the historical global statistical parameters mentioned in the first aspect of the present disclosure, and are not repeated here.

It should be pointed out that the target statistical parameters received in the first calculation cycle are the target statistical parameters obtained by calculation in the first training cycle of the worker node.

The deep learning model training method provided by the second aspect of the present disclosure cooperates with the deep learning model training method provided by the first aspect of the present disclosure. The first computing cycle in the deep learning model training method provided in the second aspect of the present disclosure corresponds to the first training cycle in the deep learning model training method provided in the first aspect of the present disclosure, and the first The historical global statistical parameter received in the second training cycle of the deep learning model training method provided by the aspect is the historical global statistical parameter obtained by the first calculation cycle of the deep learning model training method provided by the second aspect of the present disclosure. Statistical parameters; the second calculation period in the deep learning model training method provided by the second aspect of the present disclosure corresponds to the second training period in the deep learning model training method provided by the first aspect of the present disclosure, and the present disclosure The historical global statistical parameters received in the third training cycle in the deep learning model training method provided by the first aspect is the second computing cycle of the deep learning model training method provided by the second aspect of the present disclosure Calculate the historical global statistical parameters obtained by calculation, and so on.

The target statistical parameters sent by each working node may be the statistical parameters of the target batch training samples used when training the target layer, which may have the same meaning as the target statistical parameters in the method embodiment shown in FIG. 2 . This will not be repeated here.

In addition, even for the same working node and the same training cycle, the target statistical parameters corresponding to different target layers are different, and the historical global statistical parameters corresponding to different target layers are also different.

In the present disclosure, there is no special limitation on how each working node obtains the historical global statistical parameters calculated and obtained by the parameter server. As an optional implementation manner, the parameter server may actively deliver the historical global statistical parameters obtained by calculation to each working node. That is to say, the deep learning model training method includes the following steps after step S320:

Send the historical global statistical parameters to each of the working nodes.

Of course, the present disclosure is not limited to this. After calculating and obtaining the historical global statistical parameters, the parameter server does not actively send the historical global statistical parameters to each working node, but only sends the global parameter acquisition request to the The worker node that sends the historical global parameter acquisition request sends the historical global statistical parameter. That is to say, the deep learning model training method may further include steps performed after step S320:

In response to the global parameter acquisition request, determine the identity information of the worker node that sent the global parameter acquisition request;

Send the historical global statistical parameters to the worker node that sent the global parameter acquisition request.

In an optional implementation manner, the above-mentioned receiving the target statistical parameters sent by the working nodes when training the target layer of the target model based on different batches of training samples may be obtained when each working node receives the target model Step S320 is performed only after the target statistical parameters sent when the target layer is trained.

In another optional implementation manner, the above-mentioned receiving of the target statistical parameters sent by the worker nodes when training the target layer of the target model based on different batch training samples respectively may be received when a preset number of worker nodes are received. In the case of the sent target statistical parameters, the historical global statistical parameters are updated based on the historical global statistical parameters and the preset number of target statistical parameters, wherein the preset number is less than or equal to the first The total number of worker nodes. In other words, starting from the second calculation cycle, step S320 may specifically include:

In the case of receiving the target statistical parameters of the target layer corresponding to the current computing cycle sent by a preset number of working nodes, based on the historical global statistical parameters obtained in the computing cycle before the current computing cycle and the preset number of The historical global statistical parameters of the target layer corresponding to the current cycle are calculated from the target statistical parameters, wherein the preset number is less than or equal to the total number of working nodes in the working network.

In this embodiment, some working nodes are reserved, and after a preset number of working nodes submit data to the parameter server, the model can be updated according to the submitted data, and the data submitted by working nodes exceeding the preset number will no longer wait and receive. , in this way, it can reduce the stop and wait of the overall model training process due to the slow running or crash of a certain worker node, which can improve the efficiency of model training.

The above-mentioned updating of the historical global statistical parameters based on the target statistical parameters may be performing deviation correction on the historical global statistical parameters according to the target statistical parameters.

As described above, the parameter server may periodically execute steps S310 and S320, so as to synchronize the historical global statistical parameters of each batch normalization layer to each working node when updating the model data. The parameter server may perform step S310 when obtaining the updated historical global statistical parameters.

In a specific implementation, after obtaining the updated historical global statistical parameters, the parameter server may send the updated historical global statistical parameters to each working node, specifically, when each working node runs forward to the target normalization layer , respectively sending the updated historical global statistical parameters to the working node, wherein the target normalization layer is a normalization layer associated with the updated historical global statistical parameters.

In the deep learning model training method provided by the embodiment of the present application, in the process of training the batch normalization layer, it is not necessary for each worker node to send all training sample data to the parameter server, and wait for the parameter server to perform global statistics before returning the statistical parameters. To perform batch normalization, each worker node counts the batch normalization data of the batch normalization layer used by each worker, and reports their respective target statistical parameters, thereby reducing the amount of data interaction between the parameter server and each worker node, and the parameters The server can send the historical global statistical parameters to each working node, so that the working node can correct the target statistical parameters according to the received historical global statistical parameters, so as to obtain the actual statistical parameters used by each working node, and use the actual statistical parameters. And the target training samples are batch normalized to the corresponding target layers, which can greatly reduce the waiting time of the worker nodes.

In order to update the entire target model, correspondingly, as shown in Figure 5, the deep model training method may further include:

In step S330, receiving the individual target model parameters obtained by training each layer of the target model sent by each working node;

In step S340, the target model of the current cycle is updated according to the received monomer target model parameters, so as to obtain the target model for the next training cycle of the working node.

The following describes the deep learning model training method provided by the embodiment of the present application in conjunction with the data interaction process between the working node and the parameter server. As shown in FIG. 6 , the method includes the following processes:

In step 601, the working node A obtains the target training data that needs to be trained currently from the database.

The target training data is the target batch training samples in the method embodiments shown in FIG. 2 and FIG. 3 .

Step 602: The working node A trains the target model based on the target training data.

Among them, the target model can include multiple layers that require global batch size statistical parameters (such as batchnorm batch normalization layer, the following only takes batchnorm batch normalization layer as an example).

Among them, the global batch size statistical parameters include: variance, summation, average and other statistical values, which also include: the number of target training data.

Step 603: When the working node A runs forward to the target batch normalization layer, obtain the target statistical parameters, and correct the target statistical parameters based on the historical global statistical parameters to obtain the actual statistical parameters.

In this step, the worker node A obtains the target training data of the target batch normalization layer, performs statistical calculation on the target training data to obtain the target statistical parameters, and then uses the historical global statistical parameters of the target batch normalization layer synchronized from the parameter server, Correct the target statistical parameters of this work node to obtain the currently used statistical parameters, that is, the actual statistical parameters. This statistical parameter is reserved in the local machine and needs to be used during back propagation.

Step 604: Send the target statistical parameters to the parameter server at the working node A.

Step 605: The parameter server updates the stored historical global statistical parameters according to the target statistical parameters, and sends the updated historical global statistical parameters to each working node (including working node A).

In the embodiment of the present application, the working node and the parameter server cooperate with each other to execute each process of the deep learning model training method as shown in FIG. 2 and FIG. 3 , and can achieve the same beneficial effect. To avoid repetition, details are not repeated here.

Please refer to FIG. 7 , which is a structural diagram of a working node provided by an embodiment of the present application. As shown in FIG. 7 , the working node 500 includes:

a sample acquisition module 510, configured to acquire target batch training samples;

The training module 520 is configured to train multiple layers of the target model according to the target batch training samples, wherein the multiple layers of the target model include at least one target layer requiring batch normalization.

Training module 520 includes

The first receiving unit 521 is configured to receive the historical global statistical parameters sent by the parameter server, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the target layer of the target model, and the historical training data includes The target statistical parameters obtained by the current working node in the training period before the current training period, and the target statistical parameters obtained by other working nodes belonging to the same working network as the current work in the training period before the current training period;

The determining unit 522 is configured to determine the actual statistical parameters of the currently trained target layer based on the historical global statistical parameters and the target statistical parameters, and based on the actual statistical parameters and the target batch training samples, the corresponding target layer Batch normalization is performed, and the target statistical parameters are sent to the parameter server.

It should be explained that the "corresponding target layer" here refers to the target layer that belongs to the same training cycle as the actual statistical parameters and the target batch training samples and needs to use the above-mentioned actual statistical parameters.

The working node is used to execute the deep learning model training method provided in the first aspect of the present disclosure. The principles and beneficial effects of the deep learning model training method have been described in detail above, and will not be repeated here.

Optionally, the sample acquisition module 510 is configured to acquire the target batch training samples from the database in a random sampling manner. Alternatively, the sample obtaining module 510 is configured to obtain the target batch training samples arranged at preset positions from a database, wherein the training samples stored in the database are arranged in random order.

Optionally, the target statistical parameter includes a statistical value of the target batch of training samples and the number of samples of the target batch of training samples.

Optionally, the worker node 500 may further include:

a sending module 530, configured to send the single target model parameters obtained by training each layer at the current working node to the parameter server; and

The model receiving module 540 is configured to receive the updated target training model sent by the parameter server as the target training model for the next training cycle.

Please refer to FIG. 8 , which is a structural diagram of a parameter server provided by an embodiment of the present application. As shown in FIG. 8 , the parameter server 600 includes:

The parameter receiving module 610 is configured to receive the target statistical parameters for the target layer in the target training model sent by the plurality of working nodes, where the target statistical parameters of the target layer include the working node that sends the target statistical parameters in the current computing cycle corresponding to the The statistical parameters of the target batch training samples used in training the target layer in the training cycle of

The updating module 620 calculates, according to the received target statistical parameters, the historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model.

The parameter server 600 is configured to execute the deep learning model training method provided by the second aspect of the present disclosure. The beneficial effects and working principles of the deep learning model training method provided by the second aspect of the present disclosure have been described in detail above, I won't go into details here.

Optionally, the update module 620 may be specifically configured to, in the case of receiving the target statistical parameters of the target layer corresponding to the current calculation period sent by the preset number of first working nodes, based on the calculation period before the current calculation period. The obtained historical global statistical parameters and the preset number of target statistical parameters calculate the historical global statistical parameters of the target layer corresponding to the current cycle, wherein the preset number is less than or equal to the total number of working nodes in the working network .

Optionally, the parameter receiving module 610 is further configured to receive individual target model parameters sent by each working node and obtained by training each layer of the target model.

Correspondingly, the updating module 620 is further configured to update the target model of the current cycle according to the received monomer target model parameters.

Embodiments of the invention also provide an electronic device, including a processor, a memory, a program or an instruction stored in the memory and executable on the processor, and when the program or instruction is executed by the processor, the first embodiment of the present disclosure is implemented. The various processes of the deep learning model training method provided in one aspect or the deep learning model training method provided by the second aspect of the present disclosure can achieve the same technical effect, and are not repeated here to avoid repetition.

An embodiment of the present disclosure further provides a readable storage medium, wherein a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the deep learning according to the first aspect of the present disclosure is implemented The steps of the model training method, or the steps of implementing the deep learning model training method according to the second aspect of the present disclosure when the program or instructions are executed by the processor.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, CD), including several instructions to make a mobile terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present invention.

The embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of the present invention, without departing from the spirit of the present invention and the scope protected by the claims, many forms can be made, which all belong to the protection of the present invention.

Claims

A deep learning model training method, applied to a working node, is characterized in that, the method includes a plurality of training cycles, and each training cycle includes:

Obtain the target batch training samples;

The multiple layers of the target model are trained according to the target batch training samples, wherein the multiple layers of the target model include at least one target layer requiring batch normalization, starting from the second training cycle, in each training cycle Training the target layer includes:

Receive the historical global statistical parameters sent by the parameter server, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the current target layer of the target model, and the historical training data includes the current working node at The target statistical parameters obtained in the training period before the current training period, and the target statistical parameters obtained in the training period before the current training period by other working nodes belonging to the same working network as the current working node;

Obtain the target statistical parameters of the current target layer, wherein the target statistical parameters are the statistical parameters of the target batch training samples;

Determine the actual statistical parameters of the current target layer based on the historical global statistical parameters and the target statistical parameters;

Batch normalize the current target layer based on the actual statistical parameters and the target batch training samples, and send the target statistical parameters to the parameter server.
The deep learning model training method according to claim 1, wherein the obtaining target batch training samples comprises:

Obtain the target batch training samples from the database by random sampling;

or,

The target batch training samples arranged at preset positions are acquired from a database, wherein the training samples stored in the database are arranged in disorder.
The deep learning model training method according to claim 1 or 2, wherein the target statistical parameters include the statistical value of the target batch of training samples and the number of samples of the target batch of training samples.
The deep learning model training method according to claim 1 or 2, wherein each training cycle further comprises:

sending the single target model parameters obtained by training each layer at the current working node to the parameter server; and

Before the next training period starts, the updated target training model sent by the parameter server is received as the target training model of the next training period.
A deep learning model training method, applied to a parameter server, characterized in that the deep learning model training method includes a plurality of calculation cycles, and in each of the calculation cycles, the method includes:

Receive the target statistical parameters for the target layer in the target training model sent by multiple working nodes, where the target statistical parameters of the target layer include that the working node that sent the target statistical parameters trains the target in the training period corresponding to the current computing period Statistical parameters of target batch training samples used in layers, wherein a plurality of the working nodes belong to the same working network, and the target training model includes at least one target layer;

The historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model are calculated according to the received target statistical parameters.
The deep learning model training method according to claim 5, wherein, starting from the second calculation cycle, the calculation of the target corresponding to the current calculation cycle in the target training model according to the received target statistical parameters Historical global statistical parameters of the layer, including:

In the case of receiving the target statistical parameters of the target layer corresponding to the current computing cycle sent by the preset number of working nodes, based on the historical global statistical parameters obtained in the computing cycle before the current computing cycle and the preset number of The historical global statistical parameters of the target layer corresponding to the current cycle are calculated from the target statistical parameters, wherein the preset number is less than or equal to the total number of working nodes in the working network.
The deep learning model training method according to claim 5 or 6, wherein the deep model training method further comprises:

Receive the single target model parameters obtained by training each layer of the target model sent by each working node;

The target model of the current cycle is updated according to the received monomer target model parameters.
A working node, characterized in that it includes:

The sample acquisition module is used to acquire target batch training samples;

A training module, configured to train multiple layers of the target model according to the target batch training samples, wherein the multiple layers of the target model include at least one target layer requiring batch normalization, and the training module includes:

The first receiving unit is configured to receive the historical global statistical parameters sent by the parameter server, wherein the historical global statistical parameters are determined by the parameter server according to the historical training data of the target layer of the target model, and the historical training data includes the current The target statistical parameters obtained by the working node in the training cycle before the current training cycle, and the target statistical parameters obtained by other working nodes belonging to the same working network as the current work in the training cycle before the current training cycle;

a first obtaining unit, configured to obtain the target statistical parameters of the currently trained target layer, wherein the target statistical parameters are the statistical parameters of the target batch training samples;

The determining unit is configured to determine the actual statistical parameters of the currently trained target layer based on the historical global statistical parameters and the target statistical parameters, and perform the corresponding target layer based on the actual statistical parameters and the target batch training samples. batch normalization, and sending the target statistical parameters to the parameter server.
The working node according to claim 8, wherein the sample acquisition module is configured to acquire the target batch training samples from a database in a random sampling manner;

or,

The sample acquisition module is configured to acquire the target batch training samples arranged at preset positions from a database, wherein the training samples stored in the database are arranged in disorder.
The worker node according to claim 8 or 9, wherein the target statistical parameters include the statistical value of the target batch training samples and the number of samples of the target batch training samples.
The working node according to claim 8 or 9, wherein the working node further comprises:

a sending module, configured to send the single target model parameters obtained by training each layer at the current working node to the parameter server; and

The model receiving module is configured to receive the updated target training model sent by the parameter server as the target training model for the next training cycle.
A parameter server, characterized in that it includes:

The parameter receiving module is used to receive the target statistical parameters for the target layer in the target training model sent by the plurality of working nodes, and the target statistical parameters of the target layer include the working node that sends the target statistical parameters in the current calculation cycle. Statistical parameters of the target batch training samples used in training the target layer in the training cycle, wherein a plurality of the working nodes belong to the same working network, and the target training model includes at least one target layer;

The updating module calculates the historical global statistical parameters of the target layer corresponding to the current calculation period in the target training model according to the received target statistical parameters.
The parameter server according to claim 12, wherein the updating module is configured to, in the case of receiving the target statistical parameters of the target layer corresponding to the current computing cycle sent by a preset number of first working nodes, based on The historical global statistical parameters of the target layer corresponding to the current period are calculated from the historical global statistical parameters obtained in the calculation period before the current calculation period and the preset number of target statistical parameters, wherein the preset number is less than or equal to the predetermined number of target layers. The total number of worker nodes in the described worker network.
The parameter server according to claim 12 or 13, wherein the parameter receiving module is further configured to receive the single target model parameters obtained by training each layer of the target model and sent by each working node;

The updating module is further configured to update the target model of the current cycle according to the received monomer target model parameters.
An electronic device, characterized in that it includes a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction being executed by the processor to achieve the right The steps of the deep learning model training method according to any one of claims 1-4, or when the program or instruction is executed by the processor, the deep learning model training according to any one of claims 5 to 7 is implemented steps of the method.
A readable storage medium, characterized in that a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the deep learning model according to any one of claims 1-4 is implemented The steps of the training method, or the steps of implementing the deep learning model training method according to any one of claims 5 to 7 when the program or instructions are executed by the processor.