CN113283596A

CN113283596A - Model parameter training method, server, system and storage medium

Info

Publication number: CN113283596A
Application number: CN202110542415.1A
Authority: CN
Inventors: 董星
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-08-20

Abstract

The disclosure relates to a model parameter training method, a server, a system and a storage medium, and relates to the technical field of machine learning. The embodiment of the disclosure at least solves the problem that the parameter training of the prediction model takes longer time in the related art. The method is applied to a working server of a distributed system and comprises the following steps: acquiring current embedding parameters corresponding to training samples of a current batch; acquiring network parameters currently stored by a work server from the work server; performing iterative training on the current embedding parameters and the network parameters currently stored by the working server based on the training samples of the current batch to obtain embedding parameter gradients and network parameter gradients; updating the current embedding parameters based on the embedding parameter gradient, and synchronizing the updated embedding parameters to the parameter server; and updating the currently stored network parameters of the working server based on the network parameter gradient.

Description

Model parameter training method, server, system and storage medium

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a model parameter training method, a server, a system, and a storage medium.

Background

Currently, in the related art, a deep neural network is used as a prediction model for determining a click-through rate (CTR), and network parameters in the prediction model are obtained by training a large-scale sparse model based on a distributed Parameter Server (PS) architecture. Specifically, the PS framework includes a parameter server and a work server, and in the training, the work server is configured to obtain a model parameter from the parameter server after obtaining a training sample from the outside, perform iterative training on the model parameter, send a gradient of the model parameter obtained by the training to the parameter server, and update the stored model parameter by the parameter server according to the gradient of the model parameter.

However, since the parameter server mainly uses a Central Processing Unit (CPU) to update the model parameters, the work server mainly uses a Graphics Processing Unit (GPU) to perform iterative training, and during the iterative training process, a large number of model parameters and gradients need to be transmitted between the parameter server and the work server, and the data transmission amount between the CPU and the GPU is large, which results in that the whole training process takes a long time and the hardware resource utilization rate is low.

Disclosure of Invention

The present disclosure provides a model parameter training method, a server, a system, and a storage medium, to at least solve the problem in the related art that it takes a long time to predict the parameter training of a model. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a parameter training method for a prediction model, applied to a work server of a distributed system, including: acquiring current embedding parameters corresponding to training samples of a current batch; acquiring network parameters currently stored by a work server from the work server; performing iterative training on the current embedding parameters and the network parameters currently stored by the working server based on the training samples of the current batch to obtain embedding parameter gradients and network parameter gradients; updating the current embedding parameters based on the embedding parameter gradient, and synchronizing the updated embedding parameters to the parameter server; and updating the currently stored network parameters of the working server based on the network parameter gradient.

The current embedded parameters are pre-stored in the working server, and the parameter training method further includes: under the condition that the embedding parameters currently stored by the working server comprise partial embedding parameters of the current embedding parameters, acquiring differential embedding parameters from a parameter server and storing the differential embedding parameters; the differential embedding parameters comprise embedding parameters of the current embedding parameters except for partial embedding parameters;

or, under the condition that the embedding parameters currently stored by the working server do not include the current embedding parameters, acquiring the current embedding parameters from the parameter server, and storing the current embedding parameters.

Optionally, in a case that the current batch is the first batch, before the network parameter currently stored by the work server is obtained from the work server, the parameter training method further includes: network parameters from a predictive model of a parameter server are received and stored.

Optionally, when the current batch is a batch other than the first batch, before the network parameter currently stored by the work server is obtained from the work server, the parameter training method further includes:

merging the multiple historical network parameter gradients, and storing the network parameters obtained by merging; the plurality of historical network parameter gradients include network parameter gradients trained based on training samples of a previous lot of the current lot.

Optionally, the parameter training method further includes: and acquiring training samples of the next batch in the process of carrying out iterative training on the current embedding parameters and the network parameters currently stored by the working server, and acquiring the embedding parameters corresponding to the training samples of the next batch based on the training samples of the next batch.

Optionally, the parameter training method further includes:

acquiring target network parameters and sending the target network parameters to a parameter server; the target network parameters comprise network parameters obtained by training based on the training samples of the last batch.

According to a second aspect of the embodiments of the present disclosure, there is provided a work server, including an obtaining unit, a training unit, an updating unit, and a sending unit; the acquisition unit is used for acquiring current embedding parameters corresponding to the training samples of the current batch; the acquisition unit is also used for acquiring the network parameters currently stored by the work server from the work server; the training unit is used for carrying out iterative training on the current embedding parameters acquired by the acquisition unit and the network parameters currently stored by the working server based on the training samples of the current batch to obtain embedding parameter gradients and network parameter gradients; the updating unit is used for updating the current embedding parameters based on the embedding parameter gradient obtained by the training of the training unit; the sending unit is used for synchronously updating the embedded parameters updated by the unit to the parameter server; and the updating unit is also used for updating the network parameters currently stored by the working server based on the network parameter gradient.

Optionally, the current embedding parameter is pre-stored in a work server, and the work server further includes a storage unit; the acquisition unit is also used for acquiring the difference embedding parameters from the parameter server under the condition that the embedding parameters currently stored by the working server comprise part of the embedding parameters of the current embedding parameters; a storage unit for storing the differential embedding parameter; the differential embedding parameters comprise embedding parameters of the current embedding parameters except for partial embedding parameters;

or the obtaining unit is further configured to obtain the current embedding parameter from the parameter server when the embedding parameter currently stored by the working server does not include the current embedding parameter; and the storage unit is used for storing the current embedding parameters.

Optionally, the work server further includes a receiving unit and a storage unit; the receiving unit is used for receiving the network parameters of the prediction model from the parameter server before the acquiring unit acquires the network parameters currently stored by the work server from the work server under the condition that the current batch is the first batch; and the storage unit is used for storing the network parameters of the prediction model received by the receiving unit from the parameter server.

Optionally, when the current batch is a batch other than the first batch, before the obtaining unit obtains the network parameter currently stored by the work server from the work server, the updating unit is specifically configured to: merging the multiple historical network parameter gradients, and storing the network parameters obtained by merging; the plurality of historical network parameter gradients include network parameter gradients trained based on training samples of a previous lot of the current lot.

Optionally, the obtaining unit is further configured to obtain a next batch of training samples during iterative training of the current embedding parameter and the network parameter currently stored by the working server, and obtain the embedding parameter corresponding to the next batch of training samples based on the next batch of training samples.

Optionally, the obtaining unit is further configured to obtain a target network parameter; the target network parameters comprise network parameters obtained by training based on the training samples of the last batch; and the sending unit is also used for sending the target network parameters to the parameter server.

According to a third aspect of the embodiments of the present disclosure, there is provided a work server including: a processor, a memory for storing executable instructions; wherein the processor is configured to execute the instructions to implement the method of parametric training of a predictive model as provided in the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to perform the method of parameter training of a predictive model as provided in the first aspect.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a distributed parameter training system, including a plurality of parameter servers and a plurality of work servers; any one of the plurality of work servers is configured to perform the parameter training method of the predictive model of the first aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions, characterized in that the instructions, when executed by a processor, implement the method for parameter training of a prediction model of the first aspect.

The technical scheme provided by the disclosure at least brings the following beneficial effects: the network parameters required by the iterative training can be stored in advance by the working server, so that the network parameters required by the iterative training can be directly acquired from the cache of the working server when the iterative training is required. Meanwhile, for updating the current embedded parameters and the currently stored network parameters, the work server can update on the work server side, the gradient value obtained by each training does not need to be sent to the parameter server, transmission of the training gradient can be correspondingly reduced, the iterative training time can be further reduced, and the resource utilization rate of hardware can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating a distributed parameter training system architecture in accordance with an exemplary embodiment;

FIG. 2 is one of the flow diagrams illustrating a method for parameter training of a predictive model according to an exemplary embodiment;

FIG. 3 is a second flowchart illustrating a method for parameter training of a predictive model according to an exemplary embodiment;

FIG. 4 is a third flowchart illustrating a method of parameter training of a predictive model according to an exemplary embodiment;

FIG. 5 is a fourth flowchart illustrating a method of parameter training of a predictive model according to an exemplary embodiment;

FIG. 6 is a fifth flowchart illustrating a method of parameter training of a predictive model in accordance with an exemplary embodiment;

FIG. 7 is a sixth flowchart illustrating a method of parameter training of a predictive model according to an exemplary embodiment;

FIG. 8 is one of the schematic structural diagrams of a work server shown in accordance with an exemplary embodiment;

fig. 9 is a second schematic diagram illustrating a structure of a work server according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In addition, in the description of the embodiments of the present disclosure, "/" indicates an OR meaning, for example, A/B may indicate A or B, unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

The parameter training method for the prediction model provided by the embodiment of the present disclosure (for convenience of description, the embodiment of the present disclosure is hereinafter referred to as the parameter training method) may be applied to a distributed parameter training system (in practical applications, also referred to as a distributed system). Fig. 1 shows a schematic structural diagram of the distributed parameter training system. As shown in fig. 1, the distributed parameter training system 10 is used for training to obtain model parameters in the prediction model, and the distributed parameter training system 10 includes a plurality of parameter servers (for example, 3

parameter servers

111, 112, 113 are shown in fig. 1, and in practical applications, a greater number or a smaller number of parameter servers may be used) and work servers (for example, 3

work servers

121, 122, 123 are shown in fig. 1, and in practical applications, a greater number or a smaller number of work servers may be used). The parameter servers and the work servers in the distributed parameter training system adopt a distributed architecture, the parameter servers are respectively connected with the work servers, and the work servers can adopt a bus connection mode or an annular connection mode. The parameter server 11 and the work server may be connected in a wired manner or in a wireless manner, which is not limited in the embodiment of the present disclosure.

The parameter server is used for storing the full amount of model parameters and sending the model parameters to the work server when the work server initiates an iterative training requirement.

The parameter server is also used for updating the model parameters. Illustratively, the parameter server receives the model gradient sent by the work server, and updates the model parameters according to the received model gradient.

The parameter server according to the embodiment of the present disclosure may be any one of the plurality of parameter servers, or may be a main parameter server for managing the plurality of parameter servers. The parameter server mainly stores the total amount of model parameters and updates the model parameters by using a CPU in the parameter server. The model parameters include Embedding parameters (Embedding parameters) and network parameters.

Wherein the embedding parameters comprise parameters involved in an embedding layer in the prediction model for converting a sample vector of sparse training samples into a dense vector of fixed size. The network parameters include parameters in the prediction model, such as W (weight), b (bias), and the like. Illustratively, the Parameter Server may be a Parameter Server (PS) in a distributed Parameter Server architecture.

The working server is used for obtaining the training samples from the outside and obtaining the network parameters of the prediction model from the parameter server so as to carry out iterative training.

The working server is further used for sending the model parameters obtained by the iterative training to the parameter server and storing the model parameters obtained by the iterative training.

The work server according to the embodiment of the present disclosure may be any one of the plurality of work servers, or may be a master work server for managing the plurality of work servers. The work server comprises a video card cache used for storing model parameters and model gradients and a calculation module used for iterative training.

Illustratively, the work server may be an execution server (Worker) in a distributed parameter server architecture.

The parameter training method provided by the embodiment of the present disclosure may be specifically applied to any one of the work servers in the distributed parameter training system, and the parameter training method provided by the embodiment of the present disclosure is described below with reference to fig. 1.

As shown in fig. 2, the parameter training method provided in the embodiment of the present disclosure specifically includes the following steps S201 to S206.

S201, the work server obtains current embedding parameters corresponding to training samples of the current batch.

As a possible implementation manner, after obtaining the training samples of the current batch from the external sample server, the work server may query, according to the training samples of the current batch, the parameter server for the embedded parameters corresponding to the training samples of the current batch.

As another possible implementation manner, after obtaining the training samples of the current batch from the external sample server, the work server may query, according to the training samples of the current batch, the embedding parameters corresponding to the training samples of the current batch from the work server.

It should be noted that the sample server stores a plurality of batches of training sample sets, and the training samples of the current batch include a plurality of mini-batch (mini-batch) samples obtained by dividing the samples in the training sample set of the current batch.

The embedding parameters are used to convert the sample vector of sparse training samples into a dense vector of fixed size. The embedding parameters corresponding to the training samples may include a plurality of embedding sub-parameters, and the embedding sub-parameters in the corresponding embedding parameters are different for different batches of training samples.

The specific implementation manner of this step may refer to the subsequent description of the embodiment of the present disclosure, and is not described herein again.

S202, the work server obtains the network parameters currently stored by the work server from the work server.

As a possible implementation manner, after obtaining the training samples of the current batch from the external sample server, the work server queries the currently stored network parameters from the cache of the work server.

It should be noted that the cache of the work server is a graphics card cache and is an area used for storing data in the GPU of the work server. The network parameters include parameters in the prediction model, such as W (weight), b (bias), and the like.

The execution sequence of S201 and S202 is not sequential, and in practical applications, the work server may execute S201 first and then S202, or execute S202 first and then S201, or execute S201 and S202 simultaneously, which is not limited in the present disclosure.

S203, the work server carries out iterative training on the current embedding parameters and the network parameters currently stored by the work server based on the training samples of the current batch to obtain embedding parameter gradients and network parameter gradients.

As a possible implementation manner, the work server uses the training samples of the current batch, the current embedding parameters, and the network parameters currently stored by the work server as input layers, and performs forward calculation and backward propagation calculation based on the sparse model to obtain the embedding parameter gradient of the current embedding parameter and the network parameter gradient of the network parameters currently stored by the work server.

The specific implementation manner in this step may specifically refer to a training process of a worker in the existing PS framework, and is not described here again.

And S204, updating the current embedding parameters by the working server based on the embedding parameter gradient.

The specific implementation manner of this step may refer to a process in which the parameter server updates the embedded parameter according to the embedded parameter gradient in the prior art, which is not described herein again.

In one case, the work server, after updating the current embedding parameters, stores the updated current embedding parameters into the cache of the work server.

For example, in the case that the current embedding parameters corresponding to the training samples of the current batch include embedding sub-parameters L1, L3, and L5, after obtaining the updated current embedding parameters, the working server obtains updated current embedding parameters including updated L1, updated L3, and updated L5. In this case, the working server deletes L1, L3, and L5 in the cache, and stores updated L1, updated L3, and updated L5.

S205, the work server synchronizes the updated embedded parameters to the parameter server.

As a possible implementation, the work server sends the embedded parameter to be updated to the parameter server.

Illustratively, after the working server gets updated L1, updated L3, and updated L5, updated L1, updated L3, and updated L5 are sent to the parameter server so that the parameter server can update its full amount of stored embedded parameters.

S206, the work server updates the currently stored network parameters of the work server based on the network parameter gradient.

As a possible implementation manner, the work server determines the updated network parameters based on the network parameter gradient and the network parameter gradient obtained by training of other work servers in the distributed parameter training system, and updates the network parameters currently stored in the cache of the work server by using the updated network parameters.

The specific implementation manner of this step may refer to the subsequent detailed description of the embodiment of the present disclosure, and is not described herein again.

The technical scheme provided by the embodiment at least has the following beneficial effects: the network parameters required by the iterative training can be stored in advance by the working server, so that the network parameters required by the iterative training can be directly acquired from the cache of the working server when the iterative training is required. Meanwhile, for updating the current embedded parameters and the currently stored network parameters, the work server can update on the work server side, the gradient value obtained by each training does not need to be sent to the parameter server, the transmission of the training gradient can be correspondingly reduced, the iterative training time can be further reduced, and the resource utilization rate of hardware can be improved

In one design, the current embedding parameter corresponding to the training sample of the current batch may be an embedding parameter pre-stored in the work server, and in S201 provided in this embodiment of the disclosure, the work server may specifically obtain the current embedding parameter from a memory thereof. In order to store the current embedded parameters in the work server in advance, as shown in fig. 3, the parameter training method provided by the embodiment of the present disclosure may further include the following steps S301 to S305.

S301, the work server determines whether the currently stored embedding parameters include the current embedding parameters.

As a possible implementation manner, in the process of performing iterative training on the embedding parameters and the network parameters corresponding to the previous batch of training samples, the working server obtains the training samples of the current batch from the external sample server, and queries whether the memory stores the embedding sub-parameters included in the current embedding parameters.

S302, under the condition that the embedding parameters currently stored by the work server include partial embedding parameters of the current embedding parameters, the work server acquires the difference embedding parameters from the parameter server.

Wherein the differential embedding parameters include embedding parameters of the current embedding parameters except for the partial embedding parameters.

Illustratively, the current embedding parameters include embedding sub-parameters L1, L2, L3 and L4, and if the work server currently stores embedding parameters L1 and L3, the work server determines that L1 and L3 are partial embedding parameters in the current embedding parameters. Meanwhile, the work server determines L2 and L4 as differential embedding parameters, and obtains the differential embedding parameters L2 and L4 from the parameter service.

And S303, the work server stores the difference embedding parameters.

As a possible implementation manner, the work server stores the obtained difference embedding parameter into a display card cache of the work server.

It can be understood that, after the work server stores the differential embedding parameters, the video card cache of the work server includes the current embedding parameters corresponding to the training samples of the current batch.

The technical scheme provided by the embodiment at least has the following beneficial effects: the work server may determine whether the currently stored embedding parameters include partial embedding parameters in the current embedding parameters, and obtain, from the parameter server, the difference embedding parameters corresponding to the training samples of the current batch under the condition that the currently stored embedding parameters include partial embedding parameters. Compared with the prior art, the method has the advantages that the stored and updated partial embedded parameters do not need to be acquired from the parameter server, the transmission quantity of the embedded parameters between the parameter server and the work server can be further reduced, and the transmission resources are saved.

S304, under the condition that the embedding parameters currently stored by the working server do not comprise the current embedding parameters, the working server acquires the current embedding parameters from the parameter server.

As a possible implementation manner, if the work server determines that any part of the current embedding parameters are not included in the currently stored embedding parameters, the work server obtains the current embedding parameters from the parameter server.

S305, the work server stores the current embedding parameters.

The specific implementation of this step may refer to the specific description in S303, which is not described herein again.

In another case, if the work server determines that the currently stored embedding parameters include the current embedding parameters, after training of the training samples of the previous batch is completed, the current embedding parameters are directly obtained from the video card cache, and the training is performed on the current embedding parameters based on the training samples of the current batch.

Further, after the iterative training of the embedding parameters and the network parameters corresponding to the previous batch of training samples by the work server is completed, the work server obtains the embedding parameters corresponding to the training samples of the current batch from the cache of the work server according to the training samples of the current batch.

The technical scheme provided by the embodiment at least has the following beneficial effects: under the condition that the working server determines that the currently stored embedding parameters do not include the current embedding parameters, the working server can acquire the current embedding parameters from the parameter server in advance and store the current embedding parameters into the display card cache, so that the effect of 'asynchronous prefetching' is realized, the time of training parameters of the working server can be saved, and the training efficiency is improved.

In one design, in order to obtain and pre-store the network parameters of the prediction model from the work server, as shown in fig. 4, the parameter training method provided by the embodiment of the present disclosure further includes, before S202, the following S207-S208 in the case that the current lot is the first lot.

And S207, the work server receives the network parameters of the prediction model from the parameter server.

As a possible implementation manner, after obtaining the training samples of the current batch, the work server requests the parameter server to send the network parameters of the prediction model to the work server.

And S208, the work server stores the network parameters of the prediction model from the parameter server.

As a possible implementation manner, the work server stores the received network parameters into a display card cache of the work server.

The technical scheme provided by the embodiment at least has the following beneficial effects: for the training samples of the first batch, the network parameters can be obtained from the parameter server and stored, and a data basis can be provided for the subsequent iterative training in the working server. It can be understood that the work server only needs to obtain the network parameters of the prediction model from the parameter server when training for the first time, and does not need to transmit the network parameters with the parameter server when training each time, so that the transmission pressure between the parameter server and the work server can be reduced, and the parameter training efficiency is improved.

In one design, in order to obtain a network parameter corresponding to a training sample prestored in a current batch from a work server in an iterative training process of the work server, as shown in fig. 5, the parameter training method provided in the embodiment of the present disclosure further includes, before S202, the following S209 to S210 when the current batch is a batch other than a first batch.

S209, the working server combines the plurality of historical network gradient values to determine the network parameters corresponding to the training samples of the current batch.

The plurality of historical network gradient values comprise the gradient values of the network parameters, which are obtained based on training of the training samples of the previous batch and correspond to the training samples of the previous batch.

As a possible implementation manner, the work servers perform averaging, transmission, and merging of historical network gradient values with other work servers in the distributed parameter training system based on a preset reduction algorithm until each work server stores a network parameter updated based on the gradient value.

It should be noted that, a specific implementation manner of this step may specifically refer to a specific description of ring-assembly in the prior art.

It will be appreciated that in this case, all the work servers in the distributed parameter training system are connected in a ring connection.

As another possible implementation manner, the work servers receive historical network gradient values sent by other work servers, combine and average all the received historical network gradient values to obtain updated network parameters, and send the updated network parameters to the other work servers until each work server stores the updated network parameters based on the gradient values.

It should be noted that the communication between the work server and other work servers may be based on communication modes such as NVlink/Remote Direct Memory Access (RDMA)/DPDK between GPUs of the work servers, so as to transmit the gradient value.

The implementation manner of this step may also be specifically used as the specific implementation manner of updating the currently stored network parameter based on the network parameter gradient value in S208, where a difference is that the batches of the training samples corresponding to the updated network parameter are different.

S210, the work server stores, combines and processes the network parameters to obtain the network parameters.

As a possible implementation manner, the work server stores the network parameters corresponding to the training samples of the current batch into a cache of the work server.

The technical scheme provided by the embodiment at least has the following beneficial effects: compared with the prior art, the method can replace a parameter server to update the network parameters without sending the historical network gradients to the parameter server one by one, and can reduce the parameter training time of the working server. Meanwhile, the work servers adopt an allreduce algorithm to merge historical network gradients, so that the merging speed can be increased, and the time for parameter training can be further shortened.

In one design, in order to save the time for parameter training and improve the effect of parameter training, as shown in fig. 6, the parameter training method provided in the embodiment of the present disclosure further includes the following steps S401 to S403.

S401, the working server obtains training samples of the next batch in the process of performing iterative training on the current embedded parameters and the network parameters currently stored by the working server.

As a possible implementation manner, in the process of executing S203, the work server may start to obtain training samples of a next batch from the external sample server.

The specific implementation manner of obtaining the training samples of the next batch in this step may refer to the specific description in S201 in the embodiment of the present disclosure, and is not described herein again. The difference is that the batches of training samples obtained are different.

S402, the work server obtains the embedded parameters corresponding to the training samples of the next batch from the parameter server based on the training samples of the next batch.

The specific implementation manner of this step may refer to the specific description of any implementation manner in S201 provided in the embodiments of the present disclosure, and details are not repeated here. The difference is that the batches of training samples corresponding to the embedding parameters obtained in S402 are different.

S403, the work server stores the acquired embedding parameters.

As a possible implementation manner, the work server stores the acquired embedding parameters corresponding to the training samples of the next batch into a video card cache of the work server.

The technical scheme provided by the embodiment at least has the following beneficial effects: the method can asynchronously obtain the training samples of the next batch and the embedded parameters corresponding to the training samples of the next batch in the training process, realize parameter prefetching, save the time of model parameter iterative training and improve the training speed.

In one design, in order to enable updating of network parameters, as shown in fig. 7, the parameter training method provided in the embodiment of the present disclosure further includes the following steps S501 to S502.

S501, the work server obtains target network parameters.

And the target network parameters comprise network parameters obtained by training based on the training samples of the last batch.

As a possible implementation manner, after the work server performs iterative training and updating on the network parameters based on the training samples of the last batch, the updated network parameters are obtained as target network parameters.

S502, the working server sends the target network parameters to the parameter server.

As a possible implementation, the working server sends the target network parameters to the parameter server. Correspondingly, after receiving the target network parameters, the parameter server determines the target network parameters and the embedded parameters updated according to the gradient of the embedded parameters corresponding to the training samples of the last batch as the model parameters of the prediction model.

The technical scheme provided by the embodiment at least has the following beneficial effects: after training of the training samples of the last batch is completed, the updated network parameters can be sent to the parameter server, and a basis is provided for subsequent training of the prediction model of the parameter server. In this way, the network parameters of the prediction model which is updated last can be stored in the parameter server.

In addition, the present disclosure also provides a work server applied to a distributed system, and as shown in fig. 8, the work server 60 includes an obtaining unit 601, a training unit 602, an updating unit 603, and a sending unit 604.

The obtaining unit 601 is configured to obtain a current embedding parameter corresponding to a current batch of training samples. For example, as shown in fig. 2, the obtaining unit 601 may be configured to execute S201.

The obtaining unit 601 is further configured to obtain the network parameters currently stored by the work server 60 from the work server 60. For example, as shown in fig. 2, the obtaining unit 601 may be configured to execute S202.

The training unit 602 is configured to perform iterative training on the current embedding parameter acquired by the acquisition unit 601 and the network parameter currently stored in the work server 60 based on a training sample of a current batch, so as to obtain an embedding parameter gradient and a network parameter gradient. For example, as shown in fig. 2, the training unit 602 may be configured to perform S203.

An updating unit 603, configured to update the current embedding parameter based on the embedding parameter gradient obtained through training by the training unit 602. For example, as shown in fig. 2, the updating unit 603 may be configured to execute S204.

A sending unit 604, configured to synchronize the updated embedding parameters with the parameter server by the updating unit 603. For example, as shown in fig. 2, the sending unit 604 may be configured to execute S205.

The updating unit 603 is further configured to update the currently stored network parameters of the work server 60 based on the network parameter gradient. For example, as shown in fig. 2, the updating unit 603 may be configured to execute S206.

Optionally, as shown in fig. 8, the current embedding parameters provided by the embodiment of the present disclosure are pre-stored in the work server 60, and the work server 60 further includes a storage unit 605.

The obtaining unit 601 is further configured to obtain the differential embedding parameter from the parameter server if the embedding parameter currently stored by the work server 60 includes the partial embedding parameter of the current embedding parameter. The differential embedding parameters include embedding parameters of the current embedding parameters except for the partial embedding parameters. For example, as shown in fig. 3, the obtaining unit 601 may be configured to execute S302.

A storage unit 605 for storing the differential embedding parameter. For example, as shown in fig. 3, the storage unit 605 may be used to execute S303.

Alternatively, the first and second electrodes may be,

the obtaining unit 601 is further configured to obtain the current embedding parameter from the parameter server if the embedding parameter currently stored by the work server 60 does not include the current embedding parameter. For example, as shown in fig. 3, the obtaining unit 601 may be configured to execute S304.

The storage unit 605 is used for storing the current embedding parameters. For example, as shown in fig. 3, the storage unit 605 may be used to execute S305.

Optionally, as shown in fig. 8, the work server 60 provided in the embodiment of the present disclosure further includes a receiving unit 606.

A receiving unit 606, configured to receive the network parameters of the prediction model from the parameter server before the obtaining unit 601 obtains the network parameters currently stored by the work server 60 from the work server 60, where the current batch is the first batch. For example, as shown in fig. 4, the receiving unit 606 may be configured to execute S207.

A storage unit 605, configured to store the network parameters of the prediction model received by the receiving unit 606 from the parameter server. For example, as shown in fig. 4, the storage unit 605 may be used to execute S208.

Optionally, as shown in fig. 8, in the work server 60 provided in the embodiment of the present disclosure, in a case that the current batch is a batch other than the first batch, before the obtaining unit 601 obtains the network parameter currently stored in the work server 60 from the work server 60, the updating unit 603 is specifically configured to:

and merging the multiple historical network parameter gradients, and storing the network parameters obtained by merging. The plurality of historical network parameter gradients include network parameter gradients trained based on training samples of a previous lot of the current lot. For example, as shown in FIG. 5, the update unit 603 may be configured to perform S209-S210.

Optionally, as shown in fig. 8, the obtaining unit 601 provided in this embodiment of the disclosure is further configured to obtain a next batch of training samples in the process of performing iterative training on the current embedding parameter and the network parameter currently stored in the work server 60, and obtain the embedding parameter corresponding to the next batch of training samples based on the next batch of training samples. For example, as shown in fig. 6, the receiving unit 606 may be configured to perform S401-S402.

Optionally, as shown in fig. 8, the obtaining unit 601 provided in the embodiment of the present disclosure is further configured to obtain the target network parameter. The target network parameters comprise network parameters obtained by training based on the training samples of the last batch. For example, as shown in fig. 7, the obtaining unit 601 may be configured to perform S501.

The sending unit 604 is further configured to send the target network parameter to the parameter server. For example, as shown in fig. 7, the sending unit 604 may be configured to execute S502.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 9 is a schematic structural diagram of another work server provided by the present disclosure. As shown in fig. 9, the work server 70 may include at least one processor 701 and a memory 703 for storing processor-executable instructions. Wherein the processor 701 is configured to execute instructions in the memory 703 to implement the parameter training method in the above-described embodiments.

Additionally, the work server 70 may also include a communication bus 702 and at least one communication interface 704.

The processor 701 may be a GPU, a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs in accordance with the disclosed aspects.

The communication bus 702 may include a path that conveys information between the aforementioned components.

Communication interface 704, using any transceiver or the like, may be used to communicate with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The memory 703 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit as a volatile storage medium in the GPU.

The memory 703 is used for storing instructions for executing the disclosed solution, and is controlled by the processor 701. The processor 701 is configured to execute instructions stored in the memory 703 to implement the functions of the disclosed method.

In particular implementations, processor 701 may include one or more GPUs, such as GPU0 and GPU1 in fig. 9, as one embodiment.

In particular implementations, the work server 70 may include a plurality of processors, such as the processor 701 and the processor 707 in fig. 9, as an embodiment. Each of these processors may be a single-Core (CPU) processor or a multi-core (multi-GPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In one embodiment, the work server 70 may further include an output device 705 and an input device 706. An output device 705 is in communication with the processor 701 and may display information in a variety of ways. For example, the output device 705 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 706 communicates with the processor 701 and may accept input from a user in a variety of ways. For example, the input device 706 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

Those skilled in the art will appreciate that the configuration shown in FIG. 9 does not constitute a limitation of the work server 70, and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

In addition, the present disclosure also provides a computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to perform the parameter training method of the prediction model as provided in the above embodiments.

Additionally, the present disclosure also provides a computer program product comprising instructions which, when executed by a processor, cause the processor to perform the method of parameter training of a prediction model as provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A parameter training method of a prediction model is applied to a work server of a distributed system, and comprises the following steps:

acquiring current embedding parameters corresponding to training samples of a current batch;

acquiring network parameters currently stored by the working server from the working server;

performing iterative training on the current embedding parameter and the network parameter currently stored by the working server based on the training samples of the current batch to obtain an embedding parameter gradient and a network parameter gradient;

updating the current embedding parameter based on the embedding parameter gradient, and synchronizing the updated embedding parameter to a parameter server;

and updating the currently stored network parameters of the working server based on the network parameter gradient.

2. The method of claim 1, wherein the current embedded parameters are pre-stored in the work server, the method further comprising:

under the condition that the embedding parameters currently stored by the working server comprise partial embedding parameters of the current embedding parameters, acquiring differential embedding parameters from a parameter server, and storing the differential embedding parameters; the differential embedding parameters include embedding parameters of the current embedding parameters other than the partial embedding parameters;

alternatively, the first and second electrodes may be,

and under the condition that the embedding parameters currently stored by the working server do not comprise the current embedding parameters, acquiring the current embedding parameters from the parameter server and storing the current embedding parameters.

3. The method of claim 1, wherein in the case that the current batch is the first batch, before the obtaining the network parameters currently stored by the work server from the work server, the method further comprises:

receiving and storing network parameters of the predictive model from the parameter server.

4. The method of claim 1, wherein in the case that the current batch is a batch other than the first batch, before the obtaining the currently stored network parameters of the work server from the work server, the method further comprises:

merging the multiple historical network parameter gradients, and storing the network parameters obtained by merging; the plurality of historical network parameter gradients comprise network parameter gradients trained based on training samples of a previous batch of the current batch.

5. The method of parametric training of a predictive model of claim 1, further comprising:

and acquiring training samples of the next batch in the iterative training process of the current embedding parameters and the network parameters currently stored by the working server, and acquiring the embedding parameters corresponding to the training samples of the next batch based on the training samples of the next batch.

6. A working server is characterized by comprising an acquisition unit, a training unit, an updating unit and a sending unit;

the acquisition unit is used for acquiring current embedding parameters corresponding to the training samples of the current batch;

the acquiring unit is further configured to acquire a network parameter currently stored by the work server from the work server;

the training unit is used for performing iterative training on the current embedding parameters acquired by the acquisition unit and the network parameters currently stored by the working server based on the training samples of the current batch to obtain embedding parameter gradients and network parameter gradients;

the updating unit is used for updating the current embedding parameter based on the embedding parameter gradient obtained by the training of the training unit;

the sending unit is used for synchronizing the embedded parameters updated by the updating unit to a parameter server;

and the updating unit is also used for updating the network parameters currently stored by the working server based on the network parameter gradient.

7. A work server, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the method of parametric training of a predictive model of any of claims 1-5.

8. A computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to perform a method of parameter training of a predictive model according to any of claims 1-5.

9. A distributed parameter training system is characterized by comprising a plurality of parameter servers and a plurality of working servers; any one of the plurality of work servers is configured to perform the parameter training method of the prediction model according to any one of claims 1 to 5.

10. A computer program product comprising instructions which, when executed by a processor, implement the method of parameter training of a predictive model of any of claims 1 to 5.