CN109978177B

CN109978177B - Model training method, service processing method, device and related equipment

Info

Publication number: CN109978177B
Application number: CN201910211389.7A
Authority: CN
Inventors: 孙浩博; 张红林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2023-06-23
Anticipated expiration: 2039-03-19
Also published as: CN109978177A

Abstract

The embodiment of the invention discloses a model training method, a service processing method, a device and related equipment, wherein the model training method applied to first node equipment comprises the following steps: acquiring a weight parameter of a target feature in a target model; determining the current gradient accumulation amount of the target feature according to the weight parameter of the target feature; if the current gradient accumulation quantity meets a preset condition, acquiring the approximate momentum of the target feature; transmitting the approximate momentum of the target feature to a second node device, so that the second node device updates the target model by adopting the approximate momentum of the target feature; the embodiment of the invention adopts the model training method, can improve the model training efficiency, relieve the communication pressure and avoid extra memory consumption.

Description

Model training method, service processing method, device and related equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to the field of distributed machine learning technologies, and in particular, to a model training method, a service processing method, a model training device, a service processing device, a node device, and a terminal.

Background

Machine Learning (ML) is a multi-domain interdisciplinary, applicable to model training; specific: machine learning can be based on sample data, and the model is updated according to the learning result to obtain a model with perfect performance. With the development of science and technology, distributed machine learning has been developed; by distributed machine learning is meant: machine learning mode of distributing machine learning task to multiple node devices for parallel processing. The mainstream framework of distributed machine learning (e.g., parameter server) consists of two roles: one role is a worker (working node device) which is mainly responsible for reading sample data, calculating data and the like; the other role is server (service node device), which is mainly responsible for storing and updating weight parameters and the like.

In model training based on distributed machine learning, one important factor affecting model training efficiency is the communication time between the worker and the server. With the increase of sample data and the expansion of the model, more workers and servers are required to participate in model training, so that communication interaction is more frequent, and communication time becomes a bottleneck of model training gradually. Currently, a GD (Gradient Dropping, gradient descent) scheme or a DGC (Deep Gradient Compression, depth gradient compression) scheme is generally adopted to perform gradient compression (Gradient Compressor) so as to reduce the number of communication interactions between the worker and the server, thereby reducing the communication time. However, the inventors have found in practice that the current gradient compression scheme suffers from a number of problems: the GD scheme model has slower convergence speed and lower model training efficiency; the DGC scheme requires additional memory to store momentum, increasing memory overhead.

Disclosure of Invention

The embodiment of the invention provides a model training method, a service processing method, a device and related equipment, which can improve the model training efficiency, relieve the communication pressure and avoid extra memory consumption.

In one aspect, an embodiment of the present invention provides a model training method, which is applied to a first node device, where the model training method includes:

acquiring a weight parameter of a target feature in a target model;

determining the current gradient accumulation amount of the target feature according to the weight parameter of the target feature;

if the current gradient accumulation quantity meets a preset condition, acquiring the approximate momentum of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

and transmitting the approximate momentum of the target feature to second node equipment, so that the second node equipment updates the target model by adopting the approximate momentum of the target feature.

In another aspect, an embodiment of the present invention provides another model training method, applied to a second node device, where the model training method includes:

receiving the approximate momentum of a target feature in a target model sent by first node equipment, wherein the approximate momentum of the target feature is acquired and sent by the first node equipment when the current gradient accumulation of the target feature meets a preset condition, and the current gradient accumulation is determined according to the weight parameter of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

Updating the target model using the approximate momentum of the target feature.

In still another aspect, an embodiment of the present invention provides a service processing method, where the service processing method includes:

acquiring a service request;

responding to the service request, and calling a target model to execute processing on the requested service to obtain a processing result; the target model is obtained by training by adopting the model training method;

and outputting the processing result.

In still another aspect, an embodiment of the present invention provides a model training apparatus, which operates on a first node device, where the model training apparatus includes:

the acquisition unit is used for acquiring weight parameters of target features in the target model;

a determining unit, configured to determine a current gradient accumulation amount of the target feature according to a weight parameter of the target feature;

the acquisition unit is used for acquiring the approximate momentum of the target feature if the current gradient accumulation amount meets a preset condition; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

and the sending unit is used for sending the approximate momentum of the target feature to the second node equipment so that the second node equipment can update the target model by adopting the approximate momentum of the target feature.

In still another aspect, an embodiment of the present invention provides a model training apparatus, which operates on a second node device, where the model training apparatus includes:

a receiving unit, configured to receive an approximate momentum of a target feature in a target model sent by a first node device, where the approximate momentum of the target feature is acquired and sent by the first node device when a current gradient accumulation amount of the target feature meets a preset condition, and the current gradient accumulation amount is determined according to a weight parameter of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

and the updating unit is used for updating the target model by adopting the approximate momentum of the target feature.

In still another aspect, an embodiment of the present invention provides a service processing apparatus, including:

an acquisition unit for acquiring a service request;

the processing unit is used for responding to the service request and calling a target model to execute processing on the requested service to obtain a processing result; the target model is obtained by training by adopting the model training method;

and the output unit is used for outputting the processing result.

In still another aspect, an embodiment of the present invention provides a node device, where the node device includes a communication interface, and the node device further includes:

a processor adapted to implement one or more instructions; the method comprises the steps of,

a computer storage medium storing one or more first instructions adapted to be loaded by the processor and to perform the steps of:

acquiring a weight parameter of a target feature in a target model;

Alternatively, the computer storage medium stores one or more second instructions adapted to be loaded by the processor and perform the steps of:

updating the target model using the approximate momentum of the target feature.

In still another aspect, an embodiment of the present invention provides a terminal, where the terminal includes an input device and an output device, and the terminal further includes:

a computer storage medium storing one or more third instructions adapted to be loaded by the processor and to perform the steps of:

acquiring a service request;

And outputting the processing result.

In yet another aspect, embodiments of the present invention provide a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the model training method and/or the business processing method described above.

In the embodiment of the invention, the current gradient accumulation amount of the target feature can be determined according to the weight parameter of the target feature in the target model, and when the current gradient accumulation amount of the target feature meets the preset condition, the first node equipment (serving as a worker) sends the approximate momentum of the target feature to the second node equipment (serving as a server), and the second node equipment updates the target model by adopting the approximate momentum of the target feature; in the model training process, the first node equipment and the second node equipment do not directly interact gradients, but interact approximate momentum, so that the communication interaction times between the first node equipment and the second node equipment can be reduced, a higher gradient compression rate is realized, the communication time is reduced, and the communication pressure is relieved; meanwhile, extra memory overhead is not required to be increased; in addition, the approximate momentum update model is adopted to accelerate the convergence speed of the model, so that the training time is shortened, and the training efficiency of the model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1a is a block diagram of a model training system according to an embodiment of the present invention;

FIG. 1b is a schematic diagram of the working principle of a model training system according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a model training scheme according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a model training method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a model training method according to another embodiment of the present invention;

FIG. 5 is a schematic flow chart of a depth gradient compression scheme according to an embodiment of the present invention;

FIG. 6 is an effect diagram of a scenario test provided by an embodiment of the present invention;

FIG. 7 is an effect diagram of another scenario test provided by an embodiment of the present invention;

fig. 8 is a schematic flow chart of a service processing method according to an embodiment of the present invention;

FIG. 9a is a schematic diagram of a user interface provided by an embodiment of the present invention;

fig. 9b is an application scenario diagram of a service processing method according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a model training device according to an embodiment of the present invention;

FIG. 11 is a schematic structural view of another model training apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a node device according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a service processing device according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

In an embodiment of the present invention, distributed machine learning refers to: machine learning approaches that distribute machine learning tasks to multiple node devices for parallel processing, where the node devices may include, but are not limited to: intelligent terminals, PCs (personal computers), desktop computers, servers, etc. Multiple node devices participating in distributed machine learning may constitute a PS (Parameter Server) based model training system. Referring to FIG. 1a, a model training system may include at least one first node device 11 and at least one second node device 12; the number of first node devices may be the same as or different from the number of second node devices. Wherein the first node device 11 may act as a worker; the second node device 12 may act as a server. Optionally, the model training system may further include a third node device 13, where the third node device 13 may be used as a scheduler, and is mainly responsible for scheduling servers and workers, and performing task distribution, heartbeat monitoring, fault recovery, and other management on the servers and workers. The server and the workers can communicate point to point, and any two workers cannot communicate. When the physical deployment model trains the system, the server and the worker can be deployed separately, or the server and the worker can be regarded as a whole pair deployment to balance the network communication of each node device.

The main process of model training using the model training system shown in fig. 1a includes the following steps, and in particular, see fig. 1 b:

1. read (Read): the first node device reads sample data for training the target model, which may be read from an HDFS (Hadoop Distributed File System, distributed file system) as shown in fig. 1b, for example; and establishes a cache in the local space of the first node device.

2. Pull: the first node equipment pulls first weight parameters of effective target features in the sample data from the second node equipment, wherein the effective target features refer to target features with feature values; correspondingly, the second node equipment responds to the pulling operation of the first node equipment and sends the first weight parameter of the target feature to the first node equipment.

3. Computer (calculation): the first node device calculates a gradient of the target feature based on the first weight parameter of the target feature.

4. Push (send): the first node device transmits the gradient of the target feature as a parameter update value to the second node device.

5. Update: the second node equipment updates the target model according to the gradient of the target characteristic; specifically, the second node device updates the first weight parameter of the target feature according to the gradient of the target feature.

And (5) iteratively executing the steps 1-5 to obtain the target model with perfect performance after updating training. If the target model is an on-line model (namely a real-time training model), the target model updating training is completed when the accuracy of the target model reaches a preset accuracy threshold; if the target model is an offline model (i.e., a pre-trained model), the target model is not converged any more, which indicates that the target model update training is completed. It should be noted that, when the model is trained, the second node device responds to the pulling operation of the first node device, and sends the updated first weight parameter in the last model training to the first node device.

Practice shows that the main process of the model training has the following problems: (1) with the increase of sample data and the expansion of a target model, more first node equipment and second node equipment are required to participate in model training; after the gradient of the target feature is obtained through each calculation, the first node equipment transmits the gradient to the second node equipment, so that communication interaction between the first node equipment and the second node equipment is frequent, communication time gradually becomes a communication bottleneck of model training, and communication pressure is high; (2) the gradient updating model is adopted, so that the convergence rate of the model is low, and the model training efficiency is low; (3) the model training system adopts a parallel processing machine learning mechanism to update a target model, and each first node device carries out data calculation according to respective machine learning tasks; because there may be a difference in the calculation progress between the first node devices, when the first node device whose calculation progress is slow transmits an old parameter update value (for example, a gradient of the target feature) to the second node device, the second node device may be caused to be directed to an erroneous update direction when updating the target model according to the old parameter update value, thereby impairing the model training effect.

The related art of the embodiments of the present invention refers to the following technical solutions are generally adopted at present to solve the above problems: (1) GD scheme. The scheme mainly accumulates gradients of target features in a local space of a first node device, and when the gradient accumulated amount satisfies a transmission condition, the first node device transmits the gradient accumulated amount as a parameter update value to a second node device. (2) DGC scheme. The scheme is mainly combined with a GD scheme and a momentum algorithm for model training, and mainly consists of two parts, namely Momentum Correction (momentum correction) and Momentum Factor Masking (momentum factor masking). Among them, the momentum correction means: transferring the step of calculating the momentum from the second node device to the first node device; momentum factor masking refers to: after the current momentum accumulation meets the sending condition, sending the current momentum accumulation as a parameter updating value to second node equipment and emptying the corresponding momentum and momentum accumulation; otherwise, continuing to buffer the current momentum accumulation.

The inventor finds that the prior scheme has the following problems in the practical process: (1) The GD scheme may suffer from performance loss of the model at a higher compression rate, and the GD scheme cannot solve the above-described problems (2) - (3). (2) The DGC scheme cannot solve the above problem (3), and the DGC scheme requires additional memory to store momentum. In addition, the inventors have found that: in the context of gradient compression, after the second node device in the existing scheme receives the parameter updating value sent by the first node device, the second node device directly updates the target model according to the parameter updating value, and the processing mode usually loses part of historical gradient information of the target feature, so that the compression rate is lower. In addition, the GD scheme and the DGC scheme transmit the parameter update value after meeting the transmission condition, so that the problem of parameter staleness is also caused; the first weight parameter pulled from the second node device by the first node device usually loses the information which is delayed to be sent, so that the gradient of the target feature calculated according to the first weight parameter can aggravate the inconsistency between the global model and the delay update, and the model training effect is damaged.

Based on this, the embodiment of the present invention proposes a model training scheme (SGC scheme for short) as shown in fig. 2, and the model training scheme can be applied to the above-mentioned model training system. The SGC scheme is applicable to a variety of algorithms in an algorithm library, for example: linear Regression (linear regression) algorithm, logistic Regression (logistic regression) algorithm, factorization Machine (decomposer model) algorithm, field-aware Factorization Machine (domain-based decomposer model) algorithm, DNN (deep neural network) algorithm, and the like. When the model is trained, a user only needs to set a proper compression rate in the configuration file; the model training system can train and update the target model according to the compression rate, so that the communication time can be effectively reduced, and the parallel scale can be enlarged. The SGC scheme can be applied to different application scenes, such as a sparse scene, according to service requirements; the sparse scene refers to: the features of the model reach the order of billions, trillions or more, while the number of valid features is small.

T in fig. 2 represents the number of iterations of model training; w (w) _t A first weight parameter representing a pull of the first node device from the second node device; g _t A target gradient representing a target feature; r is (r) _t Representing the current gradient accumulation of the target feature, namely the gradient accumulation during the t-th model training; r is (r) _t-1 Representing the historical gradient accumulation of the target feature, namely the gradient accumulation during the t-1 model training;

representing an approximate momentum transmitted by the ith first node device at the time of the ith model training; />

Parameter update values representing target features transmitted by the ith first node device during the ith model training; s is S _t Means for representing the approximate momentum of the target feature calculated at the time of model training at the t-th time; v (V) _t Representing the current global momentum of the target feature, i.e., the global momentum at the time of the t-th model training. Referring to the flow chart shown in fig. 2, compared with the existing solution, the SGC solution proposed in the embodiment of the present invention has at least the following three improvement points:

(1) Momentum approximation (Momentum Approximation): when the current gradient accumulation amount meets a preset condition, the first node equipment can acquire the approximate momentum of the target feature, so that the parameter update value pushed to the second node equipment is the approximate momentum; the approximate momentum here refers to the momentum obtained by performing the momentum approximation calculation based on the gradient accumulation amount of the target feature. The approximate momentum is interacted between the first node equipment and the second node equipment, so that the communication interaction times between the first node equipment and the second node equipment can be reduced, the communication time is shortened, and the communication pressure is relieved; meanwhile, extra memory overhead is not required to be increased; and the approximate momentum is adopted to update the target model, so that the convergence speed of the target model can be accelerated, the training time is shortened, and the model training efficiency is improved.

(2) Long-term gradient compensation (Long-term Gradient Compensation): the historical gradient information of the target feature is saved through the global momentum corresponding to the target feature in the second node equipment, so that the convergence speed can be accelerated by means of the historical gradient information when the target model is updated later, the problem of losing the historical gradient information of the target feature in the existing scheme is compensated, and the compression rate is effectively improved.

(3) Local Update (Local Update): the first node device may locally update the first weight parameter with a historical gradient accumulation of the target feature before computing the gradient of the target feature from the first weight parameter pulled from the second node device; therefore, the gradient calculated according to the updated first weight parameter considers the information transmitted in a delayed manner, the problem of staleness of the parameter can be solved, the inconsistency between the global model and the delayed update is relieved, the negative influence caused by the gradient compression technology is weakened, and the model training effect can be improved.

Based on the above description, the embodiment of the present invention proposes a model training method, which can be applied to the above-mentioned model training system, and executed by any one of the first node devices and any one of the second node devices in the model training system. Referring to fig. 3, the model training method may include the following steps S301 to S305:

S301, the first node equipment acquires weight parameters of target features in the target model.

The target model can be a model appointed by a user, and can also be a model to be optimized and updated, which is determined according to service requirements; one or more features can be generally included in the object model, wherein the feature refers to the expression of the salient nature of the model, and is the key of distinguishing things/data by the model; different models have different characteristics. For example, object model A is a model for recommending information to a user, and the features in object model A may include: age characteristics, gender characteristics, user interest characteristics, information subject characteristics and the like, namely, the target model A mainly determines information to be recommended for a user from the dimensions of age, gender, user interest, information subject and the like. As another example, the object model B is a model for image processing, and then the features in the object model B may include: texture features, color features, shape features, and the like, that is, the object model B mainly performs image processing from the dimensions of texture, color, shape, and the like.

The target feature may refer to any feature in the target model, and the weight parameter of the target feature may include a first weight parameter or a second weight parameter. The first weight parameter is a parameter pulled by the first node device from the second node device, and the second weight parameter is a parameter obtained by the first node device after locally updating the first weight parameter by adopting the historical gradient accumulation of the target feature. Accordingly, one specific embodiment of step S301 may be: pulling first weight parameters of target features in the target model from the second node device; correspondingly, the second node device can respond to the pulling operation of the first node device and send first weight parameters of target features in the target model to the first node device; the weight parameters of the target feature in this embodiment include a first weight parameter. Optionally, another specific embodiment of step S301 may further be: pulling a first weight parameter of a target feature in the target model from second node equipment, and updating the first weight parameter by adopting the historical gradient accumulation amount of the target feature to obtain a second weight parameter; the weight parameters of the target feature in this embodiment include a second weight parameter.

S302, the first node device determines the current gradient accumulation amount of the target feature according to the weight parameter of the target feature.

In the embodiment of the invention, the current gradient accumulation amount refers to the gradient accumulation amount between the current model training and the model iteration training of which the approximate momentum is acquired and transmitted last time. For example, let the current model training be the 10 th iteration training, if the model training that last acquired and sent the approximate momentum is the 5 th iteration training, then the current gradient accumulation amount is the gradient accumulation amount between the 6 th iteration training and the 10 th iteration training; if the first node device does not acquire and transmit the approximate momentum before the 10 th iteration training, then the current gradient accumulation amount is the gradient accumulation amount between the 1 st iteration training and the 10 th iteration training. When accumulating gradients, the embodiment of the invention adopts a key-value structure to store the gradients; wherein the key stores a feature ID of the target feature and the value stores a gradient of the target feature.

In the implementation process, the current gradient accumulation of the target feature can be obtained by combining the target gradient calculated during the training of the current model and the historical gradient accumulation. Specifically, the first node device may perform gradient calculation according to the weight parameter of the target feature to obtain a target gradient; and then merging the historical gradient accumulation of the target feature with the target gradient to obtain the current gradient accumulation of the target feature.

S303, if the current gradient accumulation amount meets the preset condition, acquiring the approximate momentum of the target feature.

In one embodiment, the preset conditions may include: the value of the current gradient accumulation amount is greater than a preset threshold. In this embodiment, the first node device may detect whether the value of the current gradient accumulation amount is greater than a preset threshold; if the gradient accumulation amount is larger than the preset condition, the step of acquiring the approximate momentum of the target feature can be executed.

In yet another embodiment, the first node device may also periodically acquire an approximate momentum of the target feature at a preset frequency and send the approximate momentum; correspondingly, the preset conditions may further include: the accumulated times corresponding to the current gradient accumulated amount are larger than the preset times. In this embodiment, the first node device may detect whether the cumulative number corresponding to the current gradient cumulative amount is greater than a preset number; if the gradient accumulation amount is larger than the preset condition, the step of acquiring the approximate momentum of the target feature can be executed. The preset number of times may be determined according to a preset frequency, for example, the preset frequency is: every 5 gradients are accumulated and 1 approximation momentum is acquired and sent, then the preset number is equal to 5. The cumulative number of times corresponding to the current gradient cumulative amount and the model iteration number can be the same or different. For example, the current model training is the 3 rd iteration training, i.e. the model iteration number is 3; and the current gradient accumulation amount accumulates the gradient in the 3 model iterations, namely the accumulation number corresponding to the current gradient accumulation amount is 3, and the accumulation number corresponding to the current gradient accumulation amount is the same as the model iteration number under the condition. For another example, the current model training is the 5 th iteration training, i.e. the model iteration number is 5; and the first node device acquires and transmits 1 approximate momentum every time the 3 gradients are accumulated, which means that the first node device transmits 1 approximate momentum in the 3 rd model iterative training, and the current gradient accumulation only accumulates the gradients in the 4 th and 5 th model iterative training, namely the accumulation times corresponding to the current gradient accumulation are 2, and the accumulation times corresponding to the current gradient accumulation and the model iterative times in this case are different.

It should be noted that, the above mentioned preset threshold and preset frequency may be set according to actual service requirements or experience values of model training; it should also be appreciated that the preset conditions may also be adjusted according to actual service requirements. The first node device may clear the current gradient accumulation amount after acquiring and transmitting the approximate momentum of the target feature each time the current gradient accumulation amount satisfies a preset condition.

S304, the first node device transmits the approximate momentum of the target feature to the second node device.

After the first node device acquires the approximate momentum of the target feature, the approximate momentum may be pushed to the second node device as a parameter update value. Accordingly, the second node device may receive the approximate momentum of the target feature in the target model sent by the first node device; the approximate momentum of the target feature is acquired and transmitted by the first node device when a current gradient accumulation of the target feature, which is determined according to a weight parameter of the target feature, satisfies a preset condition.

S305, the second node device updates the target model with the approximate momentum of the target feature.

The second node device, upon receiving the approximate momentum of the target feature in the target model transmitted by the first node device, may update the target model with the approximate momentum of the target feature. In one embodiment, the second node device may update the target model directly with the approximate momentum of the target feature. Specifically, the second node device may directly update the first weight parameter of the target feature in the target model using the approximate momentum of the target feature, thereby updating the target model.

In another embodiment, the second node device may update the target model based on a long-term gradient compensation strategy and employing the approximate momentum of the target features. The long-term gradient compensation strategy refers to a strategy of compensating historical gradient information of a target feature into approximate momentum of the target feature to obtain global momentum of the target feature and updating a target model by adopting the global momentum. The second node device may update the target model based on the long-term gradient compensation strategy and employing the approximate momentum of the target feature to update the first weight parameter of the target feature in the target model. In the implementation process, the second node equipment can calculate the average value of the approximate momentum of the target feature; and calculating the current global momentum of the target feature according to the obtained mean value, and updating the first weight parameter of the target feature by adopting the current global momentum. The target model is updated based on a long-term gradient compensation strategy, so that the historical gradient information of target features can be reserved, the problem of loss of the historical gradient information is avoided, the compression rate can be effectively improved, the convergence speed of the model can be further accelerated by means of the historical gradient information, and the time consumption of model training tasks is effectively shortened.

In the embodiment of the invention, the current gradient accumulation amount of the target feature can be determined according to the weight parameter of the target feature in the target model, and when the current gradient accumulation amount of the target feature meets the preset condition, the first node equipment (serving as a worker) sends the approximate momentum of the target feature to the second node equipment (serving as a server), and the second node equipment updates the target model by adopting the approximate momentum of the target feature; in the model training process, the first node equipment and the second node equipment do not directly interact gradient or gradient accumulation amount, but interact approximate momentum, so that the communication interaction times between the first node equipment and the second node equipment can be reduced, higher gradient compression rate is realized, communication time is reduced, and communication pressure is relieved; meanwhile, extra memory overhead is not required to be increased; in addition, the approximate momentum update model is adopted to accelerate the convergence speed of the model, so that the training time is shortened, and the training efficiency of the model is improved.

Fig. 4 is a schematic flow chart of another model training method according to an embodiment of the invention. The model training method may be applied to the above-mentioned model training system, and executed by any one of the first node devices and any one of the second node devices in the model training system. As shown in fig. 4, the model training method may include the following steps S401 to S407:

S401, the first node device pulls first weight parameters of target features in the target model from the second node device. Correspondingly, the second node equipment responds to the pulling operation of the first node equipment, and sends first weight parameters of target features in the target model to the first node equipment.

S402, the first node equipment updates the first weight parameter by using the historical gradient accumulation amount of the target feature to obtain a second weight parameter.

From the foregoing, it is known that the first weight parameter pulled by the first node device from the second node device generally loses the information that is delayed to be sent, resulting in a problem of stale parameters. Therefore, in order to consider gradient information which is delayed to be sent, the embodiment of the invention can update the first weight parameter by using the historical gradient accumulation amount to obtain the second weight parameter, so that the gradient obtained by the subsequent calculation according to the second weight parameter considers the gradient information which is delayed to be sent, thereby solving the problem of stale parameters, relieving the inconsistency between the global model and the delayed update and improving the model training effect.

The research shows that each time the target model is updated and trained along the direction opposite to the gradient, the historical gradient accumulation r can be obtained when the first weight parameter is updated by the historical gradient accumulation to obtain the second weight parameter _t-1 And (d)A weight parameter w _t The difference value is determined as a second weight parameter w' _t . Specifically, see the following calculation formula 1.1:

w′ _t ＝w _t -r _t-1 1.1

S403, the first node equipment performs gradient calculation according to the weight parameters of the target features to obtain a target gradient, wherein the weight parameters comprise second weight parameters.

S404, the first node equipment combines the historical gradient accumulation amount of the target feature and the target gradient to obtain the current gradient accumulation amount of the target feature.

In the implementation process of steps S403 to S404, the first node device may perform gradient calculation according to the second weight parameter of the target feature by using the calculation formula shown in equation 1.2, to obtain the target gradient.

The first node apparatus may further acquire a history gradient accumulation amount of the target feature from the local space of the first node apparatus, and perform step S404 after obtaining the target gradient. The specific implementation manner of step S404 may be: firstly, acquiring gradient learning rate; secondly, weighting the target gradient by adopting a gradient learning rate to obtain an intermediate gradient, wherein the intermediate gradient is shown in a formula 1.3; then, the intermediate gradient and the historical gradient accumulation amount of the target feature are combined to obtain the current gradient accumulation amount of the target feature, as shown in formula 1.4.

g′ _t ＝εg _t 1.3

r _t ＝sort(r _t-1 +εg _t ) 1.4

In the above formulas 1.2 to 1.4, g _t Representing a target gradient obtained by calculation according to the second weight parameter in the t-th model training;

representing a gradient calculation rule; g's of' _t Representing an intermediate gradient; the sort () represents the merge.

S405, if the current gradient accumulation amount meets the preset condition, acquiring the approximate momentum of the target feature.

The embodiment of the invention provides a momentum approximation algorithm based on gradient compression by comparing the relation between the momentum and the gradient accumulation, wherein the momentum approximation algorithm is an algorithm capable of carrying out momentum approximation calculation according to the gradient accumulation. Based on this, the first and second light sources, the specific embodiment of step S405 may include the following steps S11-S12:

and s11, if the current gradient accumulation amount meets the preset condition, acquiring a momentum approximation algorithm.

And s12, calculating the approximate momentum of the target feature by adopting the momentum approximation algorithm according to the historical gradient accumulation and the current gradient accumulation of the target feature.

Specifically, the gradient decay factor alpha can be first called for the historical gradient accumulation r _t-1 The attenuation treatment is performed as shown in equation 1.5. Then the historical gradient accumulation r 'after the attenuation treatment' _t-1 With the current gradient cumulative amount r _t Merging to obtain approximate momentum of target feature

As shown in formula 1.6.

r′ _t-1 ＝αr _t-1 1.5

It should be understood that the specific embodiment of step S405 is not limited to the above one, and in other embodiments, the approximate momentum of the target feature may be obtained by performing the momentum approximation calculation only according to the historical gradient accumulation amount or the current gradient accumulation amount. Specifically, the specific implementation manner of step S405 may further be: if the current gradient accumulation quantity meets a preset condition, acquiring a momentum approximation algorithm; the momentum approximation algorithm is used to calculate the approximate momentum of the target feature from the historical gradient accumulation of the target feature. Alternatively, the specific implementation manner of step S405 may be: if the current gradient accumulation quantity meets a preset condition, acquiring a momentum approximation algorithm; the momentum approximation algorithm is used to calculate the approximate momentum of the target feature based on the current gradient accumulation of the target feature.

S406, the first node equipment transmits the approximate momentum of the target feature to the second node equipment; accordingly, the second node device may receive the approximate momentum of the target feature in the target model transmitted by the first node device.

S407, the second node device updates the target model based on the long-term gradient compensation strategy and using the approximate momentum of the target feature.

In a specific implementation, step S407 may include the following steps S21-S23:

s21, obtaining the average value of the approximate momentum of the target feature. Specifically, the second node device may calculate the mean value S of the approximate momentums of the received target feature after receiving the approximate momentums of the target feature transmitted by all the related first node devices related to the target feature _t As shown in formula 1.7.

Where N represents the number of associated first node devices associated with the target feature; by associated first node device associated with a target feature is meant: a first node device that participates in calculating the approximate momentum of the target feature. For example, the model training system comprises 10 first node devices, and the numbers of the first node devices are respectively 1-10; whereas only the first node device numbered 2 and the first node device numbered 5 participate in calculating the approximate momentum of the target feature, the relevant first node device is: a first node device 2 and a first node device 5.

S22, calculating the current global momentum of the target feature according to the obtained average value; in particular, historical global momentum V may be obtained _t-1 The momentum attenuation factor alpha is adopted for the historical global momentum V _t-1 Performing attenuation processing according to the historical global momentum and the average S after the attenuation processing _t Determining the current global momentum V _t As shown in formula 1.8.

V _t ＝αV _t-1 +S _t 1.8

s23, adopt the current global momentum V _t Updating a first weight parameter w of a target feature _t As shown in formula 1.9 as shown.

w″ _t ＝w _t -V _t 1.9

Wherein w' _t Representing the weight parameters obtained by updating the first weight parameters with global momentum. After the approximate momentums of the target characteristics sent by all relevant first node equipment related to the target characteristics are received, the average value of all the received approximate momentums of the target characteristics is calculated, so that the global momentums of the target characteristics are obtained; thus, the global momentum preserves the historical gradient information of the target feature; and the global momentum considers some first node devices with slower calculation progress, so that the problem of parameter staleness caused by the slower calculation progress of some first node devices is avoided, the problem of inconsistency between the global model and staleness parameters is avoided, and the model training effect and the model training efficiency can be improved. Moreover, the second node device updates the first weight parameter of the target feature using the global momentum that retains the historical gradient information, so that the convergence speed of the target model can be further accelerated by means of the historical gradient information of the target feature in the updating process.

In order to further verify the superiority of the long-term gradient compensation strategy in the embodiment of the present invention, the inventors respectively test the momentum of the DGC scheme and the second node device side in the scheme of the embodiment of the present invention under the limit conditions, to obtain the following test results:

DGC scheme: v (V) _KT+1 ＝g _KT+1

The scheme of the embodiment of the invention comprises the following steps:

comparing the test results, it can be seen that: in the limit, the second node device in the DGC scheme directly uses the gradient as a global momentum that does not accumulate historical gradient information; the global momentum in the embodiment of the invention still has accumulation of historical gradient information. Therefore, even under the limit condition, the embodiment of the invention can still use the global momentum to retain the historical gradient information, so that the problem of losing the historical gradient information of the target feature in the existing scheme is compensated; the realization consistency of the parallel algorithm and the single-machine algorithm is improved, and the compression rate and the model training efficiency are further effectively improved.

In the embodiment of the invention, the current gradient accumulation amount of the target feature can be determined according to the weight parameter of the target feature in the target model, and when the current gradient accumulation amount of the target feature meets the preset condition, the first node equipment (serving as a worker) sends the approximate momentum of the target feature to the second node equipment (serving as a server), and the second node equipment updates the target model by adopting the approximate momentum of the target feature; in the model training process, the first node equipment and the second node equipment do not directly interact gradient or gradient accumulation amount, but interact approximate momentum, so that the communication interaction times between the first node equipment and the second node equipment can be reduced, higher gradient compression rate is realized, communication time is reduced, and communication pressure is relieved; meanwhile, extra memory overhead is not required to be increased; in addition, the approximate momentum update model is adopted to accelerate the convergence speed of the model, so that the training time is shortened, and the training efficiency of the model is improved; moreover, the historical gradient information of the target features can be reserved based on a long-term gradient compensation strategy, so that the problem of historical gradient information loss is avoided, the compression rate can be effectively improved, the convergence speed of the model can be further accelerated by means of the historical gradient information, and the time consumption of model training tasks is effectively shortened.

In order to highlight the beneficial effects of the model training method provided by the embodiment of the invention, the inventor performs scheme comparison from two aspects of model calculation, model test and the like, and the scheme is as follows:

and (one) model calculation:

the conventional scheme is exemplified by a DGC scheme, a flow schematic diagram of the DGC scheme can be shown in fig. 5, and a specific model calculation flow can be shown in the following formulas 2.1-2.5:

v _t ＝βv _t-1 +εg _to 2.2

R _t ＝R _t-1 +v _t 2.3

In the above formulae 2.1 to 2.5, g _to Representing the gradient calculated from the first weight parameter; beta represents a momentum decay factor; epsilon represents the gradient learning rate; v _t Representing the current local momentum of the target feature, namely the local momentum during the t-th model training; v _t-1 Historical local momentum representing the target characteristics, namely local momentum during the t-1 model training; r is R _t Representing the current momentum accumulation of the target feature, namely the momentum accumulation during the t-th model training; r is R _t-1 A historical momentum accumulation amount representing the target characteristics, namely the momentum accumulation amount during the t-1 model training; r is R _t >Delta represents that the current momentum accumulation amount satisfies the transmission condition;

parameter update values representing target features transmitted by the ith first node device during the ith model training; w'. _t Representing that the second node device adopts the parameter update value +. >

For the first weight parameter w _t And updating the obtained weight parameters. The above formulas 1.1 to 1.5 are consolidated to obtain the following expression:

in the case of considering only momentum approximation, the model calculation flow of the model training scheme proposed by the embodiment of the present invention can be seen in the following equations 2.6-2.10:

r _t ＝r _t-1 +εg _to 2.7

The meaning of the symbols in the formulae 2.6 to 2.10 can be referred to the related description of the previous embodiments, and will not be repeated here. The above formulas 2.6 to 2.10 are consolidated to obtain the following expression:

w″′ _t+T ＝w _t -ε[g _to+T +(1+α)g _to+T-1 +…+(1+α)g _to+1 ]

comparing the above-mentioned formulae 2.1 to 2.5 of the DGC scheme with the model calculation flow formulae 2.6 to 2.10 of the scheme proposed by the embodiment of the present invention when only the power approximation is considered, it can be seen that: compared with the DGC scheme, the method adopts the momentum approximation calculation, does not need to calculate and consume extra memory to store the historical momentum of the target feature, and saves the memory; the memory of the first node equipment is not a bottleneck of gradient compression in model training, and the parallel scale of model training is promoted.

And (II) model test:

(1) Test environment setting:

setting a target model: the convex optimization model is a l2-Regularized Logistic Regression (regularized logistic regression) model, and the non-convex optimization model is a FNN model. Wherein, convex optimization refers to: the objective function is a convex function, and the set to which the variable belongs is an optimization of the convex set.

Setting a test data set: one public dataset (KDD 2010 dataset) and two Tencent internal datasets (URP dataset and CTR dataset) as shown in table 1:

TABLE 1

Dataset	Size	Dimension	#non-zero-features
				KDD2010	2.5GB	20.2M	19.3M
URP	169.7GB	400M	100M
				CTR	3.1TB	100G	10G

Test scheme: momentum algorithm scheme without gradient compression (Baseline scheme), GD scheme, GD-LU scheme, DGC-LU scheme, SGC-MA-LG scheme, and SGC scheme. Wherein, GD-LU is a scheme of adding local update in GD scheme, DGC-LU is a scheme of adding local update in DGC scheme, SGC-MA is a scheme of using only momentum approximation in the embodiment of the invention, SGC-MA-LG is a scheme of using only momentum approximation and long-term gradient compensation in the embodiment of the invention, SGC is a scheme of using momentum approximation, long-term gradient compensation and local update, namely the complete scheme of the invention.

For this index: AUC (an index for evaluating ranking ability) and loglos (loss function).

Cluster environment: 128GB RAM,48cores and10GB Ethernet.

(2) Test results:

A. model performance and convergence speed: the momentum approximation algorithm of the SGC scheme provided by the embodiment of the invention can obtain the convergence rate similar to that of the DGC scheme; the long-term gradient compensation strategy of the SGC scheme can obviously improve the convergence rate; the local updating strategy of the SGC scheme relieves the problem of parameter staleness and can further improve the model effect. Specific test result data can be seen in table 2:

TABLE 2

The AUC effect graph and the logoss effect graph of the model obtained by testing the different schemes by using the test data set are shown in fig. 6. Wherein (a) - (c) in fig. 6 are model AUC effects obtained using different schemes for different data sets, respectively; the model AUC corresponding to the SGC scheme is best. Fig. 6 (d) - (f) are effects of model logoss obtained using different schemes for different data sets, respectively; the comparison shows that the curve corresponding to the SGC scheme is the lowest, namely the logoss effect corresponding to the SGC scheme is the best.

B. Memory usage and communication time: for a high-dimensional sparse model, memory bottlenecks caused by gradient compression are serious. The SGC scheme provided by the implementation of the invention reduces the memory usage compared with the DGC scheme. The SGC scheme further reduces communication costs by reducing memory usage.

Fig. 7 (a) shows the memory consumption when the gradient compression ratios of the DGC scheme and the SGC scheme are 99.99%,90.00% and 50.00%, respectively; in this regard, the memory consumption of the DGC scheme is much greater than that of the SGC scheme. Fig. 7 (b) shows the memory consumption in the case where the gradient compression ratio is 99.99% in both the DGC scheme and the SGC scheme; in this regard, the DGC scheme requires three communication interactions under the same gradient compression, while the SGC scheme requires only one communication interaction. Therefore, the SGC scheme can further reduce communication interaction, relieve communication pressure and shorten model training time. Fig. 7 (c) is the consumption time of network communication of the first node device of the uncompressed momentum algorithm scheme, the DGC scheme, and the SGC scheme; in this regard, the SGC scheme is much less time consuming than the uncompressed momentum algorithm scheme and is also smaller than the DGC scheme during the transmit phase.

Based on the description of the above embodiments, the embodiments of the present invention also provide a service processing method, which may be performed by a terminal, where the terminal may include, but is not limited to: intelligent terminals, personal computers, tablet computers, desktop computers, and the like. Referring to fig. 8, the service processing method may specifically include the following steps S801 to S803:

s801, a service request is acquired.

The service request may include, but is not limited to: web content recommendation requests, audio-video recommendation requests, image processing requests, face recognition requests, and the like. The terminal can acquire a service request after detecting a service triggering operation; the service triggering operations herein may include, but are not limited to: a refresh operation for multimedia data, a search operation for searching multimedia data, an image processing operation, and the like; wherein the multimedia data may include: web content data and/or audiovisual data.

Alternatively, the terminal may periodically acquire the service request at a fixed frequency, where the fixed frequency may be set according to the actual requirement or an empirical value. For example, when the service processing method is applied to a real-time information application (e.g., browser), the requirement that the terminal frequently pushes the latest information to the user is large, in which case a large fixed frequency, such as 5 times/min, may be set. As another example, when the business processing method is applied to music application, the music is generally updated once a day, i.e. the terminal does not need to push the music to the user frequently, in which case a smaller fixed frequency, such as 1 time/day, can be set.

S802, responding to the service request, and calling a target model to execute processing on the requested service to obtain a processing result.

The target model can be obtained by training by any model training method shown in fig. 3 or fig. 4. In one embodiment, the target model may be obtained by training and updating in advance using the model training method described above. In another embodiment, the target model may also be obtained by training and updating in real time according to the historical portrait data of the user and by adopting the model training method when responding to the service request; the practice of the invention is not limited in this regard.

When the service request includes a web page content recommendation request, the specific implementation manner of step S802 may be: in response to the web page content recommendation request, at least one candidate web page content to be recommended is acquired, wherein the candidate web page content comprises one or more of the following types of content: text, audio video, and images; calling a target model to predict the click rate of each candidate webpage content to obtain the predicted click rate of each candidate webpage content; and determining target webpage contents according to the predicted click rate of each candidate webpage content, wherein the predicted click rate of the target webpage contents meets the output condition. The predicted click rate of the target webpage content meets the output condition comprises the following steps: the predicted click rate of the target webpage content is larger than a preset click rate threshold; alternatively, the predicted click rate of the target web content is greater than the predicted click rates of the remaining candidate web content other than the target web content, and so on.

When the service request includes a face recognition request, the specific implementation manner of step S802 may be: responding to a face recognition request, and acquiring a target face image to be recognized; invoking a target model to carry out face recognition on the target face image to obtain a recognition result, wherein the recognition result can comprise at least one of the following: expression identification, identity identification and the like of the target face image.

S803, outputting a processing result.

For easy understanding, the following describes an embodiment of the present invention by taking an example of applying the service processing method to a recommended scenario of web page content:

in the process that a user uses the terminal to browse webpage content, the terminal can provide a service interface for the user at a user interface; the service interface may comprise a recommendation interface 11 for recommending Feeds streams and/or a search direct interface 12 for searching web page content, as shown in fig. 9 a. Feeds, among other things, refer to information streams that are continuously updated and presented to a user. The user may click on the recommendation interface 11 to obtain the latest web page content, or may click on the search direct interface 12 to obtain web page content that the user wants to search for a query.

Taking the example that the user clicks the recommendation interface 11, and the web page content currently displayed by the user interface of the terminal is web page content a: after the terminal detects the clicking operation of the user, a webpage content recommendation request can be obtained; and responding to the webpage content recommendation request, and acquiring at least one candidate webpage content to be recommended, wherein the candidate webpage content comprises: candidate web content a, candidate web content B, candidate web content C, and candidate web content D. The target model may then be invoked to predict the click rate for these 4 candidate web page content, resulting in a predicted click rate for candidate web page content A of 80%, a predicted click rate for candidate web page content B of 50%, a predicted click rate for candidate web page content C of 85%, and a predicted click rate for candidate web page content D of 70%. The terminal may determine the candidate web page content C having the highest predicted click rate as the target web page content and output the candidate web page content C at the user interface, as shown in fig. 9 b. And the whole exposure efficiency of the webpage content and the exposure efficiency of the main Feed are improved by calling the target model obtained through training by the model training method shown in the figure 3 or the figure 4 to carry out service processing.

After the service request is acquired, the embodiment of the invention responds to the service request and calls the target model to execute the processing on the requested service to obtain the processing result; and then outputting the processing result. Because the target model is obtained by training by adopting the model training method shown in fig. 3 or fig. 4, the response speed of the target model is high, and the target model is called for business processing, so that the processing time can be reduced, and the processing efficiency can be improved.

Based on the above description of the embodiment of the model training method, the embodiment of the present invention further provides a model training apparatus, which may be a computer program (including program code) running in the first node device. The model training apparatus may perform steps S301 to S304 shown in fig. 3 or steps S401 to S406 shown in fig. 4. Referring to fig. 10, the model training apparatus may operate the following units:

an acquiring unit 101, configured to acquire a weight parameter of a target feature in a target model;

a determining unit 102, configured to determine a current gradient accumulation amount of the target feature according to a weight parameter of the target feature;

the acquiring unit 101 is configured to acquire an approximate momentum of the target feature if the current gradient accumulation amount meets a preset condition; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

A sending unit 103, configured to send the approximate momentum of the target feature to a second node device, so that the second node device updates the target model with the approximate momentum of the target feature.

In one embodiment, the obtaining unit 101 may be specifically configured to, when configured to obtain the approximate momentum of the target feature if the current gradient accumulated amount meets a preset condition:

if the current gradient accumulation quantity meets a preset condition, acquiring a momentum approximation algorithm;

and calculating the approximate momentum of the target feature by adopting the momentum approximation algorithm according to the historical gradient accumulation of the target feature and the current gradient accumulation.

In still another embodiment, the obtaining unit 101 may be specifically configured to, when calculating the approximate momentum of the target feature according to the historical gradient accumulation amount of the target feature and the current gradient accumulation amount by using the momentum approximation algorithm:

invoking a gradient attenuation factor to attenuate the historical gradient accumulation;

and combining the historical gradient accumulation after the attenuation processing with the current gradient accumulation to obtain the approximate momentum of the target feature.

In yet another embodiment, the weight parameter includes a first weight parameter; accordingly, the acquiring unit 101, when configured to acquire the weight parameters of the target feature in the target model, may be specifically configured to:

First weight parameters of target features in the target model are pulled from a second node device.

In yet another embodiment, the weight parameter includes a second weight parameter; accordingly, the acquiring unit 101, when configured to acquire the weight parameters of the target feature in the target model, may be specifically configured to:

pulling first weight parameters of target features in the target model from second node equipment;

and updating the first weight parameter by adopting the historical gradient accumulation amount of the target feature to obtain a second weight parameter.

In yet another embodiment, the determining unit 102, when configured to determine the current gradient accumulation amount of the target feature according to the weight parameter of the target feature, may be specifically configured to:

carrying out gradient calculation according to the weight parameters of the target characteristics to obtain a target gradient;

and combining the historical gradient accumulation of the target feature with the target gradient to obtain the current gradient accumulation of the target feature.

In still another embodiment, the determining unit 102 may be specifically configured to, when configured to combine the historical gradient cumulative amount of the target feature and the target gradient to obtain the current gradient cumulative amount of the target feature:

Acquiring gradient learning rate;

weighting the target gradient by adopting the gradient learning rate to obtain an intermediate gradient;

and merging the intermediate gradient and the historical gradient accumulation of the target feature to obtain the current gradient accumulation of the target feature.

According to one embodiment of the invention, steps S301-S304 shown in FIG. 3 and steps S401-S406 shown in FIG. 4 may each be performed by a respective unit in the model training apparatus shown in FIG. 10. For example, steps S301 and S303 shown in fig. 3 may be performed by the acquisition unit 101 shown in fig. 10, step S302 may be performed by the determination unit 102 shown in fig. 10, and step S304 may be performed by the transmission unit 103 shown in fig. 10; as another example, steps S401 to S402 shown in fig. 4 may be performed by the acquisition unit 101 shown in fig. 10, steps S403 to S404 may be performed by the determination unit 102 shown in fig. 10, and steps S405 and S406 may be performed by the acquisition unit 101 and the transmission unit 103 shown in fig. 10, respectively.

Based on the above description of the embodiments of the model training method, in another embodiment, another model training apparatus is provided, which may be a computer program (including program code) running in the second node device. Referring to fig. 11, the model training apparatus may operate the following units:

A receiving unit 201, configured to receive an approximate momentum of a target feature in a target model sent by a first node device, where the approximate momentum of the target feature is acquired and sent by the first node device when a current gradient accumulation amount of the target feature meets a preset condition, and the current gradient accumulation amount is determined according to a weight parameter of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

an updating unit 202 for updating the target model with the approximate momentum of the target feature.

In one embodiment, the model training apparatus may further include a transmitting unit 203; the sending unit 203 may be configured to, prior to said receiving the approximate momentum of the target feature in the target model sent by the first node device:

and responding to the pulling operation of the first node equipment, and sending first weight parameters of target characteristics in the target model to the first node equipment.

In yet another embodiment, the updating unit 202, when configured to update the target model with the approximate momentum of the target feature, may be specifically configured to:

the target model is updated based on a long-term gradient compensation strategy and using the approximate momentum of the target features.

In yet another embodiment, the updating unit 202, when configured to update the target model based on the long-term gradient compensation strategy and using the approximate momentum of the target feature, may be specifically configured to:

calculating the average value of the approximate momentum of the target feature;

and calculating the current global momentum of the target feature according to the obtained mean value, and updating the first weight parameter of the target feature by adopting the current global momentum.

According to an embodiment of the present invention, step S305 shown in fig. 3 and step S407 shown in fig. 4 may be performed by the updating unit 202 in the model training apparatus shown in fig. 11.

According to another embodiment of the present invention, each unit in the model training apparatus shown in fig. 10 or fig. 11 may be separately or completely combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present invention. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the invention, the model-based training apparatus may also include other units, and in actual practice, these functions may also be facilitated by other units and may be cooperatively implemented by a plurality of units.

According to another embodiment of the present invention, a model training apparatus device as shown in fig. 10 or 11 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 3 to 4 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and a model training method of an embodiment of the present invention is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed by the above-described computing device via the computer-readable recording medium.

Based on the description of the model training method embodiment and the model training device embodiment, the embodiment of the invention also provides node equipment. The node device comprises at least a processor 301, a communication interface 302 and a computer storage medium 303. Optionally, a radio frequency receiver and a radio frequency transmitter may also be included in the communication interface 302. The node device provided by the embodiment of the present invention may be the first node device mentioned above, or may be the second node device mentioned above. A radio frequency receiver in the first node device is operable to receive a first weight parameter of the target feature transmitted by the second node device, and a radio frequency transmitter is operable to transmit an approximate momentum of the target feature to the second node device; the radio frequency receiver in the second node device may be operable to receive the approximate momentum of the target feature transmitted by the first node device and the radio frequency transmitter may be operable to transmit a first weight parameter of the target feature to the first node device. Taking the first node device as an example, a schematic structural diagram thereof can be seen in fig. 12.

A computer storage medium 303 may be stored in a memory of a node device, the computer storage medium 303 being for storing a computer program comprising program instructions, the processor 301 being for executing the program instructions stored by the computer storage medium 303. The processor 301, or CPU (Central Processing Unit ), is a computing core as well as a control core of the node device, adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function. In one embodiment, when the node device is a first node device, the processor 301 according to the embodiment of the present invention may load and execute one or more first instructions stored in the computer storage medium 303 to perform a series of model training processes, including: acquiring a weight parameter of a target feature in a target model; determining the current gradient accumulation amount of the target feature according to the weight parameter of the target feature; if the current gradient accumulation quantity meets a preset condition, acquiring the approximate momentum of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature; transmitting the approximate momentum of the target feature to a second node device, causing the second node device to update the target model with the approximate momentum of the target feature, and so on. In yet another embodiment, when the node device is a second node device, the processor 301 according to the embodiment of the present invention may load and execute one or more second instructions stored in the computer storage medium 303 to perform a series of model training processes, including: receiving the approximate momentum of a target feature in a target model sent by first node equipment, wherein the approximate momentum of the target feature is acquired and sent by the first node equipment when the current gradient accumulation of the target feature meets a preset condition, and the current gradient accumulation is determined according to the weight parameter of the target feature; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature; updating the target model with the approximate momentum of the target features, and so on.

The embodiment of the invention also provides a computer storage medium (Memory), which is a Memory device in the node device and is used for storing programs and data. It will be appreciated that the computer storage media herein may include both built-in storage media in the node device and extended storage media supported by the node device. The computer storage media provide storage space that stores an operating system of the node device. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 301. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more first instructions stored in a computer storage medium may be loaded and executed by the processor 301 to implement the respective steps of the methods described above in relation to the model training embodiments; in particular implementations, one or more first instructions in the computer storage medium are loaded by the processor 301 and perform the steps of:

Acquiring a weight parameter of a target feature in a target model;

In one embodiment, when the current gradient accumulation amount satisfies a preset condition, and the approximate momentum of the target feature is obtained, the one or more first instructions are loaded and specifically executed by the processor 301:

In yet another embodiment, the one or more first instructions are loaded and executed in particular by the processor 301 when calculating an approximate momentum of the target feature using the momentum approximation algorithm based on the historical gradient accumulation of the target feature and the current gradient accumulation:

In yet another embodiment, the weight parameter includes a first weight parameter; accordingly, in acquiring the weight parameters of the target feature in the target model, the one or more first instructions are loaded and specifically executed by the processor 301:

In yet another embodiment, the weight parameter includes a second weight parameter; accordingly, in acquiring the weight parameters of the target feature in the target model, the one or more first instructions are loaded and specifically executed by the processor 301:

In yet another embodiment, the one or more first instructions may be loaded and executed in particular by the processor 301 when determining the current gradient aggregate amount for the target feature based on the weight parameter of the target feature:

In yet another embodiment, the one or more first instructions may be loaded and executed specifically by the processor 301 when merging the historical gradient aggregate of the target feature and the target gradient to obtain the current gradient aggregate of the target feature:

acquiring gradient learning rate;

In yet another embodiment, one or more second instructions stored in a computer storage medium may be loaded and executed by the processor 301 to implement the respective steps of the method described above in relation to the model training embodiment; in particular implementations, one or more second instructions in the computer storage medium are loaded by the processor 301 and perform the steps of:

Updating the target model using the approximate momentum of the target feature.

In one embodiment, the one or more second instructions are loaded and specifically executed by the processor 301 prior to the receiving the approximate momentum of the target feature in the target model transmitted by the first node device:

In yet another embodiment, the one or more second instructions are loaded by the processor 301 and specifically executed when updating the target model with the approximate momentum of the target feature:

in yet another embodiment, wherein the target model is updated based on the long-term gradient compensation strategy and with the approximate momentum of the target feature, the one or more second instructions are loaded and specifically executed by the processor 301 when the target model is updated based on the long-term gradient compensation strategy and with the approximate momentum of the target feature:

Based on the above description of the embodiments of the service processing method, the embodiments of the present invention also disclose a service processing apparatus, which may be a computer program (including program code) running in a terminal. The service processing device may perform the method shown in fig. 8. Referring to fig. 13, the service processing apparatus may operate the following units:

An acquiring unit 401, configured to acquire a service request;

a processing unit 402, configured to respond to the service request, and invoke a target model to perform processing on the requested service to obtain a processing result; the target model is obtained by training by adopting the model training method;

an output unit 403 for outputting the processing result.

In one embodiment, the service request includes: a web page content recommendation request; accordingly, when the processing unit 402 is configured to respond to the service request, call the target model to perform processing on the requested service to obtain a processing result, the processing unit may be specifically configured to:

responding to the webpage content recommendation request, and acquiring at least one candidate webpage content to be recommended, wherein the candidate webpage content comprises one or more of the following types of content: text, audio video, and images;

calling a target model to predict the click rate of each candidate webpage content to obtain the predicted click rate of each candidate webpage content;

and determining target webpage contents according to the predicted click rate of each candidate webpage content, wherein the predicted click rate of the target webpage contents meets the output condition.

According to an embodiment of the present invention, the steps involved in the method shown in fig. 8 may be performed by the units in the service processing apparatus shown in fig. 13. For example, steps S801 to S803 shown in fig. 8 may be performed by the acquisition unit 401, the processing unit 402, and the output unit 403 shown in fig. 13, respectively. According to another embodiment of the present invention, each unit in the service processing apparatus shown in fig. 13 may be separately or completely combined into one or several other units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present invention. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present invention, the service-based processing device may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units.

According to another embodiment of the present invention, a service processing apparatus device as shown in fig. 13 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 8 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and implementing the service processing method of the embodiment of the present invention. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed by the above-described computing device via the computer-readable recording medium.

Based on the description of the method embodiment and the device embodiment, the embodiment of the invention also provides a terminal. Referring to fig. 14, the terminal includes at least a processor 501, an input device 502, an output device 503, and a computer storage medium 504. Optionally, the input device 502 may further include a hardware device for performing man-machine interaction, such as a keyboard, a touch screen, and the like; the terminal may also include a battery for providing power to the terminal to support normal operation of the terminal, and so on.

A computer storage medium 504 may be stored in a memory of the terminal, the computer storage medium 504 being for storing a computer program comprising program instructions, the processor 501 being for executing the program instructions stored by the computer storage medium 504. The processor 501 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the terminal, which are adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; in one embodiment, the processor 501 according to the embodiment of the present invention may be configured to perform a series of service processes, including: acquiring a service request; responding to the service request, and calling a target model to execute processing on the requested service to obtain a processing result; the result of the processing is output and the result of the processing is output, etc.

The embodiment of the invention also provides a computer storage medium (Memory), which is a Memory device in the terminal and is used for storing programs and data. It will be appreciated that the computer storage medium herein may include both a built-in storage medium in the terminal and an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 501. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by processor 501 to implement the respective steps of the methods described above in connection with the business process embodiments; in particular implementations, one or more third instructions in the computer storage medium are loaded by the processor 501 and perform the steps of:

acquiring a service request;

responding to the service request, and calling a target model to execute processing on the requested service to obtain a processing result; the target model is obtained by training by adopting a model training method shown in fig. 3 or fig. 4;

and outputting the processing result.

In one embodiment, the service request includes: a web page content recommendation request; accordingly, when the target model is invoked to perform processing on the requested service in response to the service request to obtain a processing result, the one or more instructions may be further loaded and specifically executed by the processor 501:

The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims

1. A model training method applied to a first node device, comprising:

acquiring a weight parameter of a target feature in a target model;

determining the current gradient accumulation amount of the target feature according to the weight parameter of the target feature; the current gradient accumulation amount refers to gradient accumulation amount between current model training and model iteration training of last acquired and transmitted approximate momentum;

If the current gradient accumulation amount meets a preset condition, acquiring the approximate momentum of the target feature, including: if the current gradient accumulation amount meets a preset condition, acquiring a momentum approximation algorithm, and calculating the approximate momentum of the target feature by adopting the momentum approximation algorithm according to the historical gradient accumulation amount of the target feature and the current gradient accumulation amount; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature;

transmitting the approximate momentum of the target feature to a second node device, so that the second node device updates the target model by adopting the approximate momentum of the target feature;

wherein the calculating the approximate momentum of the target feature by using the momentum approximation algorithm according to the historical gradient accumulation of the target feature and the current gradient accumulation comprises: and calling a gradient attenuation factor to carry out attenuation treatment on the historical gradient accumulation amount, and combining the historical gradient accumulation amount after the attenuation treatment with the current gradient accumulation amount to obtain the approximate momentum of the target feature.

2. The method of claim 1, wherein the weight parameter comprises a first weight parameter; the obtaining the weight parameters of the target features in the target model comprises the following steps:

3. The method of claim 1, wherein the weight parameter comprises a second weight parameter; the obtaining the weight parameters of the target features in the target model comprises the following steps:

4. The method of claim 1, wherein the determining the current gradient aggregate amount for the target feature based on the weight parameter for the target feature comprises:

5. The method of claim 4, wherein the merging the historical gradient aggregate of the target feature and the target gradient to obtain the current gradient aggregate of the target feature comprises:

acquiring gradient learning rate;

6. A model training method applied to a second node device, comprising:

receiving the approximate momentum of the target feature in the target model sent by the first node device, wherein the approximate momentum of the target feature is acquired and sent by the first node device when the current gradient accumulation of the target feature meets the preset condition, and the method comprises the following steps: if the current gradient accumulation amount meets a preset condition, acquiring a momentum approximation algorithm, and calculating the approximate momentum of the target feature by adopting the momentum approximation algorithm according to the historical gradient accumulation amount of the target feature and the current gradient accumulation amount; the current gradient accumulation amount is determined according to the weight parameter of the target feature; the current gradient accumulation amount refers to gradient accumulation amount between current model training and model iteration training of last acquired and transmitted approximate momentum; the approximate momentum is obtained by performing momentum approximate calculation according to the gradient accumulation of the target feature; wherein the calculating the approximate momentum of the target feature by using the momentum approximation algorithm according to the historical gradient accumulation of the target feature and the current gradient accumulation comprises: invoking a gradient attenuation factor to attenuate the historical gradient accumulation amount, and combining the attenuated historical gradient accumulation amount with the current gradient accumulation amount to obtain the approximate momentum of the target feature;

Updating the target model using the approximate momentum of the target feature.

7. The method of claim 6, further comprising, prior to said receiving the approximate momentum of the target feature in the target model transmitted by the first node device:

8. The method of claim 7, wherein said updating said target model with approximate momentum of said target features comprises:

9. The method of claim 8, wherein the updating the target model based on a long-term gradient compensation strategy and employing the approximate momentum of the target feature comprises:

10. A method of service processing, the method comprising:

Acquiring a service request;

responding to the service request, and calling a target model to execute processing on the requested service to obtain a processing result; wherein the target model is obtained by training the model training method according to any one of claims 1 to 9;

and outputting the processing result.

11. The method of claim 10, wherein the service request comprises: a web page content recommendation request; and responding to the service request, calling a target model to execute processing on the requested service to obtain a processing result, wherein the processing result comprises the following steps:

12. A node device comprising a communication interface, further comprising:

A computer storage medium storing one or more first instructions adapted to be loaded by the processor and to perform the model training method of any of claims 1-5; alternatively, the computer storage medium stores one or more second instructions adapted to be loaded by the processor and to perform the model training method of any of claims 6-9.

13. A terminal comprising an input device and an output device, further comprising:

a computer storage medium storing one or more third instructions adapted to be loaded by the processor and to perform the business processing method of claim 10 or 11.