CN117670470A

CN117670470A - Training method and device for recommendation model

Info

Publication number: CN117670470A
Application number: CN202311475332.0A
Authority: CN
Inventors: 齐盛; 董辉
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-03-08

Abstract

The disclosure relates to the technical field of computers, and provides a training method and device for a recommendation model. The method comprises the following steps: respectively inputting the characteristic data of the training data into a multitask model and a weight prediction network to correspondingly obtain a multi-target predicted value and a weight parameter vector; obtaining dot products of the multi-target predicted values and the weight parameter vectors to serve as fusion predicted values; adjusting the multi-task model according to the multi-target predicted value and the training label of the training data, and adjusting the weight prediction network according to the fusion predicted value and the training label until the multi-task model and the weight prediction network converge; and adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to the online feedback data collected from the user online feedback data system and the fusion predicted value to obtain a recommendation model, so as to recommend goods when the user browses the goods according to the recommendation model. According to the technical scheme, the prediction accuracy of the recommendation model can be improved.

Description

Training method and device for recommendation model

Technical Field

The disclosure relates to the field of computer technology, and in particular relates to a training method and device for a recommendation model.

Background

There are many recommendation scenes in the e-commerce scene, such as top page recommendation, commodity detail page recommendation, shopping cart recommendation, etc. The recommendation algorithm and the recommendation system can continuously improve the recommendation effect, promote the user experience and the platform income by using various technical means, maximize the benefits of the two parties and realize the efficient connection of the user, the commodity and the platform.

The recommendation system plays an indispensable role in the life today, and has the physical and mental effects of online shopping, news reading and video watching. CTR (Click Through Rate, user click prediction) is a key task in a recommendation system that can estimate the probability that a user clicks on an item. CTR is used as a key ring of a ranking link of a recommendation system, and objects most likely to be clicked by a user are preferentially pushed to the user through modeling and expression of user features and commodity features, so that the satisfaction degree of the user and the efficiency of the whole recommendation system are improved.

In the process of training the recommendation model, multiple business targets such as click rate and conversion rate may exist to be optimized simultaneously. In the case of multi-objective scoring fusion, multiple objective scores may be added directly, or exponential weighting may be used. However, most of these scoring fusion methods need to use manual experience, and have high cost, large adjustment difficulty, low adaptability and large training difficulty of the recommended model due to the need of manual resetting when the task is changed.

In addition, in a real online shopping scenario, the interests of the user change quickly, which will make the prediction accuracy of the recommendation model lower.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a training method, device, electronic device, and readable storage medium for a recommendation model, so as to solve the technical problem in the prior art that the prediction accuracy of the recommendation model is low.

In a first aspect of an embodiment of the present disclosure, a training method for a recommendation model is provided, where the method includes: respectively inputting feature data of training data into a multi-task model and a weight prediction network to correspondingly obtain multi-target predicted values and weight parameter vectors, wherein the training data comprises data generated by browsing commodities by a user; obtaining dot products of the multi-target predicted values and the weight parameter vectors to serve as fusion predicted values; adjusting the multi-task model according to the multi-target predicted value and the training label of the training data, and adjusting the weight prediction network according to the fusion predicted value and the training label until the multi-task model and the weight prediction network converge, wherein the training label comprises candidate commodity data when a user browses commodities; and adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to the online feedback data collected from the user online feedback data system and the fusion predicted value to obtain a recommendation model, so as to recommend goods when the user browses the goods according to the recommendation model.

In a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a recommendation model, the apparatus including: the input module is used for inputting the characteristic data of the training data into the multi-task model and the weight prediction network respectively, correspondingly obtaining a multi-target predicted value and a weight parameter vector, wherein the training data comprises data generated by browsing commodities by a user; the acquisition module is used for acquiring dot products of the multi-target predicted values and the weight parameter vectors to serve as fusion predicted values; the first adjusting module is used for adjusting the multi-task model according to the multi-target predicted value and the training label of the training data, adjusting the weight predicting network according to the fusion predicted value and the training label until the multi-task model and the weight predicting network converge, wherein the training label comprises candidate commodity data when a user browses commodities; and the second adjustment module is used for adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to the online feedback data and the fusion predicted value collected from the user online feedback data system to obtain a recommendation model so as to recommend goods when the user browses the goods according to the recommendation model.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: according to the technical scheme, the multi-task model and the weight prediction network are trained respectively, and the multi-task model and the weight prediction network are adjusted by combining the online feedback data, so that more matched weights can be set for different tasks, the training difficulty of the recommendation model is reduced, and the accuracy of recommendation of the recommendation model for multi-target tasks is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a training method of a recommendation model according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a training process for a recommendation model provided by an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a training device for a recommendation model according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

In the user click prediction, the discrete features of the user and the commodity can be normalized through one-hot (single hot) coding, or the discrete features of the numerical class can be discretized through a barrel separation technology, and then the discrete features are input into a recommendation model of a recommendation system, so that the recommendation model can be trained. When the discrete features of the user and the commodity are normalized by one-hot coding or are discretized by a barrel separation technology, an LR (Logistic Regression ) model, an FM (Factorization Machine, factorizer) model, a deep FM (Deep Factorization Machine, depth factorizer) model, a DIN (Deep Interest Network, depth interest network) model and the like can be used for modeling to generate a user sequence.

The LR model is a shallow model and is easy to train. The FM model may implement feature interleaving with low-dimensional discrete features through embedding. The deep FM model combines the FM model and the depth network through double towers to form feature intersection, and high-dimensional features are generated to model the features. The DIN model can model the user sequence and express the long-short period interest of the user.

In multi-objective fusion ordering, there is a problem with using fixed parameters. The user has different appeal in different scenes, and if the user has a definite shopping intention or merely stroll, the fixed fusion parameters cannot optimize the overall effect of the model. Therefore, in the process of training the recommendation model, multiple targets need to be subjected to scoring fusion, the steps of scoring fusion are complicated, the adaptability is not strong, the training difficulty of the recommendation model is high, and the accuracy of the recommendation result of the recommendation model can be affected.

In order to solve the above problems, the embodiments of the present disclosure provide a training scheme for a recommendation model, which automatically fuses different target scores of a recommendation multi-task scene, and automatically outputs fusion weights of different targets according to a user state, so as to solve the problem of online scoring fusion in the training process of the multi-task model, thereby improving a plurality of targets to be optimized together with high efficiency in a better manner.

The following describes in detail a training method and apparatus of a recommendation model according to an embodiment of the present disclosure with reference to the accompanying drawings.

Fig. 1 is a flowchart of a training method of a recommendation model according to an embodiment of the disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device, such as a terminal or server, having computer processing capabilities. As shown in fig. 1, the training method of the recommendation model includes:

step S101, characteristic data of training data are respectively input into a multi-task model and a weight prediction network, a multi-target predicted value and a weight parameter vector are correspondingly obtained, and the training data comprise data generated when a user browses commodities.

Step S102, obtaining dot products of the multi-target predicted values and the weight parameter vectors as fusion predicted values.

And step S103, adjusting the multi-task model according to the multi-target predicted value and the training label of the training data, and adjusting the weight prediction network according to the fusion predicted value and the training label until the multi-task model and the weight prediction network converge, wherein the training label comprises candidate commodity data when a user browses commodities.

And step S104, adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to the online feedback data and the fusion predicted value collected from the user online feedback data system to obtain a recommendation model, so as to recommend goods when the user browses the goods according to the recommendation model.

Prior to step S104, online feedback data is acquired.

Specifically, the recommendation model is a network structure composed of a multitasking model and a weight prediction network. The multi-task model is a main network, the weight prediction network is an auxiliary network, and compared with the traditional multi-task learning network, the added weight prediction network can effectively improve the accuracy of multi-target tasks. In addition, the technical scheme of the embodiment of the disclosure records the expression of favorite/offensive behaviors such as real-time clicking/stepping of the user through the user online feedback data system, performs reinforcement learning through the real-time reinforcement learning module, adjusts the parameters of the multi-task architecture and the scoring fusion weight in real time, improves the updating speed of the model, effectively captures the real-time preference of the user, and improves the online recommendation quality.

The embodiment of the disclosure provides a technical scheme for automatically fusing and recommending different target scores in a multitasking scene, which is completely dependent on model end-to-end output score weights, does not depend on manual design, effectively improves the accuracy of recommending tasks in the multitasking scene, and can adjust parameters and scoring fusion weights of a multitasking architecture according to user online feedback data.

In an embodiment of the present disclosure, the weight prediction network is any one of the following: multilayer perceptron, convolutional neural network and cyclic neural network.

The multi-layer perceptron (Multilayer Perceptron, MLP) may also be referred to as an artificial neural network (Artificial Neural Network, ANN) which is a feed-forward neural network consisting of an input layer, a hidden layer and an output layer. The input layer receives input data, the hidden layer is responsible for processing the data, and the output layer outputs the processed result. Convolutional neural networks (Convolutional Neural Networks, CNN for short) are a type of feedforward neural network with a deep structure that includes convolutional calculation, and are one of representative algorithms for deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network. The recurrent neural network (Recurrent Neural Network, RNN for short) is a recurrent neural network which takes sequence data as input, performs recursion in the evolution direction of the sequence and connects all nodes, namely the recurrent units in a chained mode.

Prior to step S104, online feedback data may be acquired. The online feedback data includes real-time feedback data recording user real-time click/click and the like expressing favorite/offensive behaviors.

In the embodiment of the present disclosure, the multitasking model may be an MMoE (Multi-gate mix-of-expertise) model or a PLE (Progressive Layered Extraction, progressive hierarchical extraction) model, but is not limited thereto.

As shown in fig. 2, in the embodiment of the present disclosure, the main network of the training architecture of the recommendation model that performs task a and task B is a conventional MMoE structure. The auxiliary network of the training architecture that performs task C is a weight prediction network that can learn how to combine different goals through a Gate structure. The input of the weight prediction network is the same bottom layer characteristic as the main network, and the output is a multi-target weight parameter vector, which is simply called a weight parameter vector, and finally performs dot product with the predicted values of different targets of the main network to be used as a final sorting value.

Where dot product is also called scalar product, quantity product, which is the sum of the products of the corresponding entries of the two digital sequences.

Further, the main network optimizes the overall loss function mainly by optimizing parameters of different learning objectives, while the weight prediction network mainly learns how to combine different objectives to optimize the model in the overall loss direction. The loss function of the weight prediction network is similar to that of the primary network.

In the disclosed embodiment, the loss function L of the primary network _mmoe As shown in the following formula (1):

wherein y is _i For the multi-target prediction value(s),to train the tag, beta _i For the weights of different tasks, i is the serial number of the different tasks, and n is the total number of the tasks. L () is a function for evaluating data similarity, and can be based on a loss function L during training of the main network _mmoe The individual objectives are optimized so that the primary network is optimized in the overall loss direction.

In step S103, when the weight prediction network is adjusted according to the fusion predicted value and the training label, the weight prediction network can be adjusted according to the loss function L _x And adjusting the weight prediction network. Loss function L of weight prediction network _x As shown in the following formulas (2), (3) and (4):

ω＝softmax(GateX(X)) (4)

wherein X is characteristic data, ω is a weight parameter vector, y _i For multi-target predictors, y' is a fusion predictor,to train the tag, alpha _i For the weights of different tasks, i is the serial number of the different tasks, and n is the total number of the tasks. The training process of the weight prediction network GateX is based on the loss function L _x And optimizing a combination mode of different targets so that the weight prediction network is optimal in the overall loss direction.

In step S101, feature data of each batch of training data may be sequentially input to the multitasking model to train the multitasking model; training data of one batch every first number of batches is input to the weight prediction network to train the weight prediction network.

Specifically, the first number may take a value of 10, and is not limited thereto. For example, the first number may also take on a value of 8 or 12. In embodiments of the present disclosure, the training data may be divided into a second number of batches (batches), e.g., the second number may take on a value of 800, i.e., the training data is divided into 800 batches. When the first number value is 10, training data of one batch of every 10 batches is used for training the weight prediction network. By adopting the method for alternately training the main network and the weight prediction network, training data of the 1 st batch, the 11 th batch, the 21 st batch and the 31 st batch and other batches selected according to the rule can be selected to train the weight prediction network. And when training the main network, the main network needs to be trained by using each batch of training data.

In step S103, when the weight prediction network is adjusted according to the fusion prediction value and the training label, the gradient may be blocked in a back propagation phase of training of the weight prediction network.

Specifically, when the gradient blocking technology is adopted, the gradient of the weight prediction network is not reversely transmitted back to the user behavior sequence modeling and embedding layer at the bottom layer, and the weight prediction network can be independently trained by the main network, so that the training of the main network is not influenced by the training of the weight prediction network.

In embodiments of the present disclosure, the training data may include data generated by a user browsing goods and candidate goods data as training tags. And inputting the training data into an embedding layer for feature extraction, and obtaining the user embedding. The user embeds all or part of the feature data, which is the training data. Further, the feature data of the training data may also include dense features and item embeddings.

In the disclosed embodiment, a loss function L is employed _mmoe When training the initial multitasking model, according to the multi-target predicted value y _i Training labelsThe loss function value is calculated. When the multi-task model converges, a final training result, namely a recommended model, can be obtained. When the loss function is adopted to train the initial multi-task model, the convergence condition of the multi-task model can be that the function value of the loss function is not lifted any more or the iteration number reaches a certain number.

Specifically, when the loss function value of the multi-task model is not lifted, training can be stopped, the loss function value at the moment is recorded, and the network parameters of the multi-task model are adjusted according to the loss function value, so that a parameter adjusting process of the multi-task model is completed. In the actual training process, iterative parameter adjustment training is carried out for a plurality of times according to training data until the multi-task model converges, and a final recommended model is obtained.

In the disclosed embodiment, a loss function L is employed _x When training the initial weight prediction network, according to the multi-target predicted value y _i Weight parameter vector omega and training labelAnd calculating a loss function value, and obtaining a final training result when the weight prediction network converges. When the loss function is adopted to train the weight prediction network row, the convergence condition of the weight prediction network convergence can be that the function value of the loss function is not lifted any more or the iteration number reaches a certain number.

Specifically, when the loss function value of the weight prediction network is not increased, training can be stopped, the loss function value at the moment is recorded, and the network parameters of the weight prediction network are adjusted according to the loss function value, so that one parameter adjustment process of the weight prediction network is completed. In the actual training process, iterative parameter adjustment training is carried out for a plurality of times according to training data until the weight prediction network converges.

The recommendation model may learn the click rate and conversion rate of the user. In the process of modeling and embedding the user behavior sequence, as shown in fig. 2, the user behavior data can be characterized by an embedding layer, and then deep sequence characterization of the user is learned by DIN and then input into a multitasking model. The learning task of the main network and the weight prediction task of the weight prediction network learn how to combine different targets through one MLP tower structure. The input of the weight prediction network is the same bottom layer sharing characteristic as the main network, the output is multi-target weight parameter vectors omega 1 and omega 2, the multi-target weight parameter vectors and the predicted values ScorA and ScorB of different targets of the main network are subjected to dot product finally to serve as final sorting values, and then the final sorting values are pushed to a user according to a scoring result.

As shown in fig. 2, the feedback system, i.e., the user online feedback data system, may collect online feedback data to adjust the multitasking model and the weight prediction network.

Feature data of training data may include user embeddings, dense features, and item embeddings. And inputting the characteristic data of the training data into an expert network (expert) structure and a Gate structure of a main network for executing the task A and the task B, and combining the output data output by the expert network and the Gate structure to obtain the predicted values of different tasks.

The characteristic data of the training data is input into the Gate structure of the weight prediction network for executing the task C, so that a multi-target weight parameter vector can be obtained, and the multi-target weight parameter vector and the predicted values of different targets of the main network are subjected to dot product, so that a final sorting value can be obtained.

In the MMoE model, each expert network corresponds to a multi-layer neural network, and each expert network outputs a vector after forward propagation. The Gate structure is a shallow neural network, which can be understood as a model of attention that can output multiple scalar quantities for assigning each expert network a weight parameter, the sum of all weight parameters being equal to 1. Each task has its own Gate network, learns the weights of different expert, essentially learning the sample weights of different tasks. Each task obtains the weight of each expert through Gate, and when the gradient is returned, the parameters of different expert are updated and calculated and directly related to the weight, which is equivalent to adopting different weights for samples of different tasks in the expert layer.

Furthermore, the Expert is a neural network, and the number of the Expert is weighted according to training and estimated performance and can be consistent with the task number. Gate is a Softmax function, the number is consistent with the number of tasks, and the number of outputs of each Gate is consistent with the number of experert. The Softmax function is typically used to convert an arbitrary set of real numbers into real numbers that represent a probability distribution. The Softmax function is essentially a normalization function that converts an arbitrary set of real values into probability values between 0, 1.

In the technical scheme of the embodiment of the disclosure, a method for automatically fusing and recommending different target scores of a multi-task scene is provided, and the method can automatically output the fusion weights of different tasks according to the user state, has wide model architecture adaptability, can be suitable for different types of multi-tasks, and can effectively improve the accuracy of the multi-target tasks by an added auxiliary network.

According to the training method of the recommendation model, the multi-task model and the weight prediction network are trained respectively, and the multi-task model and the weight prediction network are adjusted by combining the online feedback data, so that more matched weights can be set for different tasks, the training difficulty of the recommendation model is reduced, and the accuracy of the recommendation model for recommending the multi-target tasks is improved.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. The training device of the recommendation model described below and the training method of the recommendation model described above can be referred to correspondingly to each other. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 3 is a schematic diagram of a training device for a recommendation model according to an embodiment of the disclosure. As shown in fig. 3, the training device of the recommendation model includes:

the input module 301 is configured to input feature data of training data to the multitasking model and the weight prediction network, respectively, to obtain a multi-objective predicted value and a weight parameter vector correspondingly, where the training data includes data generated when a user browses goods.

The obtaining module 302 is configured to obtain a dot product of the multi-target predicted value and the weight parameter vector as a fusion predicted value.

The first adjustment module 303 is configured to adjust the multitasking model according to the multitasking predicted value and a training tag of the training data, and adjust the weight predicting network according to the fusion predicted value and the training tag until the multitasking model and the weight predicting network converge, where the training tag includes candidate commodity data when the user browses the commodity.

The second adjustment module 304 is configured to adjust the multitask model and the weight prediction network by using a reinforcement learning method according to online feedback data and a fusion prediction value collected from the online feedback data system of the user, so as to obtain a recommendation model, so as to recommend goods when the user browses the goods according to the recommendation model.

The second adjustment module 304 needs to obtain online feedback data before adjusting the multitasking model and the weight prediction network by reinforcement learning method according to the online feedback data collected from the user online feedback data system and the fusion predicted value. The online feedback data includes real-time feedback data recording user real-time click/click and the like expressing favorite/offensive behaviors.

In the embodiment of the present disclosure, the multitasking model may be an MMoE model or a PLE model, and is not limited thereto.

wherein y is _i For the multi-target prediction value(s),to train the tag, beta _i Weights for different tasks, i being different arbitraryThe serial number of the task, n is the total number of the task. L () is a function for evaluating data similarity, and can be based on a loss function L during training of the main network _mmoe The individual objectives are optimized so that the primary network is optimized in the overall loss direction.

The first adjustment module 303 may adjust the weight prediction network according to the multi-objective prediction value, the training label, and the weight parameter vector according to the loss function L _x And adjusting the weight prediction network. Loss function L of weight prediction network _x As shown in the following formulas (2), (3) and (4):

ω＝softmax(GateX(X)) (4)

The input module 301 may sequentially input the feature data of each batch of training data to the multi-task model to train the multi-task model; training data of one batch every first number of batches is input to the weight prediction network to train the weight prediction network.

The first adjustment module 303 may block the gradient during a back propagation phase of training of the weight prediction network when adjusting the weight prediction network according to the fused predicted value and the training tag.

In the disclosed embodiment, a loss function L is employed _mmoe When training the initial multitasking model, the model can be trained according to the multi-target predicted value y _i Training labelsThe loss function value is calculated. When the multi-task model converges, a final training result, namely a recommended model, can be obtained. Convergence bar for convergence of a multitasking model while training an initial multitasking model using a loss functionThe piece may be that the function value of the loss function is no longer lifted or the number of iterations reaches a certain number.

In the disclosed embodiment, the loss function L is adopted _x When training the initial weight prediction network, according to the multi-target predicted value y _i Weight parameter vector omega and training labelAnd calculating a loss function value, and obtaining a final training result when the weight prediction network converges. When the loss function is adopted to train the weight prediction network row, the convergence condition of the weight prediction network convergence can be that the function value of the loss function is not lifted any more or the iteration number reaches a certain number.

Since each functional module of the training apparatus for a recommendation model in the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the training method for a recommendation model described above, for details not disclosed in the embodiment of the apparatus of the present disclosure, please refer to the embodiment of the training method for a recommendation model described above in the present disclosure.

According to the training device for the recommendation model, the multitasking model and the weight predicting network are trained respectively, and the multitasking model and the weight predicting network are adjusted by combining the online feedback data, so that more matched weights can be set for different tasks, the training difficulty of the recommendation model is reduced, and the accuracy of the recommendation model for recommending the multi-target tasks is improved.

Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401 may execute the computer program 403 to implement the functions of the modules in the above-described device embodiments.

The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.

The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules may be stored in a readable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a readable storage medium, where the computer program may implement the steps of the method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A method for training a recommendation model, the method comprising:

respectively inputting feature data of training data into a multi-task model and a weight prediction network to correspondingly obtain multi-target predicted values and weight parameter vectors, wherein the training data comprises data generated by browsing commodities by a user;

obtaining dot products of the multi-target predicted values and the weight parameter vectors to serve as fusion predicted values;

the multi-task model is adjusted according to the multi-target predicted value and the training label of the training data, the weight prediction network is adjusted according to the fusion predicted value and the training label until the multi-task model and the weight prediction network converge, and the training label comprises candidate commodity data when a user browses commodities;

and adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to online feedback data collected from a user online feedback data system and the fusion predicted value to obtain the recommendation model so as to recommend goods when the user browses the goods according to the recommendation model.

2. The method of claim 1, wherein prior to adapting the multitasking model and the weight prediction network using a reinforcement learning method based on online feedback data collected from a user online feedback data system and the fused predictors, the method further comprises:

and acquiring the online feedback data.

3. The method of claim 1, wherein inputting the characteristic data of the training data into the multitasking model and the weight prediction network, respectively, comprises:

feature data of each batch of training data are sequentially input into the multi-task model to train the multi-task model;

training data for one batch every first number of batches is input to the weight prediction network to train the weight prediction network.

4. The method of claim 1, wherein adjusting the weight prediction network based on the fusion predicted value and the training tag comprises:

the gradient is blocked during a back propagation phase of training of the weight prediction network.

5. The method of claim 2, wherein adjusting the weight prediction network based on the fusion predicted value and the training tag comprises:

according to the following loss function L _x Adjusting the weight prediction network:

wherein y' is a fusion predicted value,to train the tag, alpha _i For the weights of different tasks, i is the serial number of the different tasks, and n is the total number of the tasks.

6. The method of claim 1, wherein the weight prediction network comprises any one of: multilayer perceptron, convolutional neural network and cyclic neural network.

7. The method of claim 1, wherein the multitasking model comprises a hybrid expert network model or a progressive hierarchical extraction model of multi-gating.

8. A training device for a recommendation model, the device comprising:

the input module is used for inputting the characteristic data of the training data into the multi-task model and the weight prediction network respectively to correspondingly obtain a multi-target predicted value and a weight parameter vector, and the training data comprises data generated by browsing commodities by a user;

the acquisition module is used for acquiring dot products of the multi-target predicted values and the weight parameter vectors to serve as fusion predicted values;

the first adjustment module is used for adjusting the multi-task model according to the multi-target predicted value and the training label of the training data, adjusting the weight prediction network according to the fusion predicted value and the training label until the multi-task model and the weight prediction network converge, and the training label comprises candidate commodity data when a user browses commodities;

and the second adjustment module is used for adjusting the multitask model and the weight prediction network by adopting a reinforcement learning method according to the online feedback data collected from the user online feedback data system and the fusion predicted value to obtain the recommendation model so as to recommend the commodity when the user browses the commodity according to the recommendation model.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.