WO2020135642A1 - 一种基于生成对抗网络的模型训练方法及设备 - Google Patents

一种基于生成对抗网络的模型训练方法及设备 Download PDF

Info

Publication number
WO2020135642A1
WO2020135642A1 PCT/CN2019/128917 CN2019128917W WO2020135642A1 WO 2020135642 A1 WO2020135642 A1 WO 2020135642A1 CN 2019128917 W CN2019128917 W CN 2019128917W WO 2020135642 A1 WO2020135642 A1 WO 2020135642A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
items
item
positive
negative
Prior art date
Application number
PCT/CN2019/128917
Other languages
English (en)
French (fr)
Inventor
刘志容
董振华
张宇宙
刘明瑞
郭贵斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020135642A1 publication Critical patent/WO2020135642A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce

Definitions

  • the present application relates to the field of big data, and in particular, to a model training method and device based on a generated confrontation network.
  • Information retrieval generation adversarial network (Information, Retrieval, GAN, IRGAN) is a model that applies a generation adversarial network (Generative, Adversarial, Net, GAN) model to the field of item recommendation, which trains input item data to obtain a generation model and a discriminant model.
  • the generative model is responsible for generating counterfeit articles similar to real items, and the discriminant model is responsible for discriminating generated counterfeit articles and real samples.
  • the training of the generative model and the discriminant model depends on each other. In the item recommendation scenario, it is necessary to generate the scores of counterfeit items and items through the generative model, and then sort the items according to the scores to obtain recommendation results.
  • IRGAN Common training methods of IRGAN include point-wise method and pair-wise method.
  • Point-wise is to convert the recommendation problem into a classification problem or a regression problem. Assuming that the user's preference for each item is independent, train the extraction features of items that the user may like.
  • Pair-wise is to convert the recommendation problem into a binary classification problem.
  • model training pair-wise no longer makes independence assumptions on items, but item pairs are used as the minimum unit of training.
  • each item pair includes a user Items that you like and items that a user does not like.
  • the training effect of pair-wise is not as good as point-wise. How to optimize pair-wise to improve the generation ability of the generated model and the discrimination ability of the discriminant model in the recommended scene is a technical problem that is being studied by those skilled in the art. .
  • the embodiments of the present application disclose a model training method and equipment based on a generative confrontation network, which can improve the generative model generating ability and the discriminating ability of the discriminating model.
  • an embodiment of the present application provides a model training method based on generating an adversarial network.
  • the method includes:
  • the device generates a positive forged item and a negative forged item for the first user through a generation model, wherein the negative forged item is generated based on the positive forged item, and the positive forged item for the first user is predicted Items that are concerned by the first user, the negative forged items of the first user are predicted items that are not concerned by any of the first users; the device trains multiple real item pairs and multiple counterfeit item pairs To obtain a discriminant model, which is used to distinguish the difference between the plurality of real object pairs and the plurality of counterfeit object pairs; each real object pair includes a positive real object and a negative real object, Each pair of counterfeit items includes one positive example counterfeit item and one negative example counterfeit item; the positive example real item is an item that is recognized by the first user according to the operation behavior of the first user, The negative real item is an item that is determined according to the operation behavior of the first user and is not concerned by the first user; the device updates the generation model according to the loss function of the discrimination model.
  • the negative-forged items in the pair of forged items are generated by relying on the positive-forged items, fully considering the potential relationship between the negative-forged items and the positive-forged items, so that the information contained in the pair of forged items
  • the amount is more abundant, the training effect is improved, and the generation ability of the generative model is enhanced. Therefore, the recommendation results generated by sorting the items generated by the generative model and the existing real items have more reference value for users.
  • the method further includes: the device generating the updated generation model A score of counterfeit items, the counterfeit items including the positive and counterfeit items generated for the first user; the device based on the score of the counterfeit item and the existing real item score, the real item Sorting with the forged items, and recommending items to the first user according to the order in sorting. It can be understood that the recommendation results generated by sorting the items generated by the generation model and the existing real items have more reference value for the user.
  • the device generates a positive example, a forged item, and a negative example for the first user through a generation model
  • the device further includes: the device matches a first negative example forged item for each of the first positive example forged items to: Forming the plurality of forged item pairs, the first negative example counterfeit item belongs to the first user's negative example counterfeit item in the first user's negative example counterfeit item, and M is the first positive example counterfeit item
  • the number of items, the first positive fake item is a positive fake item of the first user sampled from the positive fake item generated by the generation model; in addition, the device is a plurality of first positive fake items
  • Example real items each match a first negative example real item to form the plurality of real item pairs, the first negative example real item belongs to the first N negative scores of the first user'
  • the initial generation model includes a positive example generation model, a negative example generation model, and a score generation A model; the device generates a positive example forged item and a negative example forged item for the first user by generating a model, including:
  • the device generates a positive example of the distribution of forged items by the first user through a positive example generation model.
  • the positive example generation model is:
  • the device generates a distribution of negative examples of forged items of the first user through a negative example generation model.
  • the negative example generation model is:
  • the device generates a score for each positive example counterfeit item and a score for each negative example counterfeit item through a score generator;
  • u) is the distribution of the forged items in the positive example
  • e u is the embedding of the first user's embedding vector
  • e i n is the i-th embodiment forged article embedding
  • b representing the user's first deviation value bias
  • the device updating the generation model according to the loss function of the discrimination model includes : The device determines the attention index of the first user on the item, and the attention index of the first user on the item is obtained by using an attention network to train the first user's real item score and fake item score; The device obtains a reward value reward according to the loss function of the discriminant model, and optimizes the reward value reward through the first user's attention index on the item to obtain a new reward value; the device uses the new reward The value updates the generation model.
  • each item pair is different.
  • the importance weight of each item pair is obtained, which can effectively select high-quality item pairs, reduce the negative impact of low-quality item pairs, and let The generative model and discriminant model we obtained are more robust and adaptive.
  • the item pairs here can be real item pairs or counterfeit item pairs.
  • the device determining the attention index of the first user on the item includes:
  • the device uses the attention network to calculate the attention index of the first user on the item according to the following formula
  • is the attention index of the first user u to the item
  • w u represents the weight of the first user trained
  • b is the bias value of the first user bias.
  • an embodiment of the present application provides a model training device based on generating an adversarial network.
  • the device includes:
  • the generation model is used to generate a positive forged item and a negative forged item for the first user, wherein the negative forged item is generated according to the positive forged item, and the positive forged item for the first user is a prediction Of the items that are concerned by the first user, and the negatively-forged items of the first user are predicted items that are not concerned by the first user;
  • a training model used to train multiple real object pairs and multiple counterfeit object pairs to obtain a discriminant model, and the discriminant model is used to distinguish between the multiple real object pairs and the multiple counterfeit object pairs;
  • each A pair of real objects includes a positive real object and a negative real object, and each counterfeit object pair includes one positive positive object and one negative negative object;
  • the positive real object is based on the first An item identified by the operation behavior of a user that is concerned by the first user, and the negative real item is an item that is identified according to the operation behavior of the first user and is not concerned by the first user;
  • the training model is used to update the generation model according to the loss function of the discriminant model.
  • the negative examples of forged items in the pair of forged items are generated by relying on the positive examples of forged items, fully considering the potential relationship between the negative examples of forged items and the positive examples of forged items, so that the information contained in the forged items pair
  • the amount is more abundant, the training effect is improved, and the generation ability of the generative model is enhanced. Therefore, the recommendation results generated by sorting the items generated by the generative model and the existing real items have more reference value for users.
  • the device further includes a recommendation model, where:
  • the updated generation model is used to generate scores of counterfeit items, the counterfeit items including the positive example counterfeit items generated for the first user And negative examples of forged items;
  • the recommendation model is used to sort the real items and the fake items according to the scores of the fake items and the existing real items, and recommend items to the first user according to the order in the sorting.
  • the generation model generates a positive example forged item and a negative example forged item for the first user
  • the training model is also used to:
  • the first negative example counterfeit item belongs to the ranking of negative counterfeit items of the first user
  • M is the number of the first positive forged items
  • the first positive forged item is the first sampled from the positive forged items generated by the generation model A user's positive example of forged items
  • N is the number of the first positive real items
  • the first positive real items are one sampled from the positive real items that the first user has Positive examples of real items.
  • the initial generation model includes a positive example generation model, a negative example generation model, and a score generation The model; the generation model is used to generate positive examples of forged items and negative examples of forged items for the first user, specifically:
  • a positive example generation model is used to generate the distribution of the first user's positive example counterfeit items.
  • the positive example generation model is:
  • a negative example generation model is used to generate the distribution of the negative example counterfeit items of the first user.
  • the negative example generation model is:
  • u) is the distribution of the forged items in the positive example
  • e u is the embedding of the first user's embedding vector
  • e i n is the i-th embodiment forged article embedding
  • b representing the user's first deviation value bias
  • the method for updating the generation model according to the loss function of the discriminant model is specifically: :
  • the attention index of the first user on the item is obtained by using an attention network to train the first user's real item score and fake item score;
  • the generation model is updated with the new reward value.
  • each item pair is different.
  • the importance weight of each item pair is obtained, which can effectively select high-quality item pairs, reduce the negative impact of low-quality item pairs, and let The generative model and discriminant model we obtained are more robust and adaptive.
  • the item pairs here can be real item pairs or counterfeit item pairs.
  • the training model determines the attention index of the first user on the item, specifically for:
  • the attention network is used to calculate the attention index of the first user on the item according to the following formula
  • is the attention index of the first user u to the item
  • w u represents the weight of the first user trained
  • b is the bias value of the first user bias.
  • the reward is optimized by the first user's attention index for the item Value reward to get a new reward value, specifically:
  • a new reward value is determined according to the reward value reward_1 corresponding to the first user.
  • an embodiment of the present application provides a device, which includes a processor and a memory, where the memory is used to store program instructions and sample data required for training a model, and the processor is used to call the program instructions to execute the first The method described in the aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium having program instructions stored therein, which, when run on a processor, implements the first aspect or any of the first aspect Possible implementation method described.
  • FIG. 1A is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 1B is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG. 1C is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • 1D is a schematic structural diagram of a device provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a processing flow of a processor provided by an embodiment of the present application.
  • FIG. 3 is a model training method based on generating an adversarial network provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a training process of a discriminant model provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a scene of an attention mechanism provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a training process of generating a model provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a training scenario of a discriminant model and a generation model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • the goal of the recommendation system is to accurately predict the user's preference for specific products.
  • the recommendation effect of the recommendation system not only affects the user experience, but also directly affects the revenue of the recommendation platform. Therefore, accurate recommendation is of great significance.
  • the users illustrated in Table 1 include User A, User B, User C, User D, and User E.
  • the illustrated items include Item 101, Item 102, Item 103, Item 104, Item 105, and Item 106.
  • Table 1 also indicates
  • the corresponding user's rating for the corresponding item the higher the rating corresponding to a certain item by a certain user, the stronger the user's preference for the item. For example, the score of item 101 by user A is 5 points, indicating that user A has a very high preference for item 101.
  • the question mark in Table 1 represents that the user has not rated the item so far.
  • the goal of the recommendation system is to predict the preference of the corresponding user for unrated products.
  • user A needs to predict the ratings of item 104, item 105, and item 106
  • user B needs to predict the ratings of user B on item 105, item 106, and so on.
  • the recommendation system can complete the user's rating of the unrated items. As shown in Table 2, if the recommendation system wants to recommend new items for user A, item 106 may be a better choice, because the recommendation system gives item 106 a rating of 5 points, which is higher than other items. User A has a great possibility to like the item 106.
  • the model training method based on the generative adversarial network proposed in the embodiment of the present application can train a generative generative model with better effect. Therefore, in the item recommendation, the generative model can be used to score fake articles as a basis to obtain a better recommendation effect.
  • the model training method based on generating an adversarial network in the embodiments of the present application can be applied in many scenarios, such as advertisement click prediction, topN item recommendation of interest, and answer prediction most relevant to the question, etc., which will be illustrated below.
  • the advertisement recommendation system needs to return one or more sorted advertisement lists to display the user.
  • the embodiments of the present application can predict advertisements that are more popular with users, thereby improving the click-through rate of the advertisements.
  • the advertisements that have been clicked by the user and the advertisements that have not been clicked can be combined into real items.
  • the advertisements that have been clicked are equivalent to the real items of the positive example
  • the advertisements that have not been clicked are equivalent to the real items of the negative example.
  • the user’s probability of clicking on each advertisement can be estimated (equivalent to the item Of ratings).
  • the user’s historical behavior data for advertisements can be trained through a model training method based on a generated confrontation network, and the user’s click probability prediction value for each advertisement can be obtained.
  • the topN items that the user is most interested in need to be recommended to the user so as to promote the user's consumption behavior of the item, where the item may be an e-commerce product, an application market APP, etc.
  • items that have been consumed or downloaded by the user and scored higher by the user and items that have been consumed by the user and rated lower by the user can be combined into real item pairs.
  • items with higher scores are equivalent to real items of positive examples.
  • Lower items are equivalent to negative real items.
  • IRGAN technology you can generate counterfeit item pairs through the generation model, and try to determine which are the generated item pairs and which are the real item pairs through the discrimination model.
  • the user's evaluation of each item is relatively high, which is equivalent to the rating of the item.
  • the user’s historical behavior data for items is trained through the model training method based on the generated adversarial network, and the user’s ranking of the interest of each item can be obtained, so that the topN items of interest are output to the user .
  • the question and answer system needs to give answers to the questions raised by the user as closely as possible to the user's expectations, thereby improving the user's friendliness to the question and answer system.
  • the answers received by the user and rated high by the user and the answers received by the user and rated low by the user can be combined into real item pairs. Answers with lower scores are equivalent to negative real objects.
  • IRGAN technology you can generate counterfeit object pairs through the generation model, and try to determine which are the generated object pairs and which are the real object pairs through the discrimination model.
  • IRGAN confrontation training it can be estimated that the user's evaluation of each answer is relatively high, which is equivalent to the rating of the item.
  • the user’s historical behavior data on questions and answers can be trained through the model training method based on the generated adversarial network, and the user’s satisfaction ranking for each answer can be obtained, thereby outputting his relatively satisfactory satisfaction to the user N answers.
  • the following describes the device that executes the model training method based on the generated confrontation network with reference to FIG. 1D.
  • FIG. 1D is a schematic structural diagram of a device provided by an embodiment of the present application.
  • the device is used to classify items.
  • the device may be a device, such as a server, or a cluster formed by several devices. Take the device as a server as an example to briefly introduce the structure of the device.
  • the device 10 includes a processor 101, a memory 102, and a communication interface 103.
  • the processor 101, the memory 102, and the communication interface 103 are connected to each other through a bus, where:
  • the communication interface 103 is used to obtain data of existing items, for example, identification and rating of existing items, information of users who rate existing items, and so on.
  • the communication interface 103 can establish a communication connection with other devices, so it can receive data of existing items sent by other devices or read data of existing items from other devices; optionally, the communication interface 103 An external readable storage medium can be connected, so data of existing items can be read from the external readable storage medium; the communication interface 103 may also obtain data of existing items through other methods.
  • the memory 102 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or A portable read-only memory (compact, read-only memory, CD-ROM).
  • the memory 102 is used to store related program instructions and store related data.
  • the related data may include data acquired through the communication interface 103, and may also include After processing these data, new data, models, and model-based predictions, etc., can also be called samples.
  • the processor 101 may be one or more central processing units (CPUs). When the processor 101 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 101 is used to read the program stored in the memory 102 and execute related operations involved in a model training method based on generating an adversarial network, such as discriminating model training, generating model training, and object Make scoring predictions, etc.
  • FIG. 2 illustrates the general execution flow of the processor, including inputting information of existing items, information of users who rated the items, and score values of the items into the initial discriminant model 201, Wherein, the existing item information may include the item identification ID, and the information of the user who scored the item may include the user identification ID.
  • the generation model 202 will also generate some forged items and input the relevant information of the forged items into the initial discriminant model 201 to train the discriminant model 201, and the discriminant model 201 and the generated model 202 are constantly confronted Finally, a discriminative model 201 with strong ability to distinguish real samples and forged samples is obtained, and a generated model 202 is generated that can be very close to real items; the generated model 202 is then used to generate forged items and scores of forged items; then The ranking prediction 203 generates a ranking of the user's items based on the ratings of all the items of any user, so as to obtain a recommendation list of items for the user according to the ranking.
  • the items include real items and counterfeit items.
  • the discriminant model 201 includes a discriminator and an attention network.
  • the discriminator is responsible for distinguishing between real items and counterfeit items.
  • the attention network is used to record the attention weights of different users on real items and counterfeit items.
  • This provides a reference for the generation of the generative model;
  • the generative model 202 includes an item generator and a score generator.
  • the item generator is used to generate counterfeit items, and the score generator is used to generate scores for the counterfeit item.
  • the item generator can also be divided into negative Example generator and positive example generator, positive example generator is used to generate positive example forged items, and negative example generator is used to generate negative example forged items.
  • the dynamic sampling technology is adopted in the item generator for sampling.
  • the device 10 may further include an output component, such as a display, a sound, etc.
  • the output component is used to show the developer the parameters to be used for training the model, so the developer can learn these parameters and can also use these parameters Make modifications, and input the modified parameters into the device 10 through the input component.
  • the input component may include a mouse, a keyboard, and the like.
  • the device 10 can display the trained model and the prediction result based on the model to the developer through the output component.
  • FIG. 3 is a model training method based on generating an adversarial network provided by an embodiment of the present application.
  • the method may be implemented based on the device 10 shown in FIG. 1D or may be implemented based on other architectures.
  • the method includes The following steps:
  • Step S301 The device generates a forged item for the first user through the generation model.
  • the embodiments of the present application relate to real items and counterfeit items, where the counterfeit items include positive example counterfeit items and negative example counterfeit items, and the real item includes positive example real items and negative example real items.
  • the positive real items of the user are the users who have operated. Behavior and more concerned items
  • the negative real items of the user are the items that the user has operated and did not pay attention to
  • the positive examples of the user are fake items that are not operated by the user and predicted to be more concerned
  • the negative example of the forged item is an item that the user has not manipulated and predicted not to pay attention to.
  • the first user in the embodiment of the present application is one of multiple users. For ease of understanding, the first user is used as an example for description, and the characteristics of other users may refer to the description of the first user.
  • the operation behavior of the first user on the items displayed on a terminal includes downloading, evaluating, clicking, browsing, etc. These actions will be recorded by the terminal and the corresponding items will be scored according to the operation of the operation, for example, it may be a score scored by the user
  • the terminal or the above-mentioned device may be scored according to the user's behavior data, and the score is used to measure the user's attention to the item, and the positive of the certain user may be divided according to the score of each item of a certain user's operating behavior.
  • Examples of real items and negative examples of real items for example, if the score is in the range of 1-5 points, then items with a score in the range of 4-5 points can be defined as positive real items for the user, with a score of 1-3 points Items in the scope are defined as negative real items of the user.
  • the items here are applications (APP), or advertisements, or videos, or songs, or answers to question-answering systems, and so on.
  • the positive example forged items generated by the first user are predicted items that are concerned by the first user
  • the negative example forged items generated for the first user are predicted items that are not concerned by the first user.
  • the generation model generates for the first user comedy movies 1, comedy movies 2, and comedy movies 3 that may be followed by the first user, and for the first user horror movies 1, horror movies 2, and that may not be followed by the first user.
  • horror movie 1, horror movie 2 and horror movie 3 belong to the negative example of the first user forged items
  • the The generative model also generates ratings for comedy movie 1, comedy movie 2, comedy movie 3, horror movie 1, horror movie 2, and horror movie 3.
  • the generated ratings belong to the predicted ratings and are used to indicate the first user’s preferences for these movies degree.
  • the principle of the generation model for generating positive examples of forged articles and negative examples of forged articles for other users can refer to the above description for the first user.
  • the positive and counterfeit items of different users may be the same or different, and the corresponding scores may be the same or different. The following describes the generation model.
  • the goal of the generation module is to generate a pair of counterfeit items and to approximate the correlation distribution of the real item pairs as much as possible, where the counterfeit item pair includes a positive example counterfeit item and a negative example counterfeit item, and the real item pair includes a positive example Real items and a negative example of real items.
  • the relative linear distribution of the counterfeit article pairs generated here is shown in formula (1):
  • u) G ((f +, f -)
  • u) g + (f +
  • f represents the generated counterfeit item
  • f + is the generated positive example counterfeit item
  • f - is the generated negative example counterfeit item.
  • Generating a model can be divided into positive example and example generator generating two sub-models negative, g + represents a positive example generator, and g - represents a negative example generator, u represents the first user.
  • the positive example generator g + is used to generate the distribution of the u positive example forged items of the first user
  • the negative example generator g - is used to generate the negative of the first user according to the positive example forged items generated by the positive example generator g +
  • the distribution of examples of counterfeit items, where the distribution of positive examples of counterfeit items generated by the positive example generator g + is shown in formula (2):
  • e u represents the embedding of the first user
  • e i is the embedding of the i-th forged article of the positive example
  • b represents the bias of the first user.
  • the embedded adjacent embedding and the bias value bias may be configured with default values during the first initial training, and the embedding and bias are usually updated after each training.
  • the negative example generator uses the inner product to calculate the relationship between the positive example counterfeit items and the negative example counterfeit items, thereby obtaining the distribution of the generated counterfeit negative example items as shown in formula (3):
  • the device generates a score for each positive and negative forged article generated by the score generation model, optional
  • the principle of the score generation model to generate scores can be shown as formula (4):
  • ru, t represents the generated first user's rating of the t-th counterfeit item
  • e t is the embedding of the t-th counterfeit item t.
  • the way to generate multiple forged article pairs can be as follows:
  • the device matches the first positive example counterfeit item with a negative example counterfeit item to form a pair of the counterfeit item, and the one negative example counterfeit item is the top M among all negative example counterfeit items of the first user Bit negative forged items, M is the number of all positive forged items of the first user, the first positive forged item is any of the generated positive forged items that belong to the first user Positive examples of forged items, M is a positive integer.
  • a negative-forged item with the highest score is collected from the generated negative-forged item and the positive-forged item constitutes a pair of forged item.
  • each negative sample counterfeit item sampled can be matched with a negative example counterfeit item, thereby obtaining multiple counterfeit item pairs.
  • the following schematic example illustrates an implementation code:
  • the device matches a first positive example real item with a negative example real item to form a pair of the real item, and the one negative example real item is a ranking of all negative example real items of the first user
  • N is the number of all positive real items of the first user
  • the first positive real items are any of the positive real items that have been sampled.
  • the positive example of the first user's real item, N is a positive integer.
  • a negative real object with the highest score is collected from the generated negative real objects to form a real object pair with the positive real object.
  • each positive sample real item sampled can be matched with a negative example real item, thereby obtaining multiple real item pairs.
  • Step S302 The device trains multiple real object pairs and multiple counterfeit object pairs with the objective of minimizing the loss function to obtain a discriminant model.
  • v may be r or f.
  • u) represents that the distribution is the distribution of the pair of counterfeit items generated by the generative model
  • e u represents the embedding of the first user
  • e u represents the embedding of the first user
  • b represents the bias of the first user
  • u) represents that the distribution is a distribution of real item pairs sampled from real items
  • e u represents the embedding of the first user
  • b represents the bias of the first user.
  • the discriminant model is responsible for distinguishing the difference between the distribution of the above-mentioned pairs of forged items and the distribution of the above-mentioned pairs of real items, which can be optimized using the cross-entropy loss function (6), so that the discriminant model can have a higher recognition of the true The ability of items and counterfeit items.
  • the following process may be performed for each user:
  • the goal is to set the number of training sessions up to n times in advance, and the training process in this case is shown in FIG. 4.
  • Step S303 The device updates the generation model according to the loss function of the discrimination model.
  • the device updates the generation model according to the loss function of the discriminant model, which may include: first, the device obtains a reward value reward according to the loss function of the discriminant model, wherein The loss function of the discriminant model is shown in formula (6), and the reward value reward can be calculated according to the parameter D(r, f
  • u) in formula (6), for example, reward log(1-D(r ,f
  • E f ⁇ G u [] is the expected function
  • f ⁇ Gu indicates that f is generated from the generator G(f
  • i takes a value from 1 to N
  • f i represents the generation
  • the i-th sample generated by the generator, reward in formula (7) is the reward value obtained earlier.
  • the device updating the generation model according to the loss function of the discriminant model to obtain a new generation model may include: In the first step, the device determines the first user’s Attention index, the first user's attention index for the item is obtained by training the first user's real item score and forged item score using the attention network; in the second step, the device obtains according to the loss function of the discriminant model Reward value reward, and optimize the reward value reward through the first user's attention index on the item to obtain a new reward value; in the third step, the device updates the generation model with the new reward value; The following describes the first step, the second step, and the third step.
  • Step 1 The device determines the attention index of the first user on the item.
  • the attention index of the first user on the item is obtained by training the real user and the fake item of the first user by using an attention network.
  • the weight of attention of the first user to the pair of real items and the pair of counterfeit items is different, we can consider using the attention network to remember the weight of the first user to the pair of real items and counterfeit items .
  • the attention network can be a one-layer or multi-layer neural network, which is related to the user, the generated pair of forged items and the sampled pair of real items. Through the attention network, different weights of the first pair to the two pairs of pairs can be learned.
  • the network structure of the attention mechanism is shown in Figure 5.
  • the first user's attention index ⁇ for the item can be calculated by formula (8), as follows:
  • w u represents the attention weight of the first user
  • soft max is the loss function.
  • Step 2 The device obtains a reward value reward according to the loss function of the discriminant model (the method of obtaining reward has been described above), and the device optimizes the reward value through the first user's attention index on the item reward to get a new reward value.
  • Step 3 The device updates the generation model with the new reward value.
  • the generative model can be trained using a policy gradient (policy gradient) to obtain a new generative model.
  • policy gradient policy gradient
  • the formula of the policy gradient is shown in the following formula (9):
  • the training process of the new generative model can include the following operations:
  • the goal is to preset the number of training times to reach m times.
  • the training process in this case is shown in FIG. 6.
  • the training of the discriminant model and the training of the generative model are the key parts.
  • the training process of the discriminant model and the training process of the generative model are also introduced above. The two processes are combined below An introduction is provided to facilitate a better understanding of the embodiments of the present application, and FIG. 7 is a schematic diagram of the corresponding process.
  • the reward value reward is calculated through the discrimination module according to the strategy gradient algorithm
  • the updated generation model is embodied in the embedding and bias in the update formula (2), formula (3), and formula (4) relative to the generation model.
  • Step S304 The device generates a score of the forged item through the updated generation model.
  • the counterfeit items include the steps of generating positive and counterfeit items for each of multiple users; that is, after training a new generation model, the generation model needs to be For each positively-forged item and each negative-forged item previously generated, the score generated by the new generation model is more of reference value.
  • Step S305 The device sorts the real articles and the fake articles according to the scores of the forged articles and the existing real articles, and recommends the articles to the first user according to the order in the sorting.
  • the device can generate the first user's real items and counterfeit items for sorting, wherein the sorting can be sorted according to the rule of high to low score, or according to other rules defined in advance; The order in the sorting recommends items to the user.
  • the device can also sort the real and counterfeit items of other users.
  • the real items of user 1 include positive example real item 1 with a corresponding score of 4.9, positive example true Item 2 with a corresponding score of 4.5, negative real item 1 with a corresponding score of 3.5, negative real item 2 with a corresponding score of 3.3, negative real item 3 with a corresponding score of 3.4; then, according to the score from high to low In order of sorting, the order of the obtained order is: positive real item 11, positive fake item 01, positive real item 12, positive fake item 02, negative positive item 11, negative real item 03, negative example Real article 12, negative example forged article 02, negative example forged article 13, negative example forged article 01. After that, these real items and counterfeit items are recommended to the user 1 in this
  • the first step data input
  • the identification IDs of all users and the identification IDs of items scored by each user are input into the data set. Taking item recommendation as an example, there are a total of 10 items in this embodiment.
  • the input information is shown in Table 3:
  • the first entry with an item number of 1 represents the item I1 with the user ID U1
  • the second with the item number 2 represents the item I3 with the user ID U1 and so on.
  • Step 2 Initialize the parameters of the generated model and the parameters of the discriminant model, including the size of the user's embedding (representation vector) and the item's embedding, the size of the training batch, and the training rate.
  • the batch is used to characterize the sample taken at a time. Quantity.
  • Step 3 Keep the parameters of the generated model unchanged, and train the discriminant model.
  • the number of item pairs is the same as the number of positive real items.
  • the positive real items refer to the items that users have rated and have higher scores, such as 4 points. And above.
  • the items I1, I3, I5, and I8 that are evaluated are positive real items, and the items I2, I4, I6, I7, I9, and I10 that the user U1 has not evaluated are negative. Examples of real items. User U1 has 4 evaluated items, so the sampled real item pairs are four pairs, as follows:
  • the negative real items I2, I4, I9, and I6 are drawn from items that the user U1 has not evaluated, and can be drawn randomly or according to other strategies specified in advance. During training, you also need to generate a model to generate forged item pairs.
  • the positive example generator in the generation module is responsible for generating positive example counterfeit items, and the negative example generator is responsible for generating negative example counterfeit items.
  • the item pair generated by the generation model may be:
  • the discriminant model When training the discriminant model, the real item pairs (I1, I2), (I3, I4), (I5, I9), (I8, I6) and the generated fake item pairs (I1, I2), (I2, I6) are required , (I5, I7), (I8, I9) are handed over to the discriminant model.
  • the discriminant model will minimize the loss function to distinguish between real item pairs and counterfeit item pairs as much as possible to achieve the purpose of improving discriminative ability. Repeat the training of the discriminant model until each user's item pair is fully trained.
  • the fourth step keep the parameters of the discriminant model unchanged and train the generative model. Similar to the training discriminant model stage, for each user, you need to collect real item pairs from the existing real items and generate fake item pairs through the generation model. Still taking user U1 as an example:
  • the real item pair for the user U1 can be as follows:
  • the pair of counterfeit items for user U1 can be as follows:
  • the difference from training a discriminant model is that the discriminant model calculates the reward value based on the two input item pairs.
  • the generation module will update the parameters according to the new reward value reward0 value updated on the basis of the reward, and repeatedly train the generation module until each user's item pair is fully trained.
  • Step 5 Repeat steps 3-4 until the judgment model and the generated model are trained to the best.
  • Step 6 The device scores the generated fake items according to the generated model obtained by the final training.
  • Step 7 Enter the user ID of the user who wants to score with the device, for example, user U1, the device will sort all the items according to the user U1 according to the rating, the higher the score, the higher the preference, all the items including existing The real items and the generated fake items, Table 4 gives an example of the sorting results:
  • the user U1's favorite item is the item I7.
  • the negative-forged articles in the pair of forged articles are generated by relying on the positive-forged articles, fully considering the potential relationship between the negative-forged articles and the positive-forged articles, so that the information contained in the forged article pair
  • the amount is more abundant, the training effect is improved, and the generation ability of the generative model is enhanced. Therefore, the recommendation results generated by sorting the items generated by the generative model and the existing real items have more reference value for users.
  • An embodiment of the present application also provides a model training device 80 based on a generative adversarial network.
  • the device includes a generative model 801, a training model 802, and a discriminant model 803, where each model is introduced as follows:
  • the generation model 801 is used to generate a positive forged item and a negative forged item for the first user, wherein the negative forged item is generated according to the positive forged item, and the positive forged item for the first user is a prediction Of the items that are concerned by the first user, and the negatively-forged items of the first user are predicted items that are not concerned by the first user;
  • the training model 802 is used to train multiple real object pairs and multiple counterfeit object pairs to obtain a discriminant model 803, and the discriminant model is used to distinguish between the multiple real object pairs and the multiple counterfeit object pairs;
  • Each real item pair includes a positive example real item and a negative example real item, and each counterfeit item pair includes one positive example counterfeit item and one negative example counterfeit item;
  • the positive example real item is based on the An item identified by the operation behavior of the first user that is concerned by the first user, the negative real item is an item that is determined based on the operation behavior of the first user and is not concerned by the first user;
  • the training model 802 is used to update the generation model according to the loss function of the discriminant model.
  • the negative examples of forged items in the pair of forged items are generated by relying on the positive examples of forged items, fully considering the potential relationship between the negative examples of forged items and the positive examples of forged items, so that the information contained in the forged items pair
  • the amount is more abundant, the training effect is improved, and the generation ability of the generative model is enhanced. Therefore, the recommendation results generated by sorting the items generated by the generative model and the existing real items have more reference value for users.
  • the device further includes a recommended model, in which:
  • the updated generation model is used to generate scores of counterfeit items, the counterfeit items including the positive example counterfeit items generated for the first user And negative examples of forged items;
  • the recommendation model is used to sort the real items and the fake items according to the scores of the fake items and the existing real items, and recommend items to the first user according to the order in the sorting.
  • the training model trains multiple real item pairs and multiple counterfeit item pairs to obtain discrimination Before the model, the training model is also used to:
  • the first negative example counterfeit item belongs to the ranking of negative counterfeit items of the first user
  • M is the number of the first positive forged items
  • the first positive forged item is the first sampled from the positive forged items generated by the generation model A user's positive example of forged items
  • N is the number of the first positive real items
  • the first positive real items are one sampled from the positive real items that the first user has Positive examples of real items.
  • the initial generation model includes a positive example generation model, a negative example generation model, and a score generation model; the generation model is used to generate a positive example forged item and a negative example forged item for the first user Items, specifically:
  • a positive example generation model is used to generate the distribution of the first user's positive example counterfeit items.
  • the positive example generation model is:
  • a negative example generation model is used to generate the distribution of the negative example counterfeit items of the first user.
  • the negative example generation model is:
  • u) is the distribution of the forged items in the positive example
  • e u is the embedding of the first user's embedding vector
  • e i n is the i-th embodiment forged article embedding
  • b representing the user's first deviation value bias
  • the attention index of the first user on the item is obtained by using an attention network to train the first user's real item score and fake item score;
  • the generation model is updated with the new reward value.
  • each item pair is different.
  • the importance weight of each item pair is obtained, which can effectively select high-quality item pairs, reduce the negative impact of low-quality item pairs, and let The generative model and discriminant model we obtained are more robust and adaptive.
  • the item pairs here can be real item pairs or counterfeit item pairs.
  • the training model determines the attention index of the first user on the item, specifically:
  • the attention network is used to calculate the attention index of the first user on the item according to the following formula
  • is the attention index of the first user u to the item
  • w u represents the weight of the first user trained
  • b is the bias value of the first user bias.
  • the optimization of the reward value reward by the first user's attention index on the item to obtain a new reward value specifically includes:
  • a new reward value is determined according to the reward value reward_1 corresponding to the first user.
  • each unit may also correspond to the model training method based on generating an adversarial network described in the foregoing embodiments, for example, steps S301-S305.
  • Embodiments of the present application also provide a computer-readable storage medium, in which instructions are stored, and when it runs on a processor, implements the model training method based on the generated confrontation network described in the foregoing embodiments , For example steps S301-S305.
  • An embodiment of the present application further provides a computer program product, which, when the computer program product runs on a processor, implements the model training method based on generating an adversarial network described in the foregoing embodiments, for example, steps S301-S305.
  • the program may be stored in a computer-readable storage medium, and when the program is executed , May include the processes of the foregoing method embodiments.
  • the foregoing storage media include various media that can store program codes, such as ROM or random storage memory RAM, magnetic disks, or optical disks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Credit Cards Or The Like (AREA)

Abstract

本申请实施例提供一种基于生成对抗网络的模型训练方法及设备,该方法包括:设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品;所述设备训练多个真实物品对和多个伪造物品对以得到判别模型,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述设备根据所述判别模型的损失函数更新所述生成模型。采用本申请实施例,能够提高生成模型的生成能力和判别模型的判别能力。

Description

一种基于生成对抗网络的模型训练方法及设备 技术领域
本申请涉及大数据领域,尤其涉及一种基于生成对抗网络的模型训练方法及设备。
背景技术
随着信息化的不断发展,人们面对着日益严重的信息过载问题。个性化推荐系统作为一种有效的信息过滤工具,能够为用户提供各种个性化的推荐服务。信息检索生成对抗网络(Information Retrieval GAN,IRGAN)是将生成对抗网络(Generative Adversarial Net,GAN)模型应用到物品推荐领域的模型,其会对输入的物品数据进行训练从而得到生成模型和判别模型,生成模型负责生成与真实物品相仿的伪造物品,而判别模型负责判别生成的伪造物品与真实样本。生成模型与判别模型的训练相互依赖,在物品推荐场景中,需要通过生成模型生成伪造物品及物品的评分,然后根据评分对物品进行排序从而得到推荐结果。
IRGAN常见的训练方法包括样本点(point-wise)方法和样本对(pair-wise)方法。Point-wise的主要思想是将推荐问题转化为分类问题或者回归问题,假设用户对每一个物品的喜好程度是独立的,对用户可能喜欢的物品抽取特征进行训练。Pair-wise的主要思想是将推荐问题转化为二分类问题,进行模型训练时pair-wise不再对物品做独立性假设,而是物品对作为训练的最小单位,通常每个物品对包括一个用户喜欢的物品和一个用户不喜欢的物品。目前来看pair-wise的训练效果还不如point-wise,如何对pair-wise进行优化,从而提高推荐场景中生成模型的生成能力和判别模型的判别能力是本领域的技术人员正在研究的技术问题。
发明内容
本申请实施例公开了一种基于生成对抗网络的模型训练方法及设备,能够提高生成模型的生成能力和判别模型的判别能力。
第一方面,本申请实施例提供一种基于生成对抗网络的模型训练方法,该方法包括:
设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品,其中所述负例伪造物品为根据所述正例伪造物品生成的,所述第一用户的正例伪造物品为预测的受所述第一用户关注的物品,所述第一用户的负例伪造物品为预测的不受所述任第一用户关注的物品;所述设备训练多个真实物品对和多个伪造物品对以得到判别模型,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述正例真实物品为根据所述第一用户的操作行为认定的受所述第一用户关注的物品,所述负例真实物品为根据所述第一用户的操作行为认定的不受所述第一用户关注的物品;所述设备根据所述判别模型的损失函数更新所述生成模型。
通过执行上述方法,伪造物品对中的负例伪造物品是依赖正例伪造物品而生成的,充 分地考虑了负例伪造物品与正例伪造物品之间的潜在关系,使得伪造物品对包含的信息量更丰富,提升了训练效果,增强了生成模型的生成能力,因此对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
结合第一方面,在第一方面的第一种可能的实现方式中,所述设备根据所述判别模型的损失函数更新所述生成模型之后,还包括:所述设备通过更新后的生成模型生成伪造物品的评分,所述伪造物品包括所述为第一用户生成的正例伪造物品和负例伪造物品;所述设备根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向所述第一用户推荐物品。可以理解的是,对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
结合第一方面,或者第一方面的上述任一可能的实现方式,在第一方面的第二种可能的实现方式中,所述设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品之后,所述设备训练多个真实物品对和多个伪造物品对以得到判别模型之前,还包括:所述设备为多个第一正例伪造物品各匹配一个第一负例伪造物品以组成所述多个伪造物品对,所述第一负例伪造物品属于所述第一用户的负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一正例伪造物品的数量,所述第一正例伪造物品为从所述生成模型生成的正例伪造物品中采样到的所述第一用户的正例伪造物品;另外,所述设备为多个第一正例真实物品各匹配一个第一负例真实物品以组成所述多个真实物品对,所述第一负例真实物品属于所述第一用户的负例真实物品中评分排在前N位的负例真实物品,N为所述第一正例真实物品的数量,所述第一正例真实物品为从所述第一用户已有的正例真实物品中采样到的一个正例真实物品。
可以理解的是,采集评分高的物品组成物品对,包括真实物品对和伪造物品对,由于评分高的物品更受用户的关注,因此其对用户而言这种方式得到的物品对包含的信息量更大且噪声更小,根据这样的物品对进行训练可以充分地分析受用户关注的特征,从而训练出生成能力更强的生成模型。
结合第一方面,或者第一方面的上述任一可能的实现方式,在第一方面的第三种可能的实现方式中,所述初始生成模型包括正例生成模型、负例生成模型和评分生成模型;所述设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品,包括:
所述设备通过正例生成模型生成第一用户的正例伪造物品的分布,所述正例生成模型为:
Figure PCTCN2019128917-appb-000001
所述设备通过负例生成模型生成第一用户的负例伪造物品的分布,所述负例生成模型为:
Figure PCTCN2019128917-appb-000002
所述设备通过评分生成器生成每个正例伪造物品的评分和每个负例伪造物品的评分;
其中,g +(f +|u)为所述正例伪造物品的分布,e u为第一用户的嵌入向量embedding,
Figure PCTCN2019128917-appb-000003
是待生成的正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表所述 第一用户的偏差值bias;g -(f -|u,f +)为所述负例伪造物品的分布,
Figure PCTCN2019128917-appb-000004
是待生成的负例伪造物品的embedding。
结合第一方面,或者第一方面的上述任一可能的实现方式,在第一方面的第四种可能的实现方式中,所述设备根据所述判别模型的损失函数更新所述生成模型,包括:所述设备确定所述第一用户对物品的注意力指标,所述第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;所述设备根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;所述设备采用所述新的奖励值更新所述生成模型。
可以理解的是,每个物品对的重要性是不同的,通过引入注意力网络,得到每个物品对的重要性权重,可以有效地选择优质的物品对,减少劣质物品对的负面影响,让我们得到的生成模型、判别模型更具鲁棒性与自适应性。这里的物品对可以为真实物品对,也可以为伪造物品对。
结合第一方面,或者第一方面的上述任一可能的实现方式,在第一方面的第五种可能的实现方式中,所述设备确定所述第一用户对物品的注意力指标,包括:
所述设备采用注意力网络根据如下公式计算第一用户对物品的注意力指标;
Figure PCTCN2019128917-appb-000005
α=soft max(g(r +,r -,f +,f -|u))
其中,α为所述第一用户u的对物品的注意力指标,w u表示训练出的所述第一用户的权重,
Figure PCTCN2019128917-appb-000006
表示训练出的第一用户的正例真实物品的权重,
Figure PCTCN2019128917-appb-000007
表示训练出的第一用户的负例真实物品的权重,
Figure PCTCN2019128917-appb-000008
表示训练出的所述第一用户的正例伪造物品的权重,
Figure PCTCN2019128917-appb-000009
表示训练出的所述第一用户的负例伪造物品的权重;b为所述第一用户的偏差值bias。
结合第一方面,或者第一方面的上述任一可能的实现方式,在第一方面的第六种可能的实现方式中,所述通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,包括:通过所述第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,所述第一用户对物品的注意力指标α、奖励值reward和所述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;根据所述第一用户对应的奖励值reward_1确定新的奖励值。
第二方面,本申请实施例提供一种基于生成对抗网络的模型训练设备,该设备包括:
生成模型,用于为第一用户生成正例伪造物品和负例伪造物品,其中所述负例伪造物品为根据所述正例伪造物品生成的,所述第一用户的正例伪造物品为预测的受所述第一用户关注的物品,所述第一用户的负例伪造物品为预测的不受所述任第一用户关注的物品;
训练模型,用于训练多个真实物品对和多个伪造物品对以得到判别模型,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述正例真实物品为根据所述第一用户的操作行为认定的受所述第一用户关注的物品,所述负例真实物品为根据所述第一用户的操作行为认定的不受所述第一用户关注的物品;
所述训练模型,用于根据所述判别模型的损失函数更新所述生成模型。
通过运行上述单元,伪造物品对中的负例伪造物品是依赖正例伪造物品而生成的,充分地考虑了负例伪造物品与正例伪造物品之间的潜在关系,使得伪造物品对包含的信息量更丰富,提升了训练效果,增强了生成模型的生成能力,因此对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
结合第二方面,在第二方面的第一种可能的实现方式中,该设备还包括推荐模型,其中:
在所述训练模型根据所述判别模型的损失函数更新所述生成模型之后,更新后的生成模型用于生成伪造物品的评分,所述伪造物品包括所述为第一用户生成的正例伪造物品和负例伪造物品;
所述推荐模型,用于根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向所述第一用户推荐物品。
可以理解的是,对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
结合第二方面,或者第二方面的上述任一可能的实现方式,在第二方面的第二种可能的实现方式中,在所述生成模型为第一用户生成正例伪造物品和负例伪造物品之后,所述训练模型训练多个真实物品对和多个伪造物品对以得到判别模型之前,所述训练模型还用于:
为多个第一正例伪造物品各匹配一个第一负例伪造物品以组成所述多个伪造物品对,所述第一负例伪造物品属于所述第一用户的负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一正例伪造物品的数量,所述第一正例伪造物品为从所述生成模型生成的正例伪造物品中采样到的所述第一用户的正例伪造物品;
为多个第一正例真实物品各匹配一个第一负例真实物品以组成所述多个真实物品对,所述第一负例真实物品属于所述第一用户的负例真实物品中评分排在前N位的负例真实物品,N为所述第一正例真实物品的数量,所述第一正例真实物品为从所述第一用户已有的正例真实物品中采样到的一个正例真实物品。
可以理解的是,采集评分高的物品组成物品对,包括真实物品对和伪造物品对,由于评分高的物品更受用户的关注,因此其对用户而言这种方式得到的物品对包含的信息量更大且噪声更小,根据这样的物品对进行训练可以充分地分析受用户关注的特征,从而训练出生成能力更强的生成模型。
结合第二方面,或者第二方面的上述任一可能的实现方式,在第二方面的第三种可能的实现方式中,所述初始生成模型包括正例生成模型、负例生成模型和评分生成模型;所述生成模型,用于为第一用户生成正例伪造物品和负例伪造物品,具体为:
用于通过正例生成模型生成第一用户的正例伪造物品的分布,所述正例生成模型为:
Figure PCTCN2019128917-appb-000010
用于通过负例生成模型生成第一用户的负例伪造物品的分布,所述负例生成模型为:
Figure PCTCN2019128917-appb-000011
用于通过评分生成器生成每个正例伪造物品的评分和每个负例伪造物品的评分;
其中,g +(f +|u)为所述正例伪造物品的分布,e u为第一用户的嵌入向量embedding,
Figure PCTCN2019128917-appb-000012
是待生成的正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表所述第一用户的偏差值bias;g -(f -|u,f +)为所述负例伪造物品的分布,
Figure PCTCN2019128917-appb-000013
是待生成的负例伪造物品的embedding。
结合第二方面,或者第二方面的上述任一可能的实现方式,在第二方面的第四种可能的实现方式中,用于根据所述判别模型的损失函数更新所述生成模型,具体为:
确定所述第一用户对物品的注意力指标,所述第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;
根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;
采用所述新的奖励值更新所述生成模型。
可以理解的是,每个物品对的重要性是不同的,通过引入注意力网络,得到每个物品对的重要性权重,可以有效地选择优质的物品对,减少劣质物品对的负面影响,让我们得到的生成模型、判别模型更具鲁棒性与自适应性。这里的物品对可以为真实物品对,也可以为伪造物品对。
结合第二方面,或者第二方面的上述任一可能的实现方式,在第二方面的第五种可能的实现方式中,所述训练模型确定所述第一用户对物品的注意力指标,具体为:
采用注意力网络根据如下公式计算第一用户对物品的注意力指标;
Figure PCTCN2019128917-appb-000014
α=soft max(g(r +,r -,f +,f -|u))
其中,α为所述第一用户u的对物品的注意力指标,w u表示训练出的所述第一用户的权重,
Figure PCTCN2019128917-appb-000015
表示训练出的第一用户的正例真实物品的权重,
Figure PCTCN2019128917-appb-000016
表示训练出的第一用户的负例真实物品的权重,
Figure PCTCN2019128917-appb-000017
表示训练出的所述第一用户的正例伪造物品的权重,
Figure PCTCN2019128917-appb-000018
表示训练出的所述第一用户的负例伪造物品的权重;b为所述第一用户的偏差值bias。
结合第二方面,或者第二方面的上述任一可能的实现方式,在第二方面的第六种可能的实现方式中,所述通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,具体为:
通过所述第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,所述第一用户对物品的注意力指标α、奖励值reward和所述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;
根据所述第一用户对应的奖励值reward_1确定新的奖励值。
第三方面,本申请实施例提供一种设备,该设备包括处理器和存储器,其中,存储器用于存储程序指令和训练模型所需的样本数据,处理器用于调用所述程序指令来执行第一方面或者第一方面的任一可能的实现方式所描述的方法。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有程序指令,当其在处理器上运行时,实现第一方面或者第一方面的任一可能的实现方式所描述的方法。
附图说明
以下对本申请实施例用到的附图进行介绍。
图1A是本申请实施例提供的一种应用场景示意图;
图1B是本申请实施例提供的又一种应用场景示意图;
图1C是本申请实施例提供的又一种应用场景示意图;
图1D是本申请实施例提供的一种设备的结构示意图;
图2是本申请实施例提供的一种处理器处理流程示意图;
图3是本申请实施例提供的一种基于生成对抗网络的模型训练方法;
图4是本申请实施例提供的一种判别模型的训练流程示意图;
图5是本申请实施例提供的一种注意力机制的场景示意图;
图6是本申请实施例提供的一种生成模型的训练流程示意图;
图7是本申请实施例提供的一种判别模型和生成模型整体训练的场景示意图;
图8是本申请实施例提供的一种设备的结构示意图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。
推荐系统的目标是准确预测用户对于特定商品的喜好程度,推荐系统的推荐效果不仅影响用户体验,也直接影响到推荐平台的收益,因此准确地推荐具有重要意义。
下面结合表1推荐系统的推荐原理及目标进行简单的介绍。
表1
用户\物品 101 102 103 104 105 106
A 5 3 2.5
B 2 2.5 5 2
C 2 4 4.5 5
D 5 3 4.5 4
E 4 3 2 4 3.5 4
表1中示意的用户包括用户A、用户B、用户C、用户D和用户E,示意的物品包括物品101、物品102、物品103、物品104、物品105和物品106,另外,表1还示意了相应的用户为相应的物品的评分,某个用户对某个物品对应的评分越高代表该用户对该物品的喜好越强。例如,用户A对物品101的评分为5分,表明用户A对物品101喜好程度非常高。表1中的问号代表目前用户对该物品尚没有进行过评分,推荐系统的目标就是预测相应用户对未评价过的商品的喜好程度。例如,需要预测用户A对物品104、物品105和物品106的评分,需要预测用户B对物品105和物品106的评分,其余依此类推。经过推荐系统的推荐算法计算以后,推荐系统可以补全用户对未评分物品的评分。如表2所示,如果推荐系统想要为用户A推荐新物品,那么物品106可能是一个比较好的选择,因为推荐系统给物品106的评分是5分,高于给其他物品的评分,该用户A有很大的可能性喜欢 物品106。
表2
用户\物品 101 102 103 104 105 106
A 5 3 2.5 2 4 5
B 2 2.5 5 2 2 4
C 2 4 3 4 4.5 5
D 5 3 3 4.5 3 4
E 4 3 2 4 3.5 4
本申请实施例提出的基于生成对抗网络的模型训练方法能够训练出效果更好的生成模型,因此在进行物品推荐时以该生成模型对伪造物品的打分作为依据,能够得到更好的推荐效果。
本申请实施例中的基于生成对抗网络的模型训练方法能够应用在很多场景中,例如,广告点击预测、感兴趣的TopN物品推荐、与问题最相关的答案预测等等,下面进行举例说明。
在广告推荐场景中,广告推荐系统需返回一个或多个排序好的广告列表展示用户。本申请实施例可以预测较受用户欢迎的广告,从而提高广告的点击率。本申请可以将用户点击过的广告和没有点击过的广告组成真实物品对,其中,点击过的广告相当于正例真实物品,没有点击过的广告相当于负例真实物品,采用IRGAN技术,可以通过生成模型生成伪造物品对,通过判别模型尽力判别哪些是生成的物品对,哪些是真实的物品对,在IRGAN对抗式训练下,可以预估用户对每个广告的点击概率(相当于对物品的评分)。如图1A所示,通过基于生成对抗网络的模型训练方法对用户针对广告的历史行为数据进行训练,即可得到用户对各个广告的点击概率预测值。
在topN物品推荐场景中,需要向用户推荐该用户最感兴趣的topN个物品,从而促进用户对物品的消费行为,其中,物品可以为电商产品、应用市场APP等。本申请可以将用户消费过或者下载过并且用户对其评分较高的物品和用户消费过并且用户评分较低的物品组成真实物品对,其中,评分较高的物品相当于正例真实物品,评分较低的物品相当于负例真实物品,采用IRGAN技术,可以通过生成模型生成伪造物品对,通过判别模型尽力判别哪些是生成的物品对,哪些是真实的物品对,在IRGAN对抗式训练下,可以预估用户对每个物品的评价比较高,这相当于对物品的评分。如图1B所示,通过基于生成对抗网络的模型训练方法对用户针对物品的历史行为数据进行训练,即可得到用户对各个物品的感兴趣程度的排名,从而向用户输出其感兴趣的topN物品。
在问答场景中,问答系统需要针对用户的提出的问题给出尽量符合用户期望的答案,从而提高用户对问答系统的友好度。本申请可以将用户收到的并且用户对其评分较高的答案和用户收到的并且用户对其评分较低的答案组成真实物品对,其中,评分较高的答案相当于正例真实物品,评分较低的答案相当于负例真实物品,,采用IRGAN技术,可以通过生成模型生成伪造物品对,通过判别模型尽力判别哪些是生成的物品对,哪些是真实的物品对,在IRGAN对抗式训练下,可以预估用户对每个答案的评价比较高,这相当于对物品的评分。如图1C所示,通过基于生成对抗网络的模型训练方法对用户针对问题及答案的历 史行为数据进行训练,即可得到用户对各个答案的满意程度的排名,从而向用户输出其相对较满意的N个答案。
下面结合图1D对执行该基于生成对抗网络的模型训练方法的设备进行介绍。
请参见图1D,图1D是本申请实施例提供的一种设备的结构示意图,该设备用于对物品进行分类,该设备可以为一个设备,如服务器,或者好几个设备构成的一个集群,下面以该设备为一个服务器为例对该设备的结构进行简单的介绍。该设备10包括处理器101、存储器102和通信接口103,所述处理器101、存储器102和通信接口103通过总线相互连接,其中:
该通信接口103用于获取已有的物品的数据,例如,已有的物品的标识、评分,对已有的物品进行评分的用户的信息,等等。可选的,通信接口103可以与其他设备之间建立通信连接,因此可以接收其他设备发送的已有物品的数据或者从其他设备上读取已有的物品的数据;可选的,通信接口103可以连接一个外部的可读存储介质,因此可以从外部的可读存储介质上读取已有的物品的数据;该通信接口103还可能通过其他方式获取已有物品的数据。
存储器102包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器102用于存储相关程序指令,以及存储相关数据,该相关数据可以包括通过通信接口103获取到的数据,还可以包括对这些数据进行处理之后产生的新的数据、模型、以及基于模型预测的结果,等等,该数据也可称样本。
处理器101可以是一个或多个中央处理器(central processing unit,CPU),在处理器101是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。该处理器101用于读取所述存储器102中存储的程序执行,执行一种基于生成对抗网络的模型训练方法中涉及到的相关操作,例如,判别模型的训练、生成模型的训练、对物品进行评分预测,等等。请参见图2,图2示意了处理器的大致执行流程,包括将已有的物品的信息、对物品进行评分的用户的信息、对物品的评分值等信息输入到初始的判别模型201中,其中,已有的物品的信息可以包括物品标识ID,对物品进行评分的用户的信息可以包括该用户标识ID。生成模型202也会生成一些伪造的物品并将该伪造物品的相关信息输入到该初始的判别模型201,从而对该判别模型201进行训练,该判别模型201与该生成模型202之间不断进行对抗最终得到一个辨别真实样本和伪造样本能力很强的判别模型201,以及得到一个生成的伪造物品能够非常接近真实物品的生成模型202;之后通过该生成模型202生成伪造物品以及伪造物品的评分;然后排序预测203根据任意一个用户的全部物品的评分,来生成该用户的物品的排序,从而根据排序得到针对该任意一个用户的物品推荐列表,可选的,该物品包括真实物品和伪造物品。在本申请实施例中,该判别模型201包括判别器和注意力网络,判别器负责对真实物品和伪造物品进行分辨,注意力网络用于记录不同用户对真实物品以及伪造物品的注意力权重,从而对生成模型的生成提供参考;生成模型202包括物品生成器和评分生成器,物品生成器用于生成伪造物品,评分生成器用于为该伪造物品生成评分,其中,物品生成器还可以分为负例生成器和正例生成器,正例生成器用于 生成正例伪造物品,负例生成器用于生成负例伪造物品。其中,在物品生成器中采用了动态采样技术进行采样。
可选的,该设备10还可以包括输出组件,例如,显示器、音响等,该输出组件用于向开发人员展示训练模型要用到的参数,因此开发人员可以获知这些参数,也可以对这些参数进行修改,并通过输入组件将修改后的参数输入到该设备10中,例如,输入组件可以包括鼠标、键盘等。另外,该设备10还可以通过输出组件将训练出的模型,以及基于模型预测的结果展示给开发人员。
下面结合图3对本申请实施例中的一种基于生成对抗网络的模型训练方法做更详细介绍。
请参见图3,图3是本申请实施例提供的一种基于生成对抗网络的模型训练方法,该方法可基于图1D所示的设备10来实现,也可以基于其他架构来实现,该方法包括如下步骤:
步骤S301:设备通过生成模型为第一用户生成伪造物品。
具体地,本申请实施例涉及到真实物品和伪造物品,其中,伪造物品包括正例伪造物品、负例伪造物品,真实物品包括正例真实物品和负例真实物品,多个用户中每个用户都有各自的正例伪造物品、负例伪造物品、正例真实物品和负例真实物品这几个概念,其中,对任意一个用户来说,该用户的正例真实物品为该用户有过操作行为且比较关注的物品,该用户的负例真实物品为该用户有过操作行为且不关注的物品,该用户的正例伪造物品为该用户未操作过且预测出比较关注的物品,该用户的负例伪造物品为该用户未操作过且预测出不关注的物品。本申请实施例中的第一用户为多个用户中的一个用户,为了便于理解这里以第一用户为例来进行说明,其他用户的特征可以参照对第一用户的描述。
第一用户对某个终端上展示的物品的操作行为包括下载、评价、点击、浏览等,这些行为会被终端记录下并根据操作其行为对相应的物品评分,例如,可以是用户打的分数也可以是该终端或者上述设备根据用户的行为数据打的分,评分用于衡量用户对该物品的关注程度,可以根据某个用户有操作行为的各个物品的评分来划分该某个用户的正例真实物品和负例真实物品,例如假若评分分值范围为1-5分,那么可以将评分处于4-5分范围的物品定义为该用户的正例真实物品,将评分处于1-3分范围的物品定义为该用户的负例真实物品。这里的物品为应用程序(APP)、或者广告、或者视频、或者歌曲、或者问答系统的答案等等。
该生成模型为第一用户生成的正例伪造物品为预测的受所述第一用户关注的物品,为第一用户生成的负例伪造物品为预测的不受所述第一用户关注的物品。例如,生成模型为第一用户生成可能受第一用户关注的喜剧电影1、喜剧电影2、喜剧电影3,以及为第一用户生成可能不受第一用户关注的恐怖电影1、恐怖电影2和恐怖电影3,那么喜剧电影1、喜剧电影2、喜剧电影3就属于第一用户的正例伪造物品,恐怖电影1、恐怖电影2和恐怖电影3就属于第一用户的负例伪造物品,该生成模型还会为喜剧电影1、喜剧电影2、喜剧电影3、恐怖电影1、恐怖电影2和恐怖电影3生成评分,生成的评分属于预测的评分,用于表示第一用户对这些电影的喜好程度。该生成模型为其他用户生成正例伪造物品和负例伪造物品的原理可以参照以上针对第一用户的描述。不同用户的正例伪造物品和负例伪造 物品可能相同也可能不相同,对应的评分也可能相同也可能不同。下面对生成模型进行介绍。
具体而言,生成模块的目标是生成伪造物品对并尽可能地逼近真实物品对的相关性分布,其中伪造物品对包括一个正例伪造物品和一个负例伪造物品,真实物品对包括一个正例真实物品和一个负例真实物品。这里生成的伪造物品对的相关线性分布如公式(1)所示:
G(f|u)=G((f +,f -)|u)=g +(f +|u)·g -(f -|u,f +) (1)
在公式(1)中,f代表生成的伪造物品,f +是生成的正例伪造物品,f -是生成的负例伪造物品。生成模型可以分为正例生成器和负例生成器两个子模型,g +代表正例生成器,和g -代表负例生成器,u代表第一用户。正例生成器g +用于生成该第一用户的u正例伪造物品的分布,负例生成器g -用于根据正例生成器g +生成的正例伪造物品生成该第一用户的负例伪造物品的分布,其中正例生成器g +生成的正例伪造物品的分布如公式(2)所示:
Figure PCTCN2019128917-appb-000019
在公式(2)中,e u表示第一用户的嵌入向量(embedding),
Figure PCTCN2019128917-appb-000020
是正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表着第一用户的bias。本申请实施例的嵌入相邻embedding、偏差值bias可以在第一次初始训练时配置默认值,在每次训练之后embedding、bias通常会更新。
在本申请实施例中,要求生成模型生成的正例伪造物品与负例伪造物品之间存在一些潜在的关系,因此负例伪造物品的生成是在正例伪造物品生成之后。举例来说,负例生成器用内积的方式计算正例伪造物品与负例伪造物品之间的关系,从而得到生成的伪造负例物品的分布如公式(3)所示:
Figure PCTCN2019128917-appb-000021
在公式(3)中,
Figure PCTCN2019128917-appb-000022
是待生成的负例伪造物品的embedding。可选的,假若一个用户喜欢喜剧片而不喜欢恐怖片,那么设备一般会训练出喜剧片与恐怖片的这一层“对立”关系,因此在通过公式2为用户生成一个喜剧片作为正例伪造物品之后,很有可能会生成一个与该喜剧片类型相对立的电影作为负例伪造物品,即这里的恐怖片,而不太可能生成一个喜剧片作为负例伪造物品。这里作为负例伪造物品的“恐怖片”即是根据在先生成的正例伪造物品“喜剧片”生成的,而不是独立生成的,体现了负例伪造物品对正例伪造物品的依存关系。
可以理解的是,通过上述方式可以生成一系列的正例伪造物品和负例伪造物品,接下来该设备通过评分生成模型为生成的每个正例伪造物品和负例伪造物品生成评分,可选的,评分生成模型生成评分的原理可以如公式(4)所示:
r u,t=e u·e t+b(4)
在公式(4)中,r u,t表示生成的第一用户对第t个伪造物品的评分,e t是第t个伪造物品t的embedding。
在本申请实施例中,通过以上方式生成一系列正例伪造物品及其评分,以及一系列负例伪造物品及其评分之后,要从生成的正例伪造物品中采样部分正例伪造物品,并从生成 的负例伪造物品中采样部分负例伪造物品,使得采样得到的正例伪造物品与采样得到的负例伪造物品构成多个伪造物品对,每个所述伪造物品对包括第一用户的一个为正例伪造物品和一个为负例伪造物品,生成多个伪造物品对的方式可以如下:
所述设备为第一正例伪造物品匹配一个负例伪造物品以组成一个所述伪造物品对,所述一个负例伪造物品为所述第一用户的所有负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一用户的所有正例伪造物品的数量,所述第一正例伪造物品为生成的正例伪造物品中属于所述第一用户的任意一个采样到的正例伪造物品,M为正整数。可选的,针对一个被采样到的正例伪造物品,从生成的负例伪造物品中采集一个评分最高的负例伪造物品与该正例伪造物品构成一个伪造物品对,此时该被采样到的负例伪造物品从被采样的池子中剔除掉,然后针对下一个被采样到的正例伪造物品,从生成的负例伪造物品中采集一个评分最高的负例伪造物品与该正例伪造物品构成又一个伪造物品对,依此类推即可为采样到的每个正例伪造物品匹配一个负例伪造物品,从而得到多个伪造物品对。下面示意性的例举了一种实现代码:
Generate positive item f +
Generate an initial negative item f -
forn epochs do
if score(f -)+ε>score(f +)do
update negative item f -
break
end if
end for
可选的,该设备为第一正例真实物品匹配一个负例真实物品以组成一个所述真实物品对,所述一个负例真实物品为所述第一用户的所有负例真实物品中评分排在前N位的负例真实物品,N为所述第一用户的所有正例真实物品的数量,所述第一正例真实物品为已有的正例真实物品中任意一个被采样到的属于所述第一用户的正例真实物品,N为正整数。可选的,针对一个被采样到的正例真实物品,从生成的负例真实物品中采集一个评分最高的负例真实物品与该正例真实物品构成一个真实物品对,此时该被采样到的负例真实物品从被采样的池子中剔除掉,然后针对下一个被采样到的正例真实物品,从生成的负例真实物品中采集一个评分最高的负例真实物品与该正例真实物品构成又一个真实物品对,依此类推即可为采样到的每个正例真实物品匹配一个负例真实物品,从而得到多个真实物品对。
步骤S302:所述设备以最小化损失函数为目标训练多个真实物品对和多个伪造物品对以获得判别模型。
具体地,训练得到的判别模型如公式(5)所示:
Figure PCTCN2019128917-appb-000023
在公式(5)中,v可以为r,也可以为f。当v为f时,p(f|u)代表该分布为生成模型生成的伪造物品对的分布,e u表示第一用户的embedding,
Figure PCTCN2019128917-appb-000024
表示正例伪造物品的embedding,
Figure PCTCN2019128917-appb-000025
表示负例伪造物品的embedding,b表示第一用户的bias。当v为r时,p(r|u)代表该分布 为从真实的物品中采样得到的真实物品对的分布,e u表示第一用户的embedding,
Figure PCTCN2019128917-appb-000026
表示正例真实物品的embedding,
Figure PCTCN2019128917-appb-000027
表示负例真实物品的embedding,b表示第一用户的bias。判别模型负责分辨上述伪造物品对的分布和上述真实物品对的分布之间的差异,可以采用交叉熵(cross-entropy)损失函数(6)进行优化,使得该判别模型能够具有更高的识别真实物品和伪造物品的能力。
D(r,f|u)=cross_entropy(p(r|u),p(f|u)) (6)
可选的,在训练判别模型的过程中,可以针对每个用户执行如下流程:
1、从真实的数据集中采样真实物品对(r +,r -);
2、利用当前生成模型生成伪造物品,并从伪造的物品中采样得到伪造物品对(f +,f -);
3、将(r +,r -)和(f +,f -)一并交给判别模型进行训练,最小化判别模型的损失函数;
4、重复以上步骤直至所有用户对物品的打分都训练完毕。
可选的,将预先设置训练次数达到n次为目标,在这种情况下的训练流程如图4所示。
步骤S303:所述设备根据所述判别模型的损失函数更新所述生成模型。
在一种可选的方案中,所述设备根据所述判别模型的损失函数更新所述生成模型,可以包括:首先,所述设备根据所述判别模型的损失函数获得奖励值reward,其中,所述判别模型的损失函数如公式(6)所示,可以根据公式(6)中的参数D(r,f|u)的来计算该奖励值reward,例如,reward=log(1-D(r,f|u));然后,所述设备采用所述新的奖励值更新所述生成模型以得到新的生成模型,其中,该生成模型可以采用策略梯度(policy gradient)的方式来训练,从而得到更新后的生成模型,策略梯度的公式如以公式(7)所示:
Figure PCTCN2019128917-appb-000028
在公式(7)中,E f~G u[]为期望函数,f~Gu表示f是从生成器G(f|u)中生成,另外,i从1到N取值,f i代表生成器生成的第i个样本,公式(7)中reward即为前面得到的奖 励值。
在又一种可选的方案中,所述设备根据所述判别模型的损失函数更新所述生成模型以得到新的生成模型,可以包括:第一步,所述设备确定第一用户对物品的注意力指标,第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;第二步,所述设备根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;第三步,所述设备采用所述新的奖励值更新所述生成模型;下面对上述第一步、第二步、第三步展开描述。
第一步:所述设备确定所述第一用户对物品的注意力指标。
具体地,第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品和伪造物品得到。在很多情况下,第一用户对真实物品对和伪造的物品对之间注意力的权重是不同的,我们可以考虑采用注意力网络记忆第一用户对真实物品对和伪造物品对之间的权重。物品对之间有很多潜在因素,以电影评分为例,一些用户喜欢对他们喜欢的电影评较高的分,而对他们不喜欢的电影评较低的分,例如正例电影为5分,负例电影为1分。一些用户喜欢评价他们喜欢和不喜欢的两部电影的中间分数,例如正例电影为4分和负例电影为3分。对于某个物品对,它们之间的电影分数的差距因不同用户而异。对于pair-wise模块,这些因素应该被关注。我们使用一种注意机制来记住这些潜在的成对因素。在这项工作中,注意力由一系列的权重向量表示,它代表了不同物品对每个用户的重要性。对于某个物品对,不同用户的注意力权重通常是不同的。注意力权重越高,他们就越重要。注意力网络可以是一层或多层的神经网络,它和用户,以及生成的伪造物品对和采样的真实物品对有关。通过该注意力网络可以学习第一用户对两对pair的不同的权重。注意力机制的网络结构如图5所示。
具体来说,第一用户对物品的注意力指标α可以通过公式(8)来计算,具体如下:
Figure PCTCN2019128917-appb-000029
α=soft max(g(r +,r -,f +,f -|u))    (8)
在公式(8)中,w u代表第一用户的注意力权重,
Figure PCTCN2019128917-appb-000030
代表第一用户对正例真实物品的注意力权重,
Figure PCTCN2019128917-appb-000031
代表第一用户对负例真实物品的注意力权重,
Figure PCTCN2019128917-appb-000032
代表第一用户对正例伪造物品的注意力权重,
Figure PCTCN2019128917-appb-000033
代表第一用户对负例伪造物品的注意力权重,b代第一用户的bias(偏差值),soft max即为损失函数。
第二步:所述设备根据所述判别模型的损失函数获得奖励值reward(获得reward的方式前面已经有描述),所述设备通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值。
具体地,所述设备通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,可以具体为:所述设备通过第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,第一用户对物品的注意力指标α、奖励值reward和所述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;其中,所述第一用户为所述多个用户中的一个用户,所述多个用户各自对应的奖励值用于构成新的奖励值,例如,该新的奖励值可以表示为 reward0=(reward_1 1,reward_1 2,reward_1 3,……,reward_1 i,……,reward_1 n-1,reward_1 n),其中,reward_1 i为上述多个用户中第i个用户对应的奖励值。
第三步:所述设备采用所述新的奖励值更新所述生成模型。
具体地,该生成模型可以采用策略梯度(policy gradient)的方式来训练,从而得到新的生成模型,该策略梯度的公式如以下公式(9)所示:
Figure PCTCN2019128917-appb-000034
在公式(9)的含义可以参照公式(7),公式(9)中的reward0即为前面得到的更新后的奖励值。
新的生成模型的训练流程可以包括如下操作:
1、使用当前的生成模型生成伪造物品对(f +,f -);
2、从真实的数据集里采样真实的物品对(r +,r -);
3、将(r +,r -)和(f +,f -)喂给判别模块,计算奖励值reward;
4、计算attention网络的α;
5、更新reward值得到新的奖励值reward0;
6、利用新的奖励值reward0更新生成模型;
7、重复以上步骤。
可选的,将预先设置训练次数达到m次为目标,在这种情况下的训练流程如图6所示。
可以理解的是,每个物品对(pair)的重要性是不同的,通过引入注意力网络,得到每个pair的重要性权重,可以有效地选择优质的pair,减少劣质pair的负面影响,让我们得到的生成模型、判别模型更具鲁棒性与自适应性。
在本申请实施例中,对判别模型的训练和对生成模型的训练是比较关键的部分,以上也分别对判别模型的训练流程和生成模型的训练流程做了介绍,下面将两个流程结合起来 进行介绍,以方便更好的理解本申请实施例,图7为对应的流程示意图。
准备阶段:
1、用随机的参数θ和φ初始化生成模型和判别模型;
2、确定采用由物品构成的数据集S进行预训练;
训练阶段:
1、Repeat
//训练判别模块
For d_epoch do
2、固定生成模型参数不变;
3、从已有的真实物品构成的数据集S中采样真实物品对(r +,r -);
4、生成模型生成伪造物品并从伪造物品中采集伪造物品对(f +,f -);
5、用(r +,r -)和(f +,f -)训练判别模型;
6、End for
//训练生成模型;
For g_epoch do
7、固定判别模型参数不变;
8、生成模型生成伪造物品并从伪造物品中采集伪造物品对(f +,f -);
9、根据策略梯度算法通过判别模块计算奖励值reward;
10、根据注意力网络更新reward,并使用更新后的奖励值reward0更新生成模型;
11、Until判断模型和生成模型收敛。
在本申请实施例中,更新后的生成模型相对于生成模型而言,具体表现在更新公式(2)、公式(3)和公式(4)中的embedding、bias。
步骤S304:所述设备通过更新后的生成模型生成伪造物品的评分。
具体地,所述伪造物品包括所述为多个用户中每个用户分别生成正例伪造物品和负例伪造物品;也即是说,在训练出新的生成模型之后需要通过该生成模型再次为之前生成的每个正例伪造物品和每个负例伪造物品打分,新的生成模型生成的打分更具有参考价值。
步骤S305:所述设备根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向第一用户推荐物品。
具体地,该设备可以为第一用户生成第一用户的真实物品和伪造物品进行排序,其中,排序可以按照分数由高到低的规则排序,也可以按照预定定义的其他规则进行排序;之后根据排序中的顺序向用户推荐物品。该设备还可以为其他用户的真实物品和伪造物品进行排序,例如,假若用户1的伪造物品包括正例伪造物品1且对应评分为4.7、正例伪造物品2且对应评分为4、负例伪造物品1且对应评分为0.5、负例伪造物品2且对应评分为1.1、负例伪造物品3且对应评分为1,用户1的真实物品包括正例真实物品1且对应评分为4.9、正例真实物品2且对应评分为4.5、负例真实物品1且对应评分为3.5、负例真实物品2且对应评分为3.3、负例真实物品3且对应评分为3.4;那么,按照分数从高到低的方式排序的话,得到的排序先后顺序依次为:正例真实物品11、正例伪造物品01、正例真实物品12、正例伪造物品02、负例真实物品11、负例真实物品03、负例真实物品12、负例伪造物品 02、负例伪造物品13、负例伪造物品01。之后,按照这种顺序将这些真实物品和伪造物品推荐给用户1。
以上对本申请实施例的原理进行了详细介绍,下面结合一个具体的例子进行说明。
第一步:数据输入
本申请实施例向数据集中输入所有用户的身份标识ID和每个用户打分过的物品的标识ID。以物品推荐为例,本实施例一共有10个物品,输入的信息如表3所示:
表3
条目序号 用户ID 物品ID
1 U1 I1
2 U1 I3
3 U1 I5
4 U1 I8
5 U2 I2
6 U2 I3
7 U2 I4
在表3中,条目序号为1的第一条代表身份标识为U1的用户评价过物品I1,条目序号为2的第二条代表身份标识为U1的用户评价过物品I3,其余依此类推。
第二步:初始化生成模型的参数和判别模型的参数,包括用户embedding(表示向量)和物品embedding的大小,训练batch的大小,以及训练的速率,其中batch用于表征样本时一次取的样本的数量。
第三步:保持生成模型参数不变,训练判别模型。训练时对于每一个用户需要从真实的物品中采样物品对,物品对的数量与正例真实物品的数量相同,其中,正例真实物品指用户评分过的且评分较高的物品,如4分及以上的物品。在本实施例中,对于用户U1来说,其评价过的物品I1,I3,I5,I8就是正例真实物品,用户U1没评价过的物品I2,I4,I6,I7,I9,I10就是负例真实物品。用户U1有4个评价过的物品,所以采样的真实物品对是四对,具体如下:
(I1,I2),(I3,I4),(I5,I9),(I8,I6);
其中负例真实物品I2、I4、I9、I6是从该用户U1没有评价过的物品中抽取的,可以随机抽取,也可以按照预先规定的其他策略来抽取。在训练时,还需要生成模型生成伪造的物品对。生成模块中的正例生成器负责生成正例伪造物品,负例生成器负责生成负例伪造物品。
例如,对于用户U1,生成模型生成的物品对可以是:
(I1,I2),(I2,I6),(I5,I7),(I8,I9);
在训练判别模型时,需要真实物品对(I1,I2),(I3,I4),(I5,I9),(I8,I6)和生成的伪造物品对(I1,I2),(I2,I6),(I5,I7),(I8,I9)一并交给判别模型,判别模型会通过最小化损失函数来尽可能的区别真实物品对和伪造物品对,达到提升判别能力的目的。重复训练判别模型,直到每一个用户的物品对都被充分训练过。
第四步:保持判别模型参数不变,训练生成模型。和训练判别模型阶段类似,对于每 一个用户,需要从已有的真实物品中采集真实物品对,并通过生成模型生成伪造物品对,依旧以用户U1为例:
针对该用户U1的真实物品对可以如下:
(I1,I2),(I3,I4),(I5,I9),(I8,I6);
针对该用户U1的伪造物品对可以如下:
(I1,I2),(I2,I6),(I5,I7),(I8,I9)。
与训练判别模型时的不同之处在于,判别模型会根据输入的两组物品对计算出reward值。生成模块会根据在该reward的基础上更新得到的新奖励值reward0值来更新参数,重复训练生成模块,直到每一个用户的物品对都被充分训练过。
第五步:重复3-4步骤直至判断模型和生成模型训练至最佳。
第六步:设备根据最终训练得到的生成模型为生成的伪造的物品评分。
第七步:向设备中输入想要与测评分的用户ID,例如,用户U1,该设备会针对该用户U1对所有物品按照评分进行排序,评分高则喜好程度高,该所有物品包括已有的真实物品和生成的伪造物品,表4对该排序结果进行了例举性示意:
表4
用户ID 物品ID 评分
U1 I3 2.54
U1 I5 2.35
U1 I7 1.93
U1 I1 1.54
U1 I8 1.32
U1 I2 1.14
U1 I4 0.97
U1 I10 0.78
U1 I9 0.76
U1 I6 0.54
根据表4所示的推荐列表,可以获知用户U1可能最喜欢的物品是物品I7。
通过执行上述方法,伪造物品对中的负例伪造物品是依赖正例伪造物品而生成的,充分地考虑了负例伪造物品与正例伪造物品之间的潜在关系,使得伪造物品对包含的信息量更丰富,提升了训练效果,增强了生成模型的生成能力,因此对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。进一步地,采集评分高的物品组成物品对,包括真实物品对和伪造物品对,由于评分高的物品更受用户的关注,因此其对用户而言这种方式得到的物品对包含的信息量更大且噪声更小,根据这样的物品对进行训练可以充分地分析受用户关注的特征,从而训练出生成能力更强的生成模型。
以上从硬件器件的角度介绍了一种设备,在实际应用中也有完全通过功能模块对终端结构进行描述的,为了本领域的技术人员能够更好的理解本申请的思想,如图8所示,本 申请实施例还提供了一种基于生成对抗网络的模型训练设备80,该设备包括生成模型801、训练模型802和判别模型803,其中,各个模型的介绍如下:
生成模型801用于为第一用户生成正例伪造物品和负例伪造物品,其中所述负例伪造物品为根据所述正例伪造物品生成的,所述第一用户的正例伪造物品为预测的受所述第一用户关注的物品,所述第一用户的负例伪造物品为预测的不受所述任第一用户关注的物品;
训练模型802用于训练多个真实物品对和多个伪造物品对以得到判别模型803,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述正例真实物品为根据所述第一用户的操作行为认定的受所述第一用户关注的物品,所述负例真实物品为根据所述第一用户的操作行为认定的不受所述第一用户关注的物品;
所述训练模型802用于根据所述判别模型的损失函数更新所述生成模型。
通过运行上述单元,伪造物品对中的负例伪造物品是依赖正例伪造物品而生成的,充分地考虑了负例伪造物品与正例伪造物品之间的潜在关系,使得伪造物品对包含的信息量更丰富,提升了训练效果,增强了生成模型的生成能力,因此对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
在一种可选的方案中,该设备还包括推荐模型,其中:
在所述训练模型根据所述判别模型的损失函数更新所述生成模型之后,更新后的生成模型用于生成伪造物品的评分,所述伪造物品包括所述为第一用户生成的正例伪造物品和负例伪造物品;
所述推荐模型,用于根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向所述第一用户推荐物品。
可以理解的是,对该生成模型生成的物品和已有的真实物品进行排序所产生的推荐结果对用户而言更具有参考价值。
在又一种可选的方案中,在所述生成模型为第一用户生成正例伪造物品和负例伪造物品之后,所述训练模型训练多个真实物品对和多个伪造物品对以得到判别模型之前,所述训练模型还用于:
为多个第一正例伪造物品各匹配一个第一负例伪造物品以组成所述多个伪造物品对,所述第一负例伪造物品属于所述第一用户的负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一正例伪造物品的数量,所述第一正例伪造物品为从所述生成模型生成的正例伪造物品中采样到的所述第一用户的正例伪造物品;
为多个第一正例真实物品各匹配一个第一负例真实物品以组成所述多个真实物品对,所述第一负例真实物品属于所述第一用户的负例真实物品中评分排在前N位的负例真实物品,N为所述第一正例真实物品的数量,所述第一正例真实物品为从所述第一用户已有的正例真实物品中采样到的一个正例真实物品。
可以理解的是,采集评分高的物品组成物品对,包括真实物品对和伪造物品对,由于评分高的物品更受用户的关注,因此其对用户而言这种方式得到的物品对包含的信息量更大且噪声更小,根据这样的物品对进行训练可以充分地分析受用户关注的特征,从而训练 出生成能力更强的生成模型。
在又一种可选的方案中,所述初始生成模型包括正例生成模型、负例生成模型和评分生成模型;所述生成模型,用于为第一用户生成正例伪造物品和负例伪造物品,具体为:
用于通过正例生成模型生成第一用户的正例伪造物品的分布,所述正例生成模型为:
Figure PCTCN2019128917-appb-000035
用于通过负例生成模型生成第一用户的负例伪造物品的分布,所述负例生成模型为:
Figure PCTCN2019128917-appb-000036
用于通过评分生成器生成每个正例伪造物品的评分和每个负例伪造物品的评分;
其中,g +(f +|u)为所述正例伪造物品的分布,e u为第一用户的嵌入向量embedding,
Figure PCTCN2019128917-appb-000037
是待生成的正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表所述第一用户的偏差值bias;g -(f -|u,f +)为所述负例伪造物品的分布,
Figure PCTCN2019128917-appb-000038
是待生成的负例伪造物品的embedding。
在又一种可选的方案中,用于根据所述判别模型的损失函数更新所述生成模型,具体为:
确定所述第一用户对物品的注意力指标,所述第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;
根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;
采用所述新的奖励值更新所述生成模型。
可以理解的是,每个物品对的重要性是不同的,通过引入注意力网络,得到每个物品对的重要性权重,可以有效地选择优质的物品对,减少劣质物品对的负面影响,让我们得到的生成模型、判别模型更具鲁棒性与自适应性。这里的物品对可以为真实物品对,也可以为伪造物品对。
在又一种可选的方案中,所述训练模型确定所述第一用户对物品的注意力指标,具体为:
采用注意力网络根据如下公式计算第一用户对物品的注意力指标;
Figure PCTCN2019128917-appb-000039
α=soft max(g(r +,r -,f +,f -|u))
其中,α为所述第一用户u的对物品的注意力指标,w u表示训练出的所述第一用户的权重,
Figure PCTCN2019128917-appb-000040
表示训练出的第一用户的正例真实物品的权重,
Figure PCTCN2019128917-appb-000041
表示训练出的第一用户的负例真实物品的权重,
Figure PCTCN2019128917-appb-000042
表示训练出的所述第一用户的正例伪造物品的权重,
Figure PCTCN2019128917-appb-000043
表示训练出的所述第一用户的负例伪造物品的权重;b为所述第一用户的偏差值bias。
在又一种可选的方案中,所述通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,具体为:
通过所述第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,所述第一用户对物品的注意力指标α、奖励值reward和所 述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;
根据所述第一用户对应的奖励值reward_1确定新的奖励值。
需要说明的是,各个单元的实现还可以对应参照前述实施例中描述的基于生成对抗网络的模型训练方法,例如步骤S301-S305。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,实现前述实施例中描述的基于生成对抗网络的模型训练方法,例如步骤S301-S305。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,实现前述实施例中描述的基于生成对抗网络的模型训练方法,例如步骤S301-S305。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。

Claims (15)

  1. 一种基于生成对抗网络的模型训练方法,其特征在于,包括:
    设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品,其中所述负例伪造物品为根据所述正例伪造物品生成的,所述第一用户的正例伪造物品为预测的受所述第一用户关注的物品,所述第一用户的负例伪造物品为预测的不受所述第一用户关注的物品;
    所述设备训练多个真实物品对和多个伪造物品对以得到判别模型,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述正例真实物品为根据所述第一用户的操作行为认定的受所述第一用户关注的物品,所述负例真实物品为根据所述第一用户的操作行为认定的不受所述第一用户关注的物品;
    所述设备根据所述判别模型的损失函数更新所述生成模型。
  2. 根据权利要求1所述的方法,其特征在于,所述设备根据所述判别模型的损失函数更新所述生成模型之后,还包括:
    所述设备通过更新后的生成模型生成伪造物品的评分,所述伪造物品包括所述为第一用户生成的正例伪造物品和负例伪造物品;
    所述设备根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向所述第一用户推荐物品。
  3. 根据权利要求1或2所述的方法,其特征在于,所述设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品之后,所述设备训练多个真实物品对和多个伪造物品对以得到判别模型之前,还包括:
    所述设备为多个第一正例伪造物品各匹配一个第一负例伪造物品以组成所述多个伪造物品对,所述第一负例伪造物品属于所述第一用户的负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一正例伪造物品的数量,所述第一正例伪造物品为从所述生成模型生成的正例伪造物品中采样到的所述第一用户的正例伪造物品;
    所述设备为多个第一正例真实物品各匹配一个第一负例真实物品以组成所述多个真实物品对,所述第一负例真实物品属于所述第一用户的负例真实物品中评分排在前N位的负例真实物品,N为所述第一正例真实物品的数量,所述第一正例真实物品为从所述第一用户已有的正例真实物品中采样到的一个正例真实物品。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述初始生成模型包括正例生成模型、负例生成模型和评分生成模型;所述设备通过生成模型为第一用户生成正例伪造物品和负例伪造物品,包括:
    所述设备通过正例生成模型生成第一用户的正例伪造物品的分布,所述正例生成模型为:
    Figure PCTCN2019128917-appb-100001
    所述设备通过负例生成模型生成第一用户的负例伪造物品的分布,所述负例生成模型为:
    Figure PCTCN2019128917-appb-100002
    所述设备通过评分生成器生成每个正例伪造物品的评分和每个负例伪造物品的评分;
    其中,g +(f +|u)为所述正例伪造物品的分布,e u为第一用户的嵌入向量embedding,
    Figure PCTCN2019128917-appb-100003
    是待生成的正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表所述第一用户的偏差值bias;g -(f -|u,f +)为所述负例伪造物品的分布,
    Figure PCTCN2019128917-appb-100004
    是待生成的负例伪造物品的embedding。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述设备根据所述判别模型的损失函数更新所述生成模型,包括:
    所述设备确定所述第一用户对物品的注意力指标,所述第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;
    所述设备根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;
    所述设备采用所述新的奖励值更新所述生成模型。
  6. 根据权利要求5所述的方法,其特征在于,所述设备确定所述第一用户对物品的注意力指标,包括:
    所述设备采用注意力网络根据如下公式计算第一用户对物品的注意力指标;
    Figure PCTCN2019128917-appb-100005
    α=soft max(g(r +,r -,f +,f -|u))
    其中,α为所述第一用户u的对物品的注意力指标,w u表示训练出的所述第一用户的权重,
    Figure PCTCN2019128917-appb-100006
    表示训练出的第一用户的正例真实物品的权重,
    Figure PCTCN2019128917-appb-100007
    表示训练出的第一用户的负例真实物品的权重,
    Figure PCTCN2019128917-appb-100008
    表示训练出的所述第一用户的正例伪造物品的权重,
    Figure PCTCN2019128917-appb-100009
    表示训练出的所述第一用户的负例伪造物品的权重;b为所述第一用户的偏差值bias。
  7. 根据权利要求5或6所述的方法,其特征在于,所述通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,包括:
    通过所述第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,所述第一用户对物品的注意力指标α、奖励值reward和所述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;
    根据所述第一用户对应的奖励值reward_1确定新的奖励值。
  8. 一种基于生成对抗网络的模型训练设备,其特征在于,包括:
    生成模型,用于为第一用户生成正例伪造物品和负例伪造物品,其中所述负例伪造物品为根据所述正例伪造物品生成的,所述第一用户的正例伪造物品为预测的受所述第一用户关注的物品,所述第一用户的负例伪造物品为预测的不受所述任第一用户关注的物品;
    训练模型,用于训练多个真实物品对和多个伪造物品对以得到判别模型,所述判别模型用于分辨所述多个真实物品对与所述多个伪造物品对之间的差异;每个真实物品对包括一个正例真实物品和一个负例真实物品,每个伪造物品对包括一个所述正例伪造物品和一个所述负例伪造物品;所述正例真实物品为根据所述第一用户的操作行为认定的受所述第一用户关注的物品,所述负例真实物品为根据所述第一用户的操作行为认定的不受所述第一用户关注的物品;
    所述训练模型,用于根据所述判别模型的损失函数更新所述生成模型。
  9. 根据权利要求8所述的设备,其特征在于,还包括推荐模型,其中:
    在所述训练模型根据所述判别模型的损失函数更新所述生成模型之后,更新后的生成模型用于生成伪造物品的评分,所述伪造物品包括所述为第一用户生成的正例伪造物品和负例伪造物品;
    所述推荐模型,用于根据伪造物品的评分和已有的真实物品的评分,对所述真实物品和所述伪造物品排序,并根据排序中的顺序向所述第一用户推荐物品。
  10. 根据权利要求8或9所述的设备,其特征在于,在所述生成模型为第一用户生成正例伪造物品和负例伪造物品之后,所述训练模型训练多个真实物品对和多个伪造物品对以得到判别模型之前,所述训练模型还用于:
    为多个第一正例伪造物品各匹配一个第一负例伪造物品以组成所述多个伪造物品对,所述第一负例伪造物品属于所述第一用户的负例伪造物品中评分排在前M位的负例伪造物品,M为所述第一正例伪造物品的数量,所述第一正例伪造物品为从所述生成模型生成的正例伪造物品中采样到的所述第一用户的正例伪造物品;
    为多个第一正例真实物品各匹配一个第一负例真实物品以组成所述多个真实物品对,所述第一负例真实物品属于所述第一用户的负例真实物品中评分排在前N位的负例真实物品,N为所述第一正例真实物品的数量,所述第一正例真实物品为从所述第一用户已有的正例真实物品中采样到的一个正例真实物品。
  11. 根据权利要求8-10任一项所述的设备,其特征在于,所述初始生成模型包括正例生成模型、负例生成模型和评分生成模型;所述生成模型,用于为第一用户生成正例伪造物品和负例伪造物品,具体为:
    用于通过正例生成模型生成第一用户的正例伪造物品的分布,所述正例生成模型为:
    Figure PCTCN2019128917-appb-100010
    用于通过负例生成模型生成第一用户的负例伪造物品的分布,所述负例生成模型为:
    Figure PCTCN2019128917-appb-100011
    用于通过评分生成器生成每个正例伪造物品的评分和每个负例伪造物品的评分;
    其中,g +(f +|u)为所述正例伪造物品的分布,e u为第一用户的嵌入向量embedding,
    Figure PCTCN2019128917-appb-100012
    是待生成的正例伪造物品的embedding,e i是第i个正例伪造物品的embedding,b代表所述第一用户的偏差值bias;g -(f -|u,f +)为所述负例伪造物品的分布,
    Figure PCTCN2019128917-appb-100013
    是待生成的负例伪造物品的embedding。
  12. 根据权利要求8-11任一项所述的设备,其特征在于,所述训练模型,用于根据所述判别模型的损失函数更新所述生成模型,具体为:
    确定所述第一用户对物品的注意力指标,所述第一用户对物品的注意力指标为采用注意力网络训练所述第一用户的真实物品评分和伪造物品评分得到;
    根据所述判别模型的损失函数获得奖励值reward,并通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值;
    采用所述新的奖励值更新所述生成模型。
  13. 根据权利要求12所述的设备,其特征在于,所述训练模型确定所述第一用户对物品的注意力指标,具体为:
    采用注意力网络根据如下公式计算第一用户对物品的注意力指标;
    Figure PCTCN2019128917-appb-100014
    α=soft max(g(r +,r -,f +,f -|u))
    其中,α为所述第一用户u的对物品的注意力指标,w u表示训练出的所述第一用户的权重,
    Figure PCTCN2019128917-appb-100015
    表示训练出的第一用户的正例真实物品的权重,
    Figure PCTCN2019128917-appb-100016
    表示训练出的第一用户的负例真实物品的权重,
    Figure PCTCN2019128917-appb-100017
    表示训练出的所述第一用户的正例伪造物品的权重,
    Figure PCTCN2019128917-appb-100018
    表示训练出的所述第一用户的负例伪造物品的权重;b为所述第一用户的偏差值bias。
  14. 根据权利要求12或13所述的设备,其特征在于,所述通过所述第一用户对物品的注意力指标优化所述奖励值reward以得到新的奖励值,具体为:
    通过所述第一用户对物品的注意力指标α优化所述奖励值reward以得到所述第一用户对应的奖励值reward_1,其中,所述第一用户对物品的注意力指标α、奖励值reward和所述第一用户对应的奖励值reward_1满足如下关系:reward_1=α*reward;
    根据所述第一用户对应的奖励值reward_1确定新的奖励值。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当其在处理器上运行时,实现权利要求1-8任一所述的方法。
PCT/CN2019/128917 2018-12-29 2019-12-26 一种基于生成对抗网络的模型训练方法及设备 WO2020135642A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811654623.5 2018-12-29
CN201811654623.5A CN109902823A (zh) 2018-12-29 2018-12-29 一种基于生成对抗网络的模型训练方法及设备

Publications (1)

Publication Number Publication Date
WO2020135642A1 true WO2020135642A1 (zh) 2020-07-02

Family

ID=66943487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/128917 WO2020135642A1 (zh) 2018-12-29 2019-12-26 一种基于生成对抗网络的模型训练方法及设备

Country Status (2)

Country Link
CN (1) CN109902823A (zh)
WO (1) WO2020135642A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902823A (zh) * 2018-12-29 2019-06-18 华为技术有限公司 一种基于生成对抗网络的模型训练方法及设备
CN110827120A (zh) * 2019-10-18 2020-02-21 郑州大学 基于gan网络的模糊推荐方法、装置、电子设备及存储介质
CN110929085B (zh) * 2019-11-14 2023-12-19 国家电网有限公司 基于元语义分解的电力客服留言生成模型样本处理系统及方法
CN112395494B (zh) * 2020-10-15 2022-10-14 南京邮电大学 一种基于生成对抗网络的双向动态推荐系统
CN113326400B (zh) * 2021-06-29 2024-01-12 合肥高维数据技术有限公司 基于深度伪造视频检测的模型的评价方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564129A (zh) * 2018-04-24 2018-09-21 电子科技大学 一种基于生成对抗网络的轨迹数据分类方法
US20180293712A1 (en) * 2017-04-06 2018-10-11 Pixar Denoising monte carlo renderings using generative adversarial neural networks
CN109902823A (zh) * 2018-12-29 2019-06-18 华为技术有限公司 一种基于生成对抗网络的模型训练方法及设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615767B (zh) * 2015-02-15 2017-12-29 百度在线网络技术(北京)有限公司 搜索排序模型的训练方法、搜索处理方法及装置
CN108875766B (zh) * 2017-11-29 2021-08-31 北京旷视科技有限公司 图像处理的方法、装置、系统及计算机存储介质
CN108595493B (zh) * 2018-03-15 2022-02-08 腾讯科技(深圳)有限公司 媒体内容的推送方法和装置、存储介质、电子装置
CN108665058B (zh) * 2018-04-11 2021-01-05 徐州工程学院 一种基于分段损失的生成对抗网络方法
CN108921220A (zh) * 2018-06-29 2018-11-30 国信优易数据有限公司 图像复原模型训练方法、装置及图像复原方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293712A1 (en) * 2017-04-06 2018-10-11 Pixar Denoising monte carlo renderings using generative adversarial neural networks
CN108564129A (zh) * 2018-04-24 2018-09-21 电子科技大学 一种基于生成对抗网络的轨迹数据分类方法
CN109902823A (zh) * 2018-12-29 2019-06-18 华为技术有限公司 一种基于生成对抗网络的模型训练方法及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JUN WANG ET AL: "IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models", SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 11 August 2017 (2017-08-11), pages 1 - 12, XP081317789, DOI: 10.1145/3077136.3080786 *

Also Published As

Publication number Publication date
CN109902823A (zh) 2019-06-18

Similar Documents

Publication Publication Date Title
WO2020135642A1 (zh) 一种基于生成对抗网络的模型训练方法及设备
WO2020228514A1 (zh) 内容推荐方法、装置、设备及存储介质
WO2022041979A1 (zh) 一种信息推荐模型的训练方法和相关装置
CN111797321B (zh) 一种面向不同场景的个性化知识推荐方法及系统
Nie et al. Data-driven answer selection in community QA systems
CN102708131B (zh) 将消费者自动分类到微细分中
WO2020048084A1 (zh) 资源推荐方法、装置、计算机设备及计算机可读存储介质
Wasid et al. A particle swarm approach to collaborative filtering based recommender systems through fuzzy features
WO2019144892A1 (zh) 数据处理方法、装置、存储介质和电子装置
CN106844530A (zh) 一种问答对分类模型的训练方法和装置
CN107123057A (zh) 用户推荐方法及装置
CN108322317A (zh) 一种账号识别关联方法及服务器
Alabdulrahman et al. Catering for unique tastes: Targeting grey-sheep users recommender systems through one-class machine learning
CN106951471A (zh) 一种基于svm的标签发展趋势预测模型的构建方法
CN110288459A (zh) 贷款预测方法、装置、设备及存储介质
CN106846029B (zh) 基于遗传算法和新型相似度计算策略的协同过滤推荐算法
CN112380433A (zh) 面向冷启动用户的推荐元学习方法
CN111597446B (zh) 基于人工智能的内容推送方法、装置、服务器和存储介质
WO2023024408A1 (zh) 用户特征向量确定方法、相关设备及介质
CN111695084A (zh) 模型生成方法、信用评分生成方法、装置、设备及存储介质
CN109933720B (zh) 一种基于用户兴趣自适应演化的动态推荐方法
CN112651790B (zh) 基于快消行业用户触达的ocpx自适应学习方法和系统
CN112148994B (zh) 信息推送效果评估方法、装置、电子设备及存储介质
CN111368131B (zh) 用户关系识别方法、装置、电子设备及存储介质
Acharya et al. Using Optimal Embeddings to Learn New Intents with Few Examples: An Application in the Insurance Domain.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19903143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19903143

Country of ref document: EP

Kind code of ref document: A1