CN113129108B

CN113129108B - Product recommendation method and device based on Double DQN algorithm

Info

Publication number: CN113129108B
Application number: CN202110452994.0A
Authority: CN
Inventors: 王光臣; 张衡; 张盼盼; 王宇; 潘宇光
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2023-05-30
Anticipated expiration: 2041-04-26
Also published as: CN113129108A

Abstract

The invention discloses a product recommendation method and a system based on a Double DQN algorithm, comprising the following steps: basic information of a target user is obtained; inputting basic information of a target user into a Double DQN algorithm after training, and outputting the prediction satisfaction degree of each product by the Double DQN algorithm; and sorting the products according to the order of the predicted satisfaction degree from large to small, and recommending the sorted products to the target user. Not only the personal information of the user, such as personal risk preference, income situation, etc., but also the information of the product itself, such as historical purchase data of the product, purchase satisfaction of the product, etc., are analyzed sufficiently, so that the most suitable product is recommended to the user.

Description

Product recommendation method and device based on Double DQN algorithm

Technical Field

The invention relates to the technical field of product recommendation, in particular to a product recommendation method and device based on a Double DQN algorithm.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

In recent years, with the rapid development of internet technology, product recommendation systems have rapidly developed, and are now widely used in various services such as e-commerce services and financial product recommendation services.

The current product recommendation methods are generally recommendation methods based on user information, and the methods analyze data such as risk preference of users to obtain similarity between the users and the products, so that corresponding product recommendation is performed according to the similarity. However, the existing product recommendation methods do not fully analyze the information of the product purchased by the user, such as historical purchase data of the product, price change condition of the product, and the like, and do not realize accurate recommendation of the product, so that the product is not accurately recommended to the required clients.

Therefore, in the prior art, the recommending mode and the device of the product cannot be well designed, the requirements of users cannot be met, and satisfactory experience of the users cannot be provided.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a product recommendation method and device based on a Double DQN algorithm.

In a first aspect, the invention provides a product recommendation method based on a Double DQN algorithm;

a product recommendation method based on Double DQN algorithm comprises the following steps:

basic information of a target user is obtained;

processing basic information of a target user and extracting characteristics of the basic information;

inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product;

sorting products according to the order of the predicted satisfaction degree from large to small, and recommending the sorted products to a target user;

the deep reinforcement learning model refers to a Double DQN algorithm.

In a second aspect, the invention provides a product recommendation device based on a Double DQN algorithm;

product recommendation device based on Double DQN algorithm includes:

an acquisition module configured to: basic information of a target user is obtained;

a feature extraction module configured to: processing basic information of a target user and extracting characteristics of the basic information;

a prediction module configured to: inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product;

a recommendation module configured to: sorting products according to the order of the predicted satisfaction degree from large to small, and recommending the sorted products to a target user;

the deep reinforcement learning model refers to a Double DQN algorithm.

In a third aspect, the present invention also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first aspect.

In a fourth aspect, the present invention also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that: not only the personal information of the user, such as personal risk preference, income situation and the like, but also the information of the product itself, such as historical purchase data of the product, purchase satisfaction degree of the product and the like, are utilized, so that the most suitable product is recommended to the user.

The invention applies a Double-Q learning algorithm (Double DQN algorithm) in deep reinforcement learning to product recommendation, and fully analyzes the data of the product by using the algorithm, thereby recommending the product with higher user satisfaction.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of an implementation of a product recommendation method based on a Double DQN algorithm provided by the invention;

FIG. 2 is a reinforcement learning framework diagram of one embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, the product recommendation method based on Double DQN algorithm includes:

s101: basic information of a target user is obtained;

s102: processing basic information of a target user and extracting characteristics of the basic information;

s103: inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product;

s104: sorting products according to the order of the predicted satisfaction degree from large to small, and recommending the sorted products to a target user;

the deep reinforcement learning model refers to a Double DQN algorithm.

Further, the step S101: basic information of a target user is obtained; the method specifically comprises the following steps:

the monthly average income of the target user, the times of purchasing historical products, the frequency of purchasing the historical products, the risk level of purchasing the historical products and the price fluctuation data of purchasing the historical products are obtained.

Further, the step S102: processing basic information of a target user and extracting characteristics of the basic information; the method specifically comprises the following steps:

feature extraction is performed by convolutional neural networks.

Further, the step S103: inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product; the training steps comprise:

constructing a training set, wherein the training set is user basic information of known historical purchase satisfaction of products;

preprocessing the basic information of the user in the training set, and taking the state characteristics of the basic information of the user and the historical purchase satisfaction of the known product obtained after preprocessing as input values of a deep reinforcement learning model; and training the model to obtain a trained deep reinforcement learning model.

Further, the preprocessing the basic information of the users in the training set specifically includes:

dividing the average monthly income, the historical product purchase times, the historical product purchase frequency, the risk level of the historical purchased products and the price fluctuation data of the users in the training set by N time units to obtain a plurality of divided data s _t The time unit may be divided according to the time dimension of the data, for example, a time unit is set to be one month, and the subscript t represents a time point, so as to record the time interval of the data represented by the state;

all the data in the same time unit after segmentation are subjected to feature extraction through a convolutional neural network CNN to obtain a month average income feature, a historical product purchase frequency feature, a risk level feature of a historical purchased product and a price fluctuation data feature;

the month average income characteristic, the historical product purchase frequency characteristic, the risk grade characteristic of the historical purchased product and the price fluctuation data characteristic are spliced in series to obtain a state characteristic χ(s) corresponding to the same time unit _t ) And similarly, obtaining the state characteristics under all time units.

It should be appreciated that due to the monthly average revenue, historical product purchase times, calendar of the users in the training setThe frequency of purchasing historical products, the risk level of purchasing the products and the price fluctuation data are very large in data quantity and very large in variety, so that the input various data are required to be preprocessed to extract the characteristics of all the data so as to reduce the dimension of the data. Dividing a plurality of data s _t Extracting features by deep neural network, wherein the extracted features are χ (s _t ) There are a variety of feature extraction networks, for example, employing a corresponding feature extraction network such that the number of pairs s _t The extracted data state features χ(s) _t ) Is a multidimensional vector. If a variety of historical data is considered, the extracted features may be in the form of a combination of multiple vectors, such as a matrix or the like.

Further, the step S103: inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product; the method specifically comprises the following steps:

and inputting the characteristics representing the basic information of the target user into the trained deep reinforcement learning model to obtain the predicted satisfaction degree of the product, wherein the predicted satisfaction degree of the product is a value obtained through an optimal Q value function of a Double DQN algorithm.

The training principle of the Double DQN algorithm of deep reinforcement learning is described in detail, and how to obtain all state characteristics χ (s _t ) Is the optimal Q value function Q of (2) _* (χ(s _t )，a)：

As shown in fig. 2, at each time point t, the state in which the agent is currently located is characterized as χ (s _t ) At this time, the agent performs operation a _t Obtaining rewards r from the environment _t And observe a new state feature χ (s _t+1 )。

The goal of agent learning is to select a strategy pi, which is the action a taken at each time t, to maximize the desired total rewards _t I.e., pi= { a _t ，a _t+1 ，a _t+2 ，…a _T -where T is the set terminal moment;

maximizing the expected return maximizes the future cumulative discount rewards, i.e., maximizes:

r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t r _T maximum, wherein, gamma is more than or equal to 0 and less than or equal to 1 as discount rate,

the value of action a taken by policy pi under the state feature χ(s) is noted as:

E[r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t r _T |χ(s _t )＝X(s)，a _t ＝a]，

which represents the expected total rewards of all possible decision sequences after execution of operation a, starting from the state feature χ(s), according to the strategy pi.

Simultaneously defining an optimal Q value function:

Q _* (χ(s)，a)＝max _π Q _π (χ(s)，a)＝max _π E[r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t r _T |χ(s _t )＝χ(s)，a _t ＝a]

which represents the desired total prize to be decided upon according to the optimal strategy after performing operation a under the state features χ(s).

The process of obtaining the optimal Q value function Q (X(s), a) under each state feature χ(s) by iterative means:

obtained from the Bellman formula:

Q _* (χ(s),a)＝E[r _t +γmax _a' Q _* (χ(s'),a')|χ(s _t )＝χ(s),a _t ＝a]。

thus, from the above equation, Q is estimated by a function approximator Q (χ(s), a; θ) _* (χ(s), a), iterating θ by random gradient descent (SGD),

wherein θ is ^- Updated every k steps, i.e. atUpdate every k steps

Then at other steps theta ^- Remain unchanged.

Thus, given a set of allowed operations, such as in this embodiment, a set of allowed operations a for a product may include, but is not limited to, purchasing a product, not purchasing the product, and selling an already owned product, etc., rewards in the deep reinforcement learning model are satisfaction levels obtained for each of the operations, which are set in a number of ways, such as by determining how to set based on personal information of the user, such as risk preferences, etc.

In the training principle of the deep reinforcement learning model, the characteristics extracted by the data in the training set can be utilized to iterate to obtain the final theta _* Corresponding Q (χ(s), a; θ) _* )＝Q _* (χ(s), a) is the corresponding optimal Q function.

So long as each state feature χ (s _t ) The lower corresponding optimal Q value function Q _* (χ(s _t ) A), then only the state features χ(s) _t ) Lower select Q _* (χ(s _t ) The operation a) with the largest value is the product in the state characteristic χ (s _t ) Optimal operation a capable of maximizing future cumulative satisfaction _* In each state χ (s _t ) All adopt the optimal operation a _* The future satisfaction of the product can be maximized and the totality of all optimal operations, called the optimal strategy pi _* Is recorded as

Obtaining the optimal strategy pi _* Thereafter, an optimal strategy pi is simulated and executed on the product _* The predicted maximum satisfaction of the product can be obtained:

obtain the optimal strategy pi _* That is, each state feature χ (s _t ) Lower corresponding optimum operation

Only the operation process of each product is simulated, and corresponding states s are corresponding to the transaction _t All adopt χ(s) _t ) Corresponding optimal transaction operations->

The predicted maximum satisfaction of the product can be obtained. Meanwhile, in the process of simulating the transaction, the deadline T, time, operation times and the like can be set for the product, for example, the deadline T is set to be six months, namely, the cumulative total predicted satisfaction of the product in six months is to be simulated. It can also be set that the operation can be performed once every five days, and the data characteristic of five days before the operation day is the current state characteristic χ(s) _t ) For state s at each operation _t By χ(s) _t ) Corresponding optimal transaction operations->

The predicted maximum satisfaction of the product can be obtained, and the predicted maximum satisfaction is the predicted satisfaction of the product output in step S103.

Further, the step S104 is to sort the products according to the order of the predicted satisfaction degree from big to small, and recommend the sorted products to the target user; the method specifically comprises the following steps:

the ranking may be performed by directly comparing the predicted satisfaction degree obtained in S103 or by the relative recommendation rate of each product, and there are many methods for calculating the relative recommendation rate by the simulated maximum satisfaction degree of each product obtained in S103, for example, for the present embodiment, assuming that the predicted satisfaction degrees of three products 1, 2 and 3 are constants 18, 15 and 13, respectively, the relative recommendation rate may be calculated by taking the lower product simulated satisfaction degree as a standard, that is, the simulated satisfaction degree of product 3 as a standard, and the recommendation rate is 1, and the relative recommendation rate of product 1 is 18+.13+.1.38, and the relative recommendation rate of product 2 is 15+.13+.15, and may be similarly calculated when there are a plurality of products.

The invention aims to overcome the defects of the technology, and provides a product recommendation method and device based on a Double DQN algorithm, wherein a deep reinforcement learning model is trained through basic information of a user, historical data of a product and the Double DQN algorithm, simulation operation obtained through large-scale data analysis is often higher in reliability and stability than operation found through manual experience, meanwhile, the defect that the Q learning algorithm, the DQN algorithm and the like are easy to overestimate is overcome, the found optimal strategy is used for simulating the prediction satisfaction degree of the product, and the product is ranked through the prediction satisfaction degree of the product, so that the product with relatively high satisfaction rate is recommended to the user.

Example two

The invention provides a product recommendation device based on a Double DQN algorithm;

product recommendation device based on Double DQN algorithm includes:

the deep reinforcement learning model refers to a Double DQN algorithm.

Here, it should be noted that the above-mentioned obtaining module, feature extraction module, prediction module, and recommendation module correspond to steps S101 to S104 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The product recommendation method based on Double DQN algorithm is characterized by comprising the following steps:

basic information of a target user is obtained;

inputting the characteristics representing the basic information of the target user into a trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product, wherein the prediction satisfaction degree of the product is a value obtained through an optimal Q value function of a Double DQN algorithm;

in particular, the method comprises the steps of,

at each time point t, the agentThe current state is characterized as χ (s _t ) At this time, the agent performs operation a _t Obtaining rewards r from the environment _t And observe a new state feature χ (s _t+1 )；

E[r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t r _T |χ(S _t )＝χ(s)，a _t ＝a]，

it represents the expected total rewards of all possible decision sequences, starting from the state features χ(s), after execution of operation a, according to the strategy pi;

simultaneously defining an optimal Q value function:

Q _* (χ(s)，a)＝max _π Q _π (χ(s)，a)＝max _π E[r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t r _T |χ(s _t )＝χ(s)，a _t ＝a]，

which represents the expected total rewards to be decided according to the optimal strategy after performing operation a under the state features χ(s);

the process of obtaining the optimal Q value function Q (χ(s), a) under each state feature χ(s) by iterative means:

obtained from the Bellman formula:

Q _* (χ(s),a)＝E[r _t +γmax _a' Q _* (χ(s'),a')|χ(s _t )＝χ(s),a _t ＝a]；

wherein θ is ^- Once every k steps, i.e. at every k steps

Then at other steps theta ^- Remain unchanged;

the deep reinforcement learning model refers to a Double DQN algorithm;

preprocessing basic information of users in a training set, which specifically comprises the following steps:

dividing the average monthly income, the historical product purchase times, the historical product purchase frequency, the risk level of the historical purchased products and the price fluctuation data of the users in the training set by N time units to obtain a plurality of divided data s _t The subscript t represents a time point, thereby recording a time interval of the data represented by the state;

the month average income characteristic, the historical product purchase frequency characteristic, the risk grade characteristic of the historical purchased product and the price fluctuation data characteristic are spliced in series to obtain the same oneState features χ(s) corresponding to time units _t ) And similarly, obtaining the state characteristics under all time units.

2. The product recommendation method based on Double DQN algorithm as claimed in claim 1, wherein the basic information of the target user is obtained; the method specifically comprises the following steps:

3. The product recommendation method based on Double DQN algorithm as claimed in claim 1, wherein the basic information of the target user is processed and the characteristics thereof are extracted; the method specifically comprises the following steps:

feature extraction is performed by convolutional neural networks.

4. The product recommendation method based on Double DQN algorithm as claimed in claim 1, wherein the feature representing the basic information of the target user is input into the trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product; the training steps comprise:

preprocessing the basic information of the user in the training set, taking the state characteristics of the basic information of the user obtained after preprocessing and the historical purchase satisfaction degree of the known product as input values of a deep reinforcement learning model, and training the model to obtain the trained deep reinforcement learning model.

5. Product recommendation device based on Double DQN algorithm, characterized by including:

a prediction module configured to: inputting the characteristics representing the basic information of the target user into a trained deep reinforcement learning model to obtain the prediction satisfaction degree of each product, wherein the prediction satisfaction degree of the product is a value obtained through an optimal Q value function of a Double DQN algorithm;

in particular, the method comprises the steps of,

at each time point t, the state in which the agent is currently located is characterized as χ (s _t ) At this time, the agent performs operation a _t Obtaining rewards r from the environment _t And observe a new state feature χ (s _t+1 )；

simultaneously defining an optimal Q value function:

obtaining the optimal Q value function Q under each state characteristic χ(s) in an iterative mode _* (χ(s), process of a):

obtained from the Bellman formula:

Q _* (χ(s)，a)＝E[r _t +γmax _a′ Q _* (χ(s′)，a′)|χ(s _t )＝χ(s)，a _t ＝a]；

wherein θ is ^- Once every k steps, i.e. at every k steps

Then at other steps theta ^- Remain unchanged;

the deep reinforcement learning model refers to a Double DQN algorithm;

6. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of claims 1-4.

7. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-4.