CN115375410A

CN115375410A - Commodity recommendation method, device and equipment based on reinforcement learning and storage medium

Info

Publication number: CN115375410A
Application number: CN202211118887.5A
Authority: CN
Inventors: 王树新; 马欣
Original assignee: China Resources Digital Technology Co Ltd
Current assignee: China Resources Digital Technology Co Ltd
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-11-22

Abstract

The application relates to a commodity recommendation method, a commodity recommendation device and a storage medium based on reinforcement learning, wherein the method comprises the steps of obtaining a commodity purchase order of a user and constructing a commodity list; determining a purchasing action of a commodity aiming at any commodity in the commodity list, calculating an initial value and an initial purchasing probability of the purchasing action, and obtaining the initial value and the initial purchasing probability corresponding to each commodity in the commodity list when the calculation is completed; randomly acquiring a commodity in a commodity list as an initial commodity, calculating the return of the initial commodity, and updating and calculating the initial commodity based on the return to obtain the purchase value of the initial commodity; circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase value set; and recommending the goods to be recommended to the user based on the target purchase value set. According to the method and the system, the commodity purchasing value is accurately calculated according to the relation among the commodities, and the commodity recommending efficiency is improved.

Description

Commodity recommendation method, device and equipment based on reinforcement learning and storage medium

Technical Field

The present application relates to the field of information recommendation technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recommending commodities based on reinforcement learning.

Background

With the popularization of informatization and the rapid development of mobile phone shopping applications, a recommendation system is used as a system for providing commodity information and suggestions for customers and helping the users to decide what products to buy, so that the constant loss of the consumers can be effectively prevented when the information is overloaded, and the current recommendation system tends to recommend the users with repeated or similar contents, which also reduces the interest of the users on the same subject. Reinforcement learning is a problem used to describe and solve the problem of an agent in interacting with an environment by learning strategies to achieve maximum return or to achieve a specific goal. Reinforcement learning can serve to maximize the rewarding of advertising merchandise recommendations in advertising merchandise recommendation scenarios.

The existing commodity recommendation method based on reinforcement learning is to acquire past purchasing behavior data of all users, and recommend repeated commodities or similar commodities to the users according to records of multiple purchases of the users in the purchasing behavior data. However, the commodity recommendation method fails to consider the relation among the commodities, and the commodity recommendation efficiency is low.

Disclosure of Invention

The embodiment of the application aims to provide a commodity recommendation method, a commodity recommendation device, commodity recommendation equipment and a storage medium based on reinforcement learning so as to improve commodity recommendation efficiency.

In order to solve the above technical problem, an embodiment of the present application provides a commodity recommendation method based on reinforcement learning, including:

acquiring a commodity purchase order of a user, and constructing a commodity list corresponding to the commodity purchase order;

determining a purchasing action of the commodity aiming at any commodity in the commodity list, calculating an initial value and an initial purchasing probability of the purchasing action, and obtaining the initial value and the initial purchasing probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is completed;

randomly acquiring a commodity in the commodity list as an initial commodity, calculating the return of the initial commodity based on the initial value and the initial purchase probability, and updating and calculating the purchase value of the initial commodity based on the return to obtain the purchase value of the initial commodity;

circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase value set;

and determining the commodity to be recommended based on the target purchase value set, and recommending the commodity to be recommended to the user.

Further, the determining, for any commodity in the commodity list, a purchasing action of the commodity, and calculating an initial value and an initial purchasing probability of the purchasing action, and obtaining an initial value and an initial purchasing probability corresponding to each commodity in the commodity list when calculation of all commodities in the commodity list is completed includes:

acquiring a preset action strategy aiming at any commodity in the commodity list, and determining a purchasing action of the commodity based on the action strategy;

calculating the initial value of the purchasing action by adopting a calculation mode of a Bellman equation, and calculating the initial purchasing probability of the purchasing action based on the initial value;

and when all the commodities in the commodity list are calculated, obtaining the initial value and the initial purchase probability corresponding to each commodity in the commodity list, and recording the number of times of cycle calculation.

Further, after the calculating of the initial value of the purchasing action by using the bellman equation and the calculating of the initial purchasing probability of the purchasing action based on the initial value, the method further includes:

dividing the initial purchase probability by the preset action strategy to obtain the current sampling degree;

and acquiring the next commodity in the commodity list, and calculating the value and the purchase probability of the next commodity.

Further, the randomly acquiring a commodity in the commodity list as an initial commodity, calculating a reward of the initial commodity based on the initial value and the initial purchase probability, and performing an update calculation on the purchase value of the initial commodity based on the reward to obtain the purchase value of the initial commodity includes:

randomly acquiring one commodity in the commodity list as the initial commodity;

acquiring a purchasing action, an initial value and an initial purchasing probability corresponding to the initial commodity as a basic purchasing action, a basic value and a basic purchasing probability;

acquiring a preset value, and calculating the return of the initial commodity based on the preset value, the basic purchasing action, the basic value and the basic purchasing probability;

and updating and calculating the purchase value of the initial commodity based on the return and the preset value to obtain the purchase value of the initial commodity.

Further, the step of circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase value set includes:

acquiring a purchasing action, an initial value and an initial purchasing probability corresponding to the current commodity as the current purchasing action, the current value and the current purchasing probability;

calculating the return of the current commodity according to the previous purchasing value as the calculation base number of the current commodity and based on the current purchasing action, the current value and the current purchasing probability;

and calculating the purchase value of the current commodity based on the calculation base number and the return of the current commodity, and obtaining the target purchase value set when the calculation of the purchase values of all commodities in the commodity list is completed.

Further, the determining a to-be-recommended commodity based on the target purchase price set and recommending the to-be-recommended commodity to the user includes:

arranging the target purchase values in the target purchase value set according to the numerical order to obtain a purchase value arrangement;

acquiring target purchase values of preset recommendation quantity from the purchase value arrangement to serve as purchase values to be recommended;

and obtaining the commodity corresponding to the purchase value to be recommended, obtaining the commodity to be recommended, and recommending the commodity to be recommended to the user.

Further, the obtaining a commodity purchase order of a user and constructing a commodity list corresponding to the commodity purchase order include:

acquiring a commodity purchase order of the user;

identifying the commodities in the commodity purchase order to obtain a plurality of commodities;

and constructing a plurality of commodities to perform list construction processing to obtain the commodity list.

In order to solve the above technical problem, an embodiment of the present application provides a reinforcement learning-based product recommendation apparatus, including:

the system comprises a commodity purchase order acquisition module, a commodity purchase order acquisition module and a commodity display module, wherein the commodity purchase order acquisition module is used for acquiring a commodity purchase order of a user and constructing a commodity list corresponding to the commodity purchase order;

an initial purchase probability calculation module, configured to determine a purchase action of each commodity in the commodity list, calculate an initial value and an initial purchase probability of the purchase action, and obtain an initial value and an initial purchase probability corresponding to each commodity in the commodity list when calculation of all commodities in the commodity list is completed;

the purchase value calculation module is used for randomly acquiring one commodity in the commodity list as an initial commodity, calculating the return of the initial commodity based on the initial value and the initial purchase probability, and updating and calculating the purchase value of the initial commodity based on the return to obtain the purchase value of the initial commodity;

the target purchase price set generation module is used for circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase price set;

and the to-be-recommended commodity recommending module is used for determining the to-be-recommended commodity based on the target purchase price set and recommending the to-be-recommended commodity to the user.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided that includes, one or more processors; a memory for storing one or more programs for causing the one or more processors to implement any of the reinforcement learning-based item recommendation methods described above.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing any one of the reinforcement learning-based item recommendation methods described above.

The embodiment of the invention provides a commodity recommendation method, a commodity recommendation device, commodity recommendation equipment and a storage medium based on reinforcement learning. According to the embodiment of the invention, the purchasing action of the commodities is determined, the initial values and the purchasing probabilities of all the commodities in the commodity list are calculated, the return and the purchasing values of the commodities are calculated, and meanwhile, the purchasing value of the previous commodity is used as the calculating base number of the current commodity to calculate the target purchasing value of the current cannon, so that the commodity purchasing value can be accurately calculated according to the relation among the commodities, and the commodity recommending efficiency can be improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of an implementation of a commodity recommendation method flow based on reinforcement learning according to an embodiment of the present application;

FIG. 2 is a flowchart of another implementation of a sub-process in a reinforcement learning-based merchandise recommendation method according to an embodiment of the present application;

FIG. 3 is a flowchart of another implementation of a sub-process in a reinforcement learning-based merchandise recommendation method according to an embodiment of the present application;

FIG. 4 is a flowchart of another implementation of a sub-process in a reinforcement learning-based merchandise recommendation method according to an embodiment of the present application;

FIG. 5 is a flowchart of another implementation of a sub-process in a reinforcement learning-based merchandise recommendation method according to an embodiment of the present application;

FIG. 6 is a flowchart of another implementation of a sub-process in a reinforcement learning-based merchandise recommendation method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a device for recommending goods based on reinforcement learning according to an embodiment of the present application;

fig. 8 is a schematic diagram of a computer device provided in an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof in the description and claims of this application and the description of the figures above, are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

The present invention will be described in detail below with reference to the drawings and embodiments.

The reinforcement learning-based product recommendation method provided in the embodiments of the present application is generally executed by an enterprise service bus, and accordingly, the reinforcement learning-based product recommendation apparatus is generally configured in the enterprise service bus.

Referring to fig. 1, fig. 1 shows an embodiment of a reinforcement learning-based merchandise recommendation method.

It should be noted that, if there is substantially the same result, the method of the present invention is not limited to the flow sequence shown in fig. 1, and the method includes the following steps:

s1: and acquiring a commodity purchase order of a user, and constructing a commodity list corresponding to the commodity purchase order.

Specifically, in the embodiment of the application, a commodity purchase order of a user is analyzed, and the value of the commodity purchased by the user next is obtained according to the contact of each commodity in the commodity purchase order, so that the commodity most likely to be purchased by the user is recommended.

Among them, there are several key definitions in reinforcement learning: the state is a characteristic representation of the information of the environment in which the agent is currently located and can be transmitted as a signal to the agent. An action is an action taken by an agent while in a certain state. Revenue is a goal in reinforcement learning problems and is also the primary basis for changing strategies. The reward is the sum of the sequence of revenues. The policy learns the behavior pattern of the agent at a particular time. The embodiment of the invention takes the order of a commodity purchased by a user as basic learning data, takes the order sequence of the user as a basic data sequence, and is defined as follows: status(s) is an order for a user to buy a certain item. Act (a) a placing action for the user to purchase a good. The profit (r) is, for example, a specific advertisement product, which is +1 purchased and-1 not purchased. Reward (G) is the sum of the sequence of gains: g _t ＝R _t+1 +R _t+2 +R _t+3 +…+R _T . Strategy pi (a-s) when the user has purchased a certain purchaseAfter the item, a policy is made to continue purchasing another item (including both items in the same order).

Referring to fig. 2, fig. 2 shows an embodiment of step S1, which is described in detail as follows:

s11: and acquiring a commodity purchase order of the user.

S12: the commodities in the commodity purchase order are identified, and a plurality of commodities are obtained.

S13: and constructing a plurality of commodities to perform list construction processing to obtain a commodity list.

Specifically, according to the embodiment of the application, the commodities in the commodity purchase order of the user are analyzed, so that the commodity purchase order of the user is obtained, all commodities in the commodity purchase order are identified, each entity in the commodity purchase order can be identified in a text identification mode, and each entity is compared with a commodity map, so that the commodities in the commodity purchase order are identified; and finally, constructing the commodities in the commodity purchase order into a commodity list so as to be convenient for calculation and analysis of each commodity in the follow-up correspondence.

S2: and determining the purchasing action of the commodities aiming at any commodity in the commodity list, calculating the initial value and the initial purchasing probability of the purchasing action, and obtaining the initial value and the initial purchasing probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is completed.

Referring to fig. 3, fig. 3 shows an embodiment of step S2, which is described in detail as follows:

s21: and acquiring a preset action strategy aiming at any commodity in the commodity list, and determining the purchasing action of the commodity based on the action strategy.

Specifically, a preset action policy b (a | s) is acquired for any one of the commodities in the commodity list, and the selected commodity is multiplied by the preset action policy b (a | s), so that the purchase action a0 of the commodity determined by you is obtained. The preset action policy b (a | s) refers to the probability that the user makes a continuous purchase of another commodity after the user has purchased a commodity. The embodiment of the application sets the value of the action strategy b (a | s) according to the actual situation. Other commodities in the commodity list are also determined to be purchased according to a preset action strategy b (a | s).

S22: and calculating the initial value of the purchasing action by adopting a calculation mode of a Bellman equation, and calculating the initial purchasing probability of the purchasing action based on the initial value.

Specifically, the embodiments of the present application provide the following formulas (1) and (2).

Formula (1):

formula (2):

wherein γ is the discount rate, v _π (s) is status value, G _t In return, S _t As a state, a is an action, π (a | s) is a policy, k is the number of states.

In the embodiment of the application, the initial value of the purchasing action is calculated by means of the calculation of the bellman equation, that is, the initial value of the purchasing action is calculated by formula (1), and then the initial purchasing probability pi (a | s) of the purchasing action is calculated by formula (2).

After step S22, the present application further provides another embodiment, which specifically includes the following steps:

dividing the initial purchase probability by a preset action strategy to obtain the current sampling degree;

Specifically, the initial purchase probability pi (a | s) is divided by a preset action strategy b (a | s) to obtain a current sampling degree σ _ (t + 1). Where σ _ (t + 1) is a number from 0 to 1, which represents the degree of sampling in the current step, and when σ _ (t + 1) =1 represents a complete sample, σ _ (t + 1) =0 represents no sample. And then acquiring the next commodity in the commodity list, and calculating the value and the purchase probability of the next commodity. The calculation process for the next product is the same as steps S21-S22, and is not repeated here to avoid repetition.

S23: and when all the commodities in the commodity list are calculated, obtaining the initial value and the initial purchase probability corresponding to each commodity in the commodity list, and recording the number of times of circular calculation.

Specifically, in the embodiment of the application, the initial value and the initial purchase probability of all the commodities in the commodity list are circularly calculated, when the calculation of all the commodities in the commodity list is completed, the initial value and the initial purchase probability corresponding to each commodity in the commodity list are obtained, and the number of times of circular calculation is recorded. The number of times of loop calculation is recorded so as to perform descending loop subsequently to calculate the target purchase value.

S3: randomly acquiring a commodity in the commodity list as an initial commodity, calculating the return of the initial commodity based on the initial value and the initial purchase probability, and updating and calculating the purchase value of the initial commodity based on the return to obtain the purchase value of the initial commodity.

Referring to fig. 4, fig. 4 shows an embodiment of step S3, which is described in detail as follows:

s31: and randomly acquiring one commodity in the commodity list as an initial commodity.

S32: and acquiring a purchasing action, an initial value and an initial purchasing probability corresponding to the initial commodity as a basic purchasing action, a basic value and a basic purchasing probability.

S33: and acquiring a preset value, and calculating the return of the initial commodity based on the preset value, the basic purchasing action, the basic value and the basic purchasing probability.

S34: and updating and calculating the purchase value of the initial commodity based on the return and the preset value to obtain the purchase value of the initial commodity.

Specifically, the present application also provides the following formula:

formula (3):

formula (4):

wherein γ is the discount rate, G _t：h 、G _t+1：h 、G _t：t+n In return for different times, Q _h-1 (S _t+1 ，A _t+1 ) For the previous purchase value, σ _t+1 To the current sampling level, A _t+1 The motion at the time t +1 is the motion,

is the previous initial value, alpha is the coefficient, Q _t+n (S _t ，A _t ) For the purchase value at time t + n, Q _t+n-1 (S _t ，A _t ) Previous purchase value, S, at time t + n _t+1 The commodity at time t + 1.

Specifically, when the purchase value of a commodity is calculated for the first time, a preset value that is set in advance according to actual conditions and can be set to 1 needs to be acquired. Then, acquiring a purchasing action, an initial value and an initial purchasing probability corresponding to the initial commodity as a basic purchasing action, a basic value and a basic purchasing probability; then obtaining the preset value, substituting the preset value, the basic purchasing action, the basic value and the basic purchasing probability into a formula (4), and calculating to obtain the return G of the initial commodity _t：h Finally, substituting the return value and the preset value into a formula (5), updating and calculating the purchase value of the initial commodity to obtain the purchase value Q of the initial commodity _t+n (S _t ，A _t )。

S4: and circularly calculating the purchase values of all the commodities in the commodity list according to the calculation base number of the previous purchase value work to obtain a target purchase value set.

Referring to fig. 5, fig. 5 shows an embodiment of step S4, which is described in detail as follows:

s41: and acquiring a purchasing action, initial value and initial purchasing probability corresponding to the current commodity as the current purchasing action, the current value and the current purchasing probability.

S42: and calculating the reward of the current commodity according to the previous purchasing value as the calculation base of the current commodity and based on the current purchasing action, the current value and the current purchasing probability.

S43: and calculating the purchase value of the current commodity based on the calculation base number and the return of the current commodity, and obtaining a target purchase value set when the calculation of the purchase values of all commodities in the commodity list is completed.

Specifically, since the purchase values of all commodities need to be calculated in the embodiment of the present application, the purchase action, the initial value, and the initial purchase probability corresponding to the current commodity are obtained as the current purchase action, the current value, and the current purchase probability. Since the purchase value of the initial commodity has been calculated in the previous step, when calculating the purchase value of the next commodity of the initial commodity, the purchase value of the initial commodity is used as the calculation base number of the purchase value of the next commodity, that is, the previous purchase value Q _h-1 (S _t+1 ，A _t+1 ) Substituting the obtained current purchasing action, the current value, the current purchasing probability and the previous purchasing value into the formula (4) to calculate the return of the current commodity; and substituting the calculation base number and the return of the current commodity into a formula (5) to calculate the purchase value of the current commodity. And then, taking the purchase value of the current commodity as the previous purchase value, calculating the purchase value of the next commodity, and so on, and obtaining a target purchase value set when the calculation of the purchase values of all the commodities in the commodity list is completed.

S5: and determining the commodity to be recommended based on the target purchase value set, and recommending the commodity to be recommended to the user.

Specifically, the higher the target purchase value of a commodity is, the more likely the user will purchase the commodity at the next shopping, so that the commodities with concentrated target purchase values and high target purchase values are obtained as the commodities to be recommended and recommended to the user.

Referring to fig. 6, fig. 6 shows an embodiment of step S5, which is described in detail as follows:

s51: and arranging the target purchase values in the target purchase value set according to the numerical value sequence to obtain a purchase value arrangement.

S52: and acquiring the target purchase value of the preset recommendation quantity from the purchase value arrangement to be used as the purchase value to be recommended.

S53: and obtaining the commodity corresponding to the purchase value to be recommended, obtaining the commodity to be recommended, and recommending the commodity to be recommended to the user.

In the embodiment of the application, the target purchase values in the target purchase value set are arranged according to the numerical order to obtain a purchase value arrangement; since the target purchase value of a commodity is higher, the more likely it is that the user purchases the commodity at the next shopping. Therefore, the target purchase value of the preset recommendation quantity is selected from the purchase value arrangement to serve as the purchase value to be recommended, the commodity corresponding to the purchase value to be recommended is finally obtained, the commodity to be recommended is obtained, and the commodity to be recommended is recommended to the user.

It should be noted that the preset recommended amount is set according to actual situations, and is not limited herein. In one embodiment, 5 items are selected for recommendation to the user.

In the embodiment, a commodity purchase order of a user is obtained, and a commodity list corresponding to the commodity purchase order is constructed; determining the purchasing action of the commodities aiming at any commodity in the commodity list, calculating the initial value and initial purchasing probability of the purchasing action, and obtaining the initial value and initial purchasing probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is completed; randomly acquiring a commodity in a commodity list as an initial commodity, calculating the return of the initial commodity based on the initial value and the initial purchase probability, and updating and calculating the purchase value of the initial commodity based on the return to obtain the purchase value of the initial commodity; circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase value set; and determining the commodity to be recommended based on the target purchase value set, and recommending the commodity to be recommended to the user. According to the embodiment of the invention, the purchasing action of the commodities is determined, the initial values and the purchasing probabilities of all the commodities in the commodity list are calculated, the return and the purchasing values of the commodities are calculated, and the purchasing value of the previous commodity is used as the calculating base number of the current commodity to calculate the target purchasing value of the current cannon, so that the commodity purchasing value can be accurately calculated according to the relation among the commodities, and the commodity recommending efficiency is improved.

Referring to fig. 7, as an implementation of the method shown in fig. 1, the present application provides an embodiment of a device for recommending goods based on reinforcement learning, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 1, and the device can be applied to various electronic devices.

As shown in fig. 7, the reinforcement learning-based product recommendation apparatus of the present embodiment includes: a commodity purchase order obtaining module 61, an initial purchase probability calculating module 62, a purchase value calculating module 63, a target purchase price set generating module 64 and a to-be-recommended commodity recommending module 65, wherein:

a commodity purchase order obtaining module 61, configured to obtain a commodity purchase order of a user, and construct a commodity list corresponding to the commodity purchase order;

an initial purchase probability calculation module 62, configured to determine a purchase action of a commodity for any commodity in the commodity list, calculate an initial value and an initial purchase probability of the purchase action, and obtain an initial value and an initial purchase probability corresponding to each commodity in the commodity list when calculation of all commodities in the commodity list is completed;

a purchase value calculation module 63, configured to randomly obtain a commodity in the commodity list as an initial commodity, calculate a return of the initial commodity based on the initial value and the initial purchase probability, and update and calculate a purchase value of the initial commodity based on the return to obtain a purchase value of the initial commodity;

a target purchase price set generation module 64, configured to circularly calculate the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity, so as to obtain a target purchase price set;

and the to-be-recommended commodity recommending module 65 is configured to determine a to-be-recommended commodity based on the target purchase price set, and recommend the to-be-recommended commodity to the user.

Further, the initial purchase probability calculation module 62 includes:

the purchase action determining unit is used for acquiring a preset action strategy for any commodity in the commodity list and determining the purchase action of the commodity based on the action strategy;

the initial value calculating unit is used for calculating the initial value of the purchasing action by adopting a calculation mode of a Bellman equation and calculating the initial purchasing probability of the purchasing action based on the initial value;

and the commodity calculation finishing unit is used for obtaining the initial value and the initial purchase probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is finished, and recording the cycle calculation times.

Further, the initial value calculation unit further includes:

the current sampling degree unit is used for dividing the initial purchase probability by a preset action strategy to obtain the current sampling degree;

and the next commodity calculating unit is used for acquiring the next commodity in the commodity list and calculating the value and the purchase probability of the next commodity.

Further, the purchase value calculation module 63 includes:

an initial commodity determining unit, configured to randomly obtain a commodity in a commodity list as an initial commodity;

the corresponding data acquisition unit is used for acquiring the purchase action, the initial value and the initial purchase probability corresponding to the initial commodity as the basic purchase action, the basic value and the basic purchase probability;

the commodity return calculating unit is used for acquiring preset value and calculating return of the initial commodity based on the preset value, the basic purchasing action, the basic value and the basic purchasing probability;

and the updating calculation unit is used for updating and calculating the purchase value of the initial commodity based on the return and the preset value to obtain the purchase value of the initial commodity.

Further, the target purchase price set generation module 64 includes:

the current data acquisition unit is used for acquiring the purchasing action, the initial value and the initial purchasing probability corresponding to the current commodity as the current purchasing action, the current value and the current purchasing probability;

the current commodity return calculating unit is used for calculating the return of the current commodity according to the previous purchasing value as the calculating base number of the current commodity and based on the current purchasing action, the current value and the current purchasing probability;

and the value calculation completion unit is used for calculating the purchase value of the current commodity based on the calculation base number and the return of the current commodity, and obtaining the target purchase value set when the calculation of the purchase values of all commodities in the commodity list is completed.

Further, the to-be-recommended item recommendation module 65 includes:

the purchase value arrangement generating unit is used for arranging the target purchase values in the target purchase value set according to the numerical value sequence to obtain a purchase value arrangement;

the system comprises a to-be-recommended purchase value acquisition unit, a to-be-recommended purchase value acquisition unit and a recommendation unit, wherein the to-be-recommended purchase value acquisition unit is used for acquiring target purchase values of preset recommendation quantity from purchase value arrangement to serve as to-be-recommended purchase values;

and the commodity recommending unit is used for acquiring the commodity corresponding to the purchase value to be recommended, obtaining the commodity to be recommended and recommending the commodity to be recommended to the user.

Further, the article purchase order obtaining module 61 includes:

the order acquisition unit is used for acquiring a commodity purchase order of a user;

the commodity identification unit is used for identifying commodities in the commodity purchase order to obtain a plurality of commodities;

and the commodity list construction unit is used for constructing a plurality of commodities to carry out list construction processing so as to obtain a commodity list.

In the embodiment, a commodity purchase order of a user is obtained, and a commodity list corresponding to the commodity purchase order is constructed; determining the purchasing action of the commodities aiming at any commodity in the commodity list, calculating the initial value and initial purchasing probability of the purchasing action, and obtaining the initial value and initial purchasing probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is completed; randomly acquiring a commodity in the commodity list as an initial commodity, calculating the return of the initial commodity based on the initial value and the initial purchase probability, and updating and calculating the purchase value of the initial commodity based on the return to obtain the purchase value of the initial commodity; circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain a target purchase value set; and determining the commodity to be recommended based on the target purchase value set, and recommending the commodity to be recommended to the user. According to the embodiment of the invention, the purchasing action of the commodities is determined, the initial values and the purchasing probabilities of all the commodities in the commodity list are calculated, the return and the purchasing values of the commodities are calculated, and the purchasing value of the previous commodity is used as the calculating base number of the current commodity to calculate the target purchasing value of the current cannon, so that the commodity purchasing value can be accurately calculated according to the relation among the commodities, and the commodity recommending efficiency is improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 7 comprises a memory 71, a processor 72, a network interface 73, communicatively connected to each other by a system bus. It is noted that only three components, memory 71, processor 72, network interface 73, are shown in the figure as a computer device 7, but it is understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 71 includes at least one type of readable storage medium including flash memory, hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device 7. Of course, the memory 71 may also comprise both an internal storage unit of the computer device 7 and an external storage device thereof. In the present embodiment, the memory 71 is generally used for storing an operating system installed in the computer device 7 and various types of application software, such as program codes of a reinforcement learning-based product recommendation method. Further, the memory 71 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 72 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute the program code stored in the memory 71 or process data, such as the program code of the reinforcement learning-based product recommendation method described above, so as to implement various embodiments of the reinforcement learning-based product recommendation method.

The network interface 73 may comprise a wireless network interface or a wired network interface, and the network interface 73 is typically used for establishing a communication connection between the computer device 7 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the reinforcement learning based merchandise recommendation method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A commodity recommendation method based on reinforcement learning is characterized by comprising the following steps:

2. The reinforcement learning-based commodity recommendation method according to claim 1, wherein the determining a purchase action of the commodity for any commodity in the commodity list, and calculating an initial value and an initial purchase probability of the purchase action, and obtaining the initial value and the initial purchase probability corresponding to each commodity in the commodity list when the calculation of all commodities in the commodity list is completed, comprises:

and when all the commodities in the commodity list are calculated, obtaining the initial value and the initial purchase probability corresponding to each commodity in the commodity list, and recording the number of times of circular calculation.

3. The reinforcement learning-based commodity recommendation method according to claim 2, wherein after calculating an initial value of the purchase action by using the calculation method of the bellman equation and calculating an initial purchase probability of the purchase action based on the initial value, the method further comprises:

4. The reinforcement learning-based commodity recommendation method according to claim 1, wherein the randomly acquiring one commodity in the commodity list as an initial commodity, calculating a reward of the initial commodity based on the initial value and the initial purchase probability, and performing an update calculation on a purchase value of the initial commodity based on the reward to obtain the purchase value of the initial commodity comprises:

5. The reinforcement learning-based commodity recommendation method according to claim 1, wherein the step of circularly calculating the purchase values of all the commodities in the commodity list according to the previous purchase value as the calculation base number of the current commodity to obtain the target purchase value set comprises:

6. The reinforcement learning-based commodity recommendation method according to claim 1, wherein the determining a commodity to be recommended based on the target purchase price set and recommending the commodity to be recommended to the user comprises:

arranging the target purchase values in the target purchase value set according to the numerical value sequence to obtain a purchase value arrangement;

and acquiring the commodity corresponding to the purchase value to be recommended, acquiring the commodity to be recommended, and recommending the commodity to be recommended to the user.

7. The reinforcement learning-based commodity recommendation method according to any one of claims 1 to 6, wherein the obtaining of a commodity purchase order of a user and the building of a commodity list corresponding to the commodity purchase order comprise:

acquiring a commodity purchase order of the user;

8. A reinforcement learning-based commodity recommendation device, comprising:

a purchase value calculation module, configured to randomly obtain a commodity in the commodity list, use the commodity as an initial commodity, calculate a return of the initial commodity based on the initial value and the initial purchase probability, and update and calculate a purchase value of the initial commodity based on the return to obtain a purchase value of the initial commodity;

and the to-be-recommended commodity recommending module is used for determining a to-be-recommended commodity based on the target purchase price set and recommending the to-be-recommended commodity to the user.

9. A computer device comprising a memory in which a computer program is stored and a processor that implements the reinforcement learning-based merchandise recommendation method according to any one of claims 1 to 7 when the processor executes the computer program.

10. A computer-readable storage medium, wherein a computer program is stored thereon, and when executed by a processor, the computer program implements the reinforcement learning-based merchandise recommendation method according to any one of claims 1 to 7.