CN111738787A

CN111738787A - Information pushing method and device

Info

Publication number: CN111738787A
Application number: CN201910512050.0A
Authority: CN
Inventors: 周东; 雷章明; 汤桢伟; 兰华勇; 古川
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2020-10-02

Abstract

The invention discloses an information pushing method and device, which specifically comprise the following steps: acquiring user information, commodity information and commodity sequence information of user historical behaviors; training a user-commodity interest model by adopting a reinforcement learning method according to the acquired information; training a user-commodity repurchase period model by adopting a deep learning method; when the current user accesses online, searching relevant information of the current user, and obtaining a re-purchasing period of each commodity in a commodity list by the current user according to a user-commodity interest model and a user-commodity re-purchasing period model; and comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user. By adopting the method and the device, the recommendation strategy can be dynamically adjusted according to the user feedback; and repeated commodity recommendation in a non-repeated purchasing period is avoided.

Description

Information pushing method and device

Technical Field

The invention relates to the technical field of electronic commerce, in particular to an information pushing method and device.

Background

In the field of personalized goods recommendation, current recommendation algorithms include:

1. based on recommendation of popularity, the popularity is defined according to browsing volume, sales volume, social hotspots and the like, and the cold start problem can be effectively solved.

2. Based on collaborative filtering recommendation, only user behavior is relied on, and deep understanding of commodity characteristics or metadata is not needed. The coverage rate is high, the problem of long tail can be effectively solved, and the user is surprised.

3. Content/rule/knowledge recommendation based on a complete, well-categorized content/rule/knowledge architecture and structured data features. The timeliness is high, the recommendation reason is reasonably explained, and the user trust degree is high.

4. Based on the combination of the recommended algorithms, the advantages are complementary, the problems of cold start, data sparseness and data unstructured can be solved, and the application range is wide.

For the prior art, including recommendation based on popularity (popularity), recommendation based on collaborative filtering, recommendation based on content/rule/knowledge, and combination of these recommendation algorithms, since the recommendation result is output to the user for display, dynamic adjustment of the recommendation strategy is not performed according to user feedback (such as whether to click or not, whether to purchase, etc.), resulting in a less than ideal recommendation effect. In addition, due to the fact that the specific repurchase cycle factor of the commodity is not considered (generally, the repurchase cycle of the consumable part is larger than that of the fast-consumed part), repeated recommendation of the commodity in a non-repurchase cycle is caused, and resource waste of the display position and user experience problems (such as contradictory conflict, distrust and the like) are caused.

Disclosure of Invention

The invention aims to provide an information pushing method and device, which can dynamically adjust a recommendation strategy according to user feedback; and repeated commodity recommendation in a non-repeated purchasing period is avoided.

In order to achieve the above object, an information pushing method includes:

acquiring user information, commodity information and commodity sequence information of user historical behaviors;

training a user-commodity interest model by adopting a reinforcement learning method according to the acquired information; training a user-commodity repurchase period model by adopting a deep learning method;

when a current user accesses online, searching user information, commodity information and commodity sequence information of historical behaviors of the current user, and obtaining a repurchase cycle of the current user to each commodity in a commodity list according to a user-commodity interest model and a user-commodity repurchase cycle model;

and comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user.

In order to achieve the above object, the present invention further provides an information pushing apparatus, including:

the data management module is used for acquiring user information, commodity information and commodity sequence information of historical behaviors of the user;

the model training module is used for training the user-commodity interest model by adopting a reinforcement learning method according to the information acquired from the data management module; training a user-commodity repurchase period model by adopting a deep learning method;

the online recommendation module is used for searching user information, commodity information and commodity sequence information of historical behaviors of the current user when the current user performs online access, and obtaining a repurchase cycle of the current user to each commodity in the commodity list according to the user-commodity interest model and the user-commodity repurchase cycle model; and comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user.

In summary, the invention provides an information push improvement method and device based on the combination of reinforcement learning and deep learning, which obtains user information, commodity information, and commodity sequence information of user historical behaviors; training a user-commodity interest model by adopting a reinforcement learning method according to the acquired information; training a user-commodity repurchase period model by adopting a deep learning method; when the current user accesses online, searching relevant information of the current user, and obtaining a re-purchasing period of each commodity in a commodity list by the current user according to a user-commodity interest model and a user-commodity re-purchasing period model; and comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user. By the scheme of the invention, the following problems in the prior art are solved: the recommendation strategy cannot be dynamically adjusted according to the user feedback; and repeatedly pushing in the non-repurchase period.

Drawings

Fig. 1 is a flowchart illustrating an information push method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an information pushing apparatus according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The core idea of the invention is that firstly, two models are trained off-line, namely a user-commodity interest model and a user-commodity repurchase cycle model, wherein the user-commodity interest model is trained by a reinforcement learning method according to user information, commodity information and commodity sequence information of user historical behaviors, and a commodity list of each user is output; and training a user-commodity repurchase period model through a deep learning method according to the user information and the commodity information bought by the user, and outputting the commodity repurchase period of each user. When the online user accesses, filtering non-repurchase commodities according to the user-commodity interest model and the user-commodity repurchase cycle model to obtain a recommended commodity list of the current user.

Fig. 1 is a schematic flow chart of an information push method according to the present invention, where the method includes:

and 11, acquiring user information, commodity information and commodity sequence information of user historical behaviors.

In this step, the acquired user information, commodity information, and commodity sequence information of the user historical behaviors are obtained by processing the acquired information. Specifically, user information, commodity information, user behavior information and user order information of each user are collected, and then the collected information is sequentially subjected to extraction, cleaning, characterization and labeling to obtain the processed user information, commodity information and user historical behavior commodity sequence information.

Wherein, the user portrait information that obtains after the user information of gathering processes includes: user rating, gender, user age, user marital status, academic history, occupation, presence or absence of a car, the method comprises the following steps of firstly purchasing the garment in the same year, last-year shopping, first single time to present, present time, last login present time, first order present days in the same year, last month unit price, last two month unit prices, last three month unit prices, mother and baby medal level, individual care and make-up medal level, wine medal, user liveness model, purchasing power segmentation, life cycle, RFM full-quality category score, POP garment RFM grouping, POP garment RFM standardized score, POP garment RFM super garment RFM grouping, POP garment super RFM standardized score, large garment quality RFM grouping, large garment quality RFM standardized score, user value grouping and user value standard score.

After the collected user behavior information and the user order information are processed, the commodity sequence information of the user historical behaviors is obtained, and the commodity sequence information comprises the following steps: a user history browsing merchandise ID sequence, a user history purchasing merchandise ID sequence, a user history exposure merchandise ID sequence, etc.

The collected commodity information is processed to obtain commodity price, size, brand, sales volume, categories and the like. The commodity here refers to a commodity obtained from commodity sequence information of user historical behaviors.

Step 12, training a user-commodity interest model by adopting a reinforcement learning method according to the acquired information; and training a user-commodity repurchase period model by adopting a deep learning method.

The method for training the user-commodity interest model by adopting the reinforcement learning method specifically comprises the following steps:

s121, constructing two identical neural networks, namely an eval neural network and a target neural network, wherein the eval neural network is used for obtaining a recommended commodity list and a predicted value of the quality degree of a recommended result, and the target neural network is used for updating parameters of the eval neural network;

in this step, the eval neural network is used to obtain a recommended goods list (action) and a predicted value Q of the quality degree of the recommended result.

S122, initializing eval neural network parameters theta^QAnd theta^μ(ii) a Initializing target neural network parameters θ^Q′＝θ^Q，θ^μ′＝θ^μ；

S123, using the user information, commodity information and the commodity sequence information of the historical behaviors of the user as a piece of training data S_i(ii) a Will(s)_i,a_i,r_i,s_i+1) The ith sample is taken as the N samples of the reinforcement learning training set; wherein, a_iList of recommended items for ith sample, r_iUser feedback for the ith sample, s_i+1The commodity sequence information is the commodity sequence information of the user information, commodity information and user historical behaviors of the next state of the ith sample;

wherein(s)_i,a_i,r_i,s_i+1) As a sample, by_iBehavioral influence is given by s_i+1. By analogy, the i +1 th sample is(s)_i+1,a_i+1,r_i+1,s_i+2)。

S124, recommending knots according to the ith sampleConstructing a first loss function according to the target value of the result quality degree and the predicted value of the recommended result quality degree; in the eval neural network, the first loss function is optimized by taking a small value, and the parameter theta of the eval neural network is updated^Q；

And calculating the gradient of the first loss function by adopting a gradient descending method, and selecting the direction with the fastest gradient descending so as to minimize the first loss function.

S125, in the eval neural network, optimizing the expectation function fed back by the user by taking a small value, and updating the parameter theta of the eval neural network^μ；

S126, according to theta^QAnd theta^μUpdating the parameter of the target neural network to be theta^Q′And theta^μ′And obtaining the trained user-commodity interest model.

The method for training the user-commodity repurchase period model by adopting the deep learning method specifically comprises the following steps:

the SS121 takes the user information of each user and the commodity information bought by the user as the ith sample in M samples of a deep learning training set, wherein i belongs to M, and M is a natural number;

the SS122 inputs the user information of the ith sample and the commodity information bought by the user into a user-commodity repurchase cycle model to obtain a commodity repurchase cycle training value;

and the SS123 constructs a second loss function according to the commodity repurchase cycle training value and the repurchase cycle real value of the ith sample, optimizes the minimum value of the second loss function, and updates the network weight parameters to obtain a trained user-commodity repurchase cycle model.

The actual value of the repurchase cycle for each sample is determined by the actual purchase record time. When the training value of the sample is closer to the true value, the second loss function takes the minimum value, the obtained network weight parameter is optimal, the model training is finished, and the trained user-commodity repurchasing period model can be used for determining a commodity repurchasing period for the currently visited user in the subsequent step 13, that is, the repurchasing period of each commodity in the commodity list by the current user is obtained.

And calculating the gradient of the second loss function by adopting a gradient descending method, and selecting the direction with the fastest gradient descending so as to enable the second loss function to be the minimum. The user-commodity repurchase period model corresponds to a set of optimal network weight parameters after being trained.

And step 13, when the current user accesses online, searching the user information, commodity information and commodity sequence information of the historical behaviors of the current user, and obtaining the repurchase cycle of the current user to each commodity in the commodity list according to the user-commodity interest model and the user-commodity repurchase cycle model.

The method comprises the following specific steps: when a current user accesses, user information, commodity information and commodity sequence information of historical behaviors of the user are obtained according to a current user identifier;

inputting the user information, the commodity information and the commodity sequence information of the historical behaviors of the user into a user-commodity interest model to obtain a commodity list of the current user;

and respectively inputting the current user information and the commodity information in the commodity list into a user-commodity re-purchasing period model to obtain the re-purchasing period of each commodity in the commodity list by the current user.

And step 14, comparing the historical purchasing record time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user.

Thus, the information push method of the present invention is completed.

Based on the same inventive concept, the invention also discloses an information pushing device, the schematic diagram of which is shown in fig. 2, the device comprises:

the data management module 201 is used for acquiring user information, commodity information and commodity sequence information of user historical behaviors;

the model training module 202 is used for training a user-commodity interest model by adopting a reinforcement learning method according to the information acquired from the data management module; training a user-commodity repurchase period model by adopting a deep learning method;

the online recommendation module 203 is used for searching relevant information of the current user when the current user performs online access, and obtaining a repurchase cycle of each commodity in the commodity list by the current user according to the user-commodity interest model and the user-commodity repurchase cycle model; and comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user.

The model training module 202 is particularly useful in training a user-commodity interest model using reinforcement learning,

constructing two identical neural networks of an eval neural network and a target neural network, wherein the eval neural network is used for acquiring a recommended commodity list and a predicted value of the quality degree of a recommended result, and the target neural network is used for updating parameters of the eval neural network;

initializing eval neural network parameters θ^QAnd theta^μ(ii) a Initializing target neural network parameters θ^Q′＝θ^Q，θ^μ′＝θ^μ；

Using the user information, commodity information and commodity sequence information of the historical behaviors of each user as a piece of training data s_i(ii) a Will(s)_i,a_i,r_i,s_i+1) The ith sample is taken as the N samples of the reinforcement learning training set; wherein, a_iList of recommended items for ith sample, r_iUser feedback for the ith sample, s_i+1The commodity sequence information is the commodity sequence information of the user information, commodity information and user historical behaviors of the next state of the ith sample;

constructing a first loss function according to the target value of the recommendation result quality degree of the ith sample and the predicted value of the recommendation result quality degree; in the eval neural network, the first loss function is optimized by taking a small value, and the parameter theta of the eval neural network is updated^Q；

In the eval neural network, the expectation function fed back by the user is optimized by taking a small value, and the parameter theta of the eval neural network is updated^μ；

According to theta^QAnd theta^μUpdating target neural networksParameter is theta^Q′And theta^μ′And obtaining the trained user-commodity interest model.

The model training module 202 is specifically configured to train the user-commodity buyback cycle model by deep learning,

taking the user information of each user and the commodity information bought by the user as the ith sample in M samples of a deep learning training set, wherein i belongs to M, and M is a natural number;

inputting the user information of the ith sample and the commodity information bought by the user into a user-commodity repurchase period model to obtain a commodity repurchase period training value;

and constructing a second loss function according to the commodity repurchase period training value and the repurchase period real value of the ith sample, optimizing the minimum value of the second loss function, and updating the network weight parameters to obtain a trained user-commodity repurchase period model.

The online recommendation module 203 searches the user information, the commodity information, and the commodity sequence information of the historical behaviors of the user of the current user when the current user performs online access, obtains the repurchase cycle of each commodity in the commodity list of the current user according to the user-commodity interest model and the user-commodity repurchase cycle model, and is specifically used for,

when a current user accesses, user information, commodity information and commodity sequence information of historical behaviors of the user are obtained according to a current user identifier;

In one embodiment, specific scenarios are listed below for clarity of illustration of the invention.

1. And the data management module collects, processes and stores information by user dimensionality so as to facilitate the calling of the subsequent model training module and the online recommendation module.

2. Training user-commodity interest model process by adopting reinforcement learning method

1) Firstly, two identical neural networks of the eval neural network and the target neural network are required to be constructed, and a parameter theta of the eval neural network is initialized^QAnd theta^μ(ii) a Initializing target neural network parameters θ^Q′And theta^μ′，θ^Q′←θ^Q，θ^μ′←θ^Q(ii) a At the same time, random process noise is initialized

It should be noted that both the Actor network and the Critic network have target-net and eval-net, and only the parameters of the Actor network and Critic network are trained during the training process, while the parameters of the Actor network and Critic network are copied by the former two networks at regular intervals. The purpose of using the target network and the eval network is to update parameters in a continuous state every time, and correlation exists before and after the parameters are updated, namely training data of the model are not independently and uniformly distributed any more, so that the neural network can only see the problem (limitation) in one side, and even the overall effect of the neural network is not converged. The Actor eval-net outputs an action (action) according to the state (state), and the critical eval-net outputs a Q value according to the state and the action.

2) A training data state s to be called from the data management module_iInputting Actor eval-net, the action Actor selects an a according to the current strategy and noise exploration_i，

In a state s_iLower a_iInfluence environment, return user feedback (reward): r is_iAnd a new state s_i+1。

Will(s)_i,a_i,r_i,s_i+1) As one of N samples in the reinforcement learning training set.

3) Critic eval-net according to s_iAnd a_iObtaining a predicted value Q(s) of the degree of the quality of the recommendation result_i,a_i)。

4) By the next state s_i+1The next (future) action a is calculated in the Actor eval-net_i+1The formula is expressed as follows: a is_i+1＝μ′(s_i+1|θ^μ′)。

5) Calculating a new Q value (Q') by using the Critic target-net, and the formula is as follows: q ═ Q'(s)_i+1,a_i+1)。

6) With current user feedback r_iAnd accumulating the Q values to calculate a target Q value. Target Q value is target value y of recommendation result quality_i＝r_i+ γ q, i.e. y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′). Gamma denotes a discount factor.

7) Calculate the gradient of Critic eval-net:

the first loss function (loss) of Critic eval-net defines: using a method similar to supervised learning, the target value y according to the degree of goodness of the recommendation result of the ith sample_iAnd the predicted value Q(s) of the degree of the quality of the recommended result_i,a_i) Constructing a first loss function:

wherein, y_iCan be regarded as "Label": y is_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′) γ denotes a discount factor and N denotes N samples.

Updating Critic eval-net by minimizing L, thereby updating theta^Q。

8) Calculating the gradient of Actor eval-net:

the expectation function of the user feedback of Actor eval-net is defined as:

actor eval-net parameter theta is calculated and updated through sampling strategy^μ：

9) According to theta^QAnd theta^μUpdating parameters of the target network:

θ^Q′←τθ^Q+(1-τ)θ^Q′

θ^μ′←τθ^μ+(1-τ)θ^μ′

in conclusion, the user-commodity interest model of the invention can dynamically update the user-commodity interest model and adjust the strategy of the recommended commodity list in real time through the training process. After user information, commodity information and commodity sequence information of a user are input in a user-commodity interest model, a commodity list corresponding to the user is output, and the commodity list is provided with a commodity identifier of each commodity.

3. Process for training user-commodity repurchase period model by deep learning method

The user-commodity repurchase period model mainly adopts three layers of fully connected networks, and initializes the neuron weight parameter theta of each layer, which is also called as the weight parameter theta.

The collected and processed user information of each user and commodity information bought by the user are used as the ith sample i in the deep learning training set M, wherein i belongs to the M;

inputting the user information of the ith sample and the commodity information bought by the user into a user-commodity repurchase period model to obtain a commodity repurchase period training value h_θ(x⁽ⁱ⁾) Wherein h is_θ(x)＝θ^Tx＝θ₀x₀+…+θ_nx_nTheta represents a network weight parameter, x represents a neuron, and n represents an information characteristic dimension;

constructing a second loss function according to the commodity repurchase cycle training value and the repurchase cycle real value of the ith sample, and performing the second loss function

Optimizing the minimum value, and updating the network weight parameter theta to obtainThe trained user-commodity re-purchasing period model; wherein, y⁽ⁱ⁾Is the real value of the repeat purchasing cycle of the ith sample. Wherein, the gradient of the second loss function is calculated by adopting a gradient descending method, and the direction with the fastest gradient descending is selected to ensure that the loss function J (theta) is the minimum.

In addition, y is⁽ⁱ⁾And inputting the label data into the user-commodity repurchase period model for gradient training to obtain the user-commodity repurchase period model. The tag data is obtained by acquiring the real purchase record time of each user within a preset time period, for example, acquiring the purchase record of each user within one year, and calculating the re-purchase time of the corresponding commodity of each user. Briefly, if a user purchases a mobile phone twice in a year, the real value of the re-purchase cycle of the mobile phone by the user is: one year/2 times 6 months.

In summary, after the user information of a user and the information of the commodities bought by the user are input in the user-commodity repurchase cycle model, the repurchase cycle of each commodity corresponding to the user is output. The user-commodity repurchase period model can filter non-repurchase commodities through the training process.

4. Online recommendation process

When a current user accesses, the user portrait information is obtained from the data management module according to the user ID, the user history browses the commodity ID sequence, the user history purchased commodity ID sequence, the user history exposed commodity ID sequence, the commodity price, the size, the brand, the sales volume and the category.

And inputting the information acquired from the data management module into a user-commodity interest model to obtain a commodity list of the current user. For example, the current user ID1 corresponds to the merchandise ID1, the merchandise ID2, and the merchandise ID3 in the merchandise list.

The user portrait information and commodity information corresponding to each of commodity ID1, commodity ID2 and commodity ID3 are acquired from the data management module based on the user ID 1.

Inputting user portrait information of current user ID1 and commodity information of commodity ID1 into a user-commodity repurchase cycle model to obtain a repurchase cycle T1 of the current user ID1 to the commodity ID 1;

inputting user portrait information of current user ID1 and commodity information of commodity ID2 into a user-commodity repurchase cycle model to obtain a repurchase cycle T2 of the current user ID1 to the commodity ID 2;

user portrait information of current user ID1 and commodity information of commodity ID3 are input to the user-commodity buyback cycle model, resulting in a buyback cycle T3 for the current user ID1 to the commodity ID 3.

And writing the commodity into a recommended commodity list if the historical purchase recording time T is greater than T according to the historical purchase recording time of the current user on each commodity in the commodity list in the data management module. For example, if the current historical purchase record time of the user ID1 for the article ID1 is 100 days of purchased time and is greater than the repurchase period T1, it indicates that the article ID1 is an article available for repurchase, and the article ID1 is written into the recommended article list.

In another embodiment, when the current user performs online access, the recommended commodity examples obtained through the user-commodity interest model and the user-commodity repurchase cycle model are as follows:

after the collected information is processed, the information is processed,

user information, such as gender, age, etc., of the current user, including 256-dimensional user vectors,

the user history browsing commodity sequence comprises the commodity ID, the commodity price and the commodity monthly sales volume of each browsing commodity. For example, the current user browses 3 items: hua is cell phone, Macbook, Sony camera, price is 5000, 15000, 20000 respectively, sales volume is 3000, 2000, 1000 respectively, then, user history browsing commodity sequence is expressed as [ Hua is cell phone, 5000,3000, Macbook,15000,2000, Sony camera, 20000,1000], total 312 dimension vector.

Inputting the 256+ 312-568-dimensional vector into a user-commodity interest model to obtain a commodity list corresponding to the user.

Inputting 256-dimensional user information and 52-dimensional commodity information, namely 256+ 52-308-dimensional vectors into a user-commodity repurchase period model, and obtaining the repurchase period of each commodity in a commodity list corresponding to the user.

And finally, comparing the historical purchase recording time of the current user with the re-purchasing period of each commodity in the commodity list, and filtering the non-re-purchased commodities in the commodity list to obtain the recommended commodity list of the current user.

The modules of the above embodiments may be integrated into one body, or may be separately deployed; the sub-modules can be combined into one module, or can be further split into a plurality of sub-modules.

In addition, an electronic device is further provided in this embodiment of the present application, and a schematic structural diagram is shown in fig. 3, and includes a memory 301, a processor 302, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the target location method of the sequence data when executing the program.

Furthermore, a computer-readable storage medium is provided in an embodiment of the present application, on which a computer program is stored, which when executed by a processor implements the steps of the method for object localization of sequence data.

In conclusion, the scheme of the invention trains the user-commodity interest model by a reinforcement learning method, and can sense the feedback of the user and dynamically adjust the recommendation strategy of the model in the training process; the user-commodity repurchase period model is trained through a deep learning method, and non-repurchase commodities can be filtered. Thereby forming an accurate and objective commodity recommendation list.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An information pushing method, characterized in that the method comprises:

2. The method of claim 1, wherein training the user-commodity interest model using reinforcement learning specifically comprises:

In eval neural network, the users are reversedThe fed expectation function takes a small value to optimize, and the eval neural network parameter theta is updated^μ；

According to theta^QAnd theta^μUpdating the parameter of the target neural network to be theta^Q′And theta^μ′And obtaining the trained user-commodity interest model.

3. The method of claim 1, wherein training the user-commodity buyback cycle model using the deep learning method specifically comprises:

4. The method according to claim 1, wherein when the current user performs online access, the user information, the commodity information, and the commodity sequence information of the historical behavior of the user of the current user are searched, and a repurchase cycle of each commodity in the commodity list by the current user is obtained according to the user-commodity interest model and the user-commodity repurchase cycle model, which specifically includes:

5. An information pushing apparatus, comprising:

6. The apparatus of claim 5, wherein the model training module, when employing reinforcement learning to train the user-commodity interest model, is specifically configured to,

Using the user information, commodity information and commodity sequence information of the historical behaviors of each user as a piece of training data s_i(ii) a Will(s)_i,a_i,r_i,s_i+1) In N samples as reinforcement learning training setThe ith sample; wherein, a_iList of recommended items for ith sample, r_iUser feedback for the ith sample, s_i+1The commodity sequence information is the commodity sequence information of the user information, commodity information and user historical behaviors of the next state of the ith sample;

7. The apparatus according to claim 5, wherein the model training module is specifically configured to, when training the user-commodity buyback cycle model using a deep learning method,

8. The apparatus according to claim 5, wherein the online recommendation module, when the current user accesses online, searches for user information, commodity information, and commodity sequence information of historical behaviors of the current user, obtains a repurchase cycle of each commodity in the commodity list of the current user according to the user-commodity interest model and the user-commodity repurchase cycle model, and is specifically configured to,

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 4 when executing the program.

10. A computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements the method of any of claims 1-4.