CN109471963A

CN109471963A - A kind of proposed algorithm based on deeply study

Info

Publication number: CN109471963A
Application number: CN201811070447.0A
Authority: CN
Inventors: 陈曦; 蓝志坚; 余智君; 陈卓
Original assignee: Guangzhou Feng Shi Technology Co Ltd
Current assignee: Guangzhou Feng Shi Technology Co Ltd
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2019-03-15

Abstract

The present invention proposes a kind of proposed algorithm based on deeply study, construct the dual network structural model of MainNet neural network and TargetNet neural network, wherein MainNet neural network is main neural network, for obtaining family to recommendation list, TargetNet neural network is used for training pattern parameter, obtain optimal model parameters, and constantly update model parameter, the current state of input as MainNet neural network not only includes long-term characteristic, and including external condition feature, lay a good foundation for the Accurate Prediction of user's Shopping Behaviors.The shortcomings that the present invention overcomes conventional machines study, does not need historical data accumulation, as long as website, there are trading activity, which may be implemented self-teaching, self-optimization and self-perfection.

Description

A kind of proposed algorithm based on deeply study

Technical field

The present invention relates to recommended method fields, more particularly, to a kind of proposed algorithm based on deeply study.

Background technique

Currently, analysis user behavior, allows system " conjecture " to go out the interested article of user, user experience is promoted, is one Great system engineering.Common proposed algorithm includes collaborative filtering, content-based recommendation algorithm, based on correlation rule Proposed algorithm, and these algorithms have the following problems: (1) be cold-started problem: being difficult to determine which kind of article recommended to new user, It is difficult to determine that new article is recommended to which user, new user, new article is caused to be unable to get scientific and reasonable recommendation；(2) Long-tail phenomenon: largely recommending popular article, and the article for comparing unexpected winner carries out less recommendation, causes Popular article is more and more popular, and the article of unexpected winner increasingly unexpected winner also causes recommender system that can not recommend novel article, no It can be brought to user pleasantly surprised；(3) privacy of user is protected: recommender system needs the historical behavior information using user, even user Demographic attributes information, a system that cannot protect privacy of user very well can allow user to lack the sense of security, be reluctant to provide Personal information, or even cause that effective recommendation can not be provided；(4) common proposed algorithm belongs to machine learning field, requires A large amount of history data set accumulates without user, transaction data for just online e-commerce system, is unrealistic 's.

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above, provide it is a kind of based on deeply study Proposed algorithm.

In order to solve the above technical problems, technical scheme is as follows:

It is a kind of based on deeply study proposed algorithm, which is characterized in that it the following steps are included:

S1: its capacity W is arranged in initialization experience pond, and experience pond is the set of commercial product recommending movement, for storing trained sample This, experience pond is sky before starting training；

S2: establishing MainNet neural network, and initialize to it, and the MainNet neural network is main nerve Network, for obtaining recommendation list；

S3: establishing TargetNet neural network, and initialize to it, and the TargetNet neural network is used for Training pattern parameter, obtains optimal model parameters；

S4: training segment sum M is set；

S5: the N number of commodity browsed recently according to t moment user initialize current state s_tIf the quotient that user browses recently Product are sky, with the replacement of N number of much-sought-after item；

S6: current state s_tAs the input of MainNet neural network, current state s is obtained_tUnder optional movement Q value column Table Q (s_t,a_t,θ_u), wherein s_tIt is current state, a_tIt is to execute movement, θ_uIt is the parameter of MainNet neural network；

S7: a is acted according to execution_t, user click according to the interest of oneself/purchase/after ignoring, and calculates award r_tWith NextState s_t+1；

S8: by commercial product recommending set of actions (s_t,a_t,r_t,s_t+1) be stored in experience pond；

S9: circulation executes step S6-S8, until storing W training data in experience pond；

S10: M training data is taken out from experience pond at random, by each of training data NextState s_t+1As The input of TargetNet neural network obtains NextState s_t+1Under optional movement Q value list Q (s_t+1,a_t+1,θ_u′)；

S11: the parameter θ of MainNet neural network is updated_u；

S12: every to take turns iteration by C, wherein C is preset iterative numerical, and the parameter of MainNet neural network is copied to TargetNet neural network.

Further, initialization described in step S2 includes: initiation parameter θ_u, input as current state s_t, export and be Current state s_tUnder optional movement Q value list Q (s_t,a_t,θ_u)。

Further, the current state s_tIt is expressed as follows:

Wherein,It is i-th of commodity that t moment user browses recently, sex is the gender of user, holiday, month, Day, weather are festivals or holidays, current slot, time, weather respectively.

Further, initialization described in step S3 includes: initiation parameter θ_u′, input as NextState s_t+1, output For NextState s_t+1Under optional movement Q value list Q (s_t+1,a_t+1,θ_u′)。

Further, step S6 specifically includes the following steps:

S61: the current state s of calculating_tUnder optional movement Q value, calculation formula is as follows:

Q(s_t,a_t)=E π [R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t, a=a_t]

Wherein, γ is discount factor, s_tIt is current state, a_tIt is current action, R_t+1It is the reward function value at t+1 moment, R_t+2It is the reward function value at t+2 moment, R_t+3It is the reward function value at t+3 moment, E_πIt is Q (s, a, θ_u) value maximum when return Functional value is a state decision function；

S62: current state s is generated_tUnder optional movement Q value list Q (s_t,a_t,θ_u)。

Further, execution described in step S7 acts a_tIt is expressed as follows:

Wherein, K is the commodity number for recommending user,For i-th of commodity for recommending user.

Further, award r described in step S7_tDefinition be: under current state in case of commodity click Movement, then award is the commodity number that user clicks；In case of the movement of commodity purchasing under current state, then award is use The price of family purchase commodity；In the case of other, reward value 0.

Further, step S10 specifically includes the following steps:

S101: M training data is taken out from experience pond at random；

S102: NextState s is calculated_t+1Under optional movement Q value, calculation formula is as follows:

Q(s_t+1,a_t+1)=E π [R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t+1, a=a_t+1]

Wherein, γ is discount factor, s_t+1It is NextState, a_t+1It is next movement, R_t+2It is the reward function at t+2 moment Value, R_t+3It is the reward function value at t+3 moment, R_t+4It is the reward function value at t+4 moment；

S103: NextState s is generated_t+1Under optional movement Q value list Q (s_t+1,a_t+1,θ_u′)。

Further, step S11 specifically includes the following steps:

S111: current state s is calculated_tUnder TargetQ value, calculation formula is as follows:

TargetQ=r_t+γmaxQ(s_t+1,a_t+1,θ_u′)

Wherein, r_tIt is the award of current action, γ is discount factor, s_t+1It is NextState, a_t+1It is next movement, θ_u′It is The parameter of TargetNet neural network, Q (s_t+1,a_t+1,θ_u′) it is the NextState s that TargetNet neural network exports_t+1Under can The Q value list of choosing movement；

S112: calculating loss function, when loss function obtains minimum value, updates the parameter θ of MainNet neural network_u, damage It is as follows to lose function:

L(θ_u)=E [(TargetQ-Q (s_t,a_t,θ_u))²]

=E [(r_t+γmaxQ(s_t+1,a_t+1,θ_u′)-Q(s_t,a_t,θ_u))²]

Wherein, E is to average, r_tIt is the award of current action, γ is discount factor, Q (s_t,a_t,θ_u) it is current state s_t Under optional movement Q value list, Q (s_t+1,a_t+1,θ_u′) it is NextState s_t+1Under optional movement Q value list, s_tIt is current shape State, a_tIt is current action, s_t+1It is NextState, a_t+1It is next movement, θ_uIt is the parameter of MainNet neural network, θ_u′It is The parameter of TargetNet neural network.

Further, current state s_tIn sex be user long-term characteristic, for distinguishing different groups, for not Same user group can also make different selections under identical recommendation list；Current state s_tIn holiday, Month, day, weather are external condition feature, and different external condition features can largely change the shopping of user Behavior, such as user are more active in the behavior of festivals or holidays.

Further, when user acts a to execution_tIt does not click or when purchase acts, recommendation list is constant, when user is dynamic A is made to execution_tWhen having click or purchase, recommendation list changes, i.e., removes the browsing commodity of front in recommendation list, fills just Generated the commodity of click or buying behavior.

Compared with prior art, the beneficial effect of technical solution of the present invention is: (1) data edge, using strong based on depth The proposed algorithm that chemistry is practised overcomes the shortcomings that conventional machines learn, does not need historical data, as long as website has transaction row Gradually to learn, self-optimization and perfect；(2) using the Q value list of optional movement, the correlation of article is fully taken into account, it will Executing action definition is the items list for recommending user；(3) building includes MainNet neural network and TargetNet nerve The dual network structure of network, improves algorithm stability.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the proposed algorithm based on deeply study of the present invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

In conjunction with Fig. 1, specific implementation step of the invention is as follows:

S1: its capacity W=100000 is arranged in initialization experience pond, and experience pond is the set of commercial product recommending movement, for depositing Training sample is stored up, experience pond is sky before starting training；

S2: establishing MainNet neural network, and initialize to it, and the MainNet neural network is main nerve Network, for obtaining recommendation list, initialization content includes: by standardized normal distribution initialization network parameter θ_u, input to work as Preceding state s_t, export as current state s_tUnder optional movement Q value list Q (s_t,a_t,θ_u)；

S3: establishing TargetNet neural network, and initialize to it, and the TargetNet neural network is used for Training pattern parameter obtains optimal model parameters, and initialization content includes: by standardized normal distribution initiation parameter θ_u′, input For NextState s_t+1, export as NextState s_t+1Under optional movement Q value list Q (s_t+1,a_t+1,θ_u′)；

S4: training segment sum M=64 are set；

S5: the N number of commodity browsed recently according to t moment user, wherein N=10, initializes current state s_tIf user is most The commodity closely browsed are sky, with the replacement of N number of much-sought-after item；

S6: current state s_tAs the input of MainNet neural network, current state s is obtained_tUnder optional movement Q value column Table Q (s_t,a_t,θ_u)；

S8: the set (s that commercial product recommending is acted_t,a_t,r_t,s_t+1) be stored in experience pond；

S9: circulation executes step S6-S8, until W training data is stored in experience pond, wherein W=100000；

S10: M training data is taken out from experience pond at random, wherein M=64, next by each of training data State s_t+1As the input of TargetNet neural network, NextState s is obtained_t+1Under optional movement Q value list Q (s_t+1, a_t+1,θ_u′)；

S11: the parameter θ of MainNet neural network is updated_u；

S12: every to take turns iteration by C, the parameter of MainNet neural network is copied to TargetNet nerve net by C=5 Network.

Specifically, the current state s_tIt is expressed as follows:

Specifically, step S6 specifically includes the following steps:

Q(s_t,a_t)=E π [R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t, a=a_t]

Specifically, execution described in step S7 acts a_tIt is expressed as follows:

Specifically, award r described in step S7_tDefinition be: click under current state in case of commodity dynamic Make, then award is the commodity number that user clicks；In case of the movement of commodity purchasing under current state, then award is user Buy the price of commodity；In the case of other, reward value 0.

Specifically, step S10 specifically includes the following steps:

S101: M training data is taken out from experience pond at random, wherein M=64；

Q(s_t+1,a_t+1)=E_π[R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t+1, a=a_t+1]

Specifically, step S11 specifically includes the following steps:

TargetQ=r_t+γmaxQ(s_t+1,a_t+1,θ_u′)

Wherein, r_tIt is the award of current action, γ is discount factor, s_t+1It is NextState, a_t+1It is next movement, θ_u′It is The parameter of TargetNet neural network, Q (s_t+1,a_t+1,θ_u′) it is TargetNet neural network NextState s_t+1Under optional movement Q value list；

L(θ_u)=E [(TargetQ-Q (s_t,a_t,θ_u))²]

=E [(r_t+γmax Q(s_t+1,a_t+1,θ_u′)-Q(s_t,a_t,θ_u))²]

Specifically, current state s_tIn sex be user long-term characteristic, for distinguishing different groups, for difference User group can also make different selections under identical recommendation list；Current state s_tIn holiday, month, Day, weather are external condition feature, and different external condition features can largely change the Shopping Behaviors of user, than As user is more active in the behavior of festivals or holidays.

Specifically, when user acts a to execution_tIt does not click or when purchase acts, recommendation list is constant, when user is to holding A is made in action_tWhen having click or purchase, recommendation list changes, i.e., removes the browsing commodity of front in recommendation list, fill rigid production Gave birth to the commodity of click or buying behavior.

The same or similar label correspond to the same or similar components；

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. it is a kind of based on deeply study proposed algorithm, which is characterized in that it the following steps are included:

S1: its capacity W is arranged in initialization experience pond, and experience pond is the set of commercial product recommending movement, for storing training sample, Experience pond is sky before starting training；

S2: establishing MainNet neural network, and initialize to it, and the MainNet neural network is main nerve net Network, for obtaining recommendation list；

S3: establishing TargetNet neural network, and initialize to it, and the TargetNet neural network is for training Model parameter obtains optimal model parameters；

S4: training segment sum M is set；

S5: the N number of commodity browsed recently according to t moment user initialize current state s_tIf the commodity that user browses recently are Sky, with the replacement of N number of much-sought-after item；

S6: current state s_tAs the input of MainNet neural network, current state s is obtained_tUnder optional movement Q value list Q (s_t,a_t,θ_u), wherein s_tIt is current state, a_tIt is to execute movement, θ_uIt is the parameter of MainNet neural network；

S7: a is acted according to execution_t, user click according to the interest of oneself/purchase/after ignoring, and calculates award r_tWith it is next State s_t+1；

S11: the parameter θ of MainNet neural network is updated_u；

2. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that institute in step S2 The initialization stated includes: initiation parameter θ_u, input as current state s_t, export as current state s_tUnder optional movement Q value column Table Q (s_t,a_t,θ_u)。

3. a kind of proposed algorithm based on deeply study according to claim 2, which is characterized in that described is current State s_tIt is expressed as follows:

Wherein,I-th of commodity that t moment user browses recently, sex is the gender of user, holiday, month, day, Weather is festivals or holidays, current slot, time, weather respectively.

4. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that institute in step S3 The initialization stated includes: initiation parameter θ_u′, input as NextState s_t+1, export as NextState s_t+1Under optional movement Q Value list Q (s_t+1,a_t+1,θ_u′)。

5. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that step S6 is specific The following steps are included:

Q(s_t,a_t)=E_π[R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t, a=a_t]

Wherein, γ is discount factor, s_tIt is current state, a_tIt is current action, R_t+1It is the reward function value at t+1 moment, R_t+2It is The reward function value at t+2 moment, R_t+3It is the reward function value at t+3 moment, E_πIt is Q (s, a, θ_u) value maximum when Reward Program Value, is a state decision function；

6. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that institute in step S7 The execution movement a stated_tIt is expressed as follows:

7. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that institute in step S7 The award r stated_tDefinition be: under current state in case of commodity click movement, then award be user click commodity Number；In case of the movement of commodity purchasing under current state, then award is the price that user buys commodity；Other situations Under, reward value 0.

8. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that step S10 tool Body the following steps are included:

S101: M training data is taken out from experience pond at random；

Q(s_t+1,a_t+1)=E_π[R_t+1+γR_t+2+γ²R_t+3+ ... | s=s_t+1, a=a_t+1]

Wherein, γ is discount factor, s_t+1It is NextState, a_t+1It is next movement, R_t+2It is the reward function value at t+2 moment, R_t+3It is the reward function value at t+3 moment, R_t+4It is the reward function value at t+4 moment；

9. a kind of proposed algorithm based on deeply study according to claim 1, which is characterized in that step S11 tool Body the following steps are included:

TargetQ=r_t+γmaxQ(s_t+1,a_t+1,θ_u′)

Wherein, r_tIt is the award of current action, γ is discount factor, s_t+1It is NextState, a_t+1It is next movement, θ_u′It is The network parameter of TargetNet, Q (s_t+1,a_t+1,θ_u′) it is the NextState s that the network of TargetNet exports_t+1Under optional movement Q value list；

S112: calculating loss function, when loss function obtains minimum value, updates the parameter θ of MainNet neural network_u, lose letter Number is as follows:

L(θ_u)=E [(TargetQ-Q (s_t,a_t,θ_u))²]

=E [(r_t+γmaxQ(s_t+1,a_t+1,θ_u′)-Q(s_t,at,θ_u))²]

Wherein, E is to average, r_tIt is the award of current action, γ is discount factor, Q (s_t,a_t,θ_u) it is current state s_tUnder can The Q value list of choosing movement, Q (s_t+1,a_t+1,θ_u′) it is NextState s_t+1Under optional movement Q value list, s_tIt is current state, a_t It is current action, s_t+1It is NextState, a_t+1It is next movement, θ_uIt is the parameter of MainNet neural network, θ_u′It is The parameter of TargetNet neural network.