CN106997488A

CN106997488A - A kind of action knowledge extraction method of combination markov decision process

Info

Publication number: CN106997488A
Application number: CN201710173631.7A
Authority: CN
Inventors: 吕强; 李兆荣; 李欢
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2017-08-01

Abstract

The invention discloses a kind of action knowledge extraction method of combination markov decision process, including：Train Random Forest model H；Definition action knowledge extracts problem AKE：For Random Forest model H, attribute is split, defined attribute change, action, definition action knowledge extracts problem AKE on this basis；AKE optimization problems are solved with markov decision process：To any input data, markov decision process MDP, and definition strategy are defined, by Policy iteration more new strategy, finally solves and obtains an optimal policy；The action that knowledge extracts definition is acted in the present invention, multiple property values of state can be changed, in actual applications, it will provide accurate feasible proposal.

Description

A kind of action knowledge extraction method of combination markov decision process

Technical field

The invention belongs to machine learning techniques field, particularly a kind of action knowledge of combination markov decision process is carried Take method.

Background technology

In machine learning, many models such as SVMs, random forest, deep-neural-network have been suggested and taken Very big success was obtained, but in many practical applications, the exploitativeness of these models is poor.

Intensified learning is the special machine learning of a class, by being interacted with the autonomous of place environment come learning decision strategy, So that the long-term accumulated award that strategy is received is maximum；Intensified learning and the difference of other machines learning method are：Without advance Provide training data, but will be by being produced with interacting for environment；In management science field, it is using system that knowledge, which extracts problem, Method is counted to analyze the behavior of user and find out specific rule；In machine learning field, knowledge is extracted problem and is mainly Using model subsequent analysis technology.

The major defect of this two classes method is that they are to set up model with total data to extract knowledge, is not to independent Record extracts its useful knowledge.So in numerous applications, the exploitativeness of these models is poor, because these models are only One property value of state is modified, this has resulted in result in actual applications and error occurs, it is impossible to give exactly Go out the suggestion of feasibility.

The content of the invention

Technical problem solved by the invention is that the action knowledge for providing a kind of combination markov decision process is extracted Method, to solve the property value set up model extraction knowledge with total data in the prior art and only change state, causes The problem of resultant error is larger；The present invention realizes the action knowledge of data-driven by the markov decision process of intensified learning Extract, realize and predicting the outcome for machine learning model is converted into the ability of action knowledge.

The technical solution for realizing the object of the invention is：

A kind of action knowledge extraction method of combination markov decision process, comprises the following steps：

Step 1：Train Random Forest model H；

Step 2：Definition action knowledge extracts problem AKE：For Random Forest model H, attribute is split, definition category Property change, action, on this basis definition action knowledge extract problem AKE；

Step 3, with markov decision process solve AKE optimization problems：To any input data, define Markov and determine Plan process MDP, and definition strategy, by Policy iteration more new strategy, finally solve and obtain an optimal policy.

The present invention compared with prior art, its remarkable advantage：

(1) method that the present invention proposes a kind of classical intensified learning method markov decision process of combination, is current Act knowledge and extract field there is provided a kind of new method.

(2) action knowledge extractive technique proposed by the present invention improve efficiently finds optimal policy in finite time Accuracy rate；The present invention is to be based on Random Forest model, and Random Forest model is one of existing best disaggregated model, extensive For in practical problem, by the pretreatment of Random Forest model, data ordered categorization can be caused, optimized in follow-up horse Iteration finds the time of optimal policy in Er Kefu decision processes.

(3) action knowledge extracts the action of definition in the present invention, can change multiple property values of state, in practical application In, it will provide accurate feasible proposal.

(4) it is based on often walking state in markov decision process and being observed completely, iteration finds optimal policy Accuracy rate is ensured；The characteristics of total data is to set up model need not be used with reference to markov decision process, the present invention Can for some individually record extract its available action knowledge, can be by independently understanding environment with interacting for environment And obtain a preferably strategy.

The present invention is described in further detail below in conjunction with the accompanying drawings.

Brief description of the drawings

Fig. 1 is the inventive method overview flow chart.

Embodiment

The action knowledge extraction method of a kind of combination markov decision process of the present invention, with reference to machine learning and reinforcing Study, knowledge is acted using markov decision process extraction；Comprise the following steps that：

Step 1：Train Random Forest model H：

A training dataset is given, a Random Forest model H is set up；It is { X, Y } to define training dataset, and X is defeated Enter data vector set, Y is output classification tag set, and Random Forest model H is set up by stochastical sampling and fully nonlinear water wave, with Machine forest model H anticipation function is

Wherein,For input vector,Y ∈ Y, y are that Random Forest model H is in input vectorIn the case of export Prediction classification, c is expects class object, and d is the d decision tree, and D is the total number of decision tree in random forest, w_dFor d The weight of decision tree,It is that the d decision tree is inputtingIn the case of corresponding output,To indicate Function,Expression be in input data vectorIn the case of the prediction that exports be categorized as c probability.

Step 2：Definition action knowledge extracts problem (AKE)：For Random Forest model H, attribute is split, defined Attribute change, action, on this basis definition act knowledge and extract problem (AKE).

2.1 pairs of attributes are split：Given a Random Forest model H, each attribute x_i(i=1 ..., M) is divided For the interval of M quantity.

If 1) attribute x_iIt is classification type and classifies with n, then attribute x_iNaturally it is divided into n interval, this When M=n.

If 2) attribute x_iIt is that branch node in value type, Random Forest model H on every decision tree is x_i＞ b, Then b is attribute x_iA cut-point.If the attribute x in all decision trees_iThere is n cut-point, then attribute x_iIt is divided into n+ 1 interval, now M=n+1.

2.2 defined attributes change：Give Random Forest model a H, an attribute change τ and be defined as a triple τ =(x_i, p, q), p and q are attribute x respectively_iTwo segmentations it is interval.

One attribute change τ is in given input vectorOn be executable, and if only if the input vectorI-th Attribute x_iIn interval p；One attribute change τ is input vectorAttribute x_iInterval q is converted to from interval p.

2.3rd, definition is acted：

One action a is defined as an attribute change collection, that is, acts a={ τ₁..., τ_|a|}；Each action a has one R (α) is awarded immediately.

Wherein, | a | the number of attribute change in expression action a, | a | the action a of >=1, i.e., one comprises at least an attribute Change τ.

One action a is in input vectorOn be executable, and if only if its all properties change τ existsOn be executable 's.

2.4th, definition action knowledge extraction problem (AKE) is：

Subject to p (y=c | x^*) ＞ z

Wherein, A is executable set of actions, A_sTo need the optimal action sequence found, a_iFor optimal action sequence A_s In any one act, R (a_i) it is action a_iAward immediately, F (A_s) it is to act on optimal action sequence A_sOn obtained total prize Reward is worth, and y is that Random Forest model H is in input vectorIn the case of the prediction classification that exports, z is constant threshold, x^*For From initial input vectorPerform optimal action sequence A_sThe vector result obtained after middle everything.

AKE problems be look for an action sequence input vector be changed into one have expect prediction classification target to Amount, while ensureing that the award summation of the action sequence is maximum；So, this is an optimization problem, referred to as AKE optimization problems. In the action definition of AKE problems, an action comprises at least an attribute change, and this can just change multiple category of a state Property value, in actual applications, it will provide accurate feasible proposal.

Step 3, with markov decision process solve AKE optimization problems：To any input data, define Markov and determine Plan process (MDP), and definition strategy, by Policy iteration more new strategy, finally solve and obtain an optimal policy.

3.1 define markov decision process for Π_MDP={ S, A, T, R }；

Definition procedure is prior art, and wherein S represents state space, and state is represented with s；A represents motion space, and action is used A is represented；T：S × A × S → [0,1] is state transition function, represents that execution under a state one is transferred to after acting another Individual shape probability of state；R：S × A → R is reward functions, represents the award immediately that environment is provided during generating state transfer.From state s Set out, take action a ∈ A (s), receiving the award R of environmental feedback, (s a), and is transferred to T (s, a, s ') probability next State s ' ∈ S, wherein A (s) expressions at moment can take the set of action in state s.

Markov decision process is the process of a loop iteration, untill meeting end condition, defeated after terminating Go out optimal policy sequence B.

3.2 definition strategy：

Tactful π is mapping of the state to action：S × A → [0,1], target is to find one there is largest cumulative to award R_π Optimal policy π^*：

Wherein, R_πIt is the accumulative award that t execution is acted under tactful π, γ^tIt is discount factor γ t powers, E_π[·] It is the expectation under tactful π, r_tIt is the award immediately of t execution action.

3.3 define value function：

Reward functions are the instant evaluations to a state (action), and value function is then in the long run to consider a shape The quality of state；Used here as state value function V (s).

A strategy π is given, state value function is defined as：

Based on optimal policy π^*, optimum state value function V^*(s) it can be defined as：

Wherein, s₀Represent original state, s₀=s represented using state s as original state, V^π(s) it is with state s under tactful π For the state value function of original state, V^*(s) it is optimum state value function under tactful π by original state of state s.

According to the optimal equatioies of Bellman, can have：

Wherein, r_t+1It is the award immediately of t+1 moment execution action, V^*(s_t+1) it is t+1 moment states s_t+1Optimum state Value function, s ' is the state of subsequent time, and T (s, a, s ') is state transition probability, and γ is discount factor, and R (s, α) is in state Accumulative award under s, action a, V^*(s ') is the lower optimum state value functions of NextState s '.

3.4th, solved according to Policy iteration and obtain an optimal policy：

First one strategy π of random initializtion_t, calculate state value function v under this strategy_t, according to these state value function calls To new tactful π_t+1, calculate the value function v of each state under new strategy_t+1, until convergence.

The value of each state under a strategy is calculated, is referred to as Policy evaluation；New strategy, quilt are obtained according to state value Referred to as stragetic innovation.

3.4.1 Policy evaluation is carried out：

According to Bellman equatioies, the value function of a state is related to the value function of its succeeding state；Therefore, with follow-up State value function v (s ') updates the value function v (s) of current state；

Policy evaluation traversal institute is stateful, and state value function is updated according to formula below：

After renewal state value function, by tactful π_tIt is added in optimal policy sequence B；

Wherein,It is tactful π_tLower state s state value function,It is tactful π_t+1Lower state s ' state Value function, (s, it is state s, action a a) to represent strategy to π.

3.4.2 stragetic innovation is carried out：

One, which is obtained, according to state value function is better than old tactful new strategy；For a state s, policy selection one is allowed Act a so that current state value function R_{(s, a)}+γ∑_s′T_{(s, a, s ')}V^π(s ') is maximum, i.e.,

Wherein, π_t+1Represent the strategy at t+1 moment.

3.4.3 according to the result of stragetic innovation, optimal policy sequence B is exported：Whether the state in determination strategy is target State, if dbjective state is with regard to exit strategy iteration and exports optimal policy sequence B；If not dbjective state, then again Policy evaluation is carried out, is dbjective state until meeting state s, and export optimal policy B.

Whether it is that the Rule of judgment of object function is：

The method that the present invention proposes a kind of classical intensified learning method markov decision process of combination, is current action Knowledge extracts field and provides a kind of new method.The present invention is to be based on Random Forest model, and Random Forest model is existing One of best disaggregated model, has been widely used in practical problem.By the pretreatment of Random Forest model, data can be caused Ordered categorization, optimizes the time that the iteration in follow-up markov decision process finds optimal policy, therefore the present invention is carried The action knowledge extraction method gone out improve efficiently the accuracy rate that optimal policy is found in finite time.Acted in the present invention Knowledge extracts the action of definition, can change multiple property values of state, in actual applications, it will provide accurate feasibility It is recommended that.It can be observed completely based on state is often walked in markov decision process, iteration finds the accuracy rate of optimal policy Ensured.The characteristics of total data is to set up model need not be used with reference to markov decision process, the present invention being capable of pin Its available action knowledge is extracted to some independent record, can be by independently understanding environment with interacting for environment and obtaining One preferably tactful.

Claims

1. a kind of action knowledge extraction method of combination markov decision process, it is characterised in that comprise the following steps：

Step 1：Train Random Forest model H；

Step 2：Definition action knowledge extracts problem AKE：For Random Forest model H, attribute is split, defined attribute becomes Change, act, definition action knowledge extracts problem AKE on this basis；

Step 3, with markov decision process solve AKE optimization problems：To any input data, Markovian decision mistake is defined Journey MDP, and definition strategy, by Policy iteration more new strategy, finally solve and obtain an optimal policy.

2. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 1, it is characterised in that Training Random Forest model H in step 1 is specially：

A training dataset is given, a Random Forest model H is set up；It is { X, Y } to define training dataset, and X is input number According to vector set, Y is output classification tag set, and Random Forest model H is set up by stochastical sampling and fully nonlinear water wave, random gloomy Woods model H anticipation function is

p (y = c | \overset{&RightArrow;}{x}) = \frac{Σ_{d - 1}^{D} w_{d} I (o_{d} (\overset{&RightArrow;}{x}) = c)}{Σ_{d - 1}^{D} w_{d}}

Wherein,For input vector,Y ∈ Y, y are that Random Forest model H is in input vectorIn the case of export it is pre- Classification is surveyed, c is expects class object, and d is the d decision tree, and D is the total number of decision tree in random forest, w_dFor the d certainly The weight of plan tree,It is that the d decision tree is inputtingIn the case of corresponding output,For indicator function,Expression be in input data vectorIn the case of the prediction that exports be categorized as c probability.

3. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 1, it is characterised in that Knowledge extraction problem is acted defined in step 2 and specifically includes following steps：

2.1 pairs of attributes are split：Given a Random Forest model H, each attribute x_i(i=1 ..., M) is divided into M The interval of quantity；

2.2 defined attributes change：Give Random Forest model a H, an attribute change τ and be defined as a triple τ=(x_i, P, q), p and q are attribute x respectively_iTwo segmentations it is interval；

2.3rd, definition is acted：

One action a is defined as an attribute change collection, that is, acts a={ τ₁..., τ_|a|}；Each action a has one to encourage immediately Appreciate R (α)；

Wherein, | a | the number of attribute change in expression action a, | α | the action α of >=1, i.e., one comprises at least an attribute change τ；

2.4th, definition action knowledge extraction problem (AKE) is：

\max_{A_{s} &Element; A} F (A_{s}) = \underset{a_{i} &Element; A_{s}}{Σ} R (a_{i})

Subject to p (y=C | x^*) ＞ z

Wherein, A is executable set of actions, A_sTo need the optimal action sequence found, α_iFor optimal action sequence A_sIn appoint One action of meaning, R (a_i) it is action a_iAward immediately, F (A_s) it is to act on optimal action sequence A_sOn obtained total award Value, y is that Random Forest model H is in input vectorIn the case of the prediction classification that exports, z is constant threshold, x^*For from Initial input vectorPerform optimal action sequence A_sThe vector result obtained after middle everything.

4. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 3, it is characterised in that Attribute x in step 2.1_iThe interval of M quantity is divided into, is specifically divided into：

If 1) attribute x_iIt is classification type and classifies with n, then attribute x_iNaturally it is divided into n interval, now M =n；

If 2) attribute x_iIt is that branch node in value type, Random Forest model H on every decision tree is x_i＞ b, then b As attribute x_iA cut-point；If the attribute x in all decision trees_iThere is n cut-point, then attribute x_iIt is divided into n+1 Interval, now M=n+1.

5. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 1, it is characterised in that AKE optimization problems are solved with markov decision process specifically include following steps in step 3：

3.1 define markov decision process for Π_MDP={ S, A, T, R }：

S represents state space, and state is represented with s；A represents motion space, and action is represented with a；T：S × A × S → [0,1] is shape State transfer function, represents to be transferred to another shape probability of state after performing an action under a state；R：S × A → R is prize Function is appreciated, the award immediately that environment is provided during generating state transfer is represented；From state s, action a ∈ A (s) are taken, are received (s a), and is transferred to T (s, a, s ') probability state s ' the ∈ S, wherein A (s) of subsequent time to the award R of environmental feedback Expression can take the set of action in state s；

3.2 definition strategy：

Tactful π is mapping of the state to action：S × A → [0,1], target is to find one there is largest cumulative to award R_πIt is optimal Tactful π^*：

π^{*} = \arg \max_{π} R_{π}

R_{π} = E_{π} [Σ_{t = 0}^{\infty} γ^{t} r_{t}]

Wherein, R_πIt is the accumulative award that t execution is acted under tactful π, γ^tIt is discount factor γ t powers, E_π[] is plan Expectation under slightly π, r_tIt is the award immediately of t execution action；

3.3 define value function：

A strategy π is given, state value function is defined as：

V^{π} (s) = E_{π} [Σ_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s]

V^{*} (s) = E_{π^{*}} [Σ_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s]

Wherein, s₀Represent original state, s₀=s represented using state s as original state, V^π(s) it under tactful π using state s is first to be The state value function of beginning state, V^*(s) it is optimum state value function under tactful π by original state of state s；

According to the optimal equatioies of Bellman, can have：

V^{*} (s) = \underset{a}{m a x} E [r_{t + 1} + {γV}^{*} (s_{t + 1}) | s_{t} = s, a_{t} = a] = \underset{a}{m a x} Σ_{s^{'}} T (s, a, s^{'}) [R (s, a) + {γV}^{*} (s^{'})]

Wherein, r_t+1It is the award immediately of t+1 moment execution action, V^*(s_t+1) it is t+1 moment states s_t+1Optimum state value letter Number, s ' is the state of subsequent time, and T (s, a, s ') is state transition probability, and γ is discount factor, and (s is a) in state s, dynamic to R Make the accumulative award under a, V^*(s ') is the lower optimum state value functions of NextState s '；

3.4th, solved according to Policy iteration and obtain an optimal policy：

First one strategy π of random initializtion_t, calculate state value function v under this strategy_t, obtained newly according to these state value functions Tactful π_t+1, calculate the value function v of each state under new strategy_t+1, until convergence.

6. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 5, it is characterised in that Solved in step 3.4 according to Policy iteration and obtain an optimal policy, specifically include following steps：

3.4.1 Policy evaluation is carried out：

According to Bellman equatioies, the value function of a state is related to the value function of its succeeding state；Therefore, succeeding state is used Value function v (s ') updates the value function v (s) of current state；

V^{π_{t}} (s) = \underset{a &Element; A}{Σ} π (s, a) (R_{(s, a)} + γ \underset{s^{'}}{Σ} T_{(s, a, s^{'})} V^{π_{t + 1}} (s^{'}))

Wherein,It is tactful π_tLower state s state value function,It is tactful π_t+1Lower state s ' state value letter Number, (s, it is state s, action a a) to represent strategy to π；

3.4.2 stragetic innovation is carried out：

One, which is obtained, according to state value function is better than old tactful new strategy；For a state s, policy selection one is allowed to act A so that current state value function R_{(s, a)}+γ∑_s′T_{(s, a, s ')}V^π(s ') is maximum, i.e.,

π_{t + 1} = \{\begin{matrix} 1 & a = \arg \max_{a} (R_{(s, a)} + γ \underset{s^{'}}{Σ} T_{(s, a, s^{'})} V^{π_{t + 1}} (s^{'})) \\ 0 & a &NotEqual; \arg \max_{a} (R_{(s, a)} + γ \underset{s^{'}}{Σ} T_{(s, a, s^{'})} V^{π_{t + 1}} (s^{'})) \end{matrix}

Wherein, π_t+1Represent the strategy at t+1 moment；

3.4.3 according to the result of stragetic innovation, optimal policy sequence B is exported：Whether the state in determination strategy is dbjective state, If dbjective state is with regard to exit strategy iteration and exports optimal policy sequence B；If not dbjective state, then plan is re-started Slightly assess, be dbjective state until meeting state s, and export optimal policy B.

7. a kind of action knowledge extraction method of combination markov decision process as claimed in claim 6, it is characterised in that Whether it is that object function Rule of judgment is in step 3.4.3：