CN102063640B

CN102063640B - Robot behavior learning model based on utility differential network

Info

Publication number: CN102063640B
Application number: CN 201010564142
Authority: CN
Inventors: 宋晓; 麻士东; 龚光红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2013-01-30
Anticipated expiration: 2030-11-29
Also published as: CN102063640A

Abstract

The invention relates to a robot behavior learning model based on a utility differential network, which comprises a utility fitting network unit, a differential signal calculating network unit, a confidence evaluating network unit, an action decision network unit, an action correcting network unit and an action executing unit. The model realizes the offline learning process and the online decision process. The utility fitting network unit calculates and obtains a utility fitting value of a state after action is executed; the differential signal calculating network unit is used for calculating a differential signal; the confidence evaluating network unit outputs the confidence obtained by calculating to the action correcting network unit; the action decision network unit outputs an action selecting function; and the action correcting network unit corrects the action selecting function by utilizing confidence, calculates a probability value selected by each action and outputs the action with largest probability to the action executing unit for executing. The invention can more favorably ensure the completeness of a robot for obtaining environmental knowledge and more favorably ensure the timeliness and effectiveness of robot behavior decision.

Description

Robot behavior learning model based on the effectiveness differential networks

Technical field

The present invention relates to a kind of robot behavior learning model based on the effectiveness differential networks, belong to one of new application of artificial intelligence field.

Background technology

The intelligent robot behavior refers to that generally robot carries out reasoning and decision-making on the basis of perception surrounding enviroment, reaches the process of behavior intelligent decision.The foundation of intelligent behavior decision model need to be obtained knowledge, expression and inference, and quality that can the automatic Evaluation robot behavior.At present, the obtaining of knowledge, to the advantage that the aspects such as the adaptability of policy setting, reusability have, make it become the first-selection of intelligent behavior modeling based on the cognitive behavior model of intensified learning technology.

The intensified learning process need is explored environment.Can be expressed as: under certain state, the decision maker selects also to carry out an action, then next step ambient condition and corresponding repayment of perception.The decision maker is not directly informed when to take what action, but according to repaying the behavior of revising self, wins more repayment.Briefly, the intensified learning process be exactly allow the decision maker by continuous trial to obtain the process of optimizing behavior sequence.

Use at present the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making of robot intensified learning, the shortcoming one of this mode is that knowledge acquisition is limited, the 2nd, the knowledge that problem is obtained is often with empirical, can not in time learn new knowledge, the 3rd, the reasoning process real-time is not high.

Summary of the invention

The present invention is directed to the shortcoming of the behaviour decision making existence of present robot intensified learning, set up a kind of robot behavior learning model based on the effectiveness differential networks.This model be one based on the learning system of estimating, mutual by to environment, the control rate of automatic creation system, and then control provides and selects action.The present invention is based on the robot behavior learning model of effectiveness differential networks, solve limited, the empirical excessively strong problem of general behavior decision model knowledge acquisition, the off-line learning process of realization and on-line decision process solve the not high problem of reasoning process real-time.

A kind of robot behavior learning model based on the effectiveness differential networks comprises: effectiveness match network element, differential signal computational grid unit, confidence evaluation network element, action decision networks unit, action correction network element and action execution unit; Described effectiveness match network element is used for calculating t and constantly moves a _tThe state space vector s that after action execution unit is carried out, produces _tResulting effectiveness match value

And export to differential signal computational grid unit; Differential signal computational grid unit is according to the effectiveness match value of input

And according to state space vector s _tThe immediately repayment function that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element, confidence evaluation network element and action decision networks unit; Effectiveness match network element is utilized differential signal Δ TD _tUpgrade the weights of neural network in the effectiveness match network element; The confidence evaluation network element is utilized the input vector of the input layer of neural network in the effectiveness match network element and output vector and the differential signal of hidden layer, calculates the degree of confidence of the action result of decision, and this degree of confidence is exported to the action correction network element; Action decision networks unit is according to the differential signal Δ TD of input _tWith state space vector s _t, the selection study of moving, output action choice function

Give the action correction network element, wherein j, k are the integer greater than 0; The degree of confidence of action correction network element utilization input is to the Action Selection function of input

Proofread and correct, then the action behind the calculation correction chooses probable value, action execution unit is exported in the action of maximum probability is carried out, the state space vector behind this action executing again feed back input to effectiveness match network element, differential signal computational grid unit and action decision networks unit.

Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, the action decision networks unit that is only obtained at last by off-line learning in the described on-line decision process and action execution unit participate in, and the action decision networks unit in the on-line decision process is according to the t state space vector s behind the action executing constantly _tCalculate and draw the output action choice function

Carry out to action execution unit by the final action of selecting of action selector output, the state space vector that obtains behind the execution action inputs to action decision networks unit again.

Advantage of the present invention and beneficial effect are:

(1) robot learning model of the present invention does not need to calculate and produces correct action, but by in action-academic environment of environmental interaction-evaluations in solution robot knowledge acquisition hard problem.Because this learning model does not need clear and definite designated environment model, the cause-effect relationship of environment has lain in the concrete differential feedback network, thereby can guarantee better that robot obtains the completeness of environmental knowledge;

(2) the off-line learning process of this modelling can be finished the environmental knowledge learning process before the robot decision-making, the on-line decision process can further be finished robot environment's knowledge acquisition, decision-making during operation is no longer explored and learning activities, only need to utilize the network of reconstruct to calculate and addition, this off-line and online modelling have guaranteed that the behaviour decision making of robot has preferably real-time, have guaranteed preferably promptness and the validity of robot behavior decision-making.

Description of drawings

Fig. 1 is the off-line learning procedure structure synoptic diagram of learning model the first embodiment of the present invention;

Fig. 2 is the action decision networks schematic flow sheet of learning model the first embodiment of the present invention;

Fig. 3 is the genetic operator coding structure synoptic diagram in the action decision networks among learning model the first embodiment of the present invention;

Fig. 4 is the genetic operator interlace operation synoptic diagram in the action decision networks among learning model the first embodiment of the present invention;

Fig. 5 is the synoptic diagram of on-line decision process among learning model the second embodiment of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.Wherein, the first embodiment specifies the off-line learning process of learning model of the present invention; The second embodiment describes the on-line decision process.

As shown in Figure 1, learning model of the present invention comprises five parts: effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13, action decision networks unit 14 and action correction network element 15.In the off-line learning process of learning model of the present invention, five parts are all participated.

Effectiveness match network element 11 is used for calculating the action a that t selects constantly _tThe different state space vector s that after action execution unit 16 is carried out, produces _tResulting effectiveness match value

And output effectiveness match value

Give differential signal computational grid unit 12, differential signal computational grid unit 12 output difference sub-signal Δ TD _tGive confidence evaluation network element 13 and effectiveness match network element 11.The differential signal Δ TD of effectiveness match network element 11 recycling differential signal computational grid unit 12 inputs _tConstantly update, thereby reach real effectiveness match.

Differential signal computational grid unit 12 is according to the effectiveness match value of input

And according to state space vector s _tThe immediately repayment function that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element 11, confidence evaluation network element 13 and action decision networks unit 14.

Confidence evaluation network element 13 is utilized the input vector of the input layer of neural network in the effectiveness match network element 11 and output vector and the differential signal Δ TD of hidden layer _tCalculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element 15, be used for the adjustment to Action Selection.

The action decision networks unit 14 differential signal Δ TD according to input _tWith state space vector s _t, utilization is passed the rank genetic algorithm neural network is optimized, and realizes the selection study of action, the output action choice function

Give action correction network element 15, wherein j, k are the integer greater than 0.

Action correction network element 15 is utilized the degree of confidence of input, to the Action Selection function of input

Proofread and correct, with the action output of maximum probability.State space vector behind the action executing again feed back input to effectiveness match network element 11, differential signal computational grid unit 12 and action decision networks unit 14.

Wherein, effectiveness match network element 11 is used for the state variation that specific behavior causes is carried out Evaluation of Utility, obtains the effectiveness match value, is made of the neural network of two-layer feedback, as shown in Figure 1.Neural network be input as state space vector s _t, the hidden layer activation function is the Sigmoid function, neural network is output as the effectiveness match value to state after the action executing, the weight coefficient of neural network be A, B and C (.This neural network comprises n input vector unit, and h Hidden unit, and each Hidden unit is accepted n input and have n to connect weights, and output unit is accepted n+h input and n+h weights are arranged.For the value of h, the user can set up on their own, generally is set as 3, is set to 2 in the embodiment of the invention.

The input vector of this neural network is x _i(t), i=1,2,3...n, function x _i(t) be s _tProcess normalization obtains, and then the output vector of Hidden unit is:

y_{j} (t) = g [Σ_{i = 1}^{n} a_{ij} (t) x_{i} (i, j = 1,2,3, . . . h

Used function in the following formula

a _Ij(t) be the vector of the weights A of input layer and hidden layer.The match value that effectiveness match network 11 is output as effectiveness

It is the linear combination to input layer and hidden layer:

\hat{U (s_{t})} = Σ_{i = 1}^{n} b_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} c_{j} (t) y_{j} (t)

Wherein, b _i(t) vector of the weights B of expression input layer and output layer, c _j(t) vector of the weights C of expression hidden layer and output layer.

Weights A, the B of network and C utilize differential signal Δ TD _tUpgrade, if differential signal Δ TD _tFor just, then illustrate in a upper action to have produced positive effect, so the selecteed chance of this action should obtain strengthening.The weights B of input layer and output layer and the weights C of hidden layer and output layer utilize following formula to upgrade:

b _i(t+1)＝b _i(t)+λ·ΔTD _t+1·x _i(t)，i＝1，2，3...n

c _j(t+1)＝c _j(t)+λ·ΔTD _t+1·y _j(t)，j＝1，2，3...h

In the formula, λ is the constant greater than zero, can be arranged voluntarily by the user.Input is carried out according to following formula with the renewal of the weights A of hidden layer:

a _ij(t+1)＝a _ij(t)+λ _h·ΔTD _t+1·y _j(t)·sgn(c _j(t))·x _i(t)

Wherein, λ _hFor greater than zero number, Δ TD can be set voluntarily by the user _T+1Represent the differential signal of the state space vector that corresponding t+1 produces behind the action executing constantly, sgn is such as minor function:

(z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix},

Z is the vectorial c of weights C herein _j(t).

As shown in Figure 1, differential signal computational grid unit 12 is according to the match effectiveness of effectiveness match network element 11 outputs And the immediately repayment function R (s of state _t) calculate differential signal Δ TD _tAccording to Temporal-DiReinfo, Δ TD _tUtilizing following formula to carry out iterative computation obtains:

ΔT D_{t} = R (s_{t}) + γ \cdot \hat{U} (s_{t + 1}) - \hat{U} (s_{t})

Wherein, R (s _t) be to state s _tImmediately evaluation, be exactly the output of repaying immediately function, γ is discount factor, can be arranged voluntarily by the user.

The state space vector s that produces behind the expression t+1 moment action executing _T+1Resulting effectiveness match value,

The state space vector s that produces behind the expression t moment action executing _tResulting effectiveness match value.

The differential signal Δ TD that calculates _tBe used for the weight coefficient of effectiveness match network element 11 and confidence evaluation network element 13 is trained renewal.If differential signal Δ TD _tProduced positive effect, then should strengthen this action, and also should strengthen its degree of confidence, believed more that namely this action should be selected.In addition, differential signal Δ TD _tAlso be used for the weights of Action Selection function in the action decision networks unit 14 are upgraded, to guarantee that realization is to the selection of optimum action.

As shown in Figure 1, when action decision networks unit 14 output action decision function, confidence evaluation network element 13 will be calculated the degree of confidence of output action, and this degree of confidence is used for the adjustment to Action Selection.The input of confidence evaluation network element 13 is state vector x _i(t) and y _j(t), they are drawn from hidden layer and the output layer of effectiveness match network element 11.

Degree of confidence p ₀(t) calculate by following formula:

p_{0} (t) = Σ_{i = 1}^{n} α_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} β_{j} (t) y_{j} (t)

Wherein, weights α _i(t) and β _j(t) utilize following formula to upgrade:

α _i(t+1)＝α _i(t)+λ _p·ΔTD _t+1·x _i(t)，i＝1，2，3...n

β _j(t+1)＝β _j(t)+λ _p·ΔTD _t+1·y _j(t)，j＝1，2，3...h

Wherein, λ _pThe expression learning rate is the numerical value between the 0-1, and empirical value is 0.618, and the user can arrange according to the experience of oneself.From following formula, be difficult to guarantee p ₀(t) confidence interval is in [0,1], so introduce the Sigmoid function to p ₀(t) carry out conversion, obtain p (t), like this, the output degree of confidence just matches with the random function probability:

p (t) = \frac{1}{1 + e^{- a p_{0} (t)}}

Degree of confidence modifying factor a plays the effect of level and smooth learning process, changes a, just can change study to the range of adjustment of environment, if a is excessive, then can make learning system lose regulating action, should set suitable a value according to priori, a＞0, the span of a is [1,10] among the present invention.

Degree of confidence has reflected the uncertainty of decision-making to the regulating action of Action Selection.Can find out, along with the effectiveness of state is tending towards actual value gradually, i.e. Δ TD _tIncrease, degree of confidence p (t) also increases gradually, to the action selection more and more definite.Recycling output degree of confidence p (t) is to each output action choice function of action decision networks unit 14

Proofread and correct, trimming process is finished 15 li of action correction network element.

Action decision networks unit 14 adopts neural networks to realize that it is divided into is four layers, and as shown in Figure 1, four layers of ground floors to the are respectively: input layer, and fuzzy subset's layer, variable node layer and function output layer, wherein, the variable node layer also claims the Function Fitting layer.Use respectively h=1,2,3,4 represent four layers of networks.If

Be respectively the input and output of i node of h layer, i is every layer node, and wherein, the ground floor nodes is I, and second layer nodes is I*J, and the 3rd node layer number is L, and the 4th node layer number is K, I, and J, K, L are positive integers.Average m _Ij, variances sigma _IjBe respectively corresponding x in the second layer _i(t) location parameter and the width of the Gauss member function of j node of input.

The input layer of the neural network of action decision networks unit 14, input quantity are state space vector s _tThe x that normalization obtains _i(t), it has characterized the robot situation information of input time.The input of i node of input layer

For:

I N_{i}_{1} = x_{i} (t), i = 1,2,3 . . . I

Fuzzy subset's layer is used for the input variable of input layer is carried out Fuzzy processing.Be output as the degree of membership of each input vector.Each x of input layer _i(t) at fuzzy subset's layer to J input should be arranged, for example among Fig. 1, J herein is 2, wherein, each input is exactly x _i(t) a fuzzy subset, output is x _i(t) in this fuzzy subset's degree of membership.Its each node activation function is Gauss member function, is output as:

{Q_{x_{i} j}}^{2} = \exp [- {(\frac{x_{j} (t) - m_{ij}}{σ_{ij}})}^{2}], i = 1,2,3 . . . I, j = 1,2,3 . . . J

Wherein,

For corresponding to the input x _i(t) j output,, exp is the exponential function take natural logarithm e the end of as, x _j(t) be the input of j node of input layer.

Neural network need to a certain degree adjusted output for satisfying the match for function of movement, and the variable node layer is used for realizing this regulatory function.The variable node layer is to realize regulatory function by nodes and the variation that connects weights, nodes and the utilization of connection weights are passed the rank genetic algorithm and are optimized, dynamically determine their number and size, to satisfy network to the match of function of movement, specifically introduce in the back.The activation function of variable node layer is Gaussian function, and its location parameter and width are respectively m _lAnd σ _lThe linking number of the second layer and the 3rd layer also is uncertain, also need to dynamically adjust in optimizing process, and connecting weights all is 1.The 3rd node layer is output as:

O_{l}^{3} = \exp [- {(\frac{Σ_{i = 1, j = 1}^{I, J} {O^{2}}_{x_{i} j} - m_{l}}{σ_{l}})}^{2}], l = 1,2,3 . . . L

Interstitial content is identical with optional action number, function output layer output be match value to function of movement, be used for calculating the selection probability of each action.The 4th node layer is output as:

{O_{k}}^{4} = Σ_{l = 1}^{L} ω_{lk} O_{l}^{3}, k = 1,2, 3 . . . K

Wherein, the 4th layer output O _k ⁴It is exactly the Action Selection function

{\hat{A}}_{k} (s_{t}) = Σ_{l = 1}^{L} ω_{lk} O_{l}^{3}, k = 1,2, 3 . . . K

The 3rd layer of each node has and is connected ω with the 4th layer _LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node, connect weights ω _LkAlso need in optimizing process, dynamically adjust.

Suppose that the network ground floor has I input, i input has k at the second layer _iIndividual fuzzy division, the then total k of second layer nodal point number ₁+ k ₂+ ...+k _IIndividual, the node function is the membership function of each input for its fuzzy subset.Summary is got up, and needs dynamically to adjust the neural network structure of optimizing be: the linking number of the 3rd node layer number, the second layer and the 3rd layer.Needing to adjust the network parameter of optimizing is: the position m of second layer input parameter subordinate function _IjAnd width cs _Ij, the 3rd layer of (hidden layer) Gauss activation function location parameter m _lWith width cs _lAnd the 3rd layer with the 4th layer be connected weights ω _Lk

Here, utilize Hybrid Hierarchy Genetic Algorithm that the structure and parameter of the neural network of action in the decision networks is optimized, the structure optimization of network is for determining the linking number of the 3rd node layer number, the second layer and the 3rd layer.The parameter optimization of network comprises the membership function location parameter m of input vector _IjAnd width cs _Ij, the 3rd layer of hidden node the location parameter m of Gaussian function _lWith width cs _lAnd the 3rd layer with the 4th layer be connected weights ω _LkUtilization is passed the rank genetic algorithm neural network is optimized and adjusts, and makes network when each takes turns decision-making, according to the variation of input differential signal, continues to optimize and obtains the Action Selection function, to realize the selection effect to action.

The degree of confidence p (t) that action correction network element 15 utilizes the evaluation of estimate of confidence evaluation network element 13 outputs namely to move is to the Action Selection function of action selection network unit 14 outputs

Proofread and correct, then calculate the probable value that each action is chosen, with the action output of maximum probability.

Trimming process be with

Be average, generate a random function take p (t) as probability, as new Action Selection function A _j(s _t).P (t) is less, then A _j(s _t) just more away from

Otherwise, then the closer to With new A _j(s _t) replace

Action Selection function A _j(s _t) value is larger, corresponding action a then _jSelecteed probability is larger.Select the computing formula of probability to be:

P (a_{j} | s_{t}) = \frac{e^{A_{j} (s_{t})}}{\underset{k}{Σ} e^{A_{k} (s_{t})}}

Then be output as the action of probable value maximum.

In the robot behavior learning model, described action decision networks unit 14 also comprises 4 subelements: coding unit 141, and initialization of population unit 142, fitness function determining unit 143, and genetic manipulation unit 144, as shown in Figure 2.

Coding unit 141 is that the chromosome structure of genetic algorithm is determined.Pass the rank genetic algorithm and be that hierarchical structure according to the biological stain body proposes, the gene in the biosome in the chromosome can be divided into regulatory gene and structural gene, and the effect of regulatory gene is whether the control structural gene is activated.Here, use for reference this characteristics of biological stain body gene, above-mentioned optimization problem is encoded.Each individuality in the population is comprised of the structure and parameter two parts that determine network.The gene structure of population at individual adopts secondary hierarchical structure coding, namely divide two-layer realization according to the gene hierarchical structure of biological stain body, the upper strata gene is realized the coding to the 3rd node layer quantity and second layer input membership function, the namely parameter m of the 3rd node layer number and second layer input membership function _IjAnd σ _IjAs shown in Figure 3, the part that realization is controlled the 3rd layer of (hidden layer) number of nodes is called controlling gene, lower floor is parameter gene, realizes the subordinate function of the 3rd layer of (hidden layer) node and the coding of network connection, comprises the 3rd layer of (hidden layer) node subordinate function parameter m _lWith σ _lAnd the linking number of the second layer and the 3rd layer, and the 3rd layer with the 4th layer be connected weights ω _Lk

The gene of the hidden nodes of controlling gene and the expression network connection of parameter gene all adopts binary coding, represents respectively the situation of " nothing " and " having " with " 0 ", " 1 ".The gene of other expression subordinate function parameters and connection weights all adopts real-valued coding, namely uses real number representation.Three-decker is encoded to a binary string, the 3rd layer of node of a bit representation, as controlling gene, " 1 " represents that this node works, " 0 " represents that this node is inoperative.Like this, the number of " 1 " is the actual number of the neural network hidden node that works in the controlling gene string.In the parameter gene, second and third layer connects gene and adopts binary coding, and the corresponding second layer of " 1 " expression has and is connected with the 3rd layer, and " 0 " represents that the corresponding second layer is not connected with the 3rd layer.Third and fourth layer weights gene adopts real-valued coding, represented the 3rd layer with the 4th layer the weights that are connected.

Hence one can see that, controlling gene is being controlled the number of node, if a certain node is " 0 ", then this node is two-layer all without being connected with front and back, and correspondingly its corresponding parameter gene all is non-existent, can find out, parameter gene is controlled by controlling gene, if a certain node of upper strata controlling gene does not exist, so corresponding lower floor parameter gene just is not activated, this has just embodied the control action of controlling gene, and this control action can be corresponding with topology of networks.The one by one chromosome that coding forms consists of population, utilizes them to finish evolution.

Further, initialization of population unit 142 is that the chromosome population is carried out initialization.In order to carry out smoothly genetic algorithm operation, need to be individual at the chromosome that produces before some, and these individualities should produce at random, have represented the possibility of multiple network structure, namely enough solution rooms should be arranged.Suitable population scale is significant for the convergence of genetic algorithm, and population quantity is too little to be difficult to try to achieve satisfied result, calculation of complex too greatly then, and population scale generally gets 10～160.

Further, determine chromosomal fitness function unit 143.Individual fitness function adopts the complexity of individual error and structure to represent, considers the complexity of control network in the individual error optimizing, thereby obtains optimum network structure.The fitness function form of network is as follows:

f (i) = α \frac{1}{E (i)} + β \frac{1}{H (i)}, i = 1,2, . . ., I

Wherein, E (i), H (i) represent respectively i individual individual error and structure complexity, wherein:

E (i) = Σ_{j = 1}^{K} {({\hat{y}}_{ij} - y_{ij})}^{2}

H(i)＝1+exp[-c(N _i(0))]

And y _IjExport and desired output for j that is i individuality, wherein, desired output y _IjChoice function for the expectation action

If certain moves desired output, then establish its expectation value

Other expectation function of movement all are made as 0.N _i(0) be that i individual hidden node is zero number, c is the parameter regulatory factor.Wherein, b, c are normal value, and α and β are the constant greater than zero, alpha+beta=1.Utilize such adaptive value function can guarantee in the optimized network weights, to obtain suitable neural network structure.

Further, carry out genetic manipulation unit 144, genetic manipulation comprises selection, crossover and mutation.Initial population after selection, crossover and mutation, has carried out one and has taken turns genetic manipulation, has finished one and has taken turns evolution, has obtained the sub-population of a new generation, and this process that circulates, and constantly carries out so that evolve, so that filial generation converges to optimum.

Selection is from previous generation population, according to the fitness of individuality, according to certain rule or method, selects some good individual inheritances in colony of future generation.The method that adopts the elite to select in the algorithm is selected, and namely according to the fitness value size, individuality optimum in every generation population remains into the next generation, and this mode has guaranteed the asymptotic convergence of algorithm.For individual i, its selection probability is:

p_{s} (i) = \frac{f_{i}}{Σ_{j = 1}^{N} f_{j}}

Wherein, f _iBe the fitness of individual i, N is the number of individuals of population.

Interlace operation is exactly randomly that this process has reflected the random information exchange so that the gene pairs of two individualities should exchange the position, and purpose is to produce the new assortment of genes, namely produces new individuality.When evolving to a certain degree, when the identical colony of most of individualities particularly occurring, intersection is to produce new individuality, at this moment can only produce new individuality by variation.Variation is with certain probability gene position to be changed, and to increase new search volume, that is to say, variation has increased the speciality of global optimization.In the process of crossover and mutation, randomness plays an important role, and only have crossover and mutation operation at random just to guarantee to upgrade individual appearance, and this randomness shows by the crossover and mutation probability.

In the genetic manipulation process, crossover probability and variation probability have a significant impact the performance of genetic algorithm.If at genetic algorithm (Genetic Algorithm is called for short GA) initial operating stage, the crossover probability choosing is large, and the variation probability selects little, can accelerate convergence of algorithm speed, is conducive to search for optimum solution.But along with the carrying out of search, just needing to reduce crossover probability increases the variation probability, so that algorithm is difficult for being absorbed in local extremum, can search for new solution.

The probability that makes a variation simultaneously can not be obtained too large, otherwise algorithm will be difficult to the gene of restraining and destroying optimum solution.For the high solution of fitness, get lower crossover probability and variation probability, make it have larger chance to enter into the next generation; And for the lower solution of fitness, should get higher crossover probability and variation probability, it is eliminated as early as possible; When the maturation convergence occurs, should strengthen crossover probability and variation probability, to accelerate new individual generation.According to the selection principle of above crossover and mutation probability, adopt the method for a kind of adaptive crossover probability and variation probability, its computing formula is:

p_{c} = \{\begin{matrix} \frac{f_{\max} - f_{avg}}{f} & (f_{\max} - f_{avg}) < f \\ 0.8 & (f_{\max} - f_{avg}) &GreaterEqual; f \end{matrix}

p_{m} = \{\begin{matrix} \frac{0.2 (f_{\max} - f^{'})}{f_{\max} - f_{avg}} & (f_{\max} - f^{'}) < (f_{\max} - f_{avg}) \\ 0.2 & (f_{\max} - f^{'}) &GreaterEqual; (f_{\max} - f_{avg}) \end{matrix}

Wherein, p _cBe crossover probability, p _mBe the variation probability.f _MaxBe the maximum adaptation degree in the colony, f _AvgBe average fitness, f is larger fitness in two individualities that intersect, and f ' is the individual fitness of variation.

When the method is larger in the evolution space, can find fast optimum solution; Converging near the locally optimal solution, increasing the diversity of colony.The individual variation probability that can find out the fitness maximum is zero, and the individual crossover and mutation probability that fitness is larger is all very little, has protected like this defect individual.And the less individual crossover and mutation probability of fitness is all very large, needs constantly to destroy it.

Carry out interlace operation according to crossover probability between two individualities choosing, interlace operation operates the corresponding part of controlling gene and parameter gene respectively, as shown in Figure 4.Such interlace operation can make two chromosomal corresponding genes intersect, and has guaranteed that also the correspondence of binary coding and real coding gene is intersected.The intersection of two corresponding positions of chromosome adopts single-point to intersect, and selects randomly the same position of two individualities, carries out the exchange operation of gene in the position of choosing.

Mutation operation comprises the operation to all genes, to the binary coding gene in controlling gene and the parameter gene, adopts the position variation, carries out the logic inversion operation, namely " 1 " is become " 0 ", and " 0 " is become " 1 ".Carry out the Gaussian mutation of linear combination for the gene of real-valued coding:

{\hat{m}}_{ij} = m_{ij} + α \frac{1}{f} N (0,1)

{\hat{σ}}_{ij} = σ_{ij} + α \frac{1}{f} N (0,1)

{\hat{m}}_{l} = m_{l} + α \frac{1}{f} N (0,1)

{\hat{σ}}_{l} = σ_{l} + α \frac{1}{f} N (0,1)

{\hat{ω}}_{lk} = ω_{lk} + α \frac{1}{f} N (0,1)

Wherein, α is evolution rate, and f is each individual fitness, and N (0,1) is 0 for expectation, and standard deviation is 1 normal distribution random function.

In sum, it is as follows passing the algorithm steps that the rank genetic algorithms realizes Neural Network Optimization:

1. network structure and parameter are encoded according to hierarchical structure, it is individual to generate chromosome.

2. generate at random 2N initial chromosome population, evolutionary generation is made as t=0.

3. calculate maximum adaptation degree value and average fitness value in each individual fitness value and the population according to formula.

4. in population, select N individuality as parent according to the individual choice probability, make t=t+1.

5. from parent, select at random two individualities, carry out interlace operation according to crossover probability.If intersect, then at first copy two individualities, former individual the reservation.Carry out interlace operation with the individuality that copies, produce two new individualities.Until that the parent population is all intersected is complete.

6. all individualities are carried out mutation operation according to the variation probability.

7. when the fitness of optimum individual and colony's fitness reach given threshold values, perhaps reach maximum evolutionary generation, the then iterative process of algorithm convergence, algorithm finish.Continue to carry out otherwise turn 3, until satisfy termination condition.

After optimize finishing, get the network structure of optimum individual and parameter as decision networks, utilize its to realize the calculating of action decision-making.

In action decision networks unit 14, with passing the structure and parameter that the rank genetic algorithm is come optimized network.After each new situation occurs, the differential signal Δ TD that at first utilizes Temporal-DiReinfo (Temporal-Difference method, TD) to provide _tCome that action selection network is carried out parameter and upgrade, in the hope of obtaining more favourable optional action.Specifically, it is to utilize differential signal Δ TD _t, by the 3rd layer in each parameter gene of the chromosome in the population is connected weights and upgrades with the 4th layer, carry out again afterwards genetic manipulation.The weights space of corresponding like this this function of movement is all upgraded, and the new weights of the respective action that obtains through heredity also should be larger, can reflect the study to this optimum action.Differential signal for the renewal process of connection weight value is:

Wherein, ω _IjBe the 3rd layer of i hidden node and the weights that are connected of the 4th layer of j Action Selection function,

Being weighting coefficient, is the numerical value between the 0-1, and empirical value is 0.62.

The present embodiment utilization is passed the rank genetic algorithm neural network is trained, and realizes knowledge learning.Solved in the prior art the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making research, solved preferably the knowledge acquisition of robot behavior decision-making, the reasoning decision problem, main body has study and the inferential capability of higher level by approaching the completeness of knowledge with environmental interaction study.

Fig. 5 is the synoptic diagram of on-line decision process among learning model the second embodiment of the present invention.After the off-line learning, the action decision networks unit 14 that obtains at last is optimum, uses this action decision networks unit 14 to be used for real-time on-line decision.And other all remove in the on-line decision process such as effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13 and action correction network element 15, do not re-use.Action decision networks unit 14 is according to the action a that selects _tState space vector s after action execution unit 16 is carried out _tCalculate and draw the output action choice function

By the final action of selecting of action selector output, the state space vector that this action obtains after action execution unit 16 is carried out inputs to action decision networks unit 14 again.

The neural network that obtains is trained in the present embodiment utilization, carries out the behavior Real-time Decision of robot.Learning process is separated with decision process, has guaranteed the efficient of on-line decision, satisfies the needs of real time execution.

Claims

1. model building device based on the robot behavior learning model of effectiveness differential networks, comprise action execution unit (16), it is characterized in that this model building device also comprises: effectiveness match network element (11), differential signal computational grid unit (12), confidence evaluation network element (13), action decision networks unit (14) and action correction network element (15);

Described effectiveness match network element (11) is used for calculating t and constantly moves a _tThe state space vector s that after action execution unit (16) is carried out, produces _tResulting effectiveness match value

And export to differential signal computational grid unit (12); Differential signal computational grid unit (12) is according to the effectiveness match value of input

And according to state space vector s _tThe immediately repayment function that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element (11), confidence evaluation network element (13) and action decision networks unit (14); Effectiveness match network element (11) is utilized differential signal Δ TD _tUpgrade the weights of neural network in the effectiveness match network element (11); Confidence evaluation network element (13) is utilized the input vector of the input layer of neural network in the effectiveness match network element (11) and output vector and the differential signal of hidden layer, calculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element (15); Action decision networks unit (14) is according to the differential signal Δ TD of input _tWith state space vector s _t, the selection study of moving, output action choice function

Give action correction network element (15), wherein j, k are the integer greater than 0; Action correction network element (15) is utilized the degree of confidence of input, to the Action Selection function of input

Proofread and correct, then the action behind the calculation correction chooses probable value, action execution unit (16) is exported in the action of maximum probability is carried out, the state space vector behind this action executing again feed back input to effectiveness match network element (11), differential signal computational grid unit (12) with move decision networks unit (14);

Described effectiveness match network element (11) is made of neural network, comprises input layer, hidden layer and output layer, and the weights of neural network are A, B and C, the input vector x of neural network input layer _i(t) the state space vector s for producing behind the t moment action executing _tNormalization obtains, and the hidden layer activation function is the Sigmoid function, and neural network is output as the effectiveness match value to state after the action executing

Wherein, b _i(t) vector of the weights B of expression input layer and output layer, c _j(t) vector of the weights C of expression hidden layer and output layer, n is input layer unit number, h is the Hidden unit number, y _j(t) be the output vector of Hidden unit:

According to function

Calculate a _Ij(t) be the vector of the weights A of input layer and hidden layer; The vector of the weights of neural network in the described effectiveness match network element (11), specifically utilize following formula to upgrade:

b _i(t+1)＝b _i(t)+λ·ΔTD _t+1·x _i(t)，i＝1,2,3...n

c _j(t+1)＝c _j(t)+λ·ΔTD _t+1·y _j(t)，j＝1,2,3...h

a _ij(t+1)＝a _ij(t)+λ _h·ΔTD _t+1·y _j(t)·sgn(c _j(t))·x _i(t)

Wherein, λ is the constant greater than zero, λ _hFor greater than zero number, Δ TD _T+1The differential signal that represents the state space vector that corresponding t+1 produces behind the action executing constantly, sgn (c _j(t)) determine according to function sgn that function sgn is:

sgn (z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix}

Described differential signal computational grid unit (12) calculates differential signal Δ TD according to Temporal-DiReinfo _t: Wherein, R (s _t) be to state space vector s _tImmediately evaluation, γ is discount factor,

The state space vector s that produces behind the expression t moment action executing _tResulting effectiveness match value;

The degree of confidence p (t) that states the final output of Evaluation of reliability network element (13) of putting is:

p (t) = \frac{1}{1 + e^{- a p_{0} (t)}},

p_{0} (t) = Σ_{i = 1}^{n} α_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} β_{j} (t) y_{j} (t)

Wherein, degree of confidence modifying factor a span is [1,10], x _i(t), y _j(t) be respectively the input vector of the neural network in the effectiveness match network element (11) and the output vector of Hidden unit, n, h are respectively neural network input layer unit number and the Hidden unit number in the effectiveness match network element (11); Weights α behind the corresponding t+1 moment action executing _i(t+1) and β _j(t+1) renewal is as follows:

α _i(t+1)＝α _i(t)+λ _p·ΔTD _t+1·x _i(t)，i＝1,2,3...n

β _j(t+1)＝β _j(t)+λ _p·ΔTD _t+1·y _j(t)，j＝1,2,3...h

Wherein, λ _pThe expression learning rate is the numerical value between the 0-1, Δ TD _T+1The differential signal that represents the state space vector that corresponding t+1 produces behind the action executing constantly;

Described action decision networks unit (14) adopts neural network to realize that this neural network comprises input layer, fuzzy subset's layer, variable node layer and function output layer, the input IN of i node of input layer _i ¹For:

IN _i ¹＝x _i(t)，i＝1,2,3...I

Wherein, I is the input layer number, x _i(t) be by the vector of the state space behind action executing s _tThe vector that normalization obtains; Fuzzy subset's layer is used for Fuzzy processing is carried out in the input of input layer, corresponding to input x _i(t) j output

For:

{O_{x_{i} j}}^{2} = \exp [- {(\frac{x_{j} (t) - m_{ij}}{σ_{ij}})}^{2}], i = 1,2,3 . . . I, j = 1,2,3 . . . J

Wherein, J is each x of input layer _i(t) in the input number of fuzzy subset's layer correspondence, m _IjAnd σ _IjThe location parameter and the width that represent respectively the membership function of input vector, x _j(t) be the input vector of j node of input layer;

The activation function of variable node layer is Gaussian function, and location parameter and the width of this Gaussian function are respectively m _lAnd σ _l, the node output O of variable node layer _l ³For:

{O_{l}}^{3} = \exp [- {(\frac{Σ_{i = 1, j = 1}^{I, J} {O^{2}}_{x_{i} j} - m_{l}}{σ_{l}})}^{2}], l = 1,2,3 . . . L

Wherein, L is the node number of variable node layer; Function output layer output be match value to function of movement, be exactly the Action Selection function

{\hat{A}}_{k} (s_{t}) = Σ_{l = 1}^{L} ω_{lk} {O_{l}}^{3}, k = 1,2,3 . . . K

Wherein, K is the node number of function output layer; ω _LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node; I, J, K, L are positive integers;

The membership function location parameter m of described input vector _IjAnd width cs _Ij, the variable node layer the location parameter m of Gaussian function _lWith width cs _l, and the connection weights of variable node layer function output layer, adopt and pass the rank genetic algorithm and be optimized and adjust;

Described action correction network element (15) with

Be average, generate a random function take p (t) as probability, as new Action Selection function A _j(s _t), then calculate and choose probable value P (a _j| s _t), the action of output probability value maximum; The formula of choosing probable value is: Wherein, a _jBe j action, s _tFor t constantly obtains state space vector, A behind the action executing _k(s _t) be k Action Selection function, A _j(s _t) be j Action Selection function;

Described model building device has two processes: off-line learning process and on-line decision process; Above-mentioned unit all will participate in the described off-line learning process, the action decision networks unit (14) that is only obtained at last by off-line learning in the described on-line decision process participates in action execution unit (16), the state space vector s that the action decision networks unit (14) in the on-line decision process produces behind action execution unit (16) execution action constantly according to t _tCalculate and draw the output action choice function

Carry out to action execution unit (16) by the final action of selecting of action selector output, the state space vector that obtains behind the execution action inputs to action decision networks unit (14) again.