CN102063640A

CN102063640A - Robot behavior learning model based on utility differential network

Info

Publication number: CN102063640A
Application number: CN2010105641422A
Authority: CN
Inventors: 宋晓; 麻士东; 龚光红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2011-05-18
Anticipated expiration: 2030-11-29
Also published as: CN102063640B

Abstract

The invention relates to a robot behavior learning model based on a utility differential network, which comprises a utility fitting network unit, a differential signal calculating network unit, a confidence evaluating network unit, an action decision network unit, an action correcting network unit and an action executing unit. The model realizes the offline learning process and the online decision process. The utility fitting network unit calculates and obtains a utility fitting value of a state after action is executed; the differential signal calculating network unit is used for calculating a differential signal; the confidence evaluating network unit outputs the confidence obtained by calculating to the action correcting network unit; the action decision network unit outputs an action selecting function; and the action correcting network unit corrects the action selecting function by utilizing confidence, calculates a probability value selected by each action and outputs the action with largest probability to the action executing unit for executing. The invention can more favorably ensure the completeness of a robot for obtaining environmental knowledge and more favorably ensure the timeliness and effectiveness of robot behavior decision.

Description

Robot behavior learning model based on the effectiveness differential networks

Technical field

The present invention relates to a kind of robot behavior learning model, belong to one of new application of artificial intelligence field based on the effectiveness differential networks.

Background technology

The intelligent robot behavior is meant that generally robot carries out reasoning and decision-making on the basis of perception surrounding enviroment, reach the process of behavior intelligent decision.The foundation of intelligent behavior decision model need be obtained knowledge, expression and reasoning, and quality that can the automatic Evaluation robot behavior.At present, the obtaining of knowledge, to advantages that the aspect had such as the adaptability of policy setting, reusabilities, make it become the first-selection of intelligent behavior modeling based on the cognitive behavior model of intensified learning technology.

The intensified learning process need is explored environment.Can be expressed as: under certain state, the decision maker selects also to carry out an action, next step ambient condition and corresponding repayment of perception then.The decision maker is not directly informed when to take what action, but according to repaying the behavior of revising self, wins more repayment.Briefly, the intensified learning process be exactly allow the decision maker by continuous trial to obtain the process of optimizing behavior sequence.

Use the more reaction equation mode that is based on specific knowledge or rule at present in the behaviour decision making of robot intensified learning, the shortcoming one of this mode is that knowledge acquisition is limited, the 2nd, the knowledge that problem is obtained often has empirical, can not in time learn new knowledge, the 3rd, the reasoning process real-time is not high.

Summary of the invention

The present invention is directed to the shortcoming of the behaviour decision making existence of present robot intensified learning, set up a kind of robot behavior learning model based on the effectiveness differential networks.This model be one based on the learning system of estimating, mutual by to environment, the control rate of automatic creation system, and then control provides and selects action.The present invention is based on the robot behavior learning model of effectiveness differential networks, solve limited, the empirical strong excessively problem of general behavior decision model knowledge acquisition, the off-line learning process of realization and on-line decision process solve the not high problem of reasoning process real-time.

A kind of robot behavior learning model based on the effectiveness differential networks comprises: effectiveness match network element, differential signal computational grid unit, confidence evaluation network element, action decision networks unit, action correction network element and action execution unit; Described effectiveness match network element is used for calculating t and moves a constantly _tThe state space vector s that after action execution unit is carried out, produces _tResulting effectiveness match value And export to differential signal computational grid unit; Differential signal computational grid unit is according to the effectiveness match value of input

And according to state space vector s _tThe function of repayment immediately that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element, confidence evaluation network element and action decision networks unit; Effectiveness match network element is utilized differential signal Δ TD _tUpgrade the weights of neural network in the effectiveness match network element; The confidence evaluation network element is utilized the input vector of the input layer of neural network in the effectiveness match network element and the output vector and the differential signal of hidden layer, calculates the degree of confidence of the action result of decision, and this degree of confidence is exported to the action correction network element; Action decision networks unit is according to the differential signal Δ TD of input _tWith state space vector s _t, the selection study of moving, output action choice function

Give the action correction network element, wherein j, k are the integer greater than 0; The degree of confidence of action correction network element utilization input is to the Action Selection function of input

Proofread and correct, action behind the calculation correction chooses probable value then, action execution unit is exported in the action of probability maximum carried out, the state space vector after this action is carried out feeds back and inputs to effectiveness match network element, differential signal computational grid unit and action decision networks unit.

Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, action decision networks unit that is only obtained at last by off-line learning in the described on-line decision process and action execution unit participate in, and the action decision networks unit in the on-line decision process moves state space vector s after carrying out constantly according to t _tCalculate and draw the output action choice function

Carry out to action execution unit by the final action of selecting of Action Selection device output, carry out the state space vector that obtains after the action and input to action decision networks unit again.

Advantage of the present invention and beneficial effect are:

(1) robot learning model of the present invention does not need to calculate and produces correct action, but by in action-academic environment of environmental interaction-evaluations in the problem of solution robot knowledge acquisition difficulty.Because this learning model does not need clear and definite designated environment model, the cause-effect relationship of environment has lain in the concrete differential feedback network, thereby can guarantee better that robot obtains the completeness of environment knowledge;

(2) the off-line learning process of this modelling can be finished environment knowledge learning process before the robot decision-making, the on-line decision process can further be finished robot environment's knowledge acquisition, decision-making during operation is no longer explored and learning activities, only need utilize the network of reconstruct to calculate and addition, this off-line and online modelling have guaranteed that the behaviour decision making of robot has good real-time performance, have guaranteed the promptness and the validity of robot behavior decision-making preferably.

Description of drawings

Fig. 1 is the off-line learning procedure structure synoptic diagram of learning model first embodiment of the present invention;

Fig. 2 is the action decision networks schematic flow sheet of learning model first embodiment of the present invention;

Fig. 3 is the genetic operator coding structure synoptic diagram in the action decision networks among learning model first embodiment of the present invention;

Fig. 4 is the genetic operator interlace operation synoptic diagram in the action decision networks among learning model first embodiment of the present invention;

Fig. 5 is the synoptic diagram of on-line decision process among learning model second embodiment of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.Wherein, first embodiment specifies the off-line learning process of learning model of the present invention; Second embodiment describes the on-line decision process.

As shown in Figure 1, learning model of the present invention comprises five parts: effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13, action decision networks unit 14 and action correction network element 15.In the off-line learning process of learning model of the present invention, five parts are all participated.

Effectiveness match network element 11 is used for calculating the action a that t selects constantly _tThe different state space vector s that after action execution unit 16 is carried out, produces _tResulting effectiveness match value

And output effectiveness match value

Give differential signal computational grid unit 12, differential signal computational grid unit 12 output differential signal Δ TD _tGive confidence evaluation network element 13 and effectiveness match network element 11.Effectiveness match network element 11 is utilized the differential signal Δ TD of differential signal computational grid unit 12 inputs again _tBring in constant renewal in, thereby reach real effectiveness match.

Differential signal computational grid unit 12 is according to the effectiveness match value of input

And according to state space vector s _tThe function of repayment immediately that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element 11, confidence evaluation network element 13 and action decision networks unit 14.

Confidence evaluation network element 13 is utilized the input vector of the input layer of neural network in the effectiveness match network element 11 and the output vector and the differential signal Δ TD of hidden layer _tCalculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element 15, be used for adjustment Action Selection.

Action decision networks unit 14 differential signal Δ TD according to input _tWith state space vector s _t, utilization is passed the rank genetic algorithm neural network is optimized, and realizes the selection study of action, the output action choice function Give action correction network element 15, wherein j, k are the integer greater than 0.

Action correction network element 15 is utilized the degree of confidence of input, to the Action Selection function of input

Proofread and correct, with the action output of probability maximum.State space vector after action is carried out feeds back and inputs to effectiveness match network element 11, differential signal computational grid unit 12 and action decision networks unit 14.

Wherein, effectiveness match network element 11 is used for the state variation that specific behavior causes is carried out the effectiveness evaluation, obtains the effectiveness match value, is made of the neural network of two-layer feedback, as shown in Figure 1.Neural network be input as state space vector s _t, the hidden layer activation function is the Sigmoid function, the effectiveness match value of state after neural network is output as action carried out, the weight coefficient of neural network be A, B and C (.This neural network comprises n input vector unit, and h hidden layer unit, and each hidden layer unit is accepted n input and have n to connect weights, and output unit is accepted n+h input and n+h weights are arranged.For the value of h, the user can set up on their own, generally is set at 3, is set to 2 in the embodiment of the invention.

The input vector of this neural network is x _i(t), i=1,2,3...n, function x _i(t) be s _tProcess normalization obtains, and then the output vector of hidden layer unit is:

y_{j} (t) = g [Σ_{i = 1}^{n} a_{ij} (t) x_{i} (i, j = 1,2,3, . . . h

Used function in the following formula a _Ij(t) be the vector of the weights A of input layer and hidden layer.The match value that effectiveness match network 11 is output as effectiveness

It is the linear combination to input layer and hidden layer:

\hat{U (s_{t})} = Σ_{i = 1}^{n} b_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} c_{j} (t) y_{j} (t)

Wherein, b _i(t) vector of the weights B of expression input layer and output layer, c _j(t) vector of the weights C of expression hidden layer and output layer.

Weights A, the B of network and C utilize differential signal Δ TD _tUpgrade, if differential signal Δ TD _tFor just, then illustrate in a last action to have produced positive effect, so the selecteed chance of this action should obtain strengthening.The weights B of input layer and output layer and the weights C of hidden layer and output layer utilize following formula to upgrade:

b _i(t+1)＝b _i(t)+λ·ΔTD _t+1·x _i(t)，i＝1，2，3...n

c _j(t+1)＝c _j(t)+λ·ΔTD _t+1·y _j(t)，j＝1，2，3...h

In the formula, λ is the constant greater than zero, can be provided with voluntarily by the user.Input is carried out according to following formula with the renewal of the weights A of hidden layer:

a _ij(t+1)＝a _ij(t)+λ _h·ΔTD _t+1·y _j(t)·sgn(c _j(t))·x _i(t)

Wherein, λ _hFor greater than zero number, Δ TD can be set voluntarily by the user _T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly, sgn is as minor function:

(z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix},

Z is the vectorial c of weights C herein _j(t).

As shown in Figure 1, differential signal computational grid unit 12 is according to the match effectiveness of effectiveness match network element 11 outputs

And (the s of repayment function R immediately of state _t) calculate differential signal Δ TD _tAccording to instantaneous difference algorithm, Δ TD _tUtilizing following formula to carry out iterative computation obtains:

ΔT D_{t} = R (s_{t}) + γ \cdot \hat{U} (s_{t + 1}) - \hat{U} (s_{t})

Wherein, R (s _t) be to state s _tEvaluation immediately, be exactly the output of repaying function immediately, γ is a discount factor, can be provided with voluntarily by the user. Expression t+1 moves constantly and carries out the state space vector s that the back produces _T+1Resulting effectiveness match value,

Expression t moves constantly and carries out the state space vector s that the back produces _tResulting effectiveness match value.

The differential signal Δ TD that calculates _tBe used for the weight coefficient of effectiveness match network element 11 and confidence evaluation network element 13 is trained renewal.If differential signal Δ TD _tProduced positive effect, then should strengthen this action, and also should strengthen, believed more that promptly this action should be selected its degree of confidence.In addition, differential signal Δ TD _tAlso be used for the weights of Action Selection function in the action decision networks unit 14 are upgraded, to guarantee the selection of realization to the optimum action.

As shown in Figure 1, when action decision networks unit 14 output action decision functions, confidence evaluation network element 13 will be calculated the degree of confidence of output action, and this degree of confidence is used for the adjustment to Action Selection.The input of confidence evaluation network element 13 is state vector x _i(t) and y _j(t), they are drawn from the hidden layer and the output layer of effectiveness match network element 11.

Degree of confidence p ₀(t) calculate by following formula:

p_{0} (t) = Σ_{i = 1}^{n} α_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} β_{j} (t) y_{j} (t)

Wherein, weights α _i(t) and β _j(t) utilize following formula to upgrade:

α _i(t+1)＝α _i(t)+λ _p·ΔTD _t+1·x _i(t)，i＝1，2，3...n

β _j(t+1)＝β _j(t)+λ _p·ΔTD _t+1·y _j(t)，j＝1，2，3...h

Wherein, λ _pThe expression learning rate is the numerical value between the 0-1, and empirical value is 0.618, and the user can be provided with according to the experience of oneself.From following formula, be difficult to guarantee p ₀(t) confidence interval is in [0,1], so introduce the Sigmoid function to p ₀(t) carry out conversion, obtain p (t), like this, the output degree of confidence just matches with the random function probability:

p (t) = \frac{1}{1 + e^{- a p_{0} (t)}}

Degree of confidence modifying factor a plays the effect of level and smooth learning process, changes a, just can change the range of adjustment of study to environment, if a is excessive, then can make learning system lose regulating action, should set suitable a value according to priori, a＞0, the span of a is [1,10] among the present invention.

Degree of confidence has reflected the uncertainty of decision-making to the regulating action of Action Selection.As can be seen, along with the effectiveness of state is tending towards actual value gradually, i.e. Δ TD _tIncrease, degree of confidence p (t) also increases gradually, to the action selection more and more definite.Utilize output degree of confidence p (t) each output action choice function again to action decision networks unit 14 Proofread and correct, trimming process is finished for 15 li in the action correction network element.

Action decision networks unit 14 adopts neural network to realize that it is divided into is four layers, and as shown in Figure 1, ground floor to the is respectively for four layers: input layer, and fuzzy subset's layer, variable node layer and function output layer, wherein, the variable node layer also claims function match layer.Use h=1 respectively, 2,3,4 represent four layers of networks.If

Be respectively the input and output of i node of h layer, i is every layer a node, and wherein, ground floor node number is I, and second layer node number is I*J, and the 3rd node layer number is L, and the 4th node layer number is K, I, and J, K, L are positive integers.Average m _Ij, variances sigma _IjBe respectively corresponding x in the second layer _i(t) location parameter and the width of Gauss's subordinate function of Shu Ru j node.

The input layer of the neural network of action decision networks unit 14, input quantity are state space vector s _tThe x that normalization obtains _i(t), it has characterized the robot situation information of input time.The input of i node of input layer

For:

I N_{i}_{1} = x_{i} (t), i = 1,2,3 . . . I

Fuzzy subset's layer is used for the input variable of input layer is carried out Fuzzy processing.Be output as the degree of membership of each input vector.Each x of input layer _i(t) at fuzzy subset's layer to J input should be arranged, for example among Fig. 1, J herein is 2, wherein, each input is exactly x _i(t) a fuzzy subset, output is x _i(t) in this fuzzy subset's degree of membership.Its each node activation function is Gauss's subordinate function, is output as:

{Q_{x_{i} j}}^{2} = \exp [- {(\frac{x_{j} (t) - m_{ij}}{σ_{ij}})}^{2}], i = 1,2,3 . . . I, j = 1,2,3 . . . J

Wherein,

For corresponding to the input x _i(t) j output,, exp is to be the exponential function at the end with natural logarithm e, x _j(t) be the input of j node of input layer.

Neural network need to a certain degree adjusted output for satisfying the match for function of movement, and the variable node layer is used for realizing this regulatory function.The variable node layer is to realize regulatory function by node number and the variation that connects weights, node number and the utilization of connection weights are passed the rank genetic algorithm and are optimized, dynamically determine their number and size,, specifically introduce in the back to satisfy the match of network to function of movement.The activation function of variable node layer is a Gaussian function, and its location parameter and width are respectively m _lAnd σ _lThe linking number of the second layer and the 3rd layer also is uncertain, also need dynamically adjust in optimizing process, and connecting weights all is 1.The 3rd node layer is output as:

O_{l}^{} = \exp [- {(\frac{Σ_{i = 1, j = 1}^{I, J} {O^{2}}_{x_{i} j} - m_{l}}{σ_{l}})}^{2}], l = 1,2,3 . . . L

Interstitial content is identical with optional action number, function output layer output be match value to function of movement, be used for calculating the selection probability of each action.The 4th node layer is output as:

{O_{k}}^{4} = Σ_{l = 1}^{L} ω_{lk} O_{l}_{3}, k = 1,2, 3 . . . K

Wherein, the 4th layer output O _k ⁴It is exactly the Action Selection function

{\hat{A}}_{k} (s_{t}) = Σ_{l = 1}^{L} ω_{lk} O_{l}_{3}, k = 1,2, 3 . . . K

The 3rd layer of each node all has and is connected ω with the 4th layer _LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node, connect weights ω _LkAlso need in optimizing process, dynamically adjust.

Suppose that the network ground floor has I input, i input has k at the second layer _iIndividual fuzzy division, the then total k of second layer nodal point number ₁+ k ₂+ ...+k _IIndividual, the node function is the membership function of each input for its fuzzy subset.Summary is got up, and needs dynamically to adjust the neural network structure of optimizing be: the linking number of the 3rd node layer number, the second layer and the 3rd layer.Needing to adjust the network parameter of optimizing is: the position m of second layer input parameter subordinate function _IjAnd width cs _Ij, the 3rd layer of (hidden layer) Gauss activation function location parameter m _lWith width cs _lAnd the 3rd layer with the 4th layer be connected weights ω _Lk

Here, utilize Hybrid Hierarchy Genetic Algorithm that the structure and parameter of the neural network of action in the decision networks is optimized, the structure optimization of network is for determining the linking number of the 3rd node layer number, the second layer and the 3rd layer.The parameter optimization of network comprises the membership function location parameter m of input vector _IjAnd width cs _Ij, the 3rd layer of latent node the location parameter m of Gaussian function _lWith width cs _lAnd the 3rd layer with the 4th layer be connected weights ω _LkUtilization is passed the rank genetic algorithm neural network is optimized and adjusts, and makes network when each takes turns decision-making, according to the variation of input differential signal, continues to optimize and obtains the Action Selection function, to realize the selection effect to action.

The degree of confidence p (t) that action correction network element 15 utilizes the evaluation of estimate of confidence evaluation network element 13 outputs promptly to move is to the Action Selection function of action selection network unit 14 outputs

Proofread and correct, calculate the probable value that each action is chosen then, with the action output of probability maximum.

Trimming process be with

Being average, is that probability generates a random function with p (t), as new Action Selection function A _j(s _t).P (t) is more little, then A _j(s _t) just more away from

Otherwise, then the closer to With new A _j(s _t) replace

Action Selection function A _j(s _t) value is big more, Dui Ying action a then _jSelecteed probability is big more.Select the computing formula of probability to be:

P (a_{j} | s_{t}) = \frac{e^{A_{j} (s_{t})}}{\underset{k}{Σ} e^{A_{k} (s_{t})}}

Then be output as the action of probable value maximum.

In the robot behavior learning model, described action decision networks unit 14 also comprises 4 subelements: coding unit 141, and initialization of population unit 142, fitness function determining unit 143, and genetic manipulation unit 144, as shown in Figure 2.

Coding unit 141 is that the chromosome structure of genetic algorithm is determined.Pass the rank genetic algorithm and be that hierarchical structure according to the biological stain body proposes, the gene in the biosome in the chromosome can be divided into regulatory gene and structural gene, and the effect of regulatory gene is whether the control structural gene is activated.Here, use for reference this characteristics of biological stain body gene, above-mentioned optimization problem is encoded.Each individuality in the population is made up of structure and parameter two parts of decision network.The gene structure of population individuality adopts secondary hierarchical structure coding, promptly divide two-layer realization according to the gene hierarchical structure of biological stain body, the upper strata gene is realized the coding to the 3rd node layer quantity and second layer input membership function, the just parameter m of the 3rd node layer number and second layer input membership function _IjAnd σ _IjAs shown in Figure 3, the part that realization is controlled the 3rd layer of (hidden layer) number of nodes is called controlling gene, lower floor is a parameter gene, realizes the subordinate function of the 3rd layer of (hidden layer) node and the coding of network connection are comprised the 3rd layer of (hidden layer) node subordinate function parameter m _lWith σ _lAnd the linking number of the second layer and the 3rd layer, and the 3rd layer with the 4th layer be connected weights ω _Lk

The gene that the latent node number of controlling gene and the expression network of parameter gene connect all adopts binary coding, represents the situation of " nothing " and " having " respectively with " 0 ", " 1 ".The gene of other expression subordinate function parameters and connection weights all adopts real-valued coding, promptly uses real number representation.Three-decker is encoded to a binary string, the 3rd layer of node of a bit representation, as controlling gene, " 1 " represents that this node works, " 0 " represents that this node is inoperative.Like this, the number of " 1 " is the actual number of the neural network hidden node that works in the controlling gene string.In the parameter gene, second and third layer connects gene and adopts binary coding, and the corresponding second layer of " 1 " expression has and is connected with the 3rd layer, and " 0 " represents that the corresponding second layer is not connected with the 3rd layer.Third and fourth layer weights gene adopts real-valued coding, represented the 3rd layer with the 4th layer the weights that are connected.

Hence one can see that, controlling gene is being controlled the number of node, if a certain node is " 0 ", then this node and front and back two-layer all do not have be connected, correspondingly its pairing parameter gene all is non-existent, as can be seen, parameter gene is controlled by controlling gene, if a certain node of upper strata controlling gene does not exist, so corresponding lower floor parameter gene just is not activated, this has just embodied the control action of controlling gene, and this control action can be corresponding with topology of networks.The chromosome one by one that coding forms constitutes population, utilizes them to finish evolution.

Further, initialization of population unit 142 is that the chromosome population is carried out initialization.In order to carry out genetic algorithm operation smoothly, need produce the chromosome individuality of some before, and these individualities should produce at random, represented the possibility of multiple network structure, enough spaces of finding the solution promptly should be arranged.Suitable population scale is significant for the convergence of genetic algorithm, and population quantity is too for a short time to be difficult to try to achieve satisfied result, calculation of complex too greatly then, and population scale generally gets 10～160.

Further, determine chromosomal fitness function unit 143.Individual fitness function adopts the complexity of individual error and structure to represent, considers the complexity of Control Network in the individual error optimizing, thereby obtains optimum network structure.The fitness function form of network is as follows:

f (i) = α \frac{1}{E (i)} + β \frac{1}{H (i)}, i = 1,2, . . ., I

Wherein, E (i), H (i) represent i individual individual error and structure complexity respectively, wherein:

E (i) = Σ_{j = 1}^{K} {({\hat{y}}_{ij} - y_{ij})}^{2}

H(i)＝1+exp[-c(N _i(0))]

And y _IjExport and desired output for j that is i individuality, wherein, desired output y _IjChoice function for the expectation action

If certain moves desired output, then establish its expectation value

Other expectation function of movement all are made as 0.N _i(0) be that i individual hidden node is zero number, c is the parameter regulation factor.Wherein, b, c are normal value, and α and β are the constant greater than zero, alpha+beta=1.Utilize such adaptive value function can guarantee when optimizing network weight, to obtain suitable neural network structure.

Further, carry out genetic manipulation unit 144, genetic manipulation comprises selection, intersects and variation.Initial population after selection, intersection and variation, has carried out one and has taken turns genetic manipulation, has finished one and has taken turns evolution, has obtained the sub-population of a new generation, and this process that circulates, and makes evolution constantly carry out, so that filial generation converges to optimum.

Selection is from previous generation population, according to the fitness of individuality, according to certain rule or method, selects some good individual inheritances in colony of future generation.The method that adopts the elite to select in the algorithm is selected, and promptly according to the fitness value size, each remains into the next generation for individuality optimum in the population, and this mode has guaranteed the asymptotic convergence of algorithm.For individual i, its selection probability is:

p_{s} (i) = \frac{f_{i}}{Σ_{j = 1}^{N} f_{j}}

Wherein, f _iBe the fitness of individual i, N is the number of individuals of population.

Interlace operation is exactly to make the gene pairs of two individualities to exchange the position randomly, and this process has reflected the random information exchange, and purpose is to produce the new assortment of genes, promptly produces new individuality.When evolving to a certain degree, when the identical colony of most of individualities particularly occurring, intersection is to produce new individuality, at this moment can only produce new individuality by variation.Variation is with certain probability gene position to be changed, and to increase new search volume, that is to say, variation has increased the speciality of global optimization.In the process of intersecting and making a variation, randomness plays an important role, and have only at random intersection and mutation operation just to guarantee to upgrade individual appearance, and this randomness shows by intersection and variation probability.

In the genetic manipulation process, crossover probability and variation probability have a significant impact the performance of genetic algorithm.If at genetic algorithm (Genetic Algorithm is called for short GA) initial operating stage, the crossover probability choosing is big, and the variation probability selects little, can accelerate convergence of algorithm speed, helps searching for optimum solution.But along with the carrying out of search, just needing to reduce crossover probability increases the variation probability, so that algorithm is difficult for being absorbed in local extremum, can search for new separating.

The probability that makes a variation simultaneously can not be obtained too big, otherwise algorithm will be difficult to the gene of restraining and destroying optimum solution.For high the separating of fitness, get lower crossover probability and variation probability, make it have bigger chance to enter into the next generation; And, should get higher crossover probability and variation probability for lower the separating of fitness, it is eliminated as early as possible; When the maturation convergence takes place, should strengthen crossover probability and variation probability, to accelerate new individual generation.According to the selection principle of above intersection and variation probability, adopt a kind of adaptive crossover probability and variation probability method, its computing formula is:

p_{c} = \{\begin{matrix} \frac{f_{\max} - f_{avg}}{f} & (f_{\max} - f_{avg}) < f \\ 0.8 & (f_{\max} - f_{avg}) &GreaterEqual; f \end{matrix}

p_{m} = \{\begin{matrix} \frac{0.2 (f_{\max} - f^{'})}{f_{\max} - f_{avg}} & (f_{\max} - f^{'}) < (f_{\max} - f_{avg}) \\ 0.2 & (f_{\max} - f^{'}) &GreaterEqual; (f_{\max} - f_{avg}) \end{matrix}

Wherein, p _cBe crossover probability, p _mBe the variation probability.f _MaxBe the maximum adaptation degree in the colony, f _AvgBe average fitness, f is a bigger fitness in two individualities that intersect, and f ' is the individual fitness of variation.

This method when big, can find optimum solution in the evolution space fast; Converging near the locally optimal solution, increasing the diversity of colony.The individual variation probability of fitness maximum is zero as can be seen, and individuality intersection and variation probability that fitness is bigger are all very little, have protected defect individual like this.And the fitness small individuals is intersected and the variation probability is all very big, needs constantly to destroy it.

Carry out interlace operation according to crossover probability between two individualities choosing, interlace operation is operated the corresponding part of controlling gene and parameter gene respectively, as shown in Figure 4.Such interlace operation can make two chromosomal corresponding genes intersect, and has guaranteed that also the correspondence of binary coding and real coding gene is intersected.The intersection of two corresponding positions of chromosome adopts single-point to intersect, and selects the same position of two individualities randomly, carries out the exchange operation of gene in the position of choosing.

Mutation operation comprises the operation to all genes, to the binary coding gene in controlling gene and the parameter gene, adopts the position variation, carries out the logic inversion operation, promptly " 1 " is become " 0 ", and " 0 " is become " 1 ".Carry out the Gaussian mutation of linear combination for the gene of real-valued coding:

{\hat{m}}_{ij} = m_{ij} + α \frac{1}{f} N (0,1)

{\hat{σ}}_{ij} = σ_{ij} + α \frac{1}{f} N (0,1)

{\hat{m}}_{l} = m_{l} + α \frac{1}{f} N (0,1)

{\hat{σ}}_{l} = σ_{l} + α \frac{1}{f} N (0,1)

{\hat{ω}}_{lk} = ω_{lk} + α \frac{1}{f} N (0,1)

Wherein, α is an evolution rate, and f is each individual fitness, and N (0,1) is 0 for expectation, and standard deviation is 1 normal distribution random function.

In sum, it is as follows passing the algorithm steps that the rank genetic algorithms realizes Neural Network Optimization:

1. network structure and parameter are encoded according to hierarchical structure, generate the chromosome individuality.

2. generate 2N initial chromosome population at random, evolutionary generation is made as t=0.

3. calculate maximum adaptation degree value and average fitness value in each individual fitness value and the population according to formula.

4. in population, select N individuality as parent according to the individual choice probability, make t=t+1.

5. from parent, select two individualities at random, carry out interlace operation according to crossover probability.If intersect, then at first duplicate two individualities, former individual the reservation.Carry out interlace operation with the individuality that duplicates, produce two new individualities.All intersect up to the parent population and to finish.

6. all individualities are carried out mutation operation according to the variation probability.

7. when the fitness of optimum individual and colony's fitness reach given threshold values, perhaps reach maximum evolutionary generation, the then iterative process of algorithm convergence, algorithm finish.Continue to carry out otherwise change 3, until satisfying termination condition.

After optimize finishing, get the network structure of optimum individual and parameter, utilize its to realize the calculating of action decision-making as decision networks.

In action decision networks unit 14, with passing the structure and parameter that the rank genetic algorithm is optimized network.After each new situation occurs, at first utilize instantaneous difference algorithm (Temporal-Difference method, the differential signal Δ TD that TD) is provided _tCome action selection network is carried out parameter update, in the hope of obtaining more favourable optional action.Specifically, it is to utilize differential signal Δ TD _t,, carry out genetic manipulation afterwards again by the 3rd layer in each parameter gene of the chromosome in the population is connected weights and upgrades with the 4th layer.The weights space of corresponding like this this function of movement is all upgraded, and the new weights of the respective action that obtains through heredity also should be bigger, can reflect the study to this optimum action.Differential signal for the renewal process that connects weights is:

Wherein, ω _IjBe the 3rd layer of i latent node and the weights that are connected of the 4th layer of j Action Selection function,

Being weighting coefficient, is the numerical value between the 0-1, and empirical value is 0.62.

The present embodiment utilization is passed the rank genetic algorithm neural network is trained, and realizes knowledge learning.Solved in the prior art the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making research, solved the knowledge acquisition of robot behavior decision-making preferably, the reasoning decision problem, main body has the study and the inferential capability of higher level by approaching the completeness of knowledge with environmental interaction study.

Fig. 5 is the synoptic diagram of on-line decision process among learning model second embodiment of the present invention.After the off-line learning, the action decision networks unit 14 that obtains at last is optimum, uses this action decision networks unit 14 to be used for real-time on-line decision.And other all remove in the on-line decision process as effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13 and action correction network element 15, do not re-use.Action decision networks unit 14 is according to the action a that selects _tState space vector s after action execution unit 16 is carried out _tCalculate and draw the output action choice function By the final action of selecting of Action Selection device output, the state space vector that this action obtains after action execution unit 16 is carried out inputs to action decision networks unit 14 again.

The neural network that present embodiment utilization training obtains is carried out the behavior of robot and is made a strategic decision in real time.Learning process is separated with decision process, has guaranteed the efficient of on-line decision, satisfies the needs of real time execution.

Claims

1. robot behavior learning model based on the effectiveness differential networks, comprise action execution unit (16), it is characterized in that this learning model also comprises: effectiveness match network element (11), differential signal computational grid unit (12), confidence evaluation network element (13), action decision networks unit (14) and action correction network element (15);

Described effectiveness match network element (11) is used for calculating t and moves a constantly _tThe state space vector s that after action execution unit (16) is carried out, produces _tResulting effectiveness match value And export to differential signal computational grid unit (12); Differential signal computational grid unit (12) is according to the effectiveness match value of input

And according to state space vector s _tThe function of repayment immediately that calculates further calculates differential signal Δ TD _t, and with this differential signal Δ TD _tExport to effectiveness match network element (11), confidence evaluation network element (13) and action decision networks unit (14); Effectiveness match network element (11) is utilized differential signal Δ TD _tUpgrade the weights of neural network in the effectiveness match network element (11); Confidence evaluation network element (13) is utilized the input vector of the input layer of neural network in the effectiveness match network element (11) and the output vector and the differential signal of hidden layer, calculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element (15); Action decision networks unit (14) is according to the differential signal Δ TD of input _tWith state space vector s _t, the selection study of moving, output action choice function Give action correction network element (15), wherein j, k are the integer greater than 0; Action correction network element (15) is utilized the degree of confidence of input, to the Action Selection function of input

Proofread and correct, action behind the calculation correction chooses probable value then, action execution unit (16) is exported in the action of probability maximum carried out, the state space vector after this action is carried out feeds back and inputs to effectiveness match network element (11), differential signal computational grid unit (12) and action decision networks unit (14);

Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, the action decision networks unit (14) that is only obtained at last by off-line learning in the described on-line decision process participates in action execution unit (16), the state space vector s that the action decision networks unit (14) in the on-line decision process produces according to t action execution unit (16) execution constantly action back _tCalculate and draw the output action choice function

Carry out for action execution unit (16) by the final action of selecting of Action Selection device output, carry out the state space vector that obtains after the action and input to action decision networks unit (14) again.

2. a kind of robot behavior learning model according to claim 1 based on the effectiveness differential networks, it is characterized in that, described effectiveness match network element (11) is made of neural network, comprise input layer, hidden layer and output layer, the weights of neural network are A, B and C, the input vector x of neural network input layer _i(t) for moving constantly, t carries out the state space vector s that the back produces _tNormalization obtains, and the hidden layer activation function is the Sigmoid function, and neural network is output as carries out the effectiveness match value of state afterwards to action

\hat{U (s_{t})} = Σ_{i = 1}^{n} b_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} c_{j} (t) y_{j} (t)

Wherein, b _i(t) vector of the weights B of expression input layer and output layer, c _j(t) vector of the weights C of expression hidden layer and output layer, n is an input layer unit number, h is a hidden layer unit number, y _j(t) be the output vector of hidden layer unit:

y_{j} (t) = g [Σ_{i = 1}^{n} a_{ij} (t) x_{i} (t)], j = 1,2,3 . . . h

According to function

Calculate a _Ij(t) be the vector of the weights A of input layer and hidden layer.

3. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 2 is characterized in that, the vector of the weights of neural network in the described effectiveness match network element (11) specifically is to utilize following formula to upgrade:

b _i(t+1)＝b _i(t)+λ·ΔTD _t+1·x _i(t)，i＝1，2，3...n

c _j(t+1)＝c _j(t)+λ·ΔTD _t+1·y _j(t)，j＝1，2，3...h

a _ij(t+1)＝a _ij(t)+λ _h·ΔTD _t+1·y _j(t)·sgn(c _j(t))·x _i(t)

Wherein, λ is the constant greater than zero, λ _hFor greater than zero number, Δ TD _T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly, sgn (c _j(t)) determine according to function sgn that function sgn is:

(z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix} .

4. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that differential signal computational grid unit (12) calculates differential signal Δ TD according to instantaneous difference algorithm _t:

ΔT D_{t} = R (s_{t}) + γ \cdot \hat{U} (s_{t + 1}) - \hat{U} (s_{t})

Wherein, R (s _t) be to state space vector s _tEvaluation immediately, γ is a discount factor,

Expression t+1 moves constantly and carries out the state space vector s that the back produces _T+1Resulting effectiveness match value,

5. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that, the degree of confidence p (t) of the final output of described confidence evaluation network element (13) is:

p (t) = \frac{1}{1 + e^{- a p_{0} (t)}}, p_{0} (t) = Σ_{i = 1}^{n} α_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} β_{j} (t) y_{j} (t)

Wherein, degree of confidence modifying factor a span is [1,10], x _i(t), y _j(t) be respectively the output vector of the input vector and the hidden layer unit of the neural network in the effectiveness match network element (11), n, h are respectively neural network input layer unit number and the hidden layer unit number in the effectiveness match network element (11); Corresponding t+1 moves the weights α after carrying out constantly _i(t+1) and β _j(t+1) renewal is as follows:

α _i(t+1)＝α _i(t)+λ _p·ΔTD _t+1·x _i(t)，i＝1，2，3...n

β _j(t+1)＝β _j(t)+λ _p·ΔTD _t+1·y _j(t)，j＝1，2，3...h

Wherein, λ _pThe expression learning rate is the numerical value between the 0-1, Δ TD _T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly.

6. a kind of robot behavior learning model according to claim 1 based on the effectiveness differential networks, it is characterized in that, described action decision networks unit (14) adopts neural network to realize, this neural network comprises input layer, fuzzy subset's layer, variable node layer and function output layer, the input of i node of input layer

For:

I N_{i}^{1} = x_{i} (t), i = 1,2,3 . . . I

Wherein, I is the input layer number, x _i(t) the state space vector s of action after carrying out that serve as reasons _tThe vector that normalization obtains;

Fuzzy subset's layer is used for Fuzzy processing is carried out in the input of input layer, corresponding to input x _i(t) j output

For:

{O_{x_{i} j}}^{2} = \exp [- {(\frac{x_{j} (t) - m_{ij}}{σ_{ij}})}^{2}], i = 1,2,3 . . . I, j = 1,2,3 . . . J

Wherein, J is each x of input layer _i(t) in the input number of fuzzy subset's layer correspondence, m _IjAnd σ _IjLocation parameter and the width of representing the membership function of input vector respectively, x _j(t) be the input vector of j node of input layer;

The activation function of variable node layer is a Gaussian function, and the location parameter and the width of this Gaussian function are respectively m _lAnd σ _l, the node output of variable node layer

For:

O_{l}^{3} = \exp [- {(\frac{Σ_{i = 1, j = 1}^{I, J} {O^{2}}_{x_{i} j} - m_{l}}{σ_{l}})}^{2}], l = 1,2,3 . . . L

Wherein, L is the node number of variable node layer;

Function output layer output be match value to function of movement, be exactly the Action Selection function

{\hat{A}}_{k} (s_{t}) = Σ_{l = 1}^{L} ω_{lk} O_{l}^{3}, k = 1,2, 3 . . . K

Wherein, K is the node number of function output layer; ω _LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node; I, J, K, L are positive integers;

The membership function location parameter m of described input vector _IjAnd width cs _Ij, the variable node layer the location parameter m of Gaussian function _lWith width cs _l, and the connection weights of variable node layer and function output layer, adopt and pass the rank genetic algorithm and be optimized and adjust.

7. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that, described action correction network element (15) with

Being average, is that probability generates a random function with p (t), as new Action Selection function A _j(s _t), calculate then and choose probable value P (a _j| s _t), the action of output probability value maximum; The formula of choosing probable value is:

P (a_{j} | s_{t}) = \frac{e^{A_{j} (s_{t})}}{\underset{k}{Σ} e^{A_{k} (s_{t})}}

Wherein, a _jBe j action, s _tFor moving constantly, t obtains state space vector, A after carrying out _k(s _t) be k Action Selection function, A _j(s _t) be j Action Selection function.