CN102063640A - Robot behavior learning model based on utility differential network - Google Patents

Robot behavior learning model based on utility differential network Download PDF

Info

Publication number
CN102063640A
CN102063640A CN2010105641422A CN201010564142A CN102063640A CN 102063640 A CN102063640 A CN 102063640A CN 2010105641422 A CN2010105641422 A CN 2010105641422A CN 201010564142 A CN201010564142 A CN 201010564142A CN 102063640 A CN102063640 A CN 102063640A
Authority
CN
China
Prior art keywords
action
layer
effectiveness
input
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105641422A
Other languages
Chinese (zh)
Other versions
CN102063640B (en
Inventor
宋晓
麻士东
龚光红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN 201010564142 priority Critical patent/CN102063640B/en
Publication of CN102063640A publication Critical patent/CN102063640A/en
Application granted granted Critical
Publication of CN102063640B publication Critical patent/CN102063640B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a robot behavior learning model based on a utility differential network, which comprises a utility fitting network unit, a differential signal calculating network unit, a confidence evaluating network unit, an action decision network unit, an action correcting network unit and an action executing unit. The model realizes the offline learning process and the online decision process. The utility fitting network unit calculates and obtains a utility fitting value of a state after action is executed; the differential signal calculating network unit is used for calculating a differential signal; the confidence evaluating network unit outputs the confidence obtained by calculating to the action correcting network unit; the action decision network unit outputs an action selecting function; and the action correcting network unit corrects the action selecting function by utilizing confidence, calculates a probability value selected by each action and outputs the action with largest probability to the action executing unit for executing. The invention can more favorably ensure the completeness of a robot for obtaining environmental knowledge and more favorably ensure the timeliness and effectiveness of robot behavior decision.

Description

Robot behavior learning model based on the effectiveness differential networks
Technical field
The present invention relates to a kind of robot behavior learning model, belong to one of new application of artificial intelligence field based on the effectiveness differential networks.
Background technology
The intelligent robot behavior is meant that generally robot carries out reasoning and decision-making on the basis of perception surrounding enviroment, reach the process of behavior intelligent decision.The foundation of intelligent behavior decision model need be obtained knowledge, expression and reasoning, and quality that can the automatic Evaluation robot behavior.At present, the obtaining of knowledge, to advantages that the aspect had such as the adaptability of policy setting, reusabilities, make it become the first-selection of intelligent behavior modeling based on the cognitive behavior model of intensified learning technology.
The intensified learning process need is explored environment.Can be expressed as: under certain state, the decision maker selects also to carry out an action, next step ambient condition and corresponding repayment of perception then.The decision maker is not directly informed when to take what action, but according to repaying the behavior of revising self, wins more repayment.Briefly, the intensified learning process be exactly allow the decision maker by continuous trial to obtain the process of optimizing behavior sequence.
Use the more reaction equation mode that is based on specific knowledge or rule at present in the behaviour decision making of robot intensified learning, the shortcoming one of this mode is that knowledge acquisition is limited, the 2nd, the knowledge that problem is obtained often has empirical, can not in time learn new knowledge, the 3rd, the reasoning process real-time is not high.
Summary of the invention
The present invention is directed to the shortcoming of the behaviour decision making existence of present robot intensified learning, set up a kind of robot behavior learning model based on the effectiveness differential networks.This model be one based on the learning system of estimating, mutual by to environment, the control rate of automatic creation system, and then control provides and selects action.The present invention is based on the robot behavior learning model of effectiveness differential networks, solve limited, the empirical strong excessively problem of general behavior decision model knowledge acquisition, the off-line learning process of realization and on-line decision process solve the not high problem of reasoning process real-time.
A kind of robot behavior learning model based on the effectiveness differential networks comprises: effectiveness match network element, differential signal computational grid unit, confidence evaluation network element, action decision networks unit, action correction network element and action execution unit; Described effectiveness match network element is used for calculating t and moves a constantly tThe state space vector s that after action execution unit is carried out, produces tResulting effectiveness match value And export to differential signal computational grid unit; Differential signal computational grid unit is according to the effectiveness match value of input
Figure BDA0000034814160000012
And according to state space vector s tThe function of repayment immediately that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element, confidence evaluation network element and action decision networks unit; Effectiveness match network element is utilized differential signal Δ TD tUpgrade the weights of neural network in the effectiveness match network element; The confidence evaluation network element is utilized the input vector of the input layer of neural network in the effectiveness match network element and the output vector and the differential signal of hidden layer, calculates the degree of confidence of the action result of decision, and this degree of confidence is exported to the action correction network element; Action decision networks unit is according to the differential signal Δ TD of input tWith state space vector s t, the selection study of moving, output action choice function
Figure BDA0000034814160000021
Give the action correction network element, wherein j, k are the integer greater than 0; The degree of confidence of action correction network element utilization input is to the Action Selection function of input
Figure BDA0000034814160000022
Proofread and correct, action behind the calculation correction chooses probable value then, action execution unit is exported in the action of probability maximum carried out, the state space vector after this action is carried out feeds back and inputs to effectiveness match network element, differential signal computational grid unit and action decision networks unit.
Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, action decision networks unit that is only obtained at last by off-line learning in the described on-line decision process and action execution unit participate in, and the action decision networks unit in the on-line decision process moves state space vector s after carrying out constantly according to t tCalculate and draw the output action choice function
Figure BDA0000034814160000023
Carry out to action execution unit by the final action of selecting of Action Selection device output, carry out the state space vector that obtains after the action and input to action decision networks unit again.
Advantage of the present invention and beneficial effect are:
(1) robot learning model of the present invention does not need to calculate and produces correct action, but by in action-academic environment of environmental interaction-evaluations in the problem of solution robot knowledge acquisition difficulty.Because this learning model does not need clear and definite designated environment model, the cause-effect relationship of environment has lain in the concrete differential feedback network, thereby can guarantee better that robot obtains the completeness of environment knowledge;
(2) the off-line learning process of this modelling can be finished environment knowledge learning process before the robot decision-making, the on-line decision process can further be finished robot environment's knowledge acquisition, decision-making during operation is no longer explored and learning activities, only need utilize the network of reconstruct to calculate and addition, this off-line and online modelling have guaranteed that the behaviour decision making of robot has good real-time performance, have guaranteed the promptness and the validity of robot behavior decision-making preferably.
Description of drawings
Fig. 1 is the off-line learning procedure structure synoptic diagram of learning model first embodiment of the present invention;
Fig. 2 is the action decision networks schematic flow sheet of learning model first embodiment of the present invention;
Fig. 3 is the genetic operator coding structure synoptic diagram in the action decision networks among learning model first embodiment of the present invention;
Fig. 4 is the genetic operator interlace operation synoptic diagram in the action decision networks among learning model first embodiment of the present invention;
Fig. 5 is the synoptic diagram of on-line decision process among learning model second embodiment of the present invention.
Embodiment
The present invention is described in further detail below in conjunction with drawings and Examples.Wherein, first embodiment specifies the off-line learning process of learning model of the present invention; Second embodiment describes the on-line decision process.
As shown in Figure 1, learning model of the present invention comprises five parts: effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13, action decision networks unit 14 and action correction network element 15.In the off-line learning process of learning model of the present invention, five parts are all participated.
Effectiveness match network element 11 is used for calculating the action a that t selects constantly tThe different state space vector s that after action execution unit 16 is carried out, produces tResulting effectiveness match value
Figure BDA0000034814160000031
And output effectiveness match value
Figure BDA0000034814160000032
Give differential signal computational grid unit 12, differential signal computational grid unit 12 output differential signal Δ TD tGive confidence evaluation network element 13 and effectiveness match network element 11.Effectiveness match network element 11 is utilized the differential signal Δ TD of differential signal computational grid unit 12 inputs again tBring in constant renewal in, thereby reach real effectiveness match.
Differential signal computational grid unit 12 is according to the effectiveness match value of input
Figure BDA0000034814160000033
And according to state space vector s tThe function of repayment immediately that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element 11, confidence evaluation network element 13 and action decision networks unit 14.
Confidence evaluation network element 13 is utilized the input vector of the input layer of neural network in the effectiveness match network element 11 and the output vector and the differential signal Δ TD of hidden layer tCalculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element 15, be used for adjustment Action Selection.
Action decision networks unit 14 differential signal Δ TD according to input tWith state space vector s t, utilization is passed the rank genetic algorithm neural network is optimized, and realizes the selection study of action, the output action choice function Give action correction network element 15, wherein j, k are the integer greater than 0.
Action correction network element 15 is utilized the degree of confidence of input, to the Action Selection function of input
Figure BDA0000034814160000035
Figure BDA0000034814160000036
Proofread and correct, with the action output of probability maximum.State space vector after action is carried out feeds back and inputs to effectiveness match network element 11, differential signal computational grid unit 12 and action decision networks unit 14.
Wherein, effectiveness match network element 11 is used for the state variation that specific behavior causes is carried out the effectiveness evaluation, obtains the effectiveness match value, is made of the neural network of two-layer feedback, as shown in Figure 1.Neural network be input as state space vector s t, the hidden layer activation function is the Sigmoid function, the effectiveness match value of state after neural network is output as action carried out, the weight coefficient of neural network be A, B and C (.This neural network comprises n input vector unit, and h hidden layer unit, and each hidden layer unit is accepted n input and have n to connect weights, and output unit is accepted n+h input and n+h weights are arranged.For the value of h, the user can set up on their own, generally is set at 3, is set to 2 in the embodiment of the invention.
The input vector of this neural network is x i(t), i=1,2,3...n, function x i(t) be s tProcess normalization obtains, and then the output vector of hidden layer unit is:
y j ( t ) = g [ Σ i = 1 n a ij ( t ) x i ( i , j = 1,2,3 , . . . h
Used function in the following formula a Ij(t) be the vector of the weights A of input layer and hidden layer.The match value that effectiveness match network 11 is output as effectiveness
Figure BDA0000034814160000039
It is the linear combination to input layer and hidden layer:
U ( s t ) ^ = Σ i = 1 n b i ( t ) x i ( t ) + Σ j = 1 h c j ( t ) y j ( t )
Wherein, b i(t) vector of the weights B of expression input layer and output layer, c j(t) vector of the weights C of expression hidden layer and output layer.
Weights A, the B of network and C utilize differential signal Δ TD tUpgrade, if differential signal Δ TD tFor just, then illustrate in a last action to have produced positive effect, so the selecteed chance of this action should obtain strengthening.The weights B of input layer and output layer and the weights C of hidden layer and output layer utilize following formula to upgrade:
b i(t+1)=b i(t)+λ·ΔTD t+1·x i(t),i=1,2,3...n
c j(t+1)=c j(t)+λ·ΔTD t+1·y j(t),j=1,2,3...h
In the formula, λ is the constant greater than zero, can be provided with voluntarily by the user.Input is carried out according to following formula with the renewal of the weights A of hidden layer:
a ij(t+1)=a ij(t)+λ h·ΔTD t+1·y j(t)·sgn(c j(t))·x i(t)
Wherein, λ hFor greater than zero number, Δ TD can be set voluntarily by the user T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly, sgn is as minor function:
( z ) = 1 z > 0 0 z = 0 - 1 z < 0 , Z is the vectorial c of weights C herein j(t).
As shown in Figure 1, differential signal computational grid unit 12 is according to the match effectiveness of effectiveness match network element 11 outputs
Figure BDA0000034814160000043
And (the s of repayment function R immediately of state t) calculate differential signal Δ TD tAccording to instantaneous difference algorithm, Δ TD tUtilizing following formula to carry out iterative computation obtains:
&Delta;T D t = R ( s t ) + &gamma; &CenterDot; U ^ ( s t + 1 ) - U ^ ( s t )
Wherein, R (s t) be to state s tEvaluation immediately, be exactly the output of repaying function immediately, γ is a discount factor, can be provided with voluntarily by the user. Expression t+1 moves constantly and carries out the state space vector s that the back produces T+1Resulting effectiveness match value,
Figure BDA0000034814160000046
Expression t moves constantly and carries out the state space vector s that the back produces tResulting effectiveness match value.
The differential signal Δ TD that calculates tBe used for the weight coefficient of effectiveness match network element 11 and confidence evaluation network element 13 is trained renewal.If differential signal Δ TD tProduced positive effect, then should strengthen this action, and also should strengthen, believed more that promptly this action should be selected its degree of confidence.In addition, differential signal Δ TD tAlso be used for the weights of Action Selection function in the action decision networks unit 14 are upgraded, to guarantee the selection of realization to the optimum action.
As shown in Figure 1, when action decision networks unit 14 output action decision functions, confidence evaluation network element 13 will be calculated the degree of confidence of output action, and this degree of confidence is used for the adjustment to Action Selection.The input of confidence evaluation network element 13 is state vector x i(t) and y j(t), they are drawn from the hidden layer and the output layer of effectiveness match network element 11.
Degree of confidence p 0(t) calculate by following formula:
p 0 ( t ) = &Sigma; i = 1 n &alpha; i ( t ) x i ( t ) + &Sigma; j = 1 h &beta; j ( t ) y j ( t )
Wherein, weights α i(t) and β j(t) utilize following formula to upgrade:
α i(t+1)=α i(t)+λ p·ΔTD t+1·x i(t),i=1,2,3...n
β j(t+1)=β j(t)+λ p·ΔTD t+1·y j(t),j=1,2,3...h
Wherein, λ pThe expression learning rate is the numerical value between the 0-1, and empirical value is 0.618, and the user can be provided with according to the experience of oneself.From following formula, be difficult to guarantee p 0(t) confidence interval is in [0,1], so introduce the Sigmoid function to p 0(t) carry out conversion, obtain p (t), like this, the output degree of confidence just matches with the random function probability:
p ( t ) = 1 1 + e - a p 0 ( t )
Degree of confidence modifying factor a plays the effect of level and smooth learning process, changes a, just can change the range of adjustment of study to environment, if a is excessive, then can make learning system lose regulating action, should set suitable a value according to priori, a>0, the span of a is [1,10] among the present invention.
Degree of confidence has reflected the uncertainty of decision-making to the regulating action of Action Selection.As can be seen, along with the effectiveness of state is tending towards actual value gradually, i.e. Δ TD tIncrease, degree of confidence p (t) also increases gradually, to the action selection more and more definite.Utilize output degree of confidence p (t) each output action choice function again to action decision networks unit 14 Proofread and correct, trimming process is finished for 15 li in the action correction network element.
Action decision networks unit 14 adopts neural network to realize that it is divided into is four layers, and as shown in Figure 1, ground floor to the is respectively for four layers: input layer, and fuzzy subset's layer, variable node layer and function output layer, wherein, the variable node layer also claims function match layer.Use h=1 respectively, 2,3,4 represent four layers of networks.If
Figure BDA0000034814160000054
Be respectively the input and output of i node of h layer, i is every layer a node, and wherein, ground floor node number is I, and second layer node number is I*J, and the 3rd node layer number is L, and the 4th node layer number is K, I, and J, K, L are positive integers.Average m Ij, variances sigma IjBe respectively corresponding x in the second layer i(t) location parameter and the width of Gauss's subordinate function of Shu Ru j node.
The input layer of the neural network of action decision networks unit 14, input quantity are state space vector s tThe x that normalization obtains i(t), it has characterized the robot situation information of input time.The input of i node of input layer
Figure BDA0000034814160000055
For:
I N i 1 = x i ( t ) , i = 1,2,3 . . . I
Fuzzy subset's layer is used for the input variable of input layer is carried out Fuzzy processing.Be output as the degree of membership of each input vector.Each x of input layer i(t) at fuzzy subset's layer to J input should be arranged, for example among Fig. 1, J herein is 2, wherein, each input is exactly x i(t) a fuzzy subset, output is x i(t) in this fuzzy subset's degree of membership.Its each node activation function is Gauss's subordinate function, is output as:
Q x i j 2 = exp [ - ( x j ( t ) - m ij &sigma; ij ) 2 ] , i = 1,2,3 . . . I , j = 1,2,3 . . . J
Wherein,
Figure BDA0000034814160000058
For corresponding to the input x i(t) j output,, exp is to be the exponential function at the end with natural logarithm e, x j(t) be the input of j node of input layer.
Neural network need to a certain degree adjusted output for satisfying the match for function of movement, and the variable node layer is used for realizing this regulatory function.The variable node layer is to realize regulatory function by node number and the variation that connects weights, node number and the utilization of connection weights are passed the rank genetic algorithm and are optimized, dynamically determine their number and size,, specifically introduce in the back to satisfy the match of network to function of movement.The activation function of variable node layer is a Gaussian function, and its location parameter and width are respectively m lAnd σ lThe linking number of the second layer and the 3rd layer also is uncertain, also need dynamically adjust in optimizing process, and connecting weights all is 1.The 3rd node layer is output as:
O l 3 = exp [ - ( &Sigma; i = 1 , j = 1 I , J O 2 x i j - m l &sigma; l ) 2 ] , l = 1,2,3 . . . L
Interstitial content is identical with optional action number, function output layer output be match value to function of movement, be used for calculating the selection probability of each action.The 4th node layer is output as:
O k 4 = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2 , 3 . . . K
Wherein, the 4th layer output O k 4It is exactly the Action Selection function
Figure BDA0000034814160000063
A ^ k ( s t ) = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2 , 3 . . . K
The 3rd layer of each node all has and is connected ω with the 4th layer LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node, connect weights ω LkAlso need in optimizing process, dynamically adjust.
Suppose that the network ground floor has I input, i input has k at the second layer iIndividual fuzzy division, the then total k of second layer nodal point number 1+ k 2+ ...+k IIndividual, the node function is the membership function of each input for its fuzzy subset.Summary is got up, and needs dynamically to adjust the neural network structure of optimizing be: the linking number of the 3rd node layer number, the second layer and the 3rd layer.Needing to adjust the network parameter of optimizing is: the position m of second layer input parameter subordinate function IjAnd width cs Ij, the 3rd layer of (hidden layer) Gauss activation function location parameter m lWith width cs lAnd the 3rd layer with the 4th layer be connected weights ω Lk
Here, utilize Hybrid Hierarchy Genetic Algorithm that the structure and parameter of the neural network of action in the decision networks is optimized, the structure optimization of network is for determining the linking number of the 3rd node layer number, the second layer and the 3rd layer.The parameter optimization of network comprises the membership function location parameter m of input vector IjAnd width cs Ij, the 3rd layer of latent node the location parameter m of Gaussian function lWith width cs lAnd the 3rd layer with the 4th layer be connected weights ω LkUtilization is passed the rank genetic algorithm neural network is optimized and adjusts, and makes network when each takes turns decision-making, according to the variation of input differential signal, continues to optimize and obtains the Action Selection function, to realize the selection effect to action.
The degree of confidence p (t) that action correction network element 15 utilizes the evaluation of estimate of confidence evaluation network element 13 outputs promptly to move is to the Action Selection function of action selection network unit 14 outputs
Figure BDA0000034814160000065
Proofread and correct, calculate the probable value that each action is chosen then, with the action output of probability maximum.
Trimming process be with
Figure BDA0000034814160000066
Being average, is that probability generates a random function with p (t), as new Action Selection function A j(s t).P (t) is more little, then A j(s t) just more away from
Figure BDA0000034814160000067
Otherwise, then the closer to With new A j(s t) replace
Figure BDA0000034814160000069
Action Selection function A j(s t) value is big more, Dui Ying action a then jSelecteed probability is big more.Select the computing formula of probability to be:
P ( a j | s t ) = e A j ( s t ) &Sigma; k e A k ( s t )
Then be output as the action of probable value maximum.
In the robot behavior learning model, described action decision networks unit 14 also comprises 4 subelements: coding unit 141, and initialization of population unit 142, fitness function determining unit 143, and genetic manipulation unit 144, as shown in Figure 2.
Coding unit 141 is that the chromosome structure of genetic algorithm is determined.Pass the rank genetic algorithm and be that hierarchical structure according to the biological stain body proposes, the gene in the biosome in the chromosome can be divided into regulatory gene and structural gene, and the effect of regulatory gene is whether the control structural gene is activated.Here, use for reference this characteristics of biological stain body gene, above-mentioned optimization problem is encoded.Each individuality in the population is made up of structure and parameter two parts of decision network.The gene structure of population individuality adopts secondary hierarchical structure coding, promptly divide two-layer realization according to the gene hierarchical structure of biological stain body, the upper strata gene is realized the coding to the 3rd node layer quantity and second layer input membership function, the just parameter m of the 3rd node layer number and second layer input membership function IjAnd σ IjAs shown in Figure 3, the part that realization is controlled the 3rd layer of (hidden layer) number of nodes is called controlling gene, lower floor is a parameter gene, realizes the subordinate function of the 3rd layer of (hidden layer) node and the coding of network connection are comprised the 3rd layer of (hidden layer) node subordinate function parameter m lWith σ lAnd the linking number of the second layer and the 3rd layer, and the 3rd layer with the 4th layer be connected weights ω Lk
The gene that the latent node number of controlling gene and the expression network of parameter gene connect all adopts binary coding, represents the situation of " nothing " and " having " respectively with " 0 ", " 1 ".The gene of other expression subordinate function parameters and connection weights all adopts real-valued coding, promptly uses real number representation.Three-decker is encoded to a binary string, the 3rd layer of node of a bit representation, as controlling gene, " 1 " represents that this node works, " 0 " represents that this node is inoperative.Like this, the number of " 1 " is the actual number of the neural network hidden node that works in the controlling gene string.In the parameter gene, second and third layer connects gene and adopts binary coding, and the corresponding second layer of " 1 " expression has and is connected with the 3rd layer, and " 0 " represents that the corresponding second layer is not connected with the 3rd layer.Third and fourth layer weights gene adopts real-valued coding, represented the 3rd layer with the 4th layer the weights that are connected.
Hence one can see that, controlling gene is being controlled the number of node, if a certain node is " 0 ", then this node and front and back two-layer all do not have be connected, correspondingly its pairing parameter gene all is non-existent, as can be seen, parameter gene is controlled by controlling gene, if a certain node of upper strata controlling gene does not exist, so corresponding lower floor parameter gene just is not activated, this has just embodied the control action of controlling gene, and this control action can be corresponding with topology of networks.The chromosome one by one that coding forms constitutes population, utilizes them to finish evolution.
Further, initialization of population unit 142 is that the chromosome population is carried out initialization.In order to carry out genetic algorithm operation smoothly, need produce the chromosome individuality of some before, and these individualities should produce at random, represented the possibility of multiple network structure, enough spaces of finding the solution promptly should be arranged.Suitable population scale is significant for the convergence of genetic algorithm, and population quantity is too for a short time to be difficult to try to achieve satisfied result, calculation of complex too greatly then, and population scale generally gets 10~160.
Further, determine chromosomal fitness function unit 143.Individual fitness function adopts the complexity of individual error and structure to represent, considers the complexity of Control Network in the individual error optimizing, thereby obtains optimum network structure.The fitness function form of network is as follows:
f ( i ) = &alpha; 1 E ( i ) + &beta; 1 H ( i ) , i = 1,2 , . . . , I
Wherein, E (i), H (i) represent i individual individual error and structure complexity respectively, wherein:
E ( i ) = &Sigma; j = 1 K ( y ^ ij - y ij ) 2
H(i)=1+exp[-c(N i(0))]
Figure BDA0000034814160000083
And y IjExport and desired output for j that is i individuality, wherein, desired output y IjChoice function for the expectation action
Figure BDA0000034814160000084
If certain moves desired output, then establish its expectation value
Figure BDA0000034814160000085
Other expectation function of movement all are made as 0.N i(0) be that i individual hidden node is zero number, c is the parameter regulation factor.Wherein, b, c are normal value, and α and β are the constant greater than zero, alpha+beta=1.Utilize such adaptive value function can guarantee when optimizing network weight, to obtain suitable neural network structure.
Further, carry out genetic manipulation unit 144, genetic manipulation comprises selection, intersects and variation.Initial population after selection, intersection and variation, has carried out one and has taken turns genetic manipulation, has finished one and has taken turns evolution, has obtained the sub-population of a new generation, and this process that circulates, and makes evolution constantly carry out, so that filial generation converges to optimum.
Selection is from previous generation population, according to the fitness of individuality, according to certain rule or method, selects some good individual inheritances in colony of future generation.The method that adopts the elite to select in the algorithm is selected, and promptly according to the fitness value size, each remains into the next generation for individuality optimum in the population, and this mode has guaranteed the asymptotic convergence of algorithm.For individual i, its selection probability is:
p s ( i ) = f i &Sigma; j = 1 N f j
Wherein, f iBe the fitness of individual i, N is the number of individuals of population.
Interlace operation is exactly to make the gene pairs of two individualities to exchange the position randomly, and this process has reflected the random information exchange, and purpose is to produce the new assortment of genes, promptly produces new individuality.When evolving to a certain degree, when the identical colony of most of individualities particularly occurring, intersection is to produce new individuality, at this moment can only produce new individuality by variation.Variation is with certain probability gene position to be changed, and to increase new search volume, that is to say, variation has increased the speciality of global optimization.In the process of intersecting and making a variation, randomness plays an important role, and have only at random intersection and mutation operation just to guarantee to upgrade individual appearance, and this randomness shows by intersection and variation probability.
In the genetic manipulation process, crossover probability and variation probability have a significant impact the performance of genetic algorithm.If at genetic algorithm (Genetic Algorithm is called for short GA) initial operating stage, the crossover probability choosing is big, and the variation probability selects little, can accelerate convergence of algorithm speed, helps searching for optimum solution.But along with the carrying out of search, just needing to reduce crossover probability increases the variation probability, so that algorithm is difficult for being absorbed in local extremum, can search for new separating.
The probability that makes a variation simultaneously can not be obtained too big, otherwise algorithm will be difficult to the gene of restraining and destroying optimum solution.For high the separating of fitness, get lower crossover probability and variation probability, make it have bigger chance to enter into the next generation; And, should get higher crossover probability and variation probability for lower the separating of fitness, it is eliminated as early as possible; When the maturation convergence takes place, should strengthen crossover probability and variation probability, to accelerate new individual generation.According to the selection principle of above intersection and variation probability, adopt a kind of adaptive crossover probability and variation probability method, its computing formula is:
p c = f max - f avg f ( f max - f avg ) < f 0.8 ( f max - f avg ) &GreaterEqual; f
p m = 0.2 ( f max - f &prime; ) f max - f avg ( f max - f &prime; ) < ( f max - f avg ) 0.2 ( f max - f &prime; ) &GreaterEqual; ( f max - f avg )
Wherein, p cBe crossover probability, p mBe the variation probability.f MaxBe the maximum adaptation degree in the colony, f AvgBe average fitness, f is a bigger fitness in two individualities that intersect, and f ' is the individual fitness of variation.
This method when big, can find optimum solution in the evolution space fast; Converging near the locally optimal solution, increasing the diversity of colony.The individual variation probability of fitness maximum is zero as can be seen, and individuality intersection and variation probability that fitness is bigger are all very little, have protected defect individual like this.And the fitness small individuals is intersected and the variation probability is all very big, needs constantly to destroy it.
Carry out interlace operation according to crossover probability between two individualities choosing, interlace operation is operated the corresponding part of controlling gene and parameter gene respectively, as shown in Figure 4.Such interlace operation can make two chromosomal corresponding genes intersect, and has guaranteed that also the correspondence of binary coding and real coding gene is intersected.The intersection of two corresponding positions of chromosome adopts single-point to intersect, and selects the same position of two individualities randomly, carries out the exchange operation of gene in the position of choosing.
Mutation operation comprises the operation to all genes, to the binary coding gene in controlling gene and the parameter gene, adopts the position variation, carries out the logic inversion operation, promptly " 1 " is become " 0 ", and " 0 " is become " 1 ".Carry out the Gaussian mutation of linear combination for the gene of real-valued coding:
m ^ ij = m ij + &alpha; 1 f N ( 0,1 )
&sigma; ^ ij = &sigma; ij + &alpha; 1 f N ( 0,1 )
m ^ l = m l + &alpha; 1 f N ( 0,1 )
&sigma; ^ l = &sigma; l + &alpha; 1 f N ( 0,1 )
&omega; ^ lk = &omega; lk + &alpha; 1 f N ( 0,1 )
Wherein, α is an evolution rate, and f is each individual fitness, and N (0,1) is 0 for expectation, and standard deviation is 1 normal distribution random function.
In sum, it is as follows passing the algorithm steps that the rank genetic algorithms realizes Neural Network Optimization:
1. network structure and parameter are encoded according to hierarchical structure, generate the chromosome individuality.
2. generate 2N initial chromosome population at random, evolutionary generation is made as t=0.
3. calculate maximum adaptation degree value and average fitness value in each individual fitness value and the population according to formula.
4. in population, select N individuality as parent according to the individual choice probability, make t=t+1.
5. from parent, select two individualities at random, carry out interlace operation according to crossover probability.If intersect, then at first duplicate two individualities, former individual the reservation.Carry out interlace operation with the individuality that duplicates, produce two new individualities.All intersect up to the parent population and to finish.
6. all individualities are carried out mutation operation according to the variation probability.
7. when the fitness of optimum individual and colony's fitness reach given threshold values, perhaps reach maximum evolutionary generation, the then iterative process of algorithm convergence, algorithm finish.Continue to carry out otherwise change 3, until satisfying termination condition.
After optimize finishing, get the network structure of optimum individual and parameter, utilize its to realize the calculating of action decision-making as decision networks.
In action decision networks unit 14, with passing the structure and parameter that the rank genetic algorithm is optimized network.After each new situation occurs, at first utilize instantaneous difference algorithm (Temporal-Difference method, the differential signal Δ TD that TD) is provided tCome action selection network is carried out parameter update, in the hope of obtaining more favourable optional action.Specifically, it is to utilize differential signal Δ TD t,, carry out genetic manipulation afterwards again by the 3rd layer in each parameter gene of the chromosome in the population is connected weights and upgrades with the 4th layer.The weights space of corresponding like this this function of movement is all upgraded, and the new weights of the respective action that obtains through heredity also should be bigger, can reflect the study to this optimum action.Differential signal for the renewal process that connects weights is:
Figure BDA0000034814160000101
Wherein, ω IjBe the 3rd layer of i latent node and the weights that are connected of the 4th layer of j Action Selection function,
Figure BDA0000034814160000102
Being weighting coefficient, is the numerical value between the 0-1, and empirical value is 0.62.
The present embodiment utilization is passed the rank genetic algorithm neural network is trained, and realizes knowledge learning.Solved in the prior art the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making research, solved the knowledge acquisition of robot behavior decision-making preferably, the reasoning decision problem, main body has the study and the inferential capability of higher level by approaching the completeness of knowledge with environmental interaction study.
Fig. 5 is the synoptic diagram of on-line decision process among learning model second embodiment of the present invention.After the off-line learning, the action decision networks unit 14 that obtains at last is optimum, uses this action decision networks unit 14 to be used for real-time on-line decision.And other all remove in the on-line decision process as effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13 and action correction network element 15, do not re-use.Action decision networks unit 14 is according to the action a that selects tState space vector s after action execution unit 16 is carried out tCalculate and draw the output action choice function By the final action of selecting of Action Selection device output, the state space vector that this action obtains after action execution unit 16 is carried out inputs to action decision networks unit 14 again.
The neural network that present embodiment utilization training obtains is carried out the behavior of robot and is made a strategic decision in real time.Learning process is separated with decision process, has guaranteed the efficient of on-line decision, satisfies the needs of real time execution.

Claims (7)

1. robot behavior learning model based on the effectiveness differential networks, comprise action execution unit (16), it is characterized in that this learning model also comprises: effectiveness match network element (11), differential signal computational grid unit (12), confidence evaluation network element (13), action decision networks unit (14) and action correction network element (15);
Described effectiveness match network element (11) is used for calculating t and moves a constantly tThe state space vector s that after action execution unit (16) is carried out, produces tResulting effectiveness match value And export to differential signal computational grid unit (12); Differential signal computational grid unit (12) is according to the effectiveness match value of input
Figure FDA0000034814150000012
And according to state space vector s tThe function of repayment immediately that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element (11), confidence evaluation network element (13) and action decision networks unit (14); Effectiveness match network element (11) is utilized differential signal Δ TD tUpgrade the weights of neural network in the effectiveness match network element (11); Confidence evaluation network element (13) is utilized the input vector of the input layer of neural network in the effectiveness match network element (11) and the output vector and the differential signal of hidden layer, calculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element (15); Action decision networks unit (14) is according to the differential signal Δ TD of input tWith state space vector s t, the selection study of moving, output action choice function Give action correction network element (15), wherein j, k are the integer greater than 0; Action correction network element (15) is utilized the degree of confidence of input, to the Action Selection function of input
Figure FDA0000034814150000015
Proofread and correct, action behind the calculation correction chooses probable value then, action execution unit (16) is exported in the action of probability maximum carried out, the state space vector after this action is carried out feeds back and inputs to effectiveness match network element (11), differential signal computational grid unit (12) and action decision networks unit (14);
Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, the action decision networks unit (14) that is only obtained at last by off-line learning in the described on-line decision process participates in action execution unit (16), the state space vector s that the action decision networks unit (14) in the on-line decision process produces according to t action execution unit (16) execution constantly action back tCalculate and draw the output action choice function
Figure FDA0000034814150000016
Figure FDA0000034814150000017
Carry out for action execution unit (16) by the final action of selecting of Action Selection device output, carry out the state space vector that obtains after the action and input to action decision networks unit (14) again.
2. a kind of robot behavior learning model according to claim 1 based on the effectiveness differential networks, it is characterized in that, described effectiveness match network element (11) is made of neural network, comprise input layer, hidden layer and output layer, the weights of neural network are A, B and C, the input vector x of neural network input layer i(t) for moving constantly, t carries out the state space vector s that the back produces tNormalization obtains, and the hidden layer activation function is the Sigmoid function, and neural network is output as carries out the effectiveness match value of state afterwards to action
U ( s t ) ^ = &Sigma; i = 1 n b i ( t ) x i ( t ) + &Sigma; j = 1 h c j ( t ) y j ( t )
Wherein, b i(t) vector of the weights B of expression input layer and output layer, c j(t) vector of the weights C of expression hidden layer and output layer, n is an input layer unit number, h is a hidden layer unit number, y j(t) be the output vector of hidden layer unit:
y j ( t ) = g [ &Sigma; i = 1 n a ij ( t ) x i ( t ) ] , j = 1,2,3 . . . h
Figure FDA0000034814150000022
According to function
Figure FDA0000034814150000023
Calculate a Ij(t) be the vector of the weights A of input layer and hidden layer.
3. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 2 is characterized in that, the vector of the weights of neural network in the described effectiveness match network element (11) specifically is to utilize following formula to upgrade:
b i(t+1)=b i(t)+λ·ΔTD t+1·x i(t),i=1,2,3...n
c j(t+1)=c j(t)+λ·ΔTD t+1·y j(t),j=1,2,3...h
a ij(t+1)=a ij(t)+λ h·ΔTD t+1·y j(t)·sgn(c j(t))·x i(t)
Wherein, λ is the constant greater than zero, λ hFor greater than zero number, Δ TD T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly, sgn (c j(t)) determine according to function sgn that function sgn is:
( z ) = 1 z > 0 0 z = 0 - 1 z < 0 .
4. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that differential signal computational grid unit (12) calculates differential signal Δ TD according to instantaneous difference algorithm t:
&Delta;T D t = R ( s t ) + &gamma; &CenterDot; U ^ ( s t + 1 ) - U ^ ( s t )
Wherein, R (s t) be to state space vector s tEvaluation immediately, γ is a discount factor,
Figure FDA0000034814150000026
Expression t+1 moves constantly and carries out the state space vector s that the back produces T+1Resulting effectiveness match value,
Figure FDA0000034814150000027
Expression t moves constantly and carries out the state space vector s that the back produces tResulting effectiveness match value.
5. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that, the degree of confidence p (t) of the final output of described confidence evaluation network element (13) is:
p ( t ) = 1 1 + e - a p 0 ( t ) , p 0 ( t ) = &Sigma; i = 1 n &alpha; i ( t ) x i ( t ) + &Sigma; j = 1 h &beta; j ( t ) y j ( t )
Wherein, degree of confidence modifying factor a span is [1,10], x i(t), y j(t) be respectively the output vector of the input vector and the hidden layer unit of the neural network in the effectiveness match network element (11), n, h are respectively neural network input layer unit number and the hidden layer unit number in the effectiveness match network element (11); Corresponding t+1 moves the weights α after carrying out constantly i(t+1) and β j(t+1) renewal is as follows:
α i(t+1)=α i(t)+λ p·ΔTD t+1·x i(t),i=1,2,3...n
β j(t+1)=β j(t)+λ p·ΔTD t+1·y j(t),j=1,2,3...h
Wherein, λ pThe expression learning rate is the numerical value between the 0-1, Δ TD T+1Represent that corresponding t+1 moves the differential signal of carrying out the state space vector that the back produces constantly.
6. a kind of robot behavior learning model according to claim 1 based on the effectiveness differential networks, it is characterized in that, described action decision networks unit (14) adopts neural network to realize, this neural network comprises input layer, fuzzy subset's layer, variable node layer and function output layer, the input of i node of input layer
Figure FDA0000034814150000031
For:
I N i 1 = x i ( t ) , i = 1,2,3 . . . I
Wherein, I is the input layer number, x i(t) the state space vector s of action after carrying out that serve as reasons tThe vector that normalization obtains;
Fuzzy subset's layer is used for Fuzzy processing is carried out in the input of input layer, corresponding to input x i(t) j output
Figure FDA0000034814150000033
For:
O x i j 2 = exp [ - ( x j ( t ) - m ij &sigma; ij ) 2 ] , i = 1,2,3 . . . I , j = 1,2,3 . . . J
Wherein, J is each x of input layer i(t) in the input number of fuzzy subset's layer correspondence, m IjAnd σ IjLocation parameter and the width of representing the membership function of input vector respectively, x j(t) be the input vector of j node of input layer;
The activation function of variable node layer is a Gaussian function, and the location parameter and the width of this Gaussian function are respectively m lAnd σ l, the node output of variable node layer
Figure FDA0000034814150000035
For:
O l 3 = exp [ - ( &Sigma; i = 1 , j = 1 I , J O 2 x i j - m l &sigma; l ) 2 ] , l = 1,2,3 . . . L
Wherein, L is the node number of variable node layer;
Function output layer output be match value to function of movement, be exactly the Action Selection function
Figure FDA0000034814150000037
A ^ k ( s t ) = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2 , 3 . . . K
Wherein, K is the node number of function output layer; ω LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node; I, J, K, L are positive integers;
The membership function location parameter m of described input vector IjAnd width cs Ij, the variable node layer the location parameter m of Gaussian function lWith width cs l, and the connection weights of variable node layer and function output layer, adopt and pass the rank genetic algorithm and be optimized and adjust.
7. a kind of robot behavior learning model based on the effectiveness differential networks according to claim 1 is characterized in that, described action correction network element (15) with
Figure FDA0000034814150000039
Being average, is that probability generates a random function with p (t), as new Action Selection function A j(s t), calculate then and choose probable value P (a j| s t), the action of output probability value maximum; The formula of choosing probable value is:
P ( a j | s t ) = e A j ( s t ) &Sigma; k e A k ( s t )
Wherein, a jBe j action, s tFor moving constantly, t obtains state space vector, A after carrying out k(s t) be k Action Selection function, A j(s t) be j Action Selection function.
CN 201010564142 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network Expired - Fee Related CN102063640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010564142 CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010564142 CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Publications (2)

Publication Number Publication Date
CN102063640A true CN102063640A (en) 2011-05-18
CN102063640B CN102063640B (en) 2013-01-30

Family

ID=43998910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010564142 Expired - Fee Related CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Country Status (1)

Country Link
CN (1) CN102063640B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
CN107972026A (en) * 2016-10-25 2018-05-01 深圳光启合众科技有限公司 Robot, mechanical arm and its control method and device
WO2018113260A1 (en) * 2016-12-22 2018-06-28 深圳光启合众科技有限公司 Emotional expression method and device, and robot
CN110705682A (en) * 2019-09-30 2020-01-17 北京工业大学 System and method for robot behavior prejudgment based on multilayer neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5129039A (en) * 1988-09-17 1992-07-07 Sony Corporation Recurrent neural network with variable size intermediate layer
CN1372506A (en) * 2000-03-24 2002-10-02 索尼公司 Method for determining action of robot and robot equipment
JP3412700B2 (en) * 1993-06-28 2003-06-03 日本電信電話株式会社 Neural network type pattern learning method and pattern processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5129039A (en) * 1988-09-17 1992-07-07 Sony Corporation Recurrent neural network with variable size intermediate layer
JP3412700B2 (en) * 1993-06-28 2003-06-03 日本電信電話株式会社 Neural network type pattern learning method and pattern processing device
CN1372506A (en) * 2000-03-24 2002-10-02 索尼公司 Method for determining action of robot and robot equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
CN102402712B (en) * 2011-08-31 2014-03-05 山东大学 Robot reinforced learning initialization method based on neural network
CN107972026A (en) * 2016-10-25 2018-05-01 深圳光启合众科技有限公司 Robot, mechanical arm and its control method and device
CN107972026B (en) * 2016-10-25 2021-05-04 河北亿超机械制造股份有限公司 Robot, mechanical arm and control method and device thereof
WO2018113260A1 (en) * 2016-12-22 2018-06-28 深圳光启合众科技有限公司 Emotional expression method and device, and robot
CN110705682A (en) * 2019-09-30 2020-01-17 北京工业大学 System and method for robot behavior prejudgment based on multilayer neural network
CN110705682B (en) * 2019-09-30 2023-01-17 北京工业大学 System and method for robot behavior prejudgment based on multilayer neural network

Also Published As

Publication number Publication date
CN102063640B (en) 2013-01-30

Similar Documents

Publication Publication Date Title
Abraham Adaptation of fuzzy inference system using neural learning
CN102063640B (en) Robot behavior learning model based on utility differential network
CN109540163A (en) A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control
CN104050505A (en) Multilayer-perceptron training method based on bee colony algorithm with learning factor
Huang et al. Interpretable policies for reinforcement learning by empirical fuzzy sets
Liu et al. Selective ensemble learning method for belief-rule-base classification system based on PAES
Chen Design of TSK-type fuzzy controllers using differential evolution with adaptive mutation strategy for nonlinear system control
Gholamian et al. A hybrid intelligent system for multiobjective decision making problems
Grosan et al. Hybrid intelligent systems
Desouky et al. Learning in n-pursuer n-evader differential games
CN113141012A (en) Power grid power flow regulation and control decision reasoning method based on deep deterministic strategy gradient network
Gudino-Penaloza et al. Fuzzy hyperheuristic framework for GA parameters tuning
de Oliveira et al. An evolutionary extreme learning machine based on fuzzy fish swarms
Figueiredo et al. Reinforcement learning/spl I. bar/hierarchical neuro-fuzzy politree model for control of autonomous agents
Jacob et al. Self-reorganizing TSK fuzzy inference system with BCM theory of meta-plasticity
Cabrita et al. Single and multi-objective genetic programming design for B-spline neural networks and neuro-fuzzy systems
Šlapák et al. Multiobjective genetic programming of agent decision strategies
Sendari et al. Two-Stage Reinforcement Learning Based on Genetic Network Programming for Mobile Robot
Chen et al. Intelligent Decision System Based on SOM-LVQ Neural Network for Soccer Robot
Otadi Simulation and evaluation of second-order fuzzy boundary value problems
Ballini et al. Heuristic learning in recurrent neural fuzzy networks
Abraham Beyond integrated neuro-fuzzy systems: Reviews, prospects, perspectives and directions
Obaid et al. Study the Neural Network Algorithms of Mathematical Numerical Optimization
Vaščák Automatic design and optimization of fuzzy inference systems
BITTERMANN Sustainable conceptual building design using a cognitive system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130130

Termination date: 20131129