CN102063640B - Robot behavior learning model based on utility differential network - Google Patents

Robot behavior learning model based on utility differential network Download PDF

Info

Publication number
CN102063640B
CN102063640B CN 201010564142 CN201010564142A CN102063640B CN 102063640 B CN102063640 B CN 102063640B CN 201010564142 CN201010564142 CN 201010564142 CN 201010564142 A CN201010564142 A CN 201010564142A CN 102063640 B CN102063640 B CN 102063640B
Authority
CN
China
Prior art keywords
action
layer
input
function
network element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201010564142
Other languages
Chinese (zh)
Other versions
CN102063640A (en
Inventor
宋晓
麻士东
龚光红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN 201010564142 priority Critical patent/CN102063640B/en
Publication of CN102063640A publication Critical patent/CN102063640A/en
Application granted granted Critical
Publication of CN102063640B publication Critical patent/CN102063640B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a robot behavior learning model based on a utility differential network, which comprises a utility fitting network unit, a differential signal calculating network unit, a confidence evaluating network unit, an action decision network unit, an action correcting network unit and an action executing unit. The model realizes the offline learning process and the online decision process. The utility fitting network unit calculates and obtains a utility fitting value of a state after action is executed; the differential signal calculating network unit is used for calculating a differential signal; the confidence evaluating network unit outputs the confidence obtained by calculating to the action correcting network unit; the action decision network unit outputs an action selecting function; and the action correcting network unit corrects the action selecting function by utilizing confidence, calculates a probability value selected by each action and outputs the action with largest probability to the action executing unit for executing. The invention can more favorably ensure the completeness of a robot for obtaining environmental knowledge and more favorably ensure the timeliness and effectiveness of robot behavior decision.

Description

Robot behavior learning model based on the effectiveness differential networks
Technical field
The present invention relates to a kind of robot behavior learning model based on the effectiveness differential networks, belong to one of new application of artificial intelligence field.
Background technology
The intelligent robot behavior refers to that generally robot carries out reasoning and decision-making on the basis of perception surrounding enviroment, reaches the process of behavior intelligent decision.The foundation of intelligent behavior decision model need to be obtained knowledge, expression and inference, and quality that can the automatic Evaluation robot behavior.At present, the obtaining of knowledge, to the advantage that the aspects such as the adaptability of policy setting, reusability have, make it become the first-selection of intelligent behavior modeling based on the cognitive behavior model of intensified learning technology.
The intensified learning process need is explored environment.Can be expressed as: under certain state, the decision maker selects also to carry out an action, then next step ambient condition and corresponding repayment of perception.The decision maker is not directly informed when to take what action, but according to repaying the behavior of revising self, wins more repayment.Briefly, the intensified learning process be exactly allow the decision maker by continuous trial to obtain the process of optimizing behavior sequence.
Use at present the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making of robot intensified learning, the shortcoming one of this mode is that knowledge acquisition is limited, the 2nd, the knowledge that problem is obtained is often with empirical, can not in time learn new knowledge, the 3rd, the reasoning process real-time is not high.
Summary of the invention
The present invention is directed to the shortcoming of the behaviour decision making existence of present robot intensified learning, set up a kind of robot behavior learning model based on the effectiveness differential networks.This model be one based on the learning system of estimating, mutual by to environment, the control rate of automatic creation system, and then control provides and selects action.The present invention is based on the robot behavior learning model of effectiveness differential networks, solve limited, the empirical excessively strong problem of general behavior decision model knowledge acquisition, the off-line learning process of realization and on-line decision process solve the not high problem of reasoning process real-time.
A kind of robot behavior learning model based on the effectiveness differential networks comprises: effectiveness match network element, differential signal computational grid unit, confidence evaluation network element, action decision networks unit, action correction network element and action execution unit; Described effectiveness match network element is used for calculating t and constantly moves a tThe state space vector s that after action execution unit is carried out, produces tResulting effectiveness match value
Figure BDA0000034814160000011
And export to differential signal computational grid unit; Differential signal computational grid unit is according to the effectiveness match value of input
Figure BDA0000034814160000012
And according to state space vector s tThe immediately repayment function that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element, confidence evaluation network element and action decision networks unit; Effectiveness match network element is utilized differential signal Δ TD tUpgrade the weights of neural network in the effectiveness match network element; The confidence evaluation network element is utilized the input vector of the input layer of neural network in the effectiveness match network element and output vector and the differential signal of hidden layer, calculates the degree of confidence of the action result of decision, and this degree of confidence is exported to the action correction network element; Action decision networks unit is according to the differential signal Δ TD of input tWith state space vector s t, the selection study of moving, output action choice function
Figure BDA0000034814160000021
Give the action correction network element, wherein j, k are the integer greater than 0; The degree of confidence of action correction network element utilization input is to the Action Selection function of input
Figure BDA0000034814160000022
Proofread and correct, then the action behind the calculation correction chooses probable value, action execution unit is exported in the action of maximum probability is carried out, the state space vector behind this action executing again feed back input to effectiveness match network element, differential signal computational grid unit and action decision networks unit.
Described learning model has two processes: off-line learning process and on-line decision process; Above-mentioned each unit all will participate in the described off-line learning process, the action decision networks unit that is only obtained at last by off-line learning in the described on-line decision process and action execution unit participate in, and the action decision networks unit in the on-line decision process is according to the t state space vector s behind the action executing constantly tCalculate and draw the output action choice function
Figure BDA0000034814160000023
Carry out to action execution unit by the final action of selecting of action selector output, the state space vector that obtains behind the execution action inputs to action decision networks unit again.
Advantage of the present invention and beneficial effect are:
(1) robot learning model of the present invention does not need to calculate and produces correct action, but by in action-academic environment of environmental interaction-evaluations in solution robot knowledge acquisition hard problem.Because this learning model does not need clear and definite designated environment model, the cause-effect relationship of environment has lain in the concrete differential feedback network, thereby can guarantee better that robot obtains the completeness of environmental knowledge;
(2) the off-line learning process of this modelling can be finished the environmental knowledge learning process before the robot decision-making, the on-line decision process can further be finished robot environment's knowledge acquisition, decision-making during operation is no longer explored and learning activities, only need to utilize the network of reconstruct to calculate and addition, this off-line and online modelling have guaranteed that the behaviour decision making of robot has preferably real-time, have guaranteed preferably promptness and the validity of robot behavior decision-making.
Description of drawings
Fig. 1 is the off-line learning procedure structure synoptic diagram of learning model the first embodiment of the present invention;
Fig. 2 is the action decision networks schematic flow sheet of learning model the first embodiment of the present invention;
Fig. 3 is the genetic operator coding structure synoptic diagram in the action decision networks among learning model the first embodiment of the present invention;
Fig. 4 is the genetic operator interlace operation synoptic diagram in the action decision networks among learning model the first embodiment of the present invention;
Fig. 5 is the synoptic diagram of on-line decision process among learning model the second embodiment of the present invention.
Embodiment
The present invention is described in further detail below in conjunction with drawings and Examples.Wherein, the first embodiment specifies the off-line learning process of learning model of the present invention; The second embodiment describes the on-line decision process.
As shown in Figure 1, learning model of the present invention comprises five parts: effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13, action decision networks unit 14 and action correction network element 15.In the off-line learning process of learning model of the present invention, five parts are all participated.
Effectiveness match network element 11 is used for calculating the action a that t selects constantly tThe different state space vector s that after action execution unit 16 is carried out, produces tResulting effectiveness match value
Figure BDA0000034814160000031
And output effectiveness match value
Figure BDA0000034814160000032
Give differential signal computational grid unit 12, differential signal computational grid unit 12 output difference sub-signal Δ TD tGive confidence evaluation network element 13 and effectiveness match network element 11.The differential signal Δ TD of effectiveness match network element 11 recycling differential signal computational grid unit 12 inputs tConstantly update, thereby reach real effectiveness match.
Differential signal computational grid unit 12 is according to the effectiveness match value of input
Figure BDA0000034814160000033
And according to state space vector s tThe immediately repayment function that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element 11, confidence evaluation network element 13 and action decision networks unit 14.
Confidence evaluation network element 13 is utilized the input vector of the input layer of neural network in the effectiveness match network element 11 and output vector and the differential signal Δ TD of hidden layer tCalculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element 15, be used for the adjustment to Action Selection.
The action decision networks unit 14 differential signal Δ TD according to input tWith state space vector s t, utilization is passed the rank genetic algorithm neural network is optimized, and realizes the selection study of action, the output action choice function
Figure BDA0000034814160000034
Give action correction network element 15, wherein j, k are the integer greater than 0.
Action correction network element 15 is utilized the degree of confidence of input, to the Action Selection function of input
Figure BDA0000034814160000035
Figure BDA0000034814160000036
Proofread and correct, with the action output of maximum probability.State space vector behind the action executing again feed back input to effectiveness match network element 11, differential signal computational grid unit 12 and action decision networks unit 14.
Wherein, effectiveness match network element 11 is used for the state variation that specific behavior causes is carried out Evaluation of Utility, obtains the effectiveness match value, is made of the neural network of two-layer feedback, as shown in Figure 1.Neural network be input as state space vector s t, the hidden layer activation function is the Sigmoid function, neural network is output as the effectiveness match value to state after the action executing, the weight coefficient of neural network be A, B and C (.This neural network comprises n input vector unit, and h Hidden unit, and each Hidden unit is accepted n input and have n to connect weights, and output unit is accepted n+h input and n+h weights are arranged.For the value of h, the user can set up on their own, generally is set as 3, is set to 2 in the embodiment of the invention.
The input vector of this neural network is x i(t), i=1,2,3...n, function x i(t) be s tProcess normalization obtains, and then the output vector of Hidden unit is:
y j ( t ) = g [ Σ i = 1 n a ij ( t ) x i ( i , j = 1,2,3 , . . . h
Used function in the following formula
Figure BDA0000034814160000038
a Ij(t) be the vector of the weights A of input layer and hidden layer.The match value that effectiveness match network 11 is output as effectiveness
Figure BDA0000034814160000039
It is the linear combination to input layer and hidden layer:
U ( s t ) ^ = Σ i = 1 n b i ( t ) x i ( t ) + Σ j = 1 h c j ( t ) y j ( t )
Wherein, b i(t) vector of the weights B of expression input layer and output layer, c j(t) vector of the weights C of expression hidden layer and output layer.
Weights A, the B of network and C utilize differential signal Δ TD tUpgrade, if differential signal Δ TD tFor just, then illustrate in a upper action to have produced positive effect, so the selecteed chance of this action should obtain strengthening.The weights B of input layer and output layer and the weights C of hidden layer and output layer utilize following formula to upgrade:
b i(t+1)=b i(t)+λ·ΔTD t+1·x i(t),i=1,2,3...n
c j(t+1)=c j(t)+λ·ΔTD t+1·y j(t),j=1,2,3...h
In the formula, λ is the constant greater than zero, can be arranged voluntarily by the user.Input is carried out according to following formula with the renewal of the weights A of hidden layer:
a ij(t+1)=a ij(t)+λ h·ΔTD t+1·y j(t)·sgn(c j(t))·x i(t)
Wherein, λ hFor greater than zero number, Δ TD can be set voluntarily by the user T+1Represent the differential signal of the state space vector that corresponding t+1 produces behind the action executing constantly, sgn is such as minor function:
( z ) = 1 z > 0 0 z = 0 - 1 z < 0 , Z is the vectorial c of weights C herein j(t).
As shown in Figure 1, differential signal computational grid unit 12 is according to the match effectiveness of effectiveness match network element 11 outputs And the immediately repayment function R (s of state t) calculate differential signal Δ TD tAccording to Temporal-DiReinfo, Δ TD tUtilizing following formula to carry out iterative computation obtains:
&Delta;T D t = R ( s t ) + &gamma; &CenterDot; U ^ ( s t + 1 ) - U ^ ( s t )
Wherein, R (s t) be to state s tImmediately evaluation, be exactly the output of repaying immediately function, γ is discount factor, can be arranged voluntarily by the user.
Figure BDA0000034814160000045
The state space vector s that produces behind the expression t+1 moment action executing T+1Resulting effectiveness match value,
Figure BDA0000034814160000046
The state space vector s that produces behind the expression t moment action executing tResulting effectiveness match value.
The differential signal Δ TD that calculates tBe used for the weight coefficient of effectiveness match network element 11 and confidence evaluation network element 13 is trained renewal.If differential signal Δ TD tProduced positive effect, then should strengthen this action, and also should strengthen its degree of confidence, believed more that namely this action should be selected.In addition, differential signal Δ TD tAlso be used for the weights of Action Selection function in the action decision networks unit 14 are upgraded, to guarantee that realization is to the selection of optimum action.
As shown in Figure 1, when action decision networks unit 14 output action decision function, confidence evaluation network element 13 will be calculated the degree of confidence of output action, and this degree of confidence is used for the adjustment to Action Selection.The input of confidence evaluation network element 13 is state vector x i(t) and y j(t), they are drawn from hidden layer and the output layer of effectiveness match network element 11.
Degree of confidence p 0(t) calculate by following formula:
p 0 ( t ) = &Sigma; i = 1 n &alpha; i ( t ) x i ( t ) + &Sigma; j = 1 h &beta; j ( t ) y j ( t )
Wherein, weights α i(t) and β j(t) utilize following formula to upgrade:
α i(t+1)=α i(t)+λ p·ΔTD t+1·x i(t),i=1,2,3...n
β j(t+1)=β j(t)+λ p·ΔTD t+1·y j(t),j=1,2,3...h
Wherein, λ pThe expression learning rate is the numerical value between the 0-1, and empirical value is 0.618, and the user can arrange according to the experience of oneself.From following formula, be difficult to guarantee p 0(t) confidence interval is in [0,1], so introduce the Sigmoid function to p 0(t) carry out conversion, obtain p (t), like this, the output degree of confidence just matches with the random function probability:
p ( t ) = 1 1 + e - a p 0 ( t )
Degree of confidence modifying factor a plays the effect of level and smooth learning process, changes a, just can change study to the range of adjustment of environment, if a is excessive, then can make learning system lose regulating action, should set suitable a value according to priori, a>0, the span of a is [1,10] among the present invention.
Degree of confidence has reflected the uncertainty of decision-making to the regulating action of Action Selection.Can find out, along with the effectiveness of state is tending towards actual value gradually, i.e. Δ TD tIncrease, degree of confidence p (t) also increases gradually, to the action selection more and more definite.Recycling output degree of confidence p (t) is to each output action choice function of action decision networks unit 14
Figure BDA0000034814160000053
Proofread and correct, trimming process is finished 15 li of action correction network element.
Action decision networks unit 14 adopts neural networks to realize that it is divided into is four layers, and as shown in Figure 1, four layers of ground floors to the are respectively: input layer, and fuzzy subset's layer, variable node layer and function output layer, wherein, the variable node layer also claims the Function Fitting layer.Use respectively h=1,2,3,4 represent four layers of networks.If
Figure BDA0000034814160000054
Be respectively the input and output of i node of h layer, i is every layer node, and wherein, the ground floor nodes is I, and second layer nodes is I*J, and the 3rd node layer number is L, and the 4th node layer number is K, I, and J, K, L are positive integers.Average m Ij, variances sigma IjBe respectively corresponding x in the second layer i(t) location parameter and the width of the Gauss member function of j node of input.
The input layer of the neural network of action decision networks unit 14, input quantity are state space vector s tThe x that normalization obtains i(t), it has characterized the robot situation information of input time.The input of i node of input layer
Figure BDA0000034814160000055
For:
I N i 1 = x i ( t ) , i = 1,2,3 . . . I
Fuzzy subset's layer is used for the input variable of input layer is carried out Fuzzy processing.Be output as the degree of membership of each input vector.Each x of input layer i(t) at fuzzy subset's layer to J input should be arranged, for example among Fig. 1, J herein is 2, wherein, each input is exactly x i(t) a fuzzy subset, output is x i(t) in this fuzzy subset's degree of membership.Its each node activation function is Gauss member function, is output as:
Q x i j 2 = exp [ - ( x j ( t ) - m ij &sigma; ij ) 2 ] , i = 1,2,3 . . . I , j = 1,2,3 . . . J
Wherein,
Figure BDA0000034814160000058
For corresponding to the input x i(t) j output,, exp is the exponential function take natural logarithm e the end of as, x j(t) be the input of j node of input layer.
Neural network need to a certain degree adjusted output for satisfying the match for function of movement, and the variable node layer is used for realizing this regulatory function.The variable node layer is to realize regulatory function by nodes and the variation that connects weights, nodes and the utilization of connection weights are passed the rank genetic algorithm and are optimized, dynamically determine their number and size, to satisfy network to the match of function of movement, specifically introduce in the back.The activation function of variable node layer is Gaussian function, and its location parameter and width are respectively m lAnd σ lThe linking number of the second layer and the 3rd layer also is uncertain, also need to dynamically adjust in optimizing process, and connecting weights all is 1.The 3rd node layer is output as:
O l 3 = exp [ - ( &Sigma; i = 1 , j = 1 I , J O 2 x i j - m l &sigma; l ) 2 ] , l = 1,2,3 . . . L
Interstitial content is identical with optional action number, function output layer output be match value to function of movement, be used for calculating the selection probability of each action.The 4th node layer is output as:
O k 4 = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2 , 3 . . . K
Wherein, the 4th layer output O k 4It is exactly the Action Selection function
Figure BDA0000034814160000063
A ^ k ( s t ) = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2 , 3 . . . K
The 3rd layer of each node has and is connected ω with the 4th layer LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node, connect weights ω LkAlso need in optimizing process, dynamically adjust.
Suppose that the network ground floor has I input, i input has k at the second layer iIndividual fuzzy division, the then total k of second layer nodal point number 1+ k 2+ ...+k IIndividual, the node function is the membership function of each input for its fuzzy subset.Summary is got up, and needs dynamically to adjust the neural network structure of optimizing be: the linking number of the 3rd node layer number, the second layer and the 3rd layer.Needing to adjust the network parameter of optimizing is: the position m of second layer input parameter subordinate function IjAnd width cs Ij, the 3rd layer of (hidden layer) Gauss activation function location parameter m lWith width cs lAnd the 3rd layer with the 4th layer be connected weights ω Lk
Here, utilize Hybrid Hierarchy Genetic Algorithm that the structure and parameter of the neural network of action in the decision networks is optimized, the structure optimization of network is for determining the linking number of the 3rd node layer number, the second layer and the 3rd layer.The parameter optimization of network comprises the membership function location parameter m of input vector IjAnd width cs Ij, the 3rd layer of hidden node the location parameter m of Gaussian function lWith width cs lAnd the 3rd layer with the 4th layer be connected weights ω LkUtilization is passed the rank genetic algorithm neural network is optimized and adjusts, and makes network when each takes turns decision-making, according to the variation of input differential signal, continues to optimize and obtains the Action Selection function, to realize the selection effect to action.
The degree of confidence p (t) that action correction network element 15 utilizes the evaluation of estimate of confidence evaluation network element 13 outputs namely to move is to the Action Selection function of action selection network unit 14 outputs
Figure BDA0000034814160000065
Proofread and correct, then calculate the probable value that each action is chosen, with the action output of maximum probability.
Trimming process be with
Figure BDA0000034814160000066
Be average, generate a random function take p (t) as probability, as new Action Selection function A j(s t).P (t) is less, then A j(s t) just more away from
Figure BDA0000034814160000067
Otherwise, then the closer to With new A j(s t) replace
Figure BDA0000034814160000069
Action Selection function A j(s t) value is larger, corresponding action a then jSelecteed probability is larger.Select the computing formula of probability to be:
P ( a j | s t ) = e A j ( s t ) &Sigma; k e A k ( s t )
Then be output as the action of probable value maximum.
In the robot behavior learning model, described action decision networks unit 14 also comprises 4 subelements: coding unit 141, and initialization of population unit 142, fitness function determining unit 143, and genetic manipulation unit 144, as shown in Figure 2.
Coding unit 141 is that the chromosome structure of genetic algorithm is determined.Pass the rank genetic algorithm and be that hierarchical structure according to the biological stain body proposes, the gene in the biosome in the chromosome can be divided into regulatory gene and structural gene, and the effect of regulatory gene is whether the control structural gene is activated.Here, use for reference this characteristics of biological stain body gene, above-mentioned optimization problem is encoded.Each individuality in the population is comprised of the structure and parameter two parts that determine network.The gene structure of population at individual adopts secondary hierarchical structure coding, namely divide two-layer realization according to the gene hierarchical structure of biological stain body, the upper strata gene is realized the coding to the 3rd node layer quantity and second layer input membership function, the namely parameter m of the 3rd node layer number and second layer input membership function IjAnd σ IjAs shown in Figure 3, the part that realization is controlled the 3rd layer of (hidden layer) number of nodes is called controlling gene, lower floor is parameter gene, realizes the subordinate function of the 3rd layer of (hidden layer) node and the coding of network connection, comprises the 3rd layer of (hidden layer) node subordinate function parameter m lWith σ lAnd the linking number of the second layer and the 3rd layer, and the 3rd layer with the 4th layer be connected weights ω Lk
The gene of the hidden nodes of controlling gene and the expression network connection of parameter gene all adopts binary coding, represents respectively the situation of " nothing " and " having " with " 0 ", " 1 ".The gene of other expression subordinate function parameters and connection weights all adopts real-valued coding, namely uses real number representation.Three-decker is encoded to a binary string, the 3rd layer of node of a bit representation, as controlling gene, " 1 " represents that this node works, " 0 " represents that this node is inoperative.Like this, the number of " 1 " is the actual number of the neural network hidden node that works in the controlling gene string.In the parameter gene, second and third layer connects gene and adopts binary coding, and the corresponding second layer of " 1 " expression has and is connected with the 3rd layer, and " 0 " represents that the corresponding second layer is not connected with the 3rd layer.Third and fourth layer weights gene adopts real-valued coding, represented the 3rd layer with the 4th layer the weights that are connected.
Hence one can see that, controlling gene is being controlled the number of node, if a certain node is " 0 ", then this node is two-layer all without being connected with front and back, and correspondingly its corresponding parameter gene all is non-existent, can find out, parameter gene is controlled by controlling gene, if a certain node of upper strata controlling gene does not exist, so corresponding lower floor parameter gene just is not activated, this has just embodied the control action of controlling gene, and this control action can be corresponding with topology of networks.The one by one chromosome that coding forms consists of population, utilizes them to finish evolution.
Further, initialization of population unit 142 is that the chromosome population is carried out initialization.In order to carry out smoothly genetic algorithm operation, need to be individual at the chromosome that produces before some, and these individualities should produce at random, have represented the possibility of multiple network structure, namely enough solution rooms should be arranged.Suitable population scale is significant for the convergence of genetic algorithm, and population quantity is too little to be difficult to try to achieve satisfied result, calculation of complex too greatly then, and population scale generally gets 10~160.
Further, determine chromosomal fitness function unit 143.Individual fitness function adopts the complexity of individual error and structure to represent, considers the complexity of control network in the individual error optimizing, thereby obtains optimum network structure.The fitness function form of network is as follows:
f ( i ) = &alpha; 1 E ( i ) + &beta; 1 H ( i ) , i = 1,2 , . . . , I
Wherein, E (i), H (i) represent respectively i individual individual error and structure complexity, wherein:
E ( i ) = &Sigma; j = 1 K ( y ^ ij - y ij ) 2
H(i)=1+exp[-c(N i(0))]
Figure BDA0000034814160000083
And y IjExport and desired output for j that is i individuality, wherein, desired output y IjChoice function for the expectation action
Figure BDA0000034814160000084
If certain moves desired output, then establish its expectation value
Figure BDA0000034814160000085
Other expectation function of movement all are made as 0.N i(0) be that i individual hidden node is zero number, c is the parameter regulatory factor.Wherein, b, c are normal value, and α and β are the constant greater than zero, alpha+beta=1.Utilize such adaptive value function can guarantee in the optimized network weights, to obtain suitable neural network structure.
Further, carry out genetic manipulation unit 144, genetic manipulation comprises selection, crossover and mutation.Initial population after selection, crossover and mutation, has carried out one and has taken turns genetic manipulation, has finished one and has taken turns evolution, has obtained the sub-population of a new generation, and this process that circulates, and constantly carries out so that evolve, so that filial generation converges to optimum.
Selection is from previous generation population, according to the fitness of individuality, according to certain rule or method, selects some good individual inheritances in colony of future generation.The method that adopts the elite to select in the algorithm is selected, and namely according to the fitness value size, individuality optimum in every generation population remains into the next generation, and this mode has guaranteed the asymptotic convergence of algorithm.For individual i, its selection probability is:
p s ( i ) = f i &Sigma; j = 1 N f j
Wherein, f iBe the fitness of individual i, N is the number of individuals of population.
Interlace operation is exactly randomly that this process has reflected the random information exchange so that the gene pairs of two individualities should exchange the position, and purpose is to produce the new assortment of genes, namely produces new individuality.When evolving to a certain degree, when the identical colony of most of individualities particularly occurring, intersection is to produce new individuality, at this moment can only produce new individuality by variation.Variation is with certain probability gene position to be changed, and to increase new search volume, that is to say, variation has increased the speciality of global optimization.In the process of crossover and mutation, randomness plays an important role, and only have crossover and mutation operation at random just to guarantee to upgrade individual appearance, and this randomness shows by the crossover and mutation probability.
In the genetic manipulation process, crossover probability and variation probability have a significant impact the performance of genetic algorithm.If at genetic algorithm (Genetic Algorithm is called for short GA) initial operating stage, the crossover probability choosing is large, and the variation probability selects little, can accelerate convergence of algorithm speed, is conducive to search for optimum solution.But along with the carrying out of search, just needing to reduce crossover probability increases the variation probability, so that algorithm is difficult for being absorbed in local extremum, can search for new solution.
The probability that makes a variation simultaneously can not be obtained too large, otherwise algorithm will be difficult to the gene of restraining and destroying optimum solution.For the high solution of fitness, get lower crossover probability and variation probability, make it have larger chance to enter into the next generation; And for the lower solution of fitness, should get higher crossover probability and variation probability, it is eliminated as early as possible; When the maturation convergence occurs, should strengthen crossover probability and variation probability, to accelerate new individual generation.According to the selection principle of above crossover and mutation probability, adopt the method for a kind of adaptive crossover probability and variation probability, its computing formula is:
p c = f max - f avg f ( f max - f avg ) < f 0.8 ( f max - f avg ) &GreaterEqual; f
p m = 0.2 ( f max - f &prime; ) f max - f avg ( f max - f &prime; ) < ( f max - f avg ) 0.2 ( f max - f &prime; ) &GreaterEqual; ( f max - f avg )
Wherein, p cBe crossover probability, p mBe the variation probability.f MaxBe the maximum adaptation degree in the colony, f AvgBe average fitness, f is larger fitness in two individualities that intersect, and f ' is the individual fitness of variation.
When the method is larger in the evolution space, can find fast optimum solution; Converging near the locally optimal solution, increasing the diversity of colony.The individual variation probability that can find out the fitness maximum is zero, and the individual crossover and mutation probability that fitness is larger is all very little, has protected like this defect individual.And the less individual crossover and mutation probability of fitness is all very large, needs constantly to destroy it.
Carry out interlace operation according to crossover probability between two individualities choosing, interlace operation operates the corresponding part of controlling gene and parameter gene respectively, as shown in Figure 4.Such interlace operation can make two chromosomal corresponding genes intersect, and has guaranteed that also the correspondence of binary coding and real coding gene is intersected.The intersection of two corresponding positions of chromosome adopts single-point to intersect, and selects randomly the same position of two individualities, carries out the exchange operation of gene in the position of choosing.
Mutation operation comprises the operation to all genes, to the binary coding gene in controlling gene and the parameter gene, adopts the position variation, carries out the logic inversion operation, namely " 1 " is become " 0 ", and " 0 " is become " 1 ".Carry out the Gaussian mutation of linear combination for the gene of real-valued coding:
m ^ ij = m ij + &alpha; 1 f N ( 0,1 )
&sigma; ^ ij = &sigma; ij + &alpha; 1 f N ( 0,1 )
m ^ l = m l + &alpha; 1 f N ( 0,1 )
&sigma; ^ l = &sigma; l + &alpha; 1 f N ( 0,1 )
&omega; ^ lk = &omega; lk + &alpha; 1 f N ( 0,1 )
Wherein, α is evolution rate, and f is each individual fitness, and N (0,1) is 0 for expectation, and standard deviation is 1 normal distribution random function.
In sum, it is as follows passing the algorithm steps that the rank genetic algorithms realizes Neural Network Optimization:
1. network structure and parameter are encoded according to hierarchical structure, it is individual to generate chromosome.
2. generate at random 2N initial chromosome population, evolutionary generation is made as t=0.
3. calculate maximum adaptation degree value and average fitness value in each individual fitness value and the population according to formula.
4. in population, select N individuality as parent according to the individual choice probability, make t=t+1.
5. from parent, select at random two individualities, carry out interlace operation according to crossover probability.If intersect, then at first copy two individualities, former individual the reservation.Carry out interlace operation with the individuality that copies, produce two new individualities.Until that the parent population is all intersected is complete.
6. all individualities are carried out mutation operation according to the variation probability.
7. when the fitness of optimum individual and colony's fitness reach given threshold values, perhaps reach maximum evolutionary generation, the then iterative process of algorithm convergence, algorithm finish.Continue to carry out otherwise turn 3, until satisfy termination condition.
After optimize finishing, get the network structure of optimum individual and parameter as decision networks, utilize its to realize the calculating of action decision-making.
In action decision networks unit 14, with passing the structure and parameter that the rank genetic algorithm is come optimized network.After each new situation occurs, the differential signal Δ TD that at first utilizes Temporal-DiReinfo (Temporal-Difference method, TD) to provide tCome that action selection network is carried out parameter and upgrade, in the hope of obtaining more favourable optional action.Specifically, it is to utilize differential signal Δ TD t, by the 3rd layer in each parameter gene of the chromosome in the population is connected weights and upgrades with the 4th layer, carry out again afterwards genetic manipulation.The weights space of corresponding like this this function of movement is all upgraded, and the new weights of the respective action that obtains through heredity also should be larger, can reflect the study to this optimum action.Differential signal for the renewal process of connection weight value is:
Wherein, ω IjBe the 3rd layer of i hidden node and the weights that are connected of the 4th layer of j Action Selection function,
Figure BDA0000034814160000102
Being weighting coefficient, is the numerical value between the 0-1, and empirical value is 0.62.
The present embodiment utilization is passed the rank genetic algorithm neural network is trained, and realizes knowledge learning.Solved in the prior art the more reaction equation mode that is based on specific knowledge or rule in the behaviour decision making research, solved preferably the knowledge acquisition of robot behavior decision-making, the reasoning decision problem, main body has study and the inferential capability of higher level by approaching the completeness of knowledge with environmental interaction study.
Fig. 5 is the synoptic diagram of on-line decision process among learning model the second embodiment of the present invention.After the off-line learning, the action decision networks unit 14 that obtains at last is optimum, uses this action decision networks unit 14 to be used for real-time on-line decision.And other all remove in the on-line decision process such as effectiveness match network element 11, differential signal computational grid unit 12, confidence evaluation network element 13 and action correction network element 15, do not re-use.Action decision networks unit 14 is according to the action a that selects tState space vector s after action execution unit 16 is carried out tCalculate and draw the output action choice function
Figure BDA0000034814160000111
By the final action of selecting of action selector output, the state space vector that this action obtains after action execution unit 16 is carried out inputs to action decision networks unit 14 again.
The neural network that obtains is trained in the present embodiment utilization, carries out the behavior Real-time Decision of robot.Learning process is separated with decision process, has guaranteed the efficient of on-line decision, satisfies the needs of real time execution.

Claims (1)

1. model building device based on the robot behavior learning model of effectiveness differential networks, comprise action execution unit (16), it is characterized in that this model building device also comprises: effectiveness match network element (11), differential signal computational grid unit (12), confidence evaluation network element (13), action decision networks unit (14) and action correction network element (15);
Described effectiveness match network element (11) is used for calculating t and constantly moves a tThe state space vector s that after action execution unit (16) is carried out, produces tResulting effectiveness match value
Figure FDA00002133940300011
And export to differential signal computational grid unit (12); Differential signal computational grid unit (12) is according to the effectiveness match value of input
Figure FDA00002133940300012
And according to state space vector s tThe immediately repayment function that calculates further calculates differential signal Δ TD t, and with this differential signal Δ TD tExport to effectiveness match network element (11), confidence evaluation network element (13) and action decision networks unit (14); Effectiveness match network element (11) is utilized differential signal Δ TD tUpgrade the weights of neural network in the effectiveness match network element (11); Confidence evaluation network element (13) is utilized the input vector of the input layer of neural network in the effectiveness match network element (11) and output vector and the differential signal of hidden layer, calculate the degree of confidence of the action result of decision, and this degree of confidence is exported to action correction network element (15); Action decision networks unit (14) is according to the differential signal Δ TD of input tWith state space vector s t, the selection study of moving, output action choice function
Figure FDA00002133940300013
Figure FDA00002133940300014
Give action correction network element (15), wherein j, k are the integer greater than 0; Action correction network element (15) is utilized the degree of confidence of input, to the Action Selection function of input
Figure FDA00002133940300015
Proofread and correct, then the action behind the calculation correction chooses probable value, action execution unit (16) is exported in the action of maximum probability is carried out, the state space vector behind this action executing again feed back input to effectiveness match network element (11), differential signal computational grid unit (12) with move decision networks unit (14);
Described effectiveness match network element (11) is made of neural network, comprises input layer, hidden layer and output layer, and the weights of neural network are A, B and C, the input vector x of neural network input layer i(t) the state space vector s for producing behind the t moment action executing tNormalization obtains, and the hidden layer activation function is the Sigmoid function, and neural network is output as the effectiveness match value to state after the action executing
Figure FDA00002133940300016
Figure FDA00002133940300017
Wherein, b i(t) vector of the weights B of expression input layer and output layer, c j(t) vector of the weights C of expression hidden layer and output layer, n is input layer unit number, h is the Hidden unit number, y j(t) be the output vector of Hidden unit:
Figure FDA00002133940300018
Figure FDA00002133940300019
According to function
Figure FDA000021339403000110
Calculate a Ij(t) be the vector of the weights A of input layer and hidden layer; The vector of the weights of neural network in the described effectiveness match network element (11), specifically utilize following formula to upgrade:
b i(t+1)=b i(t)+λ·ΔTD t+1·x i(t),i=1,2,3...n
c j(t+1)=c j(t)+λ·ΔTD t+1·y j(t),j=1,2,3...h
a ij(t+1)=a ij(t)+λ h·ΔTD t+1·y j(t)·sgn(c j(t))·x i(t)
Wherein, λ is the constant greater than zero, λ hFor greater than zero number, Δ TD T+1The differential signal that represents the state space vector that corresponding t+1 produces behind the action executing constantly, sgn (c j(t)) determine according to function sgn that function sgn is:
sgn ( z ) = 1 z > 0 0 z = 0 - 1 z < 0
Described differential signal computational grid unit (12) calculates differential signal Δ TD according to Temporal-DiReinfo t: Wherein, R (s t) be to state space vector s tImmediately evaluation, γ is discount factor,
Figure FDA00002133940300023
The state space vector s that produces behind the expression t+1 moment action executing T+1Resulting effectiveness match value,
Figure FDA00002133940300024
The state space vector s that produces behind the expression t moment action executing tResulting effectiveness match value;
The degree of confidence p (t) that states the final output of Evaluation of reliability network element (13) of putting is:
p ( t ) = 1 1 + e - a p 0 ( t ) , p 0 ( t ) = &Sigma; i = 1 n &alpha; i ( t ) x i ( t ) + &Sigma; j = 1 h &beta; j ( t ) y j ( t )
Wherein, degree of confidence modifying factor a span is [1,10], x i(t), y j(t) be respectively the input vector of the neural network in the effectiveness match network element (11) and the output vector of Hidden unit, n, h are respectively neural network input layer unit number and the Hidden unit number in the effectiveness match network element (11); Weights α behind the corresponding t+1 moment action executing i(t+1) and β j(t+1) renewal is as follows:
α i(t+1)=α i(t)+λ p·ΔTD t+1·x i(t),i=1,2,3...n
β j(t+1)=β j(t)+λ p·ΔTD t+1·y j(t),j=1,2,3...h
Wherein, λ pThe expression learning rate is the numerical value between the 0-1, Δ TD T+1The differential signal that represents the state space vector that corresponding t+1 produces behind the action executing constantly;
Described action decision networks unit (14) adopts neural network to realize that this neural network comprises input layer, fuzzy subset's layer, variable node layer and function output layer, the input IN of i node of input layer i 1For:
IN i 1=x i(t),i=1,2,3...I
Wherein, I is the input layer number, x i(t) be by the vector of the state space behind action executing s tThe vector that normalization obtains; Fuzzy subset's layer is used for Fuzzy processing is carried out in the input of input layer, corresponding to input x i(t) j output
Figure FDA00002133940300027
For:
O x i j 2 = exp [ - ( x j ( t ) - m ij &sigma; ij ) 2 ] , i = 1,2,3 . . . I , j = 1,2,3 . . . J
Wherein, J is each x of input layer i(t) in the input number of fuzzy subset's layer correspondence, m IjAnd σ IjThe location parameter and the width that represent respectively the membership function of input vector, x j(t) be the input vector of j node of input layer;
The activation function of variable node layer is Gaussian function, and location parameter and the width of this Gaussian function are respectively m lAnd σ l, the node output O of variable node layer l 3For:
O l 3 = exp [ - ( &Sigma; i = 1 , j = 1 I , J O 2 x i j - m l &sigma; l ) 2 ] , l = 1,2,3 . . . L
Wherein, L is the node number of variable node layer; Function output layer output be match value to function of movement, be exactly the Action Selection function
A ^ k ( s t ) = &Sigma; l = 1 L &omega; lk O l 3 , k = 1,2,3 . . . K
Wherein, K is the node number of function output layer; ω LkBe the 3rd layer of l node and the weights that are connected of the 4th layer of k node; I, J, K, L are positive integers;
The membership function location parameter m of described input vector IjAnd width cs Ij, the variable node layer the location parameter m of Gaussian function lWith width cs l, and the connection weights of variable node layer function output layer, adopt and pass the rank genetic algorithm and be optimized and adjust;
Described action correction network element (15) with
Figure FDA00002133940300033
Be average, generate a random function take p (t) as probability, as new Action Selection function A j(s t), then calculate and choose probable value P (a j| s t), the action of output probability value maximum; The formula of choosing probable value is: Wherein, a jBe j action, s tFor t constantly obtains state space vector, A behind the action executing k(s t) be k Action Selection function, A j(s t) be j Action Selection function;
Described model building device has two processes: off-line learning process and on-line decision process; Above-mentioned unit all will participate in the described off-line learning process, the action decision networks unit (14) that is only obtained at last by off-line learning in the described on-line decision process participates in action execution unit (16), the state space vector s that the action decision networks unit (14) in the on-line decision process produces behind action execution unit (16) execution action constantly according to t tCalculate and draw the output action choice function
Figure FDA00002133940300036
Carry out to action execution unit (16) by the final action of selecting of action selector output, the state space vector that obtains behind the execution action inputs to action decision networks unit (14) again.
CN 201010564142 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network Expired - Fee Related CN102063640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010564142 CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010564142 CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Publications (2)

Publication Number Publication Date
CN102063640A CN102063640A (en) 2011-05-18
CN102063640B true CN102063640B (en) 2013-01-30

Family

ID=43998910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010564142 Expired - Fee Related CN102063640B (en) 2010-11-29 2010-11-29 Robot behavior learning model based on utility differential network

Country Status (1)

Country Link
CN (1) CN102063640B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402712B (en) * 2011-08-31 2014-03-05 山东大学 Robot reinforced learning initialization method based on neural network
CN107972026B (en) * 2016-10-25 2021-05-04 河北亿超机械制造股份有限公司 Robot, mechanical arm and control method and device thereof
CN108229640B (en) * 2016-12-22 2021-08-20 山西翼天下智能科技有限公司 Emotion expression method and device and robot
CN110705682B (en) * 2019-09-30 2023-01-17 北京工业大学 System and method for robot behavior prejudgment based on multilayer neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5129039A (en) * 1988-09-17 1992-07-07 Sony Corporation Recurrent neural network with variable size intermediate layer
CN1372506A (en) * 2000-03-24 2002-10-02 索尼公司 Method for determining action of robot and robot equipment
JP3412700B2 (en) * 1993-06-28 2003-06-03 日本電信電話株式会社 Neural network type pattern learning method and pattern processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5129039A (en) * 1988-09-17 1992-07-07 Sony Corporation Recurrent neural network with variable size intermediate layer
JP3412700B2 (en) * 1993-06-28 2003-06-03 日本電信電話株式会社 Neural network type pattern learning method and pattern processing device
CN1372506A (en) * 2000-03-24 2002-10-02 索尼公司 Method for determining action of robot and robot equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JP特许3412700B2 2003.03.28

Also Published As

Publication number Publication date
CN102063640A (en) 2011-05-18

Similar Documents

Publication Publication Date Title
CN102063640B (en) Robot behavior learning model based on utility differential network
Liu et al. Selective ensemble learning method for belief-rule-base classification system based on PAES
CN113141012A (en) Power grid power flow regulation and control decision reasoning method based on deep deterministic strategy gradient network
Abraham et al. Evolutionary Design of Neuro-Fuzzy Systems-A Generic Framework
Ballini et al. Learning in recurrent, hybrid neurofuzzy networks
Grosan et al. Hybrid intelligent systems
Karayiannis Learning algorithms for reformulated radial basis neural networks
Desouky et al. Learning in n-pursuer n-evader differential games
Gudino-Penaloza et al. Fuzzy hyperheuristic framework for GA parameters tuning
Angélico et al. Heuristic search applied to fuzzy cognitive maps learning
Figueiredo et al. Reinforcement learning/spl I. bar/hierarchical neuro-fuzzy politree model for control of autonomous agents
Cabrita et al. Single and multi-objective genetic programming design for B-spline neural networks and neuro-fuzzy systems
Otadi Simulation and evaluation of second-order fuzzy boundary value problems
Jacob et al. Self-reorganizing TSK fuzzy inference system with BCM theory of meta-plasticity
Ballini et al. Heuristic learning in recurrent neural fuzzy networks
Abraham Beyond integrated neuro-fuzzy systems: Reviews, prospects, perspectives and directions
Šlapák et al. Multiobjective genetic programming of agent decision strategies
Gope et al. Optimization of Fuzzy Neural Network Using Multiobjective NSGA-II
Zhang et al. Combat Decision-Making Modeling Method Based on Genetic Neural Network
Hassan et al. A multi-objective genetic type-2 fuzzy extreme learning system for the identification of nonlinear dynamic systems
Obaid et al. Study the Neural Network Algorithms of Mathematical Numerical Optimization
Talbi et al. Design of optimal fuzzy controllers for stabilization of a Helicopter Simulator using hybrid Elite Genetic Algorithm and Tabu Search
Kaul et al. Deep Learning-based Advancement in Fuzzy Logic Controller
Vítku et al. Towards evolutionary design of complex systems inspired by nature
Yang et al. Design of short-term load forecasting model based on fuzzy neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130130

Termination date: 20131129