CN102402712B - Robot reinforced learning initialization method based on neural network - Google Patents

Robot reinforced learning initialization method based on neural network Download PDF

Info

Publication number
CN102402712B
CN102402712B CN201110255530.7A CN201110255530A CN102402712B CN 102402712 B CN102402712 B CN 102402712B CN 201110255530 A CN201110255530 A CN 201110255530A CN 102402712 B CN102402712 B CN 102402712B
Authority
CN
China
Prior art keywords
state
robot
neural network
learning
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110255530.7A
Other languages
Chinese (zh)
Other versions
CN102402712A (en
Inventor
李贻斌
宋勇
李彩虹
李彬
荣学文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201110255530.7A priority Critical patent/CN102402712B/en
Publication of CN102402712A publication Critical patent/CN102402712A/en
Application granted granted Critical
Publication of CN102402712B publication Critical patent/CN102402712B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a robot reinforced learning initialization method based on a neural network. The neural network has the same topological structure as a robot working space, and each neuron corresponds to a discrete state of a state space. The method comprises the following steps of: evolving the neural network according to the known partial environmental information till reaching a balance state, wherein at the moment, the output value of each neuron represents maximum cumulative return acquired when the corresponding state follows the optimal strategy; defining the initial value of a Q function as the sum of the immediate return of the current state and the maximum converted cumulative return acquired when the subsequent state follows the optimal strategy; and the mapping the known environmental information into the initial value of the Q function by the neural network. Therefore, the prior knowledge is fused into a robot learning system, and the learning capacity of the robot at the initial stage of reinforced learning is improved; and compared with the conventional Q learning algorithm, the method has the advantages of effectively improving the learning efficiency of the initial stage and increasing the algorithm convergence speed.

Description

Robot reinforced learning initialization method based on neural network
Technical field
The method that the present invention relates to priori to be dissolved into Q value initialization in mobile robot's learning system Zhong,Dui robot intensified learning process, belongs to machine learning techniques field.
Background technology
Continuous expansion along with robot application field, the task that robot faces also becomes increasingly complex, although pre-programmed is carried out in the repetition behavior that in a lot of situations, researchist can may carry out robot, but for realizing whole expected behavior, carry out Activity design and become more and more difficult, designer often can not make rational prediction to all behaviors of robot in advance.Therefore, autonomous robot that can perception environment must, by obtaining new behavior with the mutual on-line study of environment, make robot can reach the optimum action of target according to specific task choosing.
Intensified learning utilization is similar to the method for the trial and error (trial-and-error) in human thinking and finds optimum behavior strategy, aspect robot behavior study, is showing good learning performance at present.Q learning algorithm is a kind of intensified learning method that solves INFORMATION OF INCOMPLETE Markov decision problem, according to the return immediately of ambient condition and previous step study acquisition, the mapping policy of modification from state to action, so that the accumulation return value that behavior obtains from environment is maximum, thereby obtain optimum behavior strategy.Standard Q learning algorithm is generally Q or random number by Q value initialization, and robot is the priori to environment not, and the starting stage of study can only be selected action randomly, and therefore, in complex environment, algorithm the convergence speed is slower.In order to improve algorithm the convergence speed, researchist has proposed the method for many improvement Q study, improves Algorithm Learning efficiency, improves learning performance.
Generally, the method for acceleration Q study speed of convergence mainly comprises two aspects: a kind of method is the suitable return function of design, and another kind of method is reasonable initialization Q function.At present, researchist has proposed many improved Q learning algorithm ,Shi robot can obtain more effectively return in the process of intensified learning, mainly comprises: associated Q learning algorithm, inertia Q learning algorithm, Bayes Q learning algorithm etc.Its fundamental purpose will be dissolved in return function for the valuable implicit information of robot exactly, thus accelerating algorithm speed of convergence.Associated Q study compares current return and the return immediately in the moment in the past, selects the larger action of return value, can improve the learning ability of system by associated method for returning, reduces and obtains the needed iterative steps of optimal value.The target of inertia Q study is to provide a kind of method that predicted state is returned immediately, in learning process, utilize message delay principle, in the case of necessary new target is predicted, action comparer checks the expected returns of each situation, then selects the action executing of expected returns maximum.Bayes Q study utilizes probability distribution to describe the uncertainty estimation of robotary-action to Q value, in learning process, need to consider the distribution of previous moment Q value, and utilize robot learning to experience previous distribution is upgraded, utilize Bayes's variable to represent the cumulative maximum return of current state, bayes method, from having improved in essence the exploration strategy of Q study, has improved the performance of Q study.
Because enhanced signal in standard intensified learning is all the scalar value being calculated by state value function, people's knowledge form and behavior pattern cannot be dissolved in learning system.And in robot learning process, people often has the experience and knowledge of association area, therefore, in learning process, people's cognition and the intelligent form with enhanced signal are fed back to robot, can reduce state space dimension, accelerate algorithm the convergence speed.The problem existing in interactive process for standard intensified learning, provides external strengthening signal in real time by people in Thomaz Deng robot intensified learning process, and people is according to self experience adjustments training behavior, and guided robot carries out perspective exploration.Arsenio proposed a kind of to training data carry out online, the learning strategy of automatic marking, in interactive process, by triggering specific event, obtain training data, thereby the person of teaching be embedded into the backfeed loop of intensified learning.Mirza etc. have proposed the architecture based on interactive history, and robot can utilize and carry out social mutual historical experience with people and carry out intensified learning ,Shi robot and can in the easy game carrying out with people, obtain gradually suitable behavior.
Another kind improves the method for Q learning algorithm performance to be dissolved into priori in learning system exactly, and Q value is carried out to initialization.At present, Q value is carried out to initialized method and mainly comprise the method for approximate function, fuzzy rule method, potential function method etc.The method of approximate function utilizes the intelligent systems such as neural network to approach optimal value function, priori is become to return functional value ,Shi robot and in the subset of whole state space, learn, thereby can accelerate algorithm the convergence speed.Fuzzy rule method is set up fuzzy rule base according to initial environment information, then utilizes fuzzy logic to carry out initialization to Q value.The fuzzy rule that profit is set up in this way is all set according to environmental information is artificial, often can not reflect objectively the ambient condition of robot, causes algorithm unstable.Potential function method defines corresponding state potential function at whole state space, every bit potential energy value is corresponding to a certain discrete state value in state space, then utilize state potential function to carry out initialization to Q value, the Q value of learning system can be expressed as the change amount that initial value adds each iteration.In the middle of the various actions of robot, a series of code of conduct must be observed by robot, robot emerges corresponding behavior and intelligence by cognitive with reciprocation, and intensified learning Q value initialization of robot will become corresponding robot behavior by priori exactly.Therefore, how to obtain the regularization expression-form of priori, particularly realize domain expert's experience and the machine inference of general knowledge, it is robot behavior study urgent problem that people's cognition and intelligence are converted into the calculating of machine and human-machine intelligence's integration technology of reasoning.
Summary of the invention
For the present Research of existing robot intensified learning technology and the deficiency of existence, the present invention proposes a kind ofly can effectively improve the learning efficiency of starting stage, the robot reinforced learning initialization method based on neural network of convergence speedup speed, the method can be dissolved into priori in learning system by Q value initialization, study to the robot initial stage is optimized, thereby provides a good learning foundation for robot.
Robot reinforced learning initialization method based on neural network of the present invention, neural network has identical topological structure with robot working space, and each neuron is corresponding to a discrete state in state space.First according to known component environment information, neural network is developed, until reach equilibrium state, at this moment each neuron output value just represents the cumulative maximum return that its corresponding states can obtain, then the return immediately of current state is added to follow-up state follows the maximum conversion accumulation return (commutation factor is multiplied by cumulative maximum return) that optimal strategy obtains, can be to the right Q (s of all states-move, a) set rational initial value, by Q value initialization, priori can be dissolved in learning system, study to the robot initial stage is optimized, thereby for robot provides a good learning foundation, specifically comprise the following steps:
(1) set up neural network model
Neural network has identical topological structure with the structure space of robot work, each neuron is only connected with the neuron in its local neighborhood, type of attachment is all identical, connection weight all equates, between neuron, the propagation of information is two-way, neural network has the architecture of highly-parallel, each neuron is corresponding to one of robot working space discrete state, whole neural network forms two dimensional topology by N * N neuron, neural network is upgraded the state in its neighborhood according to the input of each discrete state in evolutionary process, until neural network reaches equilibrium state, while reaching equilibrium state, in neural network, neuron output value just forms a single-peaked curved surface, on curved surface, the value of every bit just represents the cumulative maximum return that institute's corresponding states can obtain,
(2) design return function
In learning process, robot can move up 4 sides, in condition selecting 4 actions up and down arbitrarily, action is selected according to current state by robot, if this action makes robot arrive target, the return immediately obtaining is 1, if robot and barrier or other robot bump, obtain return immediately for-0.2, if robot moves at free space, the return immediately obtaining is-0.1;
(3) calculate cumulative maximum return initial value
When neural network reaches equilibrium state, define the cumulative maximum return V of the corresponding state of each neuron * init(s i) equal this neuron output value x i, its relation formula is as follows:
V Init * ( s i ) ← x i ,
In formula, x ii neuronic output valve while reaching equilibrium state for neural network, V * init(s i) be from state s iset out and follow the obtainable cumulative maximum return of optimal strategy institute;
(4) Q value initialization
Q(s i, initial value a) is defined as at state s ilower selection action a obtains returns the maximum conversion accumulation return that r adds follow-up state immediately:
Q Init ( s i , a ) = r + γ V Init * ( s j )
In formula, s jfor robot is at state s ithe follow-up state that lower selection action a produces, Q init(s i, be a) that state-action is to (s i, initial Q value a); γ is commutation factor, selects γ=0.95;
(5) the robot intensified learning step based on neural network
(a) neural network develops according to initialization context information, until reach equilibrium state;
(b) state s ithe initial value of the cumulative maximum return obtaining is defined as neuron output value x i, relation formula is as follows:
V Init * ( s i ) ← x i
(c) according to following regular initialization Q value: Q Init ( s i , a ) = r + γ V Init * ( s j ) ,
(d) observe current state s t;
(e) continue to explore in complex environment, at current state s taction a of lower selection tand carry out, ambient condition is updated to new state s ' t, and r is returned in reception immediately t;
(f) observe new state s ' t;
(g) according to following formula, upgrade list item Q (s t, a t) value:
Q t ( s t , a t ) = ( 1 - α t ) Q t - 1 ( s t , a t ) + α t ( r t + γ arg a t ′ max Q t - 1 ( s t ′ , a t ′ ) )
In formula, α tfor learning rate, span is (0,1), and value is 0.5 conventionally, and decays with learning process; Q t-1(s t, a t) and Q t-1(s ' t, a ' t) state of being respectively-action is to (s t, a t) and (s ' t, a ' t) at t-1 value constantly, a ' tfor at new state s ' tthe action of lower selection;
(h) judge whether robot has arrived the maximum study number of times that target or learning system have reached setting, the maximum study number of times of setting should guarantee that learning system restrains in maximum study number of times, if both meet one, study finishes, otherwise turns back to step (d) continue studying.
The present invention becomes Q function initial value by neural network by known environment information, thereby priori is dissolved in robotics learning system, improved the learning ability of robot in intensified learning starting stage, compare with traditional Q-learning algorithm, can effectively improve the learning efficiency of starting stage, accelerate algorithm the convergence speed.
Accompanying drawing explanation
Fig. 1 is i neuron neighbour structure schematic diagram.
Fig. 2 is neuron output value schematic diagram in robot target vertex neighborhood.
Fig. 3 is cumulative maximum return initial value V* init(s, a) schematic diagram.
Fig. 4 is neural network neuron output value schematic diagram while reaching equilibrium state.
Fig. 5 is the robot planning path schematic diagram that existing Q study obtains.
Fig. 6 is existing Q learning algorithm convergence process schematic diagram.
Fig. 7 is robot planning of the present invention path schematic diagram.
Fig. 8 is study convergence process schematic diagram of the present invention.
Embodiment
The present invention is based on neural network robot intensified learning is carried out to initialization, neural network has identical topological structure with robot working space, when neural network reaches equilibrium state, neuron output value represents the cumulative maximum return of corresponding states, utilizes the return immediately of current state and the maximum conversion of follow-up state to accumulate the initial value that return obtains Q function.By Q value initialization, priori can be dissolved in learning system, the study in robot initial stage is optimized, thereby provide a good learning foundation for robot; Specifically comprise the following steps:
1 neural network model
Neural network has identical topological structure with robot working space, and each neuron is corresponding to robot working space's a discrete state.All neurons are all only connected with the neuron in its local neighborhood, and its type of attachment is all identical, and whole neural network forms two dimensional topology by N * N neuron.Neural network has the architecture of highly-parallel, and all connection weights all equate, between neuron, the propagation of information is two-way.Neural network is upgraded the state in its neighborhood according to the input of each discrete state in evolutionary process, and whole neural network can be regarded a discrete time dynamical system as.
Neural network is in evolutionary process, according to impact point and Obstacle Position information, the mapping in neural network topology structure produces the outside input of neural network, the neuron that barrier region is corresponding has negative outside input, and impact point neuron has positive outside input.Neural network develops according to outside input, and the positive neuron output value in impact point position can propagate into whole state space gradually damply by neuronic local connection, until reach equilibrium state.S type activation function has guaranteed that impact point position neuron has the maximum positive neuron output value of the overall situation, and the neuronic output valve of barrier region is suppressed to zero.After neural network reaches balance, all neuron output values have just formed a single-peaked curved surface, and on curved surface, the value of each point just represents the obtainable cumulative maximum return of its corresponding states.
Suppose that robot working space is comprised of 20 * 20 grids, neural network has identical topological structure with robot working space, also comprises 20 * 20 neurons, and each neuron is corresponding to a discrete state of work space.Each neuron is only connected with the neuron in its local neighborhood, and wherein the interior neuronic type of attachment of i neuron and its neighborhood as shown in Figure 1.Whole neural network forms two dimensional topology by 20 * 20 neurons.Neural network has the architecture of highly-parallel, and all connection weights all equate.In neural network evolutionary process, all neurons are input neuron and output neuron, and between neuron, the propagation of information is two-way, and whole neural network can be regarded a discrete time dynamical system as.
I neuron of neural network is corresponding to i discrete state of structure space, and i neuron discrete time kinetics equation is:
x i ( t + 1 ) = 1 ifi = i * f ( Σ j = 1 N w ij x j ( t ) + I i ( t ) ) otherwise
In formula, i *for target nerve unit index value, x i(t) be i neuron in t output valve constantly, N is the neuron number in i neuron neighborhood, I i(t) be that i neuron inputted in t outside constantly, f is activation function, w ijbe j neuron to i neuronic connection weight, computing formula is shown below:
w ij = e - η | i - j | 2 if | i - j | ≤ r 0 if | i - j | > r
In formula, | i-j| is vector x in structure space iand x jbetween Euclidian distance, because each neuron is only connected with neuron in its local neighborhood, r value is 1, neuron output forms single-peaked curved surface when guaranteeing that neural network reaches equilibrium state, η span is (1,2), obvious w ij=w ji, i.e. w ijfor symmetry; Neuron activation functions is selected S type function, is defined as follows:
f ( x ) = 0 ifx &le; 0 kx if 0 < x < 1 1 ifx &GreaterEqual; 1
In formula, k is the slope of linear model, span is (0,1), f (x) has guaranteed that the positive neuron output value in impact point position can propagate into whole state space gradually damply, and impact point has the maximum positive neuron output value of the overall situation, and the neuronic output valve of barrier region is suppressed to zero; I neuronic outside input mapping generation in neural network topology structure by impact point and Obstacle Position information, be defined as follows:
I i ( t ) = V if x i ( t ) = t arg et - V if x i ( t ) = obstacle 0 otherwise
In formula, V is larger constant, and in order to guarantee that impact point neuron has the maximum neuron output value of the overall situation, and barrier region neuron has the minimum neuron output value of the overall situation, the value of V should be greater than neuronic input summation, and span is to be greater than 4 real number.
2 return function designs
In learning process, robot can move up 4 sides, in free position, can select 4 actions up and down, action is selected according to current state by robot, if this action makes robot arrive target, the return immediately obtaining is 1, if robot and barrier or other robot bump, obtain return immediately for-0.2, if robot moves at free space, the return immediately obtaining is-0.1.
3 calculate cumulative maximum return initial value
According to impact point and Obstacle Position information, the mapping in neural network topology structure produces the outside input of neural network, and the neuron that barrier region is corresponding has negative outside input, and impact point neuron has positive outside input.Neural network develops according to outside input, and the positive neuron output value in impact point position can propagate into whole state space gradually damply by neuronic local connection, until reach equilibrium state.S type activation function has guaranteed that impact point position neuron has the maximum positive neuron output value of the overall situation, and the neuronic output valve of barrier region is suppressed to zero.After neural network reaches equilibrium state, all neuron output values have just formed a single-peaked curved surface, and as shown in Figure 2, on curved surface, the value of each point just represents the obtainable cumulative maximum return of its corresponding states.
Robot is from arbitrary initial state s tthe accumulation return obtaining of setting out is defined as follows:
V &pi; ( s t ) = r t + &gamma; r t + 1 + &gamma; 2 r t + 2 + &Lambda; = &Sigma; i = 0 &infin; &gamma; i r t + i
In above formula, π is control strategy, the immediately return sequence of r for obtaining, and γ is commutation factor, span is (0,1), selects γ=0.95 here; Robot follows from state s the cumulative maximum return V that optimal strategy obtains *(s) be calculated as follows:
V * ( s ) = arg max &pi; V &pi; ( s ) , ( &ForAll; s )
When neural network reaches equilibrium state, define the cumulative maximum return V of the corresponding state of each neuron * init(s i) equal this neuron output value x i, its relation formula is as follows:
V Init * ( s i ) &LeftArrow; x i .
In formula, x ii neuronic output valve while reaching equilibrium state for neural network, V * init(s i) be from state s iset out and follow the obtainable cumulative maximum return of optimal strategy institute.
4 robot intensified learnings based on neural network
4.1 traditional Q-learning algorithms
In Markovian decision process Zhong, robot, by sensor senses surrounding environment, know current state, and select the current action that will carry out, this action of environmental response also provides return immediately, and produce follow-up state.The task of robot intensified learning is exactly to obtain an optimal strategy to make robot obtain maximum conversion accumulation return from current state.The accumulation return that robot follows any tactful π acquisition from arbitrary initial state is defined as:
V &pi; ( s t ) &equiv; r t + &gamma; r t + 1 + &gamma; 2 r t + 2 + &Lambda; &equiv; &Sigma; i = 0 &infin; &gamma; i r t + i
In formula, r tfor t return immediately constantly, γ is commutation factor, and span is (0,1), selects γ=0.95 here.
The optimal strategy π * that robot can obtain cumulative maximum return from state s is defined as follows:
&pi; * &equiv; arg max &pi; V &pi; ( s ) , ( &ForAll; s )
Robot follows from state s the cumulative maximum return that optimal strategy π * can obtain and is defined as V* (s), and the return immediately that value of Q function is current state adds the maximum conversion accumulation return of follow-up state, and computing formula is as follows:
Q(s,a)≡(1-α t)Q(s,a)+α t(r(s,a)+γV*(s′))
In formula, α tfor learning rate, span is (0,1), conventionally selects α tinitial value is 0.5, and with the decay of study number of times; V* (s ') and Q (s ', a ') relational expression is as follows:
V * ( s &prime; ) = max Q ( s &prime; , a &prime; ) a &prime;
Q (s t, a t) according to following Policy Updates:
Q t ( s t , a t ) = ( 1 - &alpha; t ) Q t - 1 ( s t , a t ) + &alpha; t ( r t + &gamma; arg max a t &prime; Q t - 1 ( s t &prime; , a t &prime; ) )
In formula, Q t-1(s t, a t) and Q t-1(s ' t, a ' t) state of being respectively-action is to (s t, a t) and (s ' t, a ' t) at t-1 value constantly, a ' tfor at new state s ' tthe action of lower selection.
4.2Q value initialization
According to known environment information, neural network is developed, until reach equilibrium state, at this moment define the obtainable cumulative maximum return of each discrete state and equal its corresponding neuronic output valve.Then by carry out return immediately that selected action obtains from current state, add that follow-up state follows the maximum conversion accumulation return that optimal strategy obtains, can be to the right Q (s of all states-move i, a) rational initial value is set.Q(s i, calculation of initial value formula a) is as follows:
Q Init ( s i , a ) = r + &gamma; V Init * ( s j )
In formula, r is at state s ithe return immediately that lower selection action a obtains, γ is commutation factor, span is (0,1), selects γ=0.95 here; s jfor robot is at state s ithe follow-up state that lower selection action a produces, Q init(s i, be a) that state-action is to (s i, initial Q value a);
The 4.3 Q learning algorithms based on neural network of the present invention
(1) neural network develops according to initialization context information, until reach equilibrium state.
(2) utilize neuron output value x ito state s iinitialization is carried out in obtainable cumulative maximum return, and relation formula is as follows:
V Init * ( s i ) &LeftArrow; x i
(3) according to following regular initialization Q value:
Q Init(s i,a)=r+γV Init*(s j)
(4) observe current state s t.
(5) continue to explore in complex environment, at current state s taction a of lower selection tand carry out, ambient condition is updated to new state s ' t, and r is returned in reception immediately t.
(6) observe new state s ' t.
(7) according to following formula, upgrade list item Q (s t, a t) value:
Q t ( s t , a t ) = ( 1 - &alpha; t ) Q t - 1 ( s t , a t ) + &alpha; t ( r t + &gamma; arg max a t &prime; Q t - 1 ( s t &prime; , a t &prime; ) )
(8) judge that (the maximum study number of times of setting can guarantee that learning system restrains in maximum study number of times to the maximum study number of times whether robot arrived target or learning system and reached setting, in experimental situation of the present invention, maximum study number of times is set to 300), if both meet one, study finishes, otherwise turns back to step (4) continue studying.
For intensified learning Q value initialization of robot process is described, select robot impact point neighborhood to demonstrate.When neural network reaches equilibrium state, in neighborhood, neuron output value is as shown in numerical value in node in Fig. 3, and each node is corresponding to a discrete state, and the return of the cumulative maximum of each state equals the neuronic output valve of this state, red node represents dbjective state, and grey node represents barrier.Each arrow represents an action, if the return immediately that the target goal state G of robot obtains is 1, if with bump the return immediately that obtains of barrier or other robot for-0.2,Ruo robot moves at free space the return immediately obtaining, be-0.1.γ is commutation factor, selects γ=0.95, can obtain the initial value of Q function according to Q value initialization formula, and the right initialization Q value of each state-move represents as shown in numerical value as arrow in Fig. 4.After initialization completes, when robot can both select appropriate action ,Dang robot to face compared with complex environment under any original state, in the starting stage of study, just there is certain purpose, rather than the selection action of completely random ground, thereby accelerate algorithm the convergence speed.
On mobile robot's environmental modeling of setting up in laboratory and exploration software platform, carried out emulation experiment.Fig. 5 is the robot planning path that existing robot intensified learning method obtains; Fig. 6 is existing robot intensified learning algorithm convergence process.Learning algorithm starts convergence after 145 study, and in the starting stage of study, (as front 20 study) robot substantially all can not arrive impact point in maximum iteration time.This is because Q value is initialized to 0, makes robot without any priori, can only select randomly action, thereby it is lower to cause learning starting stage efficiency, and algorithm the convergence speed is slower.
Fig. 7 is robot planning of the present invention path; Fig. 8 is convergence process of the present invention.Learning algorithm is starting convergence after 76 study, and robot also can arrive impact point substantially in the starting stage of study within maximum iteration time, the present invention has effectively improved the learning efficiency in robot initial stage, has obviously accelerated the speed of convergence of learning process.

Claims (1)

1. the robot reinforced learning initialization method based on neural network, neural network has identical topological structure with robot working space, each neuron is corresponding to a discrete state in state space, first according to known component environment information, neural network is developed, until reach equilibrium state, at this moment each neuron output value just represents the cumulative maximum return that its corresponding states can obtain, then the return immediately of current state is added to follow-up state follows the maximum conversion accumulation return that optimal strategy obtains, can be to the right Q (s of all states-move, a) set rational initial value, by Q value initialization, priori can be dissolved in learning system, study to the robot initial stage is optimized, thereby for robot provides a good learning foundation, specifically comprise the following steps:
(1) set up neural network model
Neural network has identical topological structure with the structure space of robot work, each neuron is only connected with the neuron in its local neighborhood, type of attachment is all identical, connection weight all equates, between neuron, the propagation of information is two-way, neural network has the architecture of highly-parallel, each neuron is corresponding to one of robot working space discrete state, whole neural network forms two dimensional topology by N * N neuron, neural network is upgraded the state in its neighborhood according to the input of each discrete state in evolutionary process, until neural network reaches equilibrium state, while reaching equilibrium state, in neural network, neuron output value just forms a single-peaked curved surface, on curved surface, the value of every bit just represents the cumulative maximum return that institute's corresponding states can obtain,
(2) design return function
In learning process, robot can move up 4 sides, can select action according to current state, in free position, can both in 4 actions of advancing, retreat, turn left, turn right, select a suitable action, if this action makes robot arrive target, the return immediately obtaining is 1, if robot and barrier or other robot bump, obtain return immediately for-0.2, if robot moves at free space, the return immediately obtaining is-0.1;
(3) calculate cumulative maximum return initial value
When neural network reaches equilibrium state, define the cumulative maximum return V of the corresponding state of each neuron * init(s i) equal this neuron output value x i, its relation formula is as follows:
V Init * ( s i ) &LeftArrow; x i ,
In formula, xi is neural network i neuronic output valve while reaching equilibrium state, V * init(s i) be from state s iset out and follow the obtainable cumulative maximum return of optimal strategy institute;
(4) Q value initialization
Q(s i, initial value a) is defined as at state s ilower selection action a obtains returns the maximum conversion accumulation return that r adds follow-up state immediately:
Q Init ( s i , a ) = r + &gamma;V Init * ( s j )
In formula, s jfor robot is at state s ithe follow-up state that lower selection action a produces, Q init(s i, be a) that state-action is to (s i, initial Q value a); γ is commutation factor, selects γ=0.95;
(5) the robot intensified learning step based on neural network
(a) neural network develops according to initialization context information, until reach equilibrium state;
(b) state s ithe initial value of the cumulative maximum return obtaining is defined as neuron output value x i, relation formula is as follows:
V Init * ( s i ) &LeftArrow; x i
(c) according to following regular initialization Q value: Q Init ( s i , a ) = r + &gamma;V Init * ( s j ) ,
(d) observe current state s t;
(e) continue to explore in complex environment, at current state s taction a of lower selection tand carry out, ambient condition is more
New is new state s' t, and r is returned in reception immediately t;
(f) observe new state s' t;
(g) according to following formula, upgrade list item Q (s t, a t) value:
Q t ( s t , a t ) = ( 1 - &alpha; t ) Q t - 1 ( s t , a t ) + &alpha; t ( r t + &gamma; arg max a t &prime; Q t - 1 ( s t &prime; , a t &prime; ) )
In formula, α tfor learning rate, value is 0.5, and decays with learning process; Q t-1(s t, a t) and Q t-1(s' t, a' t) state of being respectively-action is to (s t, a t) and (s' t, a' t) at t-1 value constantly, a' tfor at new state s' tthe action of lower selection;
(h) judge whether robot has arrived the maximum study number of times that target or learning system have reached setting, the maximum study number of times of setting should guarantee that learning system restrains in maximum study number of times, if both meet one, study finishes, otherwise turns back to step (d) continue studying.
CN201110255530.7A 2011-08-31 2011-08-31 Robot reinforced learning initialization method based on neural network Expired - Fee Related CN102402712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110255530.7A CN102402712B (en) 2011-08-31 2011-08-31 Robot reinforced learning initialization method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110255530.7A CN102402712B (en) 2011-08-31 2011-08-31 Robot reinforced learning initialization method based on neural network

Publications (2)

Publication Number Publication Date
CN102402712A CN102402712A (en) 2012-04-04
CN102402712B true CN102402712B (en) 2014-03-05

Family

ID=45884895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110255530.7A Expired - Fee Related CN102402712B (en) 2011-08-31 2011-08-31 Robot reinforced learning initialization method based on neural network

Country Status (1)

Country Link
CN (1) CN102402712B (en)

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799179B (en) * 2012-07-06 2014-12-31 山东大学 Mobile robot path planning algorithm based on single-chain sequential backtracking Q-learning
CN102819264B (en) * 2012-07-30 2015-01-21 山东大学 Path planning Q-learning initial method of mobile robot
CN103218655B (en) * 2013-03-07 2016-02-24 西安理工大学 Based on the nitrification enhancement of Mechanism of immunotolerance
US9679258B2 (en) * 2013-10-08 2017-06-13 Google Inc. Methods and apparatus for reinforcement learning
CN104317297A (en) * 2014-10-30 2015-01-28 沈阳化工大学 Robot obstacle avoidance method under unknown environment
US10628733B2 (en) * 2015-04-06 2020-04-21 Deepmind Technologies Limited Selecting reinforcement learning actions using goals and observations
CN104932264B (en) * 2015-06-03 2018-07-20 华南理工大学 The apery robot stabilized control method of Q learning frameworks based on RBF networks
CN104932267B (en) * 2015-06-04 2017-10-03 曲阜师范大学 A kind of neural network lea rning control method of use eligibility trace
CN104932847B (en) * 2015-06-08 2018-01-19 三维泰柯(厦门)电子科技有限公司 A kind of spatial network 3D printing algorithm
CN105700526B (en) * 2016-01-13 2018-07-27 华北理工大学 Online limit of sequence learning machine method with independent learning ability
CN105740644B (en) * 2016-03-24 2018-04-13 苏州大学 A kind of clean robot optimal objective paths planning method based on model learning
CN105955921B (en) * 2016-04-18 2019-03-26 苏州大学 Robot Hierarchical reinforcement learning initial method based on automatic discovery abstract action
JP6453805B2 (en) * 2016-04-25 2019-01-16 ファナック株式会社 Production system for setting judgment values for variables related to product abnormalities
CN106295637B (en) * 2016-07-29 2019-05-03 电子科技大学 A kind of vehicle identification method based on deep learning and intensified learning
WO2018053187A1 (en) 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
JP6662746B2 (en) * 2016-10-07 2020-03-11 ファナック株式会社 Work assistance system with machine learning unit
JP6875513B2 (en) * 2016-10-10 2021-05-26 ディープマインド テクノロジーズ リミテッド Neural network for selecting actions to be performed by robot agents
KR102532658B1 (en) 2016-10-28 2023-05-15 구글 엘엘씨 Neural architecture search
CN108229640B (en) * 2016-12-22 2021-08-20 山西翼天下智能科技有限公司 Emotion expression method and device and robot
JP6603257B2 (en) 2017-03-31 2019-11-06 ファナック株式会社 Behavior information learning device, management device, robot control system, and behavior information learning method
EP3593288A1 (en) * 2017-05-26 2020-01-15 Deepmind Technologies Limited Training action selection neural networks using look-ahead search
CN107030704A (en) * 2017-06-14 2017-08-11 郝允志 Educational robot control design case based on neuroid
CN107102644B (en) * 2017-06-22 2019-12-10 华南师范大学 Underwater robot track control method and control system based on deep reinforcement learning
CN107516112A (en) * 2017-08-24 2017-12-26 北京小米移动软件有限公司 Object type recognition methods, device, equipment and storage medium
CN107688851A (en) * 2017-08-26 2018-02-13 胡明建 A kind of no aixs cylinder transmits the design method of artificial neuron entirely
CN107562053A (en) * 2017-08-30 2018-01-09 南京大学 A kind of Hexapod Robot barrier-avoiding method based on fuzzy Q-learning
CN107729953B (en) * 2017-09-18 2019-09-27 清华大学 Robot plume method for tracing based on continuous state behavior domain intensified learning
US10935982B2 (en) * 2017-10-04 2021-03-02 Huawei Technologies Co., Ltd. Method of selection of an action for an object using a neural network
CN108051999B (en) * 2017-10-31 2020-08-25 中国科学技术大学 Accelerator beam orbit control method and system based on deep reinforcement learning
US11164077B2 (en) * 2017-11-02 2021-11-02 Siemens Aktiengesellschaft Randomized reinforcement learning for control of complex systems
CN117451069A (en) * 2017-11-07 2024-01-26 金陵科技学院 Robot indoor walking reinforcement learning path navigation algorithm
CN110196587A (en) * 2018-02-27 2019-09-03 中国科学院深圳先进技术研究院 Vehicular automatic driving control strategy model generating method, device, equipment and medium
CN108594803B (en) * 2018-03-06 2020-06-12 吉林大学 Path planning method based on Q-learning algorithm
CN108427283A (en) * 2018-04-04 2018-08-21 浙江工贸职业技术学院 A kind of control method that the compartment intellect service robot based on neural network is advanced
CN108563971A (en) * 2018-04-26 2018-09-21 广西大学 The more reader anti-collision algorithms of RFID based on depth Q networks
CN109032168B (en) * 2018-05-07 2021-06-08 西安电子科技大学 DQN-based multi-unmanned aerial vehicle collaborative area monitoring airway planning method
US11734575B2 (en) 2018-07-30 2023-08-22 International Business Machines Corporation Sequential learning of constraints for hierarchical reinforcement learning
US11537872B2 (en) * 2018-07-30 2022-12-27 International Business Machines Corporation Imitation learning by action shaping with antagonist reinforcement learning
CN109663359B (en) * 2018-12-06 2022-03-25 广州多益网络股份有限公司 Game intelligent agent training optimization method and device, terminal device and storage medium
CN110070188B (en) * 2019-04-30 2021-03-30 山东大学 Incremental cognitive development system and method integrating interactive reinforcement learning
CN110307848A (en) * 2019-07-04 2019-10-08 南京大学 A kind of Mobile Robotics Navigation method
CN110333739B (en) * 2019-08-21 2020-07-31 哈尔滨工程大学 AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning
CN110703792B (en) * 2019-11-07 2022-12-30 江苏科技大学 Underwater robot attitude control method based on reinforcement learning
CN111552183B (en) * 2020-05-17 2021-04-23 南京大学 Six-legged robot obstacle avoidance method based on adaptive weight reinforcement learning
CN112297005B (en) * 2020-10-10 2021-10-22 杭州电子科技大学 Robot autonomous control method based on graph neural network reinforcement learning
CN114310870A (en) * 2021-11-10 2022-04-12 达闼科技(北京)有限公司 Intelligent agent control method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5402521A (en) * 1990-02-28 1995-03-28 Chiyoda Corporation Method for recognition of abnormal conditions using neural networks
WO2007135723A1 (en) * 2006-05-22 2007-11-29 Fujitsu Limited Neural network learning device, method, and program
CN101320251A (en) * 2008-07-15 2008-12-10 华南理工大学 Robot ambulation control method based on confirmation learning theory
CN102063640A (en) * 2010-11-29 2011-05-18 北京航空航天大学 Robot behavior learning model based on utility differential network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5402521A (en) * 1990-02-28 1995-03-28 Chiyoda Corporation Method for recognition of abnormal conditions using neural networks
WO2007135723A1 (en) * 2006-05-22 2007-11-29 Fujitsu Limited Neural network learning device, method, and program
CN101320251A (en) * 2008-07-15 2008-12-10 华南理工大学 Robot ambulation control method based on confirmation learning theory
CN102063640A (en) * 2010-11-29 2011-05-18 北京航空航天大学 Robot behavior learning model based on utility differential network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于神经网络的移动机器人路径规划方法;宋勇 等;《系统工程与电子技术》;20080229;第30卷(第2期);316-319 *
宋勇 等.基于神经网络的移动机器人路径规划方法.《系统工程与电子技术》.2008,第30卷(第2期),

Also Published As

Publication number Publication date
CN102402712A (en) 2012-04-04

Similar Documents

Publication Publication Date Title
CN102402712B (en) Robot reinforced learning initialization method based on neural network
CN102819264B (en) Path planning Q-learning initial method of mobile robot
Yao et al. Path planning method with improved artificial potential field—a reinforcement learning perspective
Qiang et al. Reinforcement learning model, algorithms and its application
CN102799179B (en) Mobile robot path planning algorithm based on single-chain sequential backtracking Q-learning
Parhi et al. IWO-based adaptive neuro-fuzzy controller for mobile robot navigation in cluttered environments
CN110014428B (en) Sequential logic task planning method based on reinforcement learning
Ma et al. Wasserstein generative learning with kinematic constraints for probabilistic interactive driving behavior prediction
Ma et al. State-chain sequential feedback reinforcement learning for path planning of autonomous mobile robots
Wang et al. Adaptive environment modeling based reinforcement learning for collision avoidance in complex scenes
Sun et al. A Fuzzy-Based Bio-Inspired Neural Network Approach for Target Search by Multiple Autonomous Underwater Vehicles in Underwater Environments.
Yan et al. Path Planning for Mobile Robot's Continuous Action Space Based on Deep Reinforcement Learning
Liu et al. Autonomous exploration for mobile robot using Q-learning
Chen et al. Survey of multi-agent strategy based on reinforcement learning
Sun et al. Event-triggered reconfigurable reinforcement learning motion-planning approach for mobile robot in unknown dynamic environments
Zhang et al. Robot path planning method based on deep reinforcement learning
Wu et al. Path planning for autonomous mobile robot using transfer learning-based Q-learning
Song et al. Towards efficient exploration in unknown spaces: A novel hierarchical approach based on intrinsic rewards
Senthilkumar et al. Hybrid genetic-fuzzy approach to autonomous mobile robot
Song et al. Research on Local Path Planning for the Mobile Robot Based on QL-anfis Algorithm
Martovytskyi et al. Approach to building a global mobile agent way based on Q-learning
Kutsuzawa et al. Motion generation considering situation with conditional generative adversarial networks for throwing robots
Ren et al. Research on Q-ELM algorithm in robot path planning
Cao et al. Threat Assessment Strategy of Human-in-the-Loop Unmanned Underwater Vehicle Under Uncertain Events
Fan et al. Rl-art2 neural network based mobile robot path planning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140305

Termination date: 20140831

EXPY Termination of patent right or utility model