CN105094124A - Method and model for performing independent path exploration based on operant conditioning - Google Patents

Method and model for performing independent path exploration based on operant conditioning Download PDF

Info

Publication number
CN105094124A
CN105094124A CN201410215954.4A CN201410215954A CN105094124A CN 105094124 A CN105094124 A CN 105094124A CN 201410215954 A CN201410215954 A CN 201410215954A CN 105094124 A CN105094124 A CN 105094124A
Authority
CN
China
Prior art keywords
rob
obs
robot
state
tar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410215954.4A
Other languages
Chinese (zh)
Inventor
蔡建羡
洪利
于瑞红
马洪蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Disaster Prevention
Original Assignee
Institute of Disaster Prevention
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Disaster Prevention filed Critical Institute of Disaster Prevention
Priority to CN201410215954.4A priority Critical patent/CN105094124A/en
Publication of CN105094124A publication Critical patent/CN105094124A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a method and model for performing independent path exploration based on operant conditioning. The method comprises setting values of initial parameters which comprise a discrete environment state space and an action space; while at a step t, acquiring values of current environment parameters of a robot through a sonar sensor, and judging a grade state S<i>(t) to which the value of each environment parameter of the robot belongs; selecting a maximum-probability action a<k>(t) from the action space A to execute the action; and, after the selected action a<k>(t) is implemented, enabling the state of the robot to be shifted, evaluating an action effect of implementation of the operation a<k>(t) at the state S<i>(t), and determining whether an original Q value of state S<i>(t)-operation a<k>(t) needs to be updated according to an evaluation signal V<ik>(t). The method is mainly characterized in that a biological operant conditioning mechanism is simulated, bionic self-organization, self-leaning and self-adaption functions are possessed, bionics, psychology and biology are effectively applied to a control system, and a function of achieving bionic independent learning control is possessed.

Description

Method and the model of autonomous track search is carried out based on operant conditioning reflex
Technical field
The invention belongs to track search technical field, be specifically related to a kind of method and the model that carry out autonomous track search based on operant conditioning reflex.
Background technology
Skinner operant conditioning reflex (OperantConditioning) is theoretical in the study of humans and animals, play key player, ultimate principle is: if under certain subjective and objective condition, the consequence that certain biological behavior causes meets biological orientation, so, under similar subjective and objective condition, biological implementation of class will rise like the probability of behavior.In addition, Skinner ten points emphasizes the effect strengthened, and he thinks, whether the study of people sets up key is strengthening.After an operation occurs, when and then presenting a reinforcing stimulus, so, the intensity of this operation just increases.Here increased is not the connection of " S-R ", but the general tendentiousness that reaction is occurred strengthens, and the probability namely reacting generation enhances.He thinks, in study, although exercise is important, the variable of key is strengthening.
Under circumstances not known state, because robot does not have suitable teacher signal, the learning ability of robot has been the key point of autonomous track search task.If application number is 200910044273.5, name is called the patent application of the method for ensuring navigation safety of mobile robots in unknown environments, introduce fuzzy neural network intellectual technology and realize the autonomous learning exploration of robot in circumstances not known, to guarantee the safety in motion process.Although the controlling behavior of robot links together with neuro-physiology and cognitive science by the robot research based on fuzzy neural network, but this contact is also very loose and passive, the motion control technical ability of robot body is based on declarative control law, there is too much design composition, the bionical autonomous learning of less biosystem technical ability and self-organizing feature, really can't realize the autonomous track search of robot.
Operant conditioning reflex study has stronger online adaptive and the self-learning capability to complication system, is well suited for the robot autonomous navigation learning under circumstances not known.Based on this advantage, operant conditioning reflex study becomes the focus of current bionical autonomous learning area research.In recent years, based on the learning model of operant conditioning reflex Mechanism Design, achieve certain progress, if application number is 200910086990.4, name is called the patent application of operant conditioning reflex automat and the application in bionical autonomous learning controls thereof, based on finite-state automata, design a kind of operant conditioning reflex model with self organizing function, and solve the control problem of reversible pendulum system; Application number is 200910089263.3, and name is called Autonomous Operation conditioned reflex automat and in the patent application realizing the application in intelligent behavior, devises a kind of operant conditioning reflex automat equally, and the balance being applied to coaxial two wheels robot controls.
But, the study mechanism of the operant conditioning reflex automat of above-mentioned patented claim design in fact still follows the study mechanism of general finite state automata, further, be just applied to solve and simply balance control problem, be difficult to be applied to robot path and explore this complex control task.
Summary of the invention
For the defect that prior art exists, the invention provides a kind of method and the model that carry out autonomous track search based on operant conditioning reflex, effectively can be applied to robot path and explore field.
The technical solution used in the present invention is as follows:
The invention provides a kind of method of carrying out autonomous track search based on operant conditioning reflex, comprise the following steps:
Step1, arranges initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter , discount factor η;
The ambient condition space S={ s of Robot is set i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is: S i={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~; Wherein:
D ~ rob_obs_l--apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f--robot front is apart from the distance state of barrier;
D ~ rob_obs_r--apart from the distance state of barrier on the right side of robot;
D ~ rob_tar--the distance state between robot and impact point;
θ ~--the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;
Motion space A={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Step2, robot is equipped with multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
Judge the level condition belonging to the current each environmental parameter value of robot, thus obtain the robot ambient condition S residing when t walks i(t)={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space S;
Current state S is calculated according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ A={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right excitation probability value, meet: 0 < p ik< 1, p ik∈ P i={ p i1, p i2..., p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Step3, according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min< d rob_obs_lor d rob_obs_for d rob_obs_r< d max, then perform respective action Xu Shi robot and can realize keeping away barrier;
Step4, as the action a implementing to choose kt, after (), robotary shifts, calculate new state s i(t+1) the information entropy H under i(t+1);
Judge whether current exercise wheel number exceedes the exercise wheel number N preset, if exceeded, then reject information entropy and remain maximum state, and turn to Step5; Otherwise, directly turn to Step5;
Step5, first, the distance under acquisition new state between robot and barrier:
If d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, calculate " state s according to formula (4) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_l< d max, calculate " state s according to formula (5-1) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_f< d max, calculate " state s according to formula (5-2) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_r< d max, calculate " state s according to formula (5-3) i(t)-operation a k(t) " evaluation signal V ik(t);
If d rob_obs_l< d minord rob_obs_f< d minord rob_ obs_r< d min, learn unsuccessfully, to give penalty value, make V ik(t)=-1;
V(d rob_tar,θ)=-β 1sign(Δd rob_tar(t))Δ 2d rob_tar(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, 0 < β 1, β 2< 1; Δ θ (t)=θ (t+1)-θ (t); Δ d rob_tar(t)=d rob_tar(t+1)-d rob_tar(t);
V(d rob_obs_l,d rob_tar)=β 3sign(Δd rob_obs_l(t))Δ 2d rob_obs_l(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-1)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_l(t)=d rob_obs_l(t+1)-d rob_obs_l(t);
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-2)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_f(t)=d rob_obs_f(t+1)-d rob_obs_f(t);
V(d rob_obs_r,d rob_tar)=β 3sign(Δd rob_obs_r(t))Δ 2d rob_obs_r(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-3)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_r(t)=d rob_obs_r(t+1)-d rob_obs_r(t);
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; And perform Step6; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)), and perform Step6:
Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max Q a k ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p ik) be the learning rate function of t; Learning rate function representation formula is formula (7):
&gamma; ( p ik ) = 1 1 + exp [ p ik ( t ) - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
Whether Step6, arrive impact point or exceed the iterative learning number of times t of initial setting up under judging robot new state fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, repeat the step of Step2-Step6, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
The present invention also provides a kind of autonomous exploration cognitive model carrying out autonomous track search based on operant conditioning reflex, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;
Described parameter setting module is used for arranging initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter , discount factor η;
The ambient condition space S={ s of Robot is set i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is: S i={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rod_obs_r, d ~ rod_tar, θ ~; Wherein:
D ~ rob_obs_l---apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f---robot front is apart from the distance state of barrier;
D ~ rob_obs_r---apart from the distance state of barrier on the right side of robot;
D ~ rob_tar---the distance state between robot and impact point;
θ ~---the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;
Motion space S={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Described perceptron is multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
Described state editing machine is used for the level condition belonging to the current each environmental parameter value of robot, thus obtains the robot ambient condition S residing when t walks i(t)=d ~ rob_obs_l, d ~ rob_ obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space s;
Described tendency unit is used for weighing robot to the tendency degree of ambient condition by the entropy of computing environment state, is specially: calculate current state S according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ S={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right excitation probability value, meet: 0<p ik<1, p ik∈ P i={ p i1, p i2..., p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Described action selector is used for according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l>d maxand d rob_ obs_f>d maxand d rob_obs_r>d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min<d rob_obs_lor d rob_obs_for d rob_obs_r<d max, then perform respective action Xu Shi robot and can realize keeping away barrier;
Described actuator, for performing the action selected by described selector switch, makes robotary shift;
Described study core is used for evaluating the action effect of selected action, produces evaluation signal, is specially:
If d rob_obs_l>d maxand d rob_obs_f>d maxand d rob_ obs_r>d max, calculate " state s according to formula (4) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min<d rob_obs_l<d max, calculate " state s according to formula (5-1) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min<d rob_obs_f<d max, calculate " state s according to formula (5-2) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min<d rob_obs_r<d max, calculate " state s according to formula (5-3) i(t)-operation a k(t) " evaluation signal V ik(t);
If d rob_ obs_ l<d minord rob_abs_f<d minord rob_obs_r<d min, learn unsuccessfully, to give penalty value, make V ik(t)=-1;
V(d rob_tar,θ)=-β 1sign(Δd rob_tar(t))Δ 2d rob_tar(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, 0< β 1, β 2<1; Δ θ (t)=θ (t+1)-θ (t); Δ d rob_tar(t)=s rob_tar(t+1)-d rob_tar(t);
V(d rob_obs_l,d rob_tar)=β 3sign(Δd rob_obs_l(t))Δ 2d rob_obs_l(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-1)
Wherein, β 3, β 4for weight coefficient, 0< β 3, β 4<1; Δ d rob_obs_l(t)=d rob_obs_l(t+1)-d rob_obs_l(t);
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(r)(5-2)
Wherein, β 3, β 4for weight coefficient, 0< β 3, β 4<1; Δ d rob_obs_f(t)=d rob_obs_f(t+1)-d rob_obs_f(t);
V(d rob_obs_r,d rob_tar)=β 3sign(Δd rob_obs_r(t))Δ 2d rob_obs_r(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-3)
Wherein, β 3, β 4for weight coefficient, 0< β 3, β 4<1; Δ d rob_obs_r(t)=d rob_obs_r(t+1)-d rob_obs_r(t);
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)):
Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max a k Q ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p rk) be the learning rate function of t; Learning rate function representation formula is formula (7):
&gamma; ( p ik ) 1 1 + exp [ p ik - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
Whether described end condition judge module is for arriving impact point under judging robot new state or exceeding the iterative learning number of times t of initial setting up fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
Method and the model carrying out autonomous track search based on operant conditioning reflex provided by the invention, there is bionical self-organization, self study and adaptation function, effective by bionics, psychology and biological applications in control system, there is the function realizing bionical autonomous learning and control.Patent of the present invention is that the research of bionical autonomous learning provides new thinking and new method, contributes to the correlative study promoting Based Intelligent Control and intelligent learning algorithm.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet carrying out the method for autonomous track search based on operant conditioning reflex provided by the invention;
Fig. 2 is relation schematic diagram between robot provided by the invention, barrier and impact point;
Fig. 3 is the mutual schematic diagram of intelligent body and environment;
Fig. 4 is the structural representation of the autonomous exploration cognitive model carrying out autonomous track search based on operant conditioning reflex;
Fig. 5 is the graph of relation of training round and study number of times;
Fig. 6 is the graph of relation of training round and successful termination step number;
Fig. 7 trains in learning process for taking turns 20, the change curve of certain status information entropy;
Fig. 8 is the track search trajectory diagram of initial learning period robot;
Fig. 9 is the track search trajectory diagram of study robot in latter stage;
Figure 10 is the track search trajectory diagram of starting point (5,20), terminating point (55,40);
Figure 11 is the track search trajectory diagram of starting point (30,5), terminating point (40,42).
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail:
As shown in Figure 1, the invention provides a kind of method and module of carrying out autonomous track search based on operant conditioning reflex, this modeling operant conditioning reflex mechanism, utilizes information entropy design inclination unit, characterize the tendency degree to state, realize automatically deleting redundant state; Q learning algorithm is improved, guides as study core the direction learned; Adopt Boltzmann machine to carry out annealing computing, realize the random selecting to navigation action, along with the carrying out of study, Action Selection strategy is tending towards optimum.And in conjunction with Mobotsim robot simulation software, verify the feasibility using track search in this model realization mobile robot circumstances not known by the autonomous track search problem of mobile robot.
Be specially: during the independent navigation of robot under circumstances not known, in the process that impact point is close, after running into barrier, by the information that sonar sensor provides, utilize the bionical autonomous learning algorithm of design to implement Robot dodge strategy.Learning algorithm comprises the tendency calculating of unit, the renewal of the selection of action and Q value adjusts.
In order to path of Research heuristic algorithm is convenient, suppose that robot can move to the continuous unique step of any direction, step constant and be a grid length of side.Robot can freely turn round and can not collide with barrier in narrow and small environmental area simultaneously, and therefore need not consider the radius of gyration of robot in track search algorithm, robot is reduced to a particle.Between robot, barrier and impact point, relation as shown in Figure 2, and in fig. 2, R represents robot, and T represents impact point, and O represents barrier, and O1, O2 and O3 represent three barriers respectively, and L arrow represents robot motion direction.
Specifically comprise the following steps:
Step1, arranges initial parameter value, comprising:
(1) robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter discount factor η;
(2) ambient condition space
Robot is unceasing study in track search process, the result of study makes residing ambient condition there occurs transfer, study produces new mutual, new excites again further study alternately, just in both iterate, the action of robot is tending towards optimizing for given task, and progressively adapts to circumstances not known.In order to promote environmental interaction, usually ambient condition is decomposed, make whole environment formed state space so that study progressively carry out.
The present invention adopts the simplest sensor data fusion mode, the investigative range of sensor is divided into successively 3 districts, i.e. left, front and right, and the ambient condition between such robot and barrier represents with regard to the distance state in available 3 directions.Consider and robot path explored, in identical barrier state, different dbjective states may be comprised, therefore, by the distance state of robot and impact point and and impact point between angle state also add in ambient condition space.Thus, the ambient condition space s of Robot is: S={s i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is:
S i={d rob_obs_l,d rob_bos_f,d rob_obs_r,d rob_tar,θ };
Wherein:
D ~ rob_obs_l---apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f---robot front is apart from the distance state of barrier;
D ~ rob_obs_r---apart from the distance state of barrier on the right side of robot;
D ~ rob_tar---the distance state between robot and impact point;
θ ~---the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions.The result that ambient condition divides directly has influence on the effect of study, as the concrete example of one, as shown in table 1, each quantity of state all can be divided into 5 level condition, be respectively: very little state, less state, fair state, larger state and very large state.Such as: for the distance state d apart from barrier on the left of robot ~ rob_obs_l, very little state refers to that the distance value of distance barrier on the left of robot is in 100 ~ 500 scopes; Less state refers to that the distance value of distance barrier on the left of robot is in 500 ~ 1500 scopes; Fair state refers to that the distance value of distance barrier on the left of robot is in 1500 ~ 2500 scopes; Larger state refers to that the distance value of distance barrier on the left of robot is in 2500 ~ 4000 scopes; Very large state refers to that the distance value of distance barrier on the left of robot is in 4000 ~ 5000 scopes.
Table 1 ambient condition spatial spreading divides table
When adopting table 1 dividing mode, the ambient condition space of Robot has 5 5=3125 states, namely for expression formula S={s i| i=1,2 ..., n}, n=3125.In addition, if for state S 6={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~}={ B1, B1, B3, B5, B4), the implication of its representative is: robot is apart from left side obstacle distance ∈ (100,500); Robot is far from preceding object thing distance ∈ (100,500); Robot is apart from right side obstacle distance ∈ (1500,2500); The spacing ∈ (6000, ∞) of robot and impact point; The angle ∈ (100,140) of robot current kinetic direction and goal point.
(3) motion space
The learning process of simulated operation conditioned reflex mechanism is that robot passes through constantly to be awarded or to punish in essence, thus sets up the mapping process of state space to motion space.In the path planning of robot to impact point movement, the action main manifestations of robot is that unique step is advanced, the anglec of rotation, barrier edge tracking and stop mobile etc.Therefore, motion space A={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Motion space definition needs to follow two principle, and the first, enough action forms be had to complete robot path and to explore task; The second, action will be simplified, unsuitable too many, otherwise can cause and be overloaded with studies.Take into account above-mentioned 2 points, as a kind of example, see formula (2), the motion space A of seven discrete movement composition robots can be selected:
A={a 1,a 2,a 3,a 4,a 5,a 6,a 7}(2)
In formula:
A 1---robot turns left 30 °, and advance 100mm simultaneously;
A 2---robot turns left 15 °, and advance 100mm simultaneously;
A 3---robot turns left 10 °, and advance 100mm simultaneously;
A 4---revolute 0 °, advance 100mm simultaneously;
A 5---robot turns right 10 °, and advance 100mm simultaneously;
A 6---robot turns right 15 °, and advance 100mm simultaneously;
A 7---robot turns right 30 °, and advance 100mm simultaneously.
Maximum detectable range due to the sonar sensor of robot configuration is 5000mm, and minimum detectable range is 100mm, therefore, in this kind of situation, and definable d min=100mm is minimal risk distance, d max=5000mm is safe distance, d ∈ (d min, d max) be obstacle-avoidance area; The angle theta ∈ [-180 °, 180 °] of the current direction and goal point of robot.So 100mm<d rob_obs_l, d rob_obs_f, d rob_obs_r<5000mm, namely does not do to consider with the barrier of robot distance more than 5000mm and the barrier less than 100mm.
Step2, robot is equipped with multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
In practical application, robot can be equipped with 16 sonar sensors, and the distance between each sonar sensor is 20 ° or 40 °, and therefore, sonar sensor can cover robot periphery 0 ~ 360 ° of scope.Robot can also rotate freely within the scope of 0 ~ 360 °.
Judge the level condition belonging to the current each environmental parameter value of robot, thus obtain the robot ambient condition S residing when t walks i(t)={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space S;
Such as, if when t walks, in robot current environment parameter value, apart from left side obstacle distance d t rob_obs_l=200, apart from right side obstacle distance d t rob_obs_r=450, apart from preceding object thing distance d t rob_obs_f=1800, with the spacing d of impact point t rob_tar=7800, current kinetic direction and goal point angle theta t=120; And S 6={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~}={ B1, B1, B3, B5, B4), then the ambient condition S at robot place when t walks is drawn it () is S 6.
The complexity of bionical autonomous learning increases along with the ambient condition quantity comprised in ambient condition space and exponentially rises, to the relevant ambient condition of study optimal effectiveness only account for ambient condition sum less than 1/4th, most of ambient condition is that robot can not experience or to experience number of times few.In view of the foregoing, introduce information entropy, weigh robot to the tendency degree of ambient condition by the entropy of computing environment state.In information theory, entropy can be used as measuring of certain event uncertainty.Quantity of information is larger, and architecture is more regular, and function is more perfect, and entropy is less.Utilize the concept of entropy, can the metering of research information theoretically, transmission, conversion, storage.
Because robot has certain blindness at the track search initial stage, therefore, the probable value choosing initial actuating is close, and the information entropy of each state is maximum.Along with the carrying out of study, the probable value of each action under the ambient condition lived through will change, and information entropy correspondence reduces.Ambient condition experience number of times is more, illustrates that the tendency degree of robot to this state is higher, and it is lower that its information entropy declines; Otherwise ambient condition experience number of times is fewer, and illustrate that the tendency degree of robot to this state is lower, its information changes of entropy is fewer.Consider in actual applications, do not need to try to achieve optimum solution completely, and in most cases feasible scheme is more of practical significance than optimum scheme, this also meets the custom that the mankind explore the world and process problem.Therefore, in track search learning process, information entropy remains that maximum state will be disallowable, as long as and the ambient condition lived through will retain.
Therefore, the ambient condition S set={ s of robot discretize is supposed i| i=1,2 ..., n}, then calculate current state S according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ A={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right excitation probability value meets: 0 < p ik< 1, p ik∈ P i={ p i1, p i2..., p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Step3, according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min< d rob_obs_lor d rob_obs_for d rob_obs_r< d max, then perform respective action Xu Shi robot and can realize keeping away barrier.
Evaluation signal due to action is at the beginning unknown, and it is comparatively large that T chooses, so at the beginning of study, the exponential function in formula (8), when t → 0, levels off to 1, everything to choose probability substantially equal, that is:
p ik ( 0 ) &ap; 1 r ( i = 1,2 &CenterDot; &CenterDot; &CenterDot; n ) , ( k = 1,2 &CenterDot; &CenterDot; &CenterDot; r )
Choose each action probability identical, mean at the beginning of study, robot is not containing any predetermined decision-making, and it adopts the probability of any decision-making to be equal.
Step4, as the action a implementing to choose kt, after (), robotary shifts, calculate new state s i(t+1) the information entropy H under i(t+1);
Judge whether current exercise wheel number exceedes the exercise wheel number N preset, if exceeded, then reject information entropy and remain maximum state, and turn to Step5; Otherwise, directly turn to Step5;
Step5: the generation of evaluation signal
Robot is in learning process, and according to current ambient conditions and selected action, after each step action of execution, all produce an evaluation signal, object evaluates the action effect of selected action.Evaluation signal decides robot in whole study will learn what, is the topmost place of embodying intelligent robot, rewards or punishes that the reasonable distribution of evaluation signal contributes to improving learning efficiency.Because evaluation signal evaluates the selection of each step action, therefore, its design philosophy should design according to the isolation of state; Consider that the navigation of mobile robot in circumstances not known is explored has two targets again, namely keep away barrier and convergence impact point, so the design of evaluation signal will take into account the following aspects:
First, the distance under acquisition new state between robot and barrier:
(1) if d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d maxthe time in that robot and the distance between left side, right side and preceding object thing being all greater than safe distance, the possibility collided between robot and barrier is very little, so the main task of robot is convergence impact point, the design focal point of evaluation signal considers the relation between the angle theta of line between Distance geometry robot between robot and impact point to impact point.
When robot moves towards impact point, show as: &Delta;&theta; ( t ) = &theta; ( t + 1 ) - &theta; ( t ) &Delta;&theta; ( t ) < 0
When the distance of robot and impact point reduces, show as: &Delta;d rob _ tar ( t ) = d rob _ tar ( t + 1 ) - d rob _ tar ( t ) &Delta;d rob _ tar ( t ) < 0
Therefore, the definition of evaluation signal is such as formula shown in (4):
V(d rob_tar,θ)=-β 1sign(Δd rob_tar(t))Δ 2d rob_tar(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, generally get 0 < β 1, β 2< 1.
(2) if d min< d rob_obs_lord rob_obs_ford rob_obs_r< d max, robot and the distance between left side, right side and preceding object thing have one between risk distance and safe distance time, likely collide with side's barrier.Consider that collision can not occur immediately, take into account robot and do not depart from target, track search Strategy Design is that robot is keeping away barrier while impact point is close.Therefore, the design focal point of evaluation signal considers the relation between Distance geometry robot between robot and impact point and the distance between barrier.
Suppose that distance between robot and preceding object thing is between risk distance and safe distance, then, when the distance between robot and barrier expands, show as:
&Delta;d rob _ obs _ f ( t ) = d rob _ obs _ f ( t + 1 ) - d rob _ obs _ f ( t ) d rob _ obs _ f ( t ) > 0
Therefore, the definition of evaluation signal such as formula shown in (5),
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-2)
Wherein, β 3, β 4for weight coefficient, generally get 0 < β 3, β 4< 1.
Note: when robot and the distance between left and preceding object thing are between risk distance and safe distance, evaluation function mentality of designing is the same.
(3) if d rob_obs_l< d minord rob_obs_f< d minord rob_obs_r< d min, when robot and the distance between left side, right side and preceding object thing have one to be less than risk distance, and the possibility collided between barrier is very large, should give maximum punishment immediately, learn unsuccessfully, to give penalty value, make V ik(t)=-1.
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; And perform Step6; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)), and perform Step6:
Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max a k Q ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; If value is less, then represent that the impact of nearest action is paid close attention in systematic comparison; If value is comparatively large, then long-time interior action is paid close attention to all very much.γ (p ik) be the learning rate function of t, for the speed of Schistosomiasis control, γ (p ik) larger, restrain faster, but, excessive γ (p ik) likely cause not restraining.; Therefore, learning rate function representation formula is formula (7):
&gamma; ( p ik ) = 1 1 + exp [ p ik ( t ) - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
By " state s i-action a k" right excitation probability p ikjoin learning rate function gamma (p ik) in, make each " state s i-action a k" right Q value regulates the speed all different.Therefore, the learning rate function of design not only plays the function affecting pace of learning, and the adjustment process of Q value embodies the orientation characteristic being more similar to animal operant conditioning reflex.As can be seen from formula (7), " state s i-action a k" right Q value is primarily of the variable quantity p of excitation probability ik(t)-p ik(t+1) determine, specifically, if the variable quantity p of excitation probability ik(t)-p ik(t+1) < 0, then in study afterwards, " state s i-action a k" right Q value regulate the speed tend to increase, that is: pace of learning accelerate; Otherwise, if p ik(t)-p ik(t+1) > 0, then " state s i-action a k" right Q value regulate the speed tend to reduce, that is: pace of learning slows down.
Action Selection strategy is the core that robot realizing route is explored, in track search study in earlier stage, main task carries out environment exploration, therefore require that the randomness of Action Selection is larger, in the track search study later stage, mainly make study restrain, therefore, the randomness of Action Selection is smaller.Here Boltzmann machine is adopted to carry out annealing computing.Action a kupgraded by formula (8) by the probability selected:
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
In formula (8), temperature coefficient T decides the random degree of Action Selection, and T value is larger, and the probability of each action is more close, and the random degree of Action Selection is larger; T value is less, and the probability difference of each action is larger, and the action that Q value is large is larger by the probability selected, and the random degree of Action Selection is less.In the starting stage of study, can arrange larger T value, explore fully, along with the increase of study number of times, system obtains increasing experimental knowledge in learning process, and T is from T 0start decay, when learning number of times t → ∞, T decays to zero, and the select probability of maximum Q value respective action levels off to 1, and this shows that system becomes close to deterministic study from starting incidental learning blindly, and operant conditioning reflex is formed.
Whether Step6, arrive impact point or exceed the iterative learning number of times t of initial setting up under judging robot new state fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, repeat the step of Step2-Step6, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
Learning algorithm convergence:
From the character of information entropy, as all operations behavior a kt Probability p that () may occur kwhen () is equal t, operation behavior entropy is maximum.So, generally at the initial time of study, all operations behavior a kt () is chosen and identical is chosen probable value p ik(t).Formula (3) is re-started and arranges:
H i ( t ) = H i ( A ( t ) | s i ) = - &Sigma; k = 1 r p ( a k | s i ) log 2 p ( a k | s i ) = - [ p ( a k | s i ) log 2 p ( a k | s i ) + &Sigma; k &prime; = 1 , k &prime; &NotEqual; k r p ( a k &prime; | s i ) log 2 p ( a k &prime; | s i ) ] - - - ( 9 )
Along with the carrying out of study, probability function p ikt () is updated, again because lim t &RightArrow; &infin; p ik ' ( a k ' ( t ) | s i ( t ) ) &RightArrow; 0 , Substitute into above formula, arrange:
lim t &RightArrow; &infin; H i ( t ) = lim t &RightArrow; &infin; [ - p ( a k | s i ) log 2 p ( a k | s i ) - &Sigma; k ' = 1 , k ' &NotEqual; k r p ( a k ' | s i ) log 2 p ( a k ' | s i ) ] &RightArrow; H i min &ap; 0 - - - ( 10 )
Therefore, the cond s of autonomous track search model i, corresponding information entropy H i({ A i| s i) converge to minimal value with study course.
Operant conditioning reflex principle is not only significant for the learning behavior understanding humans and animals, and it is also significant for the learning behavior " copying " humans and animals in robot or machine system, be regarded as the study form that biosystem is the most basic, study based on operant conditioning reflex mechanism is conceived to the mutual of intelligent body and environment, and its learning process as shown in Figure 3.
Operant conditioning reflex learning model is made up of three modules, and perceptron is responsible for the acquisition of ambient condition, and learner is responsible for the study of study mechanism, and action selector is responsible for from the suitable action of action spatial choice and is performed.Such formation loop, the continuous perception environment of robot also performs the action selected, thus changes environment.According to the change of ambient condition after action executing, provide evaluation signal (rewarding or punishment) to the action that intelligent body performs, that goes round and begins again completes learning process.The process learnt by operant conditioning reflex can be seen, the process of higher mammal study and the process conformed and robot and environmental interaction is closely similar.
Simulated operation conditioned reflex study mechanism, provided by the inventionly carries out the autonomous exploration cognitive model of autonomous track search as shown in Figure 4 based on operant conditioning reflex.Wherein, state editor module solves the problem of the discrete division of ambient condition; Enquiry learning policy module solves the problem that state space maps to motion space; Environmental interaction module solves the problem that evaluation signal produces and state shifts.In track search learning process, the interaction of mobile robot's maintenance and environment, by learning and training, reaches the set goal.
Concrete, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;
Described parameter setting module is used for arranging initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter , discount factor η;
The ambient condition space S={ s of Robot is set i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is: S i={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~; Wherein:
D ~ rob_obs_l---apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f---robot front is apart from the distance state of barrier;
D ~ rob_obs_r---apart from the distance state of barrier on the right side of robot;
D ~ rob_tar---the distance state between robot and impact point;
θ ~---the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;
Motion space A={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Described perceptron is multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
Described state editing machine is used for the level condition belonging to the current each environmental parameter value of robot, thus obtains the robot ambient condition S residing when t walks i(t)={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space S;
Described tendency unit is used for weighing robot to the tendency degree of ambient condition by the entropy of computing environment state, is specially: calculate current state S according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ A={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right excitation probability value meets: 0 < p ik< 1, p ik∈ P i={ p i1, p i2..., p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Described action selector is used for according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min< d rob_obs_lor d rob_obs_for d rob_obs_r< d max, then perform respective action Xu Shi robot and can realize keeping away barrier;
Described actuator, for performing the action selected by described selector switch, makes robotary shift;
Described study core is used for evaluating the action effect of selected action, produces evaluation signal, is specially:
If d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, calculate " state s according to formula (4) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_l< d max, calculate " state s according to formula (5-1) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_f< d max, calculate " state s according to formula (5-2) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_r< d max, calculate " state s according to formula (5-3) i(t)-operation a k(t) " evaluation signal V ik(t);
If d rob_obs_l< d minord rob_obs_f< d minord rob_obs_r< d min, learn unsuccessfully, to give penalty value, make V ik(t)=-1;
V(d rob_tar,θ)=-β 1sign(Δd rob_tar(t))Δ 2d rob_tar(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, 0 < β 1, β 2< 1; Δ θ (t)=θ (t+1)-θ (t); Δ d rob_tar(t)=d rob_tar(t+1)-d rob_tar(t);
V(d rob_obs_l,d rob_tar)=β 3sign(Δd rob_obs_l(t))Δ 2d rob_obs_l(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-1)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_l(t)=d rob_obs_l(t+1)-d rob_obs_l(t);
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-2)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_f(t)=d rob_obs_f(t+1)-d rob_obs_f(t);
V(d rob_obs_r,d rob_tar)=β 3sign(Δd rob_obs_r(t))Δ 2d rob_obs_r(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-3)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_r(t)=d rob_obs_r(t+1)-d rob_obs_r(t);
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)):
Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max a k Q ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p ik) be the learning rate function of t; Learning rate function representation formula is formula (7):
&gamma; ( p ik ) = 1 1 + exp [ p ik ( t ) - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
Whether described end condition judge module is for arriving impact point under judging robot new state or exceeding the iterative learning number of times t of initial setting up fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
Test example
Provided by the inventionly the method for autonomous track search and the feasibility of model is carried out based on operant conditioning reflex, carrying out autonomous track search based on operant conditioning reflex and emulate mobile robot in order to verify.Use emulation platform for Mobotsim (MobileRobotSimulator).By to the simulation analysis of mobile robot in circumstances not known track search, show that robot passes through the reciprocation with environment, be finally over the autonomous track search task under circumstances not known.
(1) experiment scene
Whole navigational environment grating map represents, is set to 70*50 grid environment in this experiment, and each grid size is 0.2*0.2m.Be provided with static-obstacle thing in environment, mobile robot is by autonomous learning avoiding obstacles, and searching optimum or preferably path arrive impact point.Robot is approximately the circular machine people of diameter 0.5m.If
The often step time interval of time step is 0.1s, and the speed of robot central point is 0.2m/s.
(2) optimum configurations
Emulation starting condition is as follows: iterative learning step number t=0; Sampling time t s=0.1s; Weight coefficient β 1=0.65, β 2=0.36, β 3=0.72, β 4=0.44; Discount factor η=0.14; The state space s of robot has 5 5=3125 state: s={s i| i=1,2 ..., 3125}; There are 7 action: A={a motion space k| k=1,2 ..., 7}; The probability of initially choosing of each action is p ik ( 0 ) &ap; 1 7 , The initial information entropy of corresponding each state is all approximately H i ( 0 ) = - &Sigma; k = 1 7 p ik &times; log 2 p ik | p ik = 1 7 &ap; 2.81 , Maximum by the known now entropy of the characteristic of entropy.
(3) experimental result and analysis thereof
In emulation experiment, robot, in impact point traveling process, when the existing of barrier being detected, implements avoid-obstacle behavior according to autonomous track search strategy.It is independently obtain series of rules by study that robot carries out keeping away barrier, therefore, at the beginning of study, robot, can only by constantly accumulating experience actual keeping away in barrier process without any experience, realize keeping away barrier by repetition learning, search out short or the shortest environment and explore path.Emulation experiment has been carried out 20 altogether and has been taken turns training, when robot arrives impact point smoothly or learns number of times more than t f=100, stop epicycle experiment, robot comes back to starting point, preserves the experience learnt, starts another on its basis and take turns training.When exercise wheel number is taken turns more than N=10, delete the state that information entropy remains maximal value.After approximately passing through 10 experiments taken turns, eliminate 1952 redundant states, approximately pass through 15 experiments taken turns, robot can navigate successfully with frequency of training less and less, and the step number that navigating successfully needs also reduces gradually, finally be tending towards a stable step number, as shown in Figure 5 and Figure 6, Fig. 5 is the graph of relation of training round and study number of times to simulation result; Fig. 6 is the graph of relation of training round and successful termination step number.Simulation result shows, at the training initial stage, because robot is without any priori, failed probability is high.Along with the carrying out of study, failed probability reduces gradually, and after university 10 takes turns, due to the rejecting of redundant state, pace of learning is accelerated, and greatly after training 15 times, not only has and learns speed of convergence faster, and substantially no longer failed.Therefore, robot by autonomous learning, has learned to choose and can make oneself not collide and can to arrive the action of impact point compared with short path.
As shown in Figure 7, for taking turns in training learning process 20, the change curve of certain status information entropy.As can be seen from simulation result, experiment the initial stage, robot without any prior imformation and knowledge, the given maximal value of information entropy.Along with study carrying out, information entropy start decline, approximately pass through 15 take turns training after, information entropy substantially no longer changes, and reaches minimum value.This illustrates, at the training initial stage, because robot is without any priori, failed probability is high.Along with the carrying out of study, failed probability reduces gradually, greatly after training 15 times, substantially no longer failed.Therefore, robot by autonomous learning, has learned to choose and can make oneself not collide and can to arrive the action of impact point compared with short path.
In order to show the self study of robot path exploration, the bionical learning process of self-organization clearly, therefrom take out the situation of initial learning period and the training of study ending phase, Fig. 8 is the track search trajectory diagram of initial learning period robot, and Fig. 9 is the track search trajectory diagram of study robot in latter stage.Be distributed with 11 out-of-shape barriers in simulated environment unevenly, represent with 1-11 respectively in fig. 8; Scallop is divided into the coverage of robot sensor, represents in fig. 8 with 12.Starting point R is at the grid place of (10,10), and impact point P is at P (55,35) grid place.The track search track shown by Fig. 8 can be seen, the track search track that the study initial stage shows is not optimum, even there will be situation about colliding with barrier, illustrate that robot is without any experimental knowledge, the randomness of study is very large, the probability difference choosing each action, apart from little, can not select effective action to carry out path planning.Along with the mutual constantly abundant robot of machine human and environment starts to tend to choose the action that can obtain " award ", the frequency choosing " good " action has in other words exceeded the frequency choosing " bad " action, its action is chosen and is tended towards stability, along with the increase of study number of times, utilize the Q learning algorithm improved, constantly improve the performance that it keeps away barrier.Can see from the track search track of Fig. 9, static-obstacle thing can be got around along track level and smooth as far as possible smoothly in study robot in latter stage, and while keeping away barrier, remain the state towards impact point movement, achieve effective selection of the guidance path that can to pass through to reality.Robot becomes close to deterministic study from incidental learning blindly at the beginning, and operant conditioning reflex is formed.The independent navigation learning process of robot, has reacted the operant conditioning reflex process of human or animal, shows that robot is by autonomous learning, tends to choose the action favourable to self.
The state of study-action mapping relations are preserved, changes the working environment of robot, different starting points, coordinate of ground point are set, change the distribution of barrier, re-start experiment, Figure 10 is starting point (5,20), the track search trajectory diagram of terminating point (55,40); Figure 11 is the track search trajectory diagram of starting point (30,5), terminating point (40,42).
Simulation result shows, adopts the method for discrete division state space and motion space, and robot, to after between state and action mapping relations learn, can complete under various circumstances, successful avoiding barrier, and finally reach target.Therefore, the autonomous track search strategy based on bionical autonomous learning of Patent design of the present invention is effective.
Above-mentioned simulation result shows, track search process based on bionical autonomous learning is a process constantly explored and consolidate, on the basis of fully study, can ensure to make environment reasonably and effectively to understand, the reasonable mapping from state space to motion space can be completed.Therefore, imitate people and be familiar with things, the mode of process problem, adopts bionical independent learning strategy to be effective and feasible to the robot autonomous track search task completed under circumstances not known.
In sum, method and the model carrying out autonomous track search based on operant conditioning reflex provided by the invention, has the following advantages:
(1) using autonomous track search cognitive model as the mathematical abstractions of mobile robot and formal tool, in order to describe the autonomous learning mechanism of robot, the design for the behavior of autonomous intelligence search and rescue robot is provided a kind of effective logical organization.
(2) simulated operation conditioned reflex mechanism of the present invention, devise a kind of bionical autonomous exploration cognitive model, the principal character of autonomous exploration cognitive model is to simulate biological operant conditioning reflex mechanism, there is bionical self-organization, self study and adaptation function, effective by bionics, psychology and biological applications in control system, there is the function realizing bionical autonomous learning and control.Patent of the present invention is that the research of bionical autonomous learning provides new thinking and new method, contributes to the correlative study promoting Based Intelligent Control and intelligent learning algorithm.
(3) robot autonomous track search is a core technology in mobile robot's research field and difficulties, relates to all many-sides such as the perception of robot, planning, execution.The autonomous track search cognitive model of the present invention's design can as the mathematical abstractions of search and rescue robot and formal tool, in order to describe study mechanism and the cognitive behavior of robot, thus provide a kind of effectively logical organization for the design of the robot behavior to intelligent and autonomy-oriented future development.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should look protection scope of the present invention.

Claims (2)

1. carry out a method for autonomous track search based on operant conditioning reflex, it is characterized in that, comprise the following steps:
Step1, arranges initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter discount factor η;
The ambient condition space S={ s of Robot is set i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is: S i={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~; Wherein:
D ~ rob_obs_l---apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f---robot front is apart from the distance state of barrier;
D ~ rob_obs_r---apart from the distance state of barrier on the right side of robot;
D ~ rob_tar---the distance state between robot and impact point;
θ ~---the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;
Motion space A={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Step2, robot is equipped with multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
Judge the level condition belonging to the current each environmental parameter value of robot, thus obtain the robot ambient condition S residing when t walks i(t)={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space S;
Current state S is calculated according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ A={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right excitation probability value, meet: 0 < p ik< 1, p ik∈ P i={ p i1, p i2..., p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Step3, according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min< d rob_obs_lor d rob_obs_for d rob_obs_r< d max, then perform respective action Xu Shi robot and can realize keeping away barrier;
Step4, as the action a implementing to choose kt, after (), robotary shifts, calculate new state s i(t+1) the information entropy H under i(t+1);
Judge whether current exercise wheel number exceedes the exercise wheel number N preset, if exceeded, then reject information entropy and remain maximum state, and turn to Step5; Otherwise, directly turn to Step5;
Step5, first, the distance under acquisition new state between robot and barrier:
If d rob_obs_l> d maxd rob_obs_f> d maxand d rob_obs_r> d max, calculate " state s according to formula (4) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_l< d max, calculate " state s according to formula (5-1) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_f< d max, calculate " state s according to formula (5-2) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_r< d max, calculate " state s according to formula (5-3) i(t)-operation a k(t) " evaluation signal V ik(t);
If d rob_obs_l< d minord rob_obs_f< d minord rob_obs_r< d min, learn unsuccessfully, to give penalty value, make V ik(t)=-1;
V(d rob_tar,θ)=-β 1sign(Δd rob_tar(t))Δ 2d rob_tar(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, 0 < β 1, β 2< 1; Δ θ (t)=θ (t+1)-θ (t); Δ d rob_tar(t)=d rob_tar(t+1)-d rob_tar(t);
V(d rob_obs_l,d rob_tar)=β 3sign(Δd rob_obs_l(t))Δ 2d rob_obs_l(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-1)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_l(t)=d rob_obs_l(t+1)-d rob_obs_l(t);
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-2)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_f(t)=d rob_obs_f(t+1)-d rob_obs_f(t);
V(d rob_obs_r,d rob_tar)=β 3sign(Δd rob_obs_r(t))Δ 2d rob_obs_r(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-3)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_r(t)=d rob_obs_r(t+1)-d rob_obs_r(t);
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; And perform Step6; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)), and perform Step6: Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max a k Q ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p ik) be the learning rate function of t; Learning rate function representation formula is formula (7):
&gamma; ( p ik ) = 1 1 + exp [ p ik ( t ) - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
Whether Step6, arrive impact point or exceed the iterative learning number of times t of initial setting up under judging robot new state fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, repeat the step of Step2-Step6, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
2. carry out an autonomous exploration cognitive model for autonomous track search based on operant conditioning reflex, it is characterized in that, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;
Described parameter setting module is used for arranging initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t f; Sampling time t s; Learning algorithm correlation parameter, comprises weight coefficient β 1, β 2, β 3, β 4, annealing initial temperature T 0, annealing parameter discount factor η;
The ambient condition space S={ s of Robot is set i| i=1,2 ..., n}, wherein, each s iall represent with five quantity of states, that is: S i={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~; Wherein:
D ~ rob_obs_l---apart from the distance state of barrier on the left of robot;
D ~ rob_obs_f---robot front is apart from the distance state of barrier;
D ~ rob_obs_r---apart from the distance state of barrier on the right side of robot;
D ~ rob_tar---the distance state between robot and impact point;
θ ~---the angle state of robot current kinetic direction and goal point;
For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;
Motion space A={a is set k| k=1,2 ..., r}; Wherein, a krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space kprobability all equal, be 1/r; Safe distance d is set maxwith minimal risk distance d min;
Described perceptron is multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d t rob_obs_l, apart from right side obstacle distance d t rob_obs_r, apart from preceding object thing distance d t rob_obs_f, with the spacing d of impact point t rob_tar, current kinetic direction and goal point angle theta t;
Described state editing machine is used for the level condition belonging to the current each environmental parameter value of robot, thus obtains the robot ambient condition S residing when t walks i(t)={ d ~ rob_obs_l, d ~ rob_obs_f, d ~ rob_obs_r, d ~ rob_tar, θ ~, S i(t) ∈ ambient condition space S;
Described tendency unit is used for weighing robot to the tendency degree of ambient condition by the entropy of computing environment state, is specially: calculate current state S according to formula (3) iinformation entropy H under (t) i(t):
H i ( t ) = H i ( A ( t ) | s i ( t ) ) = - &Sigma; k = 1 r p ik log 2 p ik = - &Sigma; k = 1 r p ( a k | s i ( t ) ) log 2 p ( a k | s i ( t ) ) - - - ( 3 )
Wherein, a k∈ A={a k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p ik=p (a k| s i(t)) ∈ P irepresent and be in ambient condition S iaction a is implemented under the condition of (t) kprobable value, be also referred to as " state s i-action a k" right
Excitation probability value, meet: 0 < p ik< 1, p ik∈ P i={ p i1, p i2... p ir, P irepresent the probability vector that i-th state is corresponding, P i={ p i1, p i2..., p ir∈ P, P represent total probability vector;
Described action selector is used for according to Boltzmann distribution, from the maximum action a of motion space A select probability kt () performs; The learning objective of optimum action is:
If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, then the Xu Shi robot tendency that performs an action moves directly to impact point;
If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d min< d roi_obs_lor d rob_obs_for d rob_obs_r< d max, then perform respective action Xu Shi robot and can realize keeping away barrier;
Described actuator, for performing the action selected by described selector switch, makes robotary shift;
Described study core is used for evaluating the action effect of selected action, produces evaluation signal, is specially:
If d rob_obs_l> d maxand d rob_obs_f> d maxand d rob_obs_r> d max, calculate " state s according to formula (4) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_l< d max, calculate " state s according to formula (5-1) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_f< d max, calculate " state s according to formula (5-2) i(t)-operation a k(t) " evaluation signal V ik(t);
If d min< d rob_obs_r< d max, calculate " state s according to formula (5-3) i(t)-operation a k(t) " evaluation signal V ik(t);
If d rob_obs_l< d minord rob_obs_f< d minord rob_obs_r< d min, learn unsuccessfully, to give penalty value, make V ik(t)=-1;
V(d rob_tar,θ)=-β 1sign(Δd rob_t ar(t))Δ 2d rob_tor(t)-β 2sign(Δθ(t))Δ 2θ(t)(4)
Wherein, β 1, β 2for weight coefficient, 0 < β 1, β 2< 1; Δ θ (t)=θ (t+1)-θ (t); Δ d rob_tar(t)=d rob_tar(t+1)-d rob_tar(t);
V(d rob_obs_l,d rob_tar)=β 3sign(Δd rob_obs_l(t))Δ 2d rob_obs_l(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-1)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_l(t)=d rob_obs_l(t+1)-d rob_obs_l(t);
V(d rob_obs_f,d rob_tar)=β 3sign(Δd rob_obs_f(t))Δ 2d rob_obs_f(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-2)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_f(t)=d rob_obs_f(t+1)-d rob_obs_f(t);
V(d rob_obs_r,d rob_tar)=β 3sign(Δd rob_obs_r(t))Δ 2d rob_obs_r(t)-β 4sign(Δd rob_tar(t))Δ 2d rob_tar(t)(5-3)
Wherein, β 3, β 4for weight coefficient, 0 < β 3, β 4< 1; Δ d rob_obs_r(t)=d rob_obs_r(t+1)-d rob_obs_r(t);
The evaluation signal V calculated ikt () essence is used for evaluating at state s iunder (t), implementation and operation a kt the action effect of (), according to evaluation signal V ikt (), determines the need of renewal " state s i(t)-operation a k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s i(t)-operation a k(t) " former Q value; Otherwise, according to formula (6), calculate " state s i(t)-operation a k(t) " new Q value, and by " state s i(t)-operation a k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s i(t)-operation a k(t) " probable value p ik(s i(t), a k(t)):
Q ( s i ( t ) , a k ( t ) ) = ( 1 - &gamma; ( p ik ) ) Q ( s i ( t - 1 ) , a k ( t - 1 ) ) + &gamma; ( p ik ) [ V ik ( t ) + &eta; max a k Q ( s i ( t + 1 ) , a k ( t ) ) ] - - - ( 6 )
Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p ik) be the learning rate function of t; Learning rate function representation formula is formula (7):
&gamma; ( p ik ) = 1 1 + exp [ p ik ( t ) - p ik ( t + 1 ) p ik ( t ) ] - - - ( 7 )
Wherein, T is temperature coefficient, T 0for initial temperature value, along with the increase of time t, T is by T 0decay, parameter for controlling the speed of annealing;
Whether described end condition judge module is for arriving impact point under judging robot new state or exceeding the iterative learning number of times t of initial setting up fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, until arrive impact point or exceed the iterative learning number of times t of initial setting up ftime, terminate epicycle study.
CN201410215954.4A 2014-05-21 2014-05-21 Method and model for performing independent path exploration based on operant conditioning Pending CN105094124A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410215954.4A CN105094124A (en) 2014-05-21 2014-05-21 Method and model for performing independent path exploration based on operant conditioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410215954.4A CN105094124A (en) 2014-05-21 2014-05-21 Method and model for performing independent path exploration based on operant conditioning

Publications (1)

Publication Number Publication Date
CN105094124A true CN105094124A (en) 2015-11-25

Family

ID=54574800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410215954.4A Pending CN105094124A (en) 2014-05-21 2014-05-21 Method and model for performing independent path exploration based on operant conditioning

Country Status (1)

Country Link
CN (1) CN105094124A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN106595671A (en) * 2017-02-22 2017-04-26 南方科技大学 Method and apparatus for planning route of unmanned aerial vehicle based on reinforcement learning
CN107066967A (en) * 2017-04-12 2017-08-18 清华大学 A kind of target-seeking method and device of active face using local observation information
CN107479547A (en) * 2017-08-11 2017-12-15 同济大学 Decision tree behaviour decision making algorithm based on learning from instruction
CN108227710A (en) * 2017-12-29 2018-06-29 商汤集团有限公司 Automatic Pilot control method and device, electronic equipment, program and medium
CN108363393A (en) * 2018-02-05 2018-08-03 腾讯科技(深圳)有限公司 A kind of smart motion equipment and its air navigation aid and storage medium
CN109547505A (en) * 2019-01-26 2019-03-29 福州大学 Multipath TCP transmission dispatching method based on intensified learning
CN109778939A (en) * 2019-03-04 2019-05-21 江苏徐工工程机械研究院有限公司 It is a kind of can contexture by self track digger arm intelligence control system and method
CN110525428A (en) * 2019-08-29 2019-12-03 合肥工业大学 A kind of automatic parking method based on the study of fuzzy deeply
CN110788865A (en) * 2019-12-09 2020-02-14 中国科学院自动化研究所 Robot control method and system based on multi-brain-area collaborative conditioned reflex model
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111401564A (en) * 2019-01-02 2020-07-10 北京地平线信息技术有限公司 Model updating method and device for machine learning, electronic equipment and storage medium
CN111443711A (en) * 2020-03-27 2020-07-24 浙江华消科技有限公司 Fire-fighting robot obstacle avoidance method and device, fire-fighting robot and readable storage medium
CN111538059A (en) * 2020-05-11 2020-08-14 东华大学 Self-adaptive rapid dynamic positioning system and method based on improved Boltzmann machine
CN112558605A (en) * 2020-12-06 2021-03-26 北京工业大学 Robot behavior learning system based on striatum structure and learning method thereof
CN117873118A (en) * 2024-03-11 2024-04-12 中国科学技术大学 Storage logistics robot navigation method based on SAC algorithm and controller

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599137A (en) * 2009-07-15 2009-12-09 北京工业大学 Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior
CN101673354A (en) * 2009-06-12 2010-03-17 北京工业大学 Operant conditioning reflex automatic machine and application thereof in control of biomimetic autonomous learning
JP2014002638A (en) * 2012-06-20 2014-01-09 Mitsuba Corp Autonomous travel device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101673354A (en) * 2009-06-12 2010-03-17 北京工业大学 Operant conditioning reflex automatic machine and application thereof in control of biomimetic autonomous learning
CN101599137A (en) * 2009-07-15 2009-12-09 北京工业大学 Autonomous operant conditioning reflex automat and the application in realizing intelligent behavior
JP2014002638A (en) * 2012-06-20 2014-01-09 Mitsuba Corp Autonomous travel device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
蔡建羡等: "基于仿生策略的机器人自主导航方法研究", 《计算机仿真》 *
蔡建羡等: "基于遗传算法的Skinner操作条件反射学习模型", 《系统工程与电子技术》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740644B (en) * 2016-03-24 2018-04-13 苏州大学 A kind of clean robot optimal objective paths planning method based on model learning
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN106595671A (en) * 2017-02-22 2017-04-26 南方科技大学 Method and apparatus for planning route of unmanned aerial vehicle based on reinforcement learning
CN107066967B (en) * 2017-04-12 2020-06-02 清华大学 Active face searching method and device by using local observation information
CN107066967A (en) * 2017-04-12 2017-08-18 清华大学 A kind of target-seeking method and device of active face using local observation information
CN107479547A (en) * 2017-08-11 2017-12-15 同济大学 Decision tree behaviour decision making algorithm based on learning from instruction
CN107479547B (en) * 2017-08-11 2020-11-24 同济大学 Decision tree behavior decision algorithm based on teaching learning
CN108227710A (en) * 2017-12-29 2018-06-29 商汤集团有限公司 Automatic Pilot control method and device, electronic equipment, program and medium
CN108363393A (en) * 2018-02-05 2018-08-03 腾讯科技(深圳)有限公司 A kind of smart motion equipment and its air navigation aid and storage medium
US11247701B2 (en) 2018-02-05 2022-02-15 Tencent Technology (Shenzhen) Company Ltd Smart moving device, navigation method thereof, and storage medium
CN111401564A (en) * 2019-01-02 2020-07-10 北京地平线信息技术有限公司 Model updating method and device for machine learning, electronic equipment and storage medium
CN109547505B (en) * 2019-01-26 2021-05-18 福州大学 Multipath TCP transmission scheduling method based on reinforcement learning
CN109547505A (en) * 2019-01-26 2019-03-29 福州大学 Multipath TCP transmission dispatching method based on intensified learning
CN109778939A (en) * 2019-03-04 2019-05-21 江苏徐工工程机械研究院有限公司 It is a kind of can contexture by self track digger arm intelligence control system and method
CN109778939B (en) * 2019-03-04 2021-07-09 江苏徐工工程机械研究院有限公司 Excavator arm intelligent control system and method capable of automatically planning track
CN110525428A (en) * 2019-08-29 2019-12-03 合肥工业大学 A kind of automatic parking method based on the study of fuzzy deeply
CN110525428B (en) * 2019-08-29 2020-09-04 合肥工业大学 Automatic parking method based on fuzzy depth reinforcement learning
CN110788865A (en) * 2019-12-09 2020-02-14 中国科学院自动化研究所 Robot control method and system based on multi-brain-area collaborative conditioned reflex model
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111443711A (en) * 2020-03-27 2020-07-24 浙江华消科技有限公司 Fire-fighting robot obstacle avoidance method and device, fire-fighting robot and readable storage medium
CN111443711B (en) * 2020-03-27 2023-05-23 浙江华消科技有限公司 Fire-fighting robot obstacle avoidance method, device, fire-fighting robot and readable storage medium
CN111538059A (en) * 2020-05-11 2020-08-14 东华大学 Self-adaptive rapid dynamic positioning system and method based on improved Boltzmann machine
CN112558605A (en) * 2020-12-06 2021-03-26 北京工业大学 Robot behavior learning system based on striatum structure and learning method thereof
CN112558605B (en) * 2020-12-06 2022-12-16 北京工业大学 Robot behavior learning system based on striatum structure and learning method thereof
CN117873118A (en) * 2024-03-11 2024-04-12 中国科学技术大学 Storage logistics robot navigation method based on SAC algorithm and controller

Similar Documents

Publication Publication Date Title
CN105094124A (en) Method and model for performing independent path exploration based on operant conditioning
CN109733415B (en) Anthropomorphic automatic driving and following model based on deep reinforcement learning
CN110969848B (en) Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN113805572B (en) Method and device for motion planning
CN110750096B (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN110745136A (en) Driving self-adaptive control method
Wang et al. Continuous control for automated lane change behavior based on deep deterministic policy gradient algorithm
CN111780777A (en) Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
CN108731684A (en) A kind of Route planner of multiple no-manned plane Cooperative Area monitoring
CN106444738A (en) Mobile robot path planning method based on dynamic motion primitive learning model
CN104571113A (en) Route planning method for mobile robot
CN105700526A (en) On-line sequence limit learning machine method possessing autonomous learning capability
CN109978012A (en) It is a kind of based on combine the improvement Bayes of feedback against intensified learning method
CN113478486B (en) Robot motion parameter self-adaptive control method and system based on deep reinforcement learning
CN106598058A (en) Intrinsically motivated extreme learning machine autonomous development system and operating method thereof
CN114895707B (en) Agricultural unmanned aerial vehicle path planning method and system based on variable frequency bat algorithm
Crosato et al. Human-centric autonomous driving in an av-pedestrian interactive environment using svo
CN116679719A (en) Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
CN114077807A (en) Computer implementation method and equipment for controlling mobile robot based on semantic environment diagram
US20230166397A1 (en) Method for obstacle avoidance in degraded environments of robots based on intrinsic plasticity of snn
Zijian et al. Imaginary filtered hindsight experience replay for UAV tracking dynamic targets in large-scale unknown environments
Li et al. A UAV coverage path planning algorithm based on double deep q-network
CN104570738A (en) Robot track tracing method based on Skinner operant conditioning automata
CN116243727A (en) Unmanned carrier countermeasure and obstacle avoidance method for progressive deep reinforcement learning
CN114609925B (en) Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151125