CN105094124A

CN105094124A - Method and model for performing independent path exploration based on operant conditioning

Info

Publication number: CN105094124A
Application number: CN201410215954.4A
Authority: CN
Inventors: 蔡建羡; 洪利; 于瑞红; 马洪蕊
Original assignee: Institute of Disaster Prevention
Current assignee: Institute of Disaster Prevention
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2015-11-25

Abstract

The invention provides a method and model for performing independent path exploration based on operant conditioning. The method comprises setting values of initial parameters which comprise a discrete environment state space and an action space; while at a step t, acquiring values of current environment parameters of a robot through a sonar sensor, and judging a grade state S<i>(t) to which the value of each environment parameter of the robot belongs; selecting a maximum-probability action a<k>(t) from the action space A to execute the action; and, after the selected action a<k>(t) is implemented, enabling the state of the robot to be shifted, evaluating an action effect of implementation of the operation a<k>(t) at the state S<i>(t), and determining whether an original Q value of state S<i>(t)-operation a<k>(t) needs to be updated according to an evaluation signal V<ik>(t). The method is mainly characterized in that a biological operant conditioning mechanism is simulated, bionic self-organization, self-leaning and self-adaption functions are possessed, bionics, psychology and biology are effectively applied to a control system, and a function of achieving bionic independent learning control is possessed.

Description

Method and the model of autonomous track search is carried out based on operant conditioning reflex

Technical field

The invention belongs to track search technical field, be specifically related to a kind of method and the model that carry out autonomous track search based on operant conditioning reflex.

Background technology

Skinner operant conditioning reflex (OperantConditioning) is theoretical in the study of humans and animals, play key player, ultimate principle is: if under certain subjective and objective condition, the consequence that certain biological behavior causes meets biological orientation, so, under similar subjective and objective condition, biological implementation of class will rise like the probability of behavior.In addition, Skinner ten points emphasizes the effect strengthened, and he thinks, whether the study of people sets up key is strengthening.After an operation occurs, when and then presenting a reinforcing stimulus, so, the intensity of this operation just increases.Here increased is not the connection of " S-R ", but the general tendentiousness that reaction is occurred strengthens, and the probability namely reacting generation enhances.He thinks, in study, although exercise is important, the variable of key is strengthening.

Under circumstances not known state, because robot does not have suitable teacher signal, the learning ability of robot has been the key point of autonomous track search task.If application number is 200910044273.5, name is called the patent application of the method for ensuring navigation safety of mobile robots in unknown environments, introduce fuzzy neural network intellectual technology and realize the autonomous learning exploration of robot in circumstances not known, to guarantee the safety in motion process.Although the controlling behavior of robot links together with neuro-physiology and cognitive science by the robot research based on fuzzy neural network, but this contact is also very loose and passive, the motion control technical ability of robot body is based on declarative control law, there is too much design composition, the bionical autonomous learning of less biosystem technical ability and self-organizing feature, really can't realize the autonomous track search of robot.

Operant conditioning reflex study has stronger online adaptive and the self-learning capability to complication system, is well suited for the robot autonomous navigation learning under circumstances not known.Based on this advantage, operant conditioning reflex study becomes the focus of current bionical autonomous learning area research.In recent years, based on the learning model of operant conditioning reflex Mechanism Design, achieve certain progress, if application number is 200910086990.4, name is called the patent application of operant conditioning reflex automat and the application in bionical autonomous learning controls thereof, based on finite-state automata, design a kind of operant conditioning reflex model with self organizing function, and solve the control problem of reversible pendulum system; Application number is 200910089263.3, and name is called Autonomous Operation conditioned reflex automat and in the patent application realizing the application in intelligent behavior, devises a kind of operant conditioning reflex automat equally, and the balance being applied to coaxial two wheels robot controls.

But, the study mechanism of the operant conditioning reflex automat of above-mentioned patented claim design in fact still follows the study mechanism of general finite state automata, further, be just applied to solve and simply balance control problem, be difficult to be applied to robot path and explore this complex control task.

Summary of the invention

For the defect that prior art exists, the invention provides a kind of method and the model that carry out autonomous track search based on operant conditioning reflex, effectively can be applied to robot path and explore field.

The technical solution used in the present invention is as follows:

The invention provides a kind of method of carrying out autonomous track search based on operant conditioning reflex, comprise the following steps:

Step1, arranges initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t _f; Sampling time t _s; Learning algorithm correlation parameter, comprises weight coefficient β ₁, β ₂, β ₃, β ₄, annealing initial temperature T ₀, annealing parameter , discount factor η;

The ambient condition space S={ s of Robot is set _i| i=1,2 ..., n}, wherein, each s _iall represent with five quantity of states, that is: S _i={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~; Wherein:

D ^~ _{rob_obs_l}--apart from the distance state of barrier on the left of robot;

D ^~ _{rob_obs_f}--robot front is apart from the distance state of barrier;

D ^~ _{rob_obs_r}--apart from the distance state of barrier on the right side of robot;

D ^~ _{rob_tar}--the distance state between robot and impact point;

θ ^~--the angle state of robot current kinetic direction and goal point;

For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions;

Motion space A={a is set _k| k=1,2 ..., r}; Wherein, a _krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space _kprobability all equal, be 1/r; Safe distance d is set _maxwith minimal risk distance d _min;

Step2, robot is equipped with multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d ^t _{rob_obs_l}, apart from right side obstacle distance d ^t _{rob_obs_r}, apart from preceding object thing distance d ^t _{rob_obs_f}, with the spacing d of impact point ^t _{rob_tar}, current kinetic direction and goal point angle theta ^t;

Judge the level condition belonging to the current each environmental parameter value of robot, thus obtain the robot ambient condition S residing when t walks _i(t)={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~, S _i(t) ∈ ambient condition space S;

Current state S is calculated according to formula (3) _iinformation entropy H under (t) _i(t):

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

Wherein, a _k∈ A={a _k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p _ik=p (a _k| s _i(t)) ∈ P _irepresent and be in ambient condition S _iaction a is implemented under the condition of (t) _kprobable value, be also referred to as " state s _i-action a _k" right excitation probability value, meet: 0 < p _ik< 1, p _ik∈ P _i={ p _i1, p _i2..., p _ir, P _irepresent the probability vector that i-th state is corresponding, P _i={ p _i1, p _i2..., p _ir∈ P, P represent total probability vector;

Step3, according to Boltzmann distribution, from the maximum action a of motion space A select probability _kt () performs; The learning objective of optimum action is:

If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d _{rob_obs_l}> d _maxand d _{rob_obs_f}> d _maxand d _{rob_obs_r}> d _max, then the Xu Shi robot tendency that performs an action moves directly to impact point;

If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d _min< d _{rob_obs_l}or d _{rob_obs_f}or d _{rob_obs_r}< d _max, then perform respective action Xu Shi robot and can realize keeping away barrier;

Step4, as the action a implementing to choose _kt, after (), robotary shifts, calculate new state s _i(t+1) the information entropy H under _i(t+1);

Judge whether current exercise wheel number exceedes the exercise wheel number N preset, if exceeded, then reject information entropy and remain maximum state, and turn to Step5; Otherwise, directly turn to Step5;

Step5, first, the distance under acquisition new state between robot and barrier:

If d _{rob_obs_l}> d _maxand d _{rob_obs_f}> d _maxand d _{rob_obs_r}> d _max, calculate " state s according to formula (4) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min< d _{rob_obs_l}< d _max, calculate " state s according to formula (5-1) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min< d _{rob_obs_f}< d _max, calculate " state s according to formula (5-2) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min< d _{rob_obs_r}< d _max, calculate " state s according to formula (5-3) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _{rob_obs_l}< d _minord _{rob_obs_f}< d _minord _rob_ _{obs_r}< d _min, learn unsuccessfully, to give penalty value, make V _ik(t)=-1;

V(d _{rob_tar}，θ)＝-β ₁sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar}(t)-β ₂sign(Δθ(t))Δ ²θ(t)(4)

Wherein, β ₁, β ₂for weight coefficient, 0 < β ₁, β ₂< 1; Δ θ (t)=θ (t+1)-θ (t); Δ d _{rob_tar}(t)=d _{rob_tar}(t+1)-d _{rob_tar}(t);

V(d _{rob_obs_l}，d _{rob_tar})＝β ₃sign(Δd _{rob_obs_l}(t))Δ ²d _{rob_obs_l}(t)-β ₄sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar}(t)(5-1)

Wherein, β ₃, β ₄for weight coefficient, 0 < β ₃, β ₄< 1; Δ d _{rob_obs_l}(t)=d _{rob_obs_l}(t+1)-d _{rob_obs_l}(t);

V(d _{rob_obs_f}，d _{rob_tar})＝β ₃sign(Δd _{rob_obs_f}(t))Δ ²d _{rob_obs_f}(t)-β ₄sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar}(t)(5-2)

Wherein, β ₃, β ₄for weight coefficient, 0 < β ₃, β ₄< 1; Δ d _{rob_obs_f}(t)=d _{rob_obs_f}(t+1)-d _{rob_obs_f}(t);

V(d _{rob_obs_r}，d _{rob_tar})＝β ₃sign(Δd _{rob_obs_r}(t))Δ ²d _{rob_obs_r}(t)-β ₄sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar}(t)(5-3)

Wherein, β ₃, β ₄for weight coefficient, 0 < β ₃, β ₄< 1; Δ d _{rob_obs_r}(t)=d _{rob_obs_r}(t+1)-d _{rob_obs_r}(t);

The evaluation signal V calculated _ikt () essence is used for evaluating at state s _iunder (t), implementation and operation a _kt the action effect of (), according to evaluation signal V _ikt (), determines the need of renewal " state s _i(t)-operation a _k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s _i(t)-operation a _k(t) " former Q value; And perform Step6; Otherwise, according to formula (6), calculate " state s _i(t)-operation a _k(t) " new Q value, and by " state s _i(t)-operation a _k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s _i(t)-operation a _k(t) " probable value p _ik(s _i(t), a _k(t)), and perform Step6:

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \underset{a_{k}}{\max Q} (s_{i} (t + 1), a_{k} (t))] - - - (6)

Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p _ik) be the learning rate function of t; Learning rate function representation formula is formula (7):

γ (p_{ik}) = \frac{1}{1 + \exp [\frac{p_{ik} (t) - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)

Wherein, T is temperature coefficient, T ₀for initial temperature value, along with the increase of time t, T is by T ₀decay, parameter for controlling the speed of annealing;

Whether Step6, arrive impact point or exceed the iterative learning number of times t of initial setting up under judging robot new state _fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, repeat the step of Step2-Step6, until arrive impact point or exceed the iterative learning number of times t of initial setting up _ftime, terminate epicycle study.

The present invention also provides a kind of autonomous exploration cognitive model carrying out autonomous track search based on operant conditioning reflex, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;

Described parameter setting module is used for arranging initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t _f; Sampling time t _s; Learning algorithm correlation parameter, comprises weight coefficient β ₁, β ₂, β ₃, β ₄, annealing initial temperature T ₀, annealing parameter , discount factor η;

The ambient condition space S={ s of Robot is set _i| i=1,2 ..., n}, wherein, each s _iall represent with five quantity of states, that is: S _i={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rod_obs_r}, d ^~ _{rod_tar}, θ ^~; Wherein:

D ^~ _{rob_obs_l}---apart from the distance state of barrier on the left of robot;

D ^~ _{rob_obs_f}---robot front is apart from the distance state of barrier;

D ^~ _{rob_obs_r}---apart from the distance state of barrier on the right side of robot;

D ^~ _{rob_tar}---the distance state between robot and impact point;

θ ^~---the angle state of robot current kinetic direction and goal point;

Motion space S={a is set _k| k=1,2 ..., r}; Wherein, a _krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space _kprobability all equal, be 1/r; Safe distance d is set _maxwith minimal risk distance d _min;

Described perceptron is multiple sonar sensor, in robot ambulation process, when t walks, gathers robot current environment parameter value, comprising by sonar sensor: apart from left side obstacle distance d ^t _{rob_obs_l}, apart from right side obstacle distance d ^t _{rob_obs_r}, apart from preceding object thing distance d ^t _{rob_obs_f}, with the spacing d of impact point ^t _{rob_tar}, current kinetic direction and goal point angle theta ^t;

Described state editing machine is used for the level condition belonging to the current each environmental parameter value of robot, thus obtains the robot ambient condition S residing when t walks _i(t)=d ~ _{rob_obs_l}, d ^~ _rob_ _{obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~, S _i(t) ∈ ambient condition space s;

Described tendency unit is used for weighing robot to the tendency degree of ambient condition by the entropy of computing environment state, is specially: calculate current state S according to formula (3) _iinformation entropy H under (t) _i(t):

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

Wherein, a _k∈ S={a _k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p _ik=p (a _k| s _i(t)) ∈ P _irepresent and be in ambient condition S _iaction a is implemented under the condition of (t) _kprobable value, be also referred to as " state s _i-action a _k" right excitation probability value, meet: 0<p _ik<1, p _ik∈ P _i={ p _i1, p _i2..., p _ir, P _irepresent the probability vector that i-th state is corresponding, P _i={ p _i1, p _i2..., p _ir∈ P, P represent total probability vector;

Described action selector is used for according to Boltzmann distribution, from the maximum action a of motion space A select probability _kt () performs; The learning objective of optimum action is:

If 1. the distance of robot and left, front and right barrier is all greater than safe distance, that is: d _{rob_obs_l}>d _maxand d _rob_ _{obs_f}>d _maxand d _{rob_obs_r}>d _max, then the Xu Shi robot tendency that performs an action moves directly to impact point;

If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d _min<d _{rob_obs_l}or d _{rob_obs_f}or d _{rob_obs_r}<d _max, then perform respective action Xu Shi robot and can realize keeping away barrier;

Described actuator, for performing the action selected by described selector switch, makes robotary shift;

Described study core is used for evaluating the action effect of selected action, produces evaluation signal, is specially:

If d _{rob_obs_l}>d _maxand d _{rob_obs_f}>d _maxand d _rob_ _{obs_r}>d _max, calculate " state s according to formula (4) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min<d _{rob_obs_l}<d _max, calculate " state s according to formula (5-1) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min<d _{rob_obs_f}<d _max, calculate " state s according to formula (5-2) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _min<d _{rob_obs_r}<d _max, calculate " state s according to formula (5-3) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

If d _rob_ _obs_ _l<d _minord _{rob_abs_f}<d _minord _{rob_obs_r}<d _min, learn unsuccessfully, to give penalty value, make V _ik(t)=-1;

Wherein, β ₁, β ₂for weight coefficient, 0< β ₁, β ₂<1; Δ θ (t)=θ (t+1)-θ (t); Δ d _{rob_tar}(t)=s _{rob_tar}(t+1)-d _{rob_tar}(t);

Wherein, β ₃, β ₄for weight coefficient, 0< β ₃, β ₄<1; Δ d _{rob_obs_l}(t)=d _{rob_obs_l}(t+1)-d _{rob_obs_l}(t);

V(d _{rob_obs_f}，d _{rob_tar})＝β ₃sign(Δd _{rob_obs_f}(t))Δ ²d _{rob_obs_f}(t)-β ₄sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar}(r)(5-2)

Wherein, β ₃, β ₄for weight coefficient, 0< β ₃, β ₄<1; Δ d _{rob_obs_f}(t)=d _{rob_obs_f}(t+1)-d _{rob_obs_f}(t);

Wherein, β ₃, β ₄for weight coefficient, 0< β ₃, β ₄<1; Δ d _{rob_obs_r}(t)=d _{rob_obs_r}(t+1)-d _{rob_obs_r}(t);

The evaluation signal V calculated _ikt () essence is used for evaluating at state s _iunder (t), implementation and operation a _kt the action effect of (), according to evaluation signal V _ikt (), determines the need of renewal " state s _i(t)-operation a _k(t) " former Q value, if do not needed, then show that this step learns unsuccessfully, abandon this learning experience, retain " state s _i(t)-operation a _k(t) " former Q value; Otherwise, according to formula (6), calculate " state s _i(t)-operation a _k(t) " new Q value, and by " state s _i(t)-operation a _k(t) " former Q value be updated to new Q value, namely remain the learning experience of this step; And, according to formula (8), upgrade " state s _i(t)-operation a _k(t) " probable value p _ik(s _i(t), a _k(t)):

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \max_{a_{k}} Q (s_{i} (t + 1), a_{k} (t))] - - - (6)

Wherein, η is discount factor, represents that learning system is to the degree of concern of action; γ (p _rk) be the learning rate function of t; Learning rate function representation formula is formula (7):

γ (p_{ik}) \frac{1}{1 + \exp [\frac{p_{ik} - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)

Whether described end condition judge module is for arriving impact point under judging robot new state or exceeding the iterative learning number of times t of initial setting up _fif judged result is yes, then terminate epicycle study; If judged result is no, then make t=t+1, proceed track search according to the Q value after adjustment and probable value robot, until arrive impact point or exceed the iterative learning number of times t of initial setting up _ftime, terminate epicycle study.

Method and the model carrying out autonomous track search based on operant conditioning reflex provided by the invention, there is bionical self-organization, self study and adaptation function, effective by bionics, psychology and biological applications in control system, there is the function realizing bionical autonomous learning and control.Patent of the present invention is that the research of bionical autonomous learning provides new thinking and new method, contributes to the correlative study promoting Based Intelligent Control and intelligent learning algorithm.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet carrying out the method for autonomous track search based on operant conditioning reflex provided by the invention;

Fig. 2 is relation schematic diagram between robot provided by the invention, barrier and impact point;

Fig. 3 is the mutual schematic diagram of intelligent body and environment;

Fig. 4 is the structural representation of the autonomous exploration cognitive model carrying out autonomous track search based on operant conditioning reflex;

Fig. 5 is the graph of relation of training round and study number of times;

Fig. 6 is the graph of relation of training round and successful termination step number;

Fig. 7 trains in learning process for taking turns 20, the change curve of certain status information entropy;

Fig. 8 is the track search trajectory diagram of initial learning period robot;

Fig. 9 is the track search trajectory diagram of study robot in latter stage;

Figure 10 is the track search trajectory diagram of starting point (5,20), terminating point (55,40);

Figure 11 is the track search trajectory diagram of starting point (30,5), terminating point (40,42).

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail:

As shown in Figure 1, the invention provides a kind of method and module of carrying out autonomous track search based on operant conditioning reflex, this modeling operant conditioning reflex mechanism, utilizes information entropy design inclination unit, characterize the tendency degree to state, realize automatically deleting redundant state; Q learning algorithm is improved, guides as study core the direction learned; Adopt Boltzmann machine to carry out annealing computing, realize the random selecting to navigation action, along with the carrying out of study, Action Selection strategy is tending towards optimum.And in conjunction with Mobotsim robot simulation software, verify the feasibility using track search in this model realization mobile robot circumstances not known by the autonomous track search problem of mobile robot.

Be specially: during the independent navigation of robot under circumstances not known, in the process that impact point is close, after running into barrier, by the information that sonar sensor provides, utilize the bionical autonomous learning algorithm of design to implement Robot dodge strategy.Learning algorithm comprises the tendency calculating of unit, the renewal of the selection of action and Q value adjusts.

In order to path of Research heuristic algorithm is convenient, suppose that robot can move to the continuous unique step of any direction, step constant and be a grid length of side.Robot can freely turn round and can not collide with barrier in narrow and small environmental area simultaneously, and therefore need not consider the radius of gyration of robot in track search algorithm, robot is reduced to a particle.Between robot, barrier and impact point, relation as shown in Figure 2, and in fig. 2, R represents robot, and T represents impact point, and O represents barrier, and O1, O2 and O3 represent three barriers respectively, and L arrow represents robot motion direction.

Specifically comprise the following steps:

Step1, arranges initial parameter value, comprising:

(1) robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t _f; Sampling time t _s; Learning algorithm correlation parameter, comprises weight coefficient β ₁, β ₂, β ₃, β ₄, annealing initial temperature T ₀, annealing parameter discount factor η;

(2) ambient condition space

Robot is unceasing study in track search process, the result of study makes residing ambient condition there occurs transfer, study produces new mutual, new excites again further study alternately, just in both iterate, the action of robot is tending towards optimizing for given task, and progressively adapts to circumstances not known.In order to promote environmental interaction, usually ambient condition is decomposed, make whole environment formed state space so that study progressively carry out.

The present invention adopts the simplest sensor data fusion mode, the investigative range of sensor is divided into successively 3 districts, i.e. left, front and right, and the ambient condition between such robot and barrier represents with regard to the distance state in available 3 directions.Consider and robot path explored, in identical barrier state, different dbjective states may be comprised, therefore, by the distance state of robot and impact point and and impact point between angle state also add in ambient condition space.Thus, the ambient condition space s of Robot is: S={s _i| i=1,2 ..., n}, wherein, each s _iall represent with five quantity of states, that is:

S _i＝{d ^～ _{rob_obs_l}，d ^～ _{rob_bos_f}，d ^～ _{rob_obs_r}，d ^～ _{rob_tar}，θ ^～}；

Wherein:

D ^~ _{rob_obs_f}---robot front is apart from the distance state of barrier;

D ^~ _{rob_tar}---the distance state between robot and impact point;

θ ^~---the angle state of robot current kinetic direction and goal point;

For each quantity of state, all adopt fuzz method to be divided into limited level condition, each level condition is between a distance regions.The result that ambient condition divides directly has influence on the effect of study, as the concrete example of one, as shown in table 1, each quantity of state all can be divided into 5 level condition, be respectively: very little state, less state, fair state, larger state and very large state.Such as: for the distance state d apart from barrier on the left of robot ^~ _{rob_obs_l}, very little state refers to that the distance value of distance barrier on the left of robot is in 100 ~ 500 scopes; Less state refers to that the distance value of distance barrier on the left of robot is in 500 ~ 1500 scopes; Fair state refers to that the distance value of distance barrier on the left of robot is in 1500 ~ 2500 scopes; Larger state refers to that the distance value of distance barrier on the left of robot is in 2500 ~ 4000 scopes; Very large state refers to that the distance value of distance barrier on the left of robot is in 4000 ~ 5000 scopes.

Table 1 ambient condition spatial spreading divides table

When adopting table 1 dividing mode, the ambient condition space of Robot has 5 ⁵=3125 states, namely for expression formula S={s _i| i=1,2 ..., n}, n=3125.In addition, if for state S ₆={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~}={ B1, B1, B3, B5, B4), the implication of its representative is: robot is apart from left side obstacle distance ∈ (100,500); Robot is far from preceding object thing distance ∈ (100,500); Robot is apart from right side obstacle distance ∈ (1500,2500); The spacing ∈ (6000, ∞) of robot and impact point; The angle ∈ (100,140) of robot current kinetic direction and goal point.

(3) motion space

The learning process of simulated operation conditioned reflex mechanism is that robot passes through constantly to be awarded or to punish in essence, thus sets up the mapping process of state space to motion space.In the path planning of robot to impact point movement, the action main manifestations of robot is that unique step is advanced, the anglec of rotation, barrier edge tracking and stop mobile etc.Therefore, motion space A={a is set _k| k=1,2 ..., r}; Wherein, a _krepresent a kth optional actions of mobile robot's track search; Robot chooses any one action a at first from motion space _kprobability all equal, be 1/r; Safe distance d is set _maxwith minimal risk distance d _min;

Motion space definition needs to follow two principle, and the first, enough action forms be had to complete robot path and to explore task; The second, action will be simplified, unsuitable too many, otherwise can cause and be overloaded with studies.Take into account above-mentioned 2 points, as a kind of example, see formula (2), the motion space A of seven discrete movement composition robots can be selected:

A＝{a ₁，a ₂，a ₃，a ₄，a ₅，a ₆，a ₇}(2)

In formula:

A ₁---robot turns left 30 °, and advance 100mm simultaneously;

A ₂---robot turns left 15 °, and advance 100mm simultaneously;

A ₃---robot turns left 10 °, and advance 100mm simultaneously;

A ₄---revolute 0 °, advance 100mm simultaneously;

A ₅---robot turns right 10 °, and advance 100mm simultaneously;

A ₆---robot turns right 15 °, and advance 100mm simultaneously;

A ₇---robot turns right 30 °, and advance 100mm simultaneously.

Maximum detectable range due to the sonar sensor of robot configuration is 5000mm, and minimum detectable range is 100mm, therefore, in this kind of situation, and definable d _min=100mm is minimal risk distance, d _max=5000mm is safe distance, d ∈ (d _min, d _max) be obstacle-avoidance area; The angle theta ∈ [-180 °, 180 °] of the current direction and goal point of robot.So 100mm<d _{rob_obs_l}, d _{rob_obs_f}, d _{rob_obs_r}<5000mm, namely does not do to consider with the barrier of robot distance more than 5000mm and the barrier less than 100mm.

In practical application, robot can be equipped with 16 sonar sensors, and the distance between each sonar sensor is 20 ° or 40 °, and therefore, sonar sensor can cover robot periphery 0 ~ 360 ° of scope.Robot can also rotate freely within the scope of 0 ~ 360 °.

Such as, if when t walks, in robot current environment parameter value, apart from left side obstacle distance d ^t _{rob_obs_l}=200, apart from right side obstacle distance d ^t _{rob_obs_r}=450, apart from preceding object thing distance d ^t _{rob_obs_f}=1800, with the spacing d of impact point ^t _{rob_tar}=7800, current kinetic direction and goal point angle theta ^t=120; And S ₆={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~}={ B1, B1, B3, B5, B4), then the ambient condition S at robot place when t walks is drawn _it () is S ₆.

The complexity of bionical autonomous learning increases along with the ambient condition quantity comprised in ambient condition space and exponentially rises, to the relevant ambient condition of study optimal effectiveness only account for ambient condition sum less than 1/4th, most of ambient condition is that robot can not experience or to experience number of times few.In view of the foregoing, introduce information entropy, weigh robot to the tendency degree of ambient condition by the entropy of computing environment state.In information theory, entropy can be used as measuring of certain event uncertainty.Quantity of information is larger, and architecture is more regular, and function is more perfect, and entropy is less.Utilize the concept of entropy, can the metering of research information theoretically, transmission, conversion, storage.

Because robot has certain blindness at the track search initial stage, therefore, the probable value choosing initial actuating is close, and the information entropy of each state is maximum.Along with the carrying out of study, the probable value of each action under the ambient condition lived through will change, and information entropy correspondence reduces.Ambient condition experience number of times is more, illustrates that the tendency degree of robot to this state is higher, and it is lower that its information entropy declines; Otherwise ambient condition experience number of times is fewer, and illustrate that the tendency degree of robot to this state is lower, its information changes of entropy is fewer.Consider in actual applications, do not need to try to achieve optimum solution completely, and in most cases feasible scheme is more of practical significance than optimum scheme, this also meets the custom that the mankind explore the world and process problem.Therefore, in track search learning process, information entropy remains that maximum state will be disallowable, as long as and the ambient condition lived through will retain.

Therefore, the ambient condition S set={ s of robot discretize is supposed _i| i=1,2 ..., n}, then calculate current state S according to formula (3) _iinformation entropy H under (t) _i(t):

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

Wherein, a _k∈ A={a _k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p _ik=p (a _k| s _i(t)) ∈ P _irepresent and be in ambient condition S _iaction a is implemented under the condition of (t) _kprobable value, be also referred to as " state s _i-action a _k" right excitation probability value meets: 0 < p _ik< 1, p _ik∈ P _i={ p _i1, p _i2..., p _ir, P _irepresent the probability vector that i-th state is corresponding, P _i={ p _i1, p _i2..., p _ir∈ P, P represent total probability vector;

If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d _min< d _{rob_obs_l}or d _{rob_obs_f}or d _{rob_obs_r}< d _max, then perform respective action Xu Shi robot and can realize keeping away barrier.

Evaluation signal due to action is at the beginning unknown, and it is comparatively large that T chooses, so at the beginning of study, the exponential function in formula (8), when t → 0, levels off to 1, everything to choose probability substantially equal, that is:

p_{ik} (0) \approx \frac{1}{r} (i = 1,2 \cdot \cdot \cdot n), (k = 1,2 \cdot \cdot \cdot r)

Choose each action probability identical, mean at the beginning of study, robot is not containing any predetermined decision-making, and it adopts the probability of any decision-making to be equal.

Step5: the generation of evaluation signal

Robot is in learning process, and according to current ambient conditions and selected action, after each step action of execution, all produce an evaluation signal, object evaluates the action effect of selected action.Evaluation signal decides robot in whole study will learn what, is the topmost place of embodying intelligent robot, rewards or punishes that the reasonable distribution of evaluation signal contributes to improving learning efficiency.Because evaluation signal evaluates the selection of each step action, therefore, its design philosophy should design according to the isolation of state; Consider that the navigation of mobile robot in circumstances not known is explored has two targets again, namely keep away barrier and convergence impact point, so the design of evaluation signal will take into account the following aspects:

First, the distance under acquisition new state between robot and barrier:

(1) if d _{rob_obs_l}> d _maxand d _{rob_obs_f}> d _maxand d _{rob_obs_r}> d _maxthe time in that robot and the distance between left side, right side and preceding object thing being all greater than safe distance, the possibility collided between robot and barrier is very little, so the main task of robot is convergence impact point, the design focal point of evaluation signal considers the relation between the angle theta of line between Distance geometry robot between robot and impact point to impact point.

When robot moves towards impact point, show as:

\{\begin{matrix} Δθ (t) = θ (t + 1) - θ (t) \\ Δθ (t) < 0 \end{matrix}

When the distance of robot and impact point reduces, show as:

\{\begin{matrix} {Δd}_{rob_tar} (t) = d_{rob_tar} (t + 1) - d_{rob_tar} (t) \\ {Δd}_{rob_tar} (t) < 0 \end{matrix}

Therefore, the definition of evaluation signal is such as formula shown in (4):

Wherein, β ₁, β ₂for weight coefficient, generally get 0 < β ₁, β ₂< 1.

(2) if d _min< d _{rob_obs_l}ord _{rob_obs_f}ord _{rob_obs_r}< d _max, robot and the distance between left side, right side and preceding object thing have one between risk distance and safe distance time, likely collide with side's barrier.Consider that collision can not occur immediately, take into account robot and do not depart from target, track search Strategy Design is that robot is keeping away barrier while impact point is close.Therefore, the design focal point of evaluation signal considers the relation between Distance geometry robot between robot and impact point and the distance between barrier.

Suppose that distance between robot and preceding object thing is between risk distance and safe distance, then, when the distance between robot and barrier expands, show as:

\{\begin{matrix} {Δd}_{rob_obs_f} (t) = d_{rob_obs_f} (t + 1) - d_{rob_obs_f} (t) \\ d_{rob_obs_f} (t) > 0 \end{matrix}

Therefore, the definition of evaluation signal such as formula shown in (5),

Wherein, β ₃, β ₄for weight coefficient, generally get 0 < β ₃, β ₄< 1.

Note: when robot and the distance between left and preceding object thing are between risk distance and safe distance, evaluation function mentality of designing is the same.

(3) if d _{rob_obs_l}< d _minord _{rob_obs_f}< d _minord _{rob_obs_r}< d _min, when robot and the distance between left side, right side and preceding object thing have one to be less than risk distance, and the possibility collided between barrier is very large, should give maximum punishment immediately, learn unsuccessfully, to give penalty value, make V _ik(t)=-1.

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \max_{a_{k}} Q (s_{i} (t + 1), a_{k} (t))] - - - (6)

Wherein, η is discount factor, represents that learning system is to the degree of concern of action; If value is less, then represent that the impact of nearest action is paid close attention in systematic comparison; If value is comparatively large, then long-time interior action is paid close attention to all very much.γ (p _ik) be the learning rate function of t, for the speed of Schistosomiasis control, γ (p _ik) larger, restrain faster, but, excessive γ (p _ik) likely cause not restraining.; Therefore, learning rate function representation formula is formula (7):

γ (p_{ik}) = \frac{1}{1 + \exp [\frac{p_{ik} (t) - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)

By " state s _i-action a _k" right excitation probability p _ikjoin learning rate function gamma (p _ik) in, make each " state s _i-action a _k" right Q value regulates the speed all different.Therefore, the learning rate function of design not only plays the function affecting pace of learning, and the adjustment process of Q value embodies the orientation characteristic being more similar to animal operant conditioning reflex.As can be seen from formula (7), " state s _i-action a _k" right Q value is primarily of the variable quantity p of excitation probability _ik(t)-p _ik(t+1) determine, specifically, if the variable quantity p of excitation probability _ik(t)-p _ik(t+1) < 0, then in study afterwards, " state s _i-action a _k" right Q value regulate the speed tend to increase, that is: pace of learning accelerate; Otherwise, if p _ik(t)-p _ik(t+1) > 0, then " state s _i-action a _k" right Q value regulate the speed tend to reduce, that is: pace of learning slows down.

Action Selection strategy is the core that robot realizing route is explored, in track search study in earlier stage, main task carries out environment exploration, therefore require that the randomness of Action Selection is larger, in the track search study later stage, mainly make study restrain, therefore, the randomness of Action Selection is smaller.Here Boltzmann machine is adopted to carry out annealing computing.Action a _kupgraded by formula (8) by the probability selected:

In formula (8), temperature coefficient T decides the random degree of Action Selection, and T value is larger, and the probability of each action is more close, and the random degree of Action Selection is larger; T value is less, and the probability difference of each action is larger, and the action that Q value is large is larger by the probability selected, and the random degree of Action Selection is less.In the starting stage of study, can arrange larger T value, explore fully, along with the increase of study number of times, system obtains increasing experimental knowledge in learning process, and T is from T ₀start decay, when learning number of times t → ∞, T decays to zero, and the select probability of maximum Q value respective action levels off to 1, and this shows that system becomes close to deterministic study from starting incidental learning blindly, and operant conditioning reflex is formed.

Learning algorithm convergence:

From the character of information entropy, as all operations behavior a _kt Probability p that () may occur _kwhen () is equal t, operation behavior entropy is maximum.So, generally at the initial time of study, all operations behavior a _kt () is chosen and identical is chosen probable value p _ik(t).Formula (3) is re-started and arranges:

\begin{matrix} H_{i} (t) = H_{i} (A (t) | s_{i}) = - Σ_{k = 1}^{r} p (a_{k} | s_{i}) \log_{2} p (a_{k} | s_{i}) \\ = - [p (a_{k} | s_{i}) \log_{2} p (a_{k} | s_{i}) + Σ_{k^{'} = 1, k^{'} &NotEqual; k}^{r} p (a_{k^{'}} | s_{i}) \log_{2} p (a_{k^{'}} | s_{i})] \end{matrix} - - - (9)

Along with the carrying out of study, probability function p _ikt () is updated, again because

\lim_{t &RightArrow; \infty} p_{{ik}^{'}} (a_{k^{'}} (t) | s_{i} (t)) &RightArrow; 0,

Substitute into above formula, arrange:

\begin{matrix} \lim_{t &RightArrow; \infty} H_{i} (t) = \lim_{t &RightArrow; \infty} [- p (a_{k} | s_{i}) \log_{2} p (a_{k} | s_{i}) - Σ_{k^{'} = 1, k^{'} &NotEqual; k}^{r} p (a_{k^{'}} | s_{i}) \log_{2} p (a_{k^{'}} | s_{i})] \\ &RightArrow; H_{i \min} \approx 0 \end{matrix} - - - (10)

Therefore, the cond s of autonomous track search model _i, corresponding information entropy H _i({ A _i| s _i) converge to minimal value with study course.

Operant conditioning reflex principle is not only significant for the learning behavior understanding humans and animals, and it is also significant for the learning behavior " copying " humans and animals in robot or machine system, be regarded as the study form that biosystem is the most basic, study based on operant conditioning reflex mechanism is conceived to the mutual of intelligent body and environment, and its learning process as shown in Figure 3.

Operant conditioning reflex learning model is made up of three modules, and perceptron is responsible for the acquisition of ambient condition, and learner is responsible for the study of study mechanism, and action selector is responsible for from the suitable action of action spatial choice and is performed.Such formation loop, the continuous perception environment of robot also performs the action selected, thus changes environment.According to the change of ambient condition after action executing, provide evaluation signal (rewarding or punishment) to the action that intelligent body performs, that goes round and begins again completes learning process.The process learnt by operant conditioning reflex can be seen, the process of higher mammal study and the process conformed and robot and environmental interaction is closely similar.

Simulated operation conditioned reflex study mechanism, provided by the inventionly carries out the autonomous exploration cognitive model of autonomous track search as shown in Figure 4 based on operant conditioning reflex.Wherein, state editor module solves the problem of the discrete division of ambient condition; Enquiry learning policy module solves the problem that state space maps to motion space; Environmental interaction module solves the problem that evaluation signal produces and state shifts.In track search learning process, the interaction of mobile robot's maintenance and environment, by learning and training, reaches the set goal.

Concrete, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;

D ^~ _{rob_obs_f}---robot front is apart from the distance state of barrier;

D ^~ _{rob_tar}---the distance state between robot and impact point;

θ ^~---the angle state of robot current kinetic direction and goal point;

Described state editing machine is used for the level condition belonging to the current each environmental parameter value of robot, thus obtains the robot ambient condition S residing when t walks _i(t)={ d ^~ _{rob_obs_l}, d ^~ _{rob_obs_f}, d ^~ _{rob_obs_r}, d ^~ _{rob_tar}, θ ^~, S _i(t) ∈ ambient condition space S;

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

If d _{rob_obs_l}< d _minord _{rob_obs_f}< d _minord _{rob_obs_r}< d _min, learn unsuccessfully, to give penalty value, make V _ik(t)=-1;

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \max_{a_{k}} Q (s_{i} (t + 1), a_{k} (t))] - - - (6)

γ (p_{ik}) = \frac{1}{1 + \exp [\frac{p_{ik} (t) - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)

Test example

Provided by the inventionly the method for autonomous track search and the feasibility of model is carried out based on operant conditioning reflex, carrying out autonomous track search based on operant conditioning reflex and emulate mobile robot in order to verify.Use emulation platform for Mobotsim (MobileRobotSimulator).By to the simulation analysis of mobile robot in circumstances not known track search, show that robot passes through the reciprocation with environment, be finally over the autonomous track search task under circumstances not known.

(1) experiment scene

Whole navigational environment grating map represents, is set to 70*50 grid environment in this experiment, and each grid size is 0.2*0.2m.Be provided with static-obstacle thing in environment, mobile robot is by autonomous learning avoiding obstacles, and searching optimum or preferably path arrive impact point.Robot is approximately the circular machine people of diameter 0.5m.If

The often step time interval of time step is 0.1s, and the speed of robot central point is 0.2m/s.

(2) optimum configurations

Emulation starting condition is as follows: iterative learning step number t=0; Sampling time t _s=0.1s; Weight coefficient β ₁=0.65, β ₂=0.36, β ₃=0.72, β ₄=0.44; Discount factor η=0.14; The state space s of robot has 5 ⁵=3125 state: s={s _i| i=1,2 ..., 3125}; There are 7 action: A={a motion space _k| k=1,2 ..., 7}; The probability of initially choosing of each action is

p_{ik} (0) \approx \frac{1}{7},

The initial information entropy of corresponding each state is all approximately

H_{i} (0) = - Σ_{k = 1}^{7} p_{ik} \times \log_{2} p_{ik} |_{p_{ik} = \frac{1}{7}} \approx 2.81,

Maximum by the known now entropy of the characteristic of entropy.

(3) experimental result and analysis thereof

In emulation experiment, robot, in impact point traveling process, when the existing of barrier being detected, implements avoid-obstacle behavior according to autonomous track search strategy.It is independently obtain series of rules by study that robot carries out keeping away barrier, therefore, at the beginning of study, robot, can only by constantly accumulating experience actual keeping away in barrier process without any experience, realize keeping away barrier by repetition learning, search out short or the shortest environment and explore path.Emulation experiment has been carried out 20 altogether and has been taken turns training, when robot arrives impact point smoothly or learns number of times more than t _f=100, stop epicycle experiment, robot comes back to starting point, preserves the experience learnt, starts another on its basis and take turns training.When exercise wheel number is taken turns more than N=10, delete the state that information entropy remains maximal value.After approximately passing through 10 experiments taken turns, eliminate 1952 redundant states, approximately pass through 15 experiments taken turns, robot can navigate successfully with frequency of training less and less, and the step number that navigating successfully needs also reduces gradually, finally be tending towards a stable step number, as shown in Figure 5 and Figure 6, Fig. 5 is the graph of relation of training round and study number of times to simulation result; Fig. 6 is the graph of relation of training round and successful termination step number.Simulation result shows, at the training initial stage, because robot is without any priori, failed probability is high.Along with the carrying out of study, failed probability reduces gradually, and after university 10 takes turns, due to the rejecting of redundant state, pace of learning is accelerated, and greatly after training 15 times, not only has and learns speed of convergence faster, and substantially no longer failed.Therefore, robot by autonomous learning, has learned to choose and can make oneself not collide and can to arrive the action of impact point compared with short path.

As shown in Figure 7, for taking turns in training learning process 20, the change curve of certain status information entropy.As can be seen from simulation result, experiment the initial stage, robot without any prior imformation and knowledge, the given maximal value of information entropy.Along with study carrying out, information entropy start decline, approximately pass through 15 take turns training after, information entropy substantially no longer changes, and reaches minimum value.This illustrates, at the training initial stage, because robot is without any priori, failed probability is high.Along with the carrying out of study, failed probability reduces gradually, greatly after training 15 times, substantially no longer failed.Therefore, robot by autonomous learning, has learned to choose and can make oneself not collide and can to arrive the action of impact point compared with short path.

In order to show the self study of robot path exploration, the bionical learning process of self-organization clearly, therefrom take out the situation of initial learning period and the training of study ending phase, Fig. 8 is the track search trajectory diagram of initial learning period robot, and Fig. 9 is the track search trajectory diagram of study robot in latter stage.Be distributed with 11 out-of-shape barriers in simulated environment unevenly, represent with 1-11 respectively in fig. 8; Scallop is divided into the coverage of robot sensor, represents in fig. 8 with 12.Starting point R is at the grid place of (10,10), and impact point P is at P (55,35) grid place.The track search track shown by Fig. 8 can be seen, the track search track that the study initial stage shows is not optimum, even there will be situation about colliding with barrier, illustrate that robot is without any experimental knowledge, the randomness of study is very large, the probability difference choosing each action, apart from little, can not select effective action to carry out path planning.Along with the mutual constantly abundant robot of machine human and environment starts to tend to choose the action that can obtain " award ", the frequency choosing " good " action has in other words exceeded the frequency choosing " bad " action, its action is chosen and is tended towards stability, along with the increase of study number of times, utilize the Q learning algorithm improved, constantly improve the performance that it keeps away barrier.Can see from the track search track of Fig. 9, static-obstacle thing can be got around along track level and smooth as far as possible smoothly in study robot in latter stage, and while keeping away barrier, remain the state towards impact point movement, achieve effective selection of the guidance path that can to pass through to reality.Robot becomes close to deterministic study from incidental learning blindly at the beginning, and operant conditioning reflex is formed.The independent navigation learning process of robot, has reacted the operant conditioning reflex process of human or animal, shows that robot is by autonomous learning, tends to choose the action favourable to self.

The state of study-action mapping relations are preserved, changes the working environment of robot, different starting points, coordinate of ground point are set, change the distribution of barrier, re-start experiment, Figure 10 is starting point (5,20), the track search trajectory diagram of terminating point (55,40); Figure 11 is the track search trajectory diagram of starting point (30,5), terminating point (40,42).

Simulation result shows, adopts the method for discrete division state space and motion space, and robot, to after between state and action mapping relations learn, can complete under various circumstances, successful avoiding barrier, and finally reach target.Therefore, the autonomous track search strategy based on bionical autonomous learning of Patent design of the present invention is effective.

Above-mentioned simulation result shows, track search process based on bionical autonomous learning is a process constantly explored and consolidate, on the basis of fully study, can ensure to make environment reasonably and effectively to understand, the reasonable mapping from state space to motion space can be completed.Therefore, imitate people and be familiar with things, the mode of process problem, adopts bionical independent learning strategy to be effective and feasible to the robot autonomous track search task completed under circumstances not known.

In sum, method and the model carrying out autonomous track search based on operant conditioning reflex provided by the invention, has the following advantages:

(1) using autonomous track search cognitive model as the mathematical abstractions of mobile robot and formal tool, in order to describe the autonomous learning mechanism of robot, the design for the behavior of autonomous intelligence search and rescue robot is provided a kind of effective logical organization.

(2) simulated operation conditioned reflex mechanism of the present invention, devise a kind of bionical autonomous exploration cognitive model, the principal character of autonomous exploration cognitive model is to simulate biological operant conditioning reflex mechanism, there is bionical self-organization, self study and adaptation function, effective by bionics, psychology and biological applications in control system, there is the function realizing bionical autonomous learning and control.Patent of the present invention is that the research of bionical autonomous learning provides new thinking and new method, contributes to the correlative study promoting Based Intelligent Control and intelligent learning algorithm.

(3) robot autonomous track search is a core technology in mobile robot's research field and difficulties, relates to all many-sides such as the perception of robot, planning, execution.The autonomous track search cognitive model of the present invention's design can as the mathematical abstractions of search and rescue robot and formal tool, in order to describe study mechanism and the cognitive behavior of robot, thus provide a kind of effectively logical organization for the design of the robot behavior to intelligent and autonomy-oriented future development.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should look protection scope of the present invention.

Claims

1. carry out a method for autonomous track search based on operant conditioning reflex, it is characterized in that, comprise the following steps:

Step1, arranges initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t _f; Sampling time t _s; Learning algorithm correlation parameter, comprises weight coefficient β ₁, β ₂, β ₃, β ₄, annealing initial temperature T ₀, annealing parameter discount factor η;

D ^~ _{rob_obs_f}---robot front is apart from the distance state of barrier;

D ^~ _{rob_tar}---the distance state between robot and impact point;

θ ^~---the angle state of robot current kinetic direction and goal point;

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

If d _{rob_obs_l}> d _maxd _{rob_obs_f}> d _maxand d _{rob_obs_r}> d _max, calculate " state s according to formula (4) _i(t)-operation a _k(t) " evaluation signal V _ik(t);

V(d _{rob_obs_l}，d _{rob_tar})＝β ₃sign(Δd _{rob_obs_l}(t))Δ ²d _{rob_obs_l}(t)-β ₄sign(Δd _{rob_tar}(t))Δ ²d _{rob_tar(}t)(5-1)

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \max_{a_{k}} Q (s_{i} (t + 1), a_{k} (t))] - - - (6)

γ (p_{ik}) = \frac{1}{1 + \exp [\frac{p_{ik} (t) - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)

2. carry out an autonomous exploration cognitive model for autonomous track search based on operant conditioning reflex, it is characterized in that, comprising: perceptron, actuator, state editing machine, parameter setting module, end condition judge module and enquiry learning policy module; Wherein, described enquiry learning policy module comprises: tendency unit, study core and Action Selection policy module;

Described parameter setting module is used for arranging initial parameter value, comprising: robot start position information; Iterative learning step number initial value t=0; Iterative learning number of times t _f; Sampling time t _s; Learning algorithm correlation parameter, comprises weight coefficient β ₁, β ₂, β ₃, β ₄, annealing initial temperature T ₀, annealing parameter discount factor η;

D ^~ _{rob_obs_f}---robot front is apart from the distance state of barrier;

D ^~ _{rob_tar}---the distance state between robot and impact point;

θ ^~---the angle state of robot current kinetic direction and goal point;

H_{i} (t) = H_{i} (A (t) | s_{i} (t)) = - Σ_{k = 1}^{r} p_{ik} \log_{2} p_{ik} = - Σ_{k = 1}^{r} p (a_{k} | s_{i} (t)) \log_{2} p (a_{k} | s_{i} (t)) - - - (3)

Wherein, a _k∈ A={a _k| k=1,2 ..., r} represents a kth optional actions of mobile robot's track search, and A represents the optional actions set that robot path is explored, and r is the number of optional actions; p _ik=p (a _k| s _i(t)) ∈ P _irepresent and be in ambient condition S _iaction a is implemented under the condition of (t) _kprobable value, be also referred to as " state s _i-action a _k" right

Excitation probability value, meet: 0 < p _ik< 1, p _ik∈ P _i={ p _i1, p _i2... p _ir, P _irepresent the probability vector that i-th state is corresponding, P _i={ p _i1, p _i2..., p _ir∈ P, P represent total probability vector;

If in the distance of 2. robot and left, front and right barrier, there is the numerical value between safe distance and minimal risk distance, namely meet following relational expression: d _min< d _{roi_obs_l}or d _{rob_obs_f}or d _{rob_obs_r}< d _max, then perform respective action Xu Shi robot and can realize keeping away barrier;

V(d _{rob_tar}，θ)＝-β ₁sign(Δd _{rob_}t _ar(t))Δ ²d _{rob_tor}(t)-β ₂sign(Δθ(t))Δ ²θ(t)(4)

Q (s_{i} (t), a_{k} (t)) = (1 - γ (p_{ik})) Q (s_{i} (t - 1), a_{k} (t - 1)) + γ (p_{ik}) [V_{ik} (t) + η \max_{a_{k}} Q (s_{i} (t + 1), a_{k} (t))] - - - (6)

γ (p_{ik}) = \frac{1}{1 + \exp [\frac{p_{ik} (t) - p_{ik} (t + 1)}{p_{ik} (t)}]} - - - (7)