CN110378439A

CN110378439A - Single robot path planning method based on Q-Learning algorithm

Info

Publication number: CN110378439A
Application number: CN201910737476.6A
Authority: CN
Inventors: 李波; 易洁; 梁宏斌
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2019-10-25
Anticipated expiration: 2039-08-09
Also published as: CN110378439B

Abstract

The present invention relates to road robot diameter planning technology fields, and in particular to single robot path planning method based on Q-Learning algorithm, comprising: the parameter of initialization algorithm；Action command is chosen, is calculated according to the action command and generates running state parameter and reward functions；If running state parameter is equal to final state parameter and is equal to dbjective state parameter, by successful path storage into successful path storage table；Otherwise, when starting renewable time less than or equal to current time and state-movement pair access times are equal to maximum count threshold value, update action value function, and by running state parameter storage into successful path；The above-mentioned steps that rerun are until reach maximum number of iterations；It is chosen according to movement value function repetitive operation instruction and state parameter generates, obtain the optimal path of single robot.The present invention can preferably promote the renewal learning speed and path planning effect of learning system when Q-Learning algorithm is used for single robot path planning.

Description

Single robot path planning method based on Q-Learning algorithm

Technical field

The present invention relates to robot diameter planning technology fields, and in particular to single robot road based on Q-Learning algorithm Diameter planing method.

Background technique

Mobile robot has a wide range of applications, such as the every field such as family, agricultural, industry, military affairs all have movement The figure of robot.And the three big cores in the mobile research field of control robot are points of the positioning of robot, task Match and Path Planning Technique.Wherein, path planning is the primary item that mobile robot reaches task object, completes task definition Part.Such as: home services type clean robot needs to carry out indoor environment reasonable path planning to complete clean up task；Agriculture Industry picking robot needs path planning that could walk between crops to complete picking task；Industrial robot is also required to carry out Path planning could complete given task in shared working space.

Single robot system home services, agriculture power-assisted, industrial environment, in terms of all widely answered With.In such applications, single robot system path planning is particularly important, and single robot system path planning refers to: in its work As soon as making the path that can avoid all barriers found in environment from initial state to dbjective state, this needs to use machine Study, in the related technology, most common learning method is intensified learning.

Q-Learning algorithm is the important algorithm in intensified learning, in the related technology, by Q-Learning algorithm application In the path planning of robot system.The learning process of Q-Learning algorithm is the process of an iteration, it is needed by not Disconnected ground trial and error and movement selection, progressive updating Q value table (movement value function).That is: set a reward function, robot according to ε-Greedy strategy (ε-greedy, ε are to explore the factor (0≤ε≤1)) chooses action command, executes action command and according to reward Function updates Q value table, then generates state parameter, chooses next movement according to state parameter and ε-Greedy strategy, then proceedes to It executes action command and updates Q value table, obtain final Q value table until updating, finally obtain optimal path according to Q value table.

Existing Q-Learning algorithm needs constantly selection and updates Q value to improve the selection of action command, that is, It says, one secondary environment of every exploration just needs to update a Q value, this causes the update of learning system and pace of learning slow.The relevant technologies In, in order to guarantee the speed for updating and learning, it can be steadily decreasing with the increase of algorithm training time and explore factor ε's Value, i.e., more go to execute optimal movement, rapidly converge to it among corresponding solution.The cost of this way is: study system System can miss optimal solution because of (value for exploring factor ε is too small) inadequate to the exploration completeness of environment, can only converge to one Suboptimal solution is more very possible to that a common solution can be converged to, and this defect will affect the effect of path planning.

Summary of the invention

In view of the above shortcomings of the prior art, the technical problems to be solved by the present invention are: how by Q-Learning When algorithm is used for single robot path planning, the renewal learning speed and path planning effect of learning system are preferably promoted.

In order to solve the above-mentioned technical problem, present invention employs the following technical solutions:

Single robot path planning method based on Q-Learning algorithm, comprising the following steps:

S1: the exploration factor, maximum number of iterations, final state parameter, the dbjective state ginseng of single robot system are initialized Number, maximum count threshold value start renewable time, the number of iterations, current time, movement value function, state-movement pair access time Number, successful path, successful path store table；

S2: judging whether the number of iterations is greater than maximum number of iterations, if: then follow the steps S6；If not: then initialization is worked as Preceding state parameter, then perform the next step rapid；

S3: generating a random number, compares random number and chooses an action command after exploring the factor, is referred to according to the movement It enables calculating generate robot and executes running state parameter and reward functions after the action command；Then, judge that operating status is joined Whether number is equal to final state parameter, if: then continue to judge whether running state parameter is equal to dbjective state parameter, if waiting In then by successful path storage into successful path storage table, execution the number of iterations returns again to step S2, if differing from adding one In then executing the number of iterations, oneself adds one, returns again to step S2；If not: then performing the next step rapid；

S4: judging to start whether renewable time is less than or equal to current time, if: then store reward functions, execution state- The access times of movement pair add one certainly, then perform the next step rapid；If not: then judge state-movement pair access times whether etc. In maximum count threshold value, if so, update action value function, then perform the next step it is rapid, if it is not, then performing the next step rapid；

S5: by running state parameter storage into successful path, current time is executed from adding one, returns again to step S3；

S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, and It repeats: executing action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to pre- If dbjective state parameter when, obtain the optimal path of single robot system.

In this way, by the way that maximum count threshold value is arranged in Q-Learning algorithm, and by comparing state-movement pair visit Number and maximum count threshold value are asked, to decide whether update action value function (Q value)；That is: when state-movement pair access times When reaching maximum count threshold value, just start update action value function.Firstly, the exploration to environment, energy will not be reduced in this programme Enough guarantee path planning effect；Secondly, this programme not only reduces the calculation amount of system, it is also greatly improved renewal learning Speed；Again, the update action value function mode of this programme has multistep anticipation ability, it is contemplated that following multiple state-movements To the influence to movement value function, the control strategy learnt can be more reasonable；Finally, in the present solution, selection state-movement Pair access times as movement value function update foundation, guarantee promoted renewal learning speed under the premise of, will not influence The step of preamble, does not need to reduce the value for exploring the factor, thus be avoided that learning system because of the exploration completeness to environment not The problem of foot leads to miss optimal solution.

Therefore, this programme can preferably promote when Q-Learning algorithm is used for single robot path planning The renewal learning speed and path planning effect of learning system.

Preferably, in step S4, the formula of update action value function is that ((s, a)/h, in formula, (s a) is Q Q by s, a)=U Value function is acted, (s is a) reward functions of storage to U, and h is maximum count threshold value.

In this way, updated movement value function is the average value of all reward functions of storage, on the one hand, pass through average meter Algorithm can be reduced calculation amount, also help the accuracy of enhancing action value function；On the other hand, this calculation no longer needs Eligibility trace matrix is calculated, the complexity of calculating is further reduced.

Preferably, in step S3, if running state parameter is equal to final state parameter and is equal to dbjective state parameter, Before executing step S2, executes preset success pathfinding number and add one certainly.

In this way, by record and being updated successfully pathfinding number, the learning effect of learning system can be timely feedbacked, moreover it is possible to assist Decision explores the update of the factor, this is conducive to auxiliary and solves the exploration of Q-Learning algorithm and utilize equilibrium problem.

Preferably, in step S2, if the number of iterations is less than maximum number of iterations, first judge successfully whether pathfinding number is big In the maximum success pathfinding number prestored, if: the then more value of the new exploration factor, and perform the next step rapid；If not: under then executing One step.

In this way, passing through the learning effect of successfully pathfinding number energy feedback learning system, constantly updated by learning effect The value of the factor is explored, so that ε-greedy strategy has stronger adaptability, can more meet moving law.

Preferably, in step S3, if running state parameter is equal to final state parameter and is equal to dbjective state parameter, Before executing step S2, the number of path that runs succeeded adds one certainly.

In this way, by record and being updated successfully pathfinding number, the learning effect of learning system can be timely feedbacked, moreover it is possible to participate in Determine the subsequent update for exploring the factor, this exploration for being equally beneficial for auxiliary solution Q-Learning algorithm is asked with using balance Topic.

Preferably, in step S2, when the more value of the new exploration factor, judge whether successful path number is less than the minimal path prestored Diameter number, if: ε '=ε+eSize × (Minpathnum-pathnun) is then executed, ε indicates the exploration factor before updating, ε ' table To show the updated exploration factor, and enables ε '=ε, in formula, eSize is that the exploration factor single prestored updates step-length, MinPathNum is minimal path number, and PathNum is successful path number；If not: then executing ε '=ε-eSize × (i/ ECycle), ε indicates the exploration factor before updating, the updated exploration factor of ε ' expression, and enables ε '=ε, and in formula, eSize is pre- The exploration factor single deposited updates step-length, and i is the number of iterations, and eCycle is that the exploration factor that fortune is deposited changes the period.

In this way, can more acurrate, timely feedback learning system by the combination of successful path number and success pathfinding number Learning effect, so that the value for exploring the factor is constantly updated by learning effect, so that ε-greedy strategy has stronger fit Should be able to power, also can more meet moving law.

Preferably, in step S3, compare random number and explore because of the period of the day from 11 p.m. to 1 a.m, if random number, which is greater than, explores the factor, according to pre- The probabilistic model deposited chooses action command；If random number is less than or equal to explore the factor, randomly selected from the behavior aggregate prestored Action command.

In this way, realizing movement based on probability by the comparison and probabilistic model of random number and the exploration factor selects plan The probability Hui Geng great that the movement omited, and act the larger value of value function is selected, maximum value caused by solving because of influence of noise Selection there are problems that deviation.

Preferably, in step S3, the formula that probabilistic model chooses action command isIn formula, P (s|a_k) it is to select selection action command a under state parameter S_kProbability, Q (s, a_k) it is action command a under state parameter S_kQ value,For under state parameter S everything instruct Q value and.

In this way, the probabilistic model can be such that the movement of the biggish value of function of movement value is selected by training in advance and study Probability it is bigger, advantageously account for maximize deviation the problem of.

Detailed description of the invention

In order to keep the purposes, technical schemes and advantages of invention clearer, the present invention is made into one below in conjunction with attached drawing The detailed description of step, in which:

Fig. 1 is the logic diagram of single robot path planning method based on Q-Learning algorithm in embodiment one；

Fig. 2 is the flow chart of single robot path planning method based on Q-Learning algorithm in embodiment one；

Fig. 3 is the flow chart that Q value table is updated in embodiment one；

Fig. 4 is the path schematic diagram of tradition Q-Learning algorithmic rule in the experiment one of embodiment two；

Fig. 5 is the path schematic diagram of Q-Learning algorithmic rule after present invention improvement in the experiment one of embodiment two；

Fig. 6 spends time taking line chart by tradition Q-Learning algorithmic statement in the experiment one of embodiment two；

Fig. 7 spends time taking line chart by Q-Learning algorithmic statement after improving in the experiment one of embodiment two；

Fig. 8 is the path schematic diagram of Q-Learning algorithmic rule after improving in the experiment two of embodiment two；

Fig. 9 is the line chart of frequency of training needed for tradition Q-Learning algorithmic statement in the experiment two of embodiment two；

Figure 10 is the line chart of frequency of training needed for Q-Learning algorithmic statement after improving in the experiment two of embodiment two.

Specific embodiment

Below by the further details of explanation of specific embodiment:

Q-Learning learning algorithm is to be put forward by Watkins in 1989, is one in nitrification enhancement A important algorithm.

One, the update rule of Q-Learning algorithm

Robot under Q-Learning algorithm does not know whole environment, only knows selectable movement under current state Set, it usually needs one instant prize payouts matrix R of building is rewarded for indicating from state s to the movement of NextState s ' Value.The Q value table (or referred to as Q matrix) of guidance machine human action is calculated by R building.

Each state-movement is set as a result, to as < S, A >, Q learning algorithm be to state-movement pair value function Q (S, A) estimated in the hope of control strategy.A kind of simplest form is that single step Q learns in Q study, and the correction formula of Q value is such as Under:

Formula (1-1) only can just set up using optimal policy, and in the beginning of study, formula (1-1) equal sign both sides are simultaneously It is unequal, error:

Obtained update rule:

Q_t(s_t,a_t)←Q_t(s_t,a_t)+αΔQ_t(s_t,a_t) (1-3)

That is:

Wherein: s_t: current state；a_t: the movement selected under current state；s_t+1: execution acts a_tNextState afterwards；

r_t+1: execution acts a_tInstant reward afterwards；Q(s_t,a_t): for state s_tThe execution of lower robot acts a_tAfterwards, acquired Accumulated weights reward, i.e. state-movement pair value function.

α: the learning rate of control convergence, 0 < α < 1, by being constantly trying to search space, Q value can be approached gradually most preferably Value；

γ: discount factor, 0≤γ < 1, γ=0 indicate award immediately, and γ tends to 1 expression and awards in the future, it can says γ The far and near influence degree to award for determining the time indicates to sacrifice current award, exchanges the degree of long-term interest for.

Two, the step of Q-Learning algorithm

Initializing Q first, (s, a) value, at state s, robot acts a according to movement selection strategy π selection, obtains down One state s ' and reward value γ corrects Q (s, a) value further according to rule is updated；Continuous repetitive operation selection and amendment Q (s, a) value Until study terminates.

One typical Q-Learning learning algorithm entire flow is as follows:

1, For i=1:n:

2, init state s

3, each learning cycle of For:

4, Utilization strategies μ come select movement a

5, a is executed, is awarded s immediately, and be transferred to NextState s'

6, according to the Q value function of formula (2-4) more new strategy π

7, current strategies are updated

8、s←s'

9, until s is final state

Three, Q-Learning convergence

After meeting the following four condition of convergence, (s, a) can be with convergence in probability very in Q by Q^*(s, a):

1) environment has markov decision process property；

2) value function is indicated with lookup table, i.e., Q (s, a) value (Q matrix) are stored using table；

3) each state-movement is to < S, and A > can with Q, (s, a) more new formula be iterated infinitely；

4) reasonable learning rate α.

Four, the equilibrium problem explored and utilized

The balance explored and utilized is a very basic concept in intensified learning, in intensified learning, is made every time When selecting out, the optimal movement explored when is selected, when goes to explore and attempts unknown movement, here it is Exploration-utilizes equilibrium problem.

ε-Greedy strategy (ε-greedy) the most commonly used method and Q- that solve exploration-and utilize equilibrium problem The exploration strategy μ used in Learning algorithm.ε-Greedy strategy formula is as follows:

Wherein, it ε: explores the factor (0≤ε≤1)；

σ: number of the algorithm between 0 to 1 generated at random in each step.

It can be seen from (1-5) formula when exploration factor ε is bigger, learning system is partial to explore environment, tastes Try random movement；When exploration factor ε is smaller, learning system then tends to selection and executes known optimal movement.Therefore ε The selection of value be very important.

Five, the defect of Q-Learning algorithm

By the analysis to existing Q-Learning algorithm, Q-Learning algorithm is applied to the road of single robot by discovery In diameter planning, have the following problems:

1) equilibrium problem explored and utilized

Existing Q-Learning algorithm, in order to guarantee convergent speed, it will usually with the increasing of the training time of algorithm Add and be steadily decreasing the value for exploring factor ε, i.e., more goes to execute optimal movement, it is made to rapidly converge to corresponding solution In.

But learning system not enough may make learning system miss optimal solution because of the exploration completeness to environment, A suboptimal solution is converged to, is more very possible to that a common solution can be converged to.

2) offset issue is maximized

Due in Q-Learning algorithm, the method for more new strategy isWherein use Max method constantly selects Q (s, a) maximum movement a that is, in more new strategy.But this method of selection may It can be because certain noise items be last as a result, there are a maximization offset issues to influence.

Although repeatedly using Q (s, a) maximum movement can generate maximize progressive award action policy, when Q (s, A) when being not so accurate, the performance of Q-Learning algorithm can be made to become worse and worse.And when study constantly uses Q (s, a) maximum movement, it is likely that can be in training early stage with regard to (s, a) state action of value is associated, to make to learn with high Q It restrains too fast when habit, thus misses some possible optimal policies.

3) the slow problem of renewal speed

Q-Learning algorithm learning process is the process of an iteration, it is needed through constantly trial and error and movement choosing It selects, gradually improves the mapping policy from state to movement, this requires learning systems to each possible state action to passing through Feedback information carries out multiple trial and error and amendment, can just obtain the control strategy being more suitable.

In view of the above-mentioned problems, the present invention provides single robot path planning method based on Q-Learning algorithm, packet Include following steps:

Embodiment one:

As shown in Figure 1: based on single robot path planning method of Q-Learning algorithm, comprising the following steps:

S1: initialization action collection A, state set S, maximum number of iterations n, at most exploration step number m, minimal path number MinPathNum, maximum success pathfinding number MaxSuccessNum, explore factor ε, explore factor single update step-length eSize, Exploring factor change period eCycle, maximum count threshold value h, beginning renewable time B, (s a), completes renewable time, action value letter (s, a), (s, a), (s, a), succeed reward functions storage U the access times C of state action pair several Q pathfinding number SuccessNum, successful path number PathNum, the PathList of successful path, successful path store table List, the number of iterations i With current time t.

S2: judging whether the number of iterations i is greater than maximum number of iterations n, if: then follow the steps S6；If not: judging successfully Whether pathfinding number SuccessNum is greater than the maximum success pathfinding number MaxSuccessNum prestored, if: then more new exploration The value of factor ε, and perform the next step rapid；If not: then performing the next step rapid.

When the more value of the new exploration factor, judge whether successful path number PathNum is less than the minimal path number prestored MinPathNum, if: then execute ε '=ε+eSize × (Minpathnum-pathnun), ε indicate the exploration before updating because Son, the updated exploration factor of ε ' expression, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length, MinPathNum is minimal path number, and PathNum is successful path number；If not: then executing ε '=ε-eSize × (i/ ECycle), ε indicates the exploration factor before updating, the updated exploration factor of ε ' expression, and enables ε '=ε, and in formula, eSize is pre- The exploration factor single deposited updates step-length, and i is the number of iterations, and eCycle is that the exploration factor that fortune is deposited changes the period.

S3: generating a random number σ ∈ (0,1), and one movement of selection refers to after comparing the value of random number σ and exploration factor ε Enable a_t, according to action command a_tIt calculates generation robot and executes the running state parameter s after the action command_t+1And reward functions r_t+1；Judge running state parameter s_t+1Whether final state parameter is equal to, if: then judge running state parameter s_t+1Whether etc. In dbjective state parameter, if being equal to, by successful path PathList storage into successful path storage table List, execution iteration Number i adds one, success pathfinding number SuccessNum certainly certainly plus one, successful path number PathNum adds one certainly, and executes step S2 executes the number of iterations i and adds one certainly, and execute step S2 if being not equal to；If not: then performing the next step rapid.

Wherein, if the value of random number σ, which is greater than, explores factor ε, according to the probabilistic model selection movement a prestored_t；If random The value of number σ is less than or equal to explore factor ε, then movement a is randomly selected from behavior aggregate A_t；Probabilistic model chooses action command a_t's Formula isIn formula, and P (s | a_k) it is to select selection action command a under state parameter S_kProbability, Q (s, a_k) it is action command a under state parameter S_kQ value,For under state parameter S everything instruct Q value and.

S4: judgement beginning renewable time B (whether s a) is less than or equal to current time t, if: then by reward functions r_t+1It deposits Being stored in reward functions storage U, (s, a), executing state-movement pair access times C, (s a) adds one certainly, and performs the next step rapid； If not: then judging (whether s a) is equal to maximum count threshold value h, if so, update action value to state-movement pair access times C (s a), and is performed the next step suddenly, if not function Q: then being performed the next step rapid.

Wherein, the formula of update action value function is that ((s, a)/h, in formula, (s is a) movement value function, U to Q to Q by s, a)=U (s is a) reward functions of storage, and h is maximum count threshold value.

S5: by running state parameter s_t+1It stores in successful path PathList, execution current time t adds one certainly, and holds Row step S3.

In order to preferably introduce the process of path planning, the list based on Q-Learning algorithm is also disclosed in the present embodiment The flow chart of robot path planning method.

It is as shown in Figures 2 and 3: single robot path planning process based on Q-Learning algorithm, comprising the following steps:

Step 1: (s a), behavior aggregate A, state set S, maximum number of iterations n, is at most explored initialization action value function Q Step number m, it minimal path number MinPathNum, maximum success pathfinding number MaxSuccessNum, explores factor ε, explore factor list Secondary update step-length eSize, factor change period eCycle, state action are explored to accessed number C (s, a), when starting to update Carving B, (s, a), completing renewable time E, (s, a) (s, a), (whether s a) is learnt L reward functions storage U, maximum count threshold value H, success pathfinding number SuccessNum, successful path number PathNum, the PathList of successful path, successful path store table List, the number of iterations i and current time t.

Wherein, Q (s, a)=0, C (s, a)=0, U (s, a)=0, SuccessNum=0, PathNum=0, PathList =0, List=0, i=1, t=1.

Step 2: judging whether i is greater than n, if: then terminate to learn；If not: t=0 is then executed, and empties PathList, Judge whether SuccessNum is greater than MaxSuccessNum again, if SuccessNum is greater than MaxSuccessNum, updates ε's Value executes S3 step if SuccessNum is less than or equal to MaxSuccessNum.

Wherein, when the more value of new exploration factor ε: if PathNum is less than most MinPathNum, using formula ε+eSize ×(MinPathNum-PathNyn)；If PathNum is more than or equal to MinPathNum, formula ε-eSize × (i/ is used eCycle)；In formula, ε is to explore the factor, and eSize is to explore factor single to update step-length, and MinPathNum is minimal path number, PathNum is successful path number, and i is the number of iterations, and eCycle is to explore the factor to change the period.

Step 3: init state s, s ∈ S.

Step 4: judging whether t is greater than m, if: then execute i+1, and return step two；If not: then generating random number σ ∈ (0,1), then judge whether σ is greater than ε, if more than then being selected according to probabilistic type in state s_tThe movement a of Shi Zhihang_tIf less In, then randomly choose movement a_t, a_t∈A。

Wherein, a is acted according to probabilistic type selection_tFormula are as follows:In formula, and P (s | a_k) it is to select shape Action command a is chosen under state parameter S_kProbability, Q (s, a_k) it is action command a under state parameter S_kQ value,For Under state parameter S everything instruct Q value and.

Step 5: execution acts a_tObtain state s_t+1With reward r_t+1。

Step 6: judge state s_t+1It whether is final state, if: state s is then judged again_t+1It whether is dbjective state, If state s_t+1For dbjective state, then perform the following operations --- after SuccessNum value is added one, determine PathList at this time In not lying in List, if not lying in List, PathList is added List, PathNum value adds the value of one, i to add one, and And step 2 is executed, if not dbjective state, then the value of i adds one, and return step two；If not: thening follow the steps seven.

Step 7: judging B, (s, a) whether being less than or equal to t, ((s, renewable time a) is at this by i.e. upper movement value function Q When before step), if: L (s, a)=true, even if it is learnt；If it is not, executing step 8.

Step 8: judgement L (s, whether value a) is true, if: (whether s a) is equal to 0, if being equal to 0, starts this moment C Study, i.e., then enable B (s, a)=t, if be not equal to 0, do nothing, has carried out for C (s, after judgement a), execution C (s, a) +=1 (access times increase primary), U (s, a) +=r_r+1+λ_maxQ(s_t+1, a) (storage reward)；If not: then executing step Rapid nine.

Step 9: judge C (s, a) whether be equal to h (whether access times reach maximum count threshold value), if: then execute Q ((s, a)/h (average value of h step reward value before taking), (s a)=0 (empties reward) to U, and (s a)=0 (empties access to C by s, a)=U Number), meanwhile, enable renewable time E (s, a)=i.

Step 10: judge E (s, a) whether be more than or equal to E (s, a), if: enable L (s, a)=true, U (and s, a)=0, C (s, a)=0；If not: executing step 11.

Step 11: by s_t+1It is put into PathList, s ← s_t+1, current state is become into s_t+1.T value adds one, executes step Rapid four.

Embodiment two:

The emulation experiment of single robot path planning is disclosed in the present embodiment.

One, emulation experiment explanation

1) when carrying out emulation experiment, software platform uses Windows10 operating system, and CPU uses Inter Core I5- 8400, the size of running memory is 16GB.The path planning algorithm of single robot system will use Python with TensorFlow deep learning tool completes emulation experiment, and multirobot path planning algorithm is existed using matlab language It is write in matlab2016a simulation software.

2) environment will be described using Grid Method herein, the working space of robot system is divided into one one A small grid, each small grid can represent a state of robot system.White grid indicates safety in map Region, black grid indicate that there are barriers.

Dbjective state and barrier are all static in environment, and barrier and border for robot in environment Position is unknown.In subsequent experimental, the working space of robot is respectively 10 × 10 or 20 × 20 grating map.

In simulation process, the moving line and initial state of robot indicate that dbjective state is indicated with 2 with 1.

3) the MDP four-tuple of single robot system is defined as follows:

Set of actions: the movement that each robot can take be set to it is upward, downward, to the left, to the right four movement.In grid Indicated in trrellis diagram are as follows: Robot blank grid moved, can be moved to from a grid up and down four it is adjacent Grid, it is not possible to across grid transfer also cannot diagonally (such as: upper left) be shifted.

The then motion space of the robot system are as follows: A={ 0,1,2,3 }.Wherein, it 0 represents upwards, 1 represents downwards, 2 generations To the left, 3 represent to the right table.

State set: in grid map, each grid means that a state, then the state space of system are as follows: S= { 1,2,3 ... 100 } or S={ 1,2,3 ... 400 }.Trellis states could of the robot locating for any time can indicate are as follows: S_t= (x_t,y_t)。

Robot arrives at black grid (barrier) or arrives at yellow grid (dbjective state), as final state.One Denier robot becomes final state, then the training of this wheel terminates, and robot will come back to original state, carries out next round Training.

Migration function: next grid after robot has selected some to act in set of actions, after execution movement Lattice are not barrier or boundary wall, and robot is just moved to next grid.

Migration function when then robot is mobile are as follows:

Reward function: in single robot system, the every shifting of setting robot moves a step, it will and obtaining award immediately is -1, The movement consumption for representing robot, forces robot quickly to arrive at dbjective state；When robot reaches dbjective state, i.e. machine When people arrives at yellow grid, can obtain an award immediately is 10；When robot and barrier bump against, i.e., robot enters black Grid, then obtaining an award immediately is -10.Therefore, single robot system reward functions can be defined as:

Two, parameter initialization

In the present embodiment, the setting of parameter: 1) learning rate α, value (if learning rate is too small, restrains between 0 to 1 Speed is slow；Learning rate is too big, may restrain less than optimal value.Herein 0.01) learning rate α is initialized as；2) discount factor γ, value (determines that robot is more to value immediate interest, still more values long-term interest between 0 to 1.If discount because Sub- γ levels off to 0, then the reward immediately of robot is more important, if otherwise level off to 1, long-term benefit is more valued by robot Benefit.In the emulation experiment of single robot path planning of this paper, set discount factor γ to 0.8)；3) step number is at most explored M: it is 200 (if the exploration step number of robot is not more than 200 still not that the present embodiment, which will be arranged each round training at most to explore step numbers, Arrive at dbjective state, then it represents that the strategy taken is inappropriate in the training of this wheel, does not continue necessity that training is gone down, choosing The training for terminating this wheel is selected, the training of next round is directly carried out)；4) explore factor ε: set exploration factor ε initial value as 0.4, minimal path number MinPathNum are set as 2, and maximum success pathfinding number MaxSuccessNum is set as 10, explore because Sub- single, which updates step-length eSize, set according to the complexity of environment, if environment is relatively simple, step-length is arranged can be compared with Greatly, if environment is complicated, then step-length setting can be smaller.

Experiment one:

Experiment one using 10 × 10 grating map, Obstacle Position be it is random given, the original state of robot be (0, 0), dbjective state is (7,7).Wherein, Fig. 4 is to carry out the obtained planning road of emulation experiment using tradition Q-Learning algorithm Diameter；Fig. 5 is to carry out the obtained planning path of emulation experiment using improved Q-Learning algorithm.

1) as shown in Fig. 4 and Fig. 5: grey grid indicates its travel path, by scheming this it appears that (existing come legacy paths Some paths planning methods) turning point is more, and the path of (paths planning method in the present invention) is more gentle after improving, and says It is bright more outstanding compared to the solution that tradition obtains using improved Q-Learning algorithm.

2) from Fig. 6 and Fig. 7: traditional robot (existing paths planning method) finds an arrival target for the first time Status safety is without the path touched at 700 seconds or so, and robot (present invention in paths planning method) is looked for for the first time after improving Dbjective state safety is arrived at without the path touched at 300 seconds or so to one.Therefore, traditional Q-Learning algorithm is first in training Phase, can not find a paths substantially can lead to dbjective state, and improved algorithm can find one faster and lead to The path of dbjective state.Further, it is also possible to find, with the increase of training time after tradition, robot system is successfully found road The probability of diameter is all being gradually increasing, but the speed and number risen after improving is considerably more rapid more.Tradition was until 900 seconds or so Just tend to restrain, and can tend to restrain within the 500th second or so again after improving.

Above-mentioned two o'clock can illustrate that improved Q-Learning algorithm significantly increases the effect of algorithm compared to tradition Rate.

Experiment two:

In experiment two, 20 × 20 grating map is taken, Obstacle Position is to give at random, and the original state of robot is (0,3), dbjective state are (15,15).Compared with the environmental model of experiment one, testing two environmental model becomes increasingly complex, Given barrier is not only more at random, and there is also many concave domains, increase the difficulty of robot path planning.

1) as shown in Figure 8: grey grid indicates its travel path, is shown in the grating map, based on improvement Q- The successful arrival target that single Robot Path Planning Algorithm (paths planning method in the present invention) of Learning algorithm obtains The planning path of state.

2) environmental model of experiment 2 becomes complicated, and there are concave domains, as shown in Fig. 9 and Figure 10: when using traditional Q-Learning algorithm (existing paths planning method) when carrying out path planning, after 1000 episode of learning training, Still the path of an arrival dbjective state is not found successfully.By observing the discovery of its training process, due to using tradition Q-Learning algorithm (existing paths planning method) training when can constantly fall into concave domain, lead to not success Ground is learnt.And improved Q-Learning algorithm (paths planning method in the present invention) in complex environment still It is feasible, in the safety that the 500th step or so has been successfully found one article of arrival dbjective state without the path touched, then gradually restrain. Also, on the training time, although traditional Q-Learning algorithm (existing paths planning method) is not restrained, Equally after 1000 episode of training, improved algorithm time-consuming is smaller, illustrates that it updates efficiency and becomes faster.

In summary two comparing results illustrate improved Q-Learning algorithm (the path planning side in the present invention Method) it is still feasible in complex environment, and renewal speed is faster, has certain practical application meaning.

What has been described above is only an embodiment of the present invention, and the common sense such as well known specific structure and characteristic are not made herein in scheme Excessive description, technical field that the present invention belongs to is all before one skilled in the art know the applying date or priority date Ordinary technical knowledge can know the prior art all in the field, and have using routine experiment hand before the date The ability of section, one skilled in the art can improve and be implemented in conjunction with self-ability under the enlightenment that the application provides This programme, some typical known features or known method should not become one skilled in the art and implement the application Obstacle.It should be pointed out that for those skilled in the art, without departing from the structure of the invention, can also make Several modifications and improvements out, these also should be considered as protection scope of the present invention, these all will not influence the effect that the present invention is implemented Fruit and patent practicability.The scope of protection required by this application should be based on the content of the claims, the tool in specification The records such as body embodiment can be used for explaining the content of claim.

Claims

1. single robot path planning method based on Q-Learning algorithm, which comprises the following steps:

S1: the exploration factor of single robot system, maximum number of iterations, final state parameter, dbjective state parameter, most are initialized Big count threshold, start renewable time, the number of iterations, current time, movement value function, state-movement pair access times, at Function path, successful path store table；

S2: judging whether the number of iterations is greater than maximum number of iterations, if: then follow the steps S6；If not: then initializing current shape State parameter, then perform the next step rapid；

S3: generating a random number, compares random number and chooses an action command after exploring the factor, according to the action command meter It calculates generation robot and executes running state parameter and reward functions after the action command；Then, judge that running state parameter is It is no to be equal to final state parameter, if: then continue to judge whether running state parameter is equal to dbjective state parameter, if being equal to, By successful path storage into successful path storage table, the number of iterations is executed from adding one, returns again to step S2, if being not equal to, The number of iterations is executed from adding one, returns again to step S2；If not: then performing the next step rapid；

S4: judging to start whether renewable time is less than or equal to current time, if: then store reward functions, execution state-movement Pair access times from plus one, then perform the next step rapid；If not: then judging whether state-movement pair access times are equal to most Big count threshold, if so, update action value function, then perform the next step suddenly, if it is not, then performing the next step rapid；

S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, lays equal stress on It is multiple: to execute action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to default Dbjective state parameter when, obtain the optimal path of single robot system.

2. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S4, the formula of update action value function be Q (s, a)=U (and s, a)/h, in formula, Q (s is a) movement value function, U (s, a) For the reward functions of storage, h is maximum count threshold value.

3. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S3, held before executing step S2 if running state parameter is equal to final state parameter and is equal to dbjective state parameter The preset success pathfinding number of row adds one certainly.

4. as claimed in claim 3 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S2, if the number of iterations is less than maximum number of iterations, first judge successfully whether pathfinding number is greater than the maximum success prestored Pathfinding number, if: the then more value of the new exploration factor, and perform the next step rapid；If not: then performing the next step rapid.

5. as claimed in claim 4 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S3, held before executing step S2 if running state parameter is equal to final state parameter and is equal to dbjective state parameter Row successful path number adds one certainly.

6. as claimed in claim 5 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S2, when the more value of the new exploration factor, first judge whether successful path number is less than the minimal path number prestored, if: then hold Row ε '=ε+eSize × (Minpathnum-pathnun), ε indicate the exploration factor before updating, the updated exploration of ε ' expression The factor, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length, and MinPathNum is minimal path number, PathNum is successful path number；If not: ε '=ε-eSize × (i/eCycle) is then executed, ε indicates the exploration factor before updating, The updated exploration factor of ε ' expression, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length, and i is repeatedly Generation number, eCycle are that the exploration factor that fortune is deposited changes the period.

7. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S3, compare random number and explore because of the period of the day from 11 p.m. to 1 a.m, if random number, which is greater than, explores the factor, is chosen according to the probabilistic model prestored dynamic It instructs；If random number is less than or equal to explore the factor, action command is randomly selected from the behavior aggregate prestored.

8. as claimed in claim 7 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step In rapid S3, the formula that probabilistic model chooses action command isIn formula, and P (s | a_k) it is to select state parameter Action command a is chosen under S_kProbability, Q (s, a_k) it is action command a under state parameter S_kQ value,For state ginseng Number S under everything instruct Q value and.