CN110378439A - Single robot path planning method based on Q-Learning algorithm - Google Patents
Single robot path planning method based on Q-Learning algorithm Download PDFInfo
- Publication number
- CN110378439A CN110378439A CN201910737476.6A CN201910737476A CN110378439A CN 110378439 A CN110378439 A CN 110378439A CN 201910737476 A CN201910737476 A CN 201910737476A CN 110378439 A CN110378439 A CN 110378439A
- Authority
- CN
- China
- Prior art keywords
- state parameter
- equal
- path
- factor
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 68
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000006870 function Effects 0.000 claims abstract description 67
- 230000009471 action Effects 0.000 claims abstract description 66
- 241000208340 Araliaceae Species 0.000 claims description 3
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 3
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 3
- 235000008434 ginseng Nutrition 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 14
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000003252 repetitive effect Effects 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 21
- 238000012549 training Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 9
- 230000004888 barrier function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000011217 control strategy Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 235000019640 taste Nutrition 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
- G06Q10/047—Optimisation of routes or paths, e.g. travelling salesman problem
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Economics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Marketing (AREA)
- Remote Sensing (AREA)
- Automation & Control Theory (AREA)
- Radar, Positioning & Navigation (AREA)
- Aviation & Aerospace Engineering (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Development Economics (AREA)
- Feedback Control In General (AREA)
- Manipulator (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
Abstract
The present invention relates to road robot diameter planning technology fields, and in particular to single robot path planning method based on Q-Learning algorithm, comprising: the parameter of initialization algorithm;Action command is chosen, is calculated according to the action command and generates running state parameter and reward functions;If running state parameter is equal to final state parameter and is equal to dbjective state parameter, by successful path storage into successful path storage table;Otherwise, when starting renewable time less than or equal to current time and state-movement pair access times are equal to maximum count threshold value, update action value function, and by running state parameter storage into successful path;The above-mentioned steps that rerun are until reach maximum number of iterations;It is chosen according to movement value function repetitive operation instruction and state parameter generates, obtain the optimal path of single robot.The present invention can preferably promote the renewal learning speed and path planning effect of learning system when Q-Learning algorithm is used for single robot path planning.
Description
Technical field
The present invention relates to robot diameter planning technology fields, and in particular to single robot road based on Q-Learning algorithm
Diameter planing method.
Background technique
Mobile robot has a wide range of applications, such as the every field such as family, agricultural, industry, military affairs all have movement
The figure of robot.And the three big cores in the mobile research field of control robot are points of the positioning of robot, task
Match and Path Planning Technique.Wherein, path planning is the primary item that mobile robot reaches task object, completes task definition
Part.Such as: home services type clean robot needs to carry out indoor environment reasonable path planning to complete clean up task;Agriculture
Industry picking robot needs path planning that could walk between crops to complete picking task;Industrial robot is also required to carry out
Path planning could complete given task in shared working space.
Single robot system home services, agriculture power-assisted, industrial environment, in terms of all widely answered
With.In such applications, single robot system path planning is particularly important, and single robot system path planning refers to: in its work
As soon as making the path that can avoid all barriers found in environment from initial state to dbjective state, this needs to use machine
Study, in the related technology, most common learning method is intensified learning.
Q-Learning algorithm is the important algorithm in intensified learning, in the related technology, by Q-Learning algorithm application
In the path planning of robot system.The learning process of Q-Learning algorithm is the process of an iteration, it is needed by not
Disconnected ground trial and error and movement selection, progressive updating Q value table (movement value function).That is: set a reward function, robot according to
ε-Greedy strategy (ε-greedy, ε are to explore the factor (0≤ε≤1)) chooses action command, executes action command and according to reward
Function updates Q value table, then generates state parameter, chooses next movement according to state parameter and ε-Greedy strategy, then proceedes to
It executes action command and updates Q value table, obtain final Q value table until updating, finally obtain optimal path according to Q value table.
Existing Q-Learning algorithm needs constantly selection and updates Q value to improve the selection of action command, that is,
It says, one secondary environment of every exploration just needs to update a Q value, this causes the update of learning system and pace of learning slow.The relevant technologies
In, in order to guarantee the speed for updating and learning, it can be steadily decreasing with the increase of algorithm training time and explore factor ε's
Value, i.e., more go to execute optimal movement, rapidly converge to it among corresponding solution.The cost of this way is: study system
System can miss optimal solution because of (value for exploring factor ε is too small) inadequate to the exploration completeness of environment, can only converge to one
Suboptimal solution is more very possible to that a common solution can be converged to, and this defect will affect the effect of path planning.
Summary of the invention
In view of the above shortcomings of the prior art, the technical problems to be solved by the present invention are: how by Q-Learning
When algorithm is used for single robot path planning, the renewal learning speed and path planning effect of learning system are preferably promoted.
In order to solve the above-mentioned technical problem, present invention employs the following technical solutions:
Single robot path planning method based on Q-Learning algorithm, comprising the following steps:
S1: the exploration factor, maximum number of iterations, final state parameter, the dbjective state ginseng of single robot system are initialized
Number, maximum count threshold value start renewable time, the number of iterations, current time, movement value function, state-movement pair access time
Number, successful path, successful path store table;
S2: judging whether the number of iterations is greater than maximum number of iterations, if: then follow the steps S6;If not: then initialization is worked as
Preceding state parameter, then perform the next step rapid;
S3: generating a random number, compares random number and chooses an action command after exploring the factor, is referred to according to the movement
It enables calculating generate robot and executes running state parameter and reward functions after the action command;Then, judge that operating status is joined
Whether number is equal to final state parameter, if: then continue to judge whether running state parameter is equal to dbjective state parameter, if waiting
In then by successful path storage into successful path storage table, execution the number of iterations returns again to step S2, if differing from adding one
In then executing the number of iterations, oneself adds one, returns again to step S2;If not: then performing the next step rapid;
S4: judging to start whether renewable time is less than or equal to current time, if: then store reward functions, execution state-
The access times of movement pair add one certainly, then perform the next step rapid;If not: then judge state-movement pair access times whether etc.
In maximum count threshold value, if so, update action value function, then perform the next step it is rapid, if it is not, then performing the next step rapid;
S5: by running state parameter storage into successful path, current time is executed from adding one, returns again to step S3;
S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, and
It repeats: executing action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to pre-
If dbjective state parameter when, obtain the optimal path of single robot system.
In this way, by the way that maximum count threshold value is arranged in Q-Learning algorithm, and by comparing state-movement pair visit
Number and maximum count threshold value are asked, to decide whether update action value function (Q value);That is: when state-movement pair access times
When reaching maximum count threshold value, just start update action value function.Firstly, the exploration to environment, energy will not be reduced in this programme
Enough guarantee path planning effect;Secondly, this programme not only reduces the calculation amount of system, it is also greatly improved renewal learning
Speed;Again, the update action value function mode of this programme has multistep anticipation ability, it is contemplated that following multiple state-movements
To the influence to movement value function, the control strategy learnt can be more reasonable;Finally, in the present solution, selection state-movement
Pair access times as movement value function update foundation, guarantee promoted renewal learning speed under the premise of, will not influence
The step of preamble, does not need to reduce the value for exploring the factor, thus be avoided that learning system because of the exploration completeness to environment not
The problem of foot leads to miss optimal solution.
Therefore, this programme can preferably promote when Q-Learning algorithm is used for single robot path planning
The renewal learning speed and path planning effect of learning system.
Preferably, in step S4, the formula of update action value function is that ((s, a)/h, in formula, (s a) is Q Q by s, a)=U
Value function is acted, (s is a) reward functions of storage to U, and h is maximum count threshold value.
In this way, updated movement value function is the average value of all reward functions of storage, on the one hand, pass through average meter
Algorithm can be reduced calculation amount, also help the accuracy of enhancing action value function;On the other hand, this calculation no longer needs
Eligibility trace matrix is calculated, the complexity of calculating is further reduced.
Preferably, in step S3, if running state parameter is equal to final state parameter and is equal to dbjective state parameter,
Before executing step S2, executes preset success pathfinding number and add one certainly.
In this way, by record and being updated successfully pathfinding number, the learning effect of learning system can be timely feedbacked, moreover it is possible to assist
Decision explores the update of the factor, this is conducive to auxiliary and solves the exploration of Q-Learning algorithm and utilize equilibrium problem.
Preferably, in step S2, if the number of iterations is less than maximum number of iterations, first judge successfully whether pathfinding number is big
In the maximum success pathfinding number prestored, if: the then more value of the new exploration factor, and perform the next step rapid;If not: under then executing
One step.
In this way, passing through the learning effect of successfully pathfinding number energy feedback learning system, constantly updated by learning effect
The value of the factor is explored, so that ε-greedy strategy has stronger adaptability, can more meet moving law.
Preferably, in step S3, if running state parameter is equal to final state parameter and is equal to dbjective state parameter,
Before executing step S2, the number of path that runs succeeded adds one certainly.
In this way, by record and being updated successfully pathfinding number, the learning effect of learning system can be timely feedbacked, moreover it is possible to participate in
Determine the subsequent update for exploring the factor, this exploration for being equally beneficial for auxiliary solution Q-Learning algorithm is asked with using balance
Topic.
Preferably, in step S2, when the more value of the new exploration factor, judge whether successful path number is less than the minimal path prestored
Diameter number, if: ε '=ε+eSize × (Minpathnum-pathnun) is then executed, ε indicates the exploration factor before updating, ε ' table
To show the updated exploration factor, and enables ε '=ε, in formula, eSize is that the exploration factor single prestored updates step-length,
MinPathNum is minimal path number, and PathNum is successful path number;If not: then executing ε '=ε-eSize × (i/
ECycle), ε indicates the exploration factor before updating, the updated exploration factor of ε ' expression, and enables ε '=ε, and in formula, eSize is pre-
The exploration factor single deposited updates step-length, and i is the number of iterations, and eCycle is that the exploration factor that fortune is deposited changes the period.
In this way, can more acurrate, timely feedback learning system by the combination of successful path number and success pathfinding number
Learning effect, so that the value for exploring the factor is constantly updated by learning effect, so that ε-greedy strategy has stronger fit
Should be able to power, also can more meet moving law.
Preferably, in step S3, compare random number and explore because of the period of the day from 11 p.m. to 1 a.m, if random number, which is greater than, explores the factor, according to pre-
The probabilistic model deposited chooses action command;If random number is less than or equal to explore the factor, randomly selected from the behavior aggregate prestored
Action command.
In this way, realizing movement based on probability by the comparison and probabilistic model of random number and the exploration factor selects plan
The probability Hui Geng great that the movement omited, and act the larger value of value function is selected, maximum value caused by solving because of influence of noise
Selection there are problems that deviation.
Preferably, in step S3, the formula that probabilistic model chooses action command isIn formula, P
(s|ak) it is to select selection action command a under state parameter SkProbability, Q (s, ak) it is action command a under state parameter SkQ value,For under state parameter S everything instruct Q value and.
In this way, the probabilistic model can be such that the movement of the biggish value of function of movement value is selected by training in advance and study
Probability it is bigger, advantageously account for maximize deviation the problem of.
Detailed description of the invention
In order to keep the purposes, technical schemes and advantages of invention clearer, the present invention is made into one below in conjunction with attached drawing
The detailed description of step, in which:
Fig. 1 is the logic diagram of single robot path planning method based on Q-Learning algorithm in embodiment one;
Fig. 2 is the flow chart of single robot path planning method based on Q-Learning algorithm in embodiment one;
Fig. 3 is the flow chart that Q value table is updated in embodiment one;
Fig. 4 is the path schematic diagram of tradition Q-Learning algorithmic rule in the experiment one of embodiment two;
Fig. 5 is the path schematic diagram of Q-Learning algorithmic rule after present invention improvement in the experiment one of embodiment two;
Fig. 6 spends time taking line chart by tradition Q-Learning algorithmic statement in the experiment one of embodiment two;
Fig. 7 spends time taking line chart by Q-Learning algorithmic statement after improving in the experiment one of embodiment two;
Fig. 8 is the path schematic diagram of Q-Learning algorithmic rule after improving in the experiment two of embodiment two;
Fig. 9 is the line chart of frequency of training needed for tradition Q-Learning algorithmic statement in the experiment two of embodiment two;
Figure 10 is the line chart of frequency of training needed for Q-Learning algorithmic statement after improving in the experiment two of embodiment two.
Specific embodiment
Below by the further details of explanation of specific embodiment:
Q-Learning learning algorithm is to be put forward by Watkins in 1989, is one in nitrification enhancement
A important algorithm.
One, the update rule of Q-Learning algorithm
Robot under Q-Learning algorithm does not know whole environment, only knows selectable movement under current state
Set, it usually needs one instant prize payouts matrix R of building is rewarded for indicating from state s to the movement of NextState s '
Value.The Q value table (or referred to as Q matrix) of guidance machine human action is calculated by R building.
Each state-movement is set as a result, to as < S, A >, Q learning algorithm be to state-movement pair value function Q (S,
A) estimated in the hope of control strategy.A kind of simplest form is that single step Q learns in Q study, and the correction formula of Q value is such as
Under:
Formula (1-1) only can just set up using optimal policy, and in the beginning of study, formula (1-1) equal sign both sides are simultaneously
It is unequal, error:
Obtained update rule:
Qt(st,at)←Qt(st,at)+αΔQt(st,at) (1-3)
That is:
Wherein: st: current state;at: the movement selected under current state;st+1: execution acts atNextState afterwards;
rt+1: execution acts atInstant reward afterwards;Q(st,at): for state stThe execution of lower robot acts atAfterwards, acquired
Accumulated weights reward, i.e. state-movement pair value function.
α: the learning rate of control convergence, 0 < α < 1, by being constantly trying to search space, Q value can be approached gradually most preferably
Value;
γ: discount factor, 0≤γ < 1, γ=0 indicate award immediately, and γ tends to 1 expression and awards in the future, it can says γ
The far and near influence degree to award for determining the time indicates to sacrifice current award, exchanges the degree of long-term interest for.
Two, the step of Q-Learning algorithm
Initializing Q first, (s, a) value, at state s, robot acts a according to movement selection strategy π selection, obtains down
One state s ' and reward value γ corrects Q (s, a) value further according to rule is updated;Continuous repetitive operation selection and amendment Q (s, a) value
Until study terminates.
One typical Q-Learning learning algorithm entire flow is as follows:
1, For i=1:n:
2, init state s
3, each learning cycle of For:
4, Utilization strategies μ come select movement a
5, a is executed, is awarded s immediately, and be transferred to NextState s'
6, according to the Q value function of formula (2-4) more new strategy π
7, current strategies are updated
8、s←s'
9, until s is final state
Three, Q-Learning convergence
After meeting the following four condition of convergence, (s, a) can be with convergence in probability very in Q by Q*(s, a):
1) environment has markov decision process property;
2) value function is indicated with lookup table, i.e., Q (s, a) value (Q matrix) are stored using table;
3) each state-movement is to < S, and A > can with Q, (s, a) more new formula be iterated infinitely;
4) reasonable learning rate α.
Four, the equilibrium problem explored and utilized
The balance explored and utilized is a very basic concept in intensified learning, in intensified learning, is made every time
When selecting out, the optimal movement explored when is selected, when goes to explore and attempts unknown movement, here it is
Exploration-utilizes equilibrium problem.
ε-Greedy strategy (ε-greedy) the most commonly used method and Q- that solve exploration-and utilize equilibrium problem
The exploration strategy μ used in Learning algorithm.ε-Greedy strategy formula is as follows:
Wherein, it ε: explores the factor (0≤ε≤1);
σ: number of the algorithm between 0 to 1 generated at random in each step.
It can be seen from (1-5) formula when exploration factor ε is bigger, learning system is partial to explore environment, tastes
Try random movement;When exploration factor ε is smaller, learning system then tends to selection and executes known optimal movement.Therefore ε
The selection of value be very important.
Five, the defect of Q-Learning algorithm
By the analysis to existing Q-Learning algorithm, Q-Learning algorithm is applied to the road of single robot by discovery
In diameter planning, have the following problems:
1) equilibrium problem explored and utilized
Existing Q-Learning algorithm, in order to guarantee convergent speed, it will usually with the increasing of the training time of algorithm
Add and be steadily decreasing the value for exploring factor ε, i.e., more goes to execute optimal movement, it is made to rapidly converge to corresponding solution
In.
But learning system not enough may make learning system miss optimal solution because of the exploration completeness to environment,
A suboptimal solution is converged to, is more very possible to that a common solution can be converged to.
2) offset issue is maximized
Due in Q-Learning algorithm, the method for more new strategy isWherein use
Max method constantly selects Q (s, a) maximum movement a that is, in more new strategy.But this method of selection may
It can be because certain noise items be last as a result, there are a maximization offset issues to influence.
Although repeatedly using Q (s, a) maximum movement can generate maximize progressive award action policy, when Q (s,
A) when being not so accurate, the performance of Q-Learning algorithm can be made to become worse and worse.And when study constantly uses Q
(s, a) maximum movement, it is likely that can be in training early stage with regard to (s, a) state action of value is associated, to make to learn with high Q
It restrains too fast when habit, thus misses some possible optimal policies.
3) the slow problem of renewal speed
Q-Learning algorithm learning process is the process of an iteration, it is needed through constantly trial and error and movement choosing
It selects, gradually improves the mapping policy from state to movement, this requires learning systems to each possible state action to passing through
Feedback information carries out multiple trial and error and amendment, can just obtain the control strategy being more suitable.
In view of the above-mentioned problems, the present invention provides single robot path planning method based on Q-Learning algorithm, packet
Include following steps:
S1: the exploration factor, maximum number of iterations, final state parameter, the dbjective state ginseng of single robot system are initialized
Number, maximum count threshold value start renewable time, the number of iterations, current time, movement value function, state-movement pair access time
Number, successful path, successful path store table;
S2: judging whether the number of iterations is greater than maximum number of iterations, if: then follow the steps S6;If not: then initialization is worked as
Preceding state parameter, then perform the next step rapid;
S3: generating a random number, compares random number and chooses an action command after exploring the factor, is referred to according to the movement
It enables calculating generate robot and executes running state parameter and reward functions after the action command;Then, judge that operating status is joined
Whether number is equal to final state parameter, if: then continue to judge whether running state parameter is equal to dbjective state parameter, if waiting
In then by successful path storage into successful path storage table, execution the number of iterations returns again to step S2, if differing from adding one
In then executing the number of iterations, oneself adds one, returns again to step S2;If not: then performing the next step rapid;
S4: judging to start whether renewable time is less than or equal to current time, if: then store reward functions, execution state-
The access times of movement pair add one certainly, then perform the next step rapid;If not: then judge state-movement pair access times whether etc.
In maximum count threshold value, if so, update action value function, then perform the next step it is rapid, if it is not, then performing the next step rapid;
S5: by running state parameter storage into successful path, current time is executed from adding one, returns again to step S3;
S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, and
It repeats: executing action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to pre-
If dbjective state parameter when, obtain the optimal path of single robot system.
In this way, by the way that maximum count threshold value is arranged in Q-Learning algorithm, and by comparing state-movement pair visit
Number and maximum count threshold value are asked, to decide whether update action value function (Q value);That is: when state-movement pair access times
When reaching maximum count threshold value, just start update action value function.Firstly, the exploration to environment, energy will not be reduced in this programme
Enough guarantee path planning effect;Secondly, this programme not only reduces the calculation amount of system, it is also greatly improved renewal learning
Speed;Again, the update action value function mode of this programme has multistep anticipation ability, it is contemplated that following multiple state-movements
To the influence to movement value function, the control strategy learnt can be more reasonable;Finally, in the present solution, selection state-movement
Pair access times as movement value function update foundation, guarantee promoted renewal learning speed under the premise of, will not influence
The step of preamble, does not need to reduce the value for exploring the factor, thus be avoided that learning system because of the exploration completeness to environment not
The problem of foot leads to miss optimal solution.
Embodiment one:
As shown in Figure 1: based on single robot path planning method of Q-Learning algorithm, comprising the following steps:
S1: initialization action collection A, state set S, maximum number of iterations n, at most exploration step number m, minimal path number
MinPathNum, maximum success pathfinding number MaxSuccessNum, explore factor ε, explore factor single update step-length eSize,
Exploring factor change period eCycle, maximum count threshold value h, beginning renewable time B, (s a), completes renewable time, action value letter
(s, a), (s, a), (s, a), succeed reward functions storage U the access times C of state action pair several Q pathfinding number
SuccessNum, successful path number PathNum, the PathList of successful path, successful path store table List, the number of iterations i
With current time t.
S2: judging whether the number of iterations i is greater than maximum number of iterations n, if: then follow the steps S6;If not: judging successfully
Whether pathfinding number SuccessNum is greater than the maximum success pathfinding number MaxSuccessNum prestored, if: then more new exploration
The value of factor ε, and perform the next step rapid;If not: then performing the next step rapid.
When the more value of the new exploration factor, judge whether successful path number PathNum is less than the minimal path number prestored
MinPathNum, if: then execute ε '=ε+eSize × (Minpathnum-pathnun), ε indicate the exploration before updating because
Son, the updated exploration factor of ε ' expression, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length,
MinPathNum is minimal path number, and PathNum is successful path number;If not: then executing ε '=ε-eSize × (i/
ECycle), ε indicates the exploration factor before updating, the updated exploration factor of ε ' expression, and enables ε '=ε, and in formula, eSize is pre-
The exploration factor single deposited updates step-length, and i is the number of iterations, and eCycle is that the exploration factor that fortune is deposited changes the period.
S3: generating a random number σ ∈ (0,1), and one movement of selection refers to after comparing the value of random number σ and exploration factor ε
Enable at, according to action command atIt calculates generation robot and executes the running state parameter s after the action commandt+1And reward functions
rt+1;Judge running state parameter st+1Whether final state parameter is equal to, if: then judge running state parameter st+1Whether etc.
In dbjective state parameter, if being equal to, by successful path PathList storage into successful path storage table List, execution iteration
Number i adds one, success pathfinding number SuccessNum certainly certainly plus one, successful path number PathNum adds one certainly, and executes step
S2 executes the number of iterations i and adds one certainly, and execute step S2 if being not equal to;If not: then performing the next step rapid.
Wherein, if the value of random number σ, which is greater than, explores factor ε, according to the probabilistic model selection movement a prestoredt;If random
The value of number σ is less than or equal to explore factor ε, then movement a is randomly selected from behavior aggregate At;Probabilistic model chooses action command at's
Formula isIn formula, and P (s | ak) it is to select selection action command a under state parameter SkProbability, Q (s,
ak) it is action command a under state parameter SkQ value,For under state parameter S everything instruct Q value and.
S4: judgement beginning renewable time B (whether s a) is less than or equal to current time t, if: then by reward functions rt+1It deposits
Being stored in reward functions storage U, (s, a), executing state-movement pair access times C, (s a) adds one certainly, and performs the next step rapid;
If not: then judging (whether s a) is equal to maximum count threshold value h, if so, update action value to state-movement pair access times C
(s a), and is performed the next step suddenly, if not function Q: then being performed the next step rapid.
Wherein, the formula of update action value function is that ((s, a)/h, in formula, (s is a) movement value function, U to Q to Q by s, a)=U
(s is a) reward functions of storage, and h is maximum count threshold value.
S5: by running state parameter st+1It stores in successful path PathList, execution current time t adds one certainly, and holds
Row step S3.
S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, and
It repeats: executing action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to pre-
If dbjective state parameter when, obtain the optimal path of single robot system.
In order to preferably introduce the process of path planning, the list based on Q-Learning algorithm is also disclosed in the present embodiment
The flow chart of robot path planning method.
It is as shown in Figures 2 and 3: single robot path planning process based on Q-Learning algorithm, comprising the following steps:
Step 1: (s a), behavior aggregate A, state set S, maximum number of iterations n, is at most explored initialization action value function Q
Step number m, it minimal path number MinPathNum, maximum success pathfinding number MaxSuccessNum, explores factor ε, explore factor list
Secondary update step-length eSize, factor change period eCycle, state action are explored to accessed number C (s, a), when starting to update
Carving B, (s, a), completing renewable time E, (s, a) (s, a), (whether s a) is learnt L reward functions storage U, maximum count threshold value
H, success pathfinding number SuccessNum, successful path number PathNum, the PathList of successful path, successful path store table
List, the number of iterations i and current time t.
Wherein, Q (s, a)=0, C (s, a)=0, U (s, a)=0, SuccessNum=0, PathNum=0, PathList
=0, List=0, i=1, t=1.
Step 2: judging whether i is greater than n, if: then terminate to learn;If not: t=0 is then executed, and empties PathList,
Judge whether SuccessNum is greater than MaxSuccessNum again, if SuccessNum is greater than MaxSuccessNum, updates ε's
Value executes S3 step if SuccessNum is less than or equal to MaxSuccessNum.
Wherein, when the more value of new exploration factor ε: if PathNum is less than most MinPathNum, using formula ε+eSize
×(MinPathNum-PathNyn);If PathNum is more than or equal to MinPathNum, formula ε-eSize × (i/ is used
eCycle);In formula, ε is to explore the factor, and eSize is to explore factor single to update step-length, and MinPathNum is minimal path number,
PathNum is successful path number, and i is the number of iterations, and eCycle is to explore the factor to change the period.
Step 3: init state s, s ∈ S.
Step 4: judging whether t is greater than m, if: then execute i+1, and return step two;If not: then generating random number σ
∈ (0,1), then judge whether σ is greater than ε, if more than then being selected according to probabilistic type in state stThe movement a of Shi ZhihangtIf less
In, then randomly choose movement at, at∈A。
Wherein, a is acted according to probabilistic type selectiontFormula are as follows:In formula, and P (s | ak) it is to select shape
Action command a is chosen under state parameter SkProbability, Q (s, ak) it is action command a under state parameter SkQ value,For
Under state parameter S everything instruct Q value and.
Step 5: execution acts atObtain state st+1With reward rt+1。
Step 6: judge state st+1It whether is final state, if: state s is then judged againt+1It whether is dbjective state,
If state st+1For dbjective state, then perform the following operations --- after SuccessNum value is added one, determine PathList at this time
In not lying in List, if not lying in List, PathList is added List, PathNum value adds the value of one, i to add one, and
And step 2 is executed, if not dbjective state, then the value of i adds one, and return step two;If not: thening follow the steps seven.
Step 7: judging B, (s, a) whether being less than or equal to t, ((s, renewable time a) is at this by i.e. upper movement value function Q
When before step), if: L (s, a)=true, even if it is learnt;If it is not, executing step 8.
Step 8: judgement L (s, whether value a) is true, if: (whether s a) is equal to 0, if being equal to 0, starts this moment C
Study, i.e., then enable B (s, a)=t, if be not equal to 0, do nothing, has carried out for C (s, after judgement a), execution C
(s, a) +=1 (access times increase primary), U (s, a) +=rr+1+λmaxQ(st+1, a) (storage reward);If not: then executing step
Rapid nine.
Step 9: judge C (s, a) whether be equal to h (whether access times reach maximum count threshold value), if: then execute Q
((s, a)/h (average value of h step reward value before taking), (s a)=0 (empties reward) to U, and (s a)=0 (empties access to C by s, a)=U
Number), meanwhile, enable renewable time E (s, a)=i.
Step 10: judge E (s, a) whether be more than or equal to E (s, a), if: enable L (s, a)=true, U (and s, a)=0, C
(s, a)=0;If not: executing step 11.
Step 11: by st+1It is put into PathList, s ← st+1, current state is become into st+1.T value adds one, executes step
Rapid four.
Embodiment two:
The emulation experiment of single robot path planning is disclosed in the present embodiment.
One, emulation experiment explanation
1) when carrying out emulation experiment, software platform uses Windows10 operating system, and CPU uses Inter Core I5-
8400, the size of running memory is 16GB.The path planning algorithm of single robot system will use Python with
TensorFlow deep learning tool completes emulation experiment, and multirobot path planning algorithm is existed using matlab language
It is write in matlab2016a simulation software.
2) environment will be described using Grid Method herein, the working space of robot system is divided into one one
A small grid, each small grid can represent a state of robot system.White grid indicates safety in map
Region, black grid indicate that there are barriers.
Dbjective state and barrier are all static in environment, and barrier and border for robot in environment
Position is unknown.In subsequent experimental, the working space of robot is respectively 10 × 10 or 20 × 20 grating map.
In simulation process, the moving line and initial state of robot indicate that dbjective state is indicated with 2 with 1.
3) the MDP four-tuple of single robot system is defined as follows:
Set of actions: the movement that each robot can take be set to it is upward, downward, to the left, to the right four movement.In grid
Indicated in trrellis diagram are as follows: Robot blank grid moved, can be moved to from a grid up and down four it is adjacent
Grid, it is not possible to across grid transfer also cannot diagonally (such as: upper left) be shifted.
The then motion space of the robot system are as follows: A={ 0,1,2,3 }.Wherein, it 0 represents upwards, 1 represents downwards, 2 generations
To the left, 3 represent to the right table.
State set: in grid map, each grid means that a state, then the state space of system are as follows: S=
{ 1,2,3 ... 100 } or S={ 1,2,3 ... 400 }.Trellis states could of the robot locating for any time can indicate are as follows: St=
(xt,yt)。
Robot arrives at black grid (barrier) or arrives at yellow grid (dbjective state), as final state.One
Denier robot becomes final state, then the training of this wheel terminates, and robot will come back to original state, carries out next round
Training.
Migration function: next grid after robot has selected some to act in set of actions, after execution movement
Lattice are not barrier or boundary wall, and robot is just moved to next grid.
Migration function when then robot is mobile are as follows:
Reward function: in single robot system, the every shifting of setting robot moves a step, it will and obtaining award immediately is -1,
The movement consumption for representing robot, forces robot quickly to arrive at dbjective state;When robot reaches dbjective state, i.e. machine
When people arrives at yellow grid, can obtain an award immediately is 10;When robot and barrier bump against, i.e., robot enters black
Grid, then obtaining an award immediately is -10.Therefore, single robot system reward functions can be defined as:
Two, parameter initialization
In the present embodiment, the setting of parameter: 1) learning rate α, value (if learning rate is too small, restrains between 0 to 1
Speed is slow;Learning rate is too big, may restrain less than optimal value.Herein 0.01) learning rate α is initialized as;2) discount factor
γ, value (determines that robot is more to value immediate interest, still more values long-term interest between 0 to 1.If discount because
Sub- γ levels off to 0, then the reward immediately of robot is more important, if otherwise level off to 1, long-term benefit is more valued by robot
Benefit.In the emulation experiment of single robot path planning of this paper, set discount factor γ to 0.8);3) step number is at most explored
M: it is 200 (if the exploration step number of robot is not more than 200 still not that the present embodiment, which will be arranged each round training at most to explore step numbers,
Arrive at dbjective state, then it represents that the strategy taken is inappropriate in the training of this wheel, does not continue necessity that training is gone down, choosing
The training for terminating this wheel is selected, the training of next round is directly carried out);4) explore factor ε: set exploration factor ε initial value as
0.4, minimal path number MinPathNum are set as 2, and maximum success pathfinding number MaxSuccessNum is set as 10, explore because
Sub- single, which updates step-length eSize, set according to the complexity of environment, if environment is relatively simple, step-length is arranged can be compared with
Greatly, if environment is complicated, then step-length setting can be smaller.
Experiment one:
Experiment one using 10 × 10 grating map, Obstacle Position be it is random given, the original state of robot be (0,
0), dbjective state is (7,7).Wherein, Fig. 4 is to carry out the obtained planning road of emulation experiment using tradition Q-Learning algorithm
Diameter;Fig. 5 is to carry out the obtained planning path of emulation experiment using improved Q-Learning algorithm.
1) as shown in Fig. 4 and Fig. 5: grey grid indicates its travel path, by scheming this it appears that (existing come legacy paths
Some paths planning methods) turning point is more, and the path of (paths planning method in the present invention) is more gentle after improving, and says
It is bright more outstanding compared to the solution that tradition obtains using improved Q-Learning algorithm.
2) from Fig. 6 and Fig. 7: traditional robot (existing paths planning method) finds an arrival target for the first time
Status safety is without the path touched at 700 seconds or so, and robot (present invention in paths planning method) is looked for for the first time after improving
Dbjective state safety is arrived at without the path touched at 300 seconds or so to one.Therefore, traditional Q-Learning algorithm is first in training
Phase, can not find a paths substantially can lead to dbjective state, and improved algorithm can find one faster and lead to
The path of dbjective state.Further, it is also possible to find, with the increase of training time after tradition, robot system is successfully found road
The probability of diameter is all being gradually increasing, but the speed and number risen after improving is considerably more rapid more.Tradition was until 900 seconds or so
Just tend to restrain, and can tend to restrain within the 500th second or so again after improving.
Above-mentioned two o'clock can illustrate that improved Q-Learning algorithm significantly increases the effect of algorithm compared to tradition
Rate.
Experiment two:
In experiment two, 20 × 20 grating map is taken, Obstacle Position is to give at random, and the original state of robot is
(0,3), dbjective state are (15,15).Compared with the environmental model of experiment one, testing two environmental model becomes increasingly complex,
Given barrier is not only more at random, and there is also many concave domains, increase the difficulty of robot path planning.
1) as shown in Figure 8: grey grid indicates its travel path, is shown in the grating map, based on improvement Q-
The successful arrival target that single Robot Path Planning Algorithm (paths planning method in the present invention) of Learning algorithm obtains
The planning path of state.
2) environmental model of experiment 2 becomes complicated, and there are concave domains, as shown in Fig. 9 and Figure 10: when using traditional
Q-Learning algorithm (existing paths planning method) when carrying out path planning, after 1000 episode of learning training,
Still the path of an arrival dbjective state is not found successfully.By observing the discovery of its training process, due to using tradition
Q-Learning algorithm (existing paths planning method) training when can constantly fall into concave domain, lead to not success
Ground is learnt.And improved Q-Learning algorithm (paths planning method in the present invention) in complex environment still
It is feasible, in the safety that the 500th step or so has been successfully found one article of arrival dbjective state without the path touched, then gradually restrain.
Also, on the training time, although traditional Q-Learning algorithm (existing paths planning method) is not restrained,
Equally after 1000 episode of training, improved algorithm time-consuming is smaller, illustrates that it updates efficiency and becomes faster.
In summary two comparing results illustrate improved Q-Learning algorithm (the path planning side in the present invention
Method) it is still feasible in complex environment, and renewal speed is faster, has certain practical application meaning.
What has been described above is only an embodiment of the present invention, and the common sense such as well known specific structure and characteristic are not made herein in scheme
Excessive description, technical field that the present invention belongs to is all before one skilled in the art know the applying date or priority date
Ordinary technical knowledge can know the prior art all in the field, and have using routine experiment hand before the date
The ability of section, one skilled in the art can improve and be implemented in conjunction with self-ability under the enlightenment that the application provides
This programme, some typical known features or known method should not become one skilled in the art and implement the application
Obstacle.It should be pointed out that for those skilled in the art, without departing from the structure of the invention, can also make
Several modifications and improvements out, these also should be considered as protection scope of the present invention, these all will not influence the effect that the present invention is implemented
Fruit and patent practicability.The scope of protection required by this application should be based on the content of the claims, the tool in specification
The records such as body embodiment can be used for explaining the content of claim.
Claims (8)
1. single robot path planning method based on Q-Learning algorithm, which comprises the following steps:
S1: the exploration factor of single robot system, maximum number of iterations, final state parameter, dbjective state parameter, most are initialized
Big count threshold, start renewable time, the number of iterations, current time, movement value function, state-movement pair access times, at
Function path, successful path store table;
S2: judging whether the number of iterations is greater than maximum number of iterations, if: then follow the steps S6;If not: then initializing current shape
State parameter, then perform the next step rapid;
S3: generating a random number, compares random number and chooses an action command after exploring the factor, according to the action command meter
It calculates generation robot and executes running state parameter and reward functions after the action command;Then, judge that running state parameter is
It is no to be equal to final state parameter, if: then continue to judge whether running state parameter is equal to dbjective state parameter, if being equal to,
By successful path storage into successful path storage table, the number of iterations is executed from adding one, returns again to step S2, if being not equal to,
The number of iterations is executed from adding one, returns again to step S2;If not: then performing the next step rapid;
S4: judging to start whether renewable time is less than or equal to current time, if: then store reward functions, execution state-movement
Pair access times from plus one, then perform the next step rapid;If not: then judging whether state-movement pair access times are equal to most
Big count threshold, if so, update action value function, then perform the next step suddenly, if it is not, then performing the next step rapid;
S5: by running state parameter storage into successful path, current time is executed from adding one, returns again to step S3;
S6: acquisition acts value function, chooses action command from movement value function according to preset original state parameter, lays equal stress on
It is multiple: to execute action command and generate state parameter, action command is chosen according to state parameter, when the state parameter of generation is equal to default
Dbjective state parameter when, obtain the optimal path of single robot system.
2. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S4, the formula of update action value function be Q (s, a)=U (and s, a)/h, in formula, Q (s is a) movement value function, U (s, a)
For the reward functions of storage, h is maximum count threshold value.
3. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S3, held before executing step S2 if running state parameter is equal to final state parameter and is equal to dbjective state parameter
The preset success pathfinding number of row adds one certainly.
4. as claimed in claim 3 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S2, if the number of iterations is less than maximum number of iterations, first judge successfully whether pathfinding number is greater than the maximum success prestored
Pathfinding number, if: the then more value of the new exploration factor, and perform the next step rapid;If not: then performing the next step rapid.
5. as claimed in claim 4 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S3, held before executing step S2 if running state parameter is equal to final state parameter and is equal to dbjective state parameter
Row successful path number adds one certainly.
6. as claimed in claim 5 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S2, when the more value of the new exploration factor, first judge whether successful path number is less than the minimal path number prestored, if: then hold
Row ε '=ε+eSize × (Minpathnum-pathnun), ε indicate the exploration factor before updating, the updated exploration of ε ' expression
The factor, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length, and MinPathNum is minimal path number,
PathNum is successful path number;If not: ε '=ε-eSize × (i/eCycle) is then executed, ε indicates the exploration factor before updating,
The updated exploration factor of ε ' expression, and ε '=ε is enabled, in formula, eSize is that the exploration factor single prestored updates step-length, and i is repeatedly
Generation number, eCycle are that the exploration factor that fortune is deposited changes the period.
7. as described in claim 1 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S3, compare random number and explore because of the period of the day from 11 p.m. to 1 a.m, if random number, which is greater than, explores the factor, is chosen according to the probabilistic model prestored dynamic
It instructs;If random number is less than or equal to explore the factor, action command is randomly selected from the behavior aggregate prestored.
8. as claimed in claim 7 based on single robot path planning method of Q-Learning algorithm, it is characterised in that: step
In rapid S3, the formula that probabilistic model chooses action command isIn formula, and P (s | ak) it is to select state parameter
Action command a is chosen under SkProbability, Q (s, ak) it is action command a under state parameter SkQ value,For state ginseng
Number S under everything instruct Q value and.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910737476.6A CN110378439B (en) | 2019-08-09 | 2019-08-09 | Single robot path planning method based on Q-Learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910737476.6A CN110378439B (en) | 2019-08-09 | 2019-08-09 | Single robot path planning method based on Q-Learning algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110378439A true CN110378439A (en) | 2019-10-25 |
CN110378439B CN110378439B (en) | 2021-03-30 |
Family
ID=68258789
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910737476.6A Active CN110378439B (en) | 2019-08-09 | 2019-08-09 | Single robot path planning method based on Q-Learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110378439B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111080013A (en) * | 2019-12-18 | 2020-04-28 | 南方科技大学 | Addressing way-finding prediction method, device, equipment and computer readable storage medium |
CN111594322A (en) * | 2020-06-05 | 2020-08-28 | 沈阳航空航天大学 | Variable-cycle aero-engine thrust control method based on Q-Learning |
CN111649758A (en) * | 2020-06-16 | 2020-09-11 | 华东师范大学 | Path planning method based on reinforcement learning algorithm in dynamic environment |
CN111859099A (en) * | 2019-12-05 | 2020-10-30 | 马上消费金融股份有限公司 | Recommendation method, device, terminal and storage medium based on reinforcement learning |
CN111857081A (en) * | 2020-08-10 | 2020-10-30 | 电子科技大学 | Chip packaging test production line performance control method based on Q-learning reinforcement learning |
CN112015174A (en) * | 2020-07-10 | 2020-12-01 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN112327890A (en) * | 2020-11-10 | 2021-02-05 | 中国海洋大学 | Underwater multi-robot path planning based on WHCA algorithm |
CN112595326A (en) * | 2020-12-25 | 2021-04-02 | 湖北汽车工业学院 | Improved Q-learning path planning algorithm with fusion of priori knowledge |
CN113062601A (en) * | 2021-03-17 | 2021-07-02 | 同济大学 | Q learning-based concrete distributing robot trajectory planning method |
CN113111296A (en) * | 2019-12-24 | 2021-07-13 | 浙江吉利汽车研究院有限公司 | Vehicle path planning method and device, electronic equipment and storage medium |
CN113534826A (en) * | 2020-04-15 | 2021-10-22 | 苏州宝时得电动工具有限公司 | Attitude control method and device for self-moving equipment and storage medium |
CN114518758A (en) * | 2022-02-08 | 2022-05-20 | 中建八局第三建设有限公司 | Q learning-based indoor measuring robot multi-target-point moving path planning method |
CN117634548A (en) * | 2024-01-26 | 2024-03-01 | 西南科技大学 | Unmanned aerial vehicle behavior tree adjustment and optimization method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
CN105137967A (en) * | 2015-07-16 | 2015-12-09 | 北京工业大学 | Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm |
CN107317756A (en) * | 2017-07-10 | 2017-11-03 | 北京理工大学 | A kind of optimal attack paths planning method learnt based on Q |
CN108594803A (en) * | 2018-03-06 | 2018-09-28 | 吉林大学 | Paths planning method based on Q- learning algorithms |
CN108762249A (en) * | 2018-04-26 | 2018-11-06 | 常熟理工学院 | Clean robot optimum path planning method based on the optimization of approximate model multistep |
US20180354126A1 (en) * | 2017-06-07 | 2018-12-13 | Fanuc Corporation | Controller and machine learning device |
CN109445440A (en) * | 2018-12-13 | 2019-03-08 | 重庆邮电大学 | The dynamic obstacle avoidance method with improvement Q learning algorithm is merged based on sensor |
CN109933086A (en) * | 2019-03-14 | 2019-06-25 | 天津大学 | Unmanned plane environment sensing and automatic obstacle avoiding method based on depth Q study |
-
2019
- 2019-08-09 CN CN201910737476.6A patent/CN110378439B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
CN105137967A (en) * | 2015-07-16 | 2015-12-09 | 北京工业大学 | Mobile robot path planning method with combination of depth automatic encoder and Q-learning algorithm |
US20180354126A1 (en) * | 2017-06-07 | 2018-12-13 | Fanuc Corporation | Controller and machine learning device |
CN107317756A (en) * | 2017-07-10 | 2017-11-03 | 北京理工大学 | A kind of optimal attack paths planning method learnt based on Q |
CN108594803A (en) * | 2018-03-06 | 2018-09-28 | 吉林大学 | Paths planning method based on Q- learning algorithms |
CN108762249A (en) * | 2018-04-26 | 2018-11-06 | 常熟理工学院 | Clean robot optimum path planning method based on the optimization of approximate model multistep |
CN109445440A (en) * | 2018-12-13 | 2019-03-08 | 重庆邮电大学 | The dynamic obstacle avoidance method with improvement Q learning algorithm is merged based on sensor |
CN109933086A (en) * | 2019-03-14 | 2019-06-25 | 天津大学 | Unmanned plane environment sensing and automatic obstacle avoiding method based on depth Q study |
Non-Patent Citations (3)
Title |
---|
AMIT KONAR ET AL: "A Deterministic Improved Q-Learning for Path Planning of a Mobile Robot", 《IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS》 * |
JING PENG AND RONALD J. WILLIAMS: "Incremental Multi-Step Q-Learning", 《MACHINE LEARNING》 * |
高乐等: "改进Q-Learning算法在路径规划中的应用", 《吉林大学学报( 信息科学版)》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859099B (en) * | 2019-12-05 | 2021-08-31 | 马上消费金融股份有限公司 | Recommendation method, device, terminal and storage medium based on reinforcement learning |
CN111859099A (en) * | 2019-12-05 | 2020-10-30 | 马上消费金融股份有限公司 | Recommendation method, device, terminal and storage medium based on reinforcement learning |
CN111080013A (en) * | 2019-12-18 | 2020-04-28 | 南方科技大学 | Addressing way-finding prediction method, device, equipment and computer readable storage medium |
CN113111296A (en) * | 2019-12-24 | 2021-07-13 | 浙江吉利汽车研究院有限公司 | Vehicle path planning method and device, electronic equipment and storage medium |
CN113534826B (en) * | 2020-04-15 | 2024-02-23 | 苏州宝时得电动工具有限公司 | Attitude control method and device of self-mobile device and storage medium |
CN113534826A (en) * | 2020-04-15 | 2021-10-22 | 苏州宝时得电动工具有限公司 | Attitude control method and device for self-moving equipment and storage medium |
CN111594322A (en) * | 2020-06-05 | 2020-08-28 | 沈阳航空航天大学 | Variable-cycle aero-engine thrust control method based on Q-Learning |
CN111594322B (en) * | 2020-06-05 | 2022-06-03 | 沈阳航空航天大学 | Variable-cycle aero-engine thrust control method based on Q-Learning |
CN111649758A (en) * | 2020-06-16 | 2020-09-11 | 华东师范大学 | Path planning method based on reinforcement learning algorithm in dynamic environment |
CN111649758B (en) * | 2020-06-16 | 2023-09-15 | 华东师范大学 | Path planning method based on reinforcement learning algorithm in dynamic environment |
CN112015174A (en) * | 2020-07-10 | 2020-12-01 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN112015174B (en) * | 2020-07-10 | 2022-06-28 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN111857081A (en) * | 2020-08-10 | 2020-10-30 | 电子科技大学 | Chip packaging test production line performance control method based on Q-learning reinforcement learning |
CN112327890A (en) * | 2020-11-10 | 2021-02-05 | 中国海洋大学 | Underwater multi-robot path planning based on WHCA algorithm |
CN112595326A (en) * | 2020-12-25 | 2021-04-02 | 湖北汽车工业学院 | Improved Q-learning path planning algorithm with fusion of priori knowledge |
CN113062601A (en) * | 2021-03-17 | 2021-07-02 | 同济大学 | Q learning-based concrete distributing robot trajectory planning method |
CN113062601B (en) * | 2021-03-17 | 2022-05-13 | 同济大学 | Q learning-based concrete distributing robot trajectory planning method |
CN114518758A (en) * | 2022-02-08 | 2022-05-20 | 中建八局第三建设有限公司 | Q learning-based indoor measuring robot multi-target-point moving path planning method |
CN114518758B (en) * | 2022-02-08 | 2023-12-12 | 中建八局第三建设有限公司 | Indoor measurement robot multi-target point moving path planning method based on Q learning |
CN117634548A (en) * | 2024-01-26 | 2024-03-01 | 西南科技大学 | Unmanned aerial vehicle behavior tree adjustment and optimization method and system |
Also Published As
Publication number | Publication date |
---|---|
CN110378439B (en) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110378439A (en) | Single robot path planning method based on Q-Learning algorithm | |
CN106096729B (en) | A kind of depth-size strategy learning method towards complex task in extensive environment | |
CN108762281A (en) | It is a kind of that intelligent robot decision-making technique under the embedded Real-time Water of intensified learning is associated with based on memory | |
CN105527964B (en) | A kind of robot path planning method | |
Carmel et al. | Model-based learning of interaction strategies in multi-agent systems | |
CN109214498A (en) | Ant group algorithm optimization method based on search concentration degree and dynamic pheromone updating | |
CN108776483A (en) | AGV paths planning methods and system based on ant group algorithm and multiple agent Q study | |
CN109241291A (en) | Knowledge mapping optimal path inquiry system and method based on deeply study | |
CN105911992A (en) | Automatic path programming method of mobile robot, and mobile robot | |
CN107253195B (en) | A kind of carrying machine human arm manipulation ADAPTIVE MIXED study mapping intelligent control method and system | |
CN104571113A (en) | Route planning method for mobile robot | |
Kamoshida et al. | Acquisition of automated guided vehicle route planning policy using deep reinforcement learning | |
CN111695690A (en) | Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning | |
CN113296520A (en) | Routing planning method for inspection robot by fusing A and improved Hui wolf algorithm | |
CN109726676A (en) | The planing method of automated driving system | |
CN115129064A (en) | Path planning method based on fusion of improved firefly algorithm and dynamic window method | |
CN117103282A (en) | Double-arm robot cooperative motion control method based on MATD3 algorithm | |
CN105867427B (en) | Diameter On-Line Control Method is sought by a kind of robot towards dynamic environment | |
Zhao et al. | Multi-objective reinforcement learning algorithm for MOSDMP in unknown environment | |
Kalyanakrishnan et al. | Learning complementary multiagent behaviors: A case study | |
Tong et al. | Enhancing rolling horizon evolution with policy and value networks | |
CN116592890B (en) | Picking robot path planning method, picking robot path planning system, electronic equipment and medium | |
Li et al. | Path planning of mobile robot based on dynamic chaotic ant colony optimization algorithm | |
McGovern et al. | Hierarchical optimal control of MDPs | |
Chen et al. | An improved bacterial foraging optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |