CN109059931B - A kind of paths planning method based on multiple agent intensified learning - Google Patents

A kind of paths planning method based on multiple agent intensified learning Download PDF

Info

Publication number
CN109059931B
CN109059931B CN201811032979.5A CN201811032979A CN109059931B CN 109059931 B CN109059931 B CN 109059931B CN 201811032979 A CN201811032979 A CN 201811032979A CN 109059931 B CN109059931 B CN 109059931B
Authority
CN
China
Prior art keywords
state
global
behavior
value
return
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811032979.5A
Other languages
Chinese (zh)
Other versions
CN109059931A (en
Inventor
曹先彬
杜文博
李碧月
李宇萌
刘瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201811032979.5A priority Critical patent/CN109059931B/en
Publication of CN109059931A publication Critical patent/CN109059931A/en
Application granted granted Critical
Publication of CN109059931B publication Critical patent/CN109059931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a kind of paths planning methods based on multiple agent intensified learning, belong to vehicle technology field.The global state partitioning model of airflight environment is initially set up, global state transfer table Q-Table is initialized1, the global state of certain row is randomly choosed, as initial state s1.In current state s1In all column, using ε-greedy algorithms selection, some column is denoted as behavior a1;Utilize selected behavior a1, in global state transfer table Q-Table1In obtain current state s1Next stateGlobal state transfer table Q-Table is updated using the transition rule of Q-Learning algorithm1In, current state s1And behavior a1Corresponding specific element value;It updatesInto interior loop, updated state s is obtained using Q-Learning algorithm1Corresponding sector planning path.The number of iterations of outer loop is from increasing 1 until N1, complete the skyborne global path planning of aircraft.The invention enables the requirements that aircraft adapts to varying environment, improve the survival rate and task completion rate of aircraft, improve the convergence rate of intensified learning.

Description

A kind of paths planning method based on multiple agent intensified learning
Technical field
The invention belongs to vehicle technology fields, are related to a kind of paths planning method based on multiple agent intensified learning.
Background technique
With the continuous development of air traffic control technology and perfect, carrying out quickly and accurately path planning to aircraft is Guarantee aircraft safe flight and the important means for effectively improving air traffic efficiency under complicated airspace environment.Path planning is past Toward being according to certain evaluation criterion system, in given planning space, searching moving object reaches target point from starting point And the optimal or most viable track for meeting constraint condition and certain performance indicator makes moving object be safely completed predetermined appoint Business, the skyborne Path Selection of aircraft are the emphasis of research all the time.
More common flight path selection method mostly be to plan before aircraft flight path, from predefine Takeoff point and destination between a plurality of flight path in, an optimal path, but the party are selected according to specific criterion It is big that method needs to obtain the information of network entirety, operand, in addition considers aircraft movenent performance during practical flight Complexity, task environment are complicated, needed when planning track the mobility of comprehensive consideration of flight vehicle, task time, The factors such as shape environment and enemy's control region, the flexibility according to practical empty condition adjustment flight path are very necessary.
Domestic and international researcher has done a large amount of work in terms of path planning, mainly includes traditional planning method and intelligence Planing method.Traditional planning method includes dynamic programming algorithm, optimal control algorithm and derivative correlation method etc., and such methods will be into Row iterates on a large scale, and lack of wisdom guidance capability, calculation amount is generally larger, and the calculating time is long, is not suitable for low latitude sky The free flight in domain, and robustness is poor.Intelligent planning algorithm includes some searching algorithms and Swarm Intelligence Algorithm, as A* is calculated Method, dijkstra's algorithm, ant group algorithm and particle swarm algorithm etc., with the growth of problem scale, computation complexity be increased dramatically.
In general, the skyborne flight environment of vehicle of aircraft is dynamic change, is difficult that global accurate letter is obtained ahead of time Breath, in order to meet increasingly diversified flight demand, need to research and develop it is a kind of aircraft flight on the way, quickly to flight path into The method of Mobile state selection, makes flight DSS or unmanned plane during flying management system have certain real-time trajectory planning Ability improves the utilization rate of airspace resource to improve the handling capacity of air net.
Method one of of the intensified learning as machine learning, also known as reinforcement function, enhancing study, in biology Conditional reflex theory, basic thought are rewarded desired result, are punished undesirable result, gradually shape At a kind of conditioned reflex for being intended to result.As shown in Figure 1, intelligent body is when completing a certain task, first by movement A with Ambient enviroment interacts, and under the action of acting A and environment, intelligent body can generate new state S, while environment can provide one It is a to return R immediately.So circulation is gone down, intelligent body constantly interacts to generate many data with environment.Nitrification enhancement Using the action policy of the data modification itself of generation, then with environmental interaction, generate new data, further improve itself row For after iterative learning for several times, intelligent body can acquire the optimal movement for completing corresponding task, that is, optimal policy.
Intensified learning early has application in path planning problem, but due to the general adoption status-movement pair of intensified learning Representation method, so being all traditionally to be divided using the grid world to planning region, according to the label of grid come to shape State is divided.For intensified learning, the quantity of state will have a direct impact on pace of learning;With the expansion and state of state Subdivision will lead to intensified learning " dimension disaster ".
Summary of the invention
For " dimension disaster " problem in intensified learning, the present invention provides a kind of based on multiple agent intensified learning Paths planning method is divided and is taken out to state and motion space based on most basic intensified learning Q-Learning algorithm As the intelligent body being had complementary functions using two: global intelligent body is explored flight environment of vehicle and is utilized with local intelligent body, Global path planning and local paths planning are respectively corresponded, the quantity of state is effectively reduced, optimizes aircraft skyborne Movement, to solve flight path planning problem, hides obstacle and threat source.
Specific step is as follows:
Step 1: establishing the global state partitioning model of airflight environment;It is initialized according to global state partitioning model Global state transfer table Q-Table1;It concurrently sets rewarding mechanism and establishes global return matrix R1
Model includes the starting point of aircraft, the different threat source of the target point and flying area radius of aircraft;Starting point, Target point and threat source represent different global states, while being configured to state transition diagram;
According to state transition diagram, global state transfer table Q-Table is constructed1With the global return matrix R of same order1
Global state transfer table Q-Table1It is initialized as null matrix;
Overall situation return matrix R1Established according to the structure of state transition diagram: row and column is with the overall situation being successively arranged in order State indicates;Matrix R1The interior corresponding global state of element representation row global state corresponding with column, the global state of composition It is right, the occurrence of element indicate global state to whether can process performing and return value.
- 1 indicates that the behavior between the corresponding global state of row global state corresponding with column cannot execute, as not optional The behavior selected;Non- 1 value indicates: the behavior between the corresponding global state of value row global state corresponding with column can execute, And return value is the value.
Step 2: according to the global state transfer table Q-Table of initialization1, the global state of certain row is randomly choosed, as Initial state s1;And set the maximum number of iterations N of outer loop1
Initial state s1For starting point, threaten source or target point.
Step 3: judging initial state s1It whether is target point, if so, entering step nine;Otherwise, four are entered step.
Step 4: being directed to global state transfer table Q-Table1, in current state s1In all column, ε-greedy is utilized Some column of algorithms selection are denoted as behavior a1
Step 5: utilizing selected behavior a1, in global state transfer table Q-Table1In obtain current state s1It is next A state
Behavior a1The columns at place is next stateThe line number at place.
Step 6: the transition rule using Q-Learning algorithm updates global state transfer table Q-Table1In, currently State s1And behavior a1Corresponding specific element value;
Shown in formula specific as follows:
Wherein Q (s1,a1) indicate global state transfer table Q-Table after updating1Middle state s1Housing choice behavior a1Corresponding Element value;R1(s1,a1) indicate global return matrix R1In, current state s1Take behavior a1Corresponding instant return value;γ1 For the discount factor constant of global intelligent body, meet 0≤γ1< 1;Indicate next stateIn global shape State transfer table Q-Table1In greatest member value in all optional behaviors;Indicate next stateCorresponding behavior.
Step 7: updatingJudge s1It whether is target point in global state, if so, entering step nine;It is no Then, step 8 is carried out;
Step 8: obtaining updated state s using Q-Learning algorithm into interior loop1Corresponding part Planning path.
Specific steps are as follows:
Step 801, updated state s1Corresponding is threat source, and the threat source and its peripheral region are divided into part State trellis, the corresponding local state of each grid;All state trellis constitute the local state transfer table Q- to be learnt Table2, while establishing part return matrix R2
Part return matrix R2The return value of interior element is determined according to following rule: being threatened inside source and where edge State trellis return value be negative, be set as -1, remaining state trellis return value be 0, wherein the state net of beginning and end Lattice return value is 100.
Beginning and end is determined according to following rule: a nearest grid of starting point in global state partitioning model For the starting point of regional area, a nearest grid of target point is the terminal of regional area in global state partitioning model.
Step 802, initialization local state transfer table Q-Table2For null matrix, the part in certain row grid is randomly choosed State, as initial state s2;Set the maximum number of iterations N of interior loop2
Step 803, in current state s2In all column, using ε-greedy algorithms selection, some column is denoted as behavior a2
Step 804 utilizes selected behavior a2, in local state transfer table Q-Table2In obtain current state s2Under One state
Behavior a2The columns at place is next stateThe line number at place.
Step 805 updates local state transfer table Q-Table using Q-Learning algorithm2In, current state s2And Behavior a2Corresponding specific element value;
Q(s2,a2) indicate local state transfer table Q-Table after updating2Middle state s2Housing choice behavior a2Corresponding element Value;R2(s2,a2) indicate part return matrix R2In, current state s2Take behavior a2Instant return value afterwards, γ2Expression office The discount factor constant of portion's intelligent body meets 0≤γ2< 1,Indicate next stateTurn in local state Move table Q-Table2In greatest member value in all optional behaviors,Indicate next stateCorresponding behavior.
Step 806 updatesReturn step 803, until the number of iterations reaches N2Interior loop is completed, is obtained current State s1Corresponding sector planning path.
According to element value each in grid, from the off, the greatest member value of all adjacent mesh is found out, by greastest element Grid where plain value is set to first grid in path, continues to look for all unduplicated adjacent mesh since the element, selects The grid where greatest member value is set to second grid in path out, and so on, it is until being transferred to the last one grid Terminal grid, all transfer grids chosen from origin-to-destination are exactly to cook up the skyborne local path of aircraft.
Step 9: the number of iterations of outer loop increases 1 certainly, and judge whether the number of iterations reaches N1, if so, completing to fly The skyborne global path planning of row device.Otherwise, initial state s is randomly choosed again1And return step three.
The present invention has the advantages that
1), a kind of paths planning method based on multiple agent intensified learning, to global, part state and motion space It is divided and is abstracted, " dimension disaster " difficulty of intensified learning in pahtfinder hard planning problem can be efficiently solved, made Path planning can independently be carried out in different environment by obtaining aircraft, adapted to the requirement of varying environment, improved the life of aircraft Deposit rate and task completion rate;
2), a kind of paths planning method based on multiple agent intensified learning, using global intelligent body and local intelligent body Global and local path planning is carried out respectively, be can be effectively reduced aircraft to the time of environment enquiry learning, is improved and strengthen The convergence rate of study completes path planning of the aircraft in circumstances not known as early as possible.
Detailed description of the invention
Fig. 1 is the intensified learning schematic diagram that the present invention uses in the prior art;
Fig. 2 is that the global state provided in the embodiment of the present invention divides schematic diagram;
Fig. 3 is that the local state provided in the embodiment of the present invention divides schematic diagram;
Fig. 4 is that global state shifts graph structure in the embodiment of the present invention;
Fig. 5 is the schematic diagram of global return matrix in the embodiment of the present invention;
Fig. 6 is the schematic diagram that global state table is initialized in the embodiment of the present invention;
Fig. 7 is the schematic diagram that planning path is referred in the embodiment of the present invention;
Fig. 8 is a kind of paths planning method flow chart based on multiple agent intensified learning of the present invention;
Fig. 9 is local paths planning process in a kind of paths planning method based on multiple agent intensified learning of the present invention Figure.
Specific embodiment
Below in conjunction with drawings and examples, the present invention is described in further detail.
The present invention proposes a kind of paths planning method based on multiple agent intensified learning, consideration is that how from certain Some path for searching out a safety reaches target point.Firstly, defining the different but complementary intelligent body of two kinds of functions for complete Office's path planning and local paths planning carry out global drawn game to flight range for the dimension explosion issues in intensified learning The state demarcation in portion, wherein global state includes starting point, terminal and different threat sources, and local state is that aircraft reaches Specific location under a certain global state;It defines the return matrix of global intelligent body and initializes global state transfer table;Then, According to global state transfer table, the initial state of random initializtion aircraft in all global states;And according to ε- Greedy strategy carries out movement selection, obtains branch mode of the aircraft between different conditions, and combine global return matrix and The transition strategy of Q-Learning algorithm carries out the study and update of global state transfer table;During this, in global intelligent body Start to enable local intelligence body progress local paths planning after into new state, equally uses Q-Learning algorithm carry out office The study of portion's state-transition table;Until study the number of iterations reaches preset maximum value.Finally, using finally obtained complete Office state matrix Q-Table1With local state matrix Q-Table2Determine the optimum programming path of aircraft.
The present invention can divide the work to the intelligent body in intensified learning, and different intelligent bodies is made to have different abilities, Cooperation between intelligent body makes a concentrated effort to finish path planning task, not only reduces the state of intelligent body in traditional Q-learning algorithm And motion space, accelerate study and training speed, can also guarantee the flexibility and accuracy of path planning.
The overview flow chart of this method is as shown in Figure 8, the specific steps are as follows:
Step 1: establishing the global state partitioning model of airflight environment;It is initialized according to global state partitioning model Global state transfer table Q-Table1;It concurrently sets rewarding mechanism and establishes global return matrix R1
Model include aircraft starting point, the different threat source of the target point and flying area radius of aircraft and it Where position;Starting point, target point and threat source represent different global states, while being configured to state transfer Figure;
To threaten source as the benchmark of division state, i.e., region, aircraft where each threat source go out global intelligent body Hair point and the target point of plan constitute state space, if threat source number is n, total status number of state space is n+2, Selectable movement number is n+1 under each state, and movement herein does not imply that specific movement, but clearly next Walk the state that can be transferred to.As shown in Fig. 2, being the mission planning signal that a simple aircraft enters enemy defensive position Scheme, the threat source containing there are three, starting point corresponding states 1, target point corresponding states 5, surface-to-air ballistic missile, ground radar, mountain valley in figure The region that the aircraft such as landform, harsh weather may be on the hazard needs to be hidden when planning flight path, with circle table Demonstration side of body source, circle radius is the threat radius in threat source, and the state region in the threat source is indicated in dotted line frame, is such as threatened Corresponding source 3 is state 4.If acting 3 in 1 selection of starting point, indicate to be shifted to movement 3 in next step, if the movement It can execute, then be transferred to the status field of state 3.
According to state transition diagram, global state transfer table Q-Table is constructed1With the global return matrix R of same order1
There are two important terms in intensified learning Q-Learning algorithm: " state (state) " and " behavior (action) ", herein by each " state " corresponding node, and each " behavior " corresponds to an arrow, i.e. expression shape Transfer between state.If the blocking in the source that is on the hazard in the process, then it represents that state can not shift, and in Fig. 2, state 1 can not It is transferred to state 4, accordingly, the graph structure for establishing global state transfer indicates, as shown in Figure 4.It is set in order to which the state for being 5 will be numbered It is set to target point, a return value (reward) is associated with for each movement (accordingly connecting side), is transferred directly to dbjective state Movement reward value be 100, other movement reward values be 0.According to the graph structure that this state shifts, it is with " state " Row, " behavior " are column, construct the return matrix R about reward value1, as shown in figure 5, row and column is with successively by suitable The global state of sequence arrangement indicates;Matrix R1The interior corresponding global state of element representation row global state corresponding with column, group At global state pair, the occurrence of element indicate global state to whether can process performing and return value.
- 1 indicates null value, and meaning between respective nodes does not have side to be connected;Namely indicate the corresponding global state of row and column Behavior between corresponding global state cannot execute, as not selectable behavior;Non- 1 value indicates: the value row is corresponding Behavior between global state global state corresponding with column can execute, and return value is the value.
Similarly, the state-transition table Q-Table of global intelligent body is constructed1, for indicating that intelligent body is acquired from experience Knowledge, transfer table Q-Table1With return matrix R1It is same order, external environment is known nothing due to just starting intelligent body, Therefore by global state transfer table Q-Table1It is initialized as null matrix, as shown in Figure 6.The situation unknown for state number, As soon as allowing intelligent body from an element, find to increase corresponding ranks in state table when a new state every time.
Step 2: according to the global state transfer table Q-Table of initialization1, randomly choosed in all global states The global state of certain row, as initial state s1;Into outer loop and set the maximum number of iterations N of outer loop1
Initial state s1For starting point, threaten source or target point.
Step 3: judging initial state s1It whether is target point, if so, entering step nine;Otherwise, four are entered step.
Step 4: being directed to global state transfer table Q-Table1, in current state s1In all column, ε-greedy is utilized Some column of algorithms selection are denoted as behavior a1
Step 5: utilizing selected behavior a1, in global state transfer table Q-Table1In obtain current state s1It is next A state
Behavior a1The columns at place is next stateThe line number at place.
Step 6: the transition rule using Q-Learning algorithm updates global state transfer table Q-Table1In, currently State s1And behavior a1Corresponding specific element value;
Shown in formula specific as follows:
Wherein Q (s1,a1) indicate global state transfer table Q-Table after updating1Middle state s1Housing choice behavior a1Corresponding Element value;R1(s1,a1) indicate global return matrix R1In, current state s1Take behavior a1Corresponding instant return value;γ1 For the discount factor constant of global intelligent body, meet 0≤γ1< 1;Indicate next stateIn global shape State transfer table Q-Table1In greatest member value in all optional behaviors;Indicate next stateCorresponding behavior.
Step 7: updatingJudge s1It whether is target point in global state, if so, entering step nine;It is no Then, step 8 is carried out;
Step 8: obtaining updated state s using Q-Learning algorithm into interior loop1Corresponding part Planning path.
The state and possible action that global intelligent body and local intelligent body are faced also are different;Local intelligence body is complete Office's intelligent body has determined that the transfering state of next step plays a role later, in order to can cook up the road for avoiding threat source Diameter includes a threat source in a general status field, gridding method can be used since local intelligence body is located in status field State and movement are divided, i.e., a grid corresponds to a kind of state, as shown in figure 3, if entering where threat source It in grid, then gives intelligent body and is the punishment of negative value, otherwise give the award that intelligent body one is positive value.In each state net There is the movement of four kinds of upper and lower, left and right in lattice it can be selected that herein, selecting ε-greedy algorithm to act generation strategy, to reach The balance of " exploration " and " utilization " into intensified learning.
The core concept of local paths planning still uses Q-Learning algorithm, and wherein local state region is using grid Representation carries out the division of state, according to the distance of the global starting point of distance and global object point determine regional area starting point and Terminal, the state reporting value in threat source are negative, otherwise are positive value, building and return matrix R1The similar part return of structure Matrix R2And local state transfer table Q-Table2, local state transfer is carried out using the transition rule of Q-Learning algorithm Table Q-Table2Study: specific steps are as follows:
Step 801, updated state s1Corresponding is threat source, and the threat source and its peripheral region are divided into part State trellis, the corresponding local state of each grid;All state trellis constitute the local state transfer table Q- to be learnt Table2, while establishing part return matrix R2
Part return matrix R2The return value of interior element is determined according to following rule: being threatened inside source and where edge State trellis return value be negative, be set as -1, remaining state trellis return value be 0, wherein the state net of beginning and end Lattice return value is 100.
Beginning and end is determined according to following rule: a nearest grid of starting point in global state partitioning model For the starting point of regional area, a nearest grid of target point is the terminal of regional area in global state partitioning model.
Step 802, initialization local state transfer table Q-Table2For null matrix, the part in certain row grid is randomly choosed State, as initial state s2;Set the maximum number of iterations N of interior loop2
Step 803, in current state s2In all column, using ε-greedy algorithms selection, some column is denoted as behavior a2
Step 804 utilizes selected behavior a2, in local state transfer table Q-Table2In obtain current state s2Under One state
Behavior a2The columns at place is next stateThe line number at place.
Step 805 updates local state transfer table Q-Table using Q-Learning algorithm2In, current state s2And Behavior a2Corresponding specific element value;
Q(s2,a2) indicate local state transfer table Q-Table after updating2Middle state s2Housing choice behavior a2Corresponding element Value;R2(s2,a2) indicate part return matrix R2In, current state s2Take behavior a2Instant return value afterwards, γ2Expression office The discount factor constant of portion's intelligent body meets 0≤γ2< 1,Indicate next stateTurn in local state Move table Q-Table2In greatest member value in all optional behaviors,Indicate next stateCorresponding behavior.
Step 806 updatesReturn step 803, until the number of iterations reaches N2Interior loop is completed, is obtained current State s1Corresponding sector planning path.
According to element value each in grid, from the off, the greatest member value of all adjacent mesh is found out, by greastest element Grid where plain value is set to first grid in path, continues to look for all unduplicated adjacent mesh since the element, selects The grid where greatest member value is set to second grid in path out, and so on, it is until being transferred to the last one grid Terminal grid, all transfer grids chosen from origin-to-destination are exactly to cook up the skyborne local path of aircraft.
Step 9: the number of iterations of outer loop increases 1 certainly, and judge whether the number of iterations reaches N1, if so, completing to fly The skyborne global path planning of row device.Otherwise, initial state s is randomly choosed again1And return step three.
Finally, it is as shown in Figure 7 for the reference path of aircraft may being cooked up in the environment of three threat sources.So far, Complete aircraft global and local paths planning in respective environment.This method can make aircraft complete given appoint Business, adapts to the requirement of varying environment, improves the survival rate and task completion rate of aircraft;Meanwhile it also substantially increasing and training Convergence rate in journey.

Claims (4)

1. a kind of paths planning method based on multiple agent intensified learning, which is characterized in that specific step is as follows:
Step 1: establishing the global state partitioning model of airflight environment;It is initialized according to global state partitioning model global State-transition table Q-Table1;It concurrently sets rewarding mechanism and establishes global return matrix R1
Step 2: according to the global state transfer table Q-Table of initialization1, the global state of certain row is randomly choosed, as starting State s1;And set the maximum number of iterations N of outer loop1
Step 3: judging initial state s1It whether is target point, if so, entering step nine;Otherwise, four are entered step;
Step 4: being directed to global state transfer table Q-Table1, in current state s1In all column, ε-greedy algorithm is utilized Some column is selected to be denoted as behavior a1
Step 5: utilizing selected behavior a1, in global state transfer table Q-Table1In obtain current state s1Next shape State
Behavior a1The columns at place is next stateThe line number at place;
Step 6: the transition rule using Q-Learning algorithm updates global state transfer table Q-Table1In, current state s1 And behavior a1Corresponding specific element value;
Shown in formula specific as follows:
Wherein Q (s1,a1) indicate global state transfer table Q-Table after updating1Middle state s1Housing choice behavior a1Corresponding element Value;R1(s1,a1) indicate global return matrix R1In, current state s1Take behavior a1Corresponding instant return value;γ1It is complete The discount factor constant of office's intelligent body, meets 0≤γ1< 1;Indicate next stateTurn in global state Move table Q-Table1In greatest member value in all optional behaviors;Indicate next stateCorresponding behavior;
Step 7: updatingJudge s1It whether is target point in global state, if so, entering step nine;Otherwise, into Row step 8;
Step 8: obtaining updated state s using Q-Learning algorithm into interior loop1Corresponding sector planning road Diameter;
Specific steps are as follows:
Step 801, updated state s1Corresponding is threat source, and the threat source and its peripheral region are divided into local state Grid, the corresponding local state of each grid;All state trellis constitute the local state transfer table Q- to be learnt Table2, while establishing part return matrix R2
Step 802, initialization local state transfer table Q-Table2For null matrix, the local state in certain row grid is randomly choosed, As initial state s2;Set the maximum number of iterations N of interior loop2
Step 803, in current state s2In all column, using ε-greedy algorithms selection, some column is denoted as behavior a2
Step 804 utilizes selected behavior a2, in local state transfer table Q-Table2In obtain current state s2It is next State
Behavior a2The columns at place is next stateThe line number at place;
Step 805 updates local state transfer table Q-Table using Q-Learning algorithm2In, current state s2And behavior a2Corresponding specific element value;
Q(s2,a2) indicate local state transfer table Q-Table after updating2Middle state s2Housing choice behavior a2Corresponding element value;R2 (s2,a2) indicate part return matrix R2In, current state s2Take behavior a2Instant return value afterwards, γ2Indicate local intelligence The discount factor constant of energy body, meets 0≤γ2< 1,Indicate next stateIn local state transfer table Q-Table2In greatest member value in all optional behaviors,Indicate next stateCorresponding behavior;
Step 806 updatesReturn step 803, until the number of iterations reaches N2Interior loop is completed, current state is obtained s1Corresponding sector planning path;
According to element value each in grid, from the off, the greatest member value of all adjacent mesh is found out, by greatest member value The grid at place is set to first grid in path, continues to look for all unduplicated adjacent mesh since the element, select most Grid where big element value is set to second grid in path, and so on, it is terminal until being transferred to the last one grid Grid, all transfer grids chosen from origin-to-destination are exactly to cook up the skyborne local path of aircraft;
Step 9: the number of iterations of outer loop increases 1 certainly, and judge whether the number of iterations reaches N1, if so, completing aircraft Skyborne global path planning;Otherwise, initial state s is randomly choosed again1And return step three.
2. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that step Model described in one includes the starting point of aircraft, the different threat source of the target point and flying area radius of aircraft; Starting point, target point and threat source represent different global states, while being configured to state transition diagram;
According to state transition diagram, global state transfer table Q-Table is constructed1With the global return matrix R of same order1
Global state transfer table Q-Table1It is initialized as null matrix;
Overall situation return matrix R1Established according to the structure of state transition diagram: row and column is with the global state being successively arranged in order It indicates;Matrix R1The interior corresponding global state of element representation row global state corresponding with column, the global state pair of composition, member Element occurrence indicate global state to whether can process performing and return value;
- 1 indicates that the behavior between the corresponding global state of row global state corresponding with column cannot execute, as not selectable Behavior;Non- 1 value indicates: the behavior between the corresponding global state of value row global state corresponding with column can execute, and return Offer values are the value.
3. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that described Initial state s1For starting point, threaten source or target point.
4. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that step Return matrix R in part described in 8012The return value of interior element is determined according to following rule: threatening source inside and edge institute State trellis return value be negative, be set as -1, remaining state trellis return value be 0, wherein the state of beginning and end Grid return value is 100;
Beginning and end is determined according to following rule: a nearest grid of starting point is office in global state partitioning model The starting point in portion region, a nearest grid of target point is the terminal of regional area in global state partitioning model.
CN201811032979.5A 2018-09-05 2018-09-05 A kind of paths planning method based on multiple agent intensified learning Active CN109059931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811032979.5A CN109059931B (en) 2018-09-05 2018-09-05 A kind of paths planning method based on multiple agent intensified learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811032979.5A CN109059931B (en) 2018-09-05 2018-09-05 A kind of paths planning method based on multiple agent intensified learning

Publications (2)

Publication Number Publication Date
CN109059931A CN109059931A (en) 2018-12-21
CN109059931B true CN109059931B (en) 2019-04-26

Family

ID=64759692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811032979.5A Active CN109059931B (en) 2018-09-05 2018-09-05 A kind of paths planning method based on multiple agent intensified learning

Country Status (1)

Country Link
CN (1) CN109059931B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352826B (en) * 2018-12-24 2024-05-03 上海云扩信息科技有限公司 Automatic interface test case generation method and tool
CN109818856B (en) * 2019-03-07 2021-07-13 北京西米兄弟未来科技有限公司 Multipath data transmission method and device
CN110081893B (en) * 2019-04-01 2020-09-25 东莞理工学院 Navigation path planning method based on strategy reuse and reinforcement learning
CN109992000B (en) * 2019-04-04 2020-07-03 北京航空航天大学 Multi-unmanned aerial vehicle path collaborative planning method and device based on hierarchical reinforcement learning
CN110134140B (en) * 2019-05-23 2022-01-11 南京航空航天大学 Unmanned aerial vehicle path planning method based on potential function reward DQN under continuous state of unknown environmental information
CN110598309B (en) * 2019-09-09 2022-11-04 电子科技大学 Hardware design verification system and method based on reinforcement learning
CN110866482B (en) * 2019-11-08 2022-09-16 广东工业大学 Dynamic selection method, device and equipment for odometer data source
CN111123963B (en) * 2019-12-19 2021-06-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN113111296A (en) * 2019-12-24 2021-07-13 浙江吉利汽车研究院有限公司 Vehicle path planning method and device, electronic equipment and storage medium
CN112015174B (en) * 2020-07-10 2022-06-28 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN112068590A (en) * 2020-08-21 2020-12-11 广东工业大学 Unmanned aerial vehicle base station flight planning method and system, storage medium and unmanned aerial vehicle base station
CN112558601B (en) * 2020-11-09 2024-04-02 广东电网有限责任公司广州供电局 Robot real-time scheduling method and system based on Q-learning algorithm and water drop algorithm
JP2023059382A (en) * 2021-10-15 2023-04-27 オムロン株式会社 Route planning system, route planning method, road map construction device, model generation device and model generation method
CN114442625B (en) * 2022-01-24 2023-06-06 中国地质大学(武汉) Environment map construction method and device based on multi-strategy combined control agent

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106595671A (en) * 2017-02-22 2017-04-26 南方科技大学 Method and apparatus for planning route of unmanned aerial vehicle based on reinforcement learning
CN106970615B (en) * 2017-03-21 2019-10-22 西北工业大学 A kind of real-time online paths planning method of deeply study
CN107883962A (en) * 2017-11-08 2018-04-06 南京航空航天大学 A kind of dynamic Route planner of multi-rotor unmanned aerial vehicle under three-dimensional environment
CN107894715A (en) * 2017-11-13 2018-04-10 华北理工大学 The cognitive development method of robot pose path targetpath optimization
CN107967513B (en) * 2017-12-25 2019-02-15 徐雪松 Multirobot intensified learning collaboratively searching method and system

Also Published As

Publication number Publication date
CN109059931A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109059931B (en) A kind of paths planning method based on multiple agent intensified learning
CN106969778B (en) Path planning method for cooperative pesticide application of multiple unmanned aerial vehicles
CN106441308B (en) A kind of Path Planning for UAV based on adaptive weighting dove group&#39;s algorithm
CN110608743B (en) Multi-unmanned aerial vehicle collaborative route planning method based on multi-population chaotic grayling algorithm
CN106979784B (en) Non-linear track planning based on hybrid pigeon swarm algorithm
CN111722643B (en) Unmanned aerial vehicle cluster dynamic task allocation method imitating wolf colony cooperative hunting mechanism
Wu et al. Distributed trajectory optimization for multiple solar-powered UAVs target tracking in urban environment by Adaptive Grasshopper Optimization Algorithm
CN102880186B (en) flight path planning method based on sparse A* algorithm and genetic algorithm
CN104503464B (en) Computer-based convex polygon field unmanned aerial vehicle spraying operation route planning method
CN105302153B (en) The planing method for the task of beating is examined in the collaboration of isomery multiple no-manned plane
CN104573812B (en) A kind of unmanned plane air route determining method of path based on particle firefly colony optimization algorithm
CN112947592B (en) Reentry vehicle trajectory planning method based on reinforcement learning
CN102506863B (en) Universal gravitation search-based unmanned plane air route planning method
CN106705970A (en) Multi-UAV(Unmanned Aerial Vehicle) cooperation path planning method based on ant colony algorithm
CN109917815A (en) No-manned plane three-dimensional route designing method based on global optimum&#39;s brainstorming algorithm
CN101122974A (en) Un-manned plane fairway layout method based on Voronoi graph and ant colony optimization algorithm
CN111006693B (en) Intelligent aircraft track planning system and method thereof
CN109269502A (en) A kind of no-manned plane three-dimensional Route planner based on more stragetic innovation particle swarm algorithms
CN108762296B (en) Unmanned aerial vehicle deception route planning method based on ant colony algorithm
CN103279793A (en) Task allocation method for formation of unmanned aerial vehicles in certain environment
CN104850009A (en) Coordination control method for multi-unmanned aerial vehicle team based on predation escape pigeon optimization
CN114840020A (en) Unmanned aerial vehicle flight path planning method based on improved whale algorithm
Lei et al. Path planning for unmanned air vehicles using an improved artificial bee colony algorithm
CN109032167A (en) Unmanned plane paths planning method based on Parallel Heuristic Algorithm
Hu et al. HG-SMA: hierarchical guided slime mould algorithm for smooth path planning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant