CN109059931B

CN109059931B - A kind of paths planning method based on multiple agent intensified learning

Info

Publication number: CN109059931B
Application number: CN201811032979.5A
Authority: CN
Inventors: 曹先彬; 杜文博; 李碧月; 李宇萌; 刘瑜
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2019-04-26
Anticipated expiration: 2038-09-05
Also published as: CN109059931A

Abstract

The invention discloses a kind of paths planning methods based on multiple agent intensified learning, belong to vehicle technology field.The global state partitioning model of airflight environment is initially set up, global state transfer table Q-Table is initialized₁, the global state of certain row is randomly choosed, as initial state s₁.In current state s₁In all column, using ε-greedy algorithms selection, some column is denoted as behavior a₁；Utilize selected behavior a₁, in global state transfer table Q-Table₁In obtain current state s₁Next stateGlobal state transfer table Q-Table is updated using the transition rule of Q-Learning algorithm₁In, current state s₁And behavior a₁Corresponding specific element value；It updatesInto interior loop, updated state s is obtained using Q-Learning algorithm₁Corresponding sector planning path.The number of iterations of outer loop is from increasing 1 until N₁, complete the skyborne global path planning of aircraft.The invention enables the requirements that aircraft adapts to varying environment, improve the survival rate and task completion rate of aircraft, improve the convergence rate of intensified learning.

Description

A kind of paths planning method based on multiple agent intensified learning

Technical field

The invention belongs to vehicle technology fields, are related to a kind of paths planning method based on multiple agent intensified learning.

Background technique

With the continuous development of air traffic control technology and perfect, carrying out quickly and accurately path planning to aircraft is Guarantee aircraft safe flight and the important means for effectively improving air traffic efficiency under complicated airspace environment.Path planning is past Toward being according to certain evaluation criterion system, in given planning space, searching moving object reaches target point from starting point And the optimal or most viable track for meeting constraint condition and certain performance indicator makes moving object be safely completed predetermined appoint Business, the skyborne Path Selection of aircraft are the emphasis of research all the time.

More common flight path selection method mostly be to plan before aircraft flight path, from predefine Takeoff point and destination between a plurality of flight path in, an optimal path, but the party are selected according to specific criterion It is big that method needs to obtain the information of network entirety, operand, in addition considers aircraft movenent performance during practical flight Complexity, task environment are complicated, needed when planning track the mobility of comprehensive consideration of flight vehicle, task time, The factors such as shape environment and enemy's control region, the flexibility according to practical empty condition adjustment flight path are very necessary.

Domestic and international researcher has done a large amount of work in terms of path planning, mainly includes traditional planning method and intelligence Planing method.Traditional planning method includes dynamic programming algorithm, optimal control algorithm and derivative correlation method etc., and such methods will be into Row iterates on a large scale, and lack of wisdom guidance capability, calculation amount is generally larger, and the calculating time is long, is not suitable for low latitude sky The free flight in domain, and robustness is poor.Intelligent planning algorithm includes some searching algorithms and Swarm Intelligence Algorithm, as A* is calculated Method, dijkstra's algorithm, ant group algorithm and particle swarm algorithm etc., with the growth of problem scale, computation complexity be increased dramatically.

In general, the skyborne flight environment of vehicle of aircraft is dynamic change, is difficult that global accurate letter is obtained ahead of time Breath, in order to meet increasingly diversified flight demand, need to research and develop it is a kind of aircraft flight on the way, quickly to flight path into The method of Mobile state selection, makes flight DSS or unmanned plane during flying management system have certain real-time trajectory planning Ability improves the utilization rate of airspace resource to improve the handling capacity of air net.

Method one of of the intensified learning as machine learning, also known as reinforcement function, enhancing study, in biology Conditional reflex theory, basic thought are rewarded desired result, are punished undesirable result, gradually shape At a kind of conditioned reflex for being intended to result.As shown in Figure 1, intelligent body is when completing a certain task, first by movement A with Ambient enviroment interacts, and under the action of acting A and environment, intelligent body can generate new state S, while environment can provide one It is a to return R immediately.So circulation is gone down, intelligent body constantly interacts to generate many data with environment.Nitrification enhancement Using the action policy of the data modification itself of generation, then with environmental interaction, generate new data, further improve itself row For after iterative learning for several times, intelligent body can acquire the optimal movement for completing corresponding task, that is, optimal policy.

Intensified learning early has application in path planning problem, but due to the general adoption status-movement pair of intensified learning Representation method, so being all traditionally to be divided using the grid world to planning region, according to the label of grid come to shape State is divided.For intensified learning, the quantity of state will have a direct impact on pace of learning；With the expansion and state of state Subdivision will lead to intensified learning " dimension disaster ".

Summary of the invention

For " dimension disaster " problem in intensified learning, the present invention provides a kind of based on multiple agent intensified learning Paths planning method is divided and is taken out to state and motion space based on most basic intensified learning Q-Learning algorithm As the intelligent body being had complementary functions using two: global intelligent body is explored flight environment of vehicle and is utilized with local intelligent body, Global path planning and local paths planning are respectively corresponded, the quantity of state is effectively reduced, optimizes aircraft skyborne Movement, to solve flight path planning problem, hides obstacle and threat source.

Specific step is as follows:

Step 1: establishing the global state partitioning model of airflight environment；It is initialized according to global state partitioning model Global state transfer table Q-Table₁；It concurrently sets rewarding mechanism and establishes global return matrix R₁。

Model includes the starting point of aircraft, the different threat source of the target point and flying area radius of aircraft；Starting point, Target point and threat source represent different global states, while being configured to state transition diagram；

According to state transition diagram, global state transfer table Q-Table is constructed₁With the global return matrix R of same order₁；

Global state transfer table Q-Table₁It is initialized as null matrix；

Overall situation return matrix R₁Established according to the structure of state transition diagram: row and column is with the overall situation being successively arranged in order State indicates；Matrix R₁The interior corresponding global state of element representation row global state corresponding with column, the global state of composition It is right, the occurrence of element indicate global state to whether can process performing and return value.

- 1 indicates that the behavior between the corresponding global state of row global state corresponding with column cannot execute, as not optional The behavior selected；Non- 1 value indicates: the behavior between the corresponding global state of value row global state corresponding with column can execute, And return value is the value.

Step 2: according to the global state transfer table Q-Table of initialization₁, the global state of certain row is randomly choosed, as Initial state s₁；And set the maximum number of iterations N of outer loop₁。

Initial state s₁For starting point, threaten source or target point.

Step 3: judging initial state s₁It whether is target point, if so, entering step nine；Otherwise, four are entered step.

Step 4: being directed to global state transfer table Q-Table₁, in current state s₁In all column, ε-greedy is utilized Some column of algorithms selection are denoted as behavior a₁；

Step 5: utilizing selected behavior a₁, in global state transfer table Q-Table₁In obtain current state s₁It is next A state

Behavior a₁The columns at place is next stateThe line number at place.

Step 6: the transition rule using Q-Learning algorithm updates global state transfer table Q-Table₁In, currently State s₁And behavior a₁Corresponding specific element value；

Shown in formula specific as follows:

Wherein Q (s₁,a₁) indicate global state transfer table Q-Table after updating₁Middle state s₁Housing choice behavior a₁Corresponding Element value；R₁(s₁,a₁) indicate global return matrix R₁In, current state s₁Take behavior a₁Corresponding instant return value；γ₁ For the discount factor constant of global intelligent body, meet 0≤γ₁< 1；Indicate next stateIn global shape State transfer table Q-Table₁In greatest member value in all optional behaviors；Indicate next stateCorresponding behavior.

Step 7: updatingJudge s₁It whether is target point in global state, if so, entering step nine；It is no Then, step 8 is carried out；

Step 8: obtaining updated state s using Q-Learning algorithm into interior loop₁Corresponding part Planning path.

Specific steps are as follows:

Step 801, updated state s₁Corresponding is threat source, and the threat source and its peripheral region are divided into part State trellis, the corresponding local state of each grid；All state trellis constitute the local state transfer table Q- to be learnt Table₂, while establishing part return matrix R₂。

Part return matrix R₂The return value of interior element is determined according to following rule: being threatened inside source and where edge State trellis return value be negative, be set as -1, remaining state trellis return value be 0, wherein the state net of beginning and end Lattice return value is 100.

Beginning and end is determined according to following rule: a nearest grid of starting point in global state partitioning model For the starting point of regional area, a nearest grid of target point is the terminal of regional area in global state partitioning model.

Step 802, initialization local state transfer table Q-Table₂For null matrix, the part in certain row grid is randomly choosed State, as initial state s₂；Set the maximum number of iterations N of interior loop₂；

Step 803, in current state s₂In all column, using ε-greedy algorithms selection, some column is denoted as behavior a₂；

Step 804 utilizes selected behavior a₂, in local state transfer table Q-Table₂In obtain current state s₂Under One state

Behavior a₂The columns at place is next stateThe line number at place.

Step 805 updates local state transfer table Q-Table using Q-Learning algorithm₂In, current state s₂And Behavior a₂Corresponding specific element value；

Q(s₂,a₂) indicate local state transfer table Q-Table after updating₂Middle state s₂Housing choice behavior a₂Corresponding element Value；R₂(s₂,a₂) indicate part return matrix R₂In, current state s₂Take behavior a₂Instant return value afterwards, γ₂Expression office The discount factor constant of portion's intelligent body meets 0≤γ₂< 1,Indicate next stateTurn in local state Move table Q-Table₂In greatest member value in all optional behaviors,Indicate next stateCorresponding behavior.

Step 806 updatesReturn step 803, until the number of iterations reaches N₂Interior loop is completed, is obtained current State s₁Corresponding sector planning path.

According to element value each in grid, from the off, the greatest member value of all adjacent mesh is found out, by greastest element Grid where plain value is set to first grid in path, continues to look for all unduplicated adjacent mesh since the element, selects The grid where greatest member value is set to second grid in path out, and so on, it is until being transferred to the last one grid Terminal grid, all transfer grids chosen from origin-to-destination are exactly to cook up the skyborne local path of aircraft.

Step 9: the number of iterations of outer loop increases 1 certainly, and judge whether the number of iterations reaches N₁, if so, completing to fly The skyborne global path planning of row device.Otherwise, initial state s is randomly choosed again₁And return step three.

The present invention has the advantages that

1), a kind of paths planning method based on multiple agent intensified learning, to global, part state and motion space It is divided and is abstracted, " dimension disaster " difficulty of intensified learning in pahtfinder hard planning problem can be efficiently solved, made Path planning can independently be carried out in different environment by obtaining aircraft, adapted to the requirement of varying environment, improved the life of aircraft Deposit rate and task completion rate；

2), a kind of paths planning method based on multiple agent intensified learning, using global intelligent body and local intelligent body Global and local path planning is carried out respectively, be can be effectively reduced aircraft to the time of environment enquiry learning, is improved and strengthen The convergence rate of study completes path planning of the aircraft in circumstances not known as early as possible.

Detailed description of the invention

Fig. 1 is the intensified learning schematic diagram that the present invention uses in the prior art；

Fig. 2 is that the global state provided in the embodiment of the present invention divides schematic diagram；

Fig. 3 is that the local state provided in the embodiment of the present invention divides schematic diagram；

Fig. 4 is that global state shifts graph structure in the embodiment of the present invention；

Fig. 5 is the schematic diagram of global return matrix in the embodiment of the present invention；

Fig. 6 is the schematic diagram that global state table is initialized in the embodiment of the present invention；

Fig. 7 is the schematic diagram that planning path is referred in the embodiment of the present invention；

Fig. 8 is a kind of paths planning method flow chart based on multiple agent intensified learning of the present invention；

Fig. 9 is local paths planning process in a kind of paths planning method based on multiple agent intensified learning of the present invention Figure.

Specific embodiment

Below in conjunction with drawings and examples, the present invention is described in further detail.

The present invention proposes a kind of paths planning method based on multiple agent intensified learning, consideration is that how from certain Some path for searching out a safety reaches target point.Firstly, defining the different but complementary intelligent body of two kinds of functions for complete Office's path planning and local paths planning carry out global drawn game to flight range for the dimension explosion issues in intensified learning The state demarcation in portion, wherein global state includes starting point, terminal and different threat sources, and local state is that aircraft reaches Specific location under a certain global state；It defines the return matrix of global intelligent body and initializes global state transfer table；Then, According to global state transfer table, the initial state of random initializtion aircraft in all global states；And according to ε- Greedy strategy carries out movement selection, obtains branch mode of the aircraft between different conditions, and combine global return matrix and The transition strategy of Q-Learning algorithm carries out the study and update of global state transfer table；During this, in global intelligent body Start to enable local intelligence body progress local paths planning after into new state, equally uses Q-Learning algorithm carry out office The study of portion's state-transition table；Until study the number of iterations reaches preset maximum value.Finally, using finally obtained complete Office state matrix Q-Table₁With local state matrix Q-Table₂Determine the optimum programming path of aircraft.

The present invention can divide the work to the intelligent body in intensified learning, and different intelligent bodies is made to have different abilities, Cooperation between intelligent body makes a concentrated effort to finish path planning task, not only reduces the state of intelligent body in traditional Q-learning algorithm And motion space, accelerate study and training speed, can also guarantee the flexibility and accuracy of path planning.

The overview flow chart of this method is as shown in Figure 8, the specific steps are as follows:

Model include aircraft starting point, the different threat source of the target point and flying area radius of aircraft and it Where position；Starting point, target point and threat source represent different global states, while being configured to state transfer Figure；

To threaten source as the benchmark of division state, i.e., region, aircraft where each threat source go out global intelligent body Hair point and the target point of plan constitute state space, if threat source number is n, total status number of state space is n+2, Selectable movement number is n+1 under each state, and movement herein does not imply that specific movement, but clearly next Walk the state that can be transferred to.As shown in Fig. 2, being the mission planning signal that a simple aircraft enters enemy defensive position Scheme, the threat source containing there are three, starting point corresponding states 1, target point corresponding states 5, surface-to-air ballistic missile, ground radar, mountain valley in figure The region that the aircraft such as landform, harsh weather may be on the hazard needs to be hidden when planning flight path, with circle table Demonstration side of body source, circle radius is the threat radius in threat source, and the state region in the threat source is indicated in dotted line frame, is such as threatened Corresponding source 3 is state 4.If acting 3 in 1 selection of starting point, indicate to be shifted to movement 3 in next step, if the movement It can execute, then be transferred to the status field of state 3.

There are two important terms in intensified learning Q-Learning algorithm: " state (state) " and " behavior (action) ", herein by each " state " corresponding node, and each " behavior " corresponds to an arrow, i.e. expression shape Transfer between state.If the blocking in the source that is on the hazard in the process, then it represents that state can not shift, and in Fig. 2, state 1 can not It is transferred to state 4, accordingly, the graph structure for establishing global state transfer indicates, as shown in Figure 4.It is set in order to which the state for being 5 will be numbered It is set to target point, a return value (reward) is associated with for each movement (accordingly connecting side), is transferred directly to dbjective state Movement reward value be 100, other movement reward values be 0.According to the graph structure that this state shifts, it is with " state " Row, " behavior " are column, construct the return matrix R about reward value₁, as shown in figure 5, row and column is with successively by suitable The global state of sequence arrangement indicates；Matrix R₁The interior corresponding global state of element representation row global state corresponding with column, group At global state pair, the occurrence of element indicate global state to whether can process performing and return value.

- 1 indicates null value, and meaning between respective nodes does not have side to be connected；Namely indicate the corresponding global state of row and column Behavior between corresponding global state cannot execute, as not selectable behavior；Non- 1 value indicates: the value row is corresponding Behavior between global state global state corresponding with column can execute, and return value is the value.

Similarly, the state-transition table Q-Table of global intelligent body is constructed₁, for indicating that intelligent body is acquired from experience Knowledge, transfer table Q-Table₁With return matrix R₁It is same order, external environment is known nothing due to just starting intelligent body, Therefore by global state transfer table Q-Table₁It is initialized as null matrix, as shown in Figure 6.The situation unknown for state number, As soon as allowing intelligent body from an element, find to increase corresponding ranks in state table when a new state every time.

Step 2: according to the global state transfer table Q-Table of initialization₁, randomly choosed in all global states The global state of certain row, as initial state s₁；Into outer loop and set the maximum number of iterations N of outer loop₁。

Initial state s₁For starting point, threaten source or target point.

Behavior a₁The columns at place is next stateThe line number at place.

Shown in formula specific as follows:

The state and possible action that global intelligent body and local intelligent body are faced also are different；Local intelligence body is complete Office's intelligent body has determined that the transfering state of next step plays a role later, in order to can cook up the road for avoiding threat source Diameter includes a threat source in a general status field, gridding method can be used since local intelligence body is located in status field State and movement are divided, i.e., a grid corresponds to a kind of state, as shown in figure 3, if entering where threat source It in grid, then gives intelligent body and is the punishment of negative value, otherwise give the award that intelligent body one is positive value.In each state net There is the movement of four kinds of upper and lower, left and right in lattice it can be selected that herein, selecting ε-greedy algorithm to act generation strategy, to reach The balance of " exploration " and " utilization " into intensified learning.

The core concept of local paths planning still uses Q-Learning algorithm, and wherein local state region is using grid Representation carries out the division of state, according to the distance of the global starting point of distance and global object point determine regional area starting point and Terminal, the state reporting value in threat source are negative, otherwise are positive value, building and return matrix R₁The similar part return of structure Matrix R₂And local state transfer table Q-Table₂, local state transfer is carried out using the transition rule of Q-Learning algorithm Table Q-Table₂Study: specific steps are as follows:

Behavior a₂The columns at place is next stateThe line number at place.

Finally, it is as shown in Figure 7 for the reference path of aircraft may being cooked up in the environment of three threat sources.So far, Complete aircraft global and local paths planning in respective environment.This method can make aircraft complete given appoint Business, adapts to the requirement of varying environment, improves the survival rate and task completion rate of aircraft；Meanwhile it also substantially increasing and training Convergence rate in journey.

Claims

1. a kind of paths planning method based on multiple agent intensified learning, which is characterized in that specific step is as follows:

Step 1: establishing the global state partitioning model of airflight environment；It is initialized according to global state partitioning model global State-transition table Q-Table₁；It concurrently sets rewarding mechanism and establishes global return matrix R₁；

Step 2: according to the global state transfer table Q-Table of initialization₁, the global state of certain row is randomly choosed, as starting State s₁；And set the maximum number of iterations N of outer loop₁；

Step 3: judging initial state s₁It whether is target point, if so, entering step nine；Otherwise, four are entered step；

Step 4: being directed to global state transfer table Q-Table₁, in current state s₁In all column, ε-greedy algorithm is utilized Some column is selected to be denoted as behavior a₁；

Step 5: utilizing selected behavior a₁, in global state transfer table Q-Table₁In obtain current state s₁Next shape State

Behavior a₁The columns at place is next stateThe line number at place；

Step 6: the transition rule using Q-Learning algorithm updates global state transfer table Q-Table₁In, current state s₁ And behavior a₁Corresponding specific element value；

Shown in formula specific as follows:

Wherein Q (s₁,a₁) indicate global state transfer table Q-Table after updating₁Middle state s₁Housing choice behavior a₁Corresponding element Value；R₁(s₁,a₁) indicate global return matrix R₁In, current state s₁Take behavior a₁Corresponding instant return value；γ₁It is complete The discount factor constant of office's intelligent body, meets 0≤γ₁< 1；Indicate next stateTurn in global state Move table Q-Table₁In greatest member value in all optional behaviors；Indicate next stateCorresponding behavior；

Step 7: updatingJudge s₁It whether is target point in global state, if so, entering step nine；Otherwise, into Row step 8；

Step 8: obtaining updated state s using Q-Learning algorithm into interior loop₁Corresponding sector planning road Diameter；

Specific steps are as follows:

Step 801, updated state s₁Corresponding is threat source, and the threat source and its peripheral region are divided into local state Grid, the corresponding local state of each grid；All state trellis constitute the local state transfer table Q- to be learnt Table₂, while establishing part return matrix R₂；

Step 802, initialization local state transfer table Q-Table₂For null matrix, the local state in certain row grid is randomly choosed, As initial state s₂；Set the maximum number of iterations N of interior loop₂；

Step 804 utilizes selected behavior a₂, in local state transfer table Q-Table₂In obtain current state s₂It is next State

Behavior a₂The columns at place is next stateThe line number at place；

Q(s₂,a₂) indicate local state transfer table Q-Table after updating₂Middle state s₂Housing choice behavior a₂Corresponding element value；R₂ (s₂,a₂) indicate part return matrix R₂In, current state s₂Take behavior a₂Instant return value afterwards, γ₂Indicate local intelligence The discount factor constant of energy body, meets 0≤γ₂< 1,Indicate next stateIn local state transfer table Q-Table₂In greatest member value in all optional behaviors,Indicate next stateCorresponding behavior；

Step 806 updatesReturn step 803, until the number of iterations reaches N₂Interior loop is completed, current state is obtained s₁Corresponding sector planning path；

According to element value each in grid, from the off, the greatest member value of all adjacent mesh is found out, by greatest member value The grid at place is set to first grid in path, continues to look for all unduplicated adjacent mesh since the element, select most Grid where big element value is set to second grid in path, and so on, it is terminal until being transferred to the last one grid Grid, all transfer grids chosen from origin-to-destination are exactly to cook up the skyborne local path of aircraft；

Step 9: the number of iterations of outer loop increases 1 certainly, and judge whether the number of iterations reaches N₁, if so, completing aircraft Skyborne global path planning；Otherwise, initial state s is randomly choosed again₁And return step three.

2. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that step Model described in one includes the starting point of aircraft, the different threat source of the target point and flying area radius of aircraft； Starting point, target point and threat source represent different global states, while being configured to state transition diagram；

Global state transfer table Q-Table₁It is initialized as null matrix；

Overall situation return matrix R₁Established according to the structure of state transition diagram: row and column is with the global state being successively arranged in order It indicates；Matrix R₁The interior corresponding global state of element representation row global state corresponding with column, the global state pair of composition, member Element occurrence indicate global state to whether can process performing and return value；

- 1 indicates that the behavior between the corresponding global state of row global state corresponding with column cannot execute, as not selectable Behavior；Non- 1 value indicates: the behavior between the corresponding global state of value row global state corresponding with column can execute, and return Offer values are the value.

3. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that described Initial state s₁For starting point, threaten source or target point.

4. a kind of paths planning method based on multiple agent intensified learning as described in claim 1, which is characterized in that step Return matrix R in part described in 801₂The return value of interior element is determined according to following rule: threatening source inside and edge institute State trellis return value be negative, be set as -1, remaining state trellis return value be 0, wherein the state of beginning and end Grid return value is 100；

Beginning and end is determined according to following rule: a nearest grid of starting point is office in global state partitioning model The starting point in portion region, a nearest grid of target point is the terminal of regional area in global state partitioning model.