CN116430891A

CN116430891A - Deep reinforcement learning method oriented to multi-agent path planning environment

Info

Publication number: CN116430891A
Application number: CN202310175856.1A
Authority: CN
Inventors: 陈志华; 王子涵; 李然; 张国栋; 梁磊; 陈凯
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-07-14

Abstract

The invention relates to the technical field of path planning, and provides a multi-agent-oriented path planning deep reinforcement learning algorithm and system, wherein the method and the system comprise the following steps: building a modeling and path planning simulation system of the quadrotor unmanned aerial vehicle; and constructing a deep reinforcement learning basic network, and initializing and setting basic parameters. And building a non-global curiosity network for improving the exploration ability and level of the intelligent agent. An attention network is built, the training process of the intelligent agents is accelerated and stabilized, and the cooperation level between the intelligent agents is enhanced. The invention provides a deep reinforcement learning algorithm for multi-agent path planning, which combines a curiosity mechanism and an attention mechanism, establishes a new agent rewarding distribution mechanism, balances agent exploration and cooperation, and effectively improves the stability and planning level of the multi-agent path planning.

Description

Deep reinforcement learning method oriented to multi-agent path planning environment

Technical Field

The invention relates to a path planning problem of multiple agents, in particular to a path planning problem of deep reinforcement learning of multiple agents, and a problem of insufficient exploration of the agents and unreasonable reward value distribution.

Background

Path planning is a technique widely used in the fields of robots, automatic driving vehicles, virtual reality, simulation systems, and the like. Its main objective is to find an optimal path from the start point to the end point in a given environment to meet specific task requirements, such as avoiding obstacles, avoiding collisions, etc. In order to better meet the actual demands, researches on path planning are also continuously advancing. In recent years, with the continuous development of artificial intelligence technology, deep learning, reinforcement learning and other technologies are gradually introduced into the field of path planning, so that the efficiency and accuracy of path planning are greatly improved. By utilizing the characteristic of reinforcement learning 'exploration and utilization', a good result can be obtained more rapidly when the path planning is conducted in the face of a complex environment than the conventional method. In addition, the application scene of the multi-agent reinforcement learning algorithm is more in line with the characteristics of many path planning scenes in the real world, for example, in the path planning of the unmanned aerial vehicle group, interaction and cooperation among various agents need to be considered, and the multi-agent can carry out cooperative control on the unmanned aerial vehicle so as to achieve the overall optimal solution.

The deep reinforcement learning is a combination of the deep learning and the reinforcement learning, and can fully utilize the characteristic of the deep learning to solve the more complex problem. In conventional reinforcement learning methods, processing the context information typically requires manual design of feature vectors. However, the deep reinforcement learning algorithm has strong environment sensing capability of deep learning, such as convolution and full connection layer, and can directly process high-dimensional environment observation information and extract characteristics thereof.

At present, path planning research in a multi-agent environment is very few, and the problems of unreasonable reward distribution, difficult convergence, complex relationship among agents and the like are required to be further researched and solved.

Disclosure of Invention

The invention aims to solve the problems, and provides a multi-agent-oriented deep reinforcement learning method for path planning, which solves the problem that in a path planning environment, because of sparse environmental rewards, agents are difficult to converge or converge to a local optimal solution.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a deep reinforcement learning method for a multi-agent path planning environment, which comprises the following steps:

step 1: constructing a three-dimensional path planning simulation system of the four-rotor unmanned aerial vehicle by means of a Pybullet development kit;

step 2: finishing a deep reinforcement learning algorithm based on a non-global curiosity network and an attention module, and initializing each intelligent agent;

step 3: constructing an environment rewarding function according to a path planning task target, and setting a target to be reached according to rules abstracted by a simulation environment;

step 4: setting the maximum iteration round and other parameters;

step 5: according to the Pybullet development kit, acquiring environment observation information in a simulation environment and communication information between the agents in the same team, processing state information, selecting actions to be executed, acquiring curiosity rewarding values of the agents, inputting the curiosity rewarding values into an attention network for further processing, and acquiring final rewarding values;

step 6: finer evaluation of parameters of the network and policy network;

step 7: acquiring new environment observation information, acquiring experience playback quadruples and storing the experience playback quadruples in a playback experience buffer;

step 8: and repeatedly executing the steps 5-7, and updating the neural network in the multi-agent reinforcement learning algorithm until the iteration number reaches the maximum iteration number, thereby realizing the path planning task in the simulation environment.

Further, the step 1 includes:

definition in a Pybullet simulation environment

Each agent is identical except for the initial location in the environment. The environment comprises a group of local observations

A group of actions

And a set of states S and state transfer functions

For each agent

Local observations obtained

。

Further, the step 3 includes:

the objects to be achieved are: and under the condition that the unmanned aerial vehicle does not crash, all the obstacles are avoided to successfully reach the target position.

Further, the step 4 includes:

the attention module acts on the non-global curiosity module and is used for controlling the importance degree of the curiosity value of each agent to achieve the overall goal.

According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the simulation module is further configured to:

definition of the definition

A group of actions

And a set of states S and state transfer functions

For each agent

Local observations obtained

。

According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the attention module is further configured to:

the attention module acts on the curiosity module, processes the curiosity reward value generated by the curiosity module and is used for improving the effect of the curiosity reward value on convergence of the whole training.

According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the non-global curiosity module is further configured to:

each agent based on its local observations

The heuristics are calculated to generate curiosity rewards.

According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the reward function construction module is further configured to:

the objects to be achieved are: any intelligent body successfully avoids various obstacles under the condition of no crash, and successfully reaches the position of the target point.

Compared with the prior art, the invention has the following advantages:

1) The non-global curiosity module adopted by the invention solves the problem that an agent path planning has a single path in a complex environment, improves the exploration level of the agent and efficiently optimizes the multi-agent game strategy;

2) The invention provides that the attention module acts on the non-global curiosity module, and the curiosity rewards acquired by a single agent are further optimized by using the attention according to global environment observation, so that the convergence stability is improved;

3) The invention aims at the cooperative multi-agent, and realizes the path planning of the multi-agent under the complex obstacle environment.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is an overall schematic of a simulation environment employed by the present invention;

FIG. 3 is a process framework diagram of a multi-agent reinforcement learning algorithm set forth in the present invention;

fig. 4 shows a path planning result diagram (top view) of the algorithm in this simulation environment.

Detailed Description

The present invention will be further described in detail with reference to the drawings and the following examples, wherein like reference numerals refer to the same or similar elements, in order to make the objects, technical solutions and advantages of the present invention more apparent. However, the following specific examples are given for the purpose of illustration only and are not intended to limit the scope of the present invention.

Referring to fig. 1, 2 and 3, the method of the embodiment of the present invention operates as follows:

step 1: the four-rotor drone was modeled using ROS. Establishing an appropriate coordinate system to describe the movement of the unmanned aerial vehicle in space, usually using an inertial coordinate system and the coordinate system of the unmanned aerial vehicle itself; describing the movement of the unmanned aerial vehicle in three directions (longitudinal, transverse and vertical), and adjusting the power output and the moment of the four motors according to the state of the unmanned aerial vehicle; describing the rotation state of the unmanned aerial vehicle by adopting a rotation matrix or quaternion according to aerodynamics; programming an ROS program to simulate the description of the four-rotor unmanned aerial vehicle; the method comprises the steps of importing a quadrotor unmanned aerial vehicle into an environment, uniformly arranging 40 radar rays around the unmanned aerial vehicle, identifying whether the surrounding environment is a target object, importing models of objects such as columns and the like into the environment, randomly generating a fixed number of models in the environment, and using the models as barriers; the sphere models with the same number as the unmanned aerial vehicles are imported as target points and randomly distributed behind the obstacle.

Step 2: the design of a non-global curiosity network and an attention module is completed, and the non-global curiosity network and the attention module are introduced into a deep reinforcement learning algorithm; according to the dimensions of the state space and the action space of the four-rotor unmanned aerial vehicle (the dimension of the observation space of each intelligent body is 40 and the dimension of the action space is 3), the input and output dimensions of an algorithm network are adjusted, and an improved deep reinforcement learning algorithm is completed; using the algorithm as intelligentInitializing respective networks by a body; according to the strategy

Obtaining the selection action in the action space of the intelligent agent

And obtains rewards by interacting with the simulation environment

:

。

Step 3: designing a bonus function of an environment: when four rotor unmanned aerial vehicle

Closer to the obstacle, a negative prize will be given:

，

wherein, the method comprises the steps of, wherein,

for the distance between the drone and the obstacle,

is the influence range of the obstacle; when the quadrotor unmanned aerial vehicle is destroyed due to collision with an obstacle or excessive posture adjustment, a negative reward is given:

the method comprises the steps of carrying out a first treatment on the surface of the When the agent successfully reaches the target point, a positive reward will be given:

the method comprises the steps of carrying out a first treatment on the surface of the In addition, output from non-global curiosity network

A prize may be awarded:

the method comprises the steps of carrying out a first treatment on the surface of the The total prize is thus:

。

step 4: setting the maximum round as 1000, and setting the size of the experience playback buffer zone as

Soft update parameters

，

Set to 256.

Step 5: the algorithm takes the form of an Actor-Critic framework, which includes Actor (Actor) networks and critics (Critic) networks. The actor network is responsible for generating actions of the unmanned aerial vehicle and interacting with the environment, and the critique network is responsible for evaluating states and performances of the actions and guiding the strategy function to generate actions of the next stage; both networks adopt a dual-network structure, including a target network and an estimation network; according to the observation information of each intelligent agent at the moment, the action executed by the processing of the Actor network is obtained

Interacting with the environment, calculating the curiosity rewards of each, inputting the curiosity rewards of all the agents into the attention network, weighting rewards, and outputting the curiosity rewards finally obtained by each agent

The method comprises the steps of carrying out a first treatment on the surface of the Adding the curiosity rewards and the environmental rewards to obtain the final rewards of each agent in the step.

In addition, the "non-global" nature of the non-global curiosity rewards module is embodied in: when calculating curiosity rewards, the single agent does not calculate all other agents as part of the environment, but selects the agent state information which has influence on the single agent as the environment information according to the distance between the agents.

Wherein, the process of calculating curiosity reward value is as follows:

first, the current state is

Current action

The next true state

Is input into the curiosity module. The curiosity module comprises four small modules; two feature extraction network modules are used for extracting states

Is characterized by (2); an execution module (Forward Model) for predicting the time of the execution

Execution under state

Obtained by

The method comprises the steps of carrying out a first treatment on the surface of the A reversing module for passing

And

estimation

. Curiosity rewarding by

And

and the similarity is calculated.

In addition, the attention network processes curiosity rewards as follows:

will firstCuriosity rewards sequence X for each agent [ A ]

]Inputting the curiosity rewards into the attention network, and learning the importance degrees of curiosity rewards of different intelligent agents through the neural network processing; in particular, an attention variable is used

Representing query variables

An item index location selected from (a) a plurality of items; at a given point

And X, select the first

The probability of the individual input information is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

referred to as the distribution of attention,

for the attention scoring function, the following equation may be used to calculate:

step 6, the specific updating process of each network is as follows:

critic two netsThe basis of collaterals

Updating:

in order to be able to sample the number of samples,

to at the same time

Parameter values of a function

In the case of determination, the state-action pair is

When the intelligent agent finishes the round, the intelligent agent can obtain expected return;

the expression is as follows:

wherein the method comprises the steps of

Is the first

The value of the prize to be awarded by the wheel,

is a discount factor that balances future rewards against current rewards.

And outputting an action value for the Actor network.

The parameters of the Actor network are updated in a gradient way according to the evaluation of the actions by the Critic network:

wherein in addition to the parameters described above, there are

The function is a function that maximizes the desirability,

namely, is

Function at parameterThe gradient at the time of the determination is determined,

equivalent to a policy function

。

Step 7. The steps are carried out

The quadruple is stored in an experience playback buffer; the process of experience playback is as follows:

taking a learning sequence from the experience playback buffer pool:

the value of the calculated time difference error (td_error) is:

the random gradient is:

the gradient update formula is:

in the algorithm, an empirical playback strategy of importance sampling is adopted; and (3) carrying out descending processing on the probability of each experience playback according to the time sequence difference error, wherein the larger the value of the time sequence difference error is, the larger the probability of being sampled is.

Step 8: the iteration is continued to the set maximum number of iterations according to steps 5-7.

Referring to fig. 4:

the three-dimensional simulation environment is projected on a two-dimensional plane, a blue square in the figure represents the position of an obstacle, a red circle represents the position of a target point, and three irregular lines represent the path planning routes of three four-rotor unmanned aerial vehicles.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood and appreciated by those skilled in the art.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A deep reinforcement learning method and system for a multi-agent path planning environment is characterized by comprising the following steps:

step 4: setting the maximum iteration round and other parameters;

step 6: finer evaluation of parameters of the network and policy network;

2. The three-dimensional path planning simulation system of the four-rotor unmanned aerial vehicle according to claim 1, wherein the four-rotor unmanned aerial vehicle is modeled according to the attribute of the four-rotor unmanned aerial vehicle by adopting ROS software, and intelligent agents of the four-rotor unmanned aerial vehicle are added in a Pybullet simulation environment, wherein each intelligent agent is completely the same except the initial position; the target unit is defined as spherical and is located behind an obstacle.

3. The non-global curiosity network of claim 1, wherein non-globally is embodied in that a single agent does not treat all other agents as a ring when calculating curiosity rewardsCalculating part of the environment, selecting the state information of the intelligent agents which has influence on the intelligent agents according to the distance between the intelligent agents as the environment state information, and firstly, setting the current state

Current action->

And the next real state +.>

Are all input into curiosity module which comprises four small modules, two feature extraction network modules for extracting state ∈ ->

Is characterized by (2); an execution module (Forward Model) for predicting the presence of +.>

Execute +.>

Obtained->

The method comprises the steps of carrying out a first treatment on the surface of the An inversion module for passing->

And->

Estimate->

Curiosity rewarding by

And->

And the similarity is calculated.

4. The attention module of claim 1, wherein the curiosity rewards sequence X of each agent is first [ A ]

]Inputting into the attention network, and learning the importance degree of curiosity rewards of different intelligent agents by neural network processing, specifically adopting an attention variable +.>

To represent the query variable +.>

Index position of the selected item in given +.>

And X, select->

The probability of the individual input information is: />

The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

Called attention distribution, +.>

The attention is scored as a function.

5. The deep reinforcement learning algorithm of claim 1 comprising an Actor (Actor) network and a Critic (Critic) network, wherein Critic's two network rootsAccording to the time sequence difference error

) Updating:

For the number of samples +.>

Is at->

Parameter value +.>

In the case of a determination, the state-action pair is +.>

When the intelligent agent reaches the end of the round, the intelligent agent can obtain the expected return; ->

The expression is as follows:

Is->

Prize value for round, ->

For balancing future returns with current returns, as a discount factor; ->

The parameters of the Actor network are updated in a gradient way according to the evaluation of the action by the Critic network:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein, in addition to the parameters described above, there is +.>

The function is the maximum desired function, +.>

Namely +.>

The function is in parameter->

Determining the gradient->

Equivalent to policy function->

。