CN116430891A - Deep reinforcement learning method oriented to multi-agent path planning environment - Google Patents

Deep reinforcement learning method oriented to multi-agent path planning environment Download PDF

Info

Publication number
CN116430891A
CN116430891A CN202310175856.1A CN202310175856A CN116430891A CN 116430891 A CN116430891 A CN 116430891A CN 202310175856 A CN202310175856 A CN 202310175856A CN 116430891 A CN116430891 A CN 116430891A
Authority
CN
China
Prior art keywords
network
agent
curiosity
path planning
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310175856.1A
Other languages
Chinese (zh)
Inventor
陈志华
王子涵
李然
张国栋
梁磊
陈凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN202310175856.1A priority Critical patent/CN116430891A/en
Publication of CN116430891A publication Critical patent/CN116430891A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of path planning, and provides a multi-agent-oriented path planning deep reinforcement learning algorithm and system, wherein the method and the system comprise the following steps: building a modeling and path planning simulation system of the quadrotor unmanned aerial vehicle; and constructing a deep reinforcement learning basic network, and initializing and setting basic parameters. And building a non-global curiosity network for improving the exploration ability and level of the intelligent agent. An attention network is built, the training process of the intelligent agents is accelerated and stabilized, and the cooperation level between the intelligent agents is enhanced. The invention provides a deep reinforcement learning algorithm for multi-agent path planning, which combines a curiosity mechanism and an attention mechanism, establishes a new agent rewarding distribution mechanism, balances agent exploration and cooperation, and effectively improves the stability and planning level of the multi-agent path planning.

Description

Deep reinforcement learning method oriented to multi-agent path planning environment
Technical Field
The invention relates to a path planning problem of multiple agents, in particular to a path planning problem of deep reinforcement learning of multiple agents, and a problem of insufficient exploration of the agents and unreasonable reward value distribution.
Background
Path planning is a technique widely used in the fields of robots, automatic driving vehicles, virtual reality, simulation systems, and the like. Its main objective is to find an optimal path from the start point to the end point in a given environment to meet specific task requirements, such as avoiding obstacles, avoiding collisions, etc. In order to better meet the actual demands, researches on path planning are also continuously advancing. In recent years, with the continuous development of artificial intelligence technology, deep learning, reinforcement learning and other technologies are gradually introduced into the field of path planning, so that the efficiency and accuracy of path planning are greatly improved. By utilizing the characteristic of reinforcement learning 'exploration and utilization', a good result can be obtained more rapidly when the path planning is conducted in the face of a complex environment than the conventional method. In addition, the application scene of the multi-agent reinforcement learning algorithm is more in line with the characteristics of many path planning scenes in the real world, for example, in the path planning of the unmanned aerial vehicle group, interaction and cooperation among various agents need to be considered, and the multi-agent can carry out cooperative control on the unmanned aerial vehicle so as to achieve the overall optimal solution.
The deep reinforcement learning is a combination of the deep learning and the reinforcement learning, and can fully utilize the characteristic of the deep learning to solve the more complex problem. In conventional reinforcement learning methods, processing the context information typically requires manual design of feature vectors. However, the deep reinforcement learning algorithm has strong environment sensing capability of deep learning, such as convolution and full connection layer, and can directly process high-dimensional environment observation information and extract characteristics thereof.
At present, path planning research in a multi-agent environment is very few, and the problems of unreasonable reward distribution, difficult convergence, complex relationship among agents and the like are required to be further researched and solved.
Disclosure of Invention
The invention aims to solve the problems, and provides a multi-agent-oriented deep reinforcement learning method for path planning, which solves the problem that in a path planning environment, because of sparse environmental rewards, agents are difficult to converge or converge to a local optimal solution.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention provides a deep reinforcement learning method for a multi-agent path planning environment, which comprises the following steps:
step 1: constructing a three-dimensional path planning simulation system of the four-rotor unmanned aerial vehicle by means of a Pybullet development kit;
step 2: finishing a deep reinforcement learning algorithm based on a non-global curiosity network and an attention module, and initializing each intelligent agent;
step 3: constructing an environment rewarding function according to a path planning task target, and setting a target to be reached according to rules abstracted by a simulation environment;
step 4: setting the maximum iteration round and other parameters;
step 5: according to the Pybullet development kit, acquiring environment observation information in a simulation environment and communication information between the agents in the same team, processing state information, selecting actions to be executed, acquiring curiosity rewarding values of the agents, inputting the curiosity rewarding values into an attention network for further processing, and acquiring final rewarding values;
step 6: finer evaluation of parameters of the network and policy network;
step 7: acquiring new environment observation information, acquiring experience playback quadruples and storing the experience playback quadruples in a playback experience buffer;
step 8: and repeatedly executing the steps 5-7, and updating the neural network in the multi-agent reinforcement learning algorithm until the iteration number reaches the maximum iteration number, thereby realizing the path planning task in the simulation environment.
Further, the step 1 includes:
definition in a Pybullet simulation environment
Figure SMS_1
Each agent is identical except for the initial location in the environment. The environment comprises a group of local observations
Figure SMS_2
A group of actions
Figure SMS_3
And a set of states S and state transfer functions
Figure SMS_4
For each agent
Figure SMS_5
Local observations obtained
Figure SMS_6
Further, the step 3 includes:
the objects to be achieved are: and under the condition that the unmanned aerial vehicle does not crash, all the obstacles are avoided to successfully reach the target position.
Further, the step 4 includes:
the attention module acts on the non-global curiosity module and is used for controlling the importance degree of the curiosity value of each agent to achieve the overall goal.
According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the simulation module is further configured to:
definition of the definition
Figure SMS_7
Each agent is identical except for the initial location in the environment. The environment comprises a group of local observations
Figure SMS_8
A group of actions
Figure SMS_9
And a set of states S and state transfer functions
Figure SMS_10
For each agent
Figure SMS_11
Local observations obtained
Figure SMS_12
According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the attention module is further configured to:
the attention module acts on the curiosity module, processes the curiosity reward value generated by the curiosity module and is used for improving the effect of the curiosity reward value on convergence of the whole training.
According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the non-global curiosity module is further configured to:
each agent based on its local observations
Figure SMS_13
The heuristics are calculated to generate curiosity rewards.
According to an embodiment of the multi-agent path planning oriented deep reinforcement learning method of the present invention, the reward function construction module is further configured to:
the objects to be achieved are: any intelligent body successfully avoids various obstacles under the condition of no crash, and successfully reaches the position of the target point.
Compared with the prior art, the invention has the following advantages:
1) The non-global curiosity module adopted by the invention solves the problem that an agent path planning has a single path in a complex environment, improves the exploration level of the agent and efficiently optimizes the multi-agent game strategy;
2) The invention provides that the attention module acts on the non-global curiosity module, and the curiosity rewards acquired by a single agent are further optimized by using the attention according to global environment observation, so that the convergence stability is improved;
3) The invention aims at the cooperative multi-agent, and realizes the path planning of the multi-agent under the complex obstacle environment.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is an overall schematic of a simulation environment employed by the present invention;
FIG. 3 is a process framework diagram of a multi-agent reinforcement learning algorithm set forth in the present invention;
fig. 4 shows a path planning result diagram (top view) of the algorithm in this simulation environment.
Detailed Description
The present invention will be further described in detail with reference to the drawings and the following examples, wherein like reference numerals refer to the same or similar elements, in order to make the objects, technical solutions and advantages of the present invention more apparent. However, the following specific examples are given for the purpose of illustration only and are not intended to limit the scope of the present invention.
Referring to fig. 1, 2 and 3, the method of the embodiment of the present invention operates as follows:
step 1: the four-rotor drone was modeled using ROS. Establishing an appropriate coordinate system to describe the movement of the unmanned aerial vehicle in space, usually using an inertial coordinate system and the coordinate system of the unmanned aerial vehicle itself; describing the movement of the unmanned aerial vehicle in three directions (longitudinal, transverse and vertical), and adjusting the power output and the moment of the four motors according to the state of the unmanned aerial vehicle; describing the rotation state of the unmanned aerial vehicle by adopting a rotation matrix or quaternion according to aerodynamics; programming an ROS program to simulate the description of the four-rotor unmanned aerial vehicle; the method comprises the steps of importing a quadrotor unmanned aerial vehicle into an environment, uniformly arranging 40 radar rays around the unmanned aerial vehicle, identifying whether the surrounding environment is a target object, importing models of objects such as columns and the like into the environment, randomly generating a fixed number of models in the environment, and using the models as barriers; the sphere models with the same number as the unmanned aerial vehicles are imported as target points and randomly distributed behind the obstacle.
Step 2: the design of a non-global curiosity network and an attention module is completed, and the non-global curiosity network and the attention module are introduced into a deep reinforcement learning algorithm; according to the dimensions of the state space and the action space of the four-rotor unmanned aerial vehicle (the dimension of the observation space of each intelligent body is 40 and the dimension of the action space is 3), the input and output dimensions of an algorithm network are adjusted, and an improved deep reinforcement learning algorithm is completed; using the algorithm as intelligentInitializing respective networks by a body; according to the strategy
Figure SMS_14
Obtaining the selection action in the action space of the intelligent agent
Figure SMS_15
And obtains rewards by interacting with the simulation environment
Figure SMS_16
:
Figure SMS_17
Step 3: designing a bonus function of an environment: when four rotor unmanned aerial vehicle
Figure SMS_20
Closer to the obstacle, a negative prize will be given:
Figure SMS_26
Figure SMS_27
wherein, the method comprises the steps of, wherein,
Figure SMS_19
for the distance between the drone and the obstacle,
Figure SMS_22
is the influence range of the obstacle; when the quadrotor unmanned aerial vehicle is destroyed due to collision with an obstacle or excessive posture adjustment, a negative reward is given:
Figure SMS_24
the method comprises the steps of carrying out a first treatment on the surface of the When the agent successfully reaches the target point, a positive reward will be given:
Figure SMS_25
the method comprises the steps of carrying out a first treatment on the surface of the In addition, output from non-global curiosity network
Figure SMS_18
A prize may be awarded:
Figure SMS_21
the method comprises the steps of carrying out a first treatment on the surface of the The total prize is thus:
Figure SMS_23
step 4: setting the maximum round as 1000, and setting the size of the experience playback buffer zone as
Figure SMS_28
Soft update parameters
Figure SMS_29
Figure SMS_30
Set to 256.
Step 5: the algorithm takes the form of an Actor-Critic framework, which includes Actor (Actor) networks and critics (Critic) networks. The actor network is responsible for generating actions of the unmanned aerial vehicle and interacting with the environment, and the critique network is responsible for evaluating states and performances of the actions and guiding the strategy function to generate actions of the next stage; both networks adopt a dual-network structure, including a target network and an estimation network; according to the observation information of each intelligent agent at the moment, the action executed by the processing of the Actor network is obtained
Figure SMS_31
Interacting with the environment, calculating the curiosity rewards of each, inputting the curiosity rewards of all the agents into the attention network, weighting rewards, and outputting the curiosity rewards finally obtained by each agent
Figure SMS_32
The method comprises the steps of carrying out a first treatment on the surface of the Adding the curiosity rewards and the environmental rewards to obtain the final rewards of each agent in the step.
In addition, the "non-global" nature of the non-global curiosity rewards module is embodied in: when calculating curiosity rewards, the single agent does not calculate all other agents as part of the environment, but selects the agent state information which has influence on the single agent as the environment information according to the distance between the agents.
Wherein, the process of calculating curiosity reward value is as follows:
first, the current state is
Figure SMS_34
Current action
Figure SMS_37
The next true state
Figure SMS_40
Is input into the curiosity module. The curiosity module comprises four small modules; two feature extraction network modules are used for extracting states
Figure SMS_36
Is characterized by (2); an execution module (Forward Model) for predicting the time of the execution
Figure SMS_38
Execution under state
Figure SMS_39
Obtained by
Figure SMS_41
The method comprises the steps of carrying out a first treatment on the surface of the A reversing module for passing
Figure SMS_33
And
Figure SMS_42
estimation
Figure SMS_43
. Curiosity rewarding by
Figure SMS_44
And
Figure SMS_35
and the similarity is calculated.
In addition, the attention network processes curiosity rewards as follows:
will firstCuriosity rewards sequence X for each agent [ A ]
Figure SMS_45
]Inputting the curiosity rewards into the attention network, and learning the importance degrees of curiosity rewards of different intelligent agents through the neural network processing; in particular, an attention variable is used
Figure SMS_46
Representing query variables
Figure SMS_47
An item index location selected from (a) a plurality of items; at a given point
Figure SMS_48
And X, select the first
Figure SMS_49
The probability of the individual input information is:
Figure SMS_50
Figure SMS_51
Figure SMS_52
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_53
referred to as the distribution of attention,
Figure SMS_54
for the attention scoring function, the following equation may be used to calculate:
Figure SMS_55
step 6, the specific updating process of each network is as follows:
critic two netsThe basis of collaterals
Figure SMS_56
Updating:
Figure SMS_57
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_58
in order to be able to sample the number of samples,
Figure SMS_59
to at the same time
Figure SMS_60
Parameter values of a function
Figure SMS_61
In the case of determination, the state-action pair is
Figure SMS_62
When the intelligent agent finishes the round, the intelligent agent can obtain expected return;
Figure SMS_63
the expression is as follows:
Figure SMS_64
wherein the method comprises the steps of
Figure SMS_65
Is the first
Figure SMS_66
The value of the prize to be awarded by the wheel,
Figure SMS_67
is a discount factor that balances future rewards against current rewards.
Figure SMS_68
And outputting an action value for the Actor network.
The parameters of the Actor network are updated in a gradient way according to the evaluation of the actions by the Critic network:
Figure SMS_69
wherein in addition to the parameters described above, there are
Figure SMS_70
The function is a function that maximizes the desirability,
Figure SMS_71
namely, is
Figure SMS_72
Function at parameterThe gradient at the time of the determination is determined,
Figure SMS_74
equivalent to a policy function
Figure SMS_75
Step 7. The steps are carried out
Figure SMS_76
The quadruple is stored in an experience playback buffer; the process of experience playback is as follows:
taking a learning sequence from the experience playback buffer pool:
Figure SMS_77
the value of the calculated time difference error (td_error) is:
Figure SMS_78
the random gradient is:
Figure SMS_79
the gradient update formula is:
Figure SMS_80
in the algorithm, an empirical playback strategy of importance sampling is adopted; and (3) carrying out descending processing on the probability of each experience playback according to the time sequence difference error, wherein the larger the value of the time sequence difference error is, the larger the probability of being sampled is.
Step 8: the iteration is continued to the set maximum number of iterations according to steps 5-7.
Referring to fig. 4:
the three-dimensional simulation environment is projected on a two-dimensional plane, a blue square in the figure represents the position of an obstacle, a red circle represents the position of a target point, and three irregular lines represent the path planning routes of three four-rotor unmanned aerial vehicles.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood and appreciated by those skilled in the art.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. A deep reinforcement learning method and system for a multi-agent path planning environment is characterized by comprising the following steps:
step 1: constructing a three-dimensional path planning simulation system of the four-rotor unmanned aerial vehicle by means of a Pybullet development kit;
step 2: finishing a deep reinforcement learning algorithm based on a non-global curiosity network and an attention module, and initializing each intelligent agent;
step 3: constructing an environment rewarding function according to a path planning task target, and setting a target to be reached according to rules abstracted by a simulation environment;
step 4: setting the maximum iteration round and other parameters;
step 5: according to the Pybullet development kit, acquiring environment observation information in a simulation environment and communication information between the agents in the same team, processing state information, selecting actions to be executed, acquiring curiosity rewarding values of the agents, inputting the curiosity rewarding values into an attention network for further processing, and acquiring final rewarding values;
step 6: finer evaluation of parameters of the network and policy network;
step 7: acquiring new environment observation information, acquiring experience playback quadruples and storing the experience playback quadruples in a playback experience buffer;
step 8: and repeatedly executing the steps 5-7, and updating the neural network in the multi-agent reinforcement learning algorithm until the iteration number reaches the maximum iteration number, thereby realizing the path planning task in the simulation environment.
2. The three-dimensional path planning simulation system of the four-rotor unmanned aerial vehicle according to claim 1, wherein the four-rotor unmanned aerial vehicle is modeled according to the attribute of the four-rotor unmanned aerial vehicle by adopting ROS software, and intelligent agents of the four-rotor unmanned aerial vehicle are added in a Pybullet simulation environment, wherein each intelligent agent is completely the same except the initial position; the target unit is defined as spherical and is located behind an obstacle.
3. The non-global curiosity network of claim 1, wherein non-globally is embodied in that a single agent does not treat all other agents as a ring when calculating curiosity rewardsCalculating part of the environment, selecting the state information of the intelligent agents which has influence on the intelligent agents according to the distance between the intelligent agents as the environment state information, and firstly, setting the current state
Figure QLYQS_2
Current action->
Figure QLYQS_6
And the next real state +.>
Figure QLYQS_8
Are all input into curiosity module which comprises four small modules, two feature extraction network modules for extracting state ∈ ->
Figure QLYQS_3
Is characterized by (2); an execution module (Forward Model) for predicting the presence of +.>
Figure QLYQS_5
Execute +.>
Figure QLYQS_9
Obtained->
Figure QLYQS_11
The method comprises the steps of carrying out a first treatment on the surface of the An inversion module for passing->
Figure QLYQS_1
And->
Figure QLYQS_7
Estimate->
Figure QLYQS_10
Curiosity rewarding by
Figure QLYQS_12
And->
Figure QLYQS_4
And the similarity is calculated.
4. The attention module of claim 1, wherein the curiosity rewards sequence X of each agent is first [ A ]
Figure QLYQS_15
]Inputting into the attention network, and learning the importance degree of curiosity rewards of different intelligent agents by neural network processing, specifically adopting an attention variable +.>
Figure QLYQS_19
To represent the query variable +.>
Figure QLYQS_21
Index position of the selected item in given +.>
Figure QLYQS_14
And X, select->
Figure QLYQS_17
The probability of the individual input information is: />
Figure QLYQS_18
Figure QLYQS_20
The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_13
Called attention distribution, +.>
Figure QLYQS_16
The attention is scored as a function.
5. The deep reinforcement learning algorithm of claim 1 comprising an Actor (Actor) network and a Critic (Critic) network, wherein Critic's two network rootsAccording to the time sequence difference error
Figure QLYQS_22
) Updating:
Figure QLYQS_30
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_26
For the number of samples +.>
Figure QLYQS_36
Is at->
Figure QLYQS_29
Parameter value +.>
Figure QLYQS_32
In the case of a determination, the state-action pair is +.>
Figure QLYQS_37
When the intelligent agent reaches the end of the round, the intelligent agent can obtain the expected return; ->
Figure QLYQS_40
The expression is as follows:
Figure QLYQS_24
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_34
Is->
Figure QLYQS_23
Prize value for round, ->
Figure QLYQS_33
For balancing future returns with current returns, as a discount factor; ->
Figure QLYQS_28
The parameters of the Actor network are updated in a gradient way according to the evaluation of the action by the Critic network:
Figure QLYQS_38
the method comprises the steps of carrying out a first treatment on the surface of the Wherein, in addition to the parameters described above, there is +.>
Figure QLYQS_39
The function is the maximum desired function, +.>
Figure QLYQS_41
Namely +.>
Figure QLYQS_25
The function is in parameter->
Figure QLYQS_31
Determining the gradient->
Figure QLYQS_27
Equivalent to policy function->
Figure QLYQS_35
CN202310175856.1A 2023-02-28 2023-02-28 Deep reinforcement learning method oriented to multi-agent path planning environment Pending CN116430891A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310175856.1A CN116430891A (en) 2023-02-28 2023-02-28 Deep reinforcement learning method oriented to multi-agent path planning environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310175856.1A CN116430891A (en) 2023-02-28 2023-02-28 Deep reinforcement learning method oriented to multi-agent path planning environment

Publications (1)

Publication Number Publication Date
CN116430891A true CN116430891A (en) 2023-07-14

Family

ID=87080380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310175856.1A Pending CN116430891A (en) 2023-02-28 2023-02-28 Deep reinforcement learning method oriented to multi-agent path planning environment

Country Status (1)

Country Link
CN (1) CN116430891A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117492446A (en) * 2023-12-25 2024-02-02 北京大学 Multi-agent cooperation planning method and system based on combination and mixing optimization

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117492446A (en) * 2023-12-25 2024-02-02 北京大学 Multi-agent cooperation planning method and system based on combination and mixing optimization

Similar Documents

Publication Publication Date Title
Zhu et al. A survey of deep rl and il for autonomous driving policy learning
Shi et al. End-to-end navigation strategy with deep reinforcement learning for mobile robots
Zhu et al. Deep reinforcement learning based mobile robot navigation: A review
Chen et al. Stabilization approaches for reinforcement learning-based end-to-end autonomous driving
Das et al. A hybrid improved PSO-DV algorithm for multi-robot path planning in a clutter environment
Hong et al. Energy-efficient online path planning of multiple drones using reinforcement learning
Sun et al. Crowd navigation in an unknown and dynamic environment based on deep reinforcement learning
Rempe et al. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion
You et al. Target tracking strategy using deep deterministic policy gradient
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN112947081A (en) Distributed reinforcement learning social navigation method based on image hidden variable probability model
CN116430891A (en) Deep reinforcement learning method oriented to multi-agent path planning environment
CN113391633A (en) Urban environment-oriented mobile robot fusion path planning method
Wu et al. Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation
CN116661503A (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Yan et al. Immune deep reinforcement learning-based path planning for mobile robot in unknown environment
Sun et al. Event-triggered reconfigurable reinforcement learning motion-planning approach for mobile robot in unknown dynamic environments
Lei et al. Kb-tree: Learnable and continuous monte-carlo tree search for autonomous driving planning
CN114326826B (en) Multi-unmanned aerial vehicle formation transformation method and system
Peng et al. Moving object grasping method of mechanical arm based on deep deterministic policy gradient and hindsight experience replay
Gan et al. Multi-USV Cooperative Chasing Strategy Based on Obstacles Assistance and Deep Reinforcement Learning
CN115164890A (en) Swarm unmanned aerial vehicle autonomous motion planning method based on simulation learning
CN115016499A (en) Path planning method based on SCA-QL
Choi et al. Collision avoidance of unmanned aerial vehicles using fuzzy inference system-aided enhanced potential field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication