CN114237293A - Deep reinforcement learning formation transformation method and system based on dynamic target allocation - Google Patents

Deep reinforcement learning formation transformation method and system based on dynamic target allocation Download PDF

Info

Publication number
CN114237293A
CN114237293A CN202111546506.9A CN202111546506A CN114237293A CN 114237293 A CN114237293 A CN 114237293A CN 202111546506 A CN202111546506 A CN 202111546506A CN 114237293 A CN114237293 A CN 114237293A
Authority
CN
China
Prior art keywords
aircraft
target
formation
network
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111546506.9A
Other languages
Chinese (zh)
Other versions
CN114237293B (en
Inventor
张毅
杨秀霞
高恒杰
杨林
陆巍巍
褚政
王宏
于浩
姜子劼
王晨蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval Aeronautical University
Original Assignee
Naval Aeronautical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval Aeronautical University filed Critical Naval Aeronautical University
Priority to CN202111546506.9A priority Critical patent/CN114237293B/en
Publication of CN114237293A publication Critical patent/CN114237293A/en
Application granted granted Critical
Publication of CN114237293B publication Critical patent/CN114237293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, wherein the method comprises the following steps: determining a state space, an action space and a reward function; initializing network parameters, an experience pool and a training environment; judging whether the number of training rounds reaches the maximum; each aircraft starts in a certain initial formation; calculating the optimal distribution target point detector of each aircraft to detect own aircraft around, and judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone; calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state; calculating a reward value; taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool; updating the network parameters; judgment of rsWhether or not it is C2+C3And finishing training and finishing team form transformation in the complex obstacle environment. The method solves the problem that the local optimal route is easy to generate due to random target distribution in the formation conversion process.

Description

Deep reinforcement learning formation transformation method and system based on dynamic target allocation
Technical Field
The invention relates to the field of deep reinforcement learning, in particular to a method and a system for transforming a deep reinforcement learning formation based on dynamic target allocation.
Background
In practical application, formation of multiple aircrafts often needs to change the formation shape due to special tasks. The current formation transformation algorithm is mostly applied to formation transformation of multiple aircrafts in a barrier-free environment, and when the environment is complicated, the algorithm is low in obstacle avoidance efficiency, long in iteration time and prone to falling into a local optimal solution, so that the algorithm is difficult to apply to the complicated barrier environment.
The deep reinforcement learning algorithm is commonly used for solving intelligent decision problems in complex environments due to the excellent situation perception capability and the strong decision capability of the deep reinforcement learning algorithm. For the problem of formation change of a plurality of aircraft formations, when obstacles in the environment increase, the algorithm has the advantages that the decision can be quickly made according to the current state, the reaction speed is high, the collision avoidance capability is strong, and the flexibility is strong; when obstacles in the environment are reduced, the maneuvering is small due to the end-to-end control mode, the planned route is more beneficial to tracking, and meanwhile, the position of a formation transformation target point is not required to be given, and the real-time performance is strong.
Therefore, on the basis of the traditional DDPG algorithm, a multi-aircraft deep reinforcement learning formation transformation algorithm based on dynamic target allocation is provided.
Disclosure of Invention
The invention aims to provide a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, which aim to solve the problem that local optimal routes are easy to generate due to random target allocation in the formation conversion process.
In order to achieve the purpose, the invention provides the following scheme:
a deep reinforcement learning formation transformation method based on dynamic target allocation, the transformation method comprises the following steps:
s1: determining a state space, an action space and a reward function;
s2: random initialization on-line operator network Q (s, a | theta [ ]Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ)Network parameter θ ofQ
S3: initializing network parameters theta of a target actor networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
s4: initializing an experience pool and a training environment;
s5: judging whether the training round number reaches the maximum round number, if so, executing the step S13, and if not, returning to the previous step;
s6: each aircraft starting with an initial formation, t0Starting to change the formation at the moment;
s7: calculating the optimal distribution target point of each aircraft, exploring the target point to fly by each aircraft, detecting own aircraft around by a detector, executing the step S8 if the own aircraft is detected, and returning to the previous step if the own aircraft is detected;
s8: judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if so, executing the step S9, otherwise, returning to the step S7;
s9: calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state;
s10: calculating a reward value according to the reward function in the next state;
s11: taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool;
s12: randomly sampling batch metadatum from the experience pool, and sequentially updating a current Critic network, a current Actor network, a target Critic network and a target Actor network;
s13: judgment of rsWhether or not it is C2+C3If the condition is true, the current round is ended, and the process goes to step S5, and if the condition is false, the process goes to step S7;
s14: and finishing training, and finishing team form transformation in the complex obstacle environment.
Optionally, the expression of the state space is as follows:
Figure BDA0003415903140000031
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
Figure BDA0003415903140000032
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phii' is the position of the target node corresponding to the ith aircraft and the geometric center of the target formation, vi·tSpeed, v, of the ith aircraft at time ttarFor the speed of the target formation, psii·tFor the course angle, psi, of the ith aircraft at time ttarIs the course angle of the target formation;
the expression of the motion space is as follows:
Figure BDA0003415903140000033
in the formula, vmaxMaximum speed, v, of the aircraftminFor the minimum speed of the aircraft,
Figure BDA0003415903140000034
is the maximum angular velocity of the aircraft,
Figure BDA0003415903140000035
minimum angular velocity, v, of the aircraft, respectivelyu
Figure BDA0003415903140000036
Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,
Figure BDA0003415903140000037
Mapping the speed and the angular speed of the front aircraft;
the expression of the reward function is as follows:
Figure BDA0003415903140000041
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,
Figure BDA0003415903140000042
for the speed of the ith aircraft at time t,
Figure BDA0003415903140000043
the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,
Figure BDA0003415903140000044
the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,
Figure BDA0003415903140000045
is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,
Figure BDA0003415903140000046
the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,
Figure BDA0003415903140000047
obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
Optionally, the following formula is specifically adopted for calculating the optimal distribution target point of each aircraft:
Figure BDA0003415903140000048
Figure BDA0003415903140000049
wherein the aircraft U is assigned a target pointiSuccessful matching to the assigned target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0;
The efficiency function calculation formula is as follows:
Figure BDA0003415903140000051
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ dijtThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,
Figure BDA0003415903140000052
target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
Optionally, the following formula is specifically adopted for calculating the heading angle of the aircraft required to avoid the obstacle:
Figure BDA0003415903140000053
wherein alpha isRVOFor heading angle, v, required to avoid obstaclesuThe speed of the aircraft needing obstacle avoidance.
Optionally, each aircraft selection action specifically adopts the following formula:
at=μ(stμ)+ηt
wherein, mu(s)tμ) For online critic networks, ηtIs random noise.
Optionally, the following formulas are specifically adopted for updating the current critical network and the current Actor network:
the online actor network update strategy gradient is as follows:
Figure BDA0003415903140000054
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is a target actor network;
the online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
Figure BDA0003415903140000055
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′)
in the formula, [ mu ]'(s)i+1μ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
Optionally, the updating of the target critical network and the target Actor network specifically include:
updating target network parameters by adopting a soft updating mode, wherein the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
The invention further provides a deep reinforcement learning formation transformation system based on dynamic target allocation, which is characterized by comprising:
the state space, action space and reward function determining module is used for determining the state space, the action space and a reward function;
a first initialization module for randomly initializing the on-line operator network Q (s, a | theta)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ
A second initialization module for initializing the network parameter theta of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
the third initialization module is used for initializing an experience pool and a training environment;
the first judging module is used for judging whether the training round number reaches the maximum round number, if so, the third judging module is executed, and if not, the last module is returned;
the formation transformation module is used for starting from a certain initial formation by each aircraft and beginning to transform the formation at the time t 0;
the optimal distribution target point calculation module is used for calculating the optimal distribution target point of each aircraft, each aircraft explores and flies to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module is used for judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cone, if so, executing the course angle calculation module, and otherwise, returning to the optimal distribution target point calculation module;
the course angle calculation module is used for calculating course angles of the aircrafts needing to avoid the obstacle, and each aircraft selects an action and enters the next state;
the reward value calculation module is used for calculating a reward value according to the next state of the reward function;
the storage module is used for storing the system state, the action, the reward value and the next system state as a group of tuple data into an experience pool;
the updating module is used for randomly sampling batch metadata from the experience pool and sequentially updating the current critic network, the current actor network, the target critic network and the target actor network;
a third judging module for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module finishes training and completes team configuration transformation in a complex obstacle environment.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the method designs a dynamic target allocation algorithm to allocate the optimal nodes corresponding to the target formation for each aircraft, and solves the problem that the target allocation is random and local optimal routes are easy to generate in the formation conversion process; aiming at the problems that a traditional DDPG algorithm is easy to generate a local optimal path, time coordination is difficult to realize and the like, a multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, and a reward function based on comprehensive cost constraint of formation shape transformation is designed, so that the formation voyage cost marked by a calculation rule is minimum.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a horizontal in-line formation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a vertical in-line formation according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an inverse triangle formation according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating triangle formation according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of obstacle avoidance by the velocity barrier method according to the embodiment of the present invention;
FIG. 7 is a schematic diagram of a reciprocal velocity barrier collision method according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, which aim to solve the problem that local optimal routes are easy to generate due to random target allocation in the formation conversion process.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
According to the invention, the optimal nodes corresponding to the target formation in the target formation are distributed to each aircraft through the designed dynamic target distribution algorithm, so that the problem that the local optimal route is easily generated due to random target distribution in the formation conversion process is solved.
Aiming at the problems that the traditional DDPG algorithm is easy to generate local optimal paths, time coordination is difficult to realize and the like, the multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, the reward function based on the formation shape transformation comprehensive cost constraint is designed, and the formation voyage cost marked by a calculation rule is minimum, and the specific method comprises the following steps:
1. determining a kinematic model
And (4) regarding the aircraft in the formation transformation problem as a particle motion model, and controlling the motion process of the aircraft by using the acceleration and the heading angle of the aircraft. The equation of motion for an aircraft may be expressed as:
Figure BDA0003415903140000091
in the formula: i is 1,2, …, N is the number of aircrafts, viRepresenting the speed of the ith aircraft in the XOY plane, ψ is the heading angle of the aircraft, and a represents the acceleration of the aircraft. Considering the saturation constraints of the control inputs, the acceleration a and the heading angle ψ of the aircraft satisfy the following conditions:
Figure BDA0003415903140000092
in the formula, the acceleration specific constraint parameter depends on the model of the aircraft and flight parameters.
2. Formation description
The queue form description method comprises the following steps:
Bi={(Bmid,dii,vtartar)|i=1,2,…,N} (3)
in the formula, BmidFor formation of the geometric center coordinates of the formation, diIs the distance between the ith aircraft and the geometric center of the formation form, phiiIs the orientation of the ith aircraft and the geometric center of the formation, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation.
3. Formation transformation cost constraint
(1) Aircraft kinematic constraints
In the whole course of formation change, the course angle and the course angular speed of the aircraft must be changed within a certain range to meet the flight performance constraint J of the aircraftuav. With the constraint of
Figure BDA0003415903140000093
In the formula, #min、ψmaxRespectively the minimum and maximum course angles of the aircraft;
Figure BDA0003415903140000094
the minimum and maximum course angular speeds of the aircraft are respectively.
(2) Temporal collaborative cost constraints
After the formation of the plurality of aircrafts is changed, the time for changing the formation of each aircraft is required to be the same. Thus, the cost of temporal synergy among the members of the formation can be expressed as
Figure BDA0003415903140000101
In the formula, JtAs a function of the temporal co-cost, Δ tiTime to complete formation change for ith aircraft, t0Is the time to start the formation change.
(3) Spatial collaborative cost constraint
When the formation of the aircrafts forms the target formation after the formation is transformed, each aircraft is on a corresponding target point in the target formation, namely, the distance and the direction between each aircraft and the geometric center of the current formation meet the conditions of the target formation, and meanwhile, the speed and the course of each aircraft are consistent with the speed and the course of the target formation. Therefore, the spatial coordination cost of the formation transformation of the multiple aircraft formation is as follows:
Figure BDA0003415903140000102
in the formula, JsAs a spatial co-cost function, Δ vi·tIs the difference between the speed of the ith aircraft at time t and the target formation speed, Delta phii·tThe difference between the heading of the ith aircraft and the heading of the target formation at the moment t,
Figure BDA0003415903140000103
for the speed of the ith aircraft at time t,
Figure BDA0003415903140000104
the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,
Figure BDA0003415903140000105
the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,
Figure BDA0003415903140000106
is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiAnd (4) forming the position of the target position of the ith aircraft and the geometric center of the target.
(4) Constraint of collision cost
Impact cost J of ith aircraftobs,iIs divided into the i-th aircraft static obstacle collision cost Js_obs,iCollision cost of dynamic obstacle Jd_obs,iAnd cost of collision J between aircraftuav,i. Then the formation overall collision cost constraint JcolComprises the following steps:
Figure BDA0003415903140000111
wherein the variables are:
Figure BDA0003415903140000112
wherein the content of the first and second substances,
Figure BDA0003415903140000113
is the distance, R, between the kth waypoint of the ith aircraft and the center of the static obstacles_kIs the radius of the static obstacle threat zone at the kth waypoint,
Figure BDA0003415903140000114
distance, R, of the kth waypoint of the ith aircraft from the dynamic obstacled_kRadius of the dynamic obstacle threat circle at the kth waypoint, duavIs the sum of the distances between the aircraft, (x)d_obs,yd_obs) Is the coordinates of the center of the static threat barrier,
Figure BDA0003415903140000115
the k waypoint coordinate of the ith aircraft.
(5) Minimum voyage cost constraint
The flight used after each aircraft completes formation transformation in the formation is the minimum, so the minimum flight cost constraint is as follows:
Figure BDA0003415903140000116
in the formula, JLAt minimum voyage cost, LiCompleting the course of formation change for the ith aircraft,
Figure BDA0003415903140000117
the (k + 1) th waypoint of the ith aircraft.
(6) Formation transformation comprehensive cost constraint
The composite cost of the formation transformation of the multiple aircrafts is described as
J=W1Jt+W2Js+W3Jcol+W4JL (10)
In the formula: w1、W2、W3、W4Respectively corresponding to weight coefficients, JtFor a coordinated cost in time, JsTo a spatial coordination cost, JcolCost of overall collision for formation, JLThe minimum voyage cost.
4. Dynamic target allocation algorithm design
The multi-aircraft dynamic target assignment algorithm may be described by the following model:
DTA=<B,U,T,F> (11)
wherein, B is a task formation set, and the task formation is set as B ═ B (B)1,B2,B3,B4)。B1Representing horizontal in-line formation, B2Representing longitudinal in-line formation, B3Representing reverse triangle formation, B4Representing triangle formation. U is the set of aircraft to which the target point is to be assigned, U ═ uav1,uav2,…,uavn). T is the current formation BiThen, the set of target points to be allocated T ═ T (T)1,T2,…,Tn). F is an efficiency matrix of the aircraft matching the corresponding target point, and the form is as follows:
Figure BDA0003415903140000121
wherein FijRepresentation uaviMatching the target point TjThe corresponding efficacy of the compound.
For convenience of describing the distance between members in the formation, the relative position relationship between two aircraft is represented by (dx, dy), wherein dx represents the transverse distance, and dy represents the longitudinal distance. Four formation forms in a dynamic target allocation algorithm are designed by taking formation composed of five aircrafts as an example.
(1) Horizontal in-line formation
In the horizontal in-line formation, as shown in fig. 2, the aircraft are arranged horizontally, so dy is 0. dx is adjusted according to actual needs. The horizontal in-line formation is mainly used for large-area search, the search range can be expanded by increasing the number of aircrafts or increasing the distance between aircrafts, and the efficiency of executing tasks is improved.
(2) Longitudinal in-line formation
In the vertical in-line formation, as shown in fig. 3, each aircraft is arranged vertically, so dx is 0. dy is adjusted according to actual needs. The longitudinal in-line formation is mainly used for tasks such as formation obstacle avoidance and the like.
(3) Reverse triangle formation
In the inverted triangle formation, as shown in fig. 4, the transverse distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The reverse triangle formation is mainly used for battle tasks such as interception and the like.
(4) Triangle formation
In the formation of triangles, as shown in fig. 5, the lateral distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The triangle formation is mainly used for battle tasks such as fire fighting.
In summary, the distances and orientations between the members in the formation form and the geometric center are shown in table 1.
TABLE 1 team formation library parameter settings
Figure BDA0003415903140000131
Figure BDA0003415903140000141
To achieve formation maintenance, aircraft in a formation need to maintain communication with each other. If continuous communication cannot be maintained between aircraft, fleet clutter and even collisions may occur. To achieve communication between members in a formation, a communication topology needs to be established.
In the four formation forms, each aircraft needs to communicate with other formation members to determine the position of the geometric center of the formation, so that the formations must communicate by adopting an intercommunicating structure. Based on the analysis, the formation network topology is designed to be a fully-connected topology structure.
Under the condition that no target node conflict exists between the aircrafts, in order to optimize the target node distribution scheme, the sum of the effectiveness of all aircrafts performing formation transformation on the current target node is the maximum, namely, the expression of the optimal distribution scheme is described as follows:
Figure BDA0003415903140000151
wherein when U isiSuccessful matching to target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0。
The efficiency function calculation formula is as follows:
Figure BDA0003415903140000152
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ ditThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,
Figure BDA0003415903140000153
target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
Fig. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention, where the method shown in fig. 1 includes:
s1: the form transformation problem in the uncertain environment is modeled as a Markov decision process, and a state space (formula (15)), an action space (formula (17)) and a reward function (formula (18)) are designed. And obtaining an optimal formation transformation airway by solving the Markov decision process.
Wherein the state space is as follows:
Figure BDA0003415903140000161
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
Figure BDA0003415903140000162
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phi'iThe position, v, of a target node corresponding to the ith aircraft and the geometric center of a target formationi·tFor the speed of the ith aircraft at time t, psii·tThe heading angle of the ith aircraft at the moment t.
The motion space is as follows:
Figure BDA0003415903140000163
in the formula, vmax、vminThe maximum speed and the minimum speed of the aircraft are respectively,
Figure BDA0003415903140000164
maximum and minimum angular velocities, v, of the aircraft, respectivelyu
Figure BDA0003415903140000165
Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,
Figure BDA0003415903140000166
The speed and the angular speed of the front aircraft are mapped.
The reward function is as follows:
Figure BDA0003415903140000171
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,
Figure BDA0003415903140000172
for the speed of the ith aircraft at time t,
Figure BDA0003415903140000173
the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,
Figure BDA0003415903140000174
the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,
Figure BDA0003415903140000175
is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,
Figure BDA0003415903140000176
the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,
Figure BDA0003415903140000177
obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
S2: random initialization on-line operator network Q (s, a | theta [ ]Q) And on-line criticic network mu (s | theta)μ) Network parameter θ ofμAnd thetaQ
Note: the DDPG network architecture consists of an online actor network, a target actor network, an online critic network and a target critic network.
The four neural network updating modes of the deep deterministic strategy gradient algorithm DDPG are as follows:
the online actor network update strategy gradient is as follows:
Figure BDA0003415903140000178
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is the target actor network.
The online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
Figure BDA0003415903140000181
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′) (21)
in the formula, [ mu ]'(s)i+1μ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
The DDPG algorithm adopts a soft updating mode to update the target network parameters, and the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′ (22)
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
Introducing Behavior strategy, namely adding random noise eta during output action of line operator networktChanging the definite value action performed by the agent into a random value action at
at=μ(stμ)+ηt (23)
S3: initializing target networks mu' and thetaQ′And their weights, and copy the parameters of each target network to the online network.
S4: and initializing an experience pool and initializing a training environment.
S5: it is determined whether the number of training rounds has reached a maximum number of rounds and if so, processing proceeds to process 13. If not, go to S5.
S6: each aircraft starting with an initial formation, t0And the time begins to change the formation.
And S7, calculating the optimal distribution target point of each aircraft according to the formulas (13) and (14), flying each aircraft to the target point according to the exploration action of the formula (23), and detecting the own aircraft around by a detector. If the own aircraft is detected, the process proceeds to S8, otherwise, the process proceeds to S7.
S8: and judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cones. If obstacle avoidance or collision avoidance is required, the process proceeds to S9, otherwise, the process proceeds to S7.
The obstacle avoidance strategy is described as follows:
as shown in fig. 5, when the relative velocity vector v isuoAt Δ PoPuL1In time, the collision avoidance time t is requirediIncreasing course angle inwards so that alphai>αo(ii) a When the vector v of relative velocityuoAt Δ PoPuL2When the heading angle needs to be reduced so that αi>αo
Wherein, the collision avoidance time
Figure BDA0003415903140000182
αiThe relationship with α is determined by equation (11).
Figure BDA0003415903140000191
Wherein y 'and x' are
Figure BDA0003415903140000192
Wherein beta is a dynamic obstacle course angle.
Setting the course angle of an aircraft at a certain moment as alpha, and in the next sampling period delta t of the flight of the aircraft, the set A of the possible course angles is generatedαIs Aα={α|α∈(α+ωmin*δt,α+ωmaxδ t), then the set of relative velocity vectors generated based on the velocity barrier method is recorded as Vuo. After calculating the VO of the dynamic collision avoidance area by a speed obstacle method, removing VuoAnd obtaining the relative vector velocity which can successfully avoid the obstacle by using the relative velocity vector which can cause the collision in the set, and then selecting the action from the relative vector velocity to finish the collision avoidance.
The collision avoidance strategy is described as follows:
as shown in FIG. 6, the reciprocal velocity barrier cone RVO may be translated by the collision cone CC
Figure BDA0003415903140000193
Thus obtaining the product. In order to realize obstacle avoidance, uav is needed2Velocity vector of
Figure BDA0003415903140000194
Deflecting out a reciprocal velocity barrier cone RVO. Suppose when uav2Velocity vector of
Figure BDA0003415903140000195
Just deflecting out the reciprocal speedAngle of rotation of the right cone RVO is alphaRVOThe velocity vector is
Figure BDA0003415903140000196
Then according to the geometric relationship in the figure
Figure BDA0003415903140000197
In the formula (I), the compound is shown in the specification,
Figure BDA0003415903140000198
from the operational relationship between vectors
Figure BDA0003415903140000199
In the formula, | · the luminance | |2A two-norm representation of a is,
Figure BDA00034159031400001910
s9: and (4) calculating the heading angle of the aircraft needing to avoid the obstacle according to the formula (12) or the formula (28), selecting the action by each aircraft according to the formula (23), and entering the next state.
S10: in the next system state, the prize value is calculated according to equation (18).
S11: and storing the system state, the action, the reward value and the next system state at the moment into an experience pool as a group of tuple data.
S12: and randomly sampling batch tuple data from the experience pool, and updating the current criticic network, the current Actor network and the target network in sequence according to the formula (20), the formula (19) and the formula (22).
S13: judgment of rsWhether or not it is C2+C3If the condition is satisfied, the current round is ended, and the process goes to S5. If the condition is not satisfied, the flow proceeds to S7.
S14: and finishing training, and finishing team form transformation in the complex obstacle environment.
Fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention, and as shown in fig. 8, the system includes:
a state space, action space and reward function determination module 201, configured to determine a state space, an action space and a reward function;
a first initialization module 202 for randomly initializing the line operator network Q (s, a | theta |)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ )Network parameter θ ofQ
A second initialization module 203 for initializing the network parameter θ of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
a third initialization module 204, configured to initialize the experience pool and the training environment;
a first judging module 205, configured to judge whether the training round number reaches a maximum round number, if so, execute a third judging module, and if not, return to the previous module;
a formation change module 206 for each aircraft to start with an initial formation, t0Starting to change the formation at the moment;
the optimal distribution target point calculation module 207 is used for calculating the optimal distribution target point of each aircraft, each aircraft explores to fly to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module 208 is used for judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if the aircraft needs obstacle avoidance or collision avoidance, the course angle calculation module in the step is executed, and if the aircraft needs obstacle avoidance or collision avoidance, the optimal distribution target point calculation module is returned;
a course angle calculation module 209, configured to calculate a course angle at which the aircraft needs to avoid the obstacle, select an action for each aircraft, and enter a next state;
a reward value calculation module 210, configured to calculate a reward value according to the reward function in a next state;
the storage module 211 is configured to store the system state, the action, the reward value, and the next system state at this time as a set of tuple data in the experience pool;
an updating module 212, configured to randomly sample batch metadata from the experience pool, and sequentially update the current critic network, the current actor network, the target critic network, and the target actor network;
a third judging module 213 for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module 214 finishes training and completes the formation transformation in the complex obstacle environment.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A deep reinforcement learning formation transformation method based on dynamic target allocation is characterized by comprising the following steps:
s1: determining a state space, an action space and a reward function;
s2: random initialization on-line operator network Q (s, a | theta [ ]Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ
S3: initializing network parameters theta of a target actor networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
s4: initializing an experience pool and a training environment;
s5: judging whether the training round number reaches the maximum round number, if so, executing the step S13, and if not, returning to the previous step;
s6: each aircraft starting with an initial formation, t0Starting to change the formation at the moment;
s7: calculating the optimal distribution target point of each aircraft, exploring the target point to fly by each aircraft, detecting own aircraft around by a detector, executing the step S8 if the own aircraft is detected, and returning to the previous step if the own aircraft is detected;
s8: judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if so, executing the step S9, otherwise, returning to the step S7;
s9: calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state;
s10: calculating the reward value in the next state according to the reward function;
s11: taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool;
s12: randomly sampling batch metadatum from the experience pool, and sequentially updating a current Critic network, a current Actor network, a target Critic network and a target Actor network;
s13: judgment of rsWhether or not it is C2+C3If the condition is true, the current round is ended, and the process goes to step S5, and if the condition is false, the process goes to step S7;
s14: and finishing training, and finishing team form transformation in the complex obstacle environment.
2. The method for transforming the formation of deep reinforcement learning based on dynamic target allocation according to claim 1, wherein the expression of the state space is as follows:
Figure FDA0003415903130000021
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
Figure FDA0003415903130000022
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phii' is the position of the target node corresponding to the ith aircraft and the geometric center of the target formation, vi·tSpeed, v, of the ith aircraft at time ttarFor the speed of the target formation, psii·tFor the course angle, psi, of the ith aircraft at time ttarIs the course angle of the target formation;
the expression of the motion space is as follows:
Figure FDA0003415903130000023
in the formula, vmaxMaximum speed, v, of the aircraftminFor the minimum speed of the aircraft,
Figure FDA0003415903130000024
is the maximum angular velocity of the aircraft,
Figure FDA0003415903130000025
minimum angular velocity, v, of the aircraft, respectivelyu
Figure FDA0003415903130000026
Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,
Figure FDA0003415903130000027
Mapping the speed and the angular speed of the front aircraft;
the expression of the reward function is as follows:
Figure FDA0003415903130000031
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,
Figure FDA0003415903130000032
for the speed of the ith aircraft at time t,
Figure FDA0003415903130000033
the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,
Figure FDA0003415903130000034
the distance between the ith aircraft and the current formation geometric center at the moment t, diIs the ithThe distance between the target position of the overhead vehicle and the geometric center of the target formation,
Figure FDA0003415903130000035
is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,
Figure FDA0003415903130000036
the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,
Figure FDA0003415903130000037
obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
3. The method for transforming the deep reinforcement learning formation based on the dynamic target allocation according to claim 1, wherein the following formula is specifically adopted for calculating the optimal allocation target point of each aircraft:
Figure FDA0003415903130000038
Figure FDA0003415903130000039
wherein the aircraft U is assigned a target pointiSuccessful matching to the assigned target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0;
The efficiency function calculation formula is as follows:
Figure FDA0003415903130000041
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ ditThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,
Figure FDA0003415903130000042
target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
4. The method as claimed in claim 1, wherein the calculation of the heading angle of the aircraft requiring obstacle avoidance specifically employs the following formula:
Figure FDA0003415903130000043
wherein alpha isRVOFor heading angle, v, required to avoid obstaclesuThe speed of the aircraft needing obstacle avoidance.
5. The method according to claim 1, wherein the aircraft selection actions specifically adopt the following formula:
at=μ(stμ)+ηt
wherein, mu(s)tμ) For online critic networks, ηtIs random noise.
6. The method for transforming the deep reinforcement learning formation based on dynamic target allocation according to claim 1, wherein the following formulas are specifically adopted for updating the current Critic network and the current Actor network:
the online actor network update strategy gradient is as follows:
Figure FDA0003415903130000044
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is a target actor network;
the online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
Figure FDA0003415903130000051
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1μ′)|θQ′)
in the formula, [ mu ]'(s)i+1μ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
7. The method of claim 6, wherein the updating of the target Critic network and the target Actor network specifically comprises:
updating target network parameters by adopting a soft updating mode, wherein the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
8. A system for deep reinforcement learning formation transformation based on dynamic target allocation, the system comprising:
the state space, action space and reward function determining module is used for determining the state space, the action space and a reward function;
a first initialization module for randomly initializing the on-line operator network Q (s, a | theta)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ
A second initialization module for initializing the network parameter theta of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
the third initialization module is used for initializing an experience pool and a training environment;
the first judging module is used for judging whether the training round number reaches the maximum round number, if so, the third judging module is executed, and if not, the last module is returned;
the formation transformation module is used for starting from a certain initial formation by each aircraft and beginning to transform the formation at the time t 0;
the optimal distribution target point calculation module is used for calculating the optimal distribution target point of each aircraft, each aircraft explores and flies to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module is used for judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cone, if so, executing the course angle calculation module, and otherwise, returning to the optimal distribution target point calculation module;
the course angle calculation module is used for calculating course angles of the aircrafts needing to avoid the obstacle, and each aircraft selects an action and enters the next state;
the reward value calculation module is used for calculating a reward value in the next state according to the reward function;
the storage module is used for storing the system state, the action, the reward value and the next system state as a group of tuple data into an experience pool;
the updating module is used for randomly sampling batch metadata from the experience pool and sequentially updating the current critic network, the current actor network, the target critic network and the target actor network;
a third judging module for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module finishes training and completes team configuration transformation in a complex obstacle environment.
CN202111546506.9A 2021-12-16 2021-12-16 Deep reinforcement learning formation transformation method and system based on dynamic target allocation Active CN114237293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111546506.9A CN114237293B (en) 2021-12-16 2021-12-16 Deep reinforcement learning formation transformation method and system based on dynamic target allocation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111546506.9A CN114237293B (en) 2021-12-16 2021-12-16 Deep reinforcement learning formation transformation method and system based on dynamic target allocation

Publications (2)

Publication Number Publication Date
CN114237293A true CN114237293A (en) 2022-03-25
CN114237293B CN114237293B (en) 2023-08-25

Family

ID=80757404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111546506.9A Active CN114237293B (en) 2021-12-16 2021-12-16 Deep reinforcement learning formation transformation method and system based on dynamic target allocation

Country Status (1)

Country Link
CN (1) CN114237293B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576353A (en) * 2022-10-20 2023-01-06 北京理工大学 Aircraft formation control method based on deep reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190217476A1 (en) * 2018-01-12 2019-07-18 Futurewei Technologies, Inc. Robot navigation and object tracking
US20200174471A1 (en) * 2018-11-30 2020-06-04 Denso International America, Inc. Multi-Level Collaborative Control System With Dual Neural Network Planning For Autonomous Vehicle Control In A Noisy Environment
CN111880563A (en) * 2020-07-17 2020-11-03 西北工业大学 Multi-unmanned aerial vehicle task decision method based on MADDPG
CN111880567A (en) * 2020-07-31 2020-11-03 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning
CN111897316A (en) * 2020-06-22 2020-11-06 北京航空航天大学 Multi-aircraft autonomous decision-making method under scene fast-changing condition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190217476A1 (en) * 2018-01-12 2019-07-18 Futurewei Technologies, Inc. Robot navigation and object tracking
US20200174471A1 (en) * 2018-11-30 2020-06-04 Denso International America, Inc. Multi-Level Collaborative Control System With Dual Neural Network Planning For Autonomous Vehicle Control In A Noisy Environment
CN111897316A (en) * 2020-06-22 2020-11-06 北京航空航天大学 Multi-aircraft autonomous decision-making method under scene fast-changing condition
CN111880563A (en) * 2020-07-17 2020-11-03 西北工业大学 Multi-unmanned aerial vehicle task decision method based on MADDPG
CN111880567A (en) * 2020-07-31 2020-11-03 中国人民解放军国防科技大学 Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李樾;韩维;仲维国;: "有人机/无人机协同系统航迹控制关键技术浅析", 无人系统技术, no. 04 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576353A (en) * 2022-10-20 2023-01-06 北京理工大学 Aircraft formation control method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN114237293B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN110456823B (en) Double-layer path planning method aiming at unmanned aerial vehicle calculation and storage capacity limitation
CN110134140B (en) Unmanned aerial vehicle path planning method based on potential function reward DQN under continuous state of unknown environmental information
Ali et al. Cooperative path planning of multiple UAVs by using max–min ant colony optimization along with cauchy mutant operator
CN111897316B (en) Multi-aircraft autonomous decision-making method under scene fast-changing condition
CN111780777A (en) Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
Li et al. Trajectory planning for UAV based on improved ACO algorithm
CN112180967B (en) Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN111240319A (en) Outdoor multi-robot cooperative operation system and method thereof
CN108398960B (en) Multi-unmanned aerial vehicle cooperative target tracking method for improving combination of APF and segmented Bezier
CN116257082B (en) Distributed active cooperative detection method for multiple unmanned aerial vehicles
CN111240345A (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN114625150B (en) Rapid ant colony unmanned ship dynamic obstacle avoidance method based on danger coefficient and distance function
CN113848974A (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN114237293A (en) Deep reinforcement learning formation transformation method and system based on dynamic target allocation
CN114967721A (en) Unmanned aerial vehicle self-service path planning and obstacle avoidance strategy method based on DQ-CapsNet
CN112001120B (en) Spacecraft-to-multi-interceptor autonomous avoidance maneuvering method based on reinforcement learning
Liu et al. Multiple UAV formations delivery task planning based on a distributed adaptive algorithm
CN113064422A (en) Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
Zhu et al. A cooperative task assignment method of multi-UAV based on self organizing map
CN115951711A (en) Unmanned cluster multi-target searching and catching method in high sea condition environment
CN115755975A (en) Multi-unmanned aerial vehicle cooperative distributed space searching and trajectory planning method and device
CN115933637A (en) Path planning method and device for substation equipment inspection robot and storage medium
Wu et al. A multi-critic deep deterministic policy gradient UAV path planning
CN115542921A (en) Autonomous path planning method for multiple robots
Ma et al. Strategy generation based on reinforcement learning with deep deterministic policy gradient for UCAV

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant