CN114237293A - Deep reinforcement learning formation transformation method and system based on dynamic target allocation - Google Patents
Deep reinforcement learning formation transformation method and system based on dynamic target allocation Download PDFInfo
- Publication number
- CN114237293A CN114237293A CN202111546506.9A CN202111546506A CN114237293A CN 114237293 A CN114237293 A CN 114237293A CN 202111546506 A CN202111546506 A CN 202111546506A CN 114237293 A CN114237293 A CN 114237293A
- Authority
- CN
- China
- Prior art keywords
- aircraft
- target
- formation
- network
- ith
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 199
- 230000002787 reinforcement Effects 0.000 title claims abstract description 21
- 238000011426 transformation method Methods 0.000 title claims description 7
- 238000000034 method Methods 0.000 claims abstract description 47
- 230000006870 function Effects 0.000 claims abstract description 36
- 230000009471 action Effects 0.000 claims abstract description 34
- 230000009466 transformation Effects 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 24
- 230000008569 process Effects 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims description 24
- 230000008859 change Effects 0.000 claims description 20
- 230000004888 barrier function Effects 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 4
- 230000001131 transforming effect Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 abstract description 5
- 238000005755 formation reaction Methods 0.000 description 163
- 238000004422 calculation algorithm Methods 0.000 description 17
- 239000013598 vector Substances 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000001133 acceleration Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000002441 reversible effect Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000036632 reaction speed Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/104—Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, wherein the method comprises the following steps: determining a state space, an action space and a reward function; initializing network parameters, an experience pool and a training environment; judging whether the number of training rounds reaches the maximum; each aircraft starts in a certain initial formation; calculating the optimal distribution target point detector of each aircraft to detect own aircraft around, and judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone; calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state; calculating a reward value; taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool; updating the network parameters; judgment of rsWhether or not it is C2+C3And finishing training and finishing team form transformation in the complex obstacle environment. The method solves the problem that the local optimal route is easy to generate due to random target distribution in the formation conversion process.
Description
Technical Field
The invention relates to the field of deep reinforcement learning, in particular to a method and a system for transforming a deep reinforcement learning formation based on dynamic target allocation.
Background
In practical application, formation of multiple aircrafts often needs to change the formation shape due to special tasks. The current formation transformation algorithm is mostly applied to formation transformation of multiple aircrafts in a barrier-free environment, and when the environment is complicated, the algorithm is low in obstacle avoidance efficiency, long in iteration time and prone to falling into a local optimal solution, so that the algorithm is difficult to apply to the complicated barrier environment.
The deep reinforcement learning algorithm is commonly used for solving intelligent decision problems in complex environments due to the excellent situation perception capability and the strong decision capability of the deep reinforcement learning algorithm. For the problem of formation change of a plurality of aircraft formations, when obstacles in the environment increase, the algorithm has the advantages that the decision can be quickly made according to the current state, the reaction speed is high, the collision avoidance capability is strong, and the flexibility is strong; when obstacles in the environment are reduced, the maneuvering is small due to the end-to-end control mode, the planned route is more beneficial to tracking, and meanwhile, the position of a formation transformation target point is not required to be given, and the real-time performance is strong.
Therefore, on the basis of the traditional DDPG algorithm, a multi-aircraft deep reinforcement learning formation transformation algorithm based on dynamic target allocation is provided.
Disclosure of Invention
The invention aims to provide a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, which aim to solve the problem that local optimal routes are easy to generate due to random target allocation in the formation conversion process.
In order to achieve the purpose, the invention provides the following scheme:
a deep reinforcement learning formation transformation method based on dynamic target allocation, the transformation method comprises the following steps:
s1: determining a state space, an action space and a reward function;
s2: random initialization on-line operator network Q (s, a | theta [ ]Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ)Network parameter θ ofQ;
S3: initializing network parameters theta of a target actor networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
s4: initializing an experience pool and a training environment;
s5: judging whether the training round number reaches the maximum round number, if so, executing the step S13, and if not, returning to the previous step;
s6: each aircraft starting with an initial formation, t0Starting to change the formation at the moment;
s7: calculating the optimal distribution target point of each aircraft, exploring the target point to fly by each aircraft, detecting own aircraft around by a detector, executing the step S8 if the own aircraft is detected, and returning to the previous step if the own aircraft is detected;
s8: judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if so, executing the step S9, otherwise, returning to the step S7;
s9: calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state;
s10: calculating a reward value according to the reward function in the next state;
s11: taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool;
s12: randomly sampling batch metadatum from the experience pool, and sequentially updating a current Critic network, a current Actor network, a target Critic network and a target Actor network;
s13: judgment of rsWhether or not it is C2+C3If the condition is true, the current round is ended, and the process goes to step S5, and if the condition is false, the process goes to step S7;
s14: and finishing training, and finishing team form transformation in the complex obstacle environment.
Optionally, the expression of the state space is as follows:
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phii' is the position of the target node corresponding to the ith aircraft and the geometric center of the target formation, vi·tSpeed, v, of the ith aircraft at time ttarFor the speed of the target formation, psii·tFor the course angle, psi, of the ith aircraft at time ttarIs the course angle of the target formation;
the expression of the motion space is as follows:
in the formula, vmaxMaximum speed, v, of the aircraftminFor the minimum speed of the aircraft,is the maximum angular velocity of the aircraft,minimum angular velocity, v, of the aircraft, respectivelyu、Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,Mapping the speed and the angular speed of the front aircraft;
the expression of the reward function is as follows:
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,for the speed of the ith aircraft at time t,the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
Optionally, the following formula is specifically adopted for calculating the optimal distribution target point of each aircraft:
wherein the aircraft U is assigned a target pointiSuccessful matching to the assigned target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0;
The efficiency function calculation formula is as follows:
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ dijtThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
Optionally, the following formula is specifically adopted for calculating the heading angle of the aircraft required to avoid the obstacle:
wherein alpha isRVOFor heading angle, v, required to avoid obstaclesuThe speed of the aircraft needing obstacle avoidance.
Optionally, each aircraft selection action specifically adopts the following formula:
at=μ(st|θμ)+ηt
wherein, mu(s)t|θμ) For online critic networks, ηtIs random noise.
Optionally, the following formulas are specifically adopted for updating the current critical network and the current Actor network:
the online actor network update strategy gradient is as follows:
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is a target actor network;
the online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)
in the formula, [ mu ]'(s)i+1|θμ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
Optionally, the updating of the target critical network and the target Actor network specifically include:
updating target network parameters by adopting a soft updating mode, wherein the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
The invention further provides a deep reinforcement learning formation transformation system based on dynamic target allocation, which is characterized by comprising:
the state space, action space and reward function determining module is used for determining the state space, the action space and a reward function;
a first initialization module for randomly initializing the on-line operator network Q (s, a | theta)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ;
A second initialization module for initializing the network parameter theta of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
the third initialization module is used for initializing an experience pool and a training environment;
the first judging module is used for judging whether the training round number reaches the maximum round number, if so, the third judging module is executed, and if not, the last module is returned;
the formation transformation module is used for starting from a certain initial formation by each aircraft and beginning to transform the formation at the time t 0;
the optimal distribution target point calculation module is used for calculating the optimal distribution target point of each aircraft, each aircraft explores and flies to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module is used for judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cone, if so, executing the course angle calculation module, and otherwise, returning to the optimal distribution target point calculation module;
the course angle calculation module is used for calculating course angles of the aircrafts needing to avoid the obstacle, and each aircraft selects an action and enters the next state;
the reward value calculation module is used for calculating a reward value according to the next state of the reward function;
the storage module is used for storing the system state, the action, the reward value and the next system state as a group of tuple data into an experience pool;
the updating module is used for randomly sampling batch metadata from the experience pool and sequentially updating the current critic network, the current actor network, the target critic network and the target actor network;
a third judging module for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module finishes training and completes team configuration transformation in a complex obstacle environment.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the method designs a dynamic target allocation algorithm to allocate the optimal nodes corresponding to the target formation for each aircraft, and solves the problem that the target allocation is random and local optimal routes are easy to generate in the formation conversion process; aiming at the problems that a traditional DDPG algorithm is easy to generate a local optimal path, time coordination is difficult to realize and the like, a multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, and a reward function based on comprehensive cost constraint of formation shape transformation is designed, so that the formation voyage cost marked by a calculation rule is minimum.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a horizontal in-line formation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a vertical in-line formation according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an inverse triangle formation according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating triangle formation according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of obstacle avoidance by the velocity barrier method according to the embodiment of the present invention;
FIG. 7 is a schematic diagram of a reciprocal velocity barrier collision method according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, which aim to solve the problem that local optimal routes are easy to generate due to random target allocation in the formation conversion process.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
According to the invention, the optimal nodes corresponding to the target formation in the target formation are distributed to each aircraft through the designed dynamic target distribution algorithm, so that the problem that the local optimal route is easily generated due to random target distribution in the formation conversion process is solved.
Aiming at the problems that the traditional DDPG algorithm is easy to generate local optimal paths, time coordination is difficult to realize and the like, the multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, the reward function based on the formation shape transformation comprehensive cost constraint is designed, and the formation voyage cost marked by a calculation rule is minimum, and the specific method comprises the following steps:
1. determining a kinematic model
And (4) regarding the aircraft in the formation transformation problem as a particle motion model, and controlling the motion process of the aircraft by using the acceleration and the heading angle of the aircraft. The equation of motion for an aircraft may be expressed as:
in the formula: i is 1,2, …, N is the number of aircrafts, viRepresenting the speed of the ith aircraft in the XOY plane, ψ is the heading angle of the aircraft, and a represents the acceleration of the aircraft. Considering the saturation constraints of the control inputs, the acceleration a and the heading angle ψ of the aircraft satisfy the following conditions:
in the formula, the acceleration specific constraint parameter depends on the model of the aircraft and flight parameters.
2. Formation description
The queue form description method comprises the following steps:
Bi={(Bmid,di,φi,vtar,ψtar)|i=1,2,…,N} (3)
in the formula, BmidFor formation of the geometric center coordinates of the formation, diIs the distance between the ith aircraft and the geometric center of the formation form, phiiIs the orientation of the ith aircraft and the geometric center of the formation, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation.
3. Formation transformation cost constraint
(1) Aircraft kinematic constraints
In the whole course of formation change, the course angle and the course angular speed of the aircraft must be changed within a certain range to meet the flight performance constraint J of the aircraftuav. With the constraint of
In the formula, #min、ψmaxRespectively the minimum and maximum course angles of the aircraft;the minimum and maximum course angular speeds of the aircraft are respectively.
(2) Temporal collaborative cost constraints
After the formation of the plurality of aircrafts is changed, the time for changing the formation of each aircraft is required to be the same. Thus, the cost of temporal synergy among the members of the formation can be expressed as
In the formula, JtAs a function of the temporal co-cost, Δ tiTime to complete formation change for ith aircraft, t0Is the time to start the formation change.
(3) Spatial collaborative cost constraint
When the formation of the aircrafts forms the target formation after the formation is transformed, each aircraft is on a corresponding target point in the target formation, namely, the distance and the direction between each aircraft and the geometric center of the current formation meet the conditions of the target formation, and meanwhile, the speed and the course of each aircraft are consistent with the speed and the course of the target formation. Therefore, the spatial coordination cost of the formation transformation of the multiple aircraft formation is as follows:
in the formula, JsAs a spatial co-cost function, Δ vi·tIs the difference between the speed of the ith aircraft at time t and the target formation speed, Delta phii·tThe difference between the heading of the ith aircraft and the heading of the target formation at the moment t,for the speed of the ith aircraft at time t,the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiAnd (4) forming the position of the target position of the ith aircraft and the geometric center of the target.
(4) Constraint of collision cost
Impact cost J of ith aircraftobs,iIs divided into the i-th aircraft static obstacle collision cost Js_obs,iCollision cost of dynamic obstacle Jd_obs,iAnd cost of collision J between aircraftuav,i. Then the formation overall collision cost constraint JcolComprises the following steps:
wherein the variables are:
wherein the content of the first and second substances,is the distance, R, between the kth waypoint of the ith aircraft and the center of the static obstacles_kIs the radius of the static obstacle threat zone at the kth waypoint,distance, R, of the kth waypoint of the ith aircraft from the dynamic obstacled_kRadius of the dynamic obstacle threat circle at the kth waypoint, duavIs the sum of the distances between the aircraft, (x)d_obs,yd_obs) Is the coordinates of the center of the static threat barrier,the k waypoint coordinate of the ith aircraft.
(5) Minimum voyage cost constraint
The flight used after each aircraft completes formation transformation in the formation is the minimum, so the minimum flight cost constraint is as follows:
in the formula, JLAt minimum voyage cost, LiCompleting the course of formation change for the ith aircraft,the (k + 1) th waypoint of the ith aircraft.
(6) Formation transformation comprehensive cost constraint
The composite cost of the formation transformation of the multiple aircrafts is described as
J=W1Jt+W2Js+W3Jcol+W4JL (10)
In the formula: w1、W2、W3、W4Respectively corresponding to weight coefficients, JtFor a coordinated cost in time, JsTo a spatial coordination cost, JcolCost of overall collision for formation, JLThe minimum voyage cost.
4. Dynamic target allocation algorithm design
The multi-aircraft dynamic target assignment algorithm may be described by the following model:
DTA=<B,U,T,F> (11)
wherein, B is a task formation set, and the task formation is set as B ═ B (B)1,B2,B3,B4)。B1Representing horizontal in-line formation, B2Representing longitudinal in-line formation, B3Representing reverse triangle formation, B4Representing triangle formation. U is the set of aircraft to which the target point is to be assigned, U ═ uav1,uav2,…,uavn). T is the current formation BiThen, the set of target points to be allocated T ═ T (T)1,T2,…,Tn). F is an efficiency matrix of the aircraft matching the corresponding target point, and the form is as follows:
wherein FijRepresentation uaviMatching the target point TjThe corresponding efficacy of the compound.
For convenience of describing the distance between members in the formation, the relative position relationship between two aircraft is represented by (dx, dy), wherein dx represents the transverse distance, and dy represents the longitudinal distance. Four formation forms in a dynamic target allocation algorithm are designed by taking formation composed of five aircrafts as an example.
(1) Horizontal in-line formation
In the horizontal in-line formation, as shown in fig. 2, the aircraft are arranged horizontally, so dy is 0. dx is adjusted according to actual needs. The horizontal in-line formation is mainly used for large-area search, the search range can be expanded by increasing the number of aircrafts or increasing the distance between aircrafts, and the efficiency of executing tasks is improved.
(2) Longitudinal in-line formation
In the vertical in-line formation, as shown in fig. 3, each aircraft is arranged vertically, so dx is 0. dy is adjusted according to actual needs. The longitudinal in-line formation is mainly used for tasks such as formation obstacle avoidance and the like.
(3) Reverse triangle formation
In the inverted triangle formation, as shown in fig. 4, the transverse distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The reverse triangle formation is mainly used for battle tasks such as interception and the like.
(4) Triangle formation
In the formation of triangles, as shown in fig. 5, the lateral distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The triangle formation is mainly used for battle tasks such as fire fighting.
In summary, the distances and orientations between the members in the formation form and the geometric center are shown in table 1.
TABLE 1 team formation library parameter settings
To achieve formation maintenance, aircraft in a formation need to maintain communication with each other. If continuous communication cannot be maintained between aircraft, fleet clutter and even collisions may occur. To achieve communication between members in a formation, a communication topology needs to be established.
In the four formation forms, each aircraft needs to communicate with other formation members to determine the position of the geometric center of the formation, so that the formations must communicate by adopting an intercommunicating structure. Based on the analysis, the formation network topology is designed to be a fully-connected topology structure.
Under the condition that no target node conflict exists between the aircrafts, in order to optimize the target node distribution scheme, the sum of the effectiveness of all aircrafts performing formation transformation on the current target node is the maximum, namely, the expression of the optimal distribution scheme is described as follows:
wherein when U isiSuccessful matching to target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0。
The efficiency function calculation formula is as follows:
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ ditThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
Fig. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention, where the method shown in fig. 1 includes:
s1: the form transformation problem in the uncertain environment is modeled as a Markov decision process, and a state space (formula (15)), an action space (formula (17)) and a reward function (formula (18)) are designed. And obtaining an optimal formation transformation airway by solving the Markov decision process.
Wherein the state space is as follows:
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phi'iThe position, v, of a target node corresponding to the ith aircraft and the geometric center of a target formationi·tFor the speed of the ith aircraft at time t, psii·tThe heading angle of the ith aircraft at the moment t.
The motion space is as follows:
in the formula, vmax、vminThe maximum speed and the minimum speed of the aircraft are respectively,maximum and minimum angular velocities, v, of the aircraft, respectivelyu、Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,The speed and the angular speed of the front aircraft are mapped.
The reward function is as follows:
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,for the speed of the ith aircraft at time t,the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,the distance between the ith aircraft and the current formation geometric center at the moment t, diThe distance between the target position of the ith aircraft and the geometric center of the target formation,is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
S2: random initialization on-line operator network Q (s, a | theta [ ]Q) And on-line criticic network mu (s | theta)μ) Network parameter θ ofμAnd thetaQ。
Note: the DDPG network architecture consists of an online actor network, a target actor network, an online critic network and a target critic network.
The four neural network updating modes of the deep deterministic strategy gradient algorithm DDPG are as follows:
the online actor network update strategy gradient is as follows:
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is the target actor network.
The online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′) (21)
in the formula, [ mu ]'(s)i+1|θμ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
The DDPG algorithm adopts a soft updating mode to update the target network parameters, and the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′ (22)
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
Introducing Behavior strategy, namely adding random noise eta during output action of line operator networktChanging the definite value action performed by the agent into a random value action at。
at=μ(st|θμ)+ηt (23)
S3: initializing target networks mu' and thetaQ′And their weights, and copy the parameters of each target network to the online network.
S4: and initializing an experience pool and initializing a training environment.
S5: it is determined whether the number of training rounds has reached a maximum number of rounds and if so, processing proceeds to process 13. If not, go to S5.
S6: each aircraft starting with an initial formation, t0And the time begins to change the formation.
And S7, calculating the optimal distribution target point of each aircraft according to the formulas (13) and (14), flying each aircraft to the target point according to the exploration action of the formula (23), and detecting the own aircraft around by a detector. If the own aircraft is detected, the process proceeds to S8, otherwise, the process proceeds to S7.
S8: and judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cones. If obstacle avoidance or collision avoidance is required, the process proceeds to S9, otherwise, the process proceeds to S7.
The obstacle avoidance strategy is described as follows:
as shown in fig. 5, when the relative velocity vector v isuoAt Δ PoPuL1In time, the collision avoidance time t is requirediIncreasing course angle inwards so that alphai>αo(ii) a When the vector v of relative velocityuoAt Δ PoPuL2When the heading angle needs to be reduced so that αi>αo。
Wherein, the collision avoidance time
αiThe relationship with α is determined by equation (11).
Wherein y 'and x' are
Wherein beta is a dynamic obstacle course angle.
Setting the course angle of an aircraft at a certain moment as alpha, and in the next sampling period delta t of the flight of the aircraft, the set A of the possible course angles is generatedαIs Aα={α|α∈(α+ωmin*δt,α+ωmaxδ t), then the set of relative velocity vectors generated based on the velocity barrier method is recorded as Vuo. After calculating the VO of the dynamic collision avoidance area by a speed obstacle method, removing VuoAnd obtaining the relative vector velocity which can successfully avoid the obstacle by using the relative velocity vector which can cause the collision in the set, and then selecting the action from the relative vector velocity to finish the collision avoidance.
The collision avoidance strategy is described as follows:
as shown in FIG. 6, the reciprocal velocity barrier cone RVO may be translated by the collision cone CCThus obtaining the product. In order to realize obstacle avoidance, uav is needed2Velocity vector ofDeflecting out a reciprocal velocity barrier cone RVO. Suppose when uav2Velocity vector ofJust deflecting out the reciprocal speedAngle of rotation of the right cone RVO is alphaRVOThe velocity vector isThen according to the geometric relationship in the figure
In the formula (I), the compound is shown in the specification,from the operational relationship between vectors
In the formula, | · the luminance | |2A two-norm representation of a is,
s9: and (4) calculating the heading angle of the aircraft needing to avoid the obstacle according to the formula (12) or the formula (28), selecting the action by each aircraft according to the formula (23), and entering the next state.
S10: in the next system state, the prize value is calculated according to equation (18).
S11: and storing the system state, the action, the reward value and the next system state at the moment into an experience pool as a group of tuple data.
S12: and randomly sampling batch tuple data from the experience pool, and updating the current criticic network, the current Actor network and the target network in sequence according to the formula (20), the formula (19) and the formula (22).
S13: judgment of rsWhether or not it is C2+C3If the condition is satisfied, the current round is ended, and the process goes to S5. If the condition is not satisfied, the flow proceeds to S7.
S14: and finishing training, and finishing team form transformation in the complex obstacle environment.
Fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention, and as shown in fig. 8, the system includes:
a state space, action space and reward function determination module 201, configured to determine a state space, an action space and a reward function;
a first initialization module 202 for randomly initializing the line operator network Q (s, a | theta |)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ )Network parameter θ ofQ;
A second initialization module 203 for initializing the network parameter θ of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
a third initialization module 204, configured to initialize the experience pool and the training environment;
a first judging module 205, configured to judge whether the training round number reaches a maximum round number, if so, execute a third judging module, and if not, return to the previous module;
a formation change module 206 for each aircraft to start with an initial formation, t0Starting to change the formation at the moment;
the optimal distribution target point calculation module 207 is used for calculating the optimal distribution target point of each aircraft, each aircraft explores to fly to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module 208 is used for judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if the aircraft needs obstacle avoidance or collision avoidance, the course angle calculation module in the step is executed, and if the aircraft needs obstacle avoidance or collision avoidance, the optimal distribution target point calculation module is returned;
a course angle calculation module 209, configured to calculate a course angle at which the aircraft needs to avoid the obstacle, select an action for each aircraft, and enter a next state;
a reward value calculation module 210, configured to calculate a reward value according to the reward function in a next state;
the storage module 211 is configured to store the system state, the action, the reward value, and the next system state at this time as a set of tuple data in the experience pool;
an updating module 212, configured to randomly sample batch metadata from the experience pool, and sequentially update the current critic network, the current actor network, the target critic network, and the target actor network;
a third judging module 213 for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module 214 finishes training and completes the formation transformation in the complex obstacle environment.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (8)
1. A deep reinforcement learning formation transformation method based on dynamic target allocation is characterized by comprising the following steps:
s1: determining a state space, an action space and a reward function;
s2: random initialization on-line operator network Q (s, a | theta [ ]Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ;
S3: initializing network parameters theta of a target actor networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
s4: initializing an experience pool and a training environment;
s5: judging whether the training round number reaches the maximum round number, if so, executing the step S13, and if not, returning to the previous step;
s6: each aircraft starting with an initial formation, t0Starting to change the formation at the moment;
s7: calculating the optimal distribution target point of each aircraft, exploring the target point to fly by each aircraft, detecting own aircraft around by a detector, executing the step S8 if the own aircraft is detected, and returning to the previous step if the own aircraft is detected;
s8: judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if so, executing the step S9, otherwise, returning to the step S7;
s9: calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state;
s10: calculating the reward value in the next state according to the reward function;
s11: taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool;
s12: randomly sampling batch metadatum from the experience pool, and sequentially updating a current Critic network, a current Actor network, a target Critic network and a target Actor network;
s13: judgment of rsWhether or not it is C2+C3If the condition is true, the current round is ended, and the process goes to step S5, and if the condition is false, the process goes to step S7;
s14: and finishing training, and finishing team form transformation in the complex obstacle environment.
2. The method for transforming the formation of deep reinforcement learning based on dynamic target allocation according to claim 1, wherein the expression of the state space is as follows:
in the formula, BjBkIs composed of an initial formation BjChange to target formation Bk,Δdi·t、Δφi·t、Δvi·t、Δψi·tThe expression of (a) is:
wherein d isi·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, di' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phii·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phii' is the position of the target node corresponding to the ith aircraft and the geometric center of the target formation, vi·tSpeed, v, of the ith aircraft at time ttarFor the speed of the target formation, psii·tFor the course angle, psi, of the ith aircraft at time ttarIs the course angle of the target formation;
the expression of the motion space is as follows:
in the formula, vmaxMaximum speed, v, of the aircraftminFor the minimum speed of the aircraft,is the maximum angular velocity of the aircraft,minimum angular velocity, v, of the aircraft, respectivelyu、Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,Mapping the speed and the angular speed of the front aircraft;
the expression of the reward function is as follows:
in the formula, rtFor time-coordinated awards, rsFor space cooperative awards, rcolFor collision avoidance and obstacle avoidance awards, rLFor minimum voyage reward, Δ tiTime to complete formation change for ith aircraft, tiIs the moment when the ith aircraft completes the formation change, t0For the moment when the formation starts to change the formation,for the speed of the ith aircraft at time t,the course of the ith aircraft at time t, vtarFor the speed of the target formation, psitarIs the heading angle of the target formation,the distance between the ith aircraft and the current formation geometric center at the moment t, diIs the ithThe distance between the target position of the overhead vehicle and the geometric center of the target formation,is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiiThe orientation of the target position of the ith aircraft and the geometric center of the target formation,the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier methodt iPosition of the ith aircraft, PobIs the location of the obstacle, C1、C2、C3、C4Is a constant, xi1、ξ2、ξ3、ξ4Are the corresponding weight coefficients.
3. The method for transforming the deep reinforcement learning formation based on the dynamic target allocation according to claim 1, wherein the following formula is specifically adopted for calculating the optimal allocation target point of each aircraft:
wherein the aircraft U is assigned a target pointiSuccessful matching to the assigned target point TiThen the performance function FiiThe target node efficiency and function are calculated, and the corresponding weight omega is obtainedi1, otherwise ωi=0;
The efficiency function calculation formula is as follows:
in the formula, xi1、ξ2Are respectively a weight coefficient, Δ ditThe current position of the ith aircraft at the moment T and the assigned target point TjOf a distance of from, delta phiijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)it,yit) For the position of the ith aircraft at time t,target point T assigned to ith aircraftj(x) of (a)mid,ymid) Is the coordinate of the center point of the current formation, delta phitu_midIs the angle between the ith aircraft and the center point of the current formation at the moment t, delta phiT_midIs a target point TjAnd the angle of the center point of the target formation.
5. The method according to claim 1, wherein the aircraft selection actions specifically adopt the following formula:
at=μ(st|θμ)+ηt
wherein, mu(s)t|θμ) For online critic networks, ηtIs random noise.
6. The method for transforming the deep reinforcement learning formation based on dynamic target allocation according to claim 1, wherein the following formulas are specifically adopted for updating the current Critic network and the current Actor network:
the online actor network update strategy gradient is as follows:
wherein N is the training times, Q (s, a | theta)μ) For online actor networks, thetaμIs the parameter, mu (s | theta) of an online actor networkμ) Is a target actor network;
the online critic network is updated by minimizing a loss function, wherein the loss function is as follows:
in the formula, yiIs a target value of the current action, thetaQIs a parameter of the online critic network.
Wherein the content of the first and second substances,
yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)
in the formula, [ mu ]'(s)i+1|θμ′) For a target actor network, thetaμ′Is a parameter, theta, of the target actor networkQ′Gamma is a discount factor for the parameters of the target critic network.
7. The method of claim 6, wherein the updating of the target Critic network and the target Actor network specifically comprises:
updating target network parameters by adopting a soft updating mode, wherein the updating modes of a target operator network and a target critic network are respectively as follows:
θμ′=τθμ+(1-τ)θμ′
θQ′=τθQ+(1-τ)θQ′
wherein τ < 1.
8. A system for deep reinforcement learning formation transformation based on dynamic target allocation, the system comprising:
the state space, action space and reward function determining module is used for determining the state space, the action space and a reward function;
a first initialization module for randomly initializing the on-line operator network Q (s, a | theta)Q) Network parameter θ ofμAnd on-line criticic network mu (s | theta)μ) Network parameter θ ofQ;
A second initialization module for initializing the network parameter theta of the target operator networkμ′Network parameter theta of target critic networkQ′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;
the third initialization module is used for initializing an experience pool and a training environment;
the first judging module is used for judging whether the training round number reaches the maximum round number, if so, the third judging module is executed, and if not, the last module is returned;
the formation transformation module is used for starting from a certain initial formation by each aircraft and beginning to transform the formation at the time t 0;
the optimal distribution target point calculation module is used for calculating the optimal distribution target point of each aircraft, each aircraft explores and flies to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;
the second judgment module is used for judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cone, if so, executing the course angle calculation module, and otherwise, returning to the optimal distribution target point calculation module;
the course angle calculation module is used for calculating course angles of the aircrafts needing to avoid the obstacle, and each aircraft selects an action and enters the next state;
the reward value calculation module is used for calculating a reward value in the next state according to the reward function;
the storage module is used for storing the system state, the action, the reward value and the next system state as a group of tuple data into an experience pool;
the updating module is used for randomly sampling batch metadata from the experience pool and sequentially updating the current critic network, the current actor network, the target critic network and the target actor network;
a third judging module for judging rsWhether or not it is C2+C3If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r issFor space cooperative awards, C2,C3Is a constant;
and an output module finishes training and completes team configuration transformation in a complex obstacle environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111546506.9A CN114237293B (en) | 2021-12-16 | 2021-12-16 | Deep reinforcement learning formation transformation method and system based on dynamic target allocation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111546506.9A CN114237293B (en) | 2021-12-16 | 2021-12-16 | Deep reinforcement learning formation transformation method and system based on dynamic target allocation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114237293A true CN114237293A (en) | 2022-03-25 |
CN114237293B CN114237293B (en) | 2023-08-25 |
Family
ID=80757404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111546506.9A Active CN114237293B (en) | 2021-12-16 | 2021-12-16 | Deep reinforcement learning formation transformation method and system based on dynamic target allocation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114237293B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115576353A (en) * | 2022-10-20 | 2023-01-06 | 北京理工大学 | Aircraft formation control method based on deep reinforcement learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190217476A1 (en) * | 2018-01-12 | 2019-07-18 | Futurewei Technologies, Inc. | Robot navigation and object tracking |
US20200174471A1 (en) * | 2018-11-30 | 2020-06-04 | Denso International America, Inc. | Multi-Level Collaborative Control System With Dual Neural Network Planning For Autonomous Vehicle Control In A Noisy Environment |
CN111880563A (en) * | 2020-07-17 | 2020-11-03 | 西北工业大学 | Multi-unmanned aerial vehicle task decision method based on MADDPG |
CN111880567A (en) * | 2020-07-31 | 2020-11-03 | 中国人民解放军国防科技大学 | Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning |
CN111897316A (en) * | 2020-06-22 | 2020-11-06 | 北京航空航天大学 | Multi-aircraft autonomous decision-making method under scene fast-changing condition |
-
2021
- 2021-12-16 CN CN202111546506.9A patent/CN114237293B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190217476A1 (en) * | 2018-01-12 | 2019-07-18 | Futurewei Technologies, Inc. | Robot navigation and object tracking |
US20200174471A1 (en) * | 2018-11-30 | 2020-06-04 | Denso International America, Inc. | Multi-Level Collaborative Control System With Dual Neural Network Planning For Autonomous Vehicle Control In A Noisy Environment |
CN111897316A (en) * | 2020-06-22 | 2020-11-06 | 北京航空航天大学 | Multi-aircraft autonomous decision-making method under scene fast-changing condition |
CN111880563A (en) * | 2020-07-17 | 2020-11-03 | 西北工业大学 | Multi-unmanned aerial vehicle task decision method based on MADDPG |
CN111880567A (en) * | 2020-07-31 | 2020-11-03 | 中国人民解放军国防科技大学 | Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
李樾;韩维;仲维国;: "有人机/无人机协同系统航迹控制关键技术浅析", 无人系统技术, no. 04 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115576353A (en) * | 2022-10-20 | 2023-01-06 | 北京理工大学 | Aircraft formation control method based on deep reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN114237293B (en) | 2023-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110456823B (en) | Double-layer path planning method aiming at unmanned aerial vehicle calculation and storage capacity limitation | |
CN110134140B (en) | Unmanned aerial vehicle path planning method based on potential function reward DQN under continuous state of unknown environmental information | |
Ali et al. | Cooperative path planning of multiple UAVs by using max–min ant colony optimization along with cauchy mutant operator | |
CN111897316B (en) | Multi-aircraft autonomous decision-making method under scene fast-changing condition | |
CN111780777A (en) | Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning | |
Li et al. | Trajectory planning for UAV based on improved ACO algorithm | |
CN112180967B (en) | Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture | |
CN111240319A (en) | Outdoor multi-robot cooperative operation system and method thereof | |
CN108398960B (en) | Multi-unmanned aerial vehicle cooperative target tracking method for improving combination of APF and segmented Bezier | |
CN116257082B (en) | Distributed active cooperative detection method for multiple unmanned aerial vehicles | |
CN111240345A (en) | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework | |
CN114625150B (en) | Rapid ant colony unmanned ship dynamic obstacle avoidance method based on danger coefficient and distance function | |
CN113848974A (en) | Aircraft trajectory planning method and system based on deep reinforcement learning | |
CN114237293A (en) | Deep reinforcement learning formation transformation method and system based on dynamic target allocation | |
CN114967721A (en) | Unmanned aerial vehicle self-service path planning and obstacle avoidance strategy method based on DQ-CapsNet | |
CN112001120B (en) | Spacecraft-to-multi-interceptor autonomous avoidance maneuvering method based on reinforcement learning | |
Liu et al. | Multiple UAV formations delivery task planning based on a distributed adaptive algorithm | |
CN113064422A (en) | Autonomous underwater vehicle path planning method based on double neural network reinforcement learning | |
Zhu et al. | A cooperative task assignment method of multi-UAV based on self organizing map | |
CN115951711A (en) | Unmanned cluster multi-target searching and catching method in high sea condition environment | |
CN115755975A (en) | Multi-unmanned aerial vehicle cooperative distributed space searching and trajectory planning method and device | |
CN115933637A (en) | Path planning method and device for substation equipment inspection robot and storage medium | |
Wu et al. | A multi-critic deep deterministic policy gradient UAV path planning | |
CN115542921A (en) | Autonomous path planning method for multiple robots | |
Ma et al. | Strategy generation based on reinforcement learning with deep deterministic policy gradient for UCAV |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |