CN114237293A

CN114237293A - Deep reinforcement learning formation transformation method and system based on dynamic target allocation

Info

Publication number: CN114237293A
Application number: CN202111546506.9A
Authority: CN
Inventors: 张毅; 杨秀霞; 高恒杰; 杨林; 陆巍巍; 褚政; 王宏; 于浩; 姜子劼; 王晨蕾
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-25
Anticipated expiration: 2041-12-16
Also published as: CN114237293B

Abstract

The invention relates to a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, wherein the method comprises the following steps: determining a state space, an action space and a reward function; initializing network parameters, an experience pool and a training environment; judging whether the number of training rounds reaches the maximum; each aircraft starts in a certain initial formation; calculating the optimal distribution target point detector of each aircraft to detect own aircraft around, and judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone; calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state; calculating a reward value; taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool; updating the network parameters; judgment of r_sWhether or not it is C₂+C₃And finishing training and finishing team form transformation in the complex obstacle environment. The method solves the problem that the local optimal route is easy to generate due to random target distribution in the formation conversion process.

Description

Deep reinforcement learning formation transformation method and system based on dynamic target allocation

Technical Field

The invention relates to the field of deep reinforcement learning, in particular to a method and a system for transforming a deep reinforcement learning formation based on dynamic target allocation.

Background

In practical application, formation of multiple aircrafts often needs to change the formation shape due to special tasks. The current formation transformation algorithm is mostly applied to formation transformation of multiple aircrafts in a barrier-free environment, and when the environment is complicated, the algorithm is low in obstacle avoidance efficiency, long in iteration time and prone to falling into a local optimal solution, so that the algorithm is difficult to apply to the complicated barrier environment.

The deep reinforcement learning algorithm is commonly used for solving intelligent decision problems in complex environments due to the excellent situation perception capability and the strong decision capability of the deep reinforcement learning algorithm. For the problem of formation change of a plurality of aircraft formations, when obstacles in the environment increase, the algorithm has the advantages that the decision can be quickly made according to the current state, the reaction speed is high, the collision avoidance capability is strong, and the flexibility is strong; when obstacles in the environment are reduced, the maneuvering is small due to the end-to-end control mode, the planned route is more beneficial to tracking, and meanwhile, the position of a formation transformation target point is not required to be given, and the real-time performance is strong.

Therefore, on the basis of the traditional DDPG algorithm, a multi-aircraft deep reinforcement learning formation transformation algorithm based on dynamic target allocation is provided.

Disclosure of Invention

The invention aims to provide a method and a system for converting a deep reinforcement learning formation based on dynamic target allocation, which aim to solve the problem that local optimal routes are easy to generate due to random target allocation in the formation conversion process.

In order to achieve the purpose, the invention provides the following scheme:

a deep reinforcement learning formation transformation method based on dynamic target allocation, the transformation method comprises the following steps:

s1: determining a state space, an action space and a reward function;

s2: random initialization on-line operator network Q (s, a | theta [ ]^Q) Network parameter θ of^μAnd on-line criticic network mu (s | theta)^μ)Network parameter θ of^Q；

S3: initializing network parameters theta of a target actor network^μ′Network parameter theta of target critic network^Q′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;

s4: initializing an experience pool and a training environment;

s5: judging whether the training round number reaches the maximum round number, if so, executing the step S13, and if not, returning to the previous step;

s6: each aircraft starting with an initial formation, t₀Starting to change the formation at the moment;

s7: calculating the optimal distribution target point of each aircraft, exploring the target point to fly by each aircraft, detecting own aircraft around by a detector, executing the step S8 if the own aircraft is detected, and returning to the previous step if the own aircraft is detected;

s8: judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if so, executing the step S9, otherwise, returning to the step S7;

s9: calculating the course angle of the aircrafts needing to avoid the obstacle, selecting actions by each aircraft, and entering the next state;

s10: calculating a reward value according to the reward function in the next state;

s11: taking the system state, the action, the reward value and the next system state at the moment as a group of tuple data to be stored in an experience pool;

s12: randomly sampling batch metadatum from the experience pool, and sequentially updating a current Critic network, a current Actor network, a target Critic network and a target Actor network;

s13: judgment of r_sWhether or not it is C₂+C₃If the condition is true, the current round is ended, and the process goes to step S5, and if the condition is false, the process goes to step S7;

s14: and finishing training, and finishing team form transformation in the complex obstacle environment.

Optionally, the expression of the state space is as follows:

in the formula, B_jB_kIs composed of an initial formation B_jChange to target formation B_k，Δd_i·t、Δφ_i·t、Δv_i·t、Δψ_i·tThe expression of (a) is:

wherein d is_i·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, d_i' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phi_i·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phi_i' is the position of the target node corresponding to the ith aircraft and the geometric center of the target formation, v_i·tSpeed, v, of the ith aircraft at time t_tarFor the speed of the target formation, psi_i·tFor the course angle, psi, of the ith aircraft at time t_tarIs the course angle of the target formation;

the expression of the motion space is as follows:

in the formula, v_maxMaximum speed, v, of the aircraft_minFor the minimum speed of the aircraft,

is the maximum angular velocity of the aircraft,

minimum angular velocity, v, of the aircraft, respectively_u、

Are respectively mapped to [ -1,1 [)]Speed, angular velocity, v, of the aircraft within the interval,

Mapping the speed and the angular speed of the front aircraft;

the expression of the reward function is as follows:

in the formula, r_tFor time-coordinated awards, r_sFor space cooperative awards, r_colFor collision avoidance and obstacle avoidance awards, r_LFor minimum voyage reward, Δ t_iTime to complete formation change for ith aircraft, t_iIs the moment when the ith aircraft completes the formation change, t₀For the moment when the formation starts to change the formation,

for the speed of the ith aircraft at time t,

the course of the ith aircraft at time t, v_tarFor the speed of the target formation, psi_tarIs the heading angle of the target formation,

the distance between the ith aircraft and the current formation geometric center at the moment t, dⁱThe distance between the target position of the ith aircraft and the geometric center of the target formation,

is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiⁱThe orientation of the target position of the ith aircraft and the geometric center of the target formation,

the collision avoidance heading calculated for the ith aircraft for the reciprocal velocity barrier method,

obstacle avoidance heading, P, calculated for the ith aircraft by the velocity barrier method_t ⁱPosition of the ith aircraft, P_obIs the location of the obstacle, C₁、C₂、C₃、C₄Is a constant, xi₁、ξ₂、ξ₃、ξ₄Are the corresponding weight coefficients.

Optionally, the following formula is specifically adopted for calculating the optimal distribution target point of each aircraft:

wherein the aircraft U is assigned a target point_iSuccessful matching to the assigned target point T_iThen the performance function F_iiThe target node efficiency and function are calculated, and the corresponding weight omega is obtained_i1, otherwise ω_i＝0；

The efficiency function calculation formula is as follows:

in the formula, xi₁、ξ₂Are respectively a weight coefficient, Δ d_ijtThe current position of the ith aircraft at the moment T and the assigned target point T_jOf a distance of from, delta phi_ijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)_it,y_it) For the position of the ith aircraft at time t,

target point T assigned to ith aircraft_j(x) of (a)_mid,y_mid) Is the coordinate of the center point of the current formation, delta phi_{tu_mid}Is the angle between the ith aircraft and the center point of the current formation at the moment t, delta phi_{T_mid}Is a target point T_jAnd the angle of the center point of the target formation.

Optionally, the following formula is specifically adopted for calculating the heading angle of the aircraft required to avoid the obstacle:

wherein alpha is_RVOFor heading angle, v, required to avoid obstacles_uThe speed of the aircraft needing obstacle avoidance.

Optionally, each aircraft selection action specifically adopts the following formula:

a_t＝μ(s_t|θ^μ)+η_t

wherein, mu(s)_t|θ^μ) For online critic networks, η_tIs random noise.

Optionally, the following formulas are specifically adopted for updating the current critical network and the current Actor network:

the online actor network update strategy gradient is as follows:

wherein N is the training times, Q (s, a | theta)^μ) For online actor networks, theta^μIs the parameter, mu (s | theta) of an online actor network^μ) Is a target actor network;

the online critic network is updated by minimizing a loss function, wherein the loss function is as follows:

in the formula, y_iIs a target value of the current action, theta^QIs a parameter of the online critic network.

Wherein the content of the first and second substances,

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′)

in the formula, [ mu ]'(s)_i+1|θ^μ′) For a target actor network, theta^μ′Is a parameter, theta, of the target actor network^Q′Gamma is a discount factor for the parameters of the target critic network.

Optionally, the updating of the target critical network and the target Actor network specifically include:

updating target network parameters by adopting a soft updating mode, wherein the updating modes of a target operator network and a target critic network are respectively as follows:

θ^μ′＝τθ^μ+(1-τ)θ^μ′

θ^Q′＝τθ^Q+(1-τ)θ^Q′

wherein τ < 1.

The invention further provides a deep reinforcement learning formation transformation system based on dynamic target allocation, which is characterized by comprising:

the state space, action space and reward function determining module is used for determining the state space, the action space and a reward function;

a first initialization module for randomly initializing the on-line operator network Q (s, a | theta)^Q) Network parameter θ of^μAnd on-line criticic network mu (s | theta)^μ) Network parameter θ of^Q；

A second initialization module for initializing the network parameter theta of the target operator network^μ′Network parameter theta of target critic network^Q′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;

the third initialization module is used for initializing an experience pool and a training environment;

the first judging module is used for judging whether the training round number reaches the maximum round number, if so, the third judging module is executed, and if not, the last module is returned;

the formation transformation module is used for starting from a certain initial formation by each aircraft and beginning to transform the formation at the time t 0;

the optimal distribution target point calculation module is used for calculating the optimal distribution target point of each aircraft, each aircraft explores and flies to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;

the second judgment module is used for judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cone, if so, executing the course angle calculation module, and otherwise, returning to the optimal distribution target point calculation module;

the course angle calculation module is used for calculating course angles of the aircrafts needing to avoid the obstacle, and each aircraft selects an action and enters the next state;

the reward value calculation module is used for calculating a reward value according to the next state of the reward function;

the storage module is used for storing the system state, the action, the reward value and the next system state as a group of tuple data into an experience pool;

the updating module is used for randomly sampling batch metadata from the experience pool and sequentially updating the current critic network, the current actor network, the target critic network and the target actor network;

a third judging module for judging r_sWhether or not it is C₂+C₃If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r is_sFor space cooperative awards, C₂,C₃Is a constant;

and an output module finishes training and completes team configuration transformation in a complex obstacle environment.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the method designs a dynamic target allocation algorithm to allocate the optimal nodes corresponding to the target formation for each aircraft, and solves the problem that the target allocation is random and local optimal routes are easy to generate in the formation conversion process; aiming at the problems that a traditional DDPG algorithm is easy to generate a local optimal path, time coordination is difficult to realize and the like, a multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, and a reward function based on comprehensive cost constraint of formation shape transformation is designed, so that the formation voyage cost marked by a calculation rule is minimum.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a horizontal in-line formation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a vertical in-line formation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an inverse triangle formation according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating triangle formation according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of obstacle avoidance by the velocity barrier method according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of a reciprocal velocity barrier collision method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

According to the invention, the optimal nodes corresponding to the target formation in the target formation are distributed to each aircraft through the designed dynamic target distribution algorithm, so that the problem that the local optimal route is easily generated due to random target distribution in the formation conversion process is solved.

Aiming at the problems that the traditional DDPG algorithm is easy to generate local optimal paths, time coordination is difficult to realize and the like, the multi-objective optimization problem of formation shape transformation of multiple aircrafts is converted into a reward function design problem, the reward function based on the formation shape transformation comprehensive cost constraint is designed, and the formation voyage cost marked by a calculation rule is minimum, and the specific method comprises the following steps:

1. determining a kinematic model

And (4) regarding the aircraft in the formation transformation problem as a particle motion model, and controlling the motion process of the aircraft by using the acceleration and the heading angle of the aircraft. The equation of motion for an aircraft may be expressed as:

in the formula: i is 1,2, …, N is the number of aircrafts, v_iRepresenting the speed of the ith aircraft in the XOY plane, ψ is the heading angle of the aircraft, and a represents the acceleration of the aircraft. Considering the saturation constraints of the control inputs, the acceleration a and the heading angle ψ of the aircraft satisfy the following conditions:

in the formula, the acceleration specific constraint parameter depends on the model of the aircraft and flight parameters.

2. Formation description

The queue form description method comprises the following steps:

B_i＝{(B_mid,d_i,φ_i,v_tar,ψ_tar)|i＝1,2,…,N} (3)

in the formula, B_midFor formation of the geometric center coordinates of the formation, d_iIs the distance between the ith aircraft and the geometric center of the formation form, phi_iIs the orientation of the ith aircraft and the geometric center of the formation, v_tarFor the speed of the target formation, psi_tarIs the heading angle of the target formation.

3. Formation transformation cost constraint

(1) Aircraft kinematic constraints

In the whole course of formation change, the course angle and the course angular speed of the aircraft must be changed within a certain range to meet the flight performance constraint J of the aircraft_uav. With the constraint of

In the formula, #_min、ψ_maxRespectively the minimum and maximum course angles of the aircraft;

the minimum and maximum course angular speeds of the aircraft are respectively.

(2) Temporal collaborative cost constraints

After the formation of the plurality of aircrafts is changed, the time for changing the formation of each aircraft is required to be the same. Thus, the cost of temporal synergy among the members of the formation can be expressed as

In the formula, J_tAs a function of the temporal co-cost, Δ t_iTime to complete formation change for ith aircraft, t₀Is the time to start the formation change.

(3) Spatial collaborative cost constraint

When the formation of the aircrafts forms the target formation after the formation is transformed, each aircraft is on a corresponding target point in the target formation, namely, the distance and the direction between each aircraft and the geometric center of the current formation meet the conditions of the target formation, and meanwhile, the speed and the course of each aircraft are consistent with the speed and the course of the target formation. Therefore, the spatial coordination cost of the formation transformation of the multiple aircraft formation is as follows:

in the formula, J_sAs a spatial co-cost function, Δ v_i·tIs the difference between the speed of the ith aircraft at time t and the target formation speed, Delta phi_i·tThe difference between the heading of the ith aircraft and the heading of the target formation at the moment t,

for the speed of the ith aircraft at time t,

is the orientation of the ith aircraft and the current geometric center of formation at the moment t, phiⁱAnd (4) forming the position of the target position of the ith aircraft and the geometric center of the target.

(4) Constraint of collision cost

Impact cost J of ith aircraft_obs,iIs divided into the i-th aircraft static obstacle collision cost J_{s_obs,i}Collision cost of dynamic obstacle J_{d_obs,i}And cost of collision J between aircraft_uav,i. Then the formation overall collision cost constraint J_colComprises the following steps:

wherein the variables are:

wherein the content of the first and second substances,

is the distance, R, between the kth waypoint of the ith aircraft and the center of the static obstacle_{s_k}Is the radius of the static obstacle threat zone at the kth waypoint,

distance, R, of the kth waypoint of the ith aircraft from the dynamic obstacle_{d_k}Radius of the dynamic obstacle threat circle at the kth waypoint, d_uavIs the sum of the distances between the aircraft, (x)_{d_obs},y_{d_obs}) Is the coordinates of the center of the static threat barrier,

the k waypoint coordinate of the ith aircraft.

(5) Minimum voyage cost constraint

The flight used after each aircraft completes formation transformation in the formation is the minimum, so the minimum flight cost constraint is as follows:

in the formula, J_LAt minimum voyage cost, L_iCompleting the course of formation change for the ith aircraft,

the (k + 1) th waypoint of the ith aircraft.

(6) Formation transformation comprehensive cost constraint

The composite cost of the formation transformation of the multiple aircrafts is described as

J＝W₁J_t+W₂J_s+W₃J_col+W₄J_L (10)

In the formula: w₁、W₂、W₃、W₄Respectively corresponding to weight coefficients, J_tFor a coordinated cost in time, J_sTo a spatial coordination cost, J_colCost of overall collision for formation, J_LThe minimum voyage cost.

4. Dynamic target allocation algorithm design

The multi-aircraft dynamic target assignment algorithm may be described by the following model:

DTA＝<B,U,T,F> (11)

wherein, B is a task formation set, and the task formation is set as B ═ B (B)₁,B₂,B₃,B₄)。B₁Representing horizontal in-line formation, B₂Representing longitudinal in-line formation, B₃Representing reverse triangle formation, B₄Representing triangle formation. U is the set of aircraft to which the target point is to be assigned, U ═ uav₁,uav₂,…,uav_n). T is the current formation B_iThen, the set of target points to be allocated T ═ T (T)₁,T₂,…,T_n). F is an efficiency matrix of the aircraft matching the corresponding target point, and the form is as follows:

wherein F_ijRepresentation uav_iMatching the target point T_jThe corresponding efficacy of the compound.

For convenience of describing the distance between members in the formation, the relative position relationship between two aircraft is represented by (dx, dy), wherein dx represents the transverse distance, and dy represents the longitudinal distance. Four formation forms in a dynamic target allocation algorithm are designed by taking formation composed of five aircrafts as an example.

(1) Horizontal in-line formation

In the horizontal in-line formation, as shown in fig. 2, the aircraft are arranged horizontally, so dy is 0. dx is adjusted according to actual needs. The horizontal in-line formation is mainly used for large-area search, the search range can be expanded by increasing the number of aircrafts or increasing the distance between aircrafts, and the efficiency of executing tasks is improved.

(2) Longitudinal in-line formation

In the vertical in-line formation, as shown in fig. 3, each aircraft is arranged vertically, so dx is 0. dy is adjusted according to actual needs. The longitudinal in-line formation is mainly used for tasks such as formation obstacle avoidance and the like.

(3) Reverse triangle formation

In the inverted triangle formation, as shown in fig. 4, the transverse distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The reverse triangle formation is mainly used for battle tasks such as interception and the like.

(4) Triangle formation

In the formation of triangles, as shown in fig. 5, the lateral distance between any two adjacent aircraft is dx, and the longitudinal distance is dy. The triangle formation is mainly used for battle tasks such as fire fighting.

In summary, the distances and orientations between the members in the formation form and the geometric center are shown in table 1.

TABLE 1 team formation library parameter settings

To achieve formation maintenance, aircraft in a formation need to maintain communication with each other. If continuous communication cannot be maintained between aircraft, fleet clutter and even collisions may occur. To achieve communication between members in a formation, a communication topology needs to be established.

In the four formation forms, each aircraft needs to communicate with other formation members to determine the position of the geometric center of the formation, so that the formations must communicate by adopting an intercommunicating structure. Based on the analysis, the formation network topology is designed to be a fully-connected topology structure.

Under the condition that no target node conflict exists between the aircrafts, in order to optimize the target node distribution scheme, the sum of the effectiveness of all aircrafts performing formation transformation on the current target node is the maximum, namely, the expression of the optimal distribution scheme is described as follows:

wherein when U is_iSuccessful matching to target point T_iThen the performance function F_iiThe target node efficiency and function are calculated, and the corresponding weight omega is obtained_i1, otherwise ω_i＝0。

The efficiency function calculation formula is as follows:

in the formula, xi₁、ξ₂Are respectively a weight coefficient, Δ d_itThe current position of the ith aircraft at the moment T and the assigned target point T_jOf a distance of from, delta phi_ijtThe difference between the angle between the ith aircraft and the center point of the current formation and the angle between the target point and the center point of the target formation at the moment t is (x)_it,y_it) For the position of the ith aircraft at time t,

Fig. 1 is a flowchart of a deep reinforcement learning formation transformation method based on dynamic target allocation according to an embodiment of the present invention, where the method shown in fig. 1 includes:

s1: the form transformation problem in the uncertain environment is modeled as a Markov decision process, and a state space (formula (15)), an action space (formula (17)) and a reward function (formula (18)) are designed. And obtaining an optimal formation transformation airway by solving the Markov decision process.

Wherein the state space is as follows:

wherein d is_i·tThe distance between the ith aircraft and the geometric center of the current formation at the moment t, d_i' is the distance between the target node corresponding to the ith aircraft and the geometric center of the target formation, phi_i·tIs the position of the ith aircraft and the geometric center of the current formation at the moment t, phi'_iThe position, v, of a target node corresponding to the ith aircraft and the geometric center of a target formation_i·tFor the speed of the ith aircraft at time t, psi_i·tThe heading angle of the ith aircraft at the moment t.

The motion space is as follows:

in the formula, v_max、v_minThe maximum speed and the minimum speed of the aircraft are respectively,

maximum and minimum angular velocities, v, of the aircraft, respectively_u、

The speed and the angular speed of the front aircraft are mapped.

The reward function is as follows:

for the speed of the ith aircraft at time t,

S2: random initialization on-line operator network Q (s, a | theta [ ]^Q) And on-line criticic network mu (s | theta)^μ) Network parameter θ of^μAnd theta^Q。

Note: the DDPG network architecture consists of an online actor network, a target actor network, an online critic network and a target critic network.

The four neural network updating modes of the deep deterministic strategy gradient algorithm DDPG are as follows:

the online actor network update strategy gradient is as follows:

wherein N is the training times, Q (s, a | theta)^μ) For online actor networks, theta^μIs the parameter, mu (s | theta) of an online actor network^μ) Is the target actor network.

Wherein the content of the first and second substances,

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′) (21)

The DDPG algorithm adopts a soft updating mode to update the target network parameters, and the updating modes of a target operator network and a target critic network are respectively as follows:

θ^μ′＝τθ^μ+(1-τ)θ^μ′ (22)

θ^Q′＝τθ^Q+(1-τ)θ^Q′

wherein τ < 1.

Introducing Behavior strategy, namely adding random noise eta during output action of line operator network_tChanging the definite value action performed by the agent into a random value action a_t。

a_t＝μ(s_t|θ^μ)+η_t (23)

S3: initializing target networks mu' and theta^Q′And their weights, and copy the parameters of each target network to the online network.

S4: and initializing an experience pool and initializing a training environment.

S5: it is determined whether the number of training rounds has reached a maximum number of rounds and if so, processing proceeds to process 13. If not, go to S5.

S6: each aircraft starting with an initial formation, t₀And the time begins to change the formation.

And S7, calculating the optimal distribution target point of each aircraft according to the formulas (13) and (14), flying each aircraft to the target point according to the exploration action of the formula (23), and detecting the own aircraft around by a detector. If the own aircraft is detected, the process proceeds to S8, otherwise, the process proceeds to S7.

S8: and judging whether the aircraft needs to avoid the obstacle or collision according to the obstacle cones. If obstacle avoidance or collision avoidance is required, the process proceeds to S9, otherwise, the process proceeds to S7.

The obstacle avoidance strategy is described as follows:

as shown in fig. 5, when the relative velocity vector v is_uoAt Δ P_oP_uL₁In time, the collision avoidance time t is required_iIncreasing course angle inwards so that alpha_i＞α_o(ii) a When the vector v of relative velocity_uoAt Δ P_oP_uL₂When the heading angle needs to be reduced so that α_i＞α_o。

Wherein, the collision avoidance time

α_iThe relationship with α is determined by equation (11).

Wherein y 'and x' are

Wherein beta is a dynamic obstacle course angle.

Setting the course angle of an aircraft at a certain moment as alpha, and in the next sampling period delta t of the flight of the aircraft, the set A of the possible course angles is generated_αIs A_α＝{α|α∈(α+ω_min*δt,α+ω_maxδ t), then the set of relative velocity vectors generated based on the velocity barrier method is recorded as V_uo. After calculating the VO of the dynamic collision avoidance area by a speed obstacle method, removing V_uoAnd obtaining the relative vector velocity which can successfully avoid the obstacle by using the relative velocity vector which can cause the collision in the set, and then selecting the action from the relative vector velocity to finish the collision avoidance.

The collision avoidance strategy is described as follows:

as shown in FIG. 6, the reciprocal velocity barrier cone RVO may be translated by the collision cone CC

Thus obtaining the product. In order to realize obstacle avoidance, uav is needed₂Velocity vector of

Deflecting out a reciprocal velocity barrier cone RVO. Suppose when uav₂Velocity vector of

Just deflecting out the reciprocal speedAngle of rotation of the right cone RVO is alpha_RVOThe velocity vector is

Then according to the geometric relationship in the figure

In the formula (I), the compound is shown in the specification,

from the operational relationship between vectors

In the formula, | · the luminance | |₂A two-norm representation of a is,

s9: and (4) calculating the heading angle of the aircraft needing to avoid the obstacle according to the formula (12) or the formula (28), selecting the action by each aircraft according to the formula (23), and entering the next state.

S10: in the next system state, the prize value is calculated according to equation (18).

S11: and storing the system state, the action, the reward value and the next system state at the moment into an experience pool as a group of tuple data.

S12: and randomly sampling batch tuple data from the experience pool, and updating the current criticic network, the current Actor network and the target network in sequence according to the formula (20), the formula (19) and the formula (22).

S13: judgment of r_sWhether or not it is C₂+C₃If the condition is satisfied, the current round is ended, and the process goes to S5. If the condition is not satisfied, the flow proceeds to S7.

Fig. 8 is a schematic structural diagram of a deep reinforcement learning formation transformation system based on dynamic target allocation according to an embodiment of the present invention, and as shown in fig. 8, the system includes:

a state space, action space and reward function determination module 201, configured to determine a state space, an action space and a reward function;

a first initialization module 202 for randomly initializing the line operator network Q (s, a | theta |)^Q) Network parameter θ of^μAnd on-line criticic network mu (s | theta)^μ ₎Network parameter θ of^Q；

A second initialization module 203 for initializing the network parameter θ of the target operator network^μ′Network parameter theta of target critic network^Q′Copying the network parameters of the target actor network and the network parameters of the target critic network to the network parameters of the online actor network and the target critic network;

a third initialization module 204, configured to initialize the experience pool and the training environment;

a first judging module 205, configured to judge whether the training round number reaches a maximum round number, if so, execute a third judging module, and if not, return to the previous module;

a formation change module 206 for each aircraft to start with an initial formation, t₀Starting to change the formation at the moment;

the optimal distribution target point calculation module 207 is used for calculating the optimal distribution target point of each aircraft, each aircraft explores to fly to the target point, the detector detects own aircraft around, if the own aircraft is detected, the second judgment module is executed, and if the own aircraft is detected, the previous module is returned to;

the second judgment module 208 is used for judging whether the aircraft needs obstacle avoidance or collision avoidance according to the obstacle cone, if the aircraft needs obstacle avoidance or collision avoidance, the course angle calculation module in the step is executed, and if the aircraft needs obstacle avoidance or collision avoidance, the optimal distribution target point calculation module is returned;

a course angle calculation module 209, configured to calculate a course angle at which the aircraft needs to avoid the obstacle, select an action for each aircraft, and enter a next state;

a reward value calculation module 210, configured to calculate a reward value according to the reward function in a next state;

the storage module 211 is configured to store the system state, the action, the reward value, and the next system state at this time as a set of tuple data in the experience pool;

an updating module 212, configured to randomly sample batch metadata from the experience pool, and sequentially update the current critic network, the current actor network, the target critic network, and the target actor network;

a third judging module 213 for judging r_sWhether or not it is C₂+C₃If the condition is satisfied, the current round is finished, the current round is transferred to a first judgment module, and if the condition is not satisfied, the current round is transferred to an optimal distribution target point calculation module; r is_sFor space cooperative awards, C₂,C₃Is a constant;

and an output module 214 finishes training and completes the formation transformation in the complex obstacle environment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A deep reinforcement learning formation transformation method based on dynamic target allocation is characterized by comprising the following steps:

s1: determining a state space, an action space and a reward function;

s2: random initialization on-line operator network Q (s, a | theta [ ]^Q) Network parameter θ of^μAnd on-line criticic network mu (s | theta)^μ) Network parameter θ of^Q；

s4: initializing an experience pool and a training environment;

s10: calculating the reward value in the next state according to the reward function;

2. The method for transforming the formation of deep reinforcement learning based on dynamic target allocation according to claim 1, wherein the expression of the state space is as follows:

the expression of the motion space is as follows:

is the maximum angular velocity of the aircraft,

minimum angular velocity, v, of the aircraft, respectively_u、

Mapping the speed and the angular speed of the front aircraft;

the expression of the reward function is as follows:

for the speed of the ith aircraft at time t,

the distance between the ith aircraft and the current formation geometric center at the moment t, dⁱIs the ithThe distance between the target position of the overhead vehicle and the geometric center of the target formation,

3. The method for transforming the deep reinforcement learning formation based on the dynamic target allocation according to claim 1, wherein the following formula is specifically adopted for calculating the optimal allocation target point of each aircraft:

The efficiency function calculation formula is as follows:

4. The method as claimed in claim 1, wherein the calculation of the heading angle of the aircraft requiring obstacle avoidance specifically employs the following formula:

5. The method according to claim 1, wherein the aircraft selection actions specifically adopt the following formula:

a_t＝μ(s_t|θ^μ)+η_t

wherein, mu(s)_t|θ^μ) For online critic networks, η_tIs random noise.

6. The method for transforming the deep reinforcement learning formation based on dynamic target allocation according to claim 1, wherein the following formulas are specifically adopted for updating the current Critic network and the current Actor network:

the online actor network update strategy gradient is as follows:

Wherein the content of the first and second substances,

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ′)|θ^Q′)

7. The method of claim 6, wherein the updating of the target Critic network and the target Actor network specifically comprises:

θ^μ′＝τθ^μ+(1-τ)θ^μ′

θ^Q′＝τθ^Q+(1-τ)θ^Q′

wherein τ < 1.

8. A system for deep reinforcement learning formation transformation based on dynamic target allocation, the system comprising:

the reward value calculation module is used for calculating a reward value in the next state according to the reward function;