CN113885576A

CN113885576A - Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning

Info

Publication number: CN113885576A
Application number: CN202111267805.9A
Authority: CN
Inventors: 赵启; 阴浩博; 曹红波; 甄子洋; 龚华军
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-04

Abstract

The invention relates to the technical field of multi-agent control, and particularly discloses an unmanned aerial vehicle formation control method based on deep reinforcement learning. Mainly disclosed an unmanned aerial vehicle formation environment based on deep reinforcement learning and an unmanned aerial vehicle formation controller design based on dual Q learning, its characteristics include the following steps: 1) establishing a relative kinematics model of unmanned aerial vehicle formation according to a kinematics equation of a long plane and a wing plane and a small disturbance principle; 2) establishing an unmanned aerial vehicle formation environment which accords with actual conditions, wherein the unmanned aerial vehicle formation environment comprises a state space, a wing plane action library (comprising two actions of speed and course), instruction conversion, a reward function and an end condition, and realizing that the environment can be transplanted to other algorithm verification; 3) a formation controller based on double Q learning is designed, the controller controls the speed and the course at the same time, and the wing plane is enabled to follow a captain plane and maintain the desired formation. In practical application, a corresponding wing plane command can be formed according to the self characteristics of the unmanned aerial vehicle so as to meet the requirement of precise formation control of the unmanned aerial vehicle.

Description

Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of multi-reinforcement learning control, in particular to an unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning.

Background

The unmanned aerial vehicle as an unmanned aerial vehicle can complete a preset task only by remote wireless control or a control program set in advance. Because of its advantages such as low cost, high flexibility and mobility, unmanned aerial vehicles have been widely used in civilian and military applications. However, with the increasing complexity of the environment and tasks, the performance of a single unmanned aerial vehicle cannot meet the actual use requirements, and the unmanned aerial vehicle formation formed by multiple unmanned aerial vehicles has the advantages of the single unmanned aerial vehicle, and has the characteristics of wide area range, high investigation and attack success rate and the like. Formation of drones is becoming the primary vehicle for performing tasks.

However, currently, common modern control methods usually require an accurate model to design a controller to realize unmanned aerial vehicle formation, and the difficulty in accurately modeling the system is very high in practical situations, and besides, the application range of the control methods is also limited by influences of sensor errors, environmental disturbances and the like, so that an intensified learning method is introduced to realize intelligent control of unmanned aerial vehicle formation.

In the reinforcement learning algorithm, the dual Q learning algorithm has very wide application in the fields of track planning, cooperative decision, single machine control and the like at present by virtue of the advantages of simplicity, easiness in use, good convergence and the like.

Disclosure of Invention

The invention provides unmanned aerial vehicle formation environment construction and a formation controller design based on deep reinforcement learning, so that a wing plane can learn speed to follow a leader and maintain a desired formation distance.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses an unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning, which comprises the following steps:

step 1, supposing that the same speed first-order speed retainer and second-order course retainer automatic pilot models are adopted by the bureaucratic planes, and then the relative kinematics models of the unmanned aerial vehicle formation are obtained through linearization according to the small disturbance principle;

step 2, designing a state space S, a wing plane action library a, an instruction conversion function and a reward function of a formation environment;

step 3, establishing a formation controller based on double Q learning by using the formation environment in the step 2, and training the designed environment; the input is the state S of establishing environment, the output is wing plane action, and then the action output by the controller is converted into a specific command and then is input to the wing plane.

Further, in step 1, the relative kinematics model of the formation of the unmanned aerial vehicle has the following specific expression:

in the formula (1), an inertial north-east-ground is taken as a basic coordinate system, any point on the ground is taken as an origin, the direction pointing to the north pole is an Ox axis, the direction pointing to the east is perpendicular to the Ox axis, and the direction pointing to the east is an Oy axis; l and F respectively represent a pilot model with a first-order speed retainer and a second-order course retainer which are the same; tau is_vRepresents a velocity time constant; tau is_ψa,τ_ψbRepresenting a course time constant; v represents drone speed; psi represents the drone heading angle; v. of_Lc,v_FcRespectively representing the speed commands of a long plane and a bureaucratic plane; psi_Lc,ψ_FcRespectively representing the course commands of the Youji and the Liao plane; x represents the distance between a long plane and a bureaucratic plane in the x direction; y represents the distance from a longicorn to a bureaucratic machine in the y direction; a is arctan (y)₀/x₀)，x₀,y₀The distance between the fixed plane and the wing plane is the x and y direction.

Further, in step 2, the state space S of the design environment, the bureaucratic action library a, the command conversion and the reward function. The method specifically comprises the following steps:

step 2.1, the y-direction distance between the long plane and the wing plane, the error between the actual distance and the expected distance, the relative speed and the integral thereof, and the relative course angle and the integral thereof are selected as a joint state space S, and the corresponding expression is as follows:

in the formula (2), the reaction mixture is,e_v＝v_L-v_Fthe relative speed of a long plane and a bureaucratic plane; e.g. of the type_ψ＝ψ_L-ψ_FIs the relative course angle of a long plane and a bureaucratic plane; e.g. of the type_y＝y_d-y is the error of the desired y-direction distance from the actual y-direction distance; y is_dA desired y-direction distance;

step 2.2, a bureaucratic action library a is established; wherein the wing-plane action library a comprises wing-plane speed actions a₁And wing plane course action a₂Wing plane velocity action a₁Comprising deceleration, uniform speed and acceleration, wing plane course action a₂Including left yaw, constant heading, and right yaw.

Establishing a bureaucratic action library a with the expression as follows:

and 2.3, converting a design command, converting the action of a bureaucratic machine in the formula (3) into a speed and course command and adding amplitude limitation.

Equations (4) and (5) show the command conversion in different actions, v_FRepresenting the current speed of a wing plane, different actions a₁The lower corresponding speed command is v_d；ψ_FA current course angle of a wing plane, different actions a₂The lower corresponding course angle instruction is psi_d；[v_min,v_max]Represents a range of bureaucratic velocities; [ -psi [ -phi ]_max,ψ_max]Representing a range of wing aircraft course angles;

step 2.4, designing a reward function r

In the formula (6)

The speed instruction at the last moment;

the course angle instruction at the last moment; t is_SIs the sampling time; t is the proceeding time; and designing an environment ending condition:

in the formula [ y_min,y_max]Is the formation sets the y-direction minimum and maximum distances.

Further, the formation environment in the step 2 is used in the step 3, a formation controller is established based on double Q learning, and the designed environment is trained; the method specifically comprises the following steps:

the controller comprises a memory base and a neural network model, wherein the memory base is used for storing interactive information, the input of the neural network model is a state space S for establishing an environment, and the output of the neural network model is a wing plane action;

the neural network model comprises two networks with the same structure and different parameters, namely a main network and a target network, wherein the parameters are theta and theta respectively^-(ii) a The main network outputs all action estimation values Q, and the target network outputs a target value y;

in each training, initializing the environment state of formation to obtain state S, inputting it into main network, outputting bureaucratic actions and inputting them into environment, and converting the actions output by controller into specific commands v by means of expressions (4) and (5)_d,ψ_dThen input into the wing plane to obtain new state S _ and instant reward r of wing plane, and will<S,a,r,S_>Storing in a memory bank;

when the memory bank is full, extracting a certain amount of samples to train the neural network model, wherein the expression of the neural network target value and the loss function is as follows:

y＝r+γQ(S_,argmaxQ(S,a|θ)|θ^-) (8)

L(θ)＝E[(y-Q(S,a|θ))²] (9)

equation (8) represents the target value, equation (9) represents the loss function, and γ represents the discount rate. The expression for training the neural network parameters by using the gradient descent method is as follows:

θ^-←θ (11)

equation (10) (11) represents neural network parameter update, equation (10) represents updating the main network parameter according to the gradient descent method, and a is the learning rate; equation (11) represents that the master network parameters are copied to the target network after a certain number of steps; the above process is repeated until the training is finished.

The invention has the beneficial effects that: the invention designs the unmanned aerial vehicle formation flying environment, designs a formation controller based on deep reinforcement learning, and the controller can enable the wing plane to independently learn the optimal strategy, and output the optimal action through the controller, so that the wing plane can follow the long plane at speed and keep the expected distance. The method can effectively improve the intelligence of the unmanned aerial vehicle, eliminate the distance error and the speed error of formation, and enable the formation to have good formation retention capability and good portability.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a control scheme of the process of the present invention.

Figure 2 is a graph of the velocity of the long-bureaucratic wing in an example of the present invention.

Fig. 3 is a graph of the distance in the y direction of a long-bureaucron-like plane in an example of embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present invention, the present invention will be further described in detail with reference to the following detailed description.

The invention discloses an unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning, wherein a control structure diagram is shown in figure 1, and the method comprises the following steps:

step 1, assuming that the bureaucratic machines all adopt the same speed first-order speed keeper and second-order course keeper autopilot models, the expression is:

wherein v represents drone velocity; v. of_cRepresenting a speed command; tau is_vRepresenting the velocity time constant, psi representing the drone heading angle; psi_cRepresenting a heading angle command; tau is_ψaτ_ψbIndicating the heading time constant. And then, carrying out linearization according to a small disturbance principle to obtain a relative kinematics model of the unmanned aerial vehicle formation, wherein the corresponding expression is as follows:

in the formula (2), an inertial north-east-ground is used as a basic coordinate system, any point on the ground is used as an origin, the direction pointing to the north pole is an Ox axis, the direction pointing to the east on the ground is perpendicular to the Ox axis, and the direction pointing to the east is an Oy axis. L and F respectively represent a pilot model with a first-order speed retainer and a second-order course retainer which are the same; tau is_vRepresents a velocity time constant; tau is_ψa,τ_ψbRepresenting a course time constant; v represents drone speed; psi represents the drone heading angle; v. of_Lc,v_FcRespectively representing the speed commands of a long plane and a bureaucratic plane; psi_Lc,ψ_FcRespectively representing the course commands of the Youji and the Liao plane; x represents the distance between a long plane and a bureaucratic plane in the x direction; y represents the distance from a longicorn to a bureaucratic machine in the y direction; a is arctan (y)₀/x₀)，x₀,y₀The distance between the fixed plane and the wing plane is the x and y direction.

And 2, designing a state space of the environment, a wing plane action library, instruction conversion, a reward function and an ending condition.

in the formula (3), e_v＝v_L-v_FThe relative speed of a long plane and a bureaucratic plane; e.g. of the type_ψ＝ψ_L-ψ_FIs the relative course angle of a long plane and a bureaucratic plane; e.g. of the type_y＝y_d-y is the error of the desired y-direction distance from the actual y-direction distance; y is_dIs the desired y-direction distance.

Step 2.2, build up of bureaucratic action library a ═ (a)₁,a₂) The expression is:

step 2.3, converting the design instruction, converting the action in the formula (4) into a speed and course instruction and adding amplitude limitation, wherein the expression is as follows:

equations (5) and (6) show the command conversion in different actions, v_FRepresenting the current speed of a wing plane, different actions a₁The lower corresponding speed command is v_d；ψ_FA current course angle of a wing plane, different actions a₂The lower corresponding course angle instruction is psi_d。[v_min,v_max]Represents a range of bureaucratic velocities; [ -psi [ -phi ]_max,ψ_max]Representing a range of wing aircraft course angles.

Step 2.4, designing a reward function r

In the formula (7)

The speed instruction at the last moment;

the course angle instruction at the last moment; t is_SIs the sampling time; t is the running time. And designing an environment ending condition:

And 3, establishing a formation controller based on a dual Q learning algorithm by using the formation environment in the step 2, and learning the designed environment. The controller comprises a memory base and a neural network model, wherein the memory base is used for storing interactive information, the input of the neural network model is a state space S for establishing an environment, and the output of the neural network model is a wing-like motor action.

The neural network model comprises two networks with the same structure and different parameters, namely a main network and a target network, wherein the parameters are theta and theta respectively^-. The master network outputs all the action estimation values Q, and the target network outputs a target value y.

The specific process is as follows: in each training, initializing the environment state of formation to obtain state S, inputting it into main network, outputting bureaucratic actions and inputting them into environment, and converting the actions output by controller into specific commands v by means of expressions (5) and (6)_d,ψ_dThen input into the wing plane to obtain the new state S _ and reward r of the wing plane, and will<S, a, r, S _ SaveStored in a memory bank.

y＝r+γQ(S_,argmaxQ(S,a|θ)|θ^-) (9)

L(θ)＝E[(y-Q(S,a|θ))²] (10)

equation (9) represents the target value, equation (10) represents the loss function, and the expression for training the neural network parameters by using the gradient descent method is as follows:

θ^-←θ (12)

equation (11) and (12) represent neural network parameter updates, equation (11) represents updating the master network parameters according to the gradient descent method, and a is the learning rate. Equation (12) represents that the master network parameters are copied to the target network after a certain number of steps. The above process is repeated until the training is finished.

The numerical simulation verification of the embodiment shows that the range of the bureaucratic wing aircraft speed is set as [30,70] m/s; the heading angle range is [ -20,20] °. At the initial moment, the velocities of the fans and the wing fans are both 50m/s, the heading angle is 0 DEG, keeping the distance in the x and y directions at 500m for forward flight. The long machine instruction and the expected formation distance are changed, and the result is obtained as shown in fig. 2 and fig. 3.

From the above simulation results it is seen that when the farm aircraft speed changes, the wing aircraft can follow at a good speed and the speed error can be kept substantially at 0.1, which is in accordance with the bonus function design. At the same time, the wing plane changes the spacing in the y direction by adjusting the course angle, the wing plane can track to the desired distance of 250m from the initial spacing of 500m, and after the command is changed, the wing plane can track to 300 m. The wing plane can independently learn the optimal strategy without prior knowledge, and the result shows the effectiveness of the designed dual Q learning controller.

The invention establishes the unmanned aerial vehicle formation motion environment according to the actual flight condition, the environment is consistent with the actual condition, and the environment can be directly transplanted to other algorithms for training and learning; the invention designs an unmanned aerial vehicle formation flying environment, and designs a formation controller based on dual Q learning, wherein the controller simultaneously controls the speed and the course, controls a wing plane to track a farm plane and maintains a desired distance.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for establishing the unmanned aerial vehicle formation relative kinematics model according to claim 1, wherein the specific expression is as follows:

in the formula (1), an inertial north-east-ground is taken as a basic coordinate system, any point on the ground is taken as an origin, the direction pointing to the north pole is an Ox axis, the direction pointing to the east is perpendicular to the Ox axis, and the direction pointing to the east is an Oy axis; l and F respectively represent a long plane and a wing plane, bothA driver model with the same first-order speed retainer and second-order course retainer is adopted; tau is_vRepresents a velocity time constant; tau is_ψa,τ_ψbRepresenting a course time constant; v represents drone speed; psi represents the drone heading angle; v. of_Lc,v_FcRespectively representing the speed commands of a long plane and a bureaucratic plane; psi_Lc,ψ_FcRespectively representing the course commands of the Youji and the Liao plane; x represents the distance between a long plane and a bureaucratic plane in the x direction; y represents the distance from a longicorn to a bureaucratic machine in the y direction; a is arctan (y)₀/x₀)，x₀,y₀The distance between the fixed plane and the wing plane is the x and y direction.

3. A state space S, a bureaucratic action library a, command conversions, reward functions r and end conditions of a convoy relative kinematic environment design environment built as claimed in claim 2. The corresponding expression is:

in the formula (2), e_v＝v_L-v_FThe relative speed of a long plane and a bureaucratic plane; e.g. of the type_ψ＝ψ_L-ψ_FIs the relative course angle of a long plane and a bureaucratic plane; e.g. of the type_y＝y_d-y is the error of the desired y-direction distance from the actual y-direction distance; y is_dA desired y-direction distance;

establishing a bureau motor action library a ═ (a)₁,a₂) The expression is as follows:

a design command is converted, a wing plane action in a formula (3) is converted into a speed and course command, and amplitude limiting is added;

designing a reward function r:

in the formula (6)

The speed instruction at the last moment;

the course angle instruction at the last moment; t is_SIs the sampling time; t is the running time.

Design environment end conditions:

4. The unmanned aerial vehicle formation environment establishment method based on deep reinforcement learning of claim 3, wherein a formation controller is established based on dual Q learning, and a designed environment is trained. The method specifically comprises the following steps:

in each training, initializing the environment state of formation to obtain state S, inputting it into main network, outputting bureaucratic actions and inputting them into environment, and converting the actions output by controller into specific commands v by means of expressions (4) and (5)_d,ψ_dThen input into the wing plane to obtain the new state S _ and reward r of the wing plane, and will<S,a,r,S_>Storing in a memory bank;

y＝r+γQ(S_,argmaxQ(S,a|θ)|θ^-) (8)

L(θ)＝E[(y-Q(S,a|θ))²] (9)

equation (8) represents the target value, equation (9) represents the loss function, where γ represents the discount rate, and the neural network parameters are trained by the gradient descent method, where the expression is:

θ^-←θ (11)