CN117826860A

CN117826860A - Fixed wing unmanned aerial vehicle control strategy determination method based on reinforcement learning

Info

Publication number: CN117826860A
Application number: CN202410239788.5A
Authority: CN
Inventors: 刘昊; 刘德元; 任梓铭; 钟森
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-04-05

Abstract

The application provides a fixed wing unmanned aerial vehicle control strategy determining method based on reinforcement learning, which relates to the technical field of flight control and comprises the following steps: constructing an augmentation system according to the reference signal and a dynamic model of the fixed wing unmanned aerial vehicle; deducing an expression of a Belman equation and an optimal control strategy according to the augmentation system and the cost function; reconstructing an augmentation system based on a strategy iteration method in reinforcement learning, and determining a strategy iteration equation by combining a cost function, the reconstructed augmentation system, a Belman equation and an expression of an optimal control strategy; applying an initial control strategy and an initial reference signal in a preset time period, and counting tracking errors; substituting the initial control strategy and the tracking error into a strategy iteration equation; and obtaining an optimal control strategy when the iterative solution converges. The reinforcement learning algorithm is applied to solving the control strategy of the fixed-wing unmanned aerial vehicle, and the optimal control strategy can be solved only by using the set initial control strategy and the measurable tracking error, so that the control effect is improved.

Description

Fixed wing unmanned aerial vehicle control strategy determination method based on reinforcement learning

Technical Field

The application relates to the technical field of flight control, in particular to a method for determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning.

Background

The flying control technology of the fixed-wing unmanned aerial vehicle is one of key technologies in an unmanned aerial vehicle system, however, the fixed-wing unmanned aerial vehicle is a complex controlled object integrating multiple variables, uncertainty, nonlinearity, rapid time variation, strong coupling, static instability and underactuation, and the flying control technology is always the key point and the difficulty of research in the aviation field.

At present, the conventional control method of the classical control theory, such as a PID control algorithm, wherein the setting work of parameters requires the own experience of engineers, thus being very cumbersome. Researchers have applied modern control technology related methods to unmanned aerial vehicle control systems. However, the optimality of the control system is not considered in the traditional linear gain control method and the modern control method, and an evaluation index for the control optimality of the flight control system is lacked, so that the control effect is not ideal. Meanwhile, the unmanned aerial vehicle is used as a complex controlled object, and many challenges still exist for the optimal control problem of the unmanned aerial vehicle due to the reasons of strong coupling, strong nonlinear dynamics, uncertain model parameters and the like. For example, the conventional optimal control method often linearizes the unmanned aerial vehicle and then solves the optimal control strategy, but no study is made on the nonlinear dynamics of the unmanned aerial vehicle, so that the control strategy fails when the unmanned aerial vehicle performs a large maneuver. For example, a Hamiltonian-jacobian-Bellman (HJB) equation is constructed to solve to obtain an optimal control strategy, but because of the existence of non-linear dynamics of the unmanned aerial vehicle and uncertainty of model parameters, a direct solving mode is difficult to realize, and only the nominal system of the unmanned aerial vehicle can be solved, but only the optimal control of the nominal system, but not the optimal control of the actual unmanned aerial vehicle system, is obtained, and the control effect is also not ideal.

Disclosure of Invention

In view of this, the present application aims to provide a method for determining a control strategy of a fixed-wing unmanned aerial vehicle based on reinforcement learning, which combines an intelligent algorithm and an optimal control theory, applies the reinforcement learning algorithm to solving the control strategy of the fixed-wing unmanned aerial vehicle, and can solve the optimal control strategy of the unmanned aerial vehicle only by using a set initial control strategy and a measurable tracking error, thereby improving the control effect of the fixed-wing unmanned aerial vehicle.

The embodiment of the application provides a method for determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning, wherein the control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal; the determining method comprises the following steps:

constructing an augmentation system of the fixed-wing unmanned aerial vehicle according to the reference signal and a dynamic model of the fixed-wing unmanned aerial vehicle;

deducing an expression of a Belman equation and an optimal control strategy according to the augmentation system and the cost function of the fixed wing unmanned aerial vehicle; the cost function is obtained according to the control target definition of the control strategy;

Reconstructing an augmentation system of the fixed wing unmanned aerial vehicle based on a strategy iteration method in reinforcement learning;

determining a strategy iteration equation to be solved by combining the cost function, the reconstructed augmentation system, the Belman equation and the expression of the optimal control strategy;

applying an initial control strategy and an initial reference signal to the unmanned aerial vehicle control system in a preset time period, and counting tracking errors of the fixed wing unmanned aerial vehicle relative to the initial reference signal in the preset time period; the initial control strategy comprises a basic control strategy for controlling the stability of the unmanned aerial vehicle control system and an exploration noise strategy;

substituting the initial control strategy and tracking error into the strategy iteration equation, and carrying out iteration solution on the strategy iteration equation;

and when the iterative solution of the strategy iterative equation converges, obtaining the optimal control strategy of the fixed-wing unmanned aerial vehicle.

The embodiment of the application also provides a fixed wing unmanned aerial vehicle control strategy determining device based on reinforcement learning, wherein the control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal; the determining device includes:

The construction module is used for constructing an augmentation system of the fixed-wing unmanned aerial vehicle according to the reference signal and the dynamic model of the fixed-wing unmanned aerial vehicle;

the deduction module is used for deducting the expression of the Belman equation and the optimal control strategy according to the augmentation system and the cost function of the fixed wing unmanned aerial vehicle; the cost function is obtained according to the control target definition of the control strategy;

the reconstruction module is used for reconstructing an augmentation system of the fixed-wing unmanned aerial vehicle based on a strategy iteration method in reinforcement learning;

the determining module is used for determining a strategy iteration equation to be solved by combining the cost function, the reconstructed augmentation system, the Belman equation and the expression of the optimal control strategy;

the control module is used for applying an initial control strategy and an initial reference signal to the unmanned aerial vehicle control system in a preset time period, and counting tracking errors of the fixed wing unmanned aerial vehicle relative to the initial reference signal in the preset time period; the initial control strategy comprises a basic control strategy for controlling the stability of the unmanned aerial vehicle control system and an exploration noise strategy;

the solving module is used for substituting the initial control strategy and the tracking error into the strategy iteration equation and carrying out iteration solving on the strategy iteration equation;

The embodiment of the application also provides electronic equipment, which comprises: the system comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device is running, and the machine-readable instructions are executed by the processor to execute the steps of the method for determining the fixed wing unmanned aerial vehicle control strategy based on reinforcement learning.

Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method for determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning as described above.

According to the method for determining the control strategy of the fixed-wing unmanned aerial vehicle based on reinforcement learning, provided by the embodiment of the application, the reinforcement learning algorithm is applied to solving the control strategy of the fixed-wing unmanned aerial vehicle by combining the intelligent algorithm and the optimal control theory, and the optimal control strategy of the unmanned aerial vehicle can be solved only by using the set initial control strategy and the measurable tracking error, so that the control effect on the fixed-wing unmanned aerial vehicle can be improved.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic structural diagram of a control system of a unmanned aerial vehicle according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an attitude control layer according to an embodiment of the present application;

FIG. 3 illustrates a flow chart of a method for determining a fixed-wing unmanned aerial vehicle control strategy based on reinforcement learning provided by an embodiment of the present application;

fig. 4 shows a schematic view of a flight state of a fixed wing unmanned aerial vehicle according to an embodiment of the present application;

FIG. 5 is a schematic diagram showing a relative relationship between an airflow coordinate system and a machine body coordinate system according to an embodiment of the present disclosure;

FIGS. 6 (a) to 6 (e) are diagrams showing experimental results of a simulation experiment provided in the examples of the present application;

fig. 7 is a schematic structural diagram of a determining device of a fixed-wing unmanned aerial vehicle control strategy based on reinforcement learning according to an embodiment of the present application;

fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment that a person skilled in the art would obtain without making any inventive effort is within the scope of protection of the present application.

Through researches, the flight control technology of the fixed-wing unmanned aerial vehicle is one of key technologies in an unmanned aerial vehicle system, however, the fixed-wing unmanned aerial vehicle is a complex controlled object integrating multiple variables, uncertainty, nonlinearity, rapid time variation, strong coupling, static instability and underactuation, and the flight control technology is always the key point and the difficulty of the research in the aviation field.

Based on the above, the embodiment of the application provides a method for determining a control strategy of a fixed-wing unmanned aerial vehicle based on reinforcement learning, which combines an intelligent algorithm and an optimal control theory, applies the reinforcement learning algorithm to solving the control strategy of the fixed-wing unmanned aerial vehicle, and can solve the optimal control strategy of the unmanned aerial vehicle only by using a set initial control strategy and a measurable tracking error, thereby improving the control effect of the fixed-wing unmanned aerial vehicle. Wherein the control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic structural diagram of a control system of an unmanned aerial vehicle according to an embodiment of the present application; fig. 2 is a schematic structural diagram of an attitude control layer according to an embodiment of the present application. As shown in fig. 1 and 2, the unmanned aerial vehicle control system comprises a position control layer and a posture control layer; and the position controller in the position control layer reversely solves the gesture reference signal according to the received position reference signal, the position state quantity of the fixed wing unmanned aerial vehicle and the position control strategy and sends the gesture reference signal to the gesture control layer.

The attitude control layer comprises an angle controller and an angular rate controller, and the angle controller, the angular rate controller and the fixed wing unmanned aerial vehicle form a cascade attitude control loop. The angle controller jointly controls the attitude of the fixed-wing unmanned aerial vehicle to follow the attitude reference signal according to the attitude reference signal, the angle state quantity and the angle control strategy of the fixed-wing unmanned aerial vehicle, and the angular rate controller jointly controls the attitude of the fixed-wing unmanned aerial vehicle to follow the attitude reference signal according to the attitude reference signal, the angular rate state quantity and the angular rate control strategy of the fixed-wing unmanned aerial vehicle.

Referring to fig. 3, fig. 3 is a flowchart of a method for determining a fixed-wing unmanned aerial vehicle control strategy based on reinforcement learning according to an embodiment of the present application. As shown in fig. 3, the determining method provided in the embodiment of the present application includes:

s101, constructing an augmentation system of the fixed-wing unmanned aerial vehicle according to the reference signal and a dynamic model of the fixed-wing unmanned aerial vehicle. S102, deducing an expression of a Belman equation and an optimal control strategy according to an augmentation system and a cost function of the fixed-wing unmanned aerial vehicle; the cost function is obtained according to the control target definition of the control strategy; s103, reconstructing an augmentation system of the fixed-wing unmanned aerial vehicle based on a strategy iteration method in reinforcement learning; s104, determining a strategy iteration equation to be solved by combining the cost function, the reconstructed augmentation system, the Belman equation and the expression of the optimal control strategy; s105, applying an initial control strategy and an initial reference signal to the unmanned aerial vehicle control system in a preset time period, and counting tracking errors of the fixed wing unmanned aerial vehicle relative to the initial reference signal in the preset time period; the initial control strategy comprises a basic control strategy for controlling the stability of the unmanned aerial vehicle control system and an exploration noise strategy; s106, substituting the initial control strategy and the tracking error into the strategy iteration equation, and carrying out iteration solution on the strategy iteration equation; and when the iterative solution of the strategy iterative equation converges, obtaining the optimal control strategy of the fixed-wing unmanned aerial vehicle.

Here, the dynamics model of the fixed wing unmanned aerial vehicle is constructed by: the first step, a dynamics model of the fixed-wing unmanned aerial vehicle is constructed based on the following five-point basic assumption: (1) Neglecting the effect of earth rotation and ground curvature on flight dynamics. (2) Neglecting the effect of wind field on flight dynamics, the airspeed of the unmanned aerial vehicle in the embodiments of the present application is equal to ground speed. (3) The layout of the drone is assumed to be a plane-symmetrical configuration, and the mass distribution is also plane-symmetrical. (4) The machine body is regarded as a rigid body with the quality unchanged along with time, and the influence caused by factors such as the elasticity of the unmanned aerial vehicle body, the vibration of parts and the like is ignored. (5) The unmanned aerial vehicle adopts a maneuvering mode of a banked-to-turn (BTT), and when the position dynamics of the unmanned aerial vehicle are calculated, the transverse overload caused by the sideslip angle is ignored. Fig. 4 is a schematic diagram illustrating a flight state of the fixed wing unmanned aerial vehicle according to the embodiment of the present application, and fig. 5 is a schematic diagram illustrating a relative relationship between an airflow coordinate system and a machine body coordinate system according to the embodiment of the present application. In order to conveniently express the component sizes of vectors such as the position, the speed, the acceleration, the aerodynamic force and the like of the fixed-wing unmanned aerial vehicle, four coordinate systems are constructed: (1) Ground inertial coordinate system : the origin of the coordinate system is taken at a certain point on the ground in a low latitude area, is fixedly connected to the ground and is regarded as an inertial system, and the unit vector of the z axis is +.>Plumb point down, unit vector of x-axis +.>North-pointing, y-axis unit vector +.>Pointing eastward, together constructed as a right-hand coordinate system. (2) Organism coordinate system->: the origin of the coordinate system is taken at the centroid of the unmanned aerial vehicle and is fixedly connected with the unmanned aerial vehicle, and the unit vector of the x axis/>Unit vector of y axis which coincides with the fuselage axis of the unmanned plane and points to the nose>The unit vector of the z axis pointing to the right side of the fuselage perpendicular to the plane of symmetry of the unmanned aerial vehicle +.>In the plane of symmetry of the unmanned aerial vehicle, and perpendicularly points to the x-axis below the fuselage. (3) Air flow coordinate system->: the origin of the coordinate system is taken at the centroid of the unmanned aerial vehicle and is fixedly connected with the unmanned aerial vehicle, and the unit vector of the x axis is +.>The unit vector of the z-axis is coincident with the speed vector of the unmanned plane and has the same direction as the speed vector of the unmanned plane>In the symmetry plane of the unmanned plane, and is perpendicular to the x-axis and directed to the lower part of the belly, the unit vector of the y-axisPointing to the right side of the fuselage and +.>、/>And the right-hand rectangular coordinate system is formed together. (4) Track coordinate system->: the origin of the coordinate system is taken at the centroid of the unmanned aerial vehicle and is fixedly connected with the unmanned aerial vehicle, and the unit vector of the x axis is +. >The unit vector of the z axis is consistent with the speed direction of the unmanned plane>In the vertical plane containing the flight velocity vector, and +.>Vertical and downward pointing, unit vector of y-axis +.>And->、/>And the right-hand rectangular coordinate system is formed together. And thirdly, constructing a six-degree-of-freedom dynamics model of the fixed-wing unmanned aerial vehicle, wherein the model comprises two parts of position dynamics and attitude dynamics. Firstly, building a position dynamics model of the unmanned aerial vehicle as follows:

（1）

for unmanned aerial vehicle barycenter in ground inertial coordinate system +.>Coordinates of the interior>Respectively the air flow coordinate system->Track inclination angle and heading angle relative to the ground inertial coordinate system, +.>Normal overload indicating unmanned aerial vehicle angle of attack generation, +.>Indicating axial overload of the thrust generation of the unmanned aerial vehicle. It is to be noted thatThe unmanned aerial vehicle adopts a cornering maneuver mode, so that the sideslip angle is smaller, and when the position dynamics of the unmanned aerial vehicle is calculated, the lateral overload caused by the sideslip angle is ignored. />For the roll angle, the direction of the normal overload in a plane perpendicular to the velocity vector is determined. Secondly, aiming at the attitude dynamics of the unmanned aerial vehicle, firstly, based on the assumption that the unmanned aerial vehicle is a plane-symmetric aircraft, the moment of inertia matrix of the fixed-wing unmanned aerial vehicle in the body coordinate system can be determined as follows:

The unmanned aerial vehicle attitude dynamics model is built according to the method, the attitude dynamics comprises angular rate dynamics and angle dynamics, the angular rate dynamics gives a dynamic equation of the rotation angular rate of the unmanned aerial vehicle relative to an inertial system in a machine body coordinate system:

（2）

in the method, in the process of the invention,respectively representing the components of the rotation angular rate of the unmanned aerial vehicle on the coordinate axes of the machine bodies, and the components are +.>Representing the component of the control moment on the axes of the respective bodies, ->Consists of the product of inertia, and satisfies the following conditions:

the angle dynamic equation gives the attack angle of the unmanned planeSide slip angle->And roll angle->Is an expression of (2). The attack angle and the sideslip angle can be given by the relative relation between the airflow coordinate system and the machine body coordinate system, and the dynamic angle equation is as follows:

in the method, in the process of the invention,for unmanned aerial vehicle quality, +.>Acceleration of gravity, ++>Is the speed of the unmanned aerial vehicle, +.>For air density->For unmanned aerial vehicle wing reference area,/->For lift coefficient>For engine thrust +.>Is the lateral force coefficient.

Normal overload in unmanned aerial vehicle position dynamic equationAngle of attack->In relation thereto, the relationship depends on the specific configuration of the aircraft. To facilitate problem study, assume normal overload +.>Angle of attack->The relation of (2) is:

（4）

in summary, the formula (1), the formula (2) and the formula (3) together form a six-degree-of-freedom dynamics model of the fixed-wing unmanned aerial vehicle.

The following will describe a control target for controlling the fixed wing unmanned aerial vehicle to move along with the reference signal in the embodiment of the present application. Taking an unmanned aerial vehicle one-to-one air combat scene as an example, the control strategy of the fixed wing unmanned aerial vehicle focuses on the position following problem, namely the control strategy of the unmanned aerial vehicle control system needs to control the fixed wing unmanned aerial vehicle to follow the position reference signal according to the received position reference signal. Because of the characteristics of the fixed wing unmanned aerial vehicle, in order to control the position state quantity of the fixed wing unmanned aerial vehicle, the posture state quantity needs to be adjusted, namely, the unmanned aerial vehicle is enabled to realize position following under the cooperative control of the position control layer and the posture control layer. Specifically, the position following problem of the unmanned aerial vehicle in the air combat scene under the condition that the flying height of the unmanned aerial vehicle is kept unchanged is considered, and the unmanned aerial vehicle position controller is designed to carry out following control so that the controlled unmanned aerial vehicle and the target unmanned aerial vehicle keep relative positions. In this case, according to equation (1), the position kinematic model of the fixed wing unmanned aerial vehicle may be expressed as:

（5）

in the method, in the process of the invention,the expression heading change angular rate, abbreviated as heading angular rate, satisfies the relation by combining the formula (4):

（6）

it can be seen that the magnitude of the heading angle rate depends on the roll angle in the unmanned aerial vehicle attitude dynamics Angle of attack->Meanwhile, in order to keep the height, the unmanned aerial vehicle also needs to meet the following requirements:

（7）

combining equation (6) with equation (7), it can be seen that when the unmanned aerial vehicle is given a heading angular rate signalAnd combining with the BTT banked hypothesis condition, the attitude reference signal of the unmanned aerial vehicle can be uniquely determined.

For the unmanned aerial vehicle position following problem, a pilot following method is adopted in the embodiment of the application. On-ground inertial systemIn the above, the local position state is set as +.>The reference signal of the position state depends on the position state of the following target +.>Desired relative to fixed state->. To facilitate the design of the controller, the above-mentioned position state is transferred to the body coordinate system of the machine +.>The error between the local position state and the position reference signal can be expressed byThe method comprises the following steps:

（8）

in the method, in the process of the invention,is the position following error under the coordinate system of the machine body.

Synthesizing the unmanned aerial vehicle kinematic model (5), deriving the equation (8) over time to obtain the unmanned aerial vehicle position following dynamic equation:

（9）

in the method, in the process of the invention,respectively representing the course angular velocity and the flying speed of the machine, < >>Respectively representing the course angular rate and the flying speed of the following target unmanned aerial vehicle.

In summary, the design objective of the position controller is to learn to obtain the optimal position control strategy by using only the control input and the state information under the condition that the unmanned aerial vehicle model parameters are unknown, so that the unmanned aerial vehicle position follows the error Approaching zero. The position controller will be designed based on the reinforcement learning algorithm later. For the attitude control layer in the unmanned aerial vehicle system, the control target is to enable the attack angle of the unmanned aerial vehicle to be +.>Side slip angle->And roll angle->Tracking a given attitude reference signal. The gesture control layer in the embodiment of the application is divided into an inner control loop and an outer control loop, which are respectively an angular rate control loop and an angle control loop, and as shown in fig. 2, the angle controller, the angular rate controller and the fixed wing unmanned aerial vehicle form a cascade gesture control loop. The controllers of the two control loops will be designed separately based on reinforcement learning algorithms later.

In a first embodiment, when the angular rate controller of the angular rate control loop is designed and the determined control strategy is the angular rate control strategy, for step S101, an augmentation system of the fixed-wing unmanned aerial vehicle is constructed according to the reference signal and the dynamic model of the fixed-wing unmanned aerial vehicle; the dynamics model of the fixed-wing unmanned aerial vehicle with respect to the angular rate state quantity in the above equation (2) can be re-expressed as a form of the following dynamic equation:

（10）

wherein the angular rate state quantity is expressed asThe components respectively represent the rotation angular rate of the unmanned aerial vehicle in a machine body coordinate system +. >Projection on each axis>Representing control moment +.>Representing system output->Representing a system matrix->Representing the input matrix, expressed as:

note that in the angular rate dynamic model, the system matrixThere are non-linear and coupling conditions, control matrix +.>There is also a coupling situation of the control quantity. The reference signal for the angular rate is assumed to satisfy the following dynamic equation:

（11）

in the method, in the process of the invention,，/>representing a smooth function to be determined, +.>Representing the output of the reference signal, combining equation (10) with equation (11) can yield an angular rate dynamic augmentation system:

（12）

in the state quantity，/>Representing an augmentation system matrix->，/>，/>Indicating angular rate tracking error.

The control objective of the angular rate controller is to make the angular rate of the unmanned aerial vehicle track the reference angular rate. In order to reduce the unmanned aerial vehicle angular rate tracking error and enable the unmanned aerial vehicle angular rate tracking error to approach zero so as to realize optimal control of the unmanned aerial vehicle rotation angular rate, a cost function is defined as follows:

（13）

in the method, in the process of the invention,，/>are all positive definite real constant matrix, +.>Is a discount factor. Seeking optimal control strategy->The function value of the cost function is minimized.

For step S102, according to the augmentation system and the cost function of the fixed-wing unmanned aerial vehicle, deriving an expression of a bellman equation and an optimal control strategy, when the control strategy is the angular rate control strategy in the angular rate controller, differentiating a value function expression (13) corresponding to the angular rate, and simultaneously combining the expression (12) of the augmentation system, thereby obtaining the bellman equation corresponding to the angular rate:

（14）

In the method, in the process of the invention,representing the partial differentiation of the angular rate state quantity in the augmentation system by the corresponding cost function of the angular rate.

Wherein each control strategy corresponds to a corresponding cost functionIs provided with->Is an optimal control strategy->The corresponding optimal cost function satisfies the hamilton-jacobian-bellman equation (HJB):

（15）

while the optimal control strategyShould satisfy->The expression that can result in an optimal control strategy is:

（16）

the following expression (16) demonstrating the resulting optimal control strategy enables asymptotically stabilizing the closed loop control system of the fixed wing drone: substituting the formula (16) into the formula (14) can obtain:

（17）

both sides of the equation multiply simultaneouslyThe method can obtain:

（18）

as can be seen, when the discount factorApproaching zero, the augmentation system (12) is asymptotically stable. While->Non-zero, the relevant literature demonstrates that if +.>Wherein->The augmentation system is locally asymptotically stable.

For the solution of HJB equation (15), conventional strategy iterations can be evaluated by the bellman equation:

（19）

wherein, superscriptRepresenting the iteration number, the policy promotion takes place in the form:

（20）

and then, carrying out next round of evaluation and lifting on the lifted strategy until the convergence condition is met. It can be seen that the iterative process depends on the dynamic parameter information of the system model, such as And->. However, due to the influence of factors such as uncertain parameters, dynamic parameter information used in solving the control strategy in the prior art is actually the nominal system dynamics, so that the iterative solution realizes the learning of optimal control of the nominal system, and the determined control strategy has poor control effect on an actual unmanned aerial vehicle system. Thus (2)According to the embodiment of the application, the strategy iteration method based on reinforcement learning is considered, the different strategy reinforcement learning algorithm is designed, and the optimal angular rate control strategy can be learned under the condition that detailed unmanned aerial vehicle model dynamic parameter information is not needed.

Specifically, for step S103, the expression of the reconstructed augmentation system obtained by first rewriting the augmentation dynamic system (12) of the angular rate is:

（21）

in the method, in the process of the invention,control strategy representing the actual input, +.>To represent the +.>And the new augmentation system of the reconstruction is equivalent to the original augmentation system.

Further, in implementation, step S104 may include: differentiating the cost function, and replacing according to the reconstructed augmentation system, the Bellman equation and the expression of the optimal control strategy to eliminate the dynamic model parameters of the unmanned aerial vehicle control system, so as to obtain a differential expression of the cost function:

（22）

Equation two-way co-ordinatesAnd integrating to obtain a strategy iteration equation:

（23）

in the method, in the process of the invention,is the integration time interval.

Thus, a fixed actual control strategy is given according to the strategy iteration equationAnd based on the measured state quantity +.>Can solve the above formula (23), i.e. for the +.>The generation strategy carries out strategy evaluation and promotion to obtain +.>And (3) with. Notably, the differential expression (22) and the strategy iteration equation (23) of the cost function obtained by the substitution process described above eliminate the unknown dynamic model parameters of the system, such as>And->Therefore, the method for determining the optimal control strategy provided by the embodiment of the application can solve the optimal control strategy by using only the measurable state quantity and the set input data under the condition that the model parameters are uncertain, namely according to the measurable tracking error and the set initial control strategy.

In summary, for steps S105 and S106, the reinforcement learning algorithm of the control strategy in the fixed-wing unmanned aerial vehicle angular rate control loop according to the embodiment of the present application is as follows: step one, applying an initial control strategy input to a system in a preset time periodAnd an initial reference signal Wherein->Representing a fixed basic control strategy which can stabilize the system, < >>Representing the search noise. Meanwhile, required data including the system tracking error and the initial control strategy input integration value are collected for a predetermined period of time. Step two, using a given strategy to be iterated +.>And the data collected in step one, iteratively solving the bellman equation (23) to obtain +.>And->. Step three, let->And returning to the second step for carrying out the next round of iterative solution until the iteration stopping condition is met, namely, when the iteration converges, the optimal control strategy of the fixed wing unmanned aerial vehicle angular rate is obtained.

In the practical application of the algorithm, it is noted that in the iterative solving process, the value function to be solved in the formula (23) is calculatedAnd policy function->Typically nonlinear, is difficult to solve directly, so can be solved by means of a neural network for function fitting. In the specific implementation, in each iteration process in the iteration solution, firstly, performing function fitting on the value function and the expression of the optimal control strategy by using a neural network to obtain a value fitting neural network and an optimal control strategy fitting neural network;

（24）

In the method, in the process of the invention,and->Is an estimate of the neural network, +.>And->Is a polynomial base function, let +.>And->Respectively representing the number of neurons of two neural networks, < ->And (3) withIs a weight matrix. Definitions->And assuming a constant matrix in the performance function +.>Is a diagonal array. />

Next, the value fit neural network and the optimal control strategy fit neural network in equation (24) are re-brought into strategy iteration equation (23) to obtain:

（25）

in the method, in the process of the invention,representing bellman estimation error,/->Representing a weight matrix->Is>Columns.

Then, the least square method is used for minimizing the Belman estimation error, the weight parameters of the value fitting neural network and the weight parameters of the optimal control strategy fitting neural network in the equation (24) are solved, and the neural network is updated so as to carry out the next iteration. When (when)Approaching zero, the rewriteable (25) may be:

（26）

wherein,

a stage of collecting data in a predetermined period of time at time points respectivelyTo->Collection of common->Section->And->To have a unique solution to the equation, implicitly require +.>The specific form of the least square method is as follows:

（27）

wherein,

therefore, the embodiment of the application designs the fixed wing unmanned aerial vehicle angular rate controller based on reinforcement learning based on the reinforcement learning algorithm of the control strategy in the angular rate control loop, and the fixed wing unmanned aerial vehicle angular rate controller can learn to obtain the optimal control strategy by only using the state quantity and the input data under the condition that the model parameters are uncertain.

Similarly, in a second embodiment, when the angle controller of the angle control loop is designed, the control strategy is an angle control strategy, for the angle dynamics model given in equation (3), in order to eliminate the gravitational accelerationIs +.>The actual angle control amount is constructed as follows, with respect to the influence of the angle controller design: />

（28）

In the method, in the process of the invention,and (3) converting an unmanned plane angle dynamics model into the following for the angular rate virtual control quantity to be designed:

in the method, in the process of the invention,because unmanned aerial vehicle flight speed is fast, can see it to be a small amount, can rewrite unmanned aerial vehicle angle dynamics model into the form of following dynamic equation:

（30）

in the formula, the unmanned plane angle state quantity is expressed asThe components of the composition respectively represent attack angle, sideslip angle and roll angle, < >>Indicating the angular rate control amount +_>Representing system output->Representing a matrix of the system and,representing the input matrix, expressed as:

it is also noted that there are also non-linearities and coupling conditions in the attitude angle dynamic model of the unmanned aerial vehicle, so similar to the angular rate control loop, the angle optimal controller will be designed based on reinforcement learning below. First, assume that the reference signal of the angle satisfies the following dynamic equation:

（31）

In the method, in the process of the invention,，/>representing a smooth function to be determined, +.>Representing the output of the reference signal. It should be noted that, since the fixed-wing unmanned aerial vehicle commonly uses a maneuver mode of banked turning (BTT), there are two types of reference signals of the unmanned aerial vehicle angle in the actual learning process: the first is that the attack angle and sideslip angle are zero, and the rolling angle is changed greatly; the second is that the roll angle and sideslip angle are zero, and the angle of attack varies within a suitable range. Combining equation (30) with equation (31) can yield an angular dynamic augmentation system: />

（32）

In the state quantity，/>Representing an augmentation system matrix->，/>，/>Indicating an angle tracking error.

The control objective of the angle controller is to enable the angle of the unmanned aerial vehicle to track the reference angle. In order to reduce the unmanned aerial vehicle angle tracking error and realize the optimal control of the unmanned aerial vehicle rotation angle, a cost function is defined as follows:

（33）

in the method, in the process of the invention,，/>and discount factor->。

Similar to the angular rate control loop, the following equations may be constructed to solve for the cost function to evaluate the current control strategyControl strategy after iterative boosting ++>。

（34）

Similarly, the embodiment of the application proposes a reinforcement learning algorithm of a control strategy in a fixed-wing unmanned aerial vehicle angle control loop as follows: step one, applying an initial control strategy input to a system in a preset time period WhereinRepresenting a fixed basic control strategy which can stabilize the system, < >>The search noise for maintaining the PE condition is shown. At the same timeThe required data including the system tracking error and the initial control strategy input integral value are collected over a predetermined period of time. Step two, using a given strategy to be iterated +.>And the data collected in step one, iteratively solving the bellman equation (34) while obtaining +.>And->. Step three, let->And returning to the second step for carrying out the next round of solving until the iteration stopping condition is met, namely, when the iteration converges, obtaining the optimal control strategy of the fixed wing unmanned aerial vehicle angle. In step two, a neural network is constructed +.>And->Fitting the cost function ∈>Expression of optimal control strategy +.>And solving the equation (34) by using a least square method to obtain the neural network parameters.

Therefore, the embodiment of the application designs the fixed wing unmanned aerial vehicle angle controller based on reinforcement learning based on the reinforcement learning algorithm of the control strategy in the angle control loop, and the fixed wing unmanned aerial vehicle angle controller can learn to obtain the optimal control strategy by only using the state quantity and the input data under the condition of uncertain model parameters.

Similarly, in a third possible embodiment, the design goal of the position controller is based on reinforcement learning method, in unmanned aerial vehicle Under the condition that the model parameters are unknown, only control input and state information are used, and an optimal control strategy of position following is obtained through learning, so that the position following error of the unmanned aerial vehicle is causedApproaching zero.

Because the unmanned aerial vehicle gesture response is faster than the position response, unmanned aerial vehicle gesture dynamics can be ignored in the design process of the position controller, and the unmanned aerial vehicle position following dynamic equation (9) is rewritten into the form of the following dynamic equation:

（35）

in the formula, the state quantity of the relative position of the unmanned aerial vehicle is expressed asThe components are respectively expressed in the body coordinate system +.>Under the condition, the position coordinate error and course angle deviation of the local unmanned aerial vehicle and the target unmanned aerial vehicle are +.>Representing the following control quantity->Representing system output->Representing a system matrix->Representing the input matrix, expressed as:

（36）

it may be noted that there are also strong non-linearities and coupling situations in the position-following dynamic model of the drone, so similar to the gesture controller design, the optimal position controller will be designed based on reinforcement learning below.

The control target of the position controller is to enable the position of the unmanned aerial vehicle to track the reference position, so as to reduce the position tracking error of the unmanned aerial vehicle, realize the optimal control of the position of the unmanned aerial vehicle, and define a cost function as follows:

（37）

In the method, in the process of the invention,，/>is a positive constant matrix and discount factor +.>Is a constant.

From equation (35) and equation (37), the hamilton equation can be derived:

（38）

in the method, in the process of the invention,。

let the optimal cost function beOptimal angle control strategy->The expression of (2) is:

（39）

will beThe expression of (2) is substituted into expression (1.38), and can be obtained:

（40）

it can be demonstrated that the optimal control strategy given by equation (39) enables progressive stabilization of the position following dynamic system (35), provided thatWherein->。

Then, the formula (40) is rewritten as:

（41）

equation two-way co-ordinatesCan obtain

（42）

As can be seen from equation (42), the discount factorApproaching zero, the unmanned plane's position follows the dynamic system (35) for asymptotically stabilizing. When->Non-zero, it can be demonstrated that if +.>Wherein->The unmanned aerial vehicle position following dynamic system will be stable.

Obviously, formula (40) relates toAnd contains target unmanned aerial vehicle dynamic model parameter information +.>Therefore, similar to unmanned plane attitude control, a strategy iteration equation is constructed as shown in formula (43), and a cost function for evaluating the current strategy can be obtained by solving formula (43) simultaneously>Control strategy after lifting +.>。

（43）

Before the reinforcement learning algorithm of the unmanned aerial vehicle position control loop is proposed, attention needs to be paid to the situation when the unmanned aerial vehicle is in a relative position state All approach zero, it will be difficult to effectively target +.>And controlling. Therefore, to ensure the stability of the initial control strategy, the relative position state quantity is reconstructed to obtain:

（44）

the control strategy (45) may prove to stabilize the position following dynamic system:

（45）

taking the control strategy as a basic control law in strategy reinforcement learning, the embodiment of the application provides a reinforcement learning algorithm of the control strategy in a fixed-wing unmanned aerial vehicle position following control loop, which is as follows: step one: at a predetermined timeApplying initial control strategy input to system in sectionAnd an initial reference signal, wherein->The search noise for maintaining the PE condition is shown. Meanwhile, required data including the system tracking error and the initial control strategy input integration value are collected for a predetermined period of time. Step two: use of a given strategy to be iterated +.>And the data collected in step one, iteratively solving the bellman equation (43) while obtaining ++>And->. Step three: let->And returning to the second step for carrying out the next round of solving until the iteration stopping condition is met, namely, when the iteration converges, obtaining the optimal control strategy of the position of the fixed-wing unmanned aerial vehicle. In step two, similar to the attitude control loop, a value fitting neural network is constructed >Fitting neural network with optimal control strategy>Fitting the cost function ∈>Control strategy->Wherein->And->For network parameters +.>And->Is a basis function. Also using the least squares solution (43), neural network parameters can be calculated.

Therefore, the embodiment of the application designs the fixed wing unmanned aerial vehicle position controller based on reinforcement learning based on the reinforcement learning algorithm of the control strategy in the position following control loop, and the fixed wing unmanned aerial vehicle position controller can learn to obtain the optimal control strategy by only using the state quantity and the input data under the condition that the model parameters are uncertain.

Furthermore, in order to verify the effectiveness of the unmanned aerial vehicle attitude controller and the position controller based on reinforcement learning provided by the embodiment of the application, the fixed wing unmanned aerial vehicle control system is integrated and designed in Matlab/Simulink, and simulation experiments are carried out. Referring to fig. 6 (a) to 6 (e), fig. 6 (a) to 6 (e) show experimental results of a simulation experiment provided in an embodiment of the present application. Specifically, fig. 6 (a) shows a policy network parameter iteration change situation diagram of an angular rate control loop provided in an embodiment of the present application; FIG. 6 (b) is a diagram illustrating iterative changes in network parameters of an angle control loop strategy according to an embodiment of the present application; FIG. 6 (c) shows a step response curve of the attitude angle of a fixed wing unmanned aerial vehicle according to an embodiment of the present application; FIG. 6 (d) is a diagram illustrating iterative variation of network parameters of a position control loop strategy according to an embodiment of the present application; FIG. 6 (e) shows a fixed wing unmanned aerial vehicle relative position error response curve provided by an embodiment of the present application; in the simulation experiment, the main simulation process is as follows:

The basic parameters of the unmanned aerial vehicle are set as follows: unmanned aerial vehicle qualityThe moment of inertia matrix is:

（46）

under the condition that the air density is 20 ℃ and one standard atmospheric pressure is adoptedWing reference areaLift coefficient->Lateral force coefficient->. The speed of the unmanned aerial vehicle is constant +.>Meter/second. For the angular rate control loop, the super-parameter of the reinforcement learning algorithm is set to +.>，/>Wherein->For a third order identity matrix, discount factor->Integration time interval +.>Second, co-collect->Segment integral data, exploring noise settings as:

（47）

in the method, in the process of the invention,，/>and->To be in interval->Random numbers sampled uniformly. Initial control strategy at data collection +.>In,

（48）

for the angle control loop, the reinforcement learning algorithm super-parameter is set as，/>Discount factorIntegration time interval +.>Second, co-collect->Segment integral data, exploring noise settings as:

（49）

in the method, in the process of the invention,，/>and->To be in interval->Random numbers sampled uniformly. Initial control strategy at data collection +.>In (I)>

（50）

In the iterative learning process, the angular rate control loop strategy network parametersThe variation of the two norms with the number of iterations is shown in fig. 6 (a). Policy network parameters of the angle control loop +.>The variation of the two norms with the number of iterations is shown in fig. 6 (b).

In order to check the control performance of the learned optimal attitude control strategy, a step reference signal is set asrad，/>rad，/>rad, the value of the initial attitude of the unmanned aerial vehicle is +.>rad，/>rad，/>And (d) obtaining an unmanned plane attitude angle step response curve through numerical simulation, wherein the curve is shown in fig. 6 (c). Meanwhile, in order to verify the effectiveness of the control strategy learned by the reinforcement learning algorithm, a robust controller is introduced as a comparison, the gesture response of the unmanned aerial vehicle under the initial control strategy and the robust compensation controller is shown in fig. 6 (c), and it can be seen that the reinforcement learning-based controller provided by the embodiment of the application has the best step response performance among the three, and the improvement effect on the control strategy is obvious compared with the initial control strategy. Meanwhile, simulation test is carried out on the learning process and the control effect of the designed unmanned aerial vehicle position controller. Super-parameter setting of reinforcement learning algorithm is +.>，/>Wherein->For a third order identity matrix, discount factorsIntegration time interval +.>Second, co-collect->Segment integral data, and exploring noise in the process of collecting the data is set as follows:

（51）

in the method, in the process of the invention,and->To +.>Random numbers sampled uniformly. In the data collection phase, the initial control strategy of the unmanned aerial vehicle position controller +. >The specific form of (2) is as follows:

（52）

following target unmanned aerial vehicle in inertial coordinate systemIs +.>Rice (semen oryzae sativae)>Rice, course angle remain->rad, flying speed remain->Meter/second, heading angle rate is kept +.>rad/sec, the desired relative states are +.>Rice (semen oryzae sativae)>Rice (semen oryzae sativae)>And (d). The unmanned aerial vehicle is in inertial coordinate system +.>The initial position is->Rice (semen oryzae sativae)>Rice, initial heading angle +.>And (d). Therefore, the body coordinate system of the unmanned plane is +.>Under, unmanned aerial vehicle initial following error is +.>Rice (semen oryzae sativae)>Rice (semen oryzae sativae)>And (d). In the iterative learning process of the neural network, the strategy network weight parameter of the position controller is +.>The variation of the two norms with the number of iterations is shown in fig. 6 (d). The response curve of the unmanned plane position tracking error obtained through the computer numerical simulation is shown in fig. 6 (e), and the comparison of the unmanned plane following the position error response curve under the influence of the learned control strategy and the initial control strategy is respectively shown in fig. 6 (d) and fig. 6 (e). It can be seen that the reinforcement learning-based position controller designed in the embodiment of the application has quicker system response relative to the initial control strategy, and the control strategy with better performance is learned by using the reinforcement learning algorithm.

In summary, according to the method for determining the control strategy of the fixed-wing unmanned aerial vehicle based on reinforcement learning, provided by the embodiment of the application, the reinforcement learning algorithm is applied to solving the control strategy of the fixed-wing unmanned aerial vehicle by combining the intelligent algorithm and the optimal control theory, and the optimal control strategy of the unmanned aerial vehicle can be solved only by using the set initial control strategy and the measurable tracking error, so that the control effect on the fixed-wing unmanned aerial vehicle can be improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a determining device of a fixed-wing unmanned aerial vehicle control strategy based on reinforcement learning according to an embodiment of the present application. The control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal; as shown in fig. 7, the determining apparatus 700 includes:

a construction module 710, configured to construct an augmentation system of the fixed-wing unmanned aerial vehicle according to the reference signal and a dynamic model of the fixed-wing unmanned aerial vehicle;

the deriving module 720 is configured to derive an expression of a bellman equation and an optimal control strategy according to the augmentation system and the cost function of the fixed-wing unmanned aerial vehicle; the cost function is obtained according to the control target definition of the control strategy;

A reconstruction module 730, configured to reconstruct an augmentation system of the fixed-wing unmanned aerial vehicle based on a strategy iteration method in reinforcement learning;

a determining module 740, configured to determine a policy iteration equation to be solved, in combination with the cost function, the reconstructed augmentation system, the bellman equation, and the expression of the optimal control policy;

a control module 750 configured to apply an initial control strategy and an initial reference signal to the unmanned aerial vehicle control system during a predetermined period of time, and count a tracking error of the fixed-wing unmanned aerial vehicle relative to the initial reference signal during the predetermined period of time; the initial control strategy comprises a basic control strategy for controlling the stability of the unmanned aerial vehicle control system and an exploration noise strategy;

a solving module 760, configured to substitute the initial control strategy and tracking error into the strategy iteration equation, and perform iterative solution on the strategy iteration equation;

Further, the unmanned aerial vehicle control system comprises a position control layer and a gesture control layer; the position controller in the position control layer reversely solves the attitude reference signal according to the received position reference signal, the position state quantity of the fixed wing unmanned aerial vehicle and the position control strategy and sends the attitude reference signal to the attitude control layer;

The attitude control layer comprises an angle controller and an angular rate controller, and the angle controller, the angular rate controller and the fixed wing unmanned aerial vehicle form a cascade attitude control loop.

Further, when the control strategy is an angular rate control strategy in the angular rate controller, the dynamics model of the fixed-wing drone with respect to the angular rate state quantity is expressed as:

wherein the angular rate state quantity isEach component represents the angular rate in the body coordinate system +.>Projection on each axis>To control the moment +.>For system output, ++>For the system matrix->For the input matrix, expressed as: />

If the angular rate reference signal received by the angular rate controller is expressed as the following dynamic equation:

the augmentation system is expressed as:

in the formula, the angular rate state quantity，/>Representing an augmentation system matrix->，/>，/>Representing an angular rate tracking error;

and if the control target of the angular rate control strategy is to control the angular rate of the fixed wing unmanned aerial vehicle to follow the angular rate reference signal, the corresponding cost function of the angular rate is expressed as follows:

in the method, in the process of the invention,，/>are all positive definite real constant matrix, +.>Is a discount factor.

Further, the deriving module 720 is configured to, when configured to derive the expression of the bellman equation and the optimal control strategy according to the augmentation system and the cost function of the fixed wing unmanned aerial vehicle, the deriving module 720 is configured to:

Differentiating the cost function corresponding to the angular rate, and combining the expression of the augmentation system to obtain a Belman equation corresponding to the angular rate:

in the method, in the process of the invention,representing partial differentiation of the corresponding cost function of the angular rate to the state quantity of the angular rate in the augmentation system;

constructing a Hamiltonian-Jacobian-Belman equation corresponding to the angular rate according to the Belman equation corresponding to the angular rate:

in the method, in the process of the invention,representing an optimal cost function corresponding to the optimal control strategy;

according to the relation to be satisfied by the optimal control strategyObtain optimal control strategy->The expression of (2) is:

in the method, in the process of the invention,indicating the optimal control strategy.

Further, the expression of the reconstructed augmentation system is:

/>

in the method, in the process of the invention,control strategy representing the actual input, +.>Represents +.>And (5) replacing a control strategy.

Further, when the determining module 740 is configured to determine a policy iteration equation to be solved in combination with the cost function, the reconstructed augmentation system, the bellman equation, and the expression of the optimal control policy, the determining module 740 is configured to:

differentiating the cost function, and replacing according to the reconstructed augmentation system, the Bellman equation and the expression of the optimal control strategy to eliminate the dynamic model parameters of the unmanned aerial vehicle control system, so as to obtain a differential expression of the cost function:

The equation for the differential expression of the cost function is multiplied on both sidesAnd integrating to obtain the strategy iteration equation:

in the method, in the process of the invention,representing the integration time interval.

Further, when the solution module 760 is configured to iteratively solve the strategy iteration equation, the solution module 760 is configured to:

in each iteration process in the iteration solution, performing function fitting on the value function and the expression of the optimal control strategy by using a neural network to obtain a value fitting neural network and an optimal control strategy fitting neural network;

substituting the value fitting neural network and the optimal control strategy fitting neural network into the strategy iteration equation again, and constructing a Belman estimation error;

minimizing the Belman estimation error by using a least square method, and solving the weight parameter of the value fitting neural network and the weight parameter of the optimal control strategy fitting neural network;

updating the value fitting neural network according to the solved weight parameters of the value fitting neural network, and updating the optimal control strategy fitting neural network according to the weight parameters of the optimal control strategy fitting neural network to obtain an updated value fitting neural network and an updated optimal control strategy fitting neural network for the next iteration.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device 800 includes a processor 810, a memory 820, and a bus 830.

The memory 820 stores machine-readable instructions executable by the processor 810, when the electronic device 800 is running, the processor 810 and the memory 820 communicate through the bus 830, and when the machine-readable instructions are executed by the processor 810, the steps of a method for determining a fixed wing unmanned aerial vehicle control policy based on reinforcement learning in the method embodiment shown in fig. 1 may be executed, and a specific implementation may refer to the method embodiment and will not be described herein.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the step of a method for determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning in the method embodiment shown in fig. 1 may be executed, and a specific implementation manner may refer to the method embodiment and will not be described herein.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning, characterized in that the control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal; the determining method comprises the following steps:

2. The method of claim 1, wherein the unmanned aerial vehicle control system comprises a position control layer and a attitude control layer; the position controller in the position control layer reversely solves the attitude reference signal according to the received position reference signal, the position state quantity of the fixed wing unmanned aerial vehicle and the position control strategy and sends the attitude reference signal to the attitude control layer;

3. The determination method according to claim 2, wherein when the control strategy is an angular rate control strategy in the angular rate controller, the dynamics model of the fixed-wing drone with respect to an angular rate state quantity is expressed as:

wherein the angular rate state quantity isEach component respectively represents the angular velocity in the machine body coordinate system Projection on each axis>To control the moment +.>For system output, ++>In the form of a system matrix,for the input matrix, expressed as:

the augmentation system is expressed as:

4. A method of determining according to claim 3, wherein deriving the expression of the bellman equation and the optimal control strategy from the augmentation system and the cost function of the fixed wing unmanned aerial vehicle comprises:

5. The method of determining of claim 4, wherein the reconstructed augmentation system is expressed as:

6. The method of determining of claim 5, wherein the determining of the strategy iteration equation to be solved in combination with the value function, the reconstructed augmentation system, the bellman equation, and the expression of the optimal control strategy comprises:

7. The method of determining of claim 1, wherein iteratively solving the strategy iteration equation comprises:

8. The fixed wing unmanned aerial vehicle control strategy determining device based on reinforcement learning is characterized in that the control strategy is applied to an unmanned aerial vehicle control system; the control target of the control strategy is to control the fixed wing unmanned aerial vehicle to move along with the reference signal according to the received reference signal; the determining device includes:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine readable instructions when executed by the processor performing the steps of a method of determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning as claimed in any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of determining a fixed wing unmanned aerial vehicle control strategy based on reinforcement learning as claimed in any of claims 1 to 7.