CN114942651A

CN114942651A - Unmanned aerial vehicle autonomous control method and system based on experience pool optimization

Info

Publication number: CN114942651A
Application number: CN202210654543.XA
Authority: CN
Inventors: 韩升; 林友芳; 吕凯; 张硕; 宋明惠
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-26

Abstract

The invention discloses an unmanned aerial vehicle autonomous control method and system based on experience pool optimization, and belongs to the technical field of flight control. The method comprises the following steps: setting a simulation environment for a target unmanned aerial vehicle in an unmanned aerial vehicle simulator; establishing a state space, an action space and a reward function of the target unmanned aerial vehicle; constructing an auto-encoder for feature extraction according to the state space and the action space; constructing an unmanned aerial vehicle autonomous control task decision network model; loading the simulation environment to simulate the flight of the target unmanned aerial vehicle, generating experience data through the reward function, extracting characteristic values of the experience data through a self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data; and autonomously controlling the target unmanned aerial vehicle through the trained unmanned aerial vehicle autonomous control task decision network model. The invention improves the diversity of experience in the experience pool.

Description

Unmanned aerial vehicle autonomous control method and system based on experience pool optimization

Technical Field

The invention relates to the technical field of flight control, in particular to an unmanned aerial vehicle autonomous control method and system based on experience pool optimization.

Background

An Unmanned Aerial Vehicle (UAV), or simply an Unmanned Aerial Vehicle, can control flight by radio using a remote control device, and can also autonomously fly by using a sensor device and an internal program. Along with the development of the unmanned aerial vehicle correlation technique, the unmanned aerial vehicle gradually integrates into the life of people. Because the unmanned aerial vehicle has the advantages of low cost, small volume, high flexibility, strong battlefield viability, convenient operation and the like, the unmanned aerial vehicle is widely applied to the military and civil fields.

When the unmanned aerial vehicle is rapidly developed, the unmanned aerial vehicle autonomous flight control technology for restricting the application of the unmanned aerial vehicle is also widely concerned, and numerous students and organizations invest in research. In the long-term research process, a plurality of control methods are proposed accordingly, and the existing methods can be divided into a conventional linear control method, a general nonlinear control method and a learning-based control method. However, the traditional linear control algorithm cannot output accurate control due to the limitation of a model structure and has the problems of poor anti-interference capability, and a general nonlinear method too depends on expert experience and also has the problem of low control accuracy. The deep reinforcement learning combines the neural network in the deep learning with the reinforcement learning idea, and shows better performance in a plurality of sequence decision tasks. Because the neural network can infinitely approximate any continuous function and is not limited by a specific controller structure, high-precision control actions can be output.

However, the problem of sparse rewards exists in the autonomous control task of the unmanned aerial vehicle, and a better strategy is difficult to learn in a reward sparse environment by traditional deep learning. This is mainly due to the fact that a lot of experiences are stored in the experience pool, which cannot help the intelligent agent to learn, and therefore it is difficult to obtain effective experiences when the experiences are replayed.

Disclosure of Invention

Aiming at the problems, the invention provides an unmanned aerial vehicle autonomous control method based on experience pool optimization, which comprises the following steps:

setting a simulation environment for a target unmanned aerial vehicle in an unmanned aerial vehicle simulator;

establishing a state space, an action space and a reward function of the target unmanned aerial vehicle;

constructing an auto-encoder for feature extraction according to the state space and the action space;

constructing an unmanned aerial vehicle autonomous control task decision network model;

loading the simulated environment simulation target unmanned aerial vehicle to fly, generating experience data through the reward function, extracting characteristic values of the experience data through a self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data;

and autonomously controlling the target unmanned aerial vehicle through the trained unmanned aerial vehicle autonomous control task decision network model.

Optionally, the method further comprises: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

Optionally, the self-encoder for feature extraction includes: the self-encoder is used for extracting the state information features and the self-encoder is used for extracting the action information features;

the self-encoder for extracting the state information features is constructed according to a state space, and specifically comprises the following steps: establishing a simulation state data set according to a state space, and establishing a self-encoder for extracting state information characteristics according to the simulation state data set;

the self-encoder for extracting the motion information features is constructed according to a motion space, and specifically comprises the following steps: and constructing a simulated motion data set according to the motion space, and constructing a self-encoder for extracting motion information characteristics according to the simulated motion data set.

Optionally, the network model for autonomous control task decision of the unmanned aerial vehicle includes: an Actor network and a Critic network.

Optionally, the loading the simulation environment to simulate the flight of the target unmanned aerial vehicle, generating experience data through the reward function, extracting a feature value of the experience data through a self-encoder, screening the experience data according to the feature value, and training the autonomous control task decision network model of the unmanned aerial vehicle according to the screened experience data specifically includes:

loading a simulation environment;

making an action decision according to the current state information of the target unmanned aerial vehicle by using an unmanned aerial vehicle autonomous control task decision network model; the action decision is used for controlling the target unmanned aerial vehicle to simulate flight in a simulation environment;

calculating reward values generated after actions of the target unmanned aerial vehicle act on the simulation environment at all times through the reward function, and generating experience data according to the reward values, the state information, the action information and the new state information of the target unmanned aerial vehicle;

extracting characteristic values of the empirical data through a self-encoder, and screening the empirical data according to the characteristic values;

and training the unmanned aerial vehicle autonomous control task decision network model through the screened empirical data.

The invention also provides an unmanned aerial vehicle autonomous control system based on experience pool optimization, which comprises the following steps:

a simulation environment construction unit which sets a simulation environment for the target unmanned aerial vehicle in the unmanned aerial vehicle simulator;

the first calculation unit is used for establishing a state space, an action space and a reward function of the target unmanned aerial vehicle;

the second calculation unit is used for constructing a self-encoder for feature extraction according to the state space and the action space;

the model building unit is used for building an unmanned aerial vehicle autonomous control task decision network model;

the model training unit is used for loading the simulated environment simulation target unmanned aerial vehicle to fly, generating experience data through the reward function, extracting characteristic values of the experience data through the self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data;

and the unmanned aerial vehicle autonomous control unit autonomously controls the target unmanned aerial vehicle through the trained unmanned aerial vehicle autonomous control task decision network model.

Optionally, the simulation environment building unit is further configured to: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

the self-encoder for extracting the motion information features is constructed according to a motion space, and specifically comprises the following steps: and constructing a simulated motion data set according to the motion space, and constructing a self-encoder for extracting the motion information characteristics according to the simulated motion data set.

Optionally, the network model for autonomous task control decision of the unmanned aerial vehicle includes: an Actor network and a Critic network.

loading a simulation environment;

calculating reward values generated after actions of the target unmanned aerial vehicle act on the simulation environment at all times through the reward function, and generating experience data according to the reward values, the state information and the action information of the target unmanned aerial vehicle and new state information;

The method can effectively reduce repeated experience stored in the experience pool, improve the diversity of the experience in the experience pool, ensure that the intelligent agent can learn various experiences as much as possible, solve the problem that the intelligent agent is difficult to learn the optimal strategy due to sparse reward, and accelerate the learning of the decision network model.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of the system of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

The invention provides an unmanned aerial vehicle autonomous control method based on experience pool optimization, which comprises the following steps of:

loading the simulation environment to simulate the flight of the target unmanned aerial vehicle, generating experience data through the reward function, extracting characteristic values of the experience data through a self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data;

Wherein, the method further comprises: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

Three attitude angles describing the state of the unmanned aerial vehicle are respectively a Roll angle (Roll), a Pitch angle (Pitch) and a Yaw angle (Yaw), and the attitude angles at three angles are defined as (Roll, Pitch, Yaw); defining the angular velocities at the three attitude angles as (roll _ v, pitch _ v, yaw _ v); defining the accelerations in the three attitude angular directions as (roll _ a, pitch _ a, yaw _ a); defining the coordinates of the target position relative to the current position of the unmanned aerial vehicle on the three-dimensional coordinate system as (point _ x, point _ y, point _ z); defining the speed of the unmanned aerial vehicle on a three-dimensional coordinate system as (v _ x, v _ y, v _ z); the acceleration of the unmanned aerial vehicle on the three-dimensional coordinate system is (a _ x, a _ y, a _ z);

establishing a state space, an action space and a reward function of the target unmanned aerial vehicle according to the defined parameters;

establishing a state space of the unmanned aerial vehicle, specifically as follows;

the state of each unmanned aerial vehicle comprises the attitude and the dynamic information of the unmanned aerial vehicle, and the state of the unmanned aerial vehicle at the moment t is defined as:

the states of the unmanned aerial vehicle at all times form a state space of the unmanned aerial vehicle;

establishing an action space of the unmanned aerial vehicle, specifically as follows;

at time t, the state of the unmanned aerial vehicle is transferred to the agent, and the agent outputs unmanned aerial vehicle control actions (action _ pitch, action _ yaw, action _ pitch) according to the current strategy, wherein the action _ pitch represents control of a pitch angle, the action _ yaw represents control of a roll angle, and the action _ pitch represents control of an accelerator. The value ranges of the three control actions are all between-1 and 1.

Establishing an unmanned aerial vehicle reward function, which specifically comprises the following steps:

the reward function defining the pitch angle is given by the following equation (2), where P ₁ Indicating the current pitch angle, P, of the drone ₂ Indicates the pitch angle, r, at which the drone should be _P A reward value representing the drone in pitch attitude:

the reward function defining the Roll angle is as the following formula (3), Roll represents the current Roll angle of the unmanned plane, r _R Reward value representing roll gesture:

r _R ＝-|Roll| (3)

the reward function defining the heading angle is given by the following equation (4), where Y ₁ Indicating the current heading angle, Y, of the drone ₂ Indicating the heading angle, r, at which the drone should be _Y The reward value representing the heading pose is:

the final drone reward function is set to:

r＝(r _p +r _y +r _R )*r _dis (5)

wherein r is _dis Being unmanned aerial vehiclesThe advancing distance, r is the final reward value;

wherein, the self-encoder for feature extraction comprises: the self-encoder is used for extracting the state information features and the self-encoder is used for extracting the action information features;

the self-encoder for extracting the state information features is constructed according to a state space;

the self-encoder for extracting the motion information features is constructed according to the motion space.

The state space is used for constructing a simulation state data set, and a self-encoder for extracting state information features is constructed according to the simulation state data set.

The motion space is used for constructing a simulated motion data set, and a self-encoder for extracting motion information features is constructed according to the simulated motion data set.

Wherein, unmanned aerial vehicle is from master control task decision network model includes: an Actor network and a Critic network. The Actor network and the Critic network both have a dual-network structure and have respective target networks and current networks.

The method includes the steps of loading the simulation environment to simulate the flight of a target unmanned aerial vehicle, obtaining experience data of simulating the flight of the target unmanned aerial vehicle through the reward function and the self-encoder, and training the unmanned aerial vehicle autonomous control task decision network model according to the experience data, and specifically includes the following steps:

loading a simulation environment;

an Actor network of an unmanned aerial vehicle autonomous control task decision network model is used for making an action decision according to the current state information of the target unmanned aerial vehicle; the action decision is used for controlling the target unmanned aerial vehicle to simulate flight in a simulation environment; in the flight process, the state of the simulation environment changes;

through the reward function, calculating empirical data generated after actions of the target unmanned aerial vehicle at all times act on the simulation environment, specifically:

the reward function calculates the action generated by the unmanned aerial vehicle at the current moment, and the reward value generated after the action is acted on the environment, so that the experience data of the unmanned aerial vehicle is obtained. Unmanned planeThe experience data of the unmanned aerial vehicle comprises the current time state of the unmanned aerial vehicle, the action strategy, the reward value of the unmanned aerial vehicle and the state of the unmanned aerial vehicle at the next time. One experience data of the unmanned plane is expressed as<s _t ，a _t ，r _t ，s _t+1 >Wherein s is _t State representing the current moment in the unmanned aerial vehicle mission, a _i Representing the action strategy of the drone at the current moment, r _i The reward value, s, representing the action of the unmanned plane at the current moment _i+1 Representing the state of the next moment in the unmanned aerial vehicle task;

extracting the empirical data through a self-encoder; the unmanned human-computer system continuously generates experience data at every moment, the space for storing the experience data is defined as an experience pool, the experience data needs to be subjected to feature extraction through a self-encoder to obtain a feature value f, and the space for storing the feature value f is defined as a feature recording table.

Before each piece of experience data is stored in the experience pool, whether the experience pool is full or not needs to be judged, if the experience pool is full, one piece of experience data needs to be removed, feature extraction is carried out by using a self-encoder, and a feature value f is obtained _i Simultaneously recording the characteristic value f in the characteristic record table _i Removing;

performing feature extraction on the current empirical data by using an autoencoder to obtain a feature value f _j Looking up the feature value f in the feature record table _j Whether it already exists. If the characteristic value f _j If the current experience data exists, the current experience data is not stored in the experience pool;

if the characteristic value f does not exist in the characteristic record table _j The characteristic value is stored in a characteristic record table, and the current experience data is stored in an experience pool.

Training an unmanned aerial vehicle autonomous control task decision network model through the extracted empirical data, and specifically comprises the following steps:

randomly taking N experiences at different moments from an experience pool to form a sampling experience data set with a structure of < S, A, R, S '> wherein S is a current moment state set of the unmanned aerial vehicle in the sampling experience data set, A is a current moment unmanned aerial vehicle action strategy set in the sampling experience data set, R is a current moment unmanned aerial vehicle reward value set in the sampling experience data set, S' is a next moment state set of the unmanned aerial vehicle, and the current state set S is obtained by adopting the current action set A;

inputting S ' into a target Actor network to obtain all unmanned aerial vehicle action strategy sets A ' at the next moment, and then inputting A ' and S ' into a target criticic network together to obtain a target Q ' value estimated at the next moment;

the loss function defining the criticic network is:

wherein, theta ^Q Is a parameter of the current Critic network, and N represents an extracted experience number during training; q(s) _i ，a _i |θ ^Q ) Is represented by s _i And a _i When the input is input, the output Q value of the current Critic network is obtained;

y _i can be expressed as:

y _i ＝r _i +γQ′(s _i+1 ，μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) (7)

wherein gamma is a discount factor and theta ^Q′ Is a parameter in the target Critic network, θ ^μ′ Is the output of the target Actor network; q'(s) _i+1 ，μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) Is represented by s _i+1 And μ'(s) _i+1 |θ ^μ′ ) The output of the target Critic as an input;

with y _i Updating the weight of the current Actor network for training the label through back propagation;

training and updating the weight of the current Critic network by adopting a different strategy method;

updating the weights of the target Critic network and the target Actor network in a soft updating mode at fixed time intervals;

and stopping training when the set training times are reached.

The invention further provides an unmanned aerial vehicle autonomous control system 200 based on experience pool optimization, as shown in fig. 2, including:

a simulation environment construction unit 201 that sets a simulation environment for a target unmanned aerial vehicle in the unmanned aerial vehicle simulator;

the first calculation unit 202 is used for establishing a state space, an action space and a reward function of the target unmanned aerial vehicle;

a second calculation unit 203, which constructs a self-encoder for feature extraction according to the state space and the action space;

the model building unit 204 is used for building an unmanned aerial vehicle autonomous control task decision network model;

the model training unit 205 is used for loading the simulated environment simulation target unmanned aerial vehicle to fly, generating experience data through the reward function, extracting characteristic values of the experience data through a self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data;

the unmanned aerial vehicle autonomous control unit 206 autonomously controls the target unmanned aerial vehicle through the trained unmanned aerial vehicle autonomous control task decision network model.

Wherein, the simulation environment construction unit 201 is further configured to: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

Wherein, unmanned aerial vehicle is from master control task decision network model includes: an Actor network and a Critic network.

The method includes the steps of loading the simulation environment to simulate the flight of a target unmanned aerial vehicle, generating experience data through the reward function, extracting characteristic values of the experience data through a self-encoder, screening the experience data according to the characteristic values, and training the unmanned aerial vehicle autonomous control task decision network model according to the screened experience data, and specifically includes the following steps:

loading a simulation environment;

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An unmanned aerial vehicle autonomous control method based on experience pool optimization is characterized by comprising the following steps:

2. The method of claim 1, further comprising: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

3. The method of claim 1, wherein the self-encoder for feature extraction comprises: the self-encoder is used for extracting the state information features and the self-encoder is used for extracting the action information features;

4. The method of claim 1, wherein the drone autonomous control task decision network model comprises: an Actor network and a Critic network.

5. The method according to claim 1, wherein the loading the simulation environment to simulate the flight of the target drone, generating experience data through the reward function, extracting feature values of the experience data through a self-encoder, screening the experience data according to the feature values, and training the drone autonomous control task decision network model according to the screened experience data specifically comprises:

loading a simulation environment;

6. An autonomous unmanned aerial vehicle control system based on experience pool optimization, the system comprising:

a simulation environment building unit which sets a simulation environment for a target unmanned aerial vehicle in the unmanned aerial vehicle simulator;

7. The system of claim 6, wherein the simulation environment construction unit is further configured to: in the process of setting the simulation environment, the starting position and the target position of the target unmanned aerial vehicle flying in the simulation environment are set simultaneously.

8. The system of claim 6, wherein the self-encoder for feature extraction comprises: the self-encoder is used for extracting the state information features and the self-encoder is used for extracting the action information features;

9. The system of claim 6, wherein the drone autonomous control task decision network model comprises: an Actor network and a Critic network.

10. The system according to claim 6, wherein the loading of the simulated environment simulating the flight of the target drone, the generating of experience data through the reward function, the extracting of feature values of the experience data through the self-encoder, the screening of the experience data according to the feature values, and the training of the drone autonomous control task decision network model according to the screened experience data specifically include:

loading a simulation environment;