CN118034331A

CN118034331A - Unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning

Info

Publication number: CN118034331A
Application number: CN202311801153.1A
Authority: CN
Inventors: 柯良军; 刘子锋
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-05-14

Abstract

An unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning comprises the following steps of; step 1: carrying out multidimensional construction on the intelligent agent through an S2MAC autonomous navigation decision algorithm; step 2: giving a global navigation target task, wherein the environment firstly gives a navigation target g of the current round and current observation information ot, and the navigation target g passes through a navigation target module to obtain navigation sub-target characteristics of the current time step; step 3: the ot passes through a feature extraction module to obtain an observation feature, and the observation feature is input into a state memory module to obtain a state memory feature; step 4: then combining the characteristics to obtain a state characteristic st of the intelligent agent; the intelligent agent inputs the state characteristic st, outputs a continuous unmanned aerial vehicle flight action value at, returns the action value to the environment to obtain a reward value rt, and obtains rewards; step 5: and updating the algorithm by using the strategy solution based on the priority experience playback to obtain the optimal strategy. The invention effectively improves the success rate of navigation decision.

Description

Unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle autonomous navigation, and particularly relates to an unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning.

Background

Because the navigation target is implicitly contained in the model state, the navigation information is coupled with the model parameters, so that the navigation success rate of the originally trained decision model can be obviously reduced every time a new navigation target appears; meanwhile, due to the lack of considerable characteristics of unmanned aerial vehicle parts, the decision model lacks global environment information and is easy to sink into local optimum.

For the autonomous decision-making problem of unmanned aerial vehicles, the conventional solutions can be divided into two types. One is a known environmental map model that is utilized by the drone to make planning decisions. Another approach is to design decision strategies by the expert by exploiting the corresponding expert knowledge in case the environmental map model is unknown. However, these methods have significant limitations.

Along with diversification and complicacy of application scenes, it is difficult to obtain an accurate map model of an environment in an implementation process, and for many unknown environments, it is also difficult to obtain corresponding expert knowledge in reality.

The depth deterministic strategy gradient algorithm and its variants are one of the most commonly used reinforcement learning algorithms for unmanned aerial vehicle autonomous navigation decisions at present. In such algorithms, the unmanned aerial vehicle autonomous navigation decision problem is often modeled as an agent that makes a decision according to the current state, and then a continuous action value is directly output. Such as MPTD algorithm, the direct output action way omits a series of problems such as sampling, but limits the selection of actions. Meanwhile MPTD designs a plurality of experience pools and gradient truncation modes to improve the convergence of the algorithm, but reduces exploratory property at the same time, and is difficult to adapt to multi-target navigation decisions randomly generated by target positions. Another type of commonly used unmanned aerial vehicle navigation decision algorithm is an algorithm based on near-end policy optimization, such as VDAS-PPO, in which speed guidance is adopted to optimize a target, but implicit target modeling is still adopted, a clipping function is used to limit updating, which ignores that in practical application, a navigation target point of an unmanned aerial vehicle is changeable, because the navigation target point is implicitly contained in a decision-making agent, and the clipping function is adopted to ensure learning stability, but also limit learning diversity, so that the unmanned aerial vehicle is easy to fall into local optimum, and the navigation decision-making capability of the newly-appearing target point is greatly reduced. Neither MPTD nor the VDAS-PPO algorithm takes into account that the unmanned aerial vehicle is often unable to obtain information about the entire environment, resulting in the observable characteristics of the unmanned aerial vehicle portion, which can easily fall into local optima. At the same time, they take a uniform sampling pattern of samples in the empirical playback pool. In an unmanned aerial vehicle-like autonomous navigation decision task, a round of training usually requires hundreds of time steps for decision, feedback signals of the unmanned aerial vehicle are often not strong enough in the stable flight process of the unmanned aerial vehicle, and obvious feedback signals can be obtained only when actions of ending the round such as collision occur. For these problems, it is difficult to extract such a critical state transition sequence in a conventional uniform sampling manner, and there is a problem that training efficiency is low.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention aims to provide the unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning, and the state memory module provided by the method can help an intelligent body to construct the spatial memory of the environment, increase decision information and effectively improve the success rate of navigation decisions.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

An unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning comprises the following steps of;

step 1: constructing reinforcement learning elements;

The intelligent agent is constructed in a multidimensional way through an S2MAC autonomous navigation decision algorithm, so that decision information of the intelligent agent is increased; the S2MAC state memory pool can utilize information in the training process to expand the decision relationship into one-to-many, so that the purpose of effectively utilizing global flight information is achieved;

Step 2: giving a global navigation target task, wherein the environment firstly gives a navigation target g of the current round and current observation information ot, and the navigation target g passes through a navigation target module to obtain navigation sub-target characteristics of the current time step;

step 3: the ot passes through a feature extraction module to obtain an observation feature, and the observation feature is input into a state memory module to obtain a state memory feature;

Step 4: then combining the characteristics to obtain a state characteristic st of the intelligent agent; the intelligent agent inputs the state characteristic st, outputs a continuous unmanned aerial vehicle flight action value at, returns the action value to the environment to obtain a reward value rt, and obtains rewards;

step 5: and updating the algorithm by using the strategy solution based on the priority experience playback to obtain the optimal strategy.

The step 1 specifically comprises the following steps:

Modeling is conducted on the MDP problem of unmanned aerial vehicle autonomous decision, and three elements of reinforcement learning, namely a state space, an action space and a reward function, are provided for construction;

Step 1): and (3) constructing a state space:

In the unmanned aerial vehicle autonomous navigation decision task, the global environment state is unknown; the information that the unmanned aerial vehicle can use for navigating and deciding is only partial information that it obtains through the onboard camera that is itself preposed and own onboard sensor, this invention will be referred to as the observation o of the unmanned aerial vehicle of these primitive information;

Observing depth image information acquired by a front-mounted camera of the unmanned aerial vehicle, wherein the first part is o _i;

The second part is some state information about the unmanned aerial vehicle itself acquired by an onboard sensor of the unmanned aerial vehicle itself, called o _s; the method comprises the steps of self position information, speed information, acceleration information and deflection angle; the self state description requires the following information:

Position information o _p: through the position information, the unmanned aerial vehicle can obtain the distance, azimuth and the like from the target position for decision, and the current position of the unmanned aerial vehicle and the position of the navigation target point are included;

linear velocity information o _v: through the linear speed information, the unmanned plane can know the current flight speed, so that the next action execution is decided according to the current speed;

Deflection angle information o _a: the basic state space is obtained by extracting the characteristics of the information of (o _i,o_p,o_v,o_a) through deflection angle information; the depth image information o _i collected by the unmanned aerial vehicle camera firstly extracts a characteristic block through a layer of larger convolution layer, then carries out pooling operation to accelerate operation and prevent parameter overfitting, then obtains characteristics through two layers of smaller convolution layers and pooling operation, and finally levels the characteristics through a multi-layer perceptron to obtain the characteristics about the image information; after obtaining the image information characteristics, the information of the unmanned aerial vehicle is subjected to coding and full-connection network characteristic extraction and then spliced together, so that the basic state characteristics f _o can be obtained;

step 2): and (3) constructing an action space:

Binding the speeds in the directions of the x axis and the y axis together to be used as one dimension of an action space, and taking the speed in the direction of the z axis as one dimension independently; meanwhile, for the control of the yaw path, an angular velocity command is used to control yaw; the motion space is shown in formula (1):

v＝(vxy,vz,vyaw) (1)

Step 3): and (3) constructing a reward function:

Firstly, considering the attraction of a navigation target point to the unmanned aerial vehicle; defining that the distance from the global navigation target point is dis _g at the beginning of the turn, the distance from the target point is dis _t-1 at the last time step t-1 before the current time step t, and then the distance from the target point is dis _t at the current time step, then in order to encourage the unmanned aerial vehicle to fly towards the navigation target point, the reward function r _dis is as shown in formula (2):

Next, consider a collision penalty for the obstacle; when the unmanned aerial vehicle collides with an obstacle, defining a collision penalty r _coli for achieving the purpose of reducing the cumulative rewards of the unmanned aerial vehicle so as to avoid the obstacle; the situation of collision should be completely avoided, and in reality, when the collision happens, the situation can cause great harm, so that the collision is set to happen, and the training round of the wheel is finished; meanwhile, if the unmanned aerial vehicle exceeds the range of the map in the flight process, a punishment r _out is given, the value of the punishment r _out is the same as that of r _coli, and the round of training is finished; similarly, setting punishment of collision to an obstacle and boundary setting are performed, and then a reward r _reach reaching the global navigation target point should be set correspondingly; when the unmanned plane is in the flight process, if yaw rotation is frequently carried out, the unmanned plane can cause unstable flight, and one penalty r _yaw for the yaw speed of action output is shown as a formula (3);

Wherein: y is the set maximum yaw rate; finally, as soon as possible the unmanned aerial vehicle is expected to reach the destination in the actual use as mentioned above, a penalty r _step = c·t for the time term is also set; finally, a bonus function for each time step, as in equation (4);

The step 2 specifically comprises the following steps:

The guiding target module decomposes the global navigation target into each training step, namely, decomposes the navigation target learning of each round into the navigation target learning of each time step, and the sampling of the intelligent agent in one time step is changed from (s, a, s ', r) to (s, a, s', g, r); Δt is the time step of each training;

In reinforcement learning for target-oriented improvement, the objective function of the strategy solution is as shown in equation 5:

J(θ)＝E_s～pπ,a～π(a|s,g),g～G[r_g(s,a,g)] (5)

in the objective function, the consideration of the guiding target is increased, and the guiding target is combined to obtain the whole objective function to be maximized;

In reinforcement learning of the target-oriented improvement, the rewrite of the conventional Q-value function is as shown in formula 6:

Q(s,a,g)＝E_π[∑t＝linfγ^t-1r_g(s_t,a_t,g)|s₁＝s,a₁＝a] (6)

in a specific implementation, the target g has various forms; if the starting position and the navigation target position are both given in the form of images, the target g can be a low-dimensional latent variable feature generated based on a variational automatic encoder (Variationalautoencoder, VAG); aiming at the problem of unmanned aerial vehicle navigation decision, the starting position and the navigation target position are given in the form of coordinates, and the definition of the middle target g is shown in a formula (7):

g_t＝(x_t,y_t,z_t,yaw_t) (7)

Wherein: x _t represents the relative distance of the current position to the x-axis of the global navigation target position; y _t represents the relative distance of the current position to the y-axis of the global navigation target position; z _t represents the relative distance of the current position to the z-axis of the global navigation target position; yaw _t represents yaw angle information of the current position with respect to the global target position.

By defining the intermediate navigation targets, the global navigation targets of the unmanned aerial vehicle are unified, and g _t of the unmanned aerial vehicle is not essentially different from different global navigation targets, so that the finally obtained state space is shown in a formula 8:

s_t＝(f_t ^o,f_t ^g) (8)

wherein: f _t ^o is an observation feature obtained by feature extraction and encoding of the observation (o _i,o_p,o_v,o_a); f _t ^g is the navigation feature encoded for the intermediate navigation object.

The step 3 specifically comprises the following steps:

extracting features of the constructed state memory by using an attention mechanism, and integrating the constructed state memory module into an S2MAC algorithm;

Step 1): state memory construction

At time t, the constructed historical image information with the length of K is shown in a formula (9):

Wherein: Image information representing the time t-k;

at time t, the constructed historical state information of the unmanned aerial vehicle with the length of K is shown in a formula (10):

Wherein: representing state information of the unmanned aerial vehicle at the time t-k;

The definition of the state memory of the intelligent agent with the length of K at the time t is shown in a formula (11):

When the memory unit is calculated, the characteristic information of the image is extracted by utilizing a convolution network to obtain a characteristic f _i ⁱ, meanwhile, the state information is embedded and encoded to obtain a state characteristic f ⁱ through a layer of full-connection layer, and the obtained image characteristic f ⁱ and the state characteristic f ^s are spliced to obtain an ith state memory unit m _i＝(f_i ⁱ||f_i ^s);

Step 2) state memory extraction:

Assuming that the weight of the ith memory cell is a _i, at time t, the weights of the memory cells with length K are (a _t-k,a_t-k+1,...,a_t-1), respectively, the memory feature f _t ^m obtained after passing through the Attention cell is shown in formula (12):

The weight of the memory cell is calculated using an additive attention method as shown in equation (13):

Wherein: w _v,W_q,W_k are parameters of the network; q is a feature of the current time step;

After the weight e _i of the ith memory cell is obtained, a softmax normalization operation is performed, and the finally obtained state memory weight a _i is shown in formula (14):

step 3): integrating state memory;

Aiming at the obtained state memory characteristic f _t ^m, the unmanned plane state characteristic f _t ^o and the target guiding characteristic f _t ^g are combined to obtain a state space of an S2MAC algorithm as shown in a formula (15):

f_t ^s＝(f_t ^m||f_t ^o||f_t ^g) (15)

the method comprises the steps of firstly obtaining a generated target guide characteristic f _t ^g according to an observation o _t at a time t, then obtaining an observation characteristic f _t ^o through full-connection layer extraction according to the observation, simultaneously backtracking k steps from the current time t, obtaining state memory of k steps, obtaining a state memory characteristic f _t ^m through Attention coding together with the current state, finally splicing all the characteristics together to obtain a state characteristic of reinforcement learning, and inputting the integrated state into a solving algorithm for calculating strategies and Q values.

The step 4 specifically comprises the following steps:

preferential empirical playback techniques are employed to improve SAC algorithms;

In the reinforcement learning algorithm, the difference between the Q value of the current network and the Q value of the target network is called TD-error, and is commonly denoted by δ. The delta is calculated as shown in formula (16):

δ＝r+π(·|s′)^T(Q^π(·|s′)-αlogπ(·|s′))-Q(s，a) (16)

Wherein: q represents the Q value of the current network; q ^π represents the Q value of the target network;

after the TD error is calculated, the priority of the current experience is expressed as shown in formula (17):

p＝|δ|+ε (17)

Wherein: epsilon represents a very small positive number, preventing the occurrence of an error of 0.

After the experience priority is obtained, normalization processing is performed to increase the stability of the algorithm, so that the probability that the experience is sampled can be obtained as shown in a formula (18):

wherein: alpha is a superparameter of priority sampling, representing the proportion of priority samples.

The correction of the loss function is achieved by importance sampling in S2MAC, a correction coefficient ω is multiplied in the loss calculation of each state sequence, and in S2MAC, the correction coefficient is defined as shown in formula (19):

wherein: n represents the number of samples in the experience playback pool, η is an importance factor;

the gradient weight calculation after importance sampling is shown in formula (20):

The step 5 specifically comprises the following steps:

In the strategy solving process, a sample is collected from a prior experience pool, and then the importance weight omega of the collected sample is calculated; calculating a loss of a strategy network and a loss of a Q value network by using sample information, updating parameters of a corresponding network by using the loss, and copying the parameters of the current Q value network to a target Q value network at regular intervals; and the specific flight action of the unmanned aerial vehicle can be obtained by utilizing mu and sigma through the distribution of actions output by the strategy network.

The invention has the beneficial effects that:

the state memory module provided by the invention can help the intelligent body construct the space memory of the environment, increase the decision information and effectively improve the success rate of navigation decisions.

The addition of the target guiding module in the invention can indeed accelerate the training of the algorithm, and the mode of decomposing the global target into each step of guiding targets can indeed and effectively learn the commonality among different target points. The addition of the state memory module makes the convergence value of the algorithm higher, but the convergence value is slightly increased due to the fact that the complexity of calculation is improved by the introduction of the state memory module. Because MSAC algorithm introduces the state memory module provided by the invention, the decision information is increased, so that the unmanned aerial vehicle can better make decisions, and the average step size used for successful decisions is the least.

Drawings

Fig. 1 is a flow chart of the S2MAC algorithm.

Fig. 2 is a schematic diagram of image information of a state space.

Fig. 3 is a schematic diagram of an image feature extraction network.

FIG. 4 is an explicit state objective decision diagram.

FIG. 5 is a schematic diagram of a state memory construction.

FIG. 6 is a state memory extraction schematic.

Fig. 7 is a schematic diagram of a characteristic network structure of the S2MAC algorithm.

Fig. 8 is a block diagram of the S2MAC algorithm.

Fig. 9 is a schematic view of a stone column barrier environment.

Fig. 10 is a schematic of a successful flight path for the training process.

FIG. 11 is a failed flight path illustration of a training process.

FIG. 12 is a schematic diagram of a different reinforcement learning algorithm rewards curve.

FIG. 13 is a diagram of different state memory step prize curves.

FIG. 14 is a schematic view of image memory visualization.

Fig. 15 is a schematic illustration of an ablation experimental reward curve.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1-15: an unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning comprises the following steps of;

step 1: constructing reinforcement learning elements;

step 2: giving a global navigation target task, wherein the environment firstly gives a navigation target g of the current round and current observation information o _t, and the navigation target g passes through a navigation target module to obtain navigation sub-target characteristics of the current time step;

Step 3: o _t, passing through a feature extraction module to obtain an observation feature, and inputting the observation feature into a state memory module to obtain a state memory feature;

Step 4: combining the characteristics to obtain a state characteristic s _t of the intelligent agent; the intelligent agent inputs the state characteristic s _t, outputs a continuous unmanned plane flight action value a _t, returns the action value to the environment to obtain a reward value r _t, and obtains rewards;

The step 1 specifically comprises the following steps:

The single-scene unmanned aerial vehicle autonomous navigation decision-making problem is an instant decision-making control problem based on deep reinforcement learning. The flight decision of the unmanned aerial vehicle is a sequence decision problem, so that the unmanned aerial vehicle can be modeled as a Markov Decision Process (MDP), and the problem is solved by adopting a deep reinforcement learning method. Modeling is conducted on the MDP problem of unmanned aerial vehicle autonomous decision, and three reinforcement learning elements of a state space, an action space and a reward function are given out.

Step 1): and (3) constructing a state space:

In the unmanned aerial vehicle autonomous navigation decision task, the global environmental state is unknown. The information that the unmanned aerial vehicle can use to navigate the decision is only some information that it obtains through the onboard camera that is itself pre-positioned and its onboard sensor, this original information is referred to as the unmanned aerial vehicle's observation o in the present invention.

The observation includes two parts, the first part is depth image information collected by the front-end onboard camera of the unmanned aerial vehicle, as shown in fig. 2, and the collected depth image information is called o _i, which contains shape information and distance information of an object, so that the observation is very suitable to be used as a part of a state space of an intelligent body.

The second part is some status information about the unmanned aerial vehicle itself acquired by the onboard sensor of the unmanned aerial vehicle itself, which is called o _s. Including its own position information, velocity information, acceleration information, yaw angle, etc. However, in a specific decision, some information is redundant and the drone does not need to use all of the information. Therefore, only the following information is required for the self state description:

position information o _p: through the position information, the unmanned aerial vehicle can obtain the distance, the azimuth and the like from the target position for decision, and the current position of the unmanned aerial vehicle and the position of the navigation target point are included.

Linear velocity information o _v: through the linear velocity information, the unmanned plane can know the current flying velocity, so that the next action execution is decided according to the current velocity.

Deflection angle information o _a: through deflection angle information, unmanned aerial vehicle can be better aassessment current flight state. The basic state space is thus obtained by feature extraction of the information of (o _i,o_p,o_v,o_a). Wherein the image feature extraction network is shown in fig. 3. The depth image information oi collected by the unmanned aerial vehicle camera firstly passes through a layer of larger convolution layer to extract a feature block, then carries out pooling operation to accelerate operation and prevent parameter overfitting, then passes through two layers of smaller convolution layers and pooling operation to obtain features, and finally, the features are leveled through a multi-layer perceptron to obtain the features related to the image information. After the image information characteristics are obtained, the information of the unmanned aerial vehicle is subjected to coding and full-connection network characteristic extraction and then spliced together, and then the basic state characteristics f _o can be obtained.

Step 2): and (3) constructing an action space:

In order to make the unmanned aerial vehicle fly more stably, the motion space designed by the invention is a continuous motion space. Considering that in the practical environment, the navigation target point is far away from the initial position on the xy-axis plane relative to the z-axis plane, and the unmanned plane moves on the horizontal plane for more time, the speed in the x-axis direction is more important than the speed in the y-axis direction, and the correlation of the speed in the x-axis direction and the speed in the y-axis direction is higher, so that the speed in the x-axis direction and the speed in the y-axis direction are bound together to be used as one dimension of the motion space, and the speed in the z-axis direction is independently used as one dimension to be considered. Meanwhile, for the control of the yaw path, there are two modes of an angle instruction and an angular velocity instruction, and here, in order to make the rotation smoother, the angular velocity instruction is used to control yaw. Finally, the action space of the invention is obtained as shown in formula (1):

v＝(vxy,vz,vyaw) (1)

Step 3): and (3) constructing a reward function:

In unmanned autonomous navigation decision-making tasks, one simple idea is to set sparse rewards to guide their flight. When the unmanned aerial vehicle reaches the final navigation target point, a positive rewarding signal is given to the unmanned aerial vehicle, and when the unmanned aerial vehicle collides or flies out of a specified environment map, a negative rewarding signal is given to the unmanned aerial vehicle. The unmanned aerial vehicle is encouraged to approach towards the final navigation target by maximizing the unmanned aerial vehicle jackpot through a reinforcement learning algorithm. However, this method has obvious drawbacks that only when one round is finished, the unmanned aerial vehicle can get rewards, and the design can eventually converge under the condition of simple scene, but the invention discusses autonomous navigation decisions of unmanned aerial vehicles in complex scenes, in which the space that unmanned aerial vehicles can explore is huge, and if only one round is finished, rewards are given, then convergence is difficult.

Another arrangement is to consider a number of factors to design an instant prize with feedback per time step to guide the drone to fly. First, consider the attraction of a navigation target point to a drone. Defining that the distance from the global navigation target point is dis _g when the unmanned aerial vehicle is at the beginning of a round, the distance from the target point at the last time step t-1 before the current time step t is dis _t-1, and then the distance from the target point at the current time step is dis _t, in order to encourage the unmanned aerial vehicle to fly towards the navigation target point, a reward function r _dis set by the invention is as shown in formula (2):

By means of the arrangement, positive feedback can be obtained whenever the distance between the current unmanned aerial vehicle and the global navigation target point is shortened compared with the distance between the current unmanned aerial vehicle and the last time step, negative feedback can be obtained when the distance between the current unmanned aerial vehicle and the global navigation target point is increased compared with the distance between the current unmanned aerial vehicle and the last time step, and in order to maximize accumulated return, the unmanned aerial vehicle is attracted by the global navigation target.

Next, the collision penalty for the obstacle is considered. When the unmanned aerial vehicle collides with the obstacle, a collision penalty r _coli is defined to achieve the purpose of reducing the cumulative rewards of the unmanned aerial vehicle, so that the obstacle is avoided. The occurrence of a collision should be completely avoided, and in reality, when the collision occurs, the collision is set to be extremely harmful, and the training round of the round is ended. Meanwhile, if the unmanned aerial vehicle exceeds the range of the map in the flight process, a penalty r _out is given, the value of the penalty r _out is the same as that of r _coli, and the round of training is finished. Similarly, setting penalizes the case of collision to an obstacle and the case of setting a boundary, and then a prize r _reach to the global navigation target point should be set accordingly. When the unmanned plane is in the flight process, if yaw rotation is frequently carried out, the unmanned plane can cause unstable flight, so the invention also designs a penalty r _yaw for the yaw speed output by action, as shown in a formula (3);

Wherein: y is the set maximum yaw rate; finally, as soon as possible the drone is expected to reach the destination in practical use as mentioned before, a penalty r _step = c·t is also set for the time term. Finally, a bonus function for each time step, as in equation (4);

The step 2 specifically comprises the following steps:

The guiding object module decomposes the global guiding object into each training step, namely, decomposes the guiding object learning of each round into the guiding object learning of each time step, and explicitly embeds the guiding object into the state of each time step, as shown in fig. 4, the sampling of the agent in one time step is changed from (s, a, s ', r) to (s, a, s', g, r). Thus, when learning to take the next action, the intelligent agent is not simply dependent on the current state information, and needs to consider the navigation target information of the current step. This is an idea similar to calculus, where a plurality of global navigation objects are differentiated into small intermediate navigation objects, and then a new global navigation object can be regarded as an integration process, and the integration is performed by a plurality of different intermediate navigation objects, so that retraining is not required. Since the minimum unit of reinforcement learning training is a time step, Δt in the present invention is the time step of each training.

J(θ)＝E_s～pπ,a～π(a|s,g),g～G[r_g(s,a,g)] (5)

in the objective function, the consideration of the guiding targets is increased, and the guiding targets are combined to obtain the whole objective function to be maximized.

Q(s,a,g)＝E_π[∑t＝linfγ^t-1r_g(s_t,a_t,g)|s₁＝s,a₁＝a] (6)

In a specific implementation, the target g has various forms. If the starting position and the navigation target position are both presented in the form of images, the target g may then be a low-dimensional latent variable feature generated based on a variational automatic encoder (Variationalautoencoder, VAG). Aiming at the problem of unmanned aerial vehicle navigation decision, the starting position and the navigation target position are given in the form of coordinates, and the intermediate target g is not required to be generated by using a complex method like VAG. The definition of the intermediate target g is shown in the formula (7):

g_t＝(x_t,y_t,z_t,yaw_t) (7)

Through the definition of the intermediate navigation targets, the global navigation targets of the unmanned aerial vehicle can be unified, and for different global navigation targets, g _t of the global navigation targets has no essential difference, so that the finally obtained state space is shown in a formula 8:

s_t＝(f_t ^o,f_t ^g) (8)

The step 3 specifically comprises the following steps:

The state memory module can effectively utilize the historical information in the training process of the intelligent agent, increase the decision information of the intelligent agent, and effectively avoid the problem that the intelligent agent is disordered to learn and is difficult to converge. The unmanned aerial vehicle state information of use history also can let unmanned aerial vehicle know the orbit of preceding motion, is convenient for better decision-making. The invention provides a specific construction mode of state memory aiming at the unmanned aerial vehicle autonomous navigation decision problem, the constructed state memory is extracted by utilizing a attention mechanism, and finally, a constructed state memory module is integrated into an S2MAC algorithm.

Step 1): state memory construction

Wherein: representing the image information at time t-k.

Wherein: Representing the state information of the unmanned aerial vehicle at the time t-k.

the construction of each state memory unit is shown in fig. 6, when the memory units are calculated, the characteristic information of the image is extracted by using a convolution network to obtain the characteristic f _i ⁱ, meanwhile, the state information is embedded and encoded, and the state characteristic f ⁱ is obtained through a full connection layer. The obtained image feature f ⁱ and the state feature f ^s are spliced together to obtain the memory cell m _i＝(f_i ⁱ||f_i ^s of the ith state.

Step 2) state memory extraction:

Based on the state memory constructed in the previous section, the key information in the extracted state memory is considered in the present section to be used as the judgment basis of the S2MAC algorithm. Attention mechanisms are often used to distinguish between the importance of different information, which emphasizes the importance by giving more weight to the useful information. The calculation of attention generally comprises three stages, wherein the first stage is to calculate the correlation between the query value and the key value, and obtain a corresponding weight coefficient s for the calculation of the next stage; the second stage is to normalize s calculated in the first stage to obtain a coefficient a; and finally, obtaining an adjusted numerical value by multiplying the value of the normalized weight coefficient a obtained in the second stage, and obtaining the attention value by weighting and summing the adjusted numerical value.

In the S2MAC algorithm, a state memory extraction method based on an attention mechanism is proposed, and an extraction framework of the state memory is shown in fig. 7.

Wherein: w _v,W_q,W_k are parameters of the network; q is a feature of the current time step.

step 3): integrating state memory;

In the foregoing, the state memory feature f _t ^m has been obtained, and the above-mentioned unmanned plane state feature f _t ^o and the target guidance feature f _t ^g are combined, and these features are spliced together, so that the state space of the S2MAC algorithm is finally obtained as shown in formula (15):

f_t ^s＝(f_t ^m||f_t ^o||f_t ^g) (15)

The network structure of the integrated S2MAC algorithm is shown in fig. 7. Firstly, according to the observation ot, the generated target guiding feature f _t ^g is obtained, then according to the observation, the observation feature f _t ^o is obtained through the extraction of the full-connection layer, and meanwhile, the k steps are traced back from the current time t, the state memory of the k steps is obtained, and the state memory feature f _t ^m is obtained through the attribute coding together with the current state. Finally, all the features are spliced together to obtain the state features of reinforcement learning, and the integrated state features are input into a solving algorithm provided by the invention to calculate strategies and Q values.

The step 4 specifically comprises the following steps:

in the deep reinforcement learning algorithm, experience information collected by an agent is stored in an experience playback pool, and the agent updates a network by using the information obtained from the experience playback pool. In a conventional reinforcement learning algorithm, sampling of samples in an experience pool takes a uniform sampling pattern. In the unmanned aerial vehicle autonomous navigation decision task, a round of training often needs a decision of hundreds of time steps, and in the flight process, an obvious feedback signal can be obtained only when the unmanned aerial vehicle collides, flies out of a map range and reaches a navigation target position.

In the training process of one round, when the unmanned aerial vehicle generates the three conditions, the training of the current round is immediately finished, and the obvious state transition sequence occupies only a small part of an experience playback pool. The traditional uniform sampling approach is insensitive to these conditions and it is difficult to extract the critical information, resulting in a too slow convergence situation. In view of the above, the present invention employs a preferential empirical playback technique to improve the SAC algorithm.

In the reinforcement learning algorithm, the difference between the Q value of the current network and the Q value of the target network is called TD-

Error, commonly denoted delta. The delta is calculated as shown in formula (16):

δ＝r+π(·|s′)^T(Q^π(·|s′)-αlogπ(·|s′))-Q(s，a) (16)

Wherein: q represents the Q value of the current network; q ^π represents the Q value of the target network.

p＝|δ|+ε (17)

The probability of the key sequence being sampled can be greatly increased by adopting a preferential experience playback mode. But at the same time, the sampling mode also changes the sample distribution in the experience playback pool, which can lead to the deviation of the loss function of the target. In order to solve the problem of distribution change, the correction of the loss function is realized by importance sampling in S2MAC, and a correction coefficient omega is multiplied in the loss calculation of each state sequence. In S2MAC, the definition of the correction coefficient is as shown in formula (19):

wherein: n represents the number of samples in the empirical playback pool and η is an importance factor.

The step 5 specifically comprises the following steps:

In the strategy solving process, samples are collected from a prior experience pool, and then importance weights omega of the collected samples are calculated. And calculating the loss of the strategy network and the loss of the Q value network by using the sample information, updating the parameters of the corresponding network by using the loss, and copying the parameters of the current Q value network to the target Q value network at regular intervals. And the specific flight action of the unmanned aerial vehicle can be obtained by utilizing mu and sigma through the distribution of actions output by the strategy network. The S2MAC algorithm flow is shown in algorithm 1 below.

Experimental results and analysis:

In order to verify the performance of the proposed S2MAC algorithm, the invention will be tested on an open source simulation platform Airsim provided by microsoft corporation. The unmanned aerial vehicle autonomous navigation decision algorithm aiming at a single scene has the following performance comparison indexes:

Convergence value: for reinforcement learning problems, the higher the final converged prize value, the more excellent it represents the algorithm.

The convergence time is as follows: the time taken by the unmanned aerial vehicle intelligent body from training to stable convergence is indicated, and the smaller the value is, the higher the training efficiency is.

Success rate: it means that the unmanned aerial vehicle safely reaches the target position set at the beginning from the initial position without colliding with any obstacle.

Collision rate: for unmanned aerial vehicles, once collision results are not envisaged, the collision is very likely to cause the damage to public safety due to the falling of the frying machine, so that the collision is very necessary to be avoided.

Success decision step size: on the premise of meeting the requirement of successfully reaching the target position, the user generally hopes to complete the navigation task more quickly, and the number of flight steps can represent the time of completing the task by the unmanned aerial vehicle.

Experiment setting:

1) Experimental environment:

simulation experiments were performed using the game development engine UE4 from microsoft's open source simulation interface package Airsim and EPICGAMES, the experimental scenario being from an open source stone column barrier environment, as shown in fig. 9. In a stone column obstacle environment, the size of a map plane is 120m by 120m, and the height of a map space is 0-50 m. There are a total of 12 columns in the environment, 8 of them in the shape of a cuboid and 4 of them in the shape of a cylinder. The length of the rectangular stone column is 10m, the width is 10m, and the height is 12m; the radius of the cylindrical stone column is 5m and the height is 12m. The starting position of the unmanned aerial vehicle for each round of experiments is positioned in the middle of the map, namely, the unmanned aerial vehicle is uniformly surrounded by 12 stone columns, and the navigation target position is randomly generated by removing obstacles on a spherical surface taking the starting position as a circle center and taking 50m as a radius. The task of the experiment is to ensure that the unmanned aerial vehicle can quickly reach the target position under the condition of no collision.

In this experiment, the software and hardware environments used are shown in table 1:

Table 1 training environment settings

2) Parameter setting:

In this experiment, the super-parametric design in the bonus function is shown in table 2:

TABLE 2 bonus function Supermarameter setting

In this experiment, some of the hyper-parameters of the algorithm itself were set as shown in 3:

Table 3 Algorithm training hyper-parameter settings

Analysis of experimental results:

the S2MAC algorithm is compared with two currently mainstream deep reinforcement learning algorithms on the autonomous decision navigation problem of the unmanned aerial vehicle, and the comparison method comprises VDAS-PPO and MPTD3 algorithms. Meanwhile, in order to verify the effectiveness of the algorithm improvement of the invention, the algorithm SAC of the basic reinforcement learning improved by the invention is also compared and verified. All algorithms of the invention are simulated and verified in a unified simulation environment, and the consistency of parameter setting of each algorithm is maintained as much as possible.

The resulting flight paths for training agents in a simulated environment using the S2MAC algorithm are shown in fig. 10 and 11. Wherein fig. 10 (a) and 10 (b) illustrate a flight path successfully navigated to a target point. It can be seen from the figure that the drone can successfully reach the target point during training, despite the different final navigational target points.

Figure 11 shows two cases of training failure in the unmanned aerial vehicle training process. The left hand drawing of fig. 11 shows the situation where the drone flies out of the map boundary, although no collision occurs during the flight, exceeding the map boundary means that its flight path is not already within the contemplation of the present invention, thus resulting in the end of the present round of training. The right figure of fig. 11 shows the situation of a collision of the unmanned aerial vehicle, when the unmanned aerial vehicle collides, it no longer meets the requirement of no collision, thus leading to the end of the training of this round.

The rewards curves for four different algorithms trained in a stone column environment are shown in fig. 12. It can be seen from the figure that the MPTD algorithm, the SAC algorithm, and the S2MAC algorithm can be trained to a converged value within the 160k round, but the VDAS-PPO algorithm still has an ascending trend, and the convergence trend cannot be seen yet. The training speed of the VDAS-PPO algorithm is far slower than that of other algorithms, and more time is needed for training to converge, because the VDAS-PPO algorithm adopts a shearing function to limit the lower limit of motion distribution and limit the stride of motion update, so that the reward curve is smooth, but the convergence speed is slow.

Meanwhile, compared with MPTD algorithm, SAC and S2MAC algorithm have larger converged rewards value and faster rising of the early rewards. The SAC and the S2MAC are both reinforcement learning algorithms based on maximum entropy, so that not only is the reward maximized in the training process, but also the entropy of the strategy is maximized, the exploration capacity of the strategy is increased, and the strategy can be learned more quickly. Meanwhile, the exploration force is increased, and the situation that an unfavorable strategy is explored occurs, so that the middle rewarding curve is reduced by a small extent. MPTD3 is an algorithm based on TD3 improvement, and a delay update strategy network is adopted in the training process, so that the training curve is more stable compared with SAC and S2 MAC.

From the experimental results of table 4 in the stone column obstacle environment, it can be derived that the VDAS-PPO algorithm is the longest in convergence time, and the average step size of successful decisions is the largest. The multi-step time sequence difference learning adopted by the VDAS-PPO algorithm enables the algorithm convergence to be more stable. The MPTD algorithm has the least convergence time, but the minimum convergence value, the lowest success rate of the flight and the highest collision rate of the flight, and the least average step length of the successful decision. The S2MAC algorithm has a certain difference in convergence time compared with other algorithms, but achieves 100% of flight success rate and 0% of flight collision rate in the stone column obstacle environment, and the average step length of the successful decision is only slightly weaker than the best MPTD. The S2MAC algorithm increases the improvement of the state memory module compared to the SAC algorithm, and thus the training time is increased compared to that of the SAC algorithm. Experimental results show that the algorithm provided by the invention realizes complete obstacle avoidance in the stone column environment of the dense obstacle, and simultaneously achieves very high flight success rate. The experimental result can prove the effectiveness of the S2MAC algorithm provided by the invention.

Table 4 experimental results of different unmanned aerial vehicle navigation decision algorithms

Discussion of State memory Length

In the algorithm provided by the invention, the decision weight of the intelligent agent on the historical state can be influenced by the length selection pair of the state memory. Therefore, the invention performs experiments in a simulation environment aiming at different state memory lengths to verify the influence of the memory lengths on the S2MAC algorithm.

As shown in fig. 13, the final prize curve convergence value will also vary significantly when the length of the state memory is varied. When the length k=5 of the state memory, the convergence value of the reward curve is minimum, and the convergence reward value is reduced because the unmanned aerial vehicle has a faster speed in the flight process, and when the length of the state memory is smaller, useful decision information is not increased, but redundant information is introduced to interfere with decision, so that the phenomenon of the reduction of the reward value occurs. When the length k=10 of the state memory, the convergence value of the reward curve is maximum, which proves that when the reward curve is in the memory length, the unmanned aerial vehicle can effectively extract the needed memory information, thereby constructing a space state and optimizing the decision, and k=10 is also the memory length used by the S2MAC algorithm in the invention. When the state memory length k=20, the convergence value of the bonus curve is reduced, which is similar to the case when k=5, i.e. when the state memory length is too long, redundant information is also introduced to disturb decision-making while useful information is extracted.

In order to more intuitively observe the influence of different memory lengths on autonomous decision-making flight of the unmanned aerial vehicle, the invention visualizes the image memory information of the unmanned aerial vehicle for 20 time steps, as shown in fig. 15.

As can be seen from fig. 14, the images of the six time steps from fig. 14 (a) to fig. 14 (f) do not change much. As can be seen from comparing fig. 14 (a) with fig. 14 (j), the memory image of the unmanned aerial vehicle is greatly changed, the left wall part in the visual field is disappeared, and the middle wall occupies most of the space in the visual field. A large change in the memory image from fig. 14 (k) to fig. 14 (q) also occurs, whereas there is little change from fig. 14 (r) to fig. 15 (t). Through visualization of unmanned aerial vehicle image memory, the situation that the window size value of state memory in the flight environment is set near 10 is reasonable, when the memory window value is too small, environment change is not obvious enough in the unmanned aerial vehicle flight process, useful memory information is difficult to obtain, and when the memory window value is too large, too much redundant information is easily introduced in decision making to influence decision making judgment.

Ablation experiments

The effectiveness of different modules of the method is proved by an ablation experiment. In the ablation experiment, the invention uses four algorithms to test, namely an original SAC algorithm (marked as SAC), an algorithm based on a target guiding module (marked as GSAC), an SAC algorithm based on state memory with a step length of 10 (marked as MSAC) and a single-scene unmanned aerial vehicle autonomous navigation decision algorithm (marked as S2 MAC) for improving and strengthening learning of the target guiding and the state memory. Experiments are carried out in stone column obstacle environments respectively, and the effectiveness of each module is proved.

As can be seen from fig. 15, the SAC algorithm has the smallest prize convergence value, which illustrates the effectiveness of the various modules proposed by the present invention. The rewarding curve of GSAC algorithm converges first, the effectiveness of the target guiding module provided by the invention can be verified, the algorithm can learn through guiding targets in each round, and the commonality among a plurality of random targets can be learned more quickly.

The most smooth rewarding curve of MSAC algorithm can be obtained from fig. 15, which can verify that the decision information of the intelligent agent is richer by adding the state memory module, the grasp of the intelligent agent on the global information is obviously increased, the generated decision is more stable, and the final convergence value of MSAC algorithm is higher than that of SAC algorithm. The reward curve convergence value of the S2MAC algorithm is the largest, but the convergence is the slowest, and there is a small drop in the middle, which means that when there is both the target steering module and the state memory module, the exploration is more aggressive, resulting in some minima being explored, but due to its more aggressive exploration, it can not be trapped by the local optimum, eventually converging to a highest reward value.

As can be seen from the experimental results in Table 5, the SAC algorithm has the smallest convergence value, the lowest flying success rate and the highest collision rate. After the algorithm module provided by the invention is added, the performance of the algorithm is greatly improved. The target guiding module is added, so that the rewarding value of the algorithm is greatly increased, the convergence time is also reduced, and the success rate is also greatly improved. The addition of the target guiding module is described to truly accelerate the training of the algorithm, and the mode of decomposing the global target into each step of guiding targets is also described to truly and effectively learn the commonality among different target points. The addition of the state memory module makes the convergence value of the algorithm higher, but the convergence value is slightly increased due to the fact that the complexity of calculation is improved by the introduction of the state memory module. Because MSAC algorithm introduces the state memory module provided by the invention, the decision information is increased, so that the unmanned aerial vehicle can better make decisions, and the average step size used for successful decisions is the least. Finally, the success rate of the S2MA algorithm used by the invention reaches 100% although the convergence time is the most, and the average step length of the successful decision is only increased by a small extent relative to the shortest step length. Comprehensively, the S2MA algorithm has optimal performance.

TABLE 5 ablation experiment results of the proposed algorithm of the present invention

/>

Claims

1. The unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning is characterized by comprising the following steps of;

step 1: constructing reinforcement learning elements;

carrying out multidimensional construction on the intelligent agent through an S2MAC autonomous navigation decision algorithm;

Step 4: combining the features to obtain a state feature s _t of the intelligent agent; the intelligent agent inputs the state characteristic s _t, outputs a continuous unmanned plane flight action value a _t, returns the action value to the environment to obtain a reward value r _t, and obtains rewards;

2. The unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 1, wherein the step1 is specifically:

Step 1): and (3) constructing a state space:

In the unmanned aerial vehicle autonomous navigation decision task, the global environment state is unknown; the information which can be used for navigation decision by the unmanned aerial vehicle is only partial information which is acquired by the unmanned aerial vehicle through a front-mounted camera and an onboard sensor of the unmanned aerial vehicle, and the original information is called as an observation o of the unmanned aerial vehicle;

Observing two parts, wherein the first part is depth image information acquired by a front-mounted camera of the unmanned aerial vehicle, and the acquired depth image information is oi;

The second part is some state information about the unmanned aerial vehicle itself acquired by the onboard sensor of the unmanned aerial vehicle itself, called os; including its own position information, velocity information, acceleration information, deflection angle; the self state description requires the following information:

Position information o _p: the unmanned aerial vehicle obtains the distance from the target position through the position information, and the azimuth is used for decision, and comprises the current position of the unmanned aerial vehicle and the position of the navigation target point;

Linear velocity information o _v: through the linear speed information, the unmanned aerial vehicle knows the current flight speed, so that the next action execution is decided according to the current speed;

step 2): and (3) constructing an action space:

v ＝ (vxy, vz , vyaw) (1)

Step 3): and (3) constructing a reward function:

Firstly, considering the attraction of a navigation target point to the unmanned aerial vehicle; defining that the distance from the global navigation target point of the unmanned plane is dis _g at the initial time of the round, the distance from the target point at the last time step t-1 before the current time step t is dis _t-1, and then the distance from the target point at the current time step is dis _t, and the rewarding function r _dis is as shown in formula (2):

consider a collision penalty for an obstacle; when the unmanned aerial vehicle collides with an obstacle, defining a collision penalty r _coli for achieving the purpose of reducing the cumulative rewards of the unmanned aerial vehicle so as to avoid the obstacle; the situation of collision should be completely avoided, and in reality, when the collision happens, the situation can cause great harm, so that the collision is set to happen, and the training round of the wheel is finished; meanwhile, if the unmanned aerial vehicle exceeds the range of the map in the flight process, a punishment r _out is given, the value of the punishment r _out is the same as that of r _coli, and the round of training is finished; similarly, setting punishment of collision to an obstacle and boundary setting are performed, and then a reward r _reach reaching the global navigation target point should be set correspondingly; when the unmanned plane is in the flight process, if yaw rotation is frequently carried out, the unmanned plane can cause unstable flight, and one penalty r _yaw for the yaw speed of action output is shown as a formula (3);

Wherein: y is the set maximum yaw rate; finally, a penalty r _step =c·t for the time term is set; finally, a bonus function for each time step, as in equation (4);

3. the unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 1, wherein the step2 is specifically:

J(θ)＝E_s～pπ,a～π(a|s,g),g～G[r_g(s,a,g)] (5)

Q(s,a,g)＝E_π[∑t＝linfγ^t-1r_g(s_t,a_t,g)|s₁＝s,a₁＝a] (6)

g_t＝(x_t,y_t,z_t,yaw_t) (7)

Wherein: x _t represents the relative distance of the current position to the x-axis of the global navigation target position; y _t represents the relative distance of the current position to the y-axis of the global navigation target position; z _t represents the relative distance of the current position to the z-axis of the global navigation target position; yaw _t represents deflection angle information of the current position with respect to the global target position;

The state space is shown in formula (8):

s_t＝(ft^o,ft^g) (8)

Wherein: ft ^o is the observation feature obtained by feature extraction and encoding of the observation (o _i,o_p,o_v,o_a); f _t ^g is the navigation feature encoded for the intermediate navigation object.

4. The unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 1, wherein the step 3 is specifically:

Step 1): state memory construction

wherein _i：o^t-k represents the image information at time t-k;

Step 2) state memory extraction:

step 3): integrating state memory;

f_t ^s＝(f_t ^m||f_t ^o||f_t ^g) (15)

5. The unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 1, wherein the step 4 is specifically:

In the reinforcement learning algorithm, the difference between the Q value of the current network and the Q value of the target network is called TD-error, and is commonly represented by delta; the delta is calculated as shown in formula (16):

δ＝r+π(·|s′)^T(Q^π(·|s′)-αlogπ(·|s′))-Q(s，a) (16)

p＝|δ|+ε (17)

Wherein: epsilon represents a very small positive number, and prevents the occurrence of an error of 0;

The probability of an empirical sample is shown in equation (18):

wherein: alpha is a superparameter for priority sampling, representing the proportion of priority samples.

6. The unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 5, wherein the correction of the loss function is achieved by importance sampling in S2MAC, a correction coefficient ω is multiplied in the loss calculation of each state sequence, and in S2MAC, the correction coefficient is defined as shown in formula (19):

7. the unmanned aerial vehicle autonomous navigation decision-making method based on state memory reinforcement learning according to claim 1, wherein the step 5 is specifically: