CN114690623B

CN114690623B - Intelligent agent efficient global exploration method and system for rapid convergence of value function

Info

Publication number: CN114690623B
Application number: CN202210421995.3A
Authority: CN
Inventors: 林旺群; 李妍; 徐菁; 王伟; 田成平; 刘波; 王锐华; 孙鹏
Original assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Current assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-10-25
Anticipated expiration: 2042-04-21
Also published as: CN114690623A

Abstract

An agent efficient global exploration method and system for rapid convergence of a value function are disclosed, wherein the method comprises the following steps: the unmanned aerial vehicle training system is characterized in that an extension reward formed by the cooperation of the correction local reward and the global reward is used for giving a more definite training target to the unmanned aerial vehicle, and a general value function approximator is adopted, so that the unmanned aerial vehicle keeps exploring the environment in the whole training process and corrects the local reward by modulating the global reward to capture the global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting. According to the method, the unmanned aerial vehicle keeps exploring in the whole training process by introducing the correction local reward, the correction local reward can regulate and control the extension reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy; the observation information access conditions among different episodes are correlated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the episode process.

Description

Intelligent agent efficient global exploration method and system for rapid convergence of value function

Technical Field

The invention relates to the field of virtual simulation intelligent confrontation, in particular to an intelligent agent efficient global exploration method and system with a fast convergence value function. The unmanned aerial vehicle learning performance is improved while the enemy unmanned aerial vehicle is overwhelmed and the attack of the enemy unmanned aerial vehicle is avoided.

Background

In recent years, with the increasing demand for unmanned and intelligent unmanned aerial vehicles, the artificial intelligence technology is developed vigorously, the unmanned aerial vehicles are concerned widely in the military and civil fields, and intelligent confrontation oriented to the field of virtual simulation becomes a hot spot of current research.

Because the traditional intelligent learning training method has the defect of sparse reward setting, the unmanned aerial vehicle is often blindly explored in the learning training process, and if the unmanned aerial vehicle explores a suboptimal strategy, the exploration is most likely to be stopped and converted into the utilization, so that the unmanned aerial vehicle is difficult to learn the optimal strategy. This approach is limited in that the drone needs to learn the strategy through repeated blind exploration to obtain a large amount of experience, is inefficient and may never learn the best strategy.

On the basis of a traditional improvement method, a new scholars provides a technology for integrating correction local rewards, the technology can enable an unmanned aerial vehicle to keep purposeful exploration in the whole battle scene, and the unmanned aerial vehicle can learn an optimal strategy to a certain extent, but the method has the limitation that the correction local rewards are not regulated and controlled, the correction local rewards of each plot are only related to the plot, and the correlation is not available to all plots in the whole training process, so that the training efficiency of an intelligent body is too low.

Therefore, how can overcome prior art's shortcoming, will correct local reward and global reward and cooperate each other, let the intelligent agent keep continuously exploring in the scene of fighting, avoid the unmanned aerial vehicle intelligent agent to carry out meaningless study, become the difficult problem that needs to solve promptly.

Disclosure of Invention

The invention aims to provide an intelligent body efficient global exploration method and system with a fast convergence value function, which can provide an unmanned aerial vehicle with a more definite training target based on an extension reward formed by the cooperation of a correction local reward and a global reward, and adopt a general value function approximator to ensure that the unmanned aerial vehicle keeps exploring the environment in the whole training process and modulates an initial local reward through an exploring factor to capture global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting.

In order to achieve the purpose, the invention adopts the following technical scheme:

an intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:

unmanned aerial vehicle combat preparation information setting step S110:

setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;

unmanned aerial vehicle correction local reward network construction and training step S120:

the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and further comprises the following steps:

local access frequency module construction substep S121:

the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment _t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent _t ) And the controllable state f (x) _t ) Storing the local awards into a plot memory M, and calculating the initial local awards harvested by the unmanned aerial vehicle of the current party at the moment through a k-nearest neighbor algorithm

Global access frequency module construction sub-step S122:

a global access frequency module is constructed by random network distillation, and observation information x of the unmanned aerial vehicle at the moment t is input _t Calculating to obtain an exploration factor alpha _t For the initial local award

Regulating and controlling to obtain corrected local reward

Correct local reward and can make the reward acquisition of whole network become intensive, unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function faster in degree of depth Q learning network, unmanned aerial vehicle performance is better, utilizes at last to correct local reward

Global reward of step S110

The extension reward r harvested at the moment by the unmanned aerial vehicle of the own party is obtained by weighted summation _t ；

Embedded network training substep S123:

connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;

random net distillation training substep S124:

solving the output values of the target network and the prediction network of random network distillation to obtain the mean square error err (x) _t ) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;

unmanned aerial vehicle intelligent network construction and training step S130:

constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target

Network, two networks having the same structure, and observation information x input _t Obtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) _t ,a _t ,r _t ,x _t+1 ) Storing the data in a replay buffer, obtaining a target Q value by using a transfer tuple in the replay buffer and solving the loss of the output value of the current Q network, training the current Q network according to the loss, updating a parameter theta of the current Q network, and updating the target at intervals after a plurality of plots

Parameter theta of network ^- ；

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.

Optionally, the step S110 specifically includes:

setting a space fight range of the unmanned aerial vehicle, wherein the moving ranges of the unmanned aerial vehicles of the same party and the enemy are in the space fight range of the unmanned aerial vehicle, and the observation information of the unmanned aerial vehicle of the same party is set as the position (x) of the unmanned aerial vehicle ₀ ,y ₀ ,z ₀ ) Deflection angle relative to horizontal xoy plane

Flip angle omega of relative motion plane ₀ (<90 degrees and the position (x) of enemy unmanned aerial vehicle ₁ ,y ₁ ,z ₁ ) Angle of deflection relative to horizontal

Flip angle omega to plane of motion ₁ (<90°)，

Observation information x of unmanned aerial vehicle of our party _t Comprises the following steps:

assuming that the legal actions of the unmanned aerial vehicle of our party are set to be eastward, southward, westward and northward;

the global reward is set to: the unmanned aerial vehicle of our party sets the global reward by whether the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy or avoids the attack of the unmanned aerial vehicle of the enemy, if the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set as a numerical value 1, if the unmanned aerial vehicle of our party avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set as a numerical value 0, otherwise, the global reward is set as a numerical value-1, and the global reward symbol is recorded as:

optionally, in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full-link layer, so as to extract the input as the observation information x _t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) _t ) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x) ₀ ),f(x ₁ ),...,f(x _t ) Converting the access count into a reward according to the state-action, and defining an initial local reward

Comprises the following steps:

wherein n (f (x) _t ) Represents a controllable state f (x) _t ) The number of accesses;

using an inverse kernel function K:

to approximate the number of times the observation information was accessed at time t,

in

Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x) _t ) Using f (x) _t ) K-adjacent controllable states in the episodic memory M are approximated,

representing k adjacent controllable states extracted from episodic memory M to derive an initial local reward

The method specifically comprises the following steps:

wherein f is _i ∈N _k Represents sequential operation from N _k The number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,

the inverse kernel function expression is:

e is taken to be 0.001, d is Euclidean distance

Is a running average, and c is 0.001.

Optionally, in step S122, observing information x of the unmanned aerial vehicle at time t is obtained _t Input to random net distillation, output error err (x) through two of the nets _t ) To define explorationFactor alpha _t ，

Wherein σ _e And mu _e Is err (x) _t ) Run-time standard deviation and mean, alpha _t As an initial local reward

Multiplicative factor of, corrected local reward

The expression of (a) is:

wherein alpha is _t Is between 1 and L, L being a hyperparametric parameter, maximum 5, alpha _t Is set to 1;

finally, using the corrected local reward

And the global reward in step S110

Weighted sum to obtain the spread reward r _t ，

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar quantity and ranges from 0 to 1.

Optionally, step S123 specifically includes:

will observe x twice in succession _t And x _t+1 Respectively inputting the embedded network f to extract a controllable state f (x) _t )，f(x _t+1 ) Then outputs the observed x through two full connection layers and one softmax layer _t Shift to observation x _t+1 The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum is 1, and the method is specifically expressed as follows:

p(a ₁ |x _t ,x _t+1 )，...，p(a _t-1 |x _t ,x _t+1 )，p(a _t |x _t ,x _t+1 )＝softmax(h(f(x _t ),f(x _t+1 )))，

wherein p (a) ₁ |x _t ,x _t+1 ) Represents from observation x _t Shift to observation x _t+1 Taking action of ₁ The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming P vectors by using output action probabilities, carrying out onehot coding on output actions of a current Q network in a deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:

and m is the number of actions which can be taken, m =4, finally, the parameters of the embedded network f are updated by utilizing the calculation result E in a back propagation mode, and training is carried out again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

Optionally, step S124 specifically includes:

the observation information x at the time t _t Inputting into random net distillation, and outputting error through target net and prediction net

To train the prediction network, and update the parameters of the prediction network by back propagation

Training is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

Optionally, step S130 specifically includes:

constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information x _t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method _t Or randomly select action a _t And the value of epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, and the action is randomly selected with the probability of 10 percent. When the observed information is x _t While performing the current action a _t Get new observation information x _t+1 And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r _t Then transfer the tuple (x) _t ,a _t ,r _t ,x _t+1 ) Deposit to replay buffer and subsequently extract from replay bufferSample (A)Batch gradient descending w transfer tuples (x) _j ,a _j ,r _j ,x _j+1 ) J =1, 2.. W, where the batch gradient descent refers to training all the transition tuples in the replay buffer at a time of sampling, and calculating the current target Q value y _j The calculation method is as follows:

wherein r is _j Is the extension award earned by drone j at time,

representing objects

The network observes the information x according to the j +1 moment _j+1 The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;

if t = j +1, indicating the end of the plot, outputting the target Q value equal to the expanded reward value at the moment j, otherwise outputting the target Q value y _j To utilize the target

Multiplying the output value of the network by the discount factor and the extended reward, and then calculating the target Q value y _j Calculating the current Q network output value to obtain the mean square error

And updating the parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transfer tuples, and the target is updated every 10 plots

Parameter theta of network ^- 。

The invention further discloses an intelligent agent efficient global exploration system with fast convergence of a value function, which is characterized by comprising a storage medium,

the storage medium is used for storing computer executable instructions, and when the computer executable instructions are executed by a processor, the intelligent agent efficient global exploration method for fast convergence of the value function is executed.

In summary, the invention has the following advantages:

1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.

2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.

3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.

Drawings

FIG. 1 is a flow diagram of an agent efficient global exploration method for rapid convergence of a value function, in accordance with a specific embodiment of the present invention;

FIG. 2 is a flow chart of the steps of UAV corrective local reward network construction and training for a smart agent efficient global exploration method with fast convergence of value functions in accordance with an embodiment of the present invention;

FIG. 3 is an architecture diagram for correcting a local reward according to an embodiment of the present invention;

FIG. 4 is an architecture diagram of an embedded network in accordance with a specific embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process according to an embodiment of the present invention.

Detailed Description

The following description is made of the related terms related to the present invention:

1. deep Q learning network

Deep Q learning is a representative method of deep reinforcement learning based on a value function. It contains two neural networks, called current Q network and target respectively

The two networks have the same structure, and the two networks of the traditional deep Q learning are Q (x) respectively _j ,a _j Theta) and

the invention controls and corrects the proportion of local reward in the extended reward through the parameter beta, introduces the beta parameter in the action value function, and respectively uses Q (x) _j ,a _j Beta, theta) and

representing current Q network and target

And (4) outputting of the network. The input of the current Q network is the observation information at the current time t, the target

The input of the network is the observation information at the next moment, namely the moment t +1, and the output is the state-action value of all actions. In the invention, the current Q network in the unmanned aerial vehicle network is a network needing learning and is used for controlling an unmanned aerial vehicle intelligent agent and a target

Directly copying the parameters of the current Q network after the parameters of the network pass through a fixed plot, updating the parameters theta of the current Q network by a gradient descent method and training by minimizing a Loss function Loss:

Loss＝(y _j -Q(x _j ,a _j ,β,θ)) ²

2. plot of things

The scenario is a sequence set formed by observation information, actions and extension rewards generated by the unmanned plane in the process of interacting with the environment, and is represented by a set of a plurality of transfer tuples formed by the experience. The plot in the invention refers to the whole process from the beginning to the end of the unmanned aerial vehicle battle.

3. Transfer tuple

The transfer tuple is a basic unit forming the plot, and the unmanned aerial vehicle intelligent body generates observation information x once interacting with the environment _t Action (instruction) a _t Extended prize r _t And the next moment observation information x _t+1 The quadruple (x) _t ,a _t ,r _t ,x _t+1 ) Called a branch tuple and places the branch tuple in the replay buffer.

4. Playback buffer

The replay buffer is a buffer area in the memory or the hard disk, and is used for storing the transfer tuple sequence. The stored transfer tuples may be used repeatedly for training of the deep Q learning network. In the invention, the replay buffer area stores transfer tuples obtained by interaction of the unmanned aerial vehicle and the environment, the maximum capacity of the transfer tuples is N, the structure of the transfer tuples is similar to a queue, and when the number of the transfer tuples is more than N, the tuple sequence firstly stored in the replay buffer area can be deleted.

5.k-nearest neighbor

Given a sample, the k training samples that are closest to it in the training set are found based on some distance metric (e.g., euclidean distance), and then prediction is performed based on the information of the k "neighbors". In the invention, the access times of certain observation information obtained by the unmanned aerial vehicle in a plot are approximately calculated by utilizing a k-neighbor idea so as to obtain the initial local reward of the unmanned aerial vehicle for the observation information. If the number of accesses of the observation information is larger, the initial local reward is smaller, and conversely, if the number of accesses of the observation information is smaller, the initial local reward is larger.

6. Random Network Distillation (RND)

Random network distillation essentially randomly initializes two networks, fixes the parameters of one network, which is called the target network, and fixes the parameters of the other network, which is called the prediction network. In the invention, the input of the RND network is observation information x obtained after the unmanned aerial vehicle interacts with the environment _t The result of the prediction network is close to the target network by training the network, and the smaller the output error of the two networks is, the observation information x is represented _t The fact that the drone has been visited too many times since the start of training means that the smaller the heuristic factor, the smaller the contribution to the corrective local reward, i.e. the smaller the corrective local reward.

7. General value function approximator (UVFA)

Generally, different tasks require different action cost functions, and different optimal value functions are required to quantize the schemes for completing different tasks. In the present invention, corrective local rewards are weighted to represent different tasks, i.e., different degrees of exploration tasks. Therefore, the invention performs an extension on the action cost function in deep Q learning from the original Q (x) _t ,a _t θ) is changed to Q (x) _t ,a _t Beta, theta), wherein the parameter beta is a weight parameter for correcting local rewards, beta takes different values, the correction of the local rewards plays different roles, and different action value functions can be mixed and mixed together through the parameter beta.

8. Kernel function and inverse kernel function

The kernel function is a point inner product of a high-dimensional space converted by operation in a mode of point inner products on an original characteristic space, and the original low-dimensional space does not need to be completely expanded into points on the high-dimensional space for calculation, so that the effect of reducing the complexity of operation is achieved. The inverse kernel function is opposite to the inverse kernel function, and the original feature space of the inverse kernel function is a high-dimensional space, so that the high-dimensional space is reduced to a low-dimensional space.

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Referring to FIG. 1, a flow chart of a method for intelligent agent efficient global exploration with fast convergence of value functions according to an embodiment of the present invention is shown

Unmanned aerial vehicle combat readiness information setting step S110:

and setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements.

Specifically, in the present step,

setting a space fight range of the unmanned aerial vehicle, wherein the fight range is a three-dimensional space, the moving ranges of unmanned aerial vehicles of my party and enemy are in the space fight range of the unmanned aerial vehicle, for example, the ranges of two horizontal coordinates are [ -1000m,1000m ], the range of a vertical coordinate is not restricted, and the freedom of the upper and lower moving ranges is ensured.

The observation information of the unmanned aerial vehicle of the my party is set as the position (x) of the unmanned aerial vehicle of the my party ₀ ,y ₀ ,z ₀ ) Deflection angle relative to horizontal xoy plane

Flip angle omega of relative motion plane ₁ (<90 deg., so observation information x of my drone _t Comprises the following steps:

suppose the legal actions of my drone are set to east, south, west and north.

Global reward setting: the my drone sets a global reward as to whether to destroy the enemy drone or to avoid the attack of the enemy drone. If the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set to be a numerical value 1, if the unmanned aerial vehicle of the enemy avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set to be a numerical value 0, otherwise, the numerical value-1 is set, namely, when the unmanned aerial vehicle does not destroy the unmanned aerial vehicle of the enemy and does not avoid the attack of the unmanned aerial vehicle of the enemy, the more the action of the unmanned aerial vehicle, the more the negative score of the global reward is obtained, and the global reward symbol is recorded as:

referring to fig. 2, a flow chart of the steps of drone remediation local reward network construction and training is shown.

Referring to fig. 3, the unmanned aerial vehicle rectification local reward network is constructed and comprises a local access frequency module and a global access frequency module.

Local access frequency module construction substep S121:

the local accessThe frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment _t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent _t ) (i.e., controllable information) and setting the controllable state f (x) _t ) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm

Specifically, in step S121, the embedded network f is formed by a convolutional neural network, see fig. 4, and includes three convolutional layers and a full link layer to extract the input as the observation information x _t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) _t ) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x) ₀ ),f(x ₁ ),...,f(x _t ) Defining initial local rewards based on state-action heuristics for access count to reward exploration methods

Comprises the following steps:

wherein n (f (x) _t ) Represents a controllable state f (x) _t ) Is accessed, i.e. in a scenario with observation information x _t The more controllable states are similar, i.e., the more accessed, the less initial local rewards and vice versa.

Since the state space is a continuous space, it is difficult to calculate whether two controllable states are the same, so an inverse kernel function (equivalent to mapping a high-dimensional space to a low-dimensional space) K is used:

in (1)

Representing the real number domain, with p in the upper right hand corner representing the dimension, i.e.

Represents a p-dimensional vector on a real number domain, and in particular, p =1 represents a real number. Further, a pseudo count n (f (x) _t ) Using f (x) _t ) K-adjacent controllable states in the episodic memory M.

The method comprises the following specific steps:

the inverse kernel function expression is:

e is a very small constant (usually 0.001), d is the Euclidean distance and

is a moving average and the constant c is a very small value (typically 0.001). The sliding average makes the inverse kernel more robust to the task being solved.

This sub-step further provides an explanation of the controllable state. The representation of the embedded network f is:

the controllable state of the agent is extracted according to the current observation information and is mapped into a p-dimensional vector. Since the environment may contain changes that are independent of the behavior of the agent, referred to as uncontrollable states, which may not be useful for calculation of the reward, or even affect the accuracy of the initial local reward calculation, it is desirable to eliminate states other than those that are independent of the behavior of the drone, leaving a controllable state of the drone. Therefore, in order to avoid meaningless exploration, under two successive observations, the embedded network f predicts the action taken by the unmanned aerial vehicle from one observation to the next, and judges the accuracy of the controllable state extracted by the embedded network f according to the prediction result. For example: the position of the enemy unmanned aerial vehicle is a controllable state which needs to be extracted by the unmanned aerial vehicle, and the number and the position of birds in the air are not needed to be observed by the unmanned aerial vehicle, so that the bird information can be removed through the embedded network f.

Global access frequency module construction sub-step S122:

constructing a global access frequency module by Random Network Distillation (RND), and inputting observation information x of the unmanned aerial vehicle at the moment t _t Calculating to obtain an exploration factor alpha _t For the initial local award

Regulating and controlling to obtain corrected local reward

The local rewards are corrected, so that the rewards of the whole network can be acquired intensively, and the unmanned aerial vehicle can be better after the rewards are intensiveThe control of (2) so that the convergence rate of the median function in the deep Q learning network is higher, the unmanned aerial vehicle has better performance effect, and finally the correction local reward is utilized

Global reward of step S110

The extension reward r harvested at the moment by the unmanned aerial vehicle of the own party is obtained by weighted summation _t 。

Specifically, in step S122, observation information x of unmanned aerial vehicle at time t is extracted _t Input Random Network Distillation (RND), output error err (x) through two of the networks _t ) To define the exploration factor alpha _t ，

Wherein sigma _e And mu _e Is err (x) _t ) Run-time standard deviation and mean, alpha _t As an initial local reward

Multiplicative factor of, corrected local reward

The expression of (c) is:

wherein alpha is _t Is limited to a value of 1 to L, L being a hyperparameter of at most 5, alpha _t The minimum value is set to 1 in order to avoid the situation that the modulation factor is small due to a certain episode being visited globally too many times, resulting in a corrected local reward of 0.

α _t As a modulation factor, it disappears with time, so that the initial local reward is

Fading to a non-modulated reward.

Finally, using the corrected local reward

Global reward of step S110

Weighted sum to obtain the spread reward r _t . The extended reward is defined as

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar, ranging from 0 to 1, in order to balance the effect of correcting the local reward.

In step S120, the local access frequency module corresponds to the access frequency of the unmanned aerial vehicle at a certain time state in a scenario, and corresponds to the initial local reward

They are negatively correlated, and if the local access frequency of an observation is very high, the corresponding initial local reward is very small; the global access frequency module corresponds to the access frequency of the whole training process (namely, the access frequency is formed by a plurality of plots) of the unmanned aerial vehicle at a certain time state and corresponds to the exploration factor alpha _t They are also negatively correlated, and if the global access frequency of an observation is high, the corresponding exploration factor is small.

After the local access frequency module and the global access frequency module are constructed, the present invention will train the corresponding networks in the two modules, as in substeps S121 and S122.

Embedded network training substep S123:

connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.

Further, the training of the embedded network begins after the second observation is obtained, and lags behind the Random Network Distillation (RND) and the deep Q learning network because the embedded network needs to predict the action taken by the transition between two successive time instants of observation from the two time instants of observation.

In a preferred embodiment, the training of this sub-step may specifically be,

will observe x twice in succession _t And x _t+1 Respectively inputting the embedded network f to extract a controllable state f (x) _t )，f(x _t+1 ) Then outputs the observation x through two full connection layers and one softmax layer _t Transfer to Observation x _t+1 The probabilities of all actions taken, for example, in the present invention, the action probabilities output through the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, which are added up to 1, specifically expressed as: p (a) ₁ |x _t ,x _t+1 )，...，p(a _t-1 |x _t ,x _t+1 )，p(a _t |x _t ,x _t+1 )＝softmax(h(f(x _t ),f(x _t+1 ) P) in which p (a) ₁ |x _t ,x _t+1 ) Represents from observation x _t Transfer to Observation x _t+1 Taking action of ₁ H is a hidden layer with a softmax function, and the parameters of h and f are trained by a maximum likelihood method. Will output probability groups of each actionForming a P vector, performing onehot coding on the output action of the current Q network in the deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein the method specifically comprises the following steps:

and finally, reversely propagating and updating parameters of the embedded network f by using the calculation result E, and training again until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.

It should be noted that the embedded network f itself does not include the fully-connected layer and the softmax layer, which are just used for training the embedded network to output probabilities corresponding to various actions, and if a certain output action probability is larger, it indicates that the embedded network f considers that the unmanned aerial vehicle is most likely to take the action, so that the observed value is from x _t Transfer to x _t+1 。

Random Network Distillation (RND) training substep S124:

training of Random Network Distillation (RND) in the global access frequency module only requires training of the prediction network therein, since the parameters of the target network are randomly initialized and kept constant, expressed as:

the parameters of the prediction network are continuously updated to approach the target network in the training process and are expressed as

Both networks eventually output k-dimensional vectors.

Solving the output values of the target network and the prediction network of Random Network Distillation (RND) into the mean square error err (x) _t ) And updating the parameters in the prediction network by using the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. All the plotsEnding means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and ending is performed after all plots are trained.

In a preferred embodiment, the training of this sub-step may specifically be,

the observation information x at the time t _t Input to Random Network Distillation (RND), output error through target network and prediction network

To train the prediction network and update the parameters of the prediction network by back propagation

Training is performed again until all episodes are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.

Network, the two networks have the same structure, and observation information x is input _t Obtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) _t ,a _t ,r _t ,x _t+1 ) Storing the data in a replay buffer, training the current Q network by using the transfer tuples in the replay buffer, updating the parameter theta of the current Q network and updating the target at intervals after a plurality of episodes

Parameter θ of the network ^- 。

Specifically, the method comprises the following steps:

a deep Q learning network is constructed as an unmanned aerial vehicle network, and an action value function is carried outExpanding, newly adding beta parameter for adjusting weight for correcting local reward, wherein the beta parameter can take different values, so that the unmanned aerial vehicle network can learn different strategies, and observation information x is utilized _t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method _t Or randomly select action a _t Generally, the value of ∈ is 0.9, that is, the action corresponding to the maximum Q value is selected with a probability of 90%, and the action is randomly selected with a probability of 10%. Then x is observed in the information _t While performing the current action a _t Get new observation information x _t+1 And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r _t Then will (x) _t ,a _t ,r _t ,x _t+1 ) This set of transfer cells is deposited into a replay buffer, from which w sets of transfer cells (x) with decreasing bulk gradient are subsequently sampled _j ,a _j ,r _j ,x _j+1 ) J =1, 2.. Multidot.w., where the bulk gradient descent refers to training all the branch tuples in the replay buffer are sampled every time, and the current target Q value y is calculated _j The calculation method is as follows:

wherein r is _j Is the extension award won by drone j at time,

representing objects

The network observes the information x according to the j +1 moment _j+1 And the maximum Q value in the Q values corresponding to all the output actions, wherein gamma represents a discount factor and takes a value between 0 and 1. If t = j +1, indicating the end of the plot, outputting a target Q value equal to the expanded reward value at time j, otherwise outputting a target Q value y _j Is a target of

Multiplying the output value of the network by the discount factor and then summing the expanded reward, and then calculating the target Q value y by a gradient descent method _j Calculating the difference between the current Q network output value and the current Q network output value to obtain the mean square error

And updating a parameter theta of the current Q network, wherein w represents the number of sampled transfer tuples, and after a plurality of plots, usually 10 plots, updating the primary target

Parameter theta of network ^- 。

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the Random Network Distillation (RND) and the deep Q learning network until the plot is finished, wherein the network for controlling the flight of the unmanned aerial vehicle comprises the embedded network, the Random Network Distillation (RND) and the deep Q learning network which are trained, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.

In particular, referring to fig. 5, the whole process of unmanned aerial vehicle combat training is shown.

The present invention further discloses a storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described agent-efficient global exploration method for fast convergence of value functions.

The invention also discloses an intelligent agent high-efficiency global exploration system with fast convergence of the value function, which is characterized by comprising a storage medium,

In summary, the invention has the following advantages:

1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.

3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of the original observation information, action and network parameters, and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained through exploration and learning, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above is a further detailed description of the invention with reference to specific preferred embodiments, which should not be considered as limiting the invention to the specific embodiments described herein, but rather as a matter of simple deductions or substitutions by a person skilled in the art without departing from the inventive concept, it should be considered that the invention lies within the scope of protection defined by the claims as filed.

Claims

1. An intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:

unmanned aerial vehicle combat readiness information setting step S110:

the step S110 specifically includes:

setting a space fight range of the unmanned aerial vehicle, setting the range of motion of the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy in the space fight range of the unmanned aerial vehicle, and setting the observation information of the unmanned aerial vehicle of our party to be the position (x) of the unmanned aerial vehicle of our party ₀ ,y ₀ ,z ₀ ) Deflection angle relative to horizontal xoy plane

Flip angle omega to plane of motion ₀ And the position of enemy unmanned plane (x) ₁ ,y ₁ ,z ₁ ) Angle of deflection relative to horizontal

Flip angle omega of relative motion plane ₁ Wherein, in the process,

ω ₀ <90°、

ω ₁ <90°，

the legal actions of the unmanned aerial vehicle of the my party are set to be eastward, southward, westward and northward;

the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:

an unmanned aerial vehicle correction local reward network construction and training step S120:

the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and specifically comprises the following substeps:

local access frequency module construction substep S121:

the local access frequency module comprises an embedded network f, a controllable state f (x) _t ) The plot memory M and the k-neighbor are used for storing the observation information x of the unmanned aerial vehicle at the t moment _t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent _t ) And setting the controllable state f (x) _t ) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm

Global access frequency module construction sub-step S122:

constructing a global access frequency module by random network distillation, and inputting observation information x of the unmanned aerial vehicle at the moment t _t Calculating to obtain an exploration factor alpha _t And awarding the initial local prize

Regulating and controlling to obtain corrected local reward

Final local reward with correction

And the global reward in step S110

The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the moment _t ；

Embedded network training substep S123:

connecting two full-connection layers and a softmax layer after the network is embedded to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming a vector by the group of probabilities, simultaneously carrying out onehot coding on the output action of the current Q network t moment in the deep Q learning network to obtain another vector, calculating a mean square error by the two vectors to obtain E, and updating parameters embedded in the network in a reverse mode until all plots are finished, wherein all plots are finished, namely the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the whole plots are finished after all plots are trained;

random net distillation training substep S124:

solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x) _t ) Updating parameters in the prediction network by using the error back propagation, keeping the parameters in the target network unchanged until all plots are finished, wherein the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and finishing after all plots are trained;

Network, two networks having the same structure, and observation information x input _t Acquiring the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) _t ,a _t ,r _t ,x _t+1 ) And storing in a replay buffer, passing the target using the branch tuples in the replay buffer

The network obtains a target Q value and the output value of the current Q network to calculate the loss, the training of the current Q network is carried out according to the loss, the parameter theta of the current Q network is updated, and the loss is calculated through a plurality of parametersPost-episode interval update target

Parameter theta of network ^- ；

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward to guide the flight of the unmanned aerial vehicle.

2. The agent-efficient global exploration method according to claim 1,

in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full connection layer to extract input as the observation information x _t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) _t ) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores the controllable state from time 0 to time t, which is expressed as: { f (x) ₀ ),f(x ₁ ),...,f(x _t ) Converting the access count into a reward according to the state-action, and defining an initial local reward

Comprises the following steps:

using an inverse kernel function K:

in (1)

representing k adjacent controllable states extracted from the episodic memory M to derive an initial local reward

The method specifically comprises the following steps:

the inverse kernel function expression is:

e is taken to be 0.001, d is the Euclidean distance and

is the running average, and c is taken to be 0.001.

3. The agent-efficient global exploration method according to claim 2,

in step S122, observation information x at time t of the drone is extracted _t By feeding into random net distillation, through two of the netsOutput error err (x) _t ) To define an exploration factor alpha _t ，

Multiplicative factor of, corrected local reward

The expression of (a) is:

wherein alpha is _t Is between 1 and L, L being a hyperparametric, at most 5, alpha _t Is set to 1;

finally, local awards are applied with correction

Global reward of step S110

Weighted summation to obtain extended reward r _t ，

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar quantity and ranges from 0 to 1.

4. The agent-efficient global exploration method according to claim 3,

step S123 specifically includes:

will observe x twice in succession _t And x _t+1 Respectively inputting the embedded network f to extract a controllable state f (x) _t )，f(x _t+1 ) Then outputs the observation x through two full connection layers and one softmax layer _t Shift to observation x _t+1 The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:

wherein p (a) ₁ |x _t ,x _t+1 ) Represents from observation x _t Transfer to Observation x _t+1 Taking action of ₁ The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming a P vector by the probability of each output action, carrying out onehot coding on the output action of the current Q network in a deep Q learning network to obtain an A vector, and solving a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function:

wherein m is the number of actions that can be taken, m =4, and finally the parameters of the embedded network f are updated by using the calculation result E for back propagation, and the training is performed again until all episodes are finished, wherein all episodes are finished and means that the unmanned aerial vehicle can repeatedly iterate a plurality of episodes in the whole training processAnd (5) training, and finishing training when all plots are trained.

5. The agent-efficient global exploration method according to claim 4,

step S124 specifically includes:

the observation information x at the time t is measured _t Input into random net distillation, output error through target net and prediction net

Training the prediction network, and updating the parameters of the prediction network by back propagation of a gradient descent method

And training again until all the plots are finished, wherein the finishing of all the plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all the plots are finished.

6. The agent-efficient global exploration method according to claim 5,

step S130 specifically includes:

constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, newly adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information x _t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and using the E to ^- Greedy method for selecting action a corresponding to maximum Q value from Q values output by current Q network _t Or randomly select action a _t And the value of the epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, the action is randomly selected with the probability of 10 percent, and the observed information is x _t Execute the current action a _t Obtaining new observation information x _t+1 And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r _t Then transfer the tuple (x) _t ,a _t ,r _t ,x _t+1 ) Deposit to replay buffer, then sample the batch gradient descending w transition tuples (x) from replay buffer _j ,a _j ,r _j ,x _j+1 ) J =1, 2.. Multidot.w., where the bulk gradient descent refers to training all the branch tuples in the replay buffer are sampled every time, and the current target Q value y is calculated _j The calculation method is as follows:

wherein r is _j Is the extension award won by drone j at time,

representing objects

if t = j +1, indicating the end of the plot, outputting a target Q value equal to the expanded reward value at time j, otherwise outputting a target Q value y _j Is a target of

Multiplying the output value of the network by the discount factor and summing the extended reward, and then calculating the target Q value y _j And calculating the mean square error of the output value of the current Q network

And updating a parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transition tuples, and the target is updated once every 10 plots

Parameter theta of network ^- 。

7. An agent-efficient global exploration system with fast convergence of value functions, comprising a storage medium,

the storage medium storing computer-executable instructions which, when executed by a processor, perform the agent-efficient global exploration method for fast convergence of value functions of any of claims 1-6.