CN114690623A

CN114690623A - Intelligent agent efficient global exploration method and system for rapid convergence of value function

Info

Publication number: CN114690623A
Application number: CN202210421995.3A
Authority: CN
Inventors: 林旺群; 李妍; 徐菁; 王伟; 田成平; 刘波; 王锐华; 孙鹏
Original assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Current assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-01
Anticipated expiration: 2042-04-21
Also published as: CN114690623B

Abstract

An agent efficient global exploration method and system for rapid convergence of a value function are disclosed, wherein the method comprises the following steps: the unmanned aerial vehicle training system is characterized in that an extension reward formed by the cooperation of the correction local reward and the global reward is used for giving a more definite training target to the unmanned aerial vehicle, a general cost function approximator is adopted, the unmanned aerial vehicle keeps exploration on the environment in the whole training process, and the global reward is modulated and corrected to capture global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting. According to the method, the unmanned aerial vehicle keeps exploring all the time in the whole training process by introducing the correction local reward, the correction local reward can also regulate and control the extension reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle is ensured to learn the optimal strategy; the observation information access conditions among different plots are correlated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process.

Description

Intelligent agent efficient global exploration method and system for rapid convergence of value function

Technical Field

The invention relates to the field of virtual simulation intelligent confrontation, in particular to an intelligent agent efficient global exploration method and system with a fast convergence value function. The unmanned aerial vehicle learning performance is improved while the enemy unmanned aerial vehicle is overwhelmed and the attack of the enemy unmanned aerial vehicle is avoided.

Background

In recent years, with the increasing demand for unmanned and intelligent unmanned aerial vehicles, the artificial intelligence technology is developed vigorously, so that the unmanned aerial vehicles are widely concerned in the military and civil fields, and intelligent confrontation facing the field of virtual simulation becomes the hot point of current research.

Because the traditional intelligent learning training method has the defect of sparse reward setting, the unmanned aerial vehicle is often blindly explored in the learning training process, and if the unmanned aerial vehicle explores a suboptimal strategy, the exploration is most likely to be stopped and converted into the utilization, so that the unmanned aerial vehicle is difficult to learn the optimal strategy. This approach is limited in that the drone needs to learn the strategy through repeated blind exploration to obtain a large amount of experience, is inefficient and may never learn the best strategy.

On the basis of a traditional improvement method, a new scholars provides a technology for integrating correction local rewards, the technology can enable an unmanned aerial vehicle to keep purposeful exploration in the whole battle scene, and the unmanned aerial vehicle can learn an optimal strategy to a certain extent, but the method has the limitation that the correction local rewards are not regulated and controlled, the correction local rewards of each plot are only related to the plot, and the correlation is not available to all plots in the whole training process, so that the training efficiency of an intelligent body is too low.

Therefore, how can overcome prior art's shortcoming, will correct local reward and global reward and cooperate each other, let the intelligent agent keep continuously exploring in the scene of fighting, avoid the unmanned aerial vehicle intelligent agent to carry out meaningless study, become the difficult problem that needs to solve promptly.

Disclosure of Invention

The invention aims to provide an intelligent body efficient global exploration method and system with a fast convergence value function, which can provide an unmanned aerial vehicle with a more definite training target based on an extension reward formed by the cooperation of a correction local reward and a global reward, and adopt a general value function approximator to ensure that the unmanned aerial vehicle keeps exploring the environment in the whole training process and modulates an initial local reward through an exploring factor to capture global correlation. Finally, make unmanned aerial vehicle agent efficient and finally can acquire the optimal strategy of fighting.

In order to achieve the purpose, the invention adopts the following technical scheme:

an intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:

unmanned aerial vehicle combat readiness information setting step S110:

setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;

unmanned aerial vehicle correction local reward network construction and training step S120:

the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and further comprises the following steps:

local access frequency module construction substep S121:

the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment_tInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent_t) And setting the controllable state f (x)_t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm

Global access frequency module construction sub-step S122:

constructing a global access frequency module by random network distillation, and inputting observation information x of the unmanned aerial vehicle at the moment t_tCalculating to obtain an exploration factor alpha_tFor the initial local award

Regulating and controlling to obtain corrected local reward

Correcting local reward can make the reward acquisition of whole network become intensive, and unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function in degree of depth Q learning network faster, unmanned aerial vehicle performance is better, utilizes at last to correct local reward

Global reward of step S110

The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the moment_t；

Embedded network training substep S123:

connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;

random net distillation training substep S124:

solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)_t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;

unmanned aerial vehicle intelligent network construction and training step S130:

constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target

Network, two networks having the same structure, and observation information x input_tObtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x)_t,a_t,r_t,x_t+1) Storing the data in a replay buffer, obtaining a target Q value by using a transfer tuple in the replay buffer and solving the loss of the output value of the current Q network, training the current Q network according to the loss, updating a parameter theta of the current Q network, and updating the target at intervals after a plurality of plots

Parameter theta of network^-；

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.

Optionally, step S110 specifically includes:

setting the space fight range of the unmanned aerial vehicle, setting the range of motion of the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy in the space fight range of the unmanned aerial vehicle, and setting the observation information of the unmanned aerial vehicle of our party as the space fight range of the unmanned aerial vehicle of our partyUnmanned plane location (x)₀,y₀,z₀) Angle of deflection from horizontal xoy plane

Flip angle omega of relative motion plane₀(<90 degrees and the position (x) of enemy unmanned aerial vehicle₁,y₁,z₁) Angle of deflection relative to horizontal

Flip angle omega of relative motion plane₁(<90°)，

Observation information x of unmanned aerial vehicle of our party_tComprises the following steps:

assuming that the legal actions of my drone are east, south, west and north;

the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:

optionally, in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full-link layer, so as to extract the input as the observation information x_tThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)_t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)₀),f(x₁),...,f(x_t) Converting the access count into a reward according to the state-action, and defining an initial local reward

Comprises the following steps:

wherein n (f (x)_t) Represents a controllable state f (x)_t) The number of accesses;

using the inverse kernel function K:

to approximate the number of times the observation information was accessed at time t,

in (1)

Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x)_t) Using f (x)_t) K-adjacent controllable states in the episodic memory M are approximated,

representing k adjacent controllable states extracted from the episodic memory M to derive an initial local reward

The method specifically comprises the following steps:

wherein f is_i∈N_kRepresents sequential operation from N_kThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,

the inverse kernel function expression is:

e is taken to be 0.001, d is Euclidean distance

Is a running average, and c is 0.001.

Optionally, in step S122, observing information x of the unmanned aerial vehicle at time t is obtained_tInput to random net distillation, output error err (x) through two of the nets_t) To define an exploration factor alpha_t，

Wherein sigma_eAnd mu_eIs err (x)_t) Run-time standard deviation and mean, alpha_tAs an initial local reward

Multiplicative factor of, corrected local reward

The expression of (a) is:

wherein alpha is_tIs between 1 and L, L being a hyperparametric, at most 5, alpha_tIs set to 1;

finally, using the corrected local reward

Global reward of step S110

Weighted summation to obtain extended reward r_t，

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar quantity and ranges from 0 to 1.

Optionally, step S123 specifically includes:

will observe x twice in succession_tAnd x_t+1Respectively inputting the embedded network f to extract a controllable state f (x)_t)，f(x_t+1) Then outputs the observation x through two full connection layers and one softmax layer_tTransfer to Observation x_t+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:

p(a₁|x_t,x_t+1)，...，p(a_t-1|x_t,x_t+1)，p(a_t|x_t,x_t+1)＝softmax(h(f(x_t),f(x_t+1)))，

wherein p (a)₁|x_t,x_t+1) Represents from observation x_tTransfer to Observation x_t+1Taking action of₁The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming P vectors by using output action probabilities, carrying out onehot coding on output actions of a current Q network in a deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:

where m is the number of actions that can be taken, m is 4, and finally the parameters of the embedded network f are updated by using the calculation result E in a back propagation manner, and the process is carried out againTraining until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

Optionally, step S124 specifically includes:

the observation information x at the time t_tInputting into random net distillation, and outputting error through target net and prediction net

To train the prediction network and update the parameters of the prediction network by back propagation

Training is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

Optionally, step S130 specifically includes:

constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information x_tObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method_tOr randomly select action a_tAnd the value of epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, and the action is randomly selected with the probability of 10 percent. When the observed information is x_tExecute the current action a_tObtaining new observation information x_t+1And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r_tThen transfer the tuple (x)_t,a_t,r_t,x_t+1) Deposit to replay buffer and subsequently extract from replay bufferSample (A)Batch gradient descending w transfer tuples (x)_j,a_j,r_j,x_j+1) J 1, 2.. times.w, where the batch gradient descent refers to training all the branch tuples in the replay buffer for each sampling, and calculating the current target Q value y_jThe calculation method is as follows:

wherein r is_jIs the extension award earned by drone j at time,

representing objects

The network observes the information x according to the j +1 moment_j+1The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;

if t equals j +1, indicating the end of the episode, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value y_jTo utilize the target

Multiplying the output value of the network by the discount factor and the extended reward, and then calculating the target Q value y_jCalculating the current Q network output value to obtain the mean square error

And updating the parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transfer tuples, and the target is updated every 10 plots

Parameter theta of network^-。

The invention further discloses an intelligent agent efficient global exploration system with fast convergence of a value function, which is characterized by comprising a storage medium,

the storage medium is used for storing computer-executable instructions which, when executed by a processor, perform the above intelligent agent efficient global exploration method for fast convergence of value functions.

In summary, the invention has the following advantages:

1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.

2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.

3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.

Drawings

FIG. 1 is a flow diagram of an agent efficient global exploration method with fast convergence of value functions, in accordance with a specific embodiment of the present invention;

FIG. 2 is a flow chart of the steps of UAV corrective local reward network construction and training for a smart agent efficient global exploration method with fast convergence of value functions in accordance with an embodiment of the present invention;

FIG. 3 is an architecture diagram for correcting a local reward according to an embodiment of the present invention;

FIG. 4 is an architecture diagram of an embedded network in accordance with a specific embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process according to an embodiment of the present invention.

Detailed Description

The following description is made of the related terms related to the present invention:

1. deep Q learning network

Deep Q learning is a representative method of deep reinforcement learning based on a value function. It contains two neural networks, called current Q network and target respectively

The two networks have the same structure, and the two networks of the traditional deep Q learning are Q (x) respectively_j,a_jTheta) and

the invention controls and corrects the proportion of local reward in the extended reward by the parameter beta, introduces the beta parameter in the action value function, and respectively uses Q (x)_j,a_jBeta, theta) and

representing current Q network and target

And (4) outputting of the network. The input of the current Q network is observation information at the current time t, and the target

The input of the network is the observation information at the next moment, namely the moment t +1, and the output is the state-action value of all actions. In the invention, the current Q network in the unmanned aerial vehicle network is a network needing learning and is used for controlling an unmanned aerial vehicle intelligent agent and a target

Directly copying the parameters of the current Q network after the parameters of the network pass through a fixed plot, updating the parameters theta of the current Q network by a gradient descent method and training by minimizing a Loss function Loss:

Loss＝(y_j-Q(x_j,a_j,β,θ))²

2. plot of things

The scenario is a sequence set formed by observation information, actions and extension rewards generated by the unmanned plane in the process of interacting with the environment, and is represented by a set of a plurality of transfer tuples formed by the experience. The plot in the invention refers to the whole process from the beginning to the end of the unmanned aerial vehicle battle.

3. Transfer tuple

The transfer tuple is a basic unit forming the plot, and the unmanned aerial vehicle intelligent body generates observation information x once interacting with the environment_tAction (instruction) a_tExtended prize r_tAnd the next moment observation information x_t+1The quadruple (x)_t,a_t,r_t,x_t+1) Called the branch tuple, and deposits the branch tuple into the replay buffer.

4. Playback buffer

The replay buffer is a buffer area in the memory or the hard disk, and is used for storing the transfer tuple sequence. The stored transfer tuples may be used repeatedly for training of the deep Q learning network. In the invention, the replay buffer area stores transfer tuples obtained by interaction between the unmanned aerial vehicle and the environment, the maximum capacity of the transfer tuples is N, the structure of the transfer tuples is similar to that of a queue, and when the number of the transfer tuples is more than N, tuple sequences firstly stored in the replay buffer area are deleted.

K-nearest neighbor

Given a sample, the k training samples in the training set that are closest to the sample are found based on some distance metric (e.g., Euclidean distance), and then prediction is performed based on the information of the k "neighbors". In the invention, the access times of certain observation information obtained by the unmanned aerial vehicle in a plot are approximately calculated by utilizing a k-neighbor thought so as to obtain the initial local reward of the unmanned aerial vehicle for the observation information. If the number of accesses of the observation information is larger, the initial local reward is smaller, and conversely, if the number of accesses of the observation information is smaller, the initial local reward is larger.

6. Random Network Distillation (RND)

Random network distillation, which essentially randomly initializes two networks, fixes the parameters of one network, called the network as the target network, and fixes the parameters of the other network, called the network as the prediction network. In the invention, the input of the RND network is observation information x obtained after the unmanned aerial vehicle interacts with the environment_tThe result of the predicted network is close to the target network by training the network, and the smaller the output error of the two networks is, the observation information x is represented_tIn training from unmanned plane toThis has been accessed too many times, meaning that the smaller the exploration factor, the smaller the contribution to the corrective local reward, i.e. the smaller the corrective local reward.

7. General value function approximator (UVFA)

Generally, different tasks require different action cost functions, and different optimal value functions are required to quantize the schemes for completing different tasks. In the present invention, corrective local rewards are weighted to represent different tasks, i.e., different degrees of exploration tasks. Therefore, the invention extends the action cost function in deep Q learning from the original Q (x)_t,a_tθ) is changed to Q (x)_t,a_tβ, θ), wherein the parameter β is a weight parameter for correcting the local reward, β takes different values, and the corrected local reward plays different roles, and different action value functions can be mixed together through the parameter β.

8. Kernel function and inverse kernel function

The kernel function is a high-dimensional space point inner product converted by operation in a mode of original point inner product on a characteristic space, and the original low-dimensional space does not need to be completely expanded into a point on a high-dimensional space for calculation, so that the effect of reducing the operation complexity is achieved. The inverse kernel function is opposite to the inverse kernel function, and the original feature space of the inverse kernel function is a high-dimensional space, so that the high-dimensional space is reduced to a low-dimensional space.

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Referring to FIG. 1, a flow chart of a method for intelligent agent efficient global exploration with fast convergence of value functions according to an embodiment of the present invention is shown

Unmanned aerial vehicle combat readiness information setting step S110:

and setting observation information and legal actions of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements.

Specifically, in the present step,

the method comprises the steps of setting a space fight range of the unmanned aerial vehicle, wherein the fight range is a three-dimensional space, the moving ranges of unmanned aerial vehicles of the owner and the enemy are in the space fight range of the unmanned aerial vehicle, for example, the ranges of two horizontal coordinates are [ -1000m,1000m ], the range of a vertical coordinate is not restricted, and the freedom of the upper moving range and the lower moving range is ensured.

The observation information of the unmanned aerial vehicle of the my party is set as the position (x) of the unmanned aerial vehicle of the my party₀,y₀,z₀) Angle of deflection from horizontal xoy plane

Flip angle omega of relative motion plane₁(<90 deg.), so observation information x of my drone_tComprises the following steps:

assume that the legal actions of my drone are set to east, south, west, and north.

Global reward setting: the my drone sets a global reward as to whether to destroy or evade enemy drone attacks. If the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set to be a value 0, otherwise, the value-1 is set, namely, the more the unmanned aerial vehicle acts when the unmanned aerial vehicle does not destroy the unmanned aerial vehicle of the enemy and does not avoid the attack of the unmanned aerial vehicle of the enemy, the more the negative score of the global reward is obtained, and the global reward symbol is recorded as:

referring to fig. 2, a flow chart of the steps of drone remediation local reward network construction and training is shown.

Referring to fig. 3, the unmanned aerial vehicle rectification local reward network is constructed and comprises a local access frequency module and a global access frequency module.

Local access frequency module construction substep S121:

the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment_tInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent_t) (i.e., controllable information) and setting the controllable state f (x)_t) Storing the local awards into a plot memory M, and calculating the initial local awards harvested by the unmanned aerial vehicle of the current party at the moment through a k-nearest neighbor algorithm

Specifically, in step S121, the embedded network f is formed by a convolutional neural network, see fig. 4, and includes three convolutional layers and a full link layer to extract the input as the observation information x_tThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)_t) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)₀),f(x₁),...,f(x_t) Defining initial local rewards based on state-action heuristics for access count to reward exploration methods

Comprises the following steps:

wherein n (f (x)_t) Represents a controllable state f (x)_t) Is accessed, i.e. in a scenario with observation information x_tThe more controllable states are similar, i.e., the more accessed, the less initial local rewards and vice versa.

Since the state space is a continuous space, it is difficult to calculate whether two controllable states are the same, so an inverse kernel function (equivalent to mapping a high-dimensional space to a low-dimensional space) K is used:

in (1)

Representing the real number domain, with p in the upper right representing the dimension, i.e.

Represents a p-dimensional vector in a real number domain, and in particular, p ═ 1 represents a real number. Further, a pseudo count n (f (x)_t) Using f (x)_t) K-adjacent controllable states in the episodic memory M.

The method comprises the following specific steps:

the inverse kernel function expression is:

e is a very small constant (usually 0.001), d is the Euclidean distance and

is a moving average, and the constant c is a very small value (typically 0.001). The sliding average makes the inverse kernel more robust to the task being solved.

This sub-step further provides an explanation of the controllable state. The representation of the embedded network f is:

the controllable state of the agent is extracted according to the current observation information and is mapped into a p-dimensional vector. Because the environment may contain changes independent of the behavior of the agent, referred to as uncontrollable states, which may be of no use in the calculation of the reward and may even affect the accuracy of the initial local reward calculation, it is desirable to cull states other than those independent of the behavior of the drone, leaving a controllable state of the drone. Therefore, in order to avoid meaningless exploration, under two successive observations, the embedded network f predicts the action taken by the unmanned aerial vehicle from one observation to the next, and judges the accuracy of the controllable state extracted by the embedded network f according to the prediction result. For example: the position of the enemy unmanned aerial vehicle is a controllable state which needs to be extracted by the unmanned aerial vehicle, and the number and the position of birds in the air are not needed to be observed by the unmanned aerial vehicle, so that the information of the birds can be removed through the embedded network f.

Global access frequency module construction sub-step S122:

constructing a global access frequency module by Random Network Distillation (RND), and inputting observation information x at the t moment of the unmanned aerial vehicle_tCalculating to obtain an exploration factor alpha_tFor the initial local award

Regulating and controlling to obtain corrected local reward

Global reward of step S110

The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the moment_t。

Specifically, in step S122, observation information x of unmanned aerial vehicle at time t is extracted_tInput Random Network Distillation (RND), output error err (x) through two of the networks_t) To define an exploration factor alpha_t，

Multiplicative factor of, corrected local reward

The expression of (a) is:

wherein alpha is_tIs limited to a value of 1 to L, L being a hyperparameter of at most 5, alpha_tThe minimum value is set to 1 in order to avoid the situation that the modulation factor is small due to a certain episode being visited globally too many times, resulting in a corrected local reward of 0.

α_tAs a modulation factor, it disappears over time, causing an initial local reward

Fading to a non-modulated reward.

Finally, using the corrected local reward

Global reward of step S110

Weighted summation to obtain extended reward r_t. The extended reward is defined as

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar, ranging from 0 to 1, in order to balance the effect of correcting the local reward.

In step S120, the local access frequency module corresponds to the access frequency of the unmanned aerial vehicle at a certain time state in a scenario, and corresponds to the initial local reward

They are negatively correlated, and if the local access frequency of an observation is very high, the corresponding initial local reward is very small; the global access frequency module corresponds to the access frequency of the state of the unmanned aerial vehicle at a certain time in the whole training process (namely, the access frequency is formed by a plurality of plots) and corresponds to the exploration factor alpha_tThey are also negatively correlated, and if the global access frequency of an observation is high, the corresponding exploration factor is small.

After the local access frequency module and the global access frequency module are constructed, as shown in substeps S121 and S122, the present invention will train the corresponding networks in the two modules.

Embedded network training substep S123:

connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

Further, the training of the embedded network begins after the second observation is obtained, and lags behind the Random Network Distillation (RND) and the deep Q learning network because the embedded network needs to predict the action taken to shift between two observations at successive times from the observations at the two times.

In a preferred embodiment, the training of this sub-step may specifically be,

will observe x twice in succession_tAnd x_t+1Respectively inputting the embedded network f to extract the controllable state f (x)_t)，f(x_t+1) Then outputs the observation x through two full connection layers and one softmax layer_tTransfer to Observation x_t+1Probability of all actions taken, e.g. in the present invention, actions exported via an embedded networkThe probabilities correspond to the probabilities corresponding to the four actions of east, south, west and north, and the sum of the probabilities is 1, which is specifically expressed as: p (a)₁|x_t,x_t+1)，...，p(a_t-1|x_t,x_t+1)，p(a_t|x_t,x_t+1)＝softmax(h(f(x_t),f(x_t+1) P) in which p (a)₁|x_t,x_t+1) Represents from observation x_tTransfer to Observation x_t+1Taking action of₁H is a hidden layer with a softmax function, and the parameters of h and f are trained by a maximum likelihood method. Forming P vectors by the probabilities of all the output actions, carrying out onehot coding on the output actions of the current Q network in the deep Q learning network to obtain A vectors, and then calculating the mean square error E of the P vectors and the A vectors, wherein the method specifically comprises the following steps:

and finally, reversely propagating and updating the parameters of the embedded network f by using the calculation result E, and performing training again until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.

It should be noted that the embedded network f does not include the above-mentioned full connection layer and softmax layer, but they are just used for training the embedded network to output probabilities corresponding to each action, and if a certain output action probability is larger, it indicates that the embedded network f considers that the drone is most likely to take the action so that the observed value is from x_tTransfer to x_t+1。

Random Network Distillation (RND) training substep S124:

training of Random Network Distillation (RND) in the global access frequency module only requires training of the prediction network therein, since the parameters of the target network are randomly initialized and remain unchanged, expressed as:

the parameters of the prediction network are continuously updated to approximate the target network in the training process and are expressed as

Both networks eventually output k-dimensional vectors.

Solving the output values of the target network and the prediction network of Random Network Distillation (RND) into the mean square error err (x)_t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.

In a preferred embodiment, the training of this sub-step may specifically be,

the observation information x at the time t_tInputting into Random Network Distillation (RND), and outputting error through target network and prediction network

Network, two networks having the same structure, and observation information x input_tAcquiring the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and performing the actionContext interaction to obtain a transition tuple (x)_t,a_t,r_t,x_t+1) Storing the data in a replay buffer, training the current Q network by using the transfer tuples in the replay buffer, updating the parameter theta of the current Q network and updating the target at intervals after a plurality of episodes

Parameter theta of network^-。

Specifically, the steps are as follows:

the deep Q learning network is constructed as an unmanned aerial vehicle network, the action value function is expanded, the beta parameter is newly added to adjust the weight for correcting the local reward, the parameter beta can take different values, so that the unmanned aerial vehicle network can learn different strategies, and the observation information x is utilized_tObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method_tOr randomly select action a_tGenerally, the value of ∈ is 0.9, that is, the action corresponding to the maximum Q value is selected with a probability of 90%, and the action is randomly selected with a probability of 10%. Then x is observed in the information_tWhile performing the current action a_tGet new observation information x_t+1And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r_tThen will (x)_t,a_t,r_t,x_t+1) This branch tuple is deposited into the replay buffer, from which w branch tuples (x) of the batch gradient descent are then sampled_j,a_j,r_j,x_j+1) J-1, 2, w, where batch gradient descent refers to each sample in the replay bufferTraining all the transfer tuples to calculate the current target Q value y_jThe calculation method is as follows:

wherein r is_jIs the extension award earned by drone j at time,

representing objects

The network observes the information x according to the j +1 moment_j+1And the maximum Q value in the Q values corresponding to all the output actions, wherein gamma represents a discount factor and takes a value between 0 and 1. If t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value y_jIs a target of

Multiplying the output value of the network by the discount factor and then summing the expanded reward, and then calculating the target Q value y by a gradient descent method_jCalculating the difference between the current Q network output value and the current Q network output value to obtain the mean square error

And updating a parameter theta of the current Q network, wherein w represents the number of sampled transfer tuples, and after a plurality of plots, usually 10 plots, updating the primary target

Parameter theta of network^-。

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the Random Network Distillation (RND) and the deep Q learning network until the plot is finished, wherein the network for controlling the flight of the unmanned aerial vehicle comprises the embedded network, the Random Network Distillation (RND) and the deep Q learning network which are trained, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.

Specifically, referring to fig. 5, the whole process of unmanned aerial vehicle combat training is shown.

The present invention further discloses a storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described agent-efficient global exploration method for fast convergence of value functions.

The invention also discloses an intelligent agent high-efficiency global exploration system with fast convergence of the value function, which is characterized by comprising a storage medium,

In summary, the invention has the following advantages:

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:

unmanned aerial vehicle combat readiness information setting step S110:

the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and specifically comprises the following substeps:

local access frequency module construction substep S121:

the local access frequency module comprises an embedded network f, a controllable state f (x)_t) The plot memory M and the k-neighbor are used for storing the observation information x of the unmanned aerial vehicle at the t moment_tInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent_t) And setting the controllable state f (x)_t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm

Global access frequency module construction sub-step S122:

a global access frequency module is constructed by random network distillation, and observation information x of the unmanned aerial vehicle at the moment t is input_tCalculating to obtain an exploration factor alpha_tAnd awarding the initial local prize

Regulating and controlling to obtain corrected local reward

Final local reward with correction

Global reward of step S110

Embedded network training substep S123:

connecting two full-connection layers and a softmax layer after the network is embedded to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming a vector by the group of probabilities, simultaneously carrying out onehot coding on the output action of the current Q network t moment in the deep Q learning network to obtain another vector, calculating a mean square error by the two vectors to obtain E, and updating parameters embedded in the network in a reverse mode until all plots are finished, wherein all plots are finished, namely the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the whole plots are finished after all plots are trained;

random net distillation training substep S124:

solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)_t) The parameters in the prediction network are updated by utilizing the error back propagation, the parameters in the target network are kept unchanged until all plots are finished, the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are finished;

Network, two networks having the same structure, and observation information x input_tThrough a depth Q meshThe current Q network in the network obtains the action selected by the unmanned aerial vehicle under the observation information at each moment, executes the action and interacts with the environment to obtain a transfer tuple (x)_t,a_t,r_t,x_t+1) And storing in a replay buffer, passing the target using the branch tuples in the replay buffer

The network obtains a target Q value and the output value of the current Q network to calculate loss, the training of the current Q network is carried out according to the loss, the parameter theta of the current Q network is updated, and the target is updated at intervals after a plurality of plots

Parameter theta of network^-；

Repeating the training and exiting step S140:

repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward to guide the flight of the unmanned aerial vehicle.

2. The agent-efficient global exploration method according to claim 1,

the step S110 specifically includes:

setting the space fight range of the unmanned aerial vehicle, wherein the active ranges of unmanned aerial vehicles of our party and enemy are in the space fight range of the unmanned aerial vehicle, and the observation information of unmanned aerial vehicle of our party is set as the position (x) of unmanned aerial vehicle of our party₀,y₀,z₀) Angle of deflection from horizontal xoy plane

Flip angle omega of relative motion plane₁(<90°)，

the legal actions of the unmanned aerial vehicle of the my party are set to be eastward, southward, westward and northward;

3. the agent-efficient global exploration method according to claim 2,

in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full link layer to extract input as the observation information x_tThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)_t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)₀),f(x₁),...,f(x_t) Converting the access count into a reward according to the state-action, and defining an initial local reward

Comprises the following steps:

using the inverse kernel function K:

in (1)

representing k adjacent controllable states extracted from episodic memory M to derive an initial local reward

The method specifically comprises the following steps:

the inverse kernel function expression is:

e is taken to be 0.001, d is Euclidean distance

Is a running average, and c is 0.001.

4. The agent-efficient global exploration method according to claim 3,

in step S122, observation information x at time t of the drone is extracted_tInput to random net distillation, output error err (x) through two of the nets_t) To define an exploration factor alpha_t，

Wherein σ_eAnd mu_eIs err (x)_t) Run-time standard deviation and mean, alpha_tAs an initial local reward

Multiplicative factor of, corrected local reward

The expression of (a) is:

finally, using the corrected local reward

Global reward of step S110

Weighted summation to obtain extended reward r_t，

And

respectively represent global awards

And correcting for local rewards

Beta is a positive scalar quantity and ranges from 0 to 1.

5. The agent-efficient global exploration method according to claim 4,

step S123 specifically includes:

will observe x twice in succession_tAnd x_t+1Respectively inputting the embedded network f to extract a controllable state f (x)_t)，f(x_t+1) Then outputs the observation x through two full connection layers and one softmax layer_tShift to observation x_t+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:

p(a₁|x_t，x_t+1)，…，p(a_t-1|x_t，x_t+1)，p(a_t|x_t，x_t+1)＝softmax(h(f(x_t)，f(x_t+1)))，

wherein p (a)₁|x_t,x_t+1) Represents from observation x_tTransfer to Observation x_t+1Taking action of a₁The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming a P vector by the probability of each output action, carrying out onehot coding on the output action of a current Q network in a deep Q learning network to obtain an A vector, and solving a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:

and finally, performing back propagation to update parameters embedded in the network f by using the calculation result E, and performing training again until all episodes are finished, wherein m is the number of actions which can be taken, and m is 4.

6. The agent-efficient global exploration method according to claim 5,

step S124 specifically includes:

Training the prediction network, and updating the parameters of the prediction network by back propagation of a gradient descent method

And training again until all the plots are finished, wherein the finishing of all the plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all the plots are finished.

7. The agent-efficient global exploration method according to claim 6,

step S130 specifically includes:

constructing a deep Q learning network as noneThe man-machine network expands the action value function, newly adds beta parameters to adjust and correct the weight of local reward, and utilizes the observation information x_tObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method_tOr randomly select action a_tAnd the value of the epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, the action is randomly selected with the probability of 10 percent, and the observed information is x_tWhile performing the current action a_tGet new observation information x_t+1And global awards

Global rewards

Re-sum correction of local rewards

Weighted summation to obtain extended reward r_tThen transfer the tuple (x)_t,a_t,r_t,x_t+1) Deposit to replay buffer, then sample the batch gradient descending w transition tuples (x) from replay buffer_j,a_t,r_j,x_j+1) Where, in batch gradient descent, all the transition tuples in the replay buffer are sampled each time for training, and the current target Q value y is calculated_jThe calculation method is as follows:

wherein r is_jIs the extension award earned by drone j at time,

representing objects

if t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value y_jIs a target of

Multiplying the output value of the network by the discount factor and summing the extended reward, and then calculating the target Q value y_jAnd calculating the mean square error of the output value of the current Q network

And updating a parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transition tuples, and the target is updated once every 10 plots

Parameter theta of network^-。

8. An agent-efficient global exploration system with fast convergence of value functions, comprising a storage medium,

the storage medium storing computer-executable instructions which, when executed by a processor, perform the agent-efficient global exploration method for fast convergence of value functions of any of claims 1-7.