CN114690623B - Intelligent agent efficient global exploration method and system for rapid convergence of value function - Google Patents

Intelligent agent efficient global exploration method and system for rapid convergence of value function Download PDF

Info

Publication number
CN114690623B
CN114690623B CN202210421995.3A CN202210421995A CN114690623B CN 114690623 B CN114690623 B CN 114690623B CN 202210421995 A CN202210421995 A CN 202210421995A CN 114690623 B CN114690623 B CN 114690623B
Authority
CN
China
Prior art keywords
network
unmanned aerial
aerial vehicle
reward
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210421995.3A
Other languages
Chinese (zh)
Other versions
CN114690623A (en
Inventor
林旺群
李妍
徐菁
王伟
田成平
刘波
王锐华
孙鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Original Assignee
Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences filed Critical Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority to CN202210421995.3A priority Critical patent/CN114690623B/en
Publication of CN114690623A publication Critical patent/CN114690623A/en
Application granted granted Critical
Publication of CN114690623B publication Critical patent/CN114690623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0205Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system
    • G05B13/024Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An agent efficient global exploration method and system for rapid convergence of a value function are disclosed, wherein the method comprises the following steps: the unmanned aerial vehicle training system is characterized in that an extension reward formed by the cooperation of the correction local reward and the global reward is used for giving a more definite training target to the unmanned aerial vehicle, and a general value function approximator is adopted, so that the unmanned aerial vehicle keeps exploring the environment in the whole training process and corrects the local reward by modulating the global reward to capture the global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting. According to the method, the unmanned aerial vehicle keeps exploring in the whole training process by introducing the correction local reward, the correction local reward can regulate and control the extension reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy; the observation information access conditions among different episodes are correlated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the episode process.

Description

Intelligent agent efficient global exploration method and system for rapid convergence of value function
Technical Field
The invention relates to the field of virtual simulation intelligent confrontation, in particular to an intelligent agent efficient global exploration method and system with a fast convergence value function. The unmanned aerial vehicle learning performance is improved while the enemy unmanned aerial vehicle is overwhelmed and the attack of the enemy unmanned aerial vehicle is avoided.
Background
In recent years, with the increasing demand for unmanned and intelligent unmanned aerial vehicles, the artificial intelligence technology is developed vigorously, the unmanned aerial vehicles are concerned widely in the military and civil fields, and intelligent confrontation oriented to the field of virtual simulation becomes a hot spot of current research.
Because the traditional intelligent learning training method has the defect of sparse reward setting, the unmanned aerial vehicle is often blindly explored in the learning training process, and if the unmanned aerial vehicle explores a suboptimal strategy, the exploration is most likely to be stopped and converted into the utilization, so that the unmanned aerial vehicle is difficult to learn the optimal strategy. This approach is limited in that the drone needs to learn the strategy through repeated blind exploration to obtain a large amount of experience, is inefficient and may never learn the best strategy.
On the basis of a traditional improvement method, a new scholars provides a technology for integrating correction local rewards, the technology can enable an unmanned aerial vehicle to keep purposeful exploration in the whole battle scene, and the unmanned aerial vehicle can learn an optimal strategy to a certain extent, but the method has the limitation that the correction local rewards are not regulated and controlled, the correction local rewards of each plot are only related to the plot, and the correlation is not available to all plots in the whole training process, so that the training efficiency of an intelligent body is too low.
Therefore, how can overcome prior art's shortcoming, will correct local reward and global reward and cooperate each other, let the intelligent agent keep continuously exploring in the scene of fighting, avoid the unmanned aerial vehicle intelligent agent to carry out meaningless study, become the difficult problem that needs to solve promptly.
Disclosure of Invention
The invention aims to provide an intelligent body efficient global exploration method and system with a fast convergence value function, which can provide an unmanned aerial vehicle with a more definite training target based on an extension reward formed by the cooperation of a correction local reward and a global reward, and adopt a general value function approximator to ensure that the unmanned aerial vehicle keeps exploring the environment in the whole training process and modulates an initial local reward through an exploring factor to capture global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting.
In order to achieve the purpose, the invention adopts the following technical scheme:
an intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat preparation information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and further comprises the following steps:
local access frequency module construction substep S121:
the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent t ) And the controllable state f (x) t ) Storing the local awards into a plot memory M, and calculating the initial local awards harvested by the unmanned aerial vehicle of the current party at the moment through a k-nearest neighbor algorithm
Figure BDA0003608219140000021
Global access frequency module construction sub-step S122:
a global access frequency module is constructed by random network distillation, and observation information x of the unmanned aerial vehicle at the moment t is input t Calculating to obtain an exploration factor alpha t For the initial local award
Figure BDA0003608219140000031
Regulating and controlling to obtain corrected local reward
Figure BDA0003608219140000032
Correct local reward and can make the reward acquisition of whole network become intensive, unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function faster in degree of depth Q learning network, unmanned aerial vehicle performance is better, utilizes at last to correct local reward
Figure BDA0003608219140000033
Global reward of step S110
Figure BDA0003608219140000034
The extension reward r harvested at the moment by the unmanned aerial vehicle of the own party is obtained by weighted summation t
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of random network distillation to obtain the mean square error err (x) t ) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure BDA0003608219140000035
Network, two networks having the same structure, and observation information x input t Obtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) t ,a t ,r t ,x t+1 ) Storing the data in a replay buffer, obtaining a target Q value by using a transfer tuple in the replay buffer and solving the loss of the output value of the current Q network, training the current Q network according to the loss, updating a parameter theta of the current Q network, and updating the target at intervals after a plurality of plots
Figure BDA0003608219140000041
Parameter theta of network -
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
Optionally, the step S110 specifically includes:
setting a space fight range of the unmanned aerial vehicle, wherein the moving ranges of the unmanned aerial vehicles of the same party and the enemy are in the space fight range of the unmanned aerial vehicle, and the observation information of the unmanned aerial vehicle of the same party is set as the position (x) of the unmanned aerial vehicle 0 ,y 0 ,z 0 ) Deflection angle relative to horizontal xoy plane
Figure BDA0003608219140000045
Flip angle omega of relative motion plane 0 (<90 degrees and the position (x) of enemy unmanned aerial vehicle 1 ,y 1 ,z 1 ) Angle of deflection relative to horizontal
Figure BDA0003608219140000046
Flip angle omega to plane of motion 1 (<90°),
Observation information x of unmanned aerial vehicle of our party t Comprises the following steps:
Figure BDA0003608219140000042
assuming that the legal actions of the unmanned aerial vehicle of our party are set to be eastward, southward, westward and northward;
the global reward is set to: the unmanned aerial vehicle of our party sets the global reward by whether the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy or avoids the attack of the unmanned aerial vehicle of the enemy, if the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set as a numerical value 1, if the unmanned aerial vehicle of our party avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set as a numerical value 0, otherwise, the global reward is set as a numerical value-1, and the global reward symbol is recorded as:
Figure BDA0003608219140000043
optionally, in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full-link layer, so as to extract the input as the observation information x t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) t ) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x) 0 ),f(x 1 ),...,f(x t ) Converting the access count into a reward according to the state-action, and defining an initial local reward
Figure BDA0003608219140000044
Comprises the following steps:
Figure BDA0003608219140000051
wherein n (f (x) t ) Represents a controllable state f (x) t ) The number of accesses;
using an inverse kernel function K:
Figure BDA0003608219140000052
to approximate the number of times the observation information was accessed at time t,
Figure BDA0003608219140000053
in
Figure BDA0003608219140000054
Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x) t ) Using f (x) t ) K-adjacent controllable states in the episodic memory M are approximated,
Figure BDA0003608219140000055
Figure BDA0003608219140000056
representing k adjacent controllable states extracted from episodic memory M to derive an initial local reward
Figure BDA0003608219140000057
The method specifically comprises the following steps:
Figure BDA0003608219140000058
wherein f is i ∈N k Represents sequential operation from N k The number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure BDA0003608219140000059
e is taken to be 0.001, d is Euclidean distance
Figure BDA00036082191400000510
Is a running average, and c is 0.001.
Optionally, in step S122, observing information x of the unmanned aerial vehicle at time t is obtained t Input to random net distillation, output error err (x) through two of the nets t ) To define explorationFactor alpha t
Figure BDA00036082191400000511
Wherein σ e And mu e Is err (x) t ) Run-time standard deviation and mean, alpha t As an initial local reward
Figure BDA00036082191400000512
Multiplicative factor of, corrected local reward
Figure BDA00036082191400000513
The expression of (a) is:
Figure BDA00036082191400000514
wherein alpha is t Is between 1 and L, L being a hyperparametric parameter, maximum 5, alpha t Is set to 1;
finally, using the corrected local reward
Figure BDA0003608219140000061
And the global reward in step S110
Figure BDA0003608219140000062
Weighted sum to obtain the spread reward r t
Figure BDA0003608219140000063
And
Figure BDA0003608219140000064
respectively represent global awards
Figure BDA0003608219140000065
And correcting for local rewards
Figure BDA0003608219140000066
Beta is a positive scalar quantity and ranges from 0 to 1.
Optionally, step S123 specifically includes:
will observe x twice in succession t And x t+1 Respectively inputting the embedded network f to extract a controllable state f (x) t ),f(x t+1 ) Then outputs the observed x through two full connection layers and one softmax layer t Shift to observation x t+1 The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum is 1, and the method is specifically expressed as follows:
p(a 1 |x t ,x t+1 ),...,p(a t-1 |x t ,x t+1 ),p(a t |x t ,x t+1 )=softmax(h(f(x t ),f(x t+1 ))),
wherein p (a) 1 |x t ,x t+1 ) Represents from observation x t Shift to observation x t+1 Taking action of 1 The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming P vectors by using output action probabilities, carrying out onehot coding on output actions of a current Q network in a deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:
Figure BDA0003608219140000067
and m is the number of actions which can be taken, m =4, finally, the parameters of the embedded network f are updated by utilizing the calculation result E in a back propagation mode, and training is carried out again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S124 specifically includes:
the observation information x at the time t t Inputting into random net distillation, and outputting error through target net and prediction net
Figure BDA0003608219140000068
To train the prediction network, and update the parameters of the prediction network by back propagation
Figure BDA0003608219140000079
Training is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S130 specifically includes:
constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information x t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method t Or randomly select action a t And the value of epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, and the action is randomly selected with the probability of 10 percent. When the observed information is x t While performing the current action a t Get new observation information x t+1 And global awards
Figure BDA0003608219140000071
Global rewards
Figure BDA0003608219140000072
Re-sum correction of local rewards
Figure BDA0003608219140000073
Weighted summation to obtain extended reward r t Then transfer the tuple (x) t ,a t ,r t ,x t+1 ) Deposit to replay buffer and subsequently extract from replay bufferSample (A)Batch gradient descending w transfer tuples (x) j ,a j ,r j ,x j+1 ) J =1, 2.. W, where the batch gradient descent refers to training all the transition tuples in the replay buffer at a time of sampling, and calculating the current target Q value y j The calculation method is as follows:
Figure BDA0003608219140000074
wherein r is j Is the extension award earned by drone j at time,
Figure BDA0003608219140000075
representing objects
Figure BDA0003608219140000076
The network observes the information x according to the j +1 moment j+1 The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t = j +1, indicating the end of the plot, outputting the target Q value equal to the expanded reward value at the moment j, otherwise outputting the target Q value y j To utilize the target
Figure BDA0003608219140000077
Multiplying the output value of the network by the discount factor and the extended reward, and then calculating the target Q value y j Calculating the current Q network output value to obtain the mean square error
Figure BDA0003608219140000078
And updating the parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transfer tuples, and the target is updated every 10 plots
Figure BDA0003608219140000081
Parameter theta of network -
The invention further discloses an intelligent agent efficient global exploration system with fast convergence of a value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer executable instructions, and when the computer executable instructions are executed by a processor, the intelligent agent efficient global exploration method for fast convergence of the value function is executed.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
Drawings
FIG. 1 is a flow diagram of an agent efficient global exploration method for rapid convergence of a value function, in accordance with a specific embodiment of the present invention;
FIG. 2 is a flow chart of the steps of UAV corrective local reward network construction and training for a smart agent efficient global exploration method with fast convergence of value functions in accordance with an embodiment of the present invention;
FIG. 3 is an architecture diagram for correcting a local reward according to an embodiment of the present invention;
FIG. 4 is an architecture diagram of an embedded network in accordance with a specific embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process according to an embodiment of the present invention.
Detailed Description
The following description is made of the related terms related to the present invention:
1. deep Q learning network
Deep Q learning is a representative method of deep reinforcement learning based on a value function. It contains two neural networks, called current Q network and target respectively
Figure BDA0003608219140000091
The two networks have the same structure, and the two networks of the traditional deep Q learning are Q (x) respectively j ,a j Theta) and
Figure BDA0003608219140000092
the invention controls and corrects the proportion of local reward in the extended reward through the parameter beta, introduces the beta parameter in the action value function, and respectively uses Q (x) j ,a j Beta, theta) and
Figure BDA0003608219140000093
representing current Q network and target
Figure BDA0003608219140000094
And (4) outputting of the network. The input of the current Q network is the observation information at the current time t, the target
Figure BDA0003608219140000095
The input of the network is the observation information at the next moment, namely the moment t +1, and the output is the state-action value of all actions. In the invention, the current Q network in the unmanned aerial vehicle network is a network needing learning and is used for controlling an unmanned aerial vehicle intelligent agent and a target
Figure BDA0003608219140000101
Directly copying the parameters of the current Q network after the parameters of the network pass through a fixed plot, updating the parameters theta of the current Q network by a gradient descent method and training by minimizing a Loss function Loss:
Figure BDA0003608219140000102
Loss=(y j -Q(x j ,a j ,β,θ)) 2
2. plot of things
The scenario is a sequence set formed by observation information, actions and extension rewards generated by the unmanned plane in the process of interacting with the environment, and is represented by a set of a plurality of transfer tuples formed by the experience. The plot in the invention refers to the whole process from the beginning to the end of the unmanned aerial vehicle battle.
3. Transfer tuple
The transfer tuple is a basic unit forming the plot, and the unmanned aerial vehicle intelligent body generates observation information x once interacting with the environment t Action (instruction) a t Extended prize r t And the next moment observation information x t+1 The quadruple (x) t ,a t ,r t ,x t+1 ) Called a branch tuple and places the branch tuple in the replay buffer.
4. Playback buffer
The replay buffer is a buffer area in the memory or the hard disk, and is used for storing the transfer tuple sequence. The stored transfer tuples may be used repeatedly for training of the deep Q learning network. In the invention, the replay buffer area stores transfer tuples obtained by interaction of the unmanned aerial vehicle and the environment, the maximum capacity of the transfer tuples is N, the structure of the transfer tuples is similar to a queue, and when the number of the transfer tuples is more than N, the tuple sequence firstly stored in the replay buffer area can be deleted.
5.k-nearest neighbor
Given a sample, the k training samples that are closest to it in the training set are found based on some distance metric (e.g., euclidean distance), and then prediction is performed based on the information of the k "neighbors". In the invention, the access times of certain observation information obtained by the unmanned aerial vehicle in a plot are approximately calculated by utilizing a k-neighbor idea so as to obtain the initial local reward of the unmanned aerial vehicle for the observation information. If the number of accesses of the observation information is larger, the initial local reward is smaller, and conversely, if the number of accesses of the observation information is smaller, the initial local reward is larger.
6. Random Network Distillation (RND)
Random network distillation essentially randomly initializes two networks, fixes the parameters of one network, which is called the target network, and fixes the parameters of the other network, which is called the prediction network. In the invention, the input of the RND network is observation information x obtained after the unmanned aerial vehicle interacts with the environment t The result of the prediction network is close to the target network by training the network, and the smaller the output error of the two networks is, the observation information x is represented t The fact that the drone has been visited too many times since the start of training means that the smaller the heuristic factor, the smaller the contribution to the corrective local reward, i.e. the smaller the corrective local reward.
7. General value function approximator (UVFA)
Generally, different tasks require different action cost functions, and different optimal value functions are required to quantize the schemes for completing different tasks. In the present invention, corrective local rewards are weighted to represent different tasks, i.e., different degrees of exploration tasks. Therefore, the invention performs an extension on the action cost function in deep Q learning from the original Q (x) t ,a t θ) is changed to Q (x) t ,a t Beta, theta), wherein the parameter beta is a weight parameter for correcting local rewards, beta takes different values, the correction of the local rewards plays different roles, and different action value functions can be mixed and mixed together through the parameter beta.
8. Kernel function and inverse kernel function
The kernel function is a point inner product of a high-dimensional space converted by operation in a mode of point inner products on an original characteristic space, and the original low-dimensional space does not need to be completely expanded into points on the high-dimensional space for calculation, so that the effect of reducing the complexity of operation is achieved. The inverse kernel function is opposite to the inverse kernel function, and the original feature space of the inverse kernel function is a high-dimensional space, so that the high-dimensional space is reduced to a low-dimensional space.
The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Referring to FIG. 1, a flow chart of a method for intelligent agent efficient global exploration with fast convergence of value functions according to an embodiment of the present invention is shown
Unmanned aerial vehicle combat readiness information setting step S110:
and setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements.
Specifically, in the present step,
setting a space fight range of the unmanned aerial vehicle, wherein the fight range is a three-dimensional space, the moving ranges of unmanned aerial vehicles of my party and enemy are in the space fight range of the unmanned aerial vehicle, for example, the ranges of two horizontal coordinates are [ -1000m,1000m ], the range of a vertical coordinate is not restricted, and the freedom of the upper and lower moving ranges is ensured.
The observation information of the unmanned aerial vehicle of the my party is set as the position (x) of the unmanned aerial vehicle of the my party 0 ,y 0 ,z 0 ) Deflection angle relative to horizontal xoy plane
Figure BDA0003608219140000123
Flip angle omega of relative motion plane 0 (<90 degrees and the position (x) of enemy unmanned aerial vehicle 1 ,y 1 ,z 1 ) Angle of deflection relative to horizontal
Figure BDA0003608219140000121
Figure BDA0003608219140000122
Flip angle omega of relative motion plane 1 (<90 deg., so observation information x of my drone t Comprises the following steps:
Figure BDA0003608219140000131
suppose the legal actions of my drone are set to east, south, west and north.
Global reward setting: the my drone sets a global reward as to whether to destroy the enemy drone or to avoid the attack of the enemy drone. If the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set to be a numerical value 1, if the unmanned aerial vehicle of the enemy avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set to be a numerical value 0, otherwise, the numerical value-1 is set, namely, when the unmanned aerial vehicle does not destroy the unmanned aerial vehicle of the enemy and does not avoid the attack of the unmanned aerial vehicle of the enemy, the more the action of the unmanned aerial vehicle, the more the negative score of the global reward is obtained, and the global reward symbol is recorded as:
Figure BDA0003608219140000132
unmanned aerial vehicle correction local reward network construction and training step S120:
referring to fig. 2, a flow chart of the steps of drone remediation local reward network construction and training is shown.
Referring to fig. 3, the unmanned aerial vehicle rectification local reward network is constructed and comprises a local access frequency module and a global access frequency module.
Local access frequency module construction substep S121:
the local accessThe frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t moment t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent t ) (i.e., controllable information) and setting the controllable state f (x) t ) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Figure BDA0003608219140000133
Specifically, in step S121, the embedded network f is formed by a convolutional neural network, see fig. 4, and includes three convolutional layers and a full link layer to extract the input as the observation information x t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) t ) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x) 0 ),f(x 1 ),...,f(x t ) Defining initial local rewards based on state-action heuristics for access count to reward exploration methods
Figure BDA0003608219140000141
Comprises the following steps:
Figure BDA0003608219140000142
wherein n (f (x) t ) Represents a controllable state f (x) t ) Is accessed, i.e. in a scenario with observation information x t The more controllable states are similar, i.e., the more accessed, the less initial local rewards and vice versa.
Since the state space is a continuous space, it is difficult to calculate whether two controllable states are the same, so an inverse kernel function (equivalent to mapping a high-dimensional space to a low-dimensional space) K is used:
Figure BDA0003608219140000143
Figure BDA0003608219140000144
to approximate the number of times the observation information was accessed at time t,
Figure BDA0003608219140000145
in (1)
Figure BDA0003608219140000146
Representing the real number domain, with p in the upper right hand corner representing the dimension, i.e.
Figure BDA0003608219140000147
Represents a p-dimensional vector on a real number domain, and in particular, p =1 represents a real number. Further, a pseudo count n (f (x) t ) Using f (x) t ) K-adjacent controllable states in the episodic memory M.
Figure BDA0003608219140000148
Representing k adjacent controllable states extracted from episodic memory M to derive an initial local reward
Figure BDA0003608219140000149
The method comprises the following specific steps:
Figure BDA00036082191400001410
wherein f is i ∈N k Represents sequential operation from N k The number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure BDA00036082191400001411
e is a very small constant (usually 0.001), d is the Euclidean distance and
Figure BDA00036082191400001412
is a moving average and the constant c is a very small value (typically 0.001). The sliding average makes the inverse kernel more robust to the task being solved.
This sub-step further provides an explanation of the controllable state. The representation of the embedded network f is:
Figure BDA0003608219140000151
the controllable state of the agent is extracted according to the current observation information and is mapped into a p-dimensional vector. Since the environment may contain changes that are independent of the behavior of the agent, referred to as uncontrollable states, which may not be useful for calculation of the reward, or even affect the accuracy of the initial local reward calculation, it is desirable to eliminate states other than those that are independent of the behavior of the drone, leaving a controllable state of the drone. Therefore, in order to avoid meaningless exploration, under two successive observations, the embedded network f predicts the action taken by the unmanned aerial vehicle from one observation to the next, and judges the accuracy of the controllable state extracted by the embedded network f according to the prediction result. For example: the position of the enemy unmanned aerial vehicle is a controllable state which needs to be extracted by the unmanned aerial vehicle, and the number and the position of birds in the air are not needed to be observed by the unmanned aerial vehicle, so that the bird information can be removed through the embedded network f.
Global access frequency module construction sub-step S122:
constructing a global access frequency module by Random Network Distillation (RND), and inputting observation information x of the unmanned aerial vehicle at the moment t t Calculating to obtain an exploration factor alpha t For the initial local award
Figure BDA0003608219140000152
Regulating and controlling to obtain corrected local reward
Figure BDA0003608219140000153
The local rewards are corrected, so that the rewards of the whole network can be acquired intensively, and the unmanned aerial vehicle can be better after the rewards are intensiveThe control of (2) so that the convergence rate of the median function in the deep Q learning network is higher, the unmanned aerial vehicle has better performance effect, and finally the correction local reward is utilized
Figure BDA0003608219140000154
Global reward of step S110
Figure BDA0003608219140000155
The extension reward r harvested at the moment by the unmanned aerial vehicle of the own party is obtained by weighted summation t
Specifically, in step S122, observation information x of unmanned aerial vehicle at time t is extracted t Input Random Network Distillation (RND), output error err (x) through two of the networks t ) To define the exploration factor alpha t
Figure BDA0003608219140000161
Wherein sigma e And mu e Is err (x) t ) Run-time standard deviation and mean, alpha t As an initial local reward
Figure BDA0003608219140000162
Multiplicative factor of, corrected local reward
Figure BDA0003608219140000163
The expression of (c) is:
Figure BDA0003608219140000164
wherein alpha is t Is limited to a value of 1 to L, L being a hyperparameter of at most 5, alpha t The minimum value is set to 1 in order to avoid the situation that the modulation factor is small due to a certain episode being visited globally too many times, resulting in a corrected local reward of 0.
α t As a modulation factor, it disappears with time, so that the initial local reward is
Figure BDA0003608219140000165
Fading to a non-modulated reward.
Finally, using the corrected local reward
Figure BDA0003608219140000166
Global reward of step S110
Figure BDA0003608219140000167
Weighted sum to obtain the spread reward r t . The extended reward is defined as
Figure BDA0003608219140000168
And
Figure BDA0003608219140000169
respectively represent global awards
Figure BDA00036082191400001610
And correcting for local rewards
Figure BDA00036082191400001611
Beta is a positive scalar, ranging from 0 to 1, in order to balance the effect of correcting the local reward.
In step S120, the local access frequency module corresponds to the access frequency of the unmanned aerial vehicle at a certain time state in a scenario, and corresponds to the initial local reward
Figure BDA00036082191400001612
They are negatively correlated, and if the local access frequency of an observation is very high, the corresponding initial local reward is very small; the global access frequency module corresponds to the access frequency of the whole training process (namely, the access frequency is formed by a plurality of plots) of the unmanned aerial vehicle at a certain time state and corresponds to the exploration factor alpha t They are also negatively correlated, and if the global access frequency of an observation is high, the corresponding exploration factor is small.
After the local access frequency module and the global access frequency module are constructed, the present invention will train the corresponding networks in the two modules, as in substeps S121 and S122.
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.
Further, the training of the embedded network begins after the second observation is obtained, and lags behind the Random Network Distillation (RND) and the deep Q learning network because the embedded network needs to predict the action taken by the transition between two successive time instants of observation from the two time instants of observation.
In a preferred embodiment, the training of this sub-step may specifically be,
will observe x twice in succession t And x t+1 Respectively inputting the embedded network f to extract a controllable state f (x) t ),f(x t+1 ) Then outputs the observation x through two full connection layers and one softmax layer t Transfer to Observation x t+1 The probabilities of all actions taken, for example, in the present invention, the action probabilities output through the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, which are added up to 1, specifically expressed as: p (a) 1 |x t ,x t+1 ),...,p(a t-1 |x t ,x t+1 ),p(a t |x t ,x t+1 )=softmax(h(f(x t ),f(x t+1 ) P) in which p (a) 1 |x t ,x t+1 ) Represents from observation x t Transfer to Observation x t+1 Taking action of 1 H is a hidden layer with a softmax function, and the parameters of h and f are trained by a maximum likelihood method. Will output probability groups of each actionForming a P vector, performing onehot coding on the output action of the current Q network in the deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein the method specifically comprises the following steps:
Figure BDA0003608219140000171
and finally, reversely propagating and updating parameters of the embedded network f by using the calculation result E, and training again until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.
It should be noted that the embedded network f itself does not include the fully-connected layer and the softmax layer, which are just used for training the embedded network to output probabilities corresponding to various actions, and if a certain output action probability is larger, it indicates that the embedded network f considers that the unmanned aerial vehicle is most likely to take the action, so that the observed value is from x t Transfer to x t+1
Random Network Distillation (RND) training substep S124:
training of Random Network Distillation (RND) in the global access frequency module only requires training of the prediction network therein, since the parameters of the target network are randomly initialized and kept constant, expressed as:
Figure BDA0003608219140000181
the parameters of the prediction network are continuously updated to approach the target network in the training process and are expressed as
Figure BDA0003608219140000182
Both networks eventually output k-dimensional vectors.
Solving the output values of the target network and the prediction network of Random Network Distillation (RND) into the mean square error err (x) t ) And updating the parameters in the prediction network by using the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. All the plotsEnding means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and ending is performed after all plots are trained.
In a preferred embodiment, the training of this sub-step may specifically be,
the observation information x at the time t t Input to Random Network Distillation (RND), output error through target network and prediction network
Figure BDA0003608219140000183
To train the prediction network and update the parameters of the prediction network by back propagation
Figure BDA0003608219140000184
Training is performed again until all episodes are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.
Unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure BDA0003608219140000185
Network, the two networks have the same structure, and observation information x is input t Obtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) t ,a t ,r t ,x t+1 ) Storing the data in a replay buffer, training the current Q network by using the transfer tuples in the replay buffer, updating the parameter theta of the current Q network and updating the target at intervals after a plurality of episodes
Figure BDA0003608219140000191
Parameter θ of the network -
Specifically, the method comprises the following steps:
a deep Q learning network is constructed as an unmanned aerial vehicle network, and an action value function is carried outExpanding, newly adding beta parameter for adjusting weight for correcting local reward, wherein the beta parameter can take different values, so that the unmanned aerial vehicle network can learn different strategies, and observation information x is utilized t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy method t Or randomly select action a t Generally, the value of ∈ is 0.9, that is, the action corresponding to the maximum Q value is selected with a probability of 90%, and the action is randomly selected with a probability of 10%. Then x is observed in the information t While performing the current action a t Get new observation information x t+1 And global awards
Figure BDA0003608219140000192
Global rewards
Figure BDA0003608219140000193
Re-sum correction of local rewards
Figure BDA0003608219140000194
Weighted summation to obtain extended reward r t Then will (x) t ,a t ,r t ,x t+1 ) This set of transfer cells is deposited into a replay buffer, from which w sets of transfer cells (x) with decreasing bulk gradient are subsequently sampled j ,a j ,r j ,x j+1 ) J =1, 2.. Multidot.w., where the bulk gradient descent refers to training all the branch tuples in the replay buffer are sampled every time, and the current target Q value y is calculated j The calculation method is as follows:
Figure BDA0003608219140000195
wherein r is j Is the extension award won by drone j at time,
Figure BDA0003608219140000196
representing objects
Figure BDA0003608219140000197
The network observes the information x according to the j +1 moment j+1 And the maximum Q value in the Q values corresponding to all the output actions, wherein gamma represents a discount factor and takes a value between 0 and 1. If t = j +1, indicating the end of the plot, outputting a target Q value equal to the expanded reward value at time j, otherwise outputting a target Q value y j Is a target of
Figure BDA0003608219140000201
Multiplying the output value of the network by the discount factor and then summing the expanded reward, and then calculating the target Q value y by a gradient descent method j Calculating the difference between the current Q network output value and the current Q network output value to obtain the mean square error
Figure BDA0003608219140000202
And updating a parameter theta of the current Q network, wherein w represents the number of sampled transfer tuples, and after a plurality of plots, usually 10 plots, updating the primary target
Figure BDA0003608219140000203
Parameter theta of network -
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the Random Network Distillation (RND) and the deep Q learning network until the plot is finished, wherein the network for controlling the flight of the unmanned aerial vehicle comprises the embedded network, the Random Network Distillation (RND) and the deep Q learning network which are trained, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
In particular, referring to fig. 5, the whole process of unmanned aerial vehicle combat training is shown.
The present invention further discloses a storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described agent-efficient global exploration method for fast convergence of value functions.
The invention also discloses an intelligent agent high-efficiency global exploration system with fast convergence of the value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer executable instructions, and when the computer executable instructions are executed by a processor, the intelligent agent efficient global exploration method for fast convergence of the value function is executed.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of the original observation information, action and network parameters, and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained through exploration and learning, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is a further detailed description of the invention with reference to specific preferred embodiments, which should not be considered as limiting the invention to the specific embodiments described herein, but rather as a matter of simple deductions or substitutions by a person skilled in the art without departing from the inventive concept, it should be considered that the invention lies within the scope of protection defined by the claims as filed.

Claims (7)

1. An intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat readiness information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
the step S110 specifically includes:
setting a space fight range of the unmanned aerial vehicle, setting the range of motion of the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy in the space fight range of the unmanned aerial vehicle, and setting the observation information of the unmanned aerial vehicle of our party to be the position (x) of the unmanned aerial vehicle of our party 0 ,y 0 ,z 0 ) Deflection angle relative to horizontal xoy plane
Figure FDA0003821409540000011
Flip angle omega to plane of motion 0 And the position of enemy unmanned plane (x) 1 ,y 1 ,z 1 ) Angle of deflection relative to horizontal
Figure FDA0003821409540000012
Flip angle omega of relative motion plane 1 Wherein, in the process,
Figure FDA0003821409540000013
ω 0 <90°、
Figure FDA0003821409540000014
ω 1 <90°,
observation information x of unmanned aerial vehicle of our party t Comprises the following steps:
Figure FDA0003821409540000015
the legal actions of the unmanned aerial vehicle of the my party are set to be eastward, southward, westward and northward;
the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:
Figure FDA0003821409540000016
an unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and specifically comprises the following substeps:
local access frequency module construction substep S121:
the local access frequency module comprises an embedded network f, a controllable state f (x) t ) The plot memory M and the k-neighbor are used for storing the observation information x of the unmanned aerial vehicle at the t moment t Input embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agent t ) And setting the controllable state f (x) t ) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Figure FDA0003821409540000021
Global access frequency module construction sub-step S122:
constructing a global access frequency module by random network distillation, and inputting observation information x of the unmanned aerial vehicle at the moment t t Calculating to obtain an exploration factor alpha t And awarding the initial local prize
Figure FDA0003821409540000022
Regulating and controlling to obtain corrected local reward
Figure FDA0003821409540000023
Final local reward with correction
Figure FDA0003821409540000024
And the global reward in step S110
Figure FDA0003821409540000025
The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the moment t
Embedded network training substep S123:
connecting two full-connection layers and a softmax layer after the network is embedded to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming a vector by the group of probabilities, simultaneously carrying out onehot coding on the output action of the current Q network t moment in the deep Q learning network to obtain another vector, calculating a mean square error by the two vectors to obtain E, and updating parameters embedded in the network in a reverse mode until all plots are finished, wherein all plots are finished, namely the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the whole plots are finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x) t ) Updating parameters in the prediction network by using the error back propagation, keeping the parameters in the target network unchanged until all plots are finished, wherein the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and finishing after all plots are trained;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure FDA0003821409540000026
Network, two networks having the same structure, and observation information x input t Acquiring the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x) t ,a t ,r t ,x t+1 ) And storing in a replay buffer, passing the target using the branch tuples in the replay buffer
Figure FDA0003821409540000031
The network obtains a target Q value and the output value of the current Q network to calculate the loss, the training of the current Q network is carried out according to the loss, the parameter theta of the current Q network is updated, and the loss is calculated through a plurality of parametersPost-episode interval update target
Figure FDA0003821409540000032
Parameter theta of network -
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward to guide the flight of the unmanned aerial vehicle.
2. The agent-efficient global exploration method according to claim 1,
in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full connection layer to extract input as the observation information x t The controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x) t ) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores the controllable state from time 0 to time t, which is expressed as: { f (x) 0 ),f(x 1 ),...,f(x t ) Converting the access count into a reward according to the state-action, and defining an initial local reward
Figure FDA0003821409540000033
Comprises the following steps:
Figure FDA0003821409540000034
wherein n (f (x) t ) Represents a controllable state f (x) t ) The number of accesses;
using an inverse kernel function K:
Figure FDA0003821409540000035
to approximate the number of times the observation information was accessed at time t,
Figure FDA0003821409540000036
in (1)
Figure FDA0003821409540000037
Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x) t ) Using f (x) t ) K-adjacent controllable states in the episodic memory M are approximated,
Figure FDA0003821409540000038
representing k adjacent controllable states extracted from the episodic memory M to derive an initial local reward
Figure FDA0003821409540000039
The method specifically comprises the following steps:
Figure FDA00038214095400000310
wherein f is i ∈N k Represents sequential operation from N k The number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure FDA0003821409540000041
e is taken to be 0.001, d is the Euclidean distance and
Figure FDA0003821409540000042
is the running average, and c is taken to be 0.001.
3. The agent-efficient global exploration method according to claim 2,
in step S122, observation information x at time t of the drone is extracted t By feeding into random net distillation, through two of the netsOutput error err (x) t ) To define an exploration factor alpha t
Figure FDA0003821409540000043
Wherein σ e And mu e Is err (x) t ) Run-time standard deviation and mean, alpha t As an initial local reward
Figure FDA0003821409540000044
Multiplicative factor of, corrected local reward
Figure FDA0003821409540000045
The expression of (a) is:
Figure FDA0003821409540000046
wherein alpha is t Is between 1 and L, L being a hyperparametric, at most 5, alpha t Is set to 1;
finally, local awards are applied with correction
Figure FDA0003821409540000047
Global reward of step S110
Figure FDA0003821409540000048
Weighted summation to obtain extended reward r t
Figure FDA0003821409540000049
Figure FDA00038214095400000410
And
Figure FDA00038214095400000411
respectively represent global awards
Figure FDA00038214095400000412
And correcting for local rewards
Figure FDA00038214095400000413
Beta is a positive scalar quantity and ranges from 0 to 1.
4. The agent-efficient global exploration method according to claim 3,
step S123 specifically includes:
will observe x twice in succession t And x t+1 Respectively inputting the embedded network f to extract a controllable state f (x) t ),f(x t+1 ) Then outputs the observation x through two full connection layers and one softmax layer t Shift to observation x t+1 The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:
p(a 1 |x t ,x t+1 ),...,p(a t-1 |x t ,x t+1 ),p(a t |x t ,x t+1 )=softmax(h(f(x t ),f(x t+1 ))),
wherein p (a) 1 |x t ,x t+1 ) Represents from observation x t Transfer to Observation x t+1 Taking action of 1 The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming a P vector by the probability of each output action, carrying out onehot coding on the output action of the current Q network in a deep Q learning network to obtain an A vector, and solving a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function:
Figure FDA0003821409540000051
wherein m is the number of actions that can be taken, m =4, and finally the parameters of the embedded network f are updated by using the calculation result E for back propagation, and the training is performed again until all episodes are finished, wherein all episodes are finished and means that the unmanned aerial vehicle can repeatedly iterate a plurality of episodes in the whole training processAnd (5) training, and finishing training when all plots are trained.
5. The agent-efficient global exploration method according to claim 4,
step S124 specifically includes:
the observation information x at the time t is measured t Input into random net distillation, output error through target net and prediction net
Figure FDA0003821409540000052
Training the prediction network, and updating the parameters of the prediction network by back propagation of a gradient descent method
Figure FDA0003821409540000053
And training again until all the plots are finished, wherein the finishing of all the plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all the plots are finished.
6. The agent-efficient global exploration method according to claim 5,
step S130 specifically includes:
constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, newly adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information x t Obtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and using the E to - Greedy method for selecting action a corresponding to maximum Q value from Q values output by current Q network t Or randomly select action a t And the value of the epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, the action is randomly selected with the probability of 10 percent, and the observed information is x t Execute the current action a t Obtaining new observation information x t+1 And global awards
Figure FDA0003821409540000061
Global rewards
Figure FDA0003821409540000062
Re-sum correction of local rewards
Figure FDA0003821409540000063
Weighted summation to obtain extended reward r t Then transfer the tuple (x) t ,a t ,r t ,x t+1 ) Deposit to replay buffer, then sample the batch gradient descending w transition tuples (x) from replay buffer j ,a j ,r j ,x j+1 ) J =1, 2.. Multidot.w., where the bulk gradient descent refers to training all the branch tuples in the replay buffer are sampled every time, and the current target Q value y is calculated j The calculation method is as follows:
Figure FDA0003821409540000064
wherein r is j Is the extension award won by drone j at time,
Figure FDA0003821409540000065
representing objects
Figure FDA0003821409540000066
The network observes the information x according to the j +1 moment j+1 The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t = j +1, indicating the end of the plot, outputting a target Q value equal to the expanded reward value at time j, otherwise outputting a target Q value y j Is a target of
Figure FDA0003821409540000067
Multiplying the output value of the network by the discount factor and summing the extended reward, and then calculating the target Q value y j And calculating the mean square error of the output value of the current Q network
Figure FDA0003821409540000068
And updating a parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transition tuples, and the target is updated once every 10 plots
Figure FDA0003821409540000069
Parameter theta of network -
7. An agent-efficient global exploration system with fast convergence of value functions, comprising a storage medium,
the storage medium storing computer-executable instructions which, when executed by a processor, perform the agent-efficient global exploration method for fast convergence of value functions of any of claims 1-6.
CN202210421995.3A 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function Active CN114690623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210421995.3A CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210421995.3A CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Publications (2)

Publication Number Publication Date
CN114690623A CN114690623A (en) 2022-07-01
CN114690623B true CN114690623B (en) 2022-10-25

Family

ID=82144133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210421995.3A Active CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Country Status (1)

Country Link
CN (1) CN114690623B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761850B (en) * 2022-11-16 2024-03-22 智慧眼科技股份有限公司 Face recognition model training method, face recognition method, device and storage medium
CN115826621B (en) * 2022-12-27 2023-12-01 山西大学 Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN115857556B (en) * 2023-01-30 2023-07-14 中国人民解放军96901部队 Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434130A (en) * 2020-11-24 2021-03-02 南京邮电大学 Multi-task label embedded emotion analysis neural network model construction method
CN113723013A (en) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 Multi-agent decision method for continuous space chess deduction
CN113780576A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution
CN114281103A (en) * 2021-12-14 2022-04-05 中国运载火箭技术研究院 Zero-interaction communication aircraft cluster collaborative search method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10375585B2 (en) * 2017-07-06 2019-08-06 Futurwei Technologies, Inc. System and method for deep learning and wireless network optimization using deep learning
US20200134445A1 (en) * 2018-10-31 2020-04-30 Advanced Micro Devices, Inc. Architecture for deep q learning
CN114371729B (en) * 2021-12-22 2022-10-25 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434130A (en) * 2020-11-24 2021-03-02 南京邮电大学 Multi-task label embedded emotion analysis neural network model construction method
CN113780576A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution
CN113723013A (en) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 Multi-agent decision method for continuous space chess deduction
CN114281103A (en) * 2021-12-14 2022-04-05 中国运载火箭技术研究院 Zero-interaction communication aircraft cluster collaborative search method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Multi-Agent Motion Prediction and Tracking Method Based on Non-Cooperative Equilibrium;Li yan,et al.;《Mathematics》;20220105;全文 *
An Interactive Self-Learning Game and Evolutionary Approach Based on Non-Cooperative Equilibrium;li yan,et al.;《Electronics》;20211129;全文 *

Also Published As

Publication number Publication date
CN114690623A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN114690623B (en) Intelligent agent efficient global exploration method and system for rapid convergence of value function
WO2021017227A1 (en) Path optimization method and device for unmanned aerial vehicle, and storage medium
US11150670B2 (en) Autonomous behavior generation for aircraft
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN113221444B (en) Behavior simulation training method for air intelligent game
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
Montazeri et al. Continuous state/action reinforcement learning: A growing self-organizing map approach
CN114967713B (en) Underwater vehicle buoyancy discrete change control method based on reinforcement learning
CN114281103B (en) Aircraft cluster collaborative search method with zero interaction communication
CN112434791A (en) Multi-agent strong countermeasure simulation method and device and electronic equipment
CN115018017A (en) Multi-agent credit allocation method, system and equipment based on ensemble learning
CN116643499A (en) Model reinforcement learning-based agent path planning method and system
CN114290339A (en) Robot reality migration system and method based on reinforcement learning and residual modeling
CN116663637A (en) Multi-level agent synchronous nesting training method
CN114371729B (en) Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
CN110450164A (en) Robot control method, device, robot and storage medium
CN115903901A (en) Output synchronization optimization control method for unmanned cluster system with unknown internal state
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
Shen et al. Theoretically principled deep RL acceleration via nearest neighbor function approximation
CN115212549A (en) Adversary model construction method under confrontation scene and storage medium
CN114814741A (en) DQN radar interference decision method and device based on priority important sampling fusion
Hachiya et al. Efficient sample reuse in EM-based policy search
Zhao et al. Convolutional fitted Q iteration for vision-based control problems
US20220404831A1 (en) Autonomous Behavior Generation for Aircraft Using Augmented and Generalized Machine Learning Inputs
CN113537318B (en) Robot behavior decision method and device simulating human brain memory mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant