CN114690623A - Intelligent agent efficient global exploration method and system for rapid convergence of value function - Google Patents

Intelligent agent efficient global exploration method and system for rapid convergence of value function Download PDF

Info

Publication number
CN114690623A
CN114690623A CN202210421995.3A CN202210421995A CN114690623A CN 114690623 A CN114690623 A CN 114690623A CN 202210421995 A CN202210421995 A CN 202210421995A CN 114690623 A CN114690623 A CN 114690623A
Authority
CN
China
Prior art keywords
network
unmanned aerial
aerial vehicle
reward
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210421995.3A
Other languages
Chinese (zh)
Other versions
CN114690623B (en
Inventor
林旺群
李妍
徐菁
王伟
田成平
刘波
王锐华
孙鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Original Assignee
Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences filed Critical Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority to CN202210421995.3A priority Critical patent/CN114690623B/en
Publication of CN114690623A publication Critical patent/CN114690623A/en
Application granted granted Critical
Publication of CN114690623B publication Critical patent/CN114690623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0205Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system
    • G05B13/024Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An agent efficient global exploration method and system for rapid convergence of a value function are disclosed, wherein the method comprises the following steps: the unmanned aerial vehicle training system is characterized in that an extension reward formed by the cooperation of the correction local reward and the global reward is used for giving a more definite training target to the unmanned aerial vehicle, a general cost function approximator is adopted, the unmanned aerial vehicle keeps exploration on the environment in the whole training process, and the global reward is modulated and corrected to capture global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting. According to the method, the unmanned aerial vehicle keeps exploring all the time in the whole training process by introducing the correction local reward, the correction local reward can also regulate and control the extension reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle is ensured to learn the optimal strategy; the observation information access conditions among different plots are correlated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process.

Description

Intelligent agent efficient global exploration method and system for rapid convergence of value function
Technical Field
The invention relates to the field of virtual simulation intelligent confrontation, in particular to an intelligent agent efficient global exploration method and system with a fast convergence value function. The unmanned aerial vehicle learning performance is improved while the enemy unmanned aerial vehicle is overwhelmed and the attack of the enemy unmanned aerial vehicle is avoided.
Background
In recent years, with the increasing demand for unmanned and intelligent unmanned aerial vehicles, the artificial intelligence technology is developed vigorously, so that the unmanned aerial vehicles are widely concerned in the military and civil fields, and intelligent confrontation facing the field of virtual simulation becomes the hot point of current research.
Because the traditional intelligent learning training method has the defect of sparse reward setting, the unmanned aerial vehicle is often blindly explored in the learning training process, and if the unmanned aerial vehicle explores a suboptimal strategy, the exploration is most likely to be stopped and converted into the utilization, so that the unmanned aerial vehicle is difficult to learn the optimal strategy. This approach is limited in that the drone needs to learn the strategy through repeated blind exploration to obtain a large amount of experience, is inefficient and may never learn the best strategy.
On the basis of a traditional improvement method, a new scholars provides a technology for integrating correction local rewards, the technology can enable an unmanned aerial vehicle to keep purposeful exploration in the whole battle scene, and the unmanned aerial vehicle can learn an optimal strategy to a certain extent, but the method has the limitation that the correction local rewards are not regulated and controlled, the correction local rewards of each plot are only related to the plot, and the correlation is not available to all plots in the whole training process, so that the training efficiency of an intelligent body is too low.
Therefore, how can overcome prior art's shortcoming, will correct local reward and global reward and cooperate each other, let the intelligent agent keep continuously exploring in the scene of fighting, avoid the unmanned aerial vehicle intelligent agent to carry out meaningless study, become the difficult problem that needs to solve promptly.
Disclosure of Invention
The invention aims to provide an intelligent body efficient global exploration method and system with a fast convergence value function, which can provide an unmanned aerial vehicle with a more definite training target based on an extension reward formed by the cooperation of a correction local reward and a global reward, and adopt a general value function approximator to ensure that the unmanned aerial vehicle keeps exploring the environment in the whole training process and modulates an initial local reward through an exploring factor to capture global correlation. Finally, make unmanned aerial vehicle agent efficient and finally can acquire the optimal strategy of fighting.
In order to achieve the purpose, the invention adopts the following technical scheme:
an intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat readiness information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and further comprises the following steps:
local access frequency module construction substep S121:
the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) And setting the controllable state f (x)t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Figure BDA0003608219140000021
Global access frequency module construction sub-step S122:
constructing a global access frequency module by random network distillation, and inputting observation information x of the unmanned aerial vehicle at the moment ttCalculating to obtain an exploration factor alphatFor the initial local award
Figure BDA0003608219140000031
Regulating and controlling to obtain corrected local reward
Figure BDA0003608219140000032
Correcting local reward can make the reward acquisition of whole network become intensive, and unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function in degree of depth Q learning network faster, unmanned aerial vehicle performance is better, utilizes at last to correct local reward
Figure BDA0003608219140000033
Global reward of step S110
Figure BDA0003608219140000034
The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure BDA0003608219140000035
Network, two networks having the same structure, and observation information x inputtObtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x)t,at,rt,xt+1) Storing the data in a replay buffer, obtaining a target Q value by using a transfer tuple in the replay buffer and solving the loss of the output value of the current Q network, training the current Q network according to the loss, updating a parameter theta of the current Q network, and updating the target at intervals after a plurality of plots
Figure BDA0003608219140000041
Parameter theta of network-
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
Optionally, step S110 specifically includes:
setting the space fight range of the unmanned aerial vehicle, setting the range of motion of the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy in the space fight range of the unmanned aerial vehicle, and setting the observation information of the unmanned aerial vehicle of our party as the space fight range of the unmanned aerial vehicle of our partyUnmanned plane location (x)0,y0,z0) Angle of deflection from horizontal xoy plane
Figure BDA0003608219140000045
Flip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontal
Figure BDA0003608219140000046
Flip angle omega of relative motion plane1(<90°),
Observation information x of unmanned aerial vehicle of our partytComprises the following steps:
Figure BDA0003608219140000042
assuming that the legal actions of my drone are east, south, west and north;
the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:
Figure BDA0003608219140000043
optionally, in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full-link layer, so as to extract the input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Converting the access count into a reward according to the state-action, and defining an initial local reward
Figure BDA0003608219140000044
Comprises the following steps:
Figure BDA0003608219140000051
wherein n (f (x)t) Represents a controllable state f (x)t) The number of accesses;
using the inverse kernel function K:
Figure BDA0003608219140000052
to approximate the number of times the observation information was accessed at time t,
Figure BDA0003608219140000053
in (1)
Figure BDA0003608219140000054
Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M are approximated,
Figure BDA0003608219140000055
Figure BDA0003608219140000056
representing k adjacent controllable states extracted from the episodic memory M to derive an initial local reward
Figure BDA0003608219140000057
The method specifically comprises the following steps:
Figure BDA0003608219140000058
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure BDA0003608219140000059
e is taken to be 0.001, d is Euclidean distance
Figure BDA00036082191400000510
Is a running average, and c is 0.001.
Optionally, in step S122, observing information x of the unmanned aerial vehicle at time t is obtainedtInput to random net distillation, output error err (x) through two of the netst) To define an exploration factor alphat
Figure BDA00036082191400000511
Wherein sigmaeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local reward
Figure BDA00036082191400000512
Multiplicative factor of, corrected local reward
Figure BDA00036082191400000513
The expression of (a) is:
Figure BDA00036082191400000514
wherein alpha istIs between 1 and L, L being a hyperparametric, at most 5, alphatIs set to 1;
finally, using the corrected local reward
Figure BDA0003608219140000061
Global reward of step S110
Figure BDA0003608219140000062
Weighted summation to obtain extended reward rt
Figure BDA0003608219140000063
And
Figure BDA0003608219140000064
respectively represent global awards
Figure BDA0003608219140000065
And correcting for local rewards
Figure BDA0003608219140000066
Beta is a positive scalar quantity and ranges from 0 to 1.
Optionally, step S123 specifically includes:
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract a controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertTransfer to Observation xt+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:
p(a1|xt,xt+1),...,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1))),
wherein p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of1The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming P vectors by using output action probabilities, carrying out onehot coding on output actions of a current Q network in a deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:
Figure BDA0003608219140000067
where m is the number of actions that can be taken, m is 4, and finally the parameters of the embedded network f are updated by using the calculation result E in a back propagation manner, and the process is carried out againTraining until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S124 specifically includes:
the observation information x at the time ttInputting into random net distillation, and outputting error through target net and prediction net
Figure BDA0003608219140000068
To train the prediction network and update the parameters of the prediction network by back propagation
Figure BDA0003608219140000079
Training is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S130 specifically includes:
constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information xtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atAnd the value of epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, and the action is randomly selected with the probability of 10 percent. When the observed information is xtExecute the current action atObtaining new observation information xt+1And global awards
Figure BDA0003608219140000071
Global rewards
Figure BDA0003608219140000072
Re-sum correction of local rewards
Figure BDA0003608219140000073
Weighted summation to obtain extended reward rtThen transfer the tuple (x)t,at,rt,xt+1) Deposit to replay buffer and subsequently extract from replay bufferSample (A)Batch gradient descending w transfer tuples (x)j,aj,rj,xj+1) J 1, 2.. times.w, where the batch gradient descent refers to training all the branch tuples in the replay buffer for each sampling, and calculating the current target Q value yjThe calculation method is as follows:
Figure BDA0003608219140000074
wherein r isjIs the extension award earned by drone j at time,
Figure BDA0003608219140000075
representing objects
Figure BDA0003608219140000076
The network observes the information x according to the j +1 momentj+1The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t equals j +1, indicating the end of the episode, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjTo utilize the target
Figure BDA0003608219140000077
Multiplying the output value of the network by the discount factor and the extended reward, and then calculating the target Q value yjCalculating the current Q network output value to obtain the mean square error
Figure BDA0003608219140000078
And updating the parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transfer tuples, and the target is updated every 10 plots
Figure BDA0003608219140000081
Parameter theta of network-
The invention further discloses an intelligent agent efficient global exploration system with fast convergence of a value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer-executable instructions which, when executed by a processor, perform the above intelligent agent efficient global exploration method for fast convergence of value functions.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
Drawings
FIG. 1 is a flow diagram of an agent efficient global exploration method with fast convergence of value functions, in accordance with a specific embodiment of the present invention;
FIG. 2 is a flow chart of the steps of UAV corrective local reward network construction and training for a smart agent efficient global exploration method with fast convergence of value functions in accordance with an embodiment of the present invention;
FIG. 3 is an architecture diagram for correcting a local reward according to an embodiment of the present invention;
FIG. 4 is an architecture diagram of an embedded network in accordance with a specific embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process according to an embodiment of the present invention.
Detailed Description
The following description is made of the related terms related to the present invention:
1. deep Q learning network
Deep Q learning is a representative method of deep reinforcement learning based on a value function. It contains two neural networks, called current Q network and target respectively
Figure BDA0003608219140000091
The two networks have the same structure, and the two networks of the traditional deep Q learning are Q (x) respectivelyj,ajTheta) and
Figure BDA0003608219140000092
the invention controls and corrects the proportion of local reward in the extended reward by the parameter beta, introduces the beta parameter in the action value function, and respectively uses Q (x)j,ajBeta, theta) and
Figure BDA0003608219140000093
representing current Q network and target
Figure BDA0003608219140000094
And (4) outputting of the network. The input of the current Q network is observation information at the current time t, and the target
Figure BDA0003608219140000095
The input of the network is the observation information at the next moment, namely the moment t +1, and the output is the state-action value of all actions. In the invention, the current Q network in the unmanned aerial vehicle network is a network needing learning and is used for controlling an unmanned aerial vehicle intelligent agent and a target
Figure BDA0003608219140000101
Directly copying the parameters of the current Q network after the parameters of the network pass through a fixed plot, updating the parameters theta of the current Q network by a gradient descent method and training by minimizing a Loss function Loss:
Figure BDA0003608219140000102
Loss=(yj-Q(xj,aj,β,θ))2
2. plot of things
The scenario is a sequence set formed by observation information, actions and extension rewards generated by the unmanned plane in the process of interacting with the environment, and is represented by a set of a plurality of transfer tuples formed by the experience. The plot in the invention refers to the whole process from the beginning to the end of the unmanned aerial vehicle battle.
3. Transfer tuple
The transfer tuple is a basic unit forming the plot, and the unmanned aerial vehicle intelligent body generates observation information x once interacting with the environmenttAction (instruction) atExtended prize rtAnd the next moment observation information xt+1The quadruple (x)t,at,rt,xt+1) Called the branch tuple, and deposits the branch tuple into the replay buffer.
4. Playback buffer
The replay buffer is a buffer area in the memory or the hard disk, and is used for storing the transfer tuple sequence. The stored transfer tuples may be used repeatedly for training of the deep Q learning network. In the invention, the replay buffer area stores transfer tuples obtained by interaction between the unmanned aerial vehicle and the environment, the maximum capacity of the transfer tuples is N, the structure of the transfer tuples is similar to that of a queue, and when the number of the transfer tuples is more than N, tuple sequences firstly stored in the replay buffer area are deleted.
K-nearest neighbor
Given a sample, the k training samples in the training set that are closest to the sample are found based on some distance metric (e.g., Euclidean distance), and then prediction is performed based on the information of the k "neighbors". In the invention, the access times of certain observation information obtained by the unmanned aerial vehicle in a plot are approximately calculated by utilizing a k-neighbor thought so as to obtain the initial local reward of the unmanned aerial vehicle for the observation information. If the number of accesses of the observation information is larger, the initial local reward is smaller, and conversely, if the number of accesses of the observation information is smaller, the initial local reward is larger.
6. Random Network Distillation (RND)
Random network distillation, which essentially randomly initializes two networks, fixes the parameters of one network, called the network as the target network, and fixes the parameters of the other network, called the network as the prediction network. In the invention, the input of the RND network is observation information x obtained after the unmanned aerial vehicle interacts with the environmenttThe result of the predicted network is close to the target network by training the network, and the smaller the output error of the two networks is, the observation information x is representedtIn training from unmanned plane toThis has been accessed too many times, meaning that the smaller the exploration factor, the smaller the contribution to the corrective local reward, i.e. the smaller the corrective local reward.
7. General value function approximator (UVFA)
Generally, different tasks require different action cost functions, and different optimal value functions are required to quantize the schemes for completing different tasks. In the present invention, corrective local rewards are weighted to represent different tasks, i.e., different degrees of exploration tasks. Therefore, the invention extends the action cost function in deep Q learning from the original Q (x)t,atθ) is changed to Q (x)t,atβ, θ), wherein the parameter β is a weight parameter for correcting the local reward, β takes different values, and the corrected local reward plays different roles, and different action value functions can be mixed together through the parameter β.
8. Kernel function and inverse kernel function
The kernel function is a high-dimensional space point inner product converted by operation in a mode of original point inner product on a characteristic space, and the original low-dimensional space does not need to be completely expanded into a point on a high-dimensional space for calculation, so that the effect of reducing the operation complexity is achieved. The inverse kernel function is opposite to the inverse kernel function, and the original feature space of the inverse kernel function is a high-dimensional space, so that the high-dimensional space is reduced to a low-dimensional space.
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Referring to FIG. 1, a flow chart of a method for intelligent agent efficient global exploration with fast convergence of value functions according to an embodiment of the present invention is shown
Unmanned aerial vehicle combat readiness information setting step S110:
and setting observation information and legal actions of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements.
Specifically, in the present step,
the method comprises the steps of setting a space fight range of the unmanned aerial vehicle, wherein the fight range is a three-dimensional space, the moving ranges of unmanned aerial vehicles of the owner and the enemy are in the space fight range of the unmanned aerial vehicle, for example, the ranges of two horizontal coordinates are [ -1000m,1000m ], the range of a vertical coordinate is not restricted, and the freedom of the upper moving range and the lower moving range is ensured.
The observation information of the unmanned aerial vehicle of the my party is set as the position (x) of the unmanned aerial vehicle of the my party0,y0,z0) Angle of deflection from horizontal xoy plane
Figure BDA0003608219140000123
Flip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontal
Figure BDA0003608219140000121
Figure BDA0003608219140000122
Flip angle omega of relative motion plane1(<90 deg.), so observation information x of my dronetComprises the following steps:
Figure BDA0003608219140000131
assume that the legal actions of my drone are set to east, south, west, and north.
Global reward setting: the my drone sets a global reward as to whether to destroy or evade enemy drone attacks. If the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set to be a value 0, otherwise, the value-1 is set, namely, the more the unmanned aerial vehicle acts when the unmanned aerial vehicle does not destroy the unmanned aerial vehicle of the enemy and does not avoid the attack of the unmanned aerial vehicle of the enemy, the more the negative score of the global reward is obtained, and the global reward symbol is recorded as:
Figure BDA0003608219140000132
unmanned aerial vehicle correction local reward network construction and training step S120:
referring to fig. 2, a flow chart of the steps of drone remediation local reward network construction and training is shown.
Referring to fig. 3, the unmanned aerial vehicle rectification local reward network is constructed and comprises a local access frequency module and a global access frequency module.
Local access frequency module construction substep S121:
the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) (i.e., controllable information) and setting the controllable state f (x)t) Storing the local awards into a plot memory M, and calculating the initial local awards harvested by the unmanned aerial vehicle of the current party at the moment through a k-nearest neighbor algorithm
Figure BDA0003608219140000133
Specifically, in step S121, the embedded network f is formed by a convolutional neural network, see fig. 4, and includes three convolutional layers and a full link layer to extract the input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Defining initial local rewards based on state-action heuristics for access count to reward exploration methods
Figure BDA0003608219140000141
Comprises the following steps:
Figure BDA0003608219140000142
wherein n (f (x)t) Represents a controllable state f (x)t) Is accessed, i.e. in a scenario with observation information xtThe more controllable states are similar, i.e., the more accessed, the less initial local rewards and vice versa.
Since the state space is a continuous space, it is difficult to calculate whether two controllable states are the same, so an inverse kernel function (equivalent to mapping a high-dimensional space to a low-dimensional space) K is used:
Figure BDA0003608219140000143
Figure BDA0003608219140000144
to approximate the number of times the observation information was accessed at time t,
Figure BDA0003608219140000145
in (1)
Figure BDA0003608219140000146
Representing the real number domain, with p in the upper right representing the dimension, i.e.
Figure BDA0003608219140000147
Represents a p-dimensional vector in a real number domain, and in particular, p ═ 1 represents a real number. Further, a pseudo count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M.
Figure BDA0003608219140000148
Representing k adjacent controllable states extracted from the episodic memory M to derive an initial local reward
Figure BDA0003608219140000149
The method comprises the following specific steps:
Figure BDA00036082191400001410
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure BDA00036082191400001411
e is a very small constant (usually 0.001), d is the Euclidean distance and
Figure BDA00036082191400001412
is a moving average, and the constant c is a very small value (typically 0.001). The sliding average makes the inverse kernel more robust to the task being solved.
This sub-step further provides an explanation of the controllable state. The representation of the embedded network f is:
Figure BDA0003608219140000151
the controllable state of the agent is extracted according to the current observation information and is mapped into a p-dimensional vector. Because the environment may contain changes independent of the behavior of the agent, referred to as uncontrollable states, which may be of no use in the calculation of the reward and may even affect the accuracy of the initial local reward calculation, it is desirable to cull states other than those independent of the behavior of the drone, leaving a controllable state of the drone. Therefore, in order to avoid meaningless exploration, under two successive observations, the embedded network f predicts the action taken by the unmanned aerial vehicle from one observation to the next, and judges the accuracy of the controllable state extracted by the embedded network f according to the prediction result. For example: the position of the enemy unmanned aerial vehicle is a controllable state which needs to be extracted by the unmanned aerial vehicle, and the number and the position of birds in the air are not needed to be observed by the unmanned aerial vehicle, so that the information of the birds can be removed through the embedded network f.
Global access frequency module construction sub-step S122:
constructing a global access frequency module by Random Network Distillation (RND), and inputting observation information x at the t moment of the unmanned aerial vehicletCalculating to obtain an exploration factor alphatFor the initial local award
Figure BDA0003608219140000152
Regulating and controlling to obtain corrected local reward
Figure BDA0003608219140000153
Correcting local reward can make the reward acquisition of whole network become intensive, and unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function in degree of depth Q learning network faster, unmanned aerial vehicle performance is better, utilizes at last to correct local reward
Figure BDA0003608219140000154
Global reward of step S110
Figure BDA0003608219140000155
The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt
Specifically, in step S122, observation information x of unmanned aerial vehicle at time t is extractedtInput Random Network Distillation (RND), output error err (x) through two of the networkst) To define an exploration factor alphat
Figure BDA0003608219140000161
Wherein sigmaeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local reward
Figure BDA0003608219140000162
Multiplicative factor of, corrected local reward
Figure BDA0003608219140000163
The expression of (a) is:
Figure BDA0003608219140000164
wherein alpha istIs limited to a value of 1 to L, L being a hyperparameter of at most 5, alphatThe minimum value is set to 1 in order to avoid the situation that the modulation factor is small due to a certain episode being visited globally too many times, resulting in a corrected local reward of 0.
αtAs a modulation factor, it disappears over time, causing an initial local reward
Figure BDA0003608219140000165
Fading to a non-modulated reward.
Finally, using the corrected local reward
Figure BDA0003608219140000166
Global reward of step S110
Figure BDA0003608219140000167
Weighted summation to obtain extended reward rt. The extended reward is defined as
Figure BDA0003608219140000168
And
Figure BDA0003608219140000169
respectively represent global awards
Figure BDA00036082191400001610
And correcting for local rewards
Figure BDA00036082191400001611
Beta is a positive scalar, ranging from 0 to 1, in order to balance the effect of correcting the local reward.
In step S120, the local access frequency module corresponds to the access frequency of the unmanned aerial vehicle at a certain time state in a scenario, and corresponds to the initial local reward
Figure BDA00036082191400001612
They are negatively correlated, and if the local access frequency of an observation is very high, the corresponding initial local reward is very small; the global access frequency module corresponds to the access frequency of the state of the unmanned aerial vehicle at a certain time in the whole training process (namely, the access frequency is formed by a plurality of plots) and corresponds to the exploration factor alphatThey are also negatively correlated, and if the global access frequency of an observation is high, the corresponding exploration factor is small.
After the local access frequency module and the global access frequency module are constructed, as shown in substeps S121 and S122, the present invention will train the corresponding networks in the two modules.
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Further, the training of the embedded network begins after the second observation is obtained, and lags behind the Random Network Distillation (RND) and the deep Q learning network because the embedded network needs to predict the action taken to shift between two observations at successive times from the observations at the two times.
In a preferred embodiment, the training of this sub-step may specifically be,
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract the controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertTransfer to Observation xt+1Probability of all actions taken, e.g. in the present invention, actions exported via an embedded networkThe probabilities correspond to the probabilities corresponding to the four actions of east, south, west and north, and the sum of the probabilities is 1, which is specifically expressed as: p (a)1|xt,xt+1),...,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1) P) in which p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of1H is a hidden layer with a softmax function, and the parameters of h and f are trained by a maximum likelihood method. Forming P vectors by the probabilities of all the output actions, carrying out onehot coding on the output actions of the current Q network in the deep Q learning network to obtain A vectors, and then calculating the mean square error E of the P vectors and the A vectors, wherein the method specifically comprises the following steps:
Figure BDA0003608219140000171
and finally, reversely propagating and updating the parameters of the embedded network f by using the calculation result E, and performing training again until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.
It should be noted that the embedded network f does not include the above-mentioned full connection layer and softmax layer, but they are just used for training the embedded network to output probabilities corresponding to each action, and if a certain output action probability is larger, it indicates that the embedded network f considers that the drone is most likely to take the action so that the observed value is from xtTransfer to xt+1
Random Network Distillation (RND) training substep S124:
training of Random Network Distillation (RND) in the global access frequency module only requires training of the prediction network therein, since the parameters of the target network are randomly initialized and remain unchanged, expressed as:
Figure BDA0003608219140000181
the parameters of the prediction network are continuously updated to approximate the target network in the training process and are expressed as
Figure BDA0003608219140000182
Both networks eventually output k-dimensional vectors.
Solving the output values of the target network and the prediction network of Random Network Distillation (RND) into the mean square error err (x)t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
In a preferred embodiment, the training of this sub-step may specifically be,
the observation information x at the time ttInputting into Random Network Distillation (RND), and outputting error through target network and prediction network
Figure BDA0003608219140000183
To train the prediction network and update the parameters of the prediction network by back propagation
Figure BDA0003608219140000184
Training is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure BDA0003608219140000185
Network, two networks having the same structure, and observation information x inputtAcquiring the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and performing the actionContext interaction to obtain a transition tuple (x)t,at,rt,xt+1) Storing the data in a replay buffer, training the current Q network by using the transfer tuples in the replay buffer, updating the parameter theta of the current Q network and updating the target at intervals after a plurality of episodes
Figure BDA0003608219140000191
Parameter theta of network-
Specifically, the steps are as follows:
the deep Q learning network is constructed as an unmanned aerial vehicle network, the action value function is expanded, the beta parameter is newly added to adjust the weight for correcting the local reward, the parameter beta can take different values, so that the unmanned aerial vehicle network can learn different strategies, and the observation information x is utilizedtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atGenerally, the value of ∈ is 0.9, that is, the action corresponding to the maximum Q value is selected with a probability of 90%, and the action is randomly selected with a probability of 10%. Then x is observed in the informationtWhile performing the current action atGet new observation information xt+1And global awards
Figure BDA0003608219140000192
Global rewards
Figure BDA0003608219140000193
Re-sum correction of local rewards
Figure BDA0003608219140000194
Weighted summation to obtain extended reward rtThen will (x)t,at,rt,xt+1) This branch tuple is deposited into the replay buffer, from which w branch tuples (x) of the batch gradient descent are then sampledj,aj,rj,xj+1) J-1, 2, w, where batch gradient descent refers to each sample in the replay bufferTraining all the transfer tuples to calculate the current target Q value yjThe calculation method is as follows:
Figure BDA0003608219140000195
wherein r isjIs the extension award earned by drone j at time,
Figure BDA0003608219140000196
representing objects
Figure BDA0003608219140000197
The network observes the information x according to the j +1 momentj+1And the maximum Q value in the Q values corresponding to all the output actions, wherein gamma represents a discount factor and takes a value between 0 and 1. If t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjIs a target of
Figure BDA0003608219140000201
Multiplying the output value of the network by the discount factor and then summing the expanded reward, and then calculating the target Q value y by a gradient descent methodjCalculating the difference between the current Q network output value and the current Q network output value to obtain the mean square error
Figure BDA0003608219140000202
And updating a parameter theta of the current Q network, wherein w represents the number of sampled transfer tuples, and after a plurality of plots, usually 10 plots, updating the primary target
Figure BDA0003608219140000203
Parameter theta of network-
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the Random Network Distillation (RND) and the deep Q learning network until the plot is finished, wherein the network for controlling the flight of the unmanned aerial vehicle comprises the embedded network, the Random Network Distillation (RND) and the deep Q learning network which are trained, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
Specifically, referring to fig. 5, the whole process of unmanned aerial vehicle combat training is shown.
The present invention further discloses a storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described agent-efficient global exploration method for fast convergence of value functions.
The invention also discloses an intelligent agent high-efficiency global exploration system with fast convergence of the value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer-executable instructions which, when executed by a processor, perform the above intelligent agent efficient global exploration method for fast convergence of value functions.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. An intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat readiness information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and specifically comprises the following substeps:
local access frequency module construction substep S121:
the local access frequency module comprises an embedded network f, a controllable state f (x)t) The plot memory M and the k-neighbor are used for storing the observation information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) And setting the controllable state f (x)t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Figure FDA0003608219130000011
Global access frequency module construction sub-step S122:
a global access frequency module is constructed by random network distillation, and observation information x of the unmanned aerial vehicle at the moment t is inputtCalculating to obtain an exploration factor alphatAnd awarding the initial local prize
Figure FDA0003608219130000012
Regulating and controlling to obtain corrected local reward
Figure FDA0003608219130000013
Final local reward with correction
Figure FDA0003608219130000014
Global reward of step S110
Figure FDA0003608219130000015
The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt
Embedded network training substep S123:
connecting two full-connection layers and a softmax layer after the network is embedded to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming a vector by the group of probabilities, simultaneously carrying out onehot coding on the output action of the current Q network t moment in the deep Q learning network to obtain another vector, calculating a mean square error by the two vectors to obtain E, and updating parameters embedded in the network in a reverse mode until all plots are finished, wherein all plots are finished, namely the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the whole plots are finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)t) The parameters in the prediction network are updated by utilizing the error back propagation, the parameters in the target network are kept unchanged until all plots are finished, the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are finished;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a target
Figure FDA0003608219130000021
Network, two networks having the same structure, and observation information x inputtThrough a depth Q meshThe current Q network in the network obtains the action selected by the unmanned aerial vehicle under the observation information at each moment, executes the action and interacts with the environment to obtain a transfer tuple (x)t,at,rt,xt+1) And storing in a replay buffer, passing the target using the branch tuples in the replay buffer
Figure FDA0003608219130000022
The network obtains a target Q value and the output value of the current Q network to calculate loss, the training of the current Q network is carried out according to the loss, the parameter theta of the current Q network is updated, and the target is updated at intervals after a plurality of plots
Figure FDA0003608219130000023
Parameter theta of network-
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward to guide the flight of the unmanned aerial vehicle.
2. The agent-efficient global exploration method according to claim 1,
the step S110 specifically includes:
setting the space fight range of the unmanned aerial vehicle, wherein the active ranges of unmanned aerial vehicles of our party and enemy are in the space fight range of the unmanned aerial vehicle, and the observation information of unmanned aerial vehicle of our party is set as the position (x) of unmanned aerial vehicle of our party0,y0,z0) Angle of deflection from horizontal xoy plane
Figure FDA0003608219130000024
Flip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontal
Figure FDA00036082191300000312
Flip angle omega of relative motion plane1(<90°),
Observation information x of unmanned aerial vehicle of our partytComprises the following steps:
Figure FDA0003608219130000031
the legal actions of the unmanned aerial vehicle of the my party are set to be eastward, southward, westward and northward;
the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:
Figure FDA0003608219130000032
3. the agent-efficient global exploration method according to claim 2,
in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full link layer to extract input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Converting the access count into a reward according to the state-action, and defining an initial local reward
Figure FDA0003608219130000033
Comprises the following steps:
Figure FDA0003608219130000034
wherein n (f (x)t) Represents a controllable state f (x)t) The number of accesses;
using the inverse kernel function K:
Figure FDA0003608219130000035
to approximate the number of times the observation information was accessed at time t,
Figure FDA0003608219130000036
in (1)
Figure FDA0003608219130000037
Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M are approximated,
Figure FDA0003608219130000038
Figure FDA0003608219130000039
representing k adjacent controllable states extracted from episodic memory M to derive an initial local reward
Figure FDA00036082191300000310
The method specifically comprises the following steps:
Figure FDA00036082191300000311
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Figure FDA0003608219130000041
e is taken to be 0.001, d is Euclidean distance
Figure FDA0003608219130000042
Is a running average, and c is 0.001.
4. The agent-efficient global exploration method according to claim 3,
in step S122, observation information x at time t of the drone is extractedtInput to random net distillation, output error err (x) through two of the netst) To define an exploration factor alphat
Figure FDA0003608219130000043
Figure FDA0003608219130000044
Wherein σeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local reward
Figure FDA0003608219130000045
Multiplicative factor of, corrected local reward
Figure FDA0003608219130000046
The expression of (a) is:
Figure FDA0003608219130000047
wherein alpha istIs between 1 and L, L being a hyperparametric, at most 5, alphatIs set to 1;
finally, using the corrected local reward
Figure FDA0003608219130000048
Global reward of step S110
Figure FDA0003608219130000049
Weighted summation to obtain extended reward rt
Figure FDA00036082191300000410
Figure FDA00036082191300000411
And
Figure FDA00036082191300000412
respectively represent global awards
Figure FDA00036082191300000413
And correcting for local rewards
Figure FDA00036082191300000414
Beta is a positive scalar quantity and ranges from 0 to 1.
5. The agent-efficient global exploration method according to claim 4,
step S123 specifically includes:
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract a controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertShift to observation xt+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:
p(a1|xt,xt+1),…,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1))),
wherein p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of a1The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming a P vector by the probability of each output action, carrying out onehot coding on the output action of a current Q network in a deep Q learning network to obtain an A vector, and solving a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:
Figure FDA0003608219130000051
and finally, performing back propagation to update parameters embedded in the network f by using the calculation result E, and performing training again until all episodes are finished, wherein m is the number of actions which can be taken, and m is 4.
6. The agent-efficient global exploration method according to claim 5,
step S124 specifically includes:
the observation information x at the time ttInputting into random net distillation, and outputting error through target net and prediction net
Figure FDA0003608219130000052
Training the prediction network, and updating the parameters of the prediction network by back propagation of a gradient descent method
Figure FDA0003608219130000053
And training again until all the plots are finished, wherein the finishing of all the plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all the plots are finished.
7. The agent-efficient global exploration method according to claim 6,
step S130 specifically includes:
constructing a deep Q learning network as noneThe man-machine network expands the action value function, newly adds beta parameters to adjust and correct the weight of local reward, and utilizes the observation information xtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atAnd the value of the epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, the action is randomly selected with the probability of 10 percent, and the observed information is xtWhile performing the current action atGet new observation information xt+1And global awards
Figure FDA0003608219130000061
Global rewards
Figure FDA0003608219130000062
Re-sum correction of local rewards
Figure FDA0003608219130000063
Weighted summation to obtain extended reward rtThen transfer the tuple (x)t,at,rt,xt+1) Deposit to replay buffer, then sample the batch gradient descending w transition tuples (x) from replay bufferj,at,rj,xj+1) Where, in batch gradient descent, all the transition tuples in the replay buffer are sampled each time for training, and the current target Q value y is calculatedjThe calculation method is as follows:
Figure FDA0003608219130000064
wherein r isjIs the extension award earned by drone j at time,
Figure FDA0003608219130000065
representing objects
Figure FDA0003608219130000066
The network observes the information x according to the j +1 momentj+1The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjIs a target of
Figure FDA0003608219130000067
Multiplying the output value of the network by the discount factor and summing the extended reward, and then calculating the target Q value yjAnd calculating the mean square error of the output value of the current Q network
Figure FDA0003608219130000068
And updating a parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transition tuples, and the target is updated once every 10 plots
Figure FDA0003608219130000069
Parameter theta of network-
8. An agent-efficient global exploration system with fast convergence of value functions, comprising a storage medium,
the storage medium storing computer-executable instructions which, when executed by a processor, perform the agent-efficient global exploration method for fast convergence of value functions of any of claims 1-7.
CN202210421995.3A 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function Active CN114690623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210421995.3A CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210421995.3A CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Publications (2)

Publication Number Publication Date
CN114690623A true CN114690623A (en) 2022-07-01
CN114690623B CN114690623B (en) 2022-10-25

Family

ID=82144133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210421995.3A Active CN114690623B (en) 2022-04-21 2022-04-21 Intelligent agent efficient global exploration method and system for rapid convergence of value function

Country Status (1)

Country Link
CN (1) CN114690623B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761850A (en) * 2022-11-16 2023-03-07 智慧眼科技股份有限公司 Face recognition model training method, face recognition device and storage medium
CN115826621A (en) * 2022-12-27 2023-03-21 山西大学 Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN115857556A (en) * 2023-01-30 2023-03-28 中国人民解放军96901部队 Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190014488A1 (en) * 2017-07-06 2019-01-10 Futurewei Technologies, Inc. System and method for deep learning and wireless network optimization using deep learning
US20200134445A1 (en) * 2018-10-31 2020-04-30 Advanced Micro Devices, Inc. Architecture for deep q learning
CN112434130A (en) * 2020-11-24 2021-03-02 南京邮电大学 Multi-task label embedded emotion analysis neural network model construction method
CN113723013A (en) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 Multi-agent decision method for continuous space chess deduction
CN113780576A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution
CN114281103A (en) * 2021-12-14 2022-04-05 中国运载火箭技术研究院 Zero-interaction communication aircraft cluster collaborative search method
CN114371729A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190014488A1 (en) * 2017-07-06 2019-01-10 Futurewei Technologies, Inc. System and method for deep learning and wireless network optimization using deep learning
US20200134445A1 (en) * 2018-10-31 2020-04-30 Advanced Micro Devices, Inc. Architecture for deep q learning
CN112434130A (en) * 2020-11-24 2021-03-02 南京邮电大学 Multi-task label embedded emotion analysis neural network model construction method
CN113780576A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution
CN113723013A (en) * 2021-09-10 2021-11-30 中国人民解放军国防科技大学 Multi-agent decision method for continuous space chess deduction
CN114281103A (en) * 2021-12-14 2022-04-05 中国运载火箭技术研究院 Zero-interaction communication aircraft cluster collaborative search method
CN114371729A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI YAN,ET AL.: "A Multi-Agent Motion Prediction and Tracking Method Based on Non-Cooperative Equilibrium", 《MATHEMATICS》 *
LI YAN,ET AL.: "An Interactive Self-Learning Game and Evolutionary Approach Based on Non-Cooperative Equilibrium", 《ELECTRONICS》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761850A (en) * 2022-11-16 2023-03-07 智慧眼科技股份有限公司 Face recognition model training method, face recognition device and storage medium
CN115761850B (en) * 2022-11-16 2024-03-22 智慧眼科技股份有限公司 Face recognition model training method, face recognition method, device and storage medium
CN115826621A (en) * 2022-12-27 2023-03-21 山西大学 Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN115826621B (en) * 2022-12-27 2023-12-01 山西大学 Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN115857556A (en) * 2023-01-30 2023-03-28 中国人民解放军96901部队 Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning

Also Published As

Publication number Publication date
CN114690623B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN114690623B (en) Intelligent agent efficient global exploration method and system for rapid convergence of value function
US11150670B2 (en) Autonomous behavior generation for aircraft
WO2021017227A1 (en) Path optimization method and device for unmanned aerial vehicle, and storage medium
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN114281103B (en) Aircraft cluster collaborative search method with zero interaction communication
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
Li et al. F2a2: Flexible fully-decentralized approximate actor-critic for cooperative multi-agent reinforcement learning
US20220404831A1 (en) Autonomous Behavior Generation for Aircraft Using Augmented and Generalized Machine Learning Inputs
Shen et al. Theoretically principled deep RL acceleration via nearest neighbor function approximation
CN114371729B (en) Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
CN115730743A (en) Battlefield combat trend prediction method based on deep neural network
Han et al. Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c
CN114290339A (en) Robot reality migration system and method based on reinforcement learning and residual modeling
CN114814741A (en) DQN radar interference decision method and device based on priority important sampling fusion
CN114037048A (en) Belief consistency multi-agent reinforcement learning method based on variational cycle network model
CN116880540A (en) Heterogeneous unmanned aerial vehicle group task allocation method based on alliance game formation
CN116663637A (en) Multi-level agent synchronous nesting training method
CN115994484A (en) Air combat countergame strategy optimizing system based on multi-population self-adaptive orthoevolutionary algorithm
CN115903901A (en) Output synchronization optimization control method for unmanned cluster system with unknown internal state
CN116451762A (en) Reinforced learning method based on PPO algorithm and application thereof
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN115212549A (en) Adversary model construction method under confrontation scene and storage medium
CN114757092A (en) System and method for training multi-agent cooperative communication strategy based on teammate perception
KR20230079804A (en) Device based on reinforcement learning to linearize state transition and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant