CN114690623A - Intelligent agent efficient global exploration method and system for rapid convergence of value function - Google Patents
Intelligent agent efficient global exploration method and system for rapid convergence of value function Download PDFInfo
- Publication number
- CN114690623A CN114690623A CN202210421995.3A CN202210421995A CN114690623A CN 114690623 A CN114690623 A CN 114690623A CN 202210421995 A CN202210421995 A CN 202210421995A CN 114690623 A CN114690623 A CN 114690623A
- Authority
- CN
- China
- Prior art keywords
- network
- unmanned aerial
- aerial vehicle
- reward
- global
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/0205—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system
- G05B13/024—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
An agent efficient global exploration method and system for rapid convergence of a value function are disclosed, wherein the method comprises the following steps: the unmanned aerial vehicle training system is characterized in that an extension reward formed by the cooperation of the correction local reward and the global reward is used for giving a more definite training target to the unmanned aerial vehicle, a general cost function approximator is adopted, the unmanned aerial vehicle keeps exploration on the environment in the whole training process, and the global reward is modulated and corrected to capture global correlation. Finally, make unmanned aerial vehicle agent training efficient and finally can learn the optimum strategy of fighting. According to the method, the unmanned aerial vehicle keeps exploring all the time in the whole training process by introducing the correction local reward, the correction local reward can also regulate and control the extension reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle is ensured to learn the optimal strategy; the observation information access conditions among different plots are correlated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process.
Description
Technical Field
The invention relates to the field of virtual simulation intelligent confrontation, in particular to an intelligent agent efficient global exploration method and system with a fast convergence value function. The unmanned aerial vehicle learning performance is improved while the enemy unmanned aerial vehicle is overwhelmed and the attack of the enemy unmanned aerial vehicle is avoided.
Background
In recent years, with the increasing demand for unmanned and intelligent unmanned aerial vehicles, the artificial intelligence technology is developed vigorously, so that the unmanned aerial vehicles are widely concerned in the military and civil fields, and intelligent confrontation facing the field of virtual simulation becomes the hot point of current research.
Because the traditional intelligent learning training method has the defect of sparse reward setting, the unmanned aerial vehicle is often blindly explored in the learning training process, and if the unmanned aerial vehicle explores a suboptimal strategy, the exploration is most likely to be stopped and converted into the utilization, so that the unmanned aerial vehicle is difficult to learn the optimal strategy. This approach is limited in that the drone needs to learn the strategy through repeated blind exploration to obtain a large amount of experience, is inefficient and may never learn the best strategy.
On the basis of a traditional improvement method, a new scholars provides a technology for integrating correction local rewards, the technology can enable an unmanned aerial vehicle to keep purposeful exploration in the whole battle scene, and the unmanned aerial vehicle can learn an optimal strategy to a certain extent, but the method has the limitation that the correction local rewards are not regulated and controlled, the correction local rewards of each plot are only related to the plot, and the correlation is not available to all plots in the whole training process, so that the training efficiency of an intelligent body is too low.
Therefore, how can overcome prior art's shortcoming, will correct local reward and global reward and cooperate each other, let the intelligent agent keep continuously exploring in the scene of fighting, avoid the unmanned aerial vehicle intelligent agent to carry out meaningless study, become the difficult problem that needs to solve promptly.
Disclosure of Invention
The invention aims to provide an intelligent body efficient global exploration method and system with a fast convergence value function, which can provide an unmanned aerial vehicle with a more definite training target based on an extension reward formed by the cooperation of a correction local reward and a global reward, and adopt a general value function approximator to ensure that the unmanned aerial vehicle keeps exploring the environment in the whole training process and modulates an initial local reward through an exploring factor to capture global correlation. Finally, make unmanned aerial vehicle agent efficient and finally can acquire the optimal strategy of fighting.
In order to achieve the purpose, the invention adopts the following technical scheme:
an intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat readiness information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and further comprises the following steps:
local access frequency module construction substep S121:
the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) And setting the controllable state f (x)t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Global access frequency module construction sub-step S122:
constructing a global access frequency module by random network distillation, and inputting observation information x of the unmanned aerial vehicle at the moment ttCalculating to obtain an exploration factor alphatFor the initial local awardRegulating and controlling to obtain corrected local rewardCorrecting local reward can make the reward acquisition of whole network become intensive, and unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function in degree of depth Q learning network faster, unmanned aerial vehicle performance is better, utilizes at last to correct local rewardGlobal reward of step S110The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt;
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. The fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a targetNetwork, two networks having the same structure, and observation information x inputtObtaining the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and interacting with the environment to obtain a transfer tuple (x)t,at,rt,xt+1) Storing the data in a replay buffer, obtaining a target Q value by using a transfer tuple in the replay buffer and solving the loss of the output value of the current Q network, training the current Q network according to the loss, updating a parameter theta of the current Q network, and updating the target at intervals after a plurality of plotsParameter theta of network-;
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
Optionally, step S110 specifically includes:
setting the space fight range of the unmanned aerial vehicle, setting the range of motion of the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy in the space fight range of the unmanned aerial vehicle, and setting the observation information of the unmanned aerial vehicle of our party as the space fight range of the unmanned aerial vehicle of our partyUnmanned plane location (x)0,y0,z0) Angle of deflection from horizontal xoy planeFlip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontalFlip angle omega of relative motion plane1(<90°),
Observation information x of unmanned aerial vehicle of our partytComprises the following steps:
assuming that the legal actions of my drone are east, south, west and north;
the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:
optionally, in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full-link layer, so as to extract the input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Converting the access count into a reward according to the state-action, and defining an initial local rewardComprises the following steps:
wherein n (f (x)t) Represents a controllable state f (x)t) The number of accesses;
using the inverse kernel function K:to approximate the number of times the observation information was accessed at time t,in (1)Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M are approximated, representing k adjacent controllable states extracted from the episodic memory M to derive an initial local rewardThe method specifically comprises the following steps:
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
Optionally, in step S122, observing information x of the unmanned aerial vehicle at time t is obtainedtInput to random net distillation, output error err (x) through two of the netst) To define an exploration factor alphat,Wherein sigmaeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local rewardMultiplicative factor of, corrected local rewardThe expression of (a) is:
wherein alpha istIs between 1 and L, L being a hyperparametric, at most 5, alphatIs set to 1;
finally, using the corrected local rewardGlobal reward of step S110Weighted summation to obtain extended reward rt,Andrespectively represent global awardsAnd correcting for local rewardsBeta is a positive scalar quantity and ranges from 0 to 1.
Optionally, step S123 specifically includes:
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract a controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertTransfer to Observation xt+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:
p(a1|xt,xt+1),...,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1))),
wherein p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of1The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming P vectors by using output action probabilities, carrying out onehot coding on output actions of a current Q network in a deep Q learning network to obtain an A vector, and calculating a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:where m is the number of actions that can be taken, m is 4, and finally the parameters of the embedded network f are updated by using the calculation result E in a back propagation manner, and the process is carried out againTraining until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S124 specifically includes:
the observation information x at the time ttInputting into random net distillation, and outputting error through target net and prediction netTo train the prediction network and update the parameters of the prediction network by back propagationTraining is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Optionally, step S130 specifically includes:
constructing a deep Q learning network as an unmanned aerial vehicle network, expanding an action value function, adding a beta parameter for adjusting and correcting the weight of local reward, and utilizing observation information xtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atAnd the value of epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, and the action is randomly selected with the probability of 10 percent. When the observed information is xtExecute the current action atObtaining new observation information xt+1And global awardsGlobal rewardsRe-sum correction of local rewardsWeighted summation to obtain extended reward rtThen transfer the tuple (x)t,at,rt,xt+1) Deposit to replay buffer and subsequently extract from replay bufferSample (A)Batch gradient descending w transfer tuples (x)j,aj,rj,xj+1) J 1, 2.. times.w, where the batch gradient descent refers to training all the branch tuples in the replay buffer for each sampling, and calculating the current target Q value yjThe calculation method is as follows:
wherein r isjIs the extension award earned by drone j at time,representing objectsThe network observes the information x according to the j +1 momentj+1The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t equals j +1, indicating the end of the episode, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjTo utilize the targetMultiplying the output value of the network by the discount factor and the extended reward, and then calculating the target Q value yjCalculating the current Q network output value to obtain the mean square errorAnd updating the parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transfer tuples, and the target is updated every 10 plotsParameter theta of network-。
The invention further discloses an intelligent agent efficient global exploration system with fast convergence of a value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer-executable instructions which, when executed by a processor, perform the above intelligent agent efficient global exploration method for fast convergence of value functions.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
Drawings
FIG. 1 is a flow diagram of an agent efficient global exploration method with fast convergence of value functions, in accordance with a specific embodiment of the present invention;
FIG. 2 is a flow chart of the steps of UAV corrective local reward network construction and training for a smart agent efficient global exploration method with fast convergence of value functions in accordance with an embodiment of the present invention;
FIG. 3 is an architecture diagram for correcting a local reward according to an embodiment of the present invention;
FIG. 4 is an architecture diagram of an embedded network in accordance with a specific embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process according to an embodiment of the present invention.
Detailed Description
The following description is made of the related terms related to the present invention:
1. deep Q learning network
Deep Q learning is a representative method of deep reinforcement learning based on a value function. It contains two neural networks, called current Q network and target respectivelyThe two networks have the same structure, and the two networks of the traditional deep Q learning are Q (x) respectivelyj,ajTheta) andthe invention controls and corrects the proportion of local reward in the extended reward by the parameter beta, introduces the beta parameter in the action value function, and respectively uses Q (x)j,ajBeta, theta) andrepresenting current Q network and targetAnd (4) outputting of the network. The input of the current Q network is observation information at the current time t, and the targetThe input of the network is the observation information at the next moment, namely the moment t +1, and the output is the state-action value of all actions. In the invention, the current Q network in the unmanned aerial vehicle network is a network needing learning and is used for controlling an unmanned aerial vehicle intelligent agent and a targetDirectly copying the parameters of the current Q network after the parameters of the network pass through a fixed plot, updating the parameters theta of the current Q network by a gradient descent method and training by minimizing a Loss function Loss:
Loss=(yj-Q(xj,aj,β,θ))2
2. plot of things
The scenario is a sequence set formed by observation information, actions and extension rewards generated by the unmanned plane in the process of interacting with the environment, and is represented by a set of a plurality of transfer tuples formed by the experience. The plot in the invention refers to the whole process from the beginning to the end of the unmanned aerial vehicle battle.
3. Transfer tuple
The transfer tuple is a basic unit forming the plot, and the unmanned aerial vehicle intelligent body generates observation information x once interacting with the environmenttAction (instruction) atExtended prize rtAnd the next moment observation information xt+1The quadruple (x)t,at,rt,xt+1) Called the branch tuple, and deposits the branch tuple into the replay buffer.
4. Playback buffer
The replay buffer is a buffer area in the memory or the hard disk, and is used for storing the transfer tuple sequence. The stored transfer tuples may be used repeatedly for training of the deep Q learning network. In the invention, the replay buffer area stores transfer tuples obtained by interaction between the unmanned aerial vehicle and the environment, the maximum capacity of the transfer tuples is N, the structure of the transfer tuples is similar to that of a queue, and when the number of the transfer tuples is more than N, tuple sequences firstly stored in the replay buffer area are deleted.
K-nearest neighbor
Given a sample, the k training samples in the training set that are closest to the sample are found based on some distance metric (e.g., Euclidean distance), and then prediction is performed based on the information of the k "neighbors". In the invention, the access times of certain observation information obtained by the unmanned aerial vehicle in a plot are approximately calculated by utilizing a k-neighbor thought so as to obtain the initial local reward of the unmanned aerial vehicle for the observation information. If the number of accesses of the observation information is larger, the initial local reward is smaller, and conversely, if the number of accesses of the observation information is smaller, the initial local reward is larger.
6. Random Network Distillation (RND)
Random network distillation, which essentially randomly initializes two networks, fixes the parameters of one network, called the network as the target network, and fixes the parameters of the other network, called the network as the prediction network. In the invention, the input of the RND network is observation information x obtained after the unmanned aerial vehicle interacts with the environmenttThe result of the predicted network is close to the target network by training the network, and the smaller the output error of the two networks is, the observation information x is representedtIn training from unmanned plane toThis has been accessed too many times, meaning that the smaller the exploration factor, the smaller the contribution to the corrective local reward, i.e. the smaller the corrective local reward.
7. General value function approximator (UVFA)
Generally, different tasks require different action cost functions, and different optimal value functions are required to quantize the schemes for completing different tasks. In the present invention, corrective local rewards are weighted to represent different tasks, i.e., different degrees of exploration tasks. Therefore, the invention extends the action cost function in deep Q learning from the original Q (x)t,atθ) is changed to Q (x)t,atβ, θ), wherein the parameter β is a weight parameter for correcting the local reward, β takes different values, and the corrected local reward plays different roles, and different action value functions can be mixed together through the parameter β.
8. Kernel function and inverse kernel function
The kernel function is a high-dimensional space point inner product converted by operation in a mode of original point inner product on a characteristic space, and the original low-dimensional space does not need to be completely expanded into a point on a high-dimensional space for calculation, so that the effect of reducing the operation complexity is achieved. The inverse kernel function is opposite to the inverse kernel function, and the original feature space of the inverse kernel function is a high-dimensional space, so that the high-dimensional space is reduced to a low-dimensional space.
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Referring to FIG. 1, a flow chart of a method for intelligent agent efficient global exploration with fast convergence of value functions according to an embodiment of the present invention is shown
Unmanned aerial vehicle combat readiness information setting step S110:
and setting observation information and legal actions of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements.
Specifically, in the present step,
the method comprises the steps of setting a space fight range of the unmanned aerial vehicle, wherein the fight range is a three-dimensional space, the moving ranges of unmanned aerial vehicles of the owner and the enemy are in the space fight range of the unmanned aerial vehicle, for example, the ranges of two horizontal coordinates are [ -1000m,1000m ], the range of a vertical coordinate is not restricted, and the freedom of the upper moving range and the lower moving range is ensured.
The observation information of the unmanned aerial vehicle of the my party is set as the position (x) of the unmanned aerial vehicle of the my party0,y0,z0) Angle of deflection from horizontal xoy planeFlip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontal Flip angle omega of relative motion plane1(<90 deg.), so observation information x of my dronetComprises the following steps:
assume that the legal actions of my drone are set to east, south, west, and north.
Global reward setting: the my drone sets a global reward as to whether to destroy or evade enemy drone attacks. If the unmanned aerial vehicle of our party destroys the unmanned aerial vehicle of the enemy, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy avoids the attack of the unmanned aerial vehicle of the enemy, the global reward is set to be a value 0, otherwise, the value-1 is set, namely, the more the unmanned aerial vehicle acts when the unmanned aerial vehicle does not destroy the unmanned aerial vehicle of the enemy and does not avoid the attack of the unmanned aerial vehicle of the enemy, the more the negative score of the global reward is obtained, and the global reward symbol is recorded as:
unmanned aerial vehicle correction local reward network construction and training step S120:
referring to fig. 2, a flow chart of the steps of drone remediation local reward network construction and training is shown.
Referring to fig. 3, the unmanned aerial vehicle rectification local reward network is constructed and comprises a local access frequency module and a global access frequency module.
Local access frequency module construction substep S121:
the local access frequency module comprises four parts, namely an embedded network f, a controllable state, a plot memory M and a k-neighbor, and is used for observing information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) (i.e., controllable information) and setting the controllable state f (x)t) Storing the local awards into a plot memory M, and calculating the initial local awards harvested by the unmanned aerial vehicle of the current party at the moment through a k-nearest neighbor algorithm
Specifically, in step S121, the embedded network f is formed by a convolutional neural network, see fig. 4, and includes three convolutional layers and a full link layer to extract the input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then storing the p-dimensional vector into an episode memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Defining initial local rewards based on state-action heuristics for access count to reward exploration methodsComprises the following steps:
wherein n (f (x)t) Represents a controllable state f (x)t) Is accessed, i.e. in a scenario with observation information xtThe more controllable states are similar, i.e., the more accessed, the less initial local rewards and vice versa.
Since the state space is a continuous space, it is difficult to calculate whether two controllable states are the same, so an inverse kernel function (equivalent to mapping a high-dimensional space to a low-dimensional space) K is used: to approximate the number of times the observation information was accessed at time t,in (1)Representing the real number domain, with p in the upper right representing the dimension, i.e.Represents a p-dimensional vector in a real number domain, and in particular, p ═ 1 represents a real number. Further, a pseudo count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M.Representing k adjacent controllable states extracted from the episodic memory M to derive an initial local rewardThe method comprises the following specific steps:
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
e is a very small constant (usually 0.001), d is the Euclidean distance andis a moving average, and the constant c is a very small value (typically 0.001). The sliding average makes the inverse kernel more robust to the task being solved.
This sub-step further provides an explanation of the controllable state. The representation of the embedded network f is:the controllable state of the agent is extracted according to the current observation information and is mapped into a p-dimensional vector. Because the environment may contain changes independent of the behavior of the agent, referred to as uncontrollable states, which may be of no use in the calculation of the reward and may even affect the accuracy of the initial local reward calculation, it is desirable to cull states other than those independent of the behavior of the drone, leaving a controllable state of the drone. Therefore, in order to avoid meaningless exploration, under two successive observations, the embedded network f predicts the action taken by the unmanned aerial vehicle from one observation to the next, and judges the accuracy of the controllable state extracted by the embedded network f according to the prediction result. For example: the position of the enemy unmanned aerial vehicle is a controllable state which needs to be extracted by the unmanned aerial vehicle, and the number and the position of birds in the air are not needed to be observed by the unmanned aerial vehicle, so that the information of the birds can be removed through the embedded network f.
Global access frequency module construction sub-step S122:
constructing a global access frequency module by Random Network Distillation (RND), and inputting observation information x at the t moment of the unmanned aerial vehicletCalculating to obtain an exploration factor alphatFor the initial local awardRegulating and controlling to obtain corrected local rewardCorrecting local reward can make the reward acquisition of whole network become intensive, and unmanned aerial vehicle can receive better regulation and control after the reward is intensive to make the convergence rate of value function in degree of depth Q learning network faster, unmanned aerial vehicle performance is better, utilizes at last to correct local rewardGlobal reward of step S110The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt。
Specifically, in step S122, observation information x of unmanned aerial vehicle at time t is extractedtInput Random Network Distillation (RND), output error err (x) through two of the networkst) To define an exploration factor alphat,Wherein sigmaeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local rewardMultiplicative factor of, corrected local rewardThe expression of (a) is:
wherein alpha istIs limited to a value of 1 to L, L being a hyperparameter of at most 5, alphatThe minimum value is set to 1 in order to avoid the situation that the modulation factor is small due to a certain episode being visited globally too many times, resulting in a corrected local reward of 0.
αtAs a modulation factor, it disappears over time, causing an initial local rewardFading to a non-modulated reward.
Finally, using the corrected local rewardGlobal reward of step S110Weighted summation to obtain extended reward rt. The extended reward is defined asAndrespectively represent global awardsAnd correcting for local rewardsBeta is a positive scalar, ranging from 0 to 1, in order to balance the effect of correcting the local reward.
In step S120, the local access frequency module corresponds to the access frequency of the unmanned aerial vehicle at a certain time state in a scenario, and corresponds to the initial local rewardThey are negatively correlated, and if the local access frequency of an observation is very high, the corresponding initial local reward is very small; the global access frequency module corresponds to the access frequency of the state of the unmanned aerial vehicle at a certain time in the whole training process (namely, the access frequency is formed by a plurality of plots) and corresponds to the exploration factor alphatThey are also negatively correlated, and if the global access frequency of an observation is high, the corresponding exploration factor is small.
After the local access frequency module and the global access frequency module are constructed, as shown in substeps S121 and S122, the present invention will train the corresponding networks in the two modules.
Embedded network training substep S123:
connecting two full connection layers and a softmax layer after embedding the network to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming the probabilities into a vector, simultaneously carrying out onehot coding on the output actions of the current Q network t moment in the deep Q learning network to obtain another vector, calculating mean square error by the two vectors to obtain E, and reversely updating parameters embedded in the network until all plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Further, the training of the embedded network begins after the second observation is obtained, and lags behind the Random Network Distillation (RND) and the deep Q learning network because the embedded network needs to predict the action taken to shift between two observations at successive times from the observations at the two times.
In a preferred embodiment, the training of this sub-step may specifically be,
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract the controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertTransfer to Observation xt+1Probability of all actions taken, e.g. in the present invention, actions exported via an embedded networkThe probabilities correspond to the probabilities corresponding to the four actions of east, south, west and north, and the sum of the probabilities is 1, which is specifically expressed as: p (a)1|xt,xt+1),...,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1) P) in which p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of1H is a hidden layer with a softmax function, and the parameters of h and f are trained by a maximum likelihood method. Forming P vectors by the probabilities of all the output actions, carrying out onehot coding on the output actions of the current Q network in the deep Q learning network to obtain A vectors, and then calculating the mean square error E of the P vectors and the A vectors, wherein the method specifically comprises the following steps:and finally, reversely propagating and updating the parameters of the embedded network f by using the calculation result E, and performing training again until all plots are finished. And the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots for training in the whole training process, and the training is completed after all plots are trained.
It should be noted that the embedded network f does not include the above-mentioned full connection layer and softmax layer, but they are just used for training the embedded network to output probabilities corresponding to each action, and if a certain output action probability is larger, it indicates that the embedded network f considers that the drone is most likely to take the action so that the observed value is from xtTransfer to xt+1。
Random Network Distillation (RND) training substep S124:
training of Random Network Distillation (RND) in the global access frequency module only requires training of the prediction network therein, since the parameters of the target network are randomly initialized and remain unchanged, expressed as:the parameters of the prediction network are continuously updated to approximate the target network in the training process and are expressed asBoth networks eventually output k-dimensional vectors.
Solving the output values of the target network and the prediction network of Random Network Distillation (RND) into the mean square error err (x)t) And updating the parameters in the prediction network by utilizing the error back propagation, wherein the parameters in the target network are kept unchanged until all the plots are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
In a preferred embodiment, the training of this sub-step may specifically be,
the observation information x at the time ttInputting into Random Network Distillation (RND), and outputting error through target network and prediction networkTo train the prediction network and update the parameters of the prediction network by back propagationTraining is performed again until all episodes are finished. And the fact that all plots are finished means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are trained.
Unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a targetNetwork, two networks having the same structure, and observation information x inputtAcquiring the action selected by the unmanned aerial vehicle under the observation information at each moment through the current Q network in the deep Q network, executing the action and performing the actionContext interaction to obtain a transition tuple (x)t,at,rt,xt+1) Storing the data in a replay buffer, training the current Q network by using the transfer tuples in the replay buffer, updating the parameter theta of the current Q network and updating the target at intervals after a plurality of episodesParameter theta of network-。
Specifically, the steps are as follows:
the deep Q learning network is constructed as an unmanned aerial vehicle network, the action value function is expanded, the beta parameter is newly added to adjust the weight for correcting the local reward, the parameter beta can take different values, so that the unmanned aerial vehicle network can learn different strategies, and the observation information x is utilizedtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atGenerally, the value of ∈ is 0.9, that is, the action corresponding to the maximum Q value is selected with a probability of 90%, and the action is randomly selected with a probability of 10%. Then x is observed in the informationtWhile performing the current action atGet new observation information xt+1And global awardsGlobal rewardsRe-sum correction of local rewardsWeighted summation to obtain extended reward rtThen will (x)t,at,rt,xt+1) This branch tuple is deposited into the replay buffer, from which w branch tuples (x) of the batch gradient descent are then sampledj,aj,rj,xj+1) J-1, 2, w, where batch gradient descent refers to each sample in the replay bufferTraining all the transfer tuples to calculate the current target Q value yjThe calculation method is as follows:
wherein r isjIs the extension award earned by drone j at time,representing objectsThe network observes the information x according to the j +1 momentj+1And the maximum Q value in the Q values corresponding to all the output actions, wherein gamma represents a discount factor and takes a value between 0 and 1. If t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjIs a target ofMultiplying the output value of the network by the discount factor and then summing the expanded reward, and then calculating the target Q value y by a gradient descent methodjCalculating the difference between the current Q network output value and the current Q network output value to obtain the mean square errorAnd updating a parameter theta of the current Q network, wherein w represents the number of sampled transfer tuples, and after a plurality of plots, usually 10 plots, updating the primary targetParameter theta of network-。
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the Random Network Distillation (RND) and the deep Q learning network until the plot is finished, wherein the network for controlling the flight of the unmanned aerial vehicle comprises the embedded network, the Random Network Distillation (RND) and the deep Q learning network which are trained, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward for guiding the flight of the unmanned aerial vehicle.
Specifically, referring to fig. 5, the whole process of unmanned aerial vehicle combat training is shown.
The present invention further discloses a storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described agent-efficient global exploration method for fast convergence of value functions.
The invention also discloses an intelligent agent high-efficiency global exploration system with fast convergence of the value function, which is characterized by comprising a storage medium,
the storage medium is used for storing computer-executable instructions which, when executed by a processor, perform the above intelligent agent efficient global exploration method for fast convergence of value functions.
In summary, the invention has the following advantages:
1. by introducing the correction local reward, the unmanned aerial vehicle keeps exploring all the time in the whole training process, and the unmanned aerial vehicle is encouraged to visit the observation information which is not visited and give a very high reward, so that the observation information obtained by interaction between the unmanned aerial vehicle and the environment can be visited in the training process, and the unmanned aerial vehicle can clearly know which observation information can obtain a higher reward; meanwhile, the acquired correction local reward can be regulated and controlled to extend the reward all the time, so that the unmanned aerial vehicle cannot converge to a suboptimal strategy in advance, and the unmanned aerial vehicle can learn the optimal strategy.
2. The observation information access times in the whole training process of the unmanned aerial vehicle are recorded through Random Network Distillation (RND) in the global access frequency module, and the observation information access conditions among different plots are associated, so that the unmanned aerial vehicle can access more observation information which is not accessed in the whole training process and the plot process. For example: if the initial local reward obtained by the local access frequency module is small, the observation information is accessed in the plot for a plurality of times, if the exploration factor obtained by the observation information in the global access frequency module is large, the observation information is accessed in the whole training process for the unmanned aerial vehicle for a plurality of times, the corrected local reward obtained by modulating the initial local reward and the exploration factor is not small, and the observation information is accessed in other plots. Under the condition of a large number of training times, the unmanned aerial vehicle can clearly know which observation information obtains the maximum reward, and the obtained strategy is optimal.
3. The traditional action value function is improved, a weight parameter beta for correcting the local reward is added under the condition of original observation information, action and network parameters and is used for adjusting the importance degree of correcting the local reward, namely the exploration degree of the unmanned aerial vehicle. The exploration and utilization degree can be adjusted by setting different beta values, a good strategy and parameter are obtained by exploration, and then the parameter beta is set to be 0, so that the unmanned aerial vehicle is only regulated and controlled by global rewards, and a better strategy is obtained. According to the method, the parameters learned by the unmanned aerial vehicle are better by correcting the local reward, and finally, the unmanned aerial vehicle training is only regulated and controlled by the global reward by modulating the beta parameter.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. An intelligent agent efficient global exploration method for rapid convergence of a value function is characterized by comprising the following steps:
unmanned aerial vehicle combat readiness information setting step S110:
setting observation information and legal action of the simulated flight of the unmanned aerial vehicle, and setting a global reward function according to task requirements;
unmanned aerial vehicle correction local reward network construction and training step S120:
the method for constructing the unmanned aerial vehicle correction local reward network comprises a local access frequency module and a global access frequency module, and specifically comprises the following substeps:
local access frequency module construction substep S121:
the local access frequency module comprises an embedded network f, a controllable state f (x)t) The plot memory M and the k-neighbor are used for storing the observation information x of the unmanned aerial vehicle at the t momenttInput embedded network f for extracting controllable state f (x) of unmanned aerial vehicle agentt) And setting the controllable state f (x)t) Storing the local reward into a plot memory M, and calculating the initial local reward harvested by the unmanned aerial vehicle of the same party at the moment through a k-nearest neighbor algorithm
Global access frequency module construction sub-step S122:
a global access frequency module is constructed by random network distillation, and observation information x of the unmanned aerial vehicle at the moment t is inputtCalculating to obtain an exploration factor alphatAnd awarding the initial local prizeRegulating and controlling to obtain corrected local rewardFinal local reward with correctionGlobal reward of step S110The weighted sum obtains the extended reward r of the unmanned aerial vehicle of the party at the momentt;
Embedded network training substep S123:
connecting two full-connection layers and a softmax layer after the network is embedded to output corresponding probabilities of actions when t moment is transferred to t +1 moment, forming a vector by the group of probabilities, simultaneously carrying out onehot coding on the output action of the current Q network t moment in the deep Q learning network to obtain another vector, calculating a mean square error by the two vectors to obtain E, and updating parameters embedded in the network in a reverse mode until all plots are finished, wherein all plots are finished, namely the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the whole plots are finished after all plots are trained;
random net distillation training substep S124:
solving the output values of the target network and the prediction network of the random network distillation into the mean square error err (x)t) The parameters in the prediction network are updated by utilizing the error back propagation, the parameters in the target network are kept unchanged until all plots are finished, the completion of all plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all plots are finished;
unmanned aerial vehicle intelligent network construction and training step S130:
constructing a deep Q learning network as an unmanned aerial vehicle network, comprising a current Q network and a targetNetwork, two networks having the same structure, and observation information x inputtThrough a depth Q meshThe current Q network in the network obtains the action selected by the unmanned aerial vehicle under the observation information at each moment, executes the action and interacts with the environment to obtain a transfer tuple (x)t,at,rt,xt+1) And storing in a replay buffer, passing the target using the branch tuples in the replay bufferThe network obtains a target Q value and the output value of the current Q network to calculate loss, the training of the current Q network is carried out according to the loss, the parameter theta of the current Q network is updated, and the target is updated at intervals after a plurality of plotsParameter theta of network-;
Repeating the training and exiting step S140:
repeating the steps S120-S130 by using observation information obtained by interaction of the unmanned aerial vehicle and the environment, continuously training and updating the embedded network, the random network distillation and the deep Q learning network until the plot is finished, and selecting a network structure which enables the unmanned aerial vehicle to obtain the maximum reward to guide the flight of the unmanned aerial vehicle.
2. The agent-efficient global exploration method according to claim 1,
the step S110 specifically includes:
setting the space fight range of the unmanned aerial vehicle, wherein the active ranges of unmanned aerial vehicles of our party and enemy are in the space fight range of the unmanned aerial vehicle, and the observation information of unmanned aerial vehicle of our party is set as the position (x) of unmanned aerial vehicle of our party0,y0,z0) Angle of deflection from horizontal xoy planeFlip angle omega of relative motion plane0(<90 degrees and the position (x) of enemy unmanned aerial vehicle1,y1,z1) Angle of deflection relative to horizontalFlip angle omega of relative motion plane1(<90°),
Observation information x of unmanned aerial vehicle of our partytComprises the following steps:
the legal actions of the unmanned aerial vehicle of the my party are set to be eastward, southward, westward and northward;
the global reward is set to: the global reward is set by whether the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party or avoids the attack of the unmanned aerial vehicle of the enemy party, if the unmanned aerial vehicle of the my party destroys the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 1, if the unmanned aerial vehicle of the enemy party avoids the attack of the unmanned aerial vehicle of the enemy party, the global reward is set to be a value 0, otherwise, the global reward is set to be a value-1, and the global reward symbol is recorded as:
3. the agent-efficient global exploration method according to claim 2,
in step S121, the embedded network f is formed by a convolutional neural network, and includes three convolutional layers and a full link layer to extract input as the observation information xtThe controllable state of the unmanned aerial vehicle is represented by a p-dimensional vector, and is recorded as: f (x)t) Then, the p-dimensional vector is stored in the plot memory M; at time t, the scenario memory M stores a controllable state from time 0 to time t, which is expressed as: { f (x)0),f(x1),...,f(xt) Converting the access count into a reward according to the state-action, and defining an initial local rewardComprises the following steps:
wherein n (f (x)t) Represents a controllable state f (x)t) The number of accesses;
using the inverse kernel function K:to approximate the number of times the observation information was accessed at time t,in (1)Representing the real number domain, p in the upper right-hand corner representing the dimension, pseudo-count n (f (x)t) Using f (x)t) K-adjacent controllable states in the episodic memory M are approximated, representing k adjacent controllable states extracted from episodic memory M to derive an initial local rewardThe method specifically comprises the following steps:
wherein f isi∈NkRepresents sequential operation from NkThe number of times the observation information is accessed at the moment t is calculated by taking out the controllable state,
the inverse kernel function expression is:
4. The agent-efficient global exploration method according to claim 3,
in step S122, observation information x at time t of the drone is extractedtInput to random net distillation, output error err (x) through two of the netst) To define an exploration factor alphat, Wherein σeAnd mueIs err (x)t) Run-time standard deviation and mean, alphatAs an initial local rewardMultiplicative factor of, corrected local rewardThe expression of (a) is:
wherein alpha istIs between 1 and L, L being a hyperparametric, at most 5, alphatIs set to 1;
5. The agent-efficient global exploration method according to claim 4,
step S123 specifically includes:
will observe x twice in successiontAnd xt+1Respectively inputting the embedded network f to extract a controllable state f (x)t),f(xt+1) Then outputs the observation x through two full connection layers and one softmax layertShift to observation xt+1The probabilities of all actions taken in the process, the action probabilities output by the embedded network correspond to the probabilities corresponding to the four actions of east, south, west and north, the sum of the probabilities is 1, and the probabilities are specifically expressed as:
p(a1|xt,xt+1),…,p(at-1|xt,xt+1),p(at|xt,xt+1)=softmax(h(f(xt),f(xt+1))),
wherein p (a)1|xt,xt+1) Represents from observation xtTransfer to Observation xt+1Taking action of a1The method comprises the following steps of (1) training parameters of h and f by a maximum likelihood method, forming a P vector by the probability of each output action, carrying out onehot coding on the output action of a current Q network in a deep Q learning network to obtain an A vector, and solving a mean square error E of the P vector and the A vector, wherein h is a hidden layer with a softmax function, and the parameters of h and f are specifically as follows:and finally, performing back propagation to update parameters embedded in the network f by using the calculation result E, and performing training again until all episodes are finished, wherein m is the number of actions which can be taken, and m is 4.
6. The agent-efficient global exploration method according to claim 5,
step S124 specifically includes:
the observation information x at the time ttInputting into random net distillation, and outputting error through target net and prediction netTraining the prediction network, and updating the parameters of the prediction network by back propagation of a gradient descent methodAnd training again until all the plots are finished, wherein the finishing of all the plots means that the unmanned aerial vehicle can repeatedly iterate a plurality of plots to train in the whole training process, and the training is finished after all the plots are finished.
7. The agent-efficient global exploration method according to claim 6,
step S130 specifically includes:
constructing a deep Q learning network as noneThe man-machine network expands the action value function, newly adds beta parameters to adjust and correct the weight of local reward, and utilizes the observation information xtObtaining Q values corresponding to all actions output by the current Q network as the input of the current Q network, and selecting the action a corresponding to the maximum Q value from the Q values output by the current Q network by an epsilon-greedy methodtOr randomly select action atAnd the value of the epsilon is 0.9, namely the action corresponding to the maximum Q value is selected with the probability of 90 percent, the action is randomly selected with the probability of 10 percent, and the observed information is xtWhile performing the current action atGet new observation information xt+1And global awardsGlobal rewardsRe-sum correction of local rewardsWeighted summation to obtain extended reward rtThen transfer the tuple (x)t,at,rt,xt+1) Deposit to replay buffer, then sample the batch gradient descending w transition tuples (x) from replay bufferj,at,rj,xj+1) Where, in batch gradient descent, all the transition tuples in the replay buffer are sampled each time for training, and the current target Q value y is calculatedjThe calculation method is as follows:
wherein r isjIs the extension award earned by drone j at time,representing objectsThe network observes the information x according to the j +1 momentj+1The maximum Q value in the Q values corresponding to all the output actions, gamma represents a discount factor, and the value is between 0 and 1;
if t equals j +1, indicating the end of the plot, then the output target Q value equals the extended prize value at time j, otherwise the output target Q value yjIs a target ofMultiplying the output value of the network by the discount factor and summing the extended reward, and then calculating the target Q value yjAnd calculating the mean square error of the output value of the current Q networkAnd updating a parameter theta of the current Q network by a gradient descent method, wherein w represents the number of sampled transition tuples, and the target is updated once every 10 plotsParameter theta of network-。
8. An agent-efficient global exploration system with fast convergence of value functions, comprising a storage medium,
the storage medium storing computer-executable instructions which, when executed by a processor, perform the agent-efficient global exploration method for fast convergence of value functions of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210421995.3A CN114690623B (en) | 2022-04-21 | 2022-04-21 | Intelligent agent efficient global exploration method and system for rapid convergence of value function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210421995.3A CN114690623B (en) | 2022-04-21 | 2022-04-21 | Intelligent agent efficient global exploration method and system for rapid convergence of value function |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114690623A true CN114690623A (en) | 2022-07-01 |
CN114690623B CN114690623B (en) | 2022-10-25 |
Family
ID=82144133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210421995.3A Active CN114690623B (en) | 2022-04-21 | 2022-04-21 | Intelligent agent efficient global exploration method and system for rapid convergence of value function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114690623B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115761850A (en) * | 2022-11-16 | 2023-03-07 | 智慧眼科技股份有限公司 | Face recognition model training method, face recognition device and storage medium |
CN115826621A (en) * | 2022-12-27 | 2023-03-21 | 山西大学 | Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning |
CN115857556A (en) * | 2023-01-30 | 2023-03-28 | 中国人民解放军96901部队 | Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190014488A1 (en) * | 2017-07-06 | 2019-01-10 | Futurewei Technologies, Inc. | System and method for deep learning and wireless network optimization using deep learning |
US20200134445A1 (en) * | 2018-10-31 | 2020-04-30 | Advanced Micro Devices, Inc. | Architecture for deep q learning |
CN112434130A (en) * | 2020-11-24 | 2021-03-02 | 南京邮电大学 | Multi-task label embedded emotion analysis neural network model construction method |
CN113723013A (en) * | 2021-09-10 | 2021-11-30 | 中国人民解放军国防科技大学 | Multi-agent decision method for continuous space chess deduction |
CN113780576A (en) * | 2021-09-07 | 2021-12-10 | 中国船舶重工集团公司第七0九研究所 | Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution |
CN114281103A (en) * | 2021-12-14 | 2022-04-05 | 中国运载火箭技术研究院 | Zero-interaction communication aircraft cluster collaborative search method |
CN114371729A (en) * | 2021-12-22 | 2022-04-19 | 中国人民解放军军事科学院战略评估咨询中心 | Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback |
-
2022
- 2022-04-21 CN CN202210421995.3A patent/CN114690623B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190014488A1 (en) * | 2017-07-06 | 2019-01-10 | Futurewei Technologies, Inc. | System and method for deep learning and wireless network optimization using deep learning |
US20200134445A1 (en) * | 2018-10-31 | 2020-04-30 | Advanced Micro Devices, Inc. | Architecture for deep q learning |
CN112434130A (en) * | 2020-11-24 | 2021-03-02 | 南京邮电大学 | Multi-task label embedded emotion analysis neural network model construction method |
CN113780576A (en) * | 2021-09-07 | 2021-12-10 | 中国船舶重工集团公司第七0九研究所 | Cooperative multi-agent reinforcement learning method based on reward self-adaptive distribution |
CN113723013A (en) * | 2021-09-10 | 2021-11-30 | 中国人民解放军国防科技大学 | Multi-agent decision method for continuous space chess deduction |
CN114281103A (en) * | 2021-12-14 | 2022-04-05 | 中国运载火箭技术研究院 | Zero-interaction communication aircraft cluster collaborative search method |
CN114371729A (en) * | 2021-12-22 | 2022-04-19 | 中国人民解放军军事科学院战略评估咨询中心 | Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback |
Non-Patent Citations (2)
Title |
---|
LI YAN,ET AL.: "A Multi-Agent Motion Prediction and Tracking Method Based on Non-Cooperative Equilibrium", 《MATHEMATICS》 * |
LI YAN,ET AL.: "An Interactive Self-Learning Game and Evolutionary Approach Based on Non-Cooperative Equilibrium", 《ELECTRONICS》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115761850A (en) * | 2022-11-16 | 2023-03-07 | 智慧眼科技股份有限公司 | Face recognition model training method, face recognition device and storage medium |
CN115761850B (en) * | 2022-11-16 | 2024-03-22 | 智慧眼科技股份有限公司 | Face recognition model training method, face recognition method, device and storage medium |
CN115826621A (en) * | 2022-12-27 | 2023-03-21 | 山西大学 | Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning |
CN115826621B (en) * | 2022-12-27 | 2023-12-01 | 山西大学 | Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning |
CN115857556A (en) * | 2023-01-30 | 2023-03-28 | 中国人民解放军96901部队 | Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN114690623B (en) | 2022-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114690623B (en) | Intelligent agent efficient global exploration method and system for rapid convergence of value function | |
US11150670B2 (en) | Autonomous behavior generation for aircraft | |
WO2021017227A1 (en) | Path optimization method and device for unmanned aerial vehicle, and storage medium | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN114281103B (en) | Aircraft cluster collaborative search method with zero interaction communication | |
CN114839884B (en) | Underwater vehicle bottom layer control method and system based on deep reinforcement learning | |
CN113382060B (en) | Unmanned aerial vehicle track optimization method and system in Internet of things data collection | |
Li et al. | F2a2: Flexible fully-decentralized approximate actor-critic for cooperative multi-agent reinforcement learning | |
US20220404831A1 (en) | Autonomous Behavior Generation for Aircraft Using Augmented and Generalized Machine Learning Inputs | |
Shen et al. | Theoretically principled deep RL acceleration via nearest neighbor function approximation | |
CN114371729B (en) | Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback | |
CN115730743A (en) | Battlefield combat trend prediction method based on deep neural network | |
Han et al. | Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c | |
CN114290339A (en) | Robot reality migration system and method based on reinforcement learning and residual modeling | |
CN114814741A (en) | DQN radar interference decision method and device based on priority important sampling fusion | |
CN114037048A (en) | Belief consistency multi-agent reinforcement learning method based on variational cycle network model | |
CN116880540A (en) | Heterogeneous unmanned aerial vehicle group task allocation method based on alliance game formation | |
CN116663637A (en) | Multi-level agent synchronous nesting training method | |
CN115994484A (en) | Air combat countergame strategy optimizing system based on multi-population self-adaptive orthoevolutionary algorithm | |
CN115903901A (en) | Output synchronization optimization control method for unmanned cluster system with unknown internal state | |
CN116451762A (en) | Reinforced learning method based on PPO algorithm and application thereof | |
Liu et al. | Forward-looking imaginative planning framework combined with prioritized-replay double DQN | |
CN115212549A (en) | Adversary model construction method under confrontation scene and storage medium | |
CN114757092A (en) | System and method for training multi-agent cooperative communication strategy based on teammate perception | |
KR20230079804A (en) | Device based on reinforcement learning to linearize state transition and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |