CN109523029B - Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method - Google Patents

Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method Download PDF

Info

Publication number
CN109523029B
CN109523029B CN201811144686.6A CN201811144686A CN109523029B CN 109523029 B CN109523029 B CN 109523029B CN 201811144686 A CN201811144686 A CN 201811144686A CN 109523029 B CN109523029 B CN 109523029B
Authority
CN
China
Prior art keywords
head
action
value
kth
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811144686.6A
Other languages
Chinese (zh)
Other versions
CN109523029A (en
Inventor
袁春
郑卓彬
朱新瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Tsinghua University
Original Assignee
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Tsinghua University filed Critical Shenzhen Graduate School Tsinghua University
Priority to CN201811144686.6A priority Critical patent/CN109523029B/en
Publication of CN109523029A publication Critical patent/CN109523029A/en
Application granted granted Critical
Publication of CN109523029B publication Critical patent/CN109523029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an intelligent agent, which improves the evaluation performance of an evaluator by using a multi-head self-driven architecture, improves the efficiency of an executor on environment exploration, can optimize a depth certainty strategy gradient (DDPG) algorithm to a certain extent, relieves the adverse effects of environment complexity, randomness and the like, accelerates the convergence of the DDPG algorithm, and improves the performance on the basis of stable training. Experiments prove that the invention can achieve the three advantages of fastest training speed, best performance and best stability in an experimental data set (simulation environment), and the specific numerical value exceeds that of the known solution.

Description

Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method
Technical Field
The invention relates to a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an agent.
Background art:
deep reinforcement learning has enjoyed great success in a series of challenging problems, such as unmanned driving, automated robotics, intelligent voice dialog systems, and the like. Deep deterministic strategy gradients (DDPG), an off-line reinforcement learning algorithm not based on environmental modeling, achieves higher sampling efficiency than traditional methods by using an actor-judge architecture with empirical replay, and is gaining more and more widespread use as it achieves optimal performance in continuous control tasks. However, DDPG is susceptible to environmental complexity and randomness, which may lead to performance instability and may not guarantee convergence of training results. This means that a large amount of over-parameter adjustment work is required to obtain good results.
In order to improve the effect of the DDPG, the MA-BDDPG in the existing method utilizes multi-head self-driven DQN as an evaluator to improve the utilization efficiency of samples played back by experience (the point is that [ Kalweit and Boedecker,2017] Gabriel Kalweit and Joschka Boedecker. nozzle-real-ready for continuous searching, in Conference on Robot Learning, pages 195 and 206,2017.), but the MA-BDDPG only introduces a single self-driven multi-head evaluator to easily cause the problem of insufficient environmental exploration. The Multi-DDPG adopts a single self-driven Multi-head executor architecture to improve the adaptability of the DDPG to multiple tasks (presented in the places of [ Yangtal, 2017] ZHaoyangYang, KathrynMerrick, Husseinb-bass and Lianwen jin. Multi-task later retrieval for controlling action. InProcedents of the two-site International task Conference on Intelligent Intellignment, IJCAI-17, pages 3301. 3307,2017.), but because the Multi-head executor is only introduced into the Multi-head executor, the problem that the action of a plurality of executors is evaluated inaccurately by a unique evaluator is caused.
Furthermore, MA-BDDPG and Multi-DDPG, while alleviating to some extent the above-mentioned problems of susceptibility to environmental complexity and randomness, introduce new problems and drawbacks, respectively.
Disclosure of Invention
The invention aims to solve the problems that a DDPG algorithm in deep reinforcement learning in the prior art is easily influenced by environmental complexity and randomness, so that the performance is unstable and convergence is not easy to occur.
Therefore, the invention provides a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an intelligent agent, which adopts a plurality of judges and a plurality of executors and comprises the following steps: when a state is observed, each actor head generates an action vector, constituting a set of K-dimensional action vectors. Under the condition of giving the same state, the judge splices each action vector into the self-sharing hidden layer and generates Q values one by one, and further generates an intermediate result Q value matrix (dimension K multiplied by K), and meanwhile, the confidence coefficient module outputs a confidence coefficient vector c (dimension K). The E-assessor layer performs a weighting operation on the two tensors (Q matrix and confidence vector) simultaneously to generate an E-Q vector (with dimension K) representing the potential value of each motion vector. And finally, the E-executor layer selects the E-action corresponding to the maximum E-Q value in the E-action vector, namely the action with the maximum potential to obtain the maximum reward, interacts with the current state of the environment, and then obtains the reward to train the intelligent agent.
In some embodiments, it is also possible to include the steps of:
and a presetting step, namely setting the number K of heads, the number n of batch training samples, the maximum training round number E and mask distribution M.
Randomly initializing assessor and executor networks with K headers
Figure GDA0002659594870000021
And copy the weights to respective target network parameters
Figure GDA0002659594870000022
Namely, it is
Figure GDA0002659594870000023
Where θ refers to parameters of the model, such as all parameters of the neural network, and the right superscript Q, μ, Q ', μ' represents the judge, the executor, the target judge, and the target executor, respectively.
Initializing an experience replay pool R, a confidence network thetaC
Selecting an action according to the following formula:
Figure GDA0002659594870000024
atthe action executed by real selection at the moment t is pointed out;
Figure GDA0002659594870000025
the confidence of the ith judgment home at the moment t is pointed; qiThe evaluation value of the ith evaluator head is Q value, which is the output of a function, and the parameter is
Figure GDA0002659594870000026
The input is state and action; stIs the state of the environment at time t; the selection of action comes from the kth taskHead mu of pedestriankIs also the output of a function with parameters of
Figure GDA0002659594870000027
The input is a state;
Figure GDA0002659594870000028
is random noise at time t.
Receiving an immediate reward r after performing the selected actiontAnd new state st+1
Sampling self-driven mask mt-M; storing the transfer tuple(s)t,at,rt,st+1,mt) In an experience pool R; and randomly sampling n transfer tuples to serve as a batch of training data.
Update the kth judge head Q by minimizing the following loss functionk
Figure GDA0002659594870000031
Figure GDA0002659594870000032
The loss function value of the kth judge head is represented and used for training optimization;
Figure GDA0002659594870000033
means for averaging the calculated values of the subsequent batch of n transfer tuple data; y isiA target for Q value;
Figure GDA0002659594870000034
q value, which is the evaluation value of the k-th evaluator head, is input as the state and action of the i-th transfer tuple data,
wherein
Figure GDA0002659594870000035
yiA target for Q value; r isiA prize value for the ith transfer metadata set; gamma rayIs a discount factor; the next two function nests follow, the first is the Q function generated by the kth target evaluator head, the input is the next state and the action, the action is generated by the kth target executor head, the action is the second function, and the input is the next state.
Updating kth actor head mu using a policy gradientk
Figure GDA0002659594870000036
Figure GDA0002659594870000037
Gradient values representing k-th actor head model parameters;
Figure GDA0002659594870000038
is the gradient of the Q value of the kth judge head for action a; act by
Figure GDA0002659594870000039
Generating, i.e. the action generated by the kth actor head;
Figure GDA00026595948700000310
is the gradient of the kth actor head model parameter; the two gradients after the formula are in a multiplicative relationship.
Updating the kth pair of target network parameters according to a formula:
Figure GDA00026595948700000311
Figure GDA00026595948700000312
Figure GDA00026595948700000313
is the parameter of the kth target judge head;
Figure GDA00026595948700000314
is the parameter of the kth judge head;
Figure GDA00026595948700000315
is a parameter of the kth target actor head;
Figure GDA00026595948700000316
is a parameter of the kth actor head; τ is the updated scale parameter.
Updating the confidence network according to the policy gradient:
Figure GDA00026595948700000317
θCrepresents a parameter thereof; α represents a learning rate;
Figure GDA00026595948700000318
is the gradient of the confidence network;
Figure GDA00026595948700000319
the parameterized strategy output value, namely the output value of the confidence coefficient network; qπ(si,ai) The evaluated Q value was obtained.
The method is carried out based on the working principle of the DDPG, the original DDPG uses a single judge and a single executor, the executor observes the state to generate action, the state is used as the input of the judge, the action is spliced on a hidden layer of the judge, the Q value is output, and the potential value of the action is evaluated; the method of the invention combines the advantages of the MA-BDDPG and the Multi-DDPG, solves the defects of the MA-BDDPG and the Multi-DDPG, improves the evaluation performance of a judge by using a Multi-head self-driven architecture, improves the efficiency of an executor for environment exploration, optimizes the algorithm to a certain extent, relieves the adverse effects of the environment complexity, randomness and the like, accelerates the convergence of the DDPG algorithm, and improves the performance on the basis of stable training.
Experiments prove that the invention can achieve the three advantages of fastest training speed, best performance and best stability in an experimental data set (simulation environment), and the specific numerical value exceeds that of the known solution.
In some embodiments, for the problem of the uneven evaluation capability of a multi-head judge, the invention introduces a method of an adaptive confidence strategy to solve and further optimize the DDPG.
Drawings
Fig. 1 is a schematic diagram of the architecture and operation flow of an adaptive dual self-driven DDPG according to an embodiment of the present invention.
FIGS. 2A and 2B are graphs showing the results of experiment one in the example of the present invention.
FIGS. 3A and 3B are graphs showing the results of experiment one in the example of the present invention.
FIGS. 4A and 4B are graphs showing the results of experiment one in the example of the present invention.
FIG. 5 is a diagram illustrating the effect of self-tuning of an adaptive confidence policy in a training process according to an embodiment of the present invention.
Detailed Description
FIG. 1 is a schematic diagram of the architecture and operational flow of our inventive adaptive dual self-driven DDPG.
The dual self-driving means: on one hand, the evaluator uses the multi-head self-driven technology to improve the evaluation performance, and on the other hand, the executor uses the multi-head self-driven architecture to improve the efficiency of environmental exploration.
As shown in fig. 1, firstly, the environment generates a state s (the situation provided by the environment and currently requiring the agent to make a decision, the state dimension depends on the environment attribute), and when the executor (ii) observes the state, each executor head (generally, there are K, K is a natural number greater than 2) generates an action vector a (the form given by the environment requirement to interact with the state, the vector dimension depends on the environment attribute) to form an action vector set a (K vectors in total).
Under the condition of giving the same state (the state received by each head of the judge at the same time is the same), the judge (i) splices the action vectors in the action vector set A into a shared hidden layer (namely a shared layer in the picture) of the judge one by one and generates Q values one by one, and further generates an intermediate result Q value matrix V (dimension K multiplied by K) (the Q value matrix V is only an intermediate result and really plays an important role in subsequent generation of E-Q value vectors), and meanwhile, the confidence coefficient module (iii) outputs a confidence coefficient vector c (the dimension is K). The Ensemble-assessor layer (E-assessor layer) performs a weighting operation on the two tensors (Q-matrix and confidence vector; the tensors include vector and matrix) to generate an Ensemble-Q vector (E-Q, dimension K) representing the potential value of each motion vector.
Finally, the Ensemble-executor layer (E-executor layer) selects Ensemble-actions (E-actions) corresponding to the maximum E-Q value according to the E-Q value vector, namely the actions with the maximum potential to obtain the maximum reward, interacts with the current state of the environment, and then obtains the reward, so that the < state, action, reward > transfer tuples can train the agent by using a traditional executor-judge algorithm and train the confidence module by using a strategy gradient algorithm.
The process in which the environment interacts with the agent: the environment provides the state, the agent interacts with the action, the environment generates a new state, and the agent is awarded the reward. This process and mechanism is common to reinforcement learning, is well known in the art, and can be referred to in the related description of the prior art. The DDPG has a single pair of actor and judge architectures, as also described in the prior art.
The embodiment of the invention is improved on the basis of the prior art as follows:
(1) expanding the DDPG into a double self-driven multi-head architecture;
(2) the flow of generating a plurality of actions by a plurality of executives, weighting and scoring a plurality of judges and screening the optimal action at the same time;
(3) an adaptive confidence module is added.
The specific training algorithm may be described as table 1 below:
Figure GDA0002659594870000051
Figure GDA0002659594870000061
Figure GDA0002659594870000071
note: fig. 1 only shows the most core model architecture and simple operation flow, and does not contain some details such as the presetting part and the specific parameter training, and the algorithm shown in table 1 is a more detailed presetting and training step. Fig. 1 is a convenient visual understanding and the table 1 algorithm is supplemented in more detail.
We shall now add further details to the following.
Splicing (see the middle part of figure 1 'splicing' word)
The judge network is composed of multiple fully-connected layers, during operation, motion vectors (for example, the dimension of X) are spliced with input vectors (for example, the dimension of Y) of a hidden layer (that is, the above-mentioned "motion vectors in a motion vector set a are spliced into a shared layer of itself one by one and Q values are generated one by one", the meaning of the expression is the same, the "shared layer" is actually the "hidden layer", the expression in the front is "spliced to the shared layer", the actual operation is "spliced with the input vectors of the hidden layer", the expression in the following is more detailed), and finally, vectors (that is, X vector splicing Y vector (X + Y vector)) with the dimension of X + Y input to the hidden layer can be simplified into vector splicing.
Q value matrix expression (assuming K heads in total)
Vt∈RK×K
Wherein VtThe Q value matrix at the t moment is belonged to in the mathematical formula, RK×KRefers to a real matrix with dimensions K by K. This is simply an intermediate result that produces a vector of E-Q values.
Confidence vector (dimension K)
Output by confidence module (network)
Figure GDA0002659594870000081
Wherein c istIs a confidence vector at time t, consisting of K values,
Figure GDA0002659594870000082
the confidence of the kth judge head at time t is a value greater than 0 and equal to or less than 1. The confidence module is a neural network, corresponding to a function, ct=f(st) And inputting the state and outputting the confidence coefficient vector.
E-judge executive steps
The following formula, intermediate
Figure GDA0002659594870000083
It is the operation, weighted sum process, performed by the E-judge that multiplies the confidence vector by the Q matrix, which results in an E-Q vector of dimension K
Figure GDA0002659594870000084
This step corresponds to the E-judge model (r) in the graph, and the step of generating the E-Q value vector; corresponding to step 10 of the algorithm.
E-performer performing steps
As with the above formula, the layer selects, based on the E-Q vector, the E-action in which the maximum E-Q corresponds to the argmax operation, i.e., the action with the greatest potential to obtain the greatest reward, to interact with the current state of the environment.
This step corresponds to the E-actor model (II) in the graph, and the step of generating E-actions; corresponding to step 10 of the algorithm.
More specific training details refer to the algorithm pseudo code (i.e., table 1 above).
In the existing method, the MA-BDDPG only introduces a single self-driven multi-head judge, which causes the problem of insufficient exploration (it omits the multi-head execution part and confidence module in our FIG. 1 and the process of integrated evaluation selection action); while the Multi-DDPG introduces only a single self-driven Multi-head executor (which omits the Multi-head evaluator part and the confidence module in our FIG. 1 and integrates the process of evaluating the selection action), resulting in the problem of inaccurate evaluation.
The method combines the advantages of the two methods, and simultaneously, the problem of uneven evaluation capability (part of heads can fall into local optimum in the training process or the training direction has deviation, so that the final training capability is uneven) occurs, so that the method introduces an adaptive confidence strategy to solve the problem. The method expands the executors and judges of the DDPG to the self-driven multi-head network for exploration. Based on the multi-headed architecture, the assessor utilizes integrated Q value assessment to increase the probability that potentially optimal actions are explored in empirical playback. Meanwhile, a self-adaptive confidence degree strategy (automatically generated by a confidence degree module) is provided for automatically calibrating the weight of the weight and the operation weight, and the problem of inaccurate evaluation caused by the fact that the evaluation capabilities of different judges are different is solved.
In the embodiment of the invention, a large number of experiments are carried out on the technology, and under a Hopper/Walker in a Mujoco experiment environment, the speed is improved by 45% while the training is stabilized, and the average performance (reward) is improved by 44%. Specific experiments are described below.
Experiment of
We tested our method in the Mujoco simulator experimental environment of OpenAI. Mainly Hopper-v1 and Walker-2d among them.
Hopper-v1 is an environment for the one-legged robot to learn by jumping (a state is a vector consisting of 11 real numbers (i.e., this state is an 11-dimensional vector, each dimension is a real number, and so on), and an action is a vector consisting of 3 real numbers);
walker-2d is an environment in which the biped robot learns to walk (the state is a vector consisting of 17 real numbers, and the motion is a vector consisting of 6 real numbers).
Based on these two environments we performed the following example comparisons:
1. our model (both configured/unconfigured adaptive confidence policies) was contrasted with other models:
2. the performances of different confidence policies are compared;
3. different numbers of self-driven multi-head frameworks are used, and different performances are improved.
In all experiments, the number of training rounds was set to 10000 rounds, the empirical playback module size was 1000000, and the batch size was set to 1024.
Experiment one
Comparing the models: DDPG (original model), MA-BDDPG (Single self-driven multiple judgments Home header architecture), Multi-DDPG (Single self-driven multiple executors header architecture), DBDDPG (our model, double self-driven multiple headers architecture, without adaptive confidence), SOUP (our model, double self-driven multiple headers architecture, with adaptive confidence)
As can be seen from fig. 2A and 2B, our method achieves the highest average reward performance (the vertical axis of the solid line is highest), the highest speed (the solid line rises fastest, the slope is largest), and the best stability (the shaded band is thinnest).
Experiment two
Our approach was compared with different confidence strategies: no Confidence, FixedConfiguration, Decayed Confidence, Self-adaptive Confidence
As can be seen from FIGS. 3A and 3B, the adaptive confidence policy achieves the highest average reward performance (highest vertical axis of the solid line), the fastest speed (fastest rise of the solid line, maximum slope), and the best stability (thinnest shaded band)
Experiment three
Comparing our approach with different numbers of multi-headed architectures: DDPG (original model), SOUP (our model, 3 heads, 5 heads, 10 heads)
As can be seen from FIGS. 4A and 4B, as the number of heads increases, the higher the average reward performance achieved (the higher the vertical axis of the solid line), the faster the speed (the faster the solid line rises, the greater the slope), the better the stability (the thinner the shaded band)
Fig. 5 shows the self-adjustment of our adaptive confidence strategy during training. The confidence of each head can be dynamically adjusted according to different rewards, and the training is carried out by a strategy gradient method.
The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. The technical personnel in the field can also make some modified solutions in the light of this application, also belong to the protective scope of this invention. For example, the present invention may also adopt the following modifications:
1. multiple DDPGs (shared nothing networks) are used to train simultaneously a scheme that ultimately fuses decisions with confidence (rather than using multiple headers).
2. Only double self-driven multi-head extension is carried out on the DDPG, but a confidence coefficient network is not added for optimization.
3. The DDPG is single self-driven extended and confidence rebalanced.
The invention can also be applied in the following technical fields:
1. intelligent autopilot system: the self-learning speed of the vehicle (as an intelligent agent) in the simulation environment is increased, and the vehicle is more stable when being transferred to the real environment;
2. and (4) game AI: the trained agent can continuously evolve by interacting with the player or interacting with the game, and self-help learns to obtain higher rewards and scores in the game;
3. intelligent robot etc. field: the mechanical arm or the robot can adapt to the real environment more quickly by being provided with an algorithm, the basic task requirement is met quickly, and the task (such as clamping an object, distinguishing the object, screening the object and the like) is completed accurately.

Claims (1)

1. A self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method is used for training an intelligent agent and is characterized in that a plurality of judges and a plurality of executors are adopted, the intelligent agent is a robot, and the operation process comprises the following steps: the intelligent agent receives the state provided by the environment, and when the executor in the intelligent agent observes a state, each executor head generates an action vector to form a K-dimensional action vector set; under the condition of giving the same state, a judge in the intelligent agent splices each motion vector into a self-shared hidden layer and generates Q values one by one, and then an intermediate result Q value matrix and a dimension K multiplied by K are generated; meanwhile, a confidence coefficient module in the agent outputs a confidence coefficient vector c, and the dimensionality is K; the E-judging layer performs weighting operation by combining the Q value matrix and the confidence coefficient vector to generate an E-Q value vector, the dimensionality is K, and the E-Q value vector represents the potential value of each action vector; finally, the E-executor layer in the agent selects the E-action corresponding to the maximum E-Q value in the E-action according to the E-Q value vector, namely the action with the maximum potential to obtain the maximum reward, interacts with the current state of the environment to obtain the reward, and the environment generates a new state to be provided for the agent to train the agent, wherein K is a natural number larger than 2; the specific training algorithm comprises the following steps:
a presetting step, setting the number K of heads;
randomly initializing assessor and executor networks with K headers
Figure FDA0002659594860000011
And copy the weights to respective target network parameters
Figure FDA0002659594860000012
Namely, it is
Figure FDA0002659594860000013
Wherein theta refers to all parameters of the neural network, and the upper right marks Q, mu, Q ', mu' respectively represent an evaluator, an executor, a target evaluator and a target executor;
initialize confidence module θC
Selecting an action according to the following formula:
Figure FDA0002659594860000014
atthe action executed by real selection at the moment t is pointed out;
Figure FDA0002659594860000015
the confidence of the ith judgment home at the moment t is pointed; qiThe evaluation value of the ith evaluator head is Q value, which is the output of a function, and the parameter is
Figure FDA0002659594860000016
The input is state and action; stIs the state of the environment at time t; the selection of the action comes from the k performer head mukIs also the output of a function with parameters of
Figure FDA0002659594860000017
The input is a state;
Figure FDA0002659594860000018
random noise at time t;
sampling self-driven mask mt-M; storing the transfer tuple(s)t,at,rt,st+1,mt) In an empirical tank R, where mtIs a self-driven mask, M is a mask distribution; a istIs the movement at time t, rtIs at time t in state stLower execution action atIs awarded, st+1 is the state at time t + 1;
update the kth judge head Q by minimizing the following loss functionk
Figure FDA0002659594860000021
Figure FDA0002659594860000022
The loss function value of the kth judge head is represented and used for training optimization;
Figure FDA0002659594860000023
means for averaging the calculated values of the subsequent batch of n transfer tuple data; y isiA target for Q value;
Figure FDA0002659594860000024
q value, which is the evaluation value of the k-th evaluator head, is input as the state and action of the i-th transfer tuple data,
Figure FDA0002659594860000025
is the evaluation network of the kth head, siIs the current state of the ith transition tuple data, aiIs the action executed by the current state s _ i in the ith transition metadata;
wherein
Figure FDA0002659594860000026
Figure FDA0002659594860000027
Is the k-th judge target network,
Figure FDA0002659594860000028
is the executor target network of the kth head, yiA target for Q value; r isiA prize value for the ith transfer metadata set; gamma is a discount factor; then nesting two functions, wherein the outer layer is a Q value function generated by the kth target judging head, the input is the next state and action, the action is generated by the kth target executor head, the action is an inner layer function, and the input is the next state; si+1Is the current state s in the ith transition tuple dataiPerforming action aiThe next state later generated;
updating kth actor head mu using a policy gradientk
Figure FDA0002659594860000029
Figure FDA00026595948600000210
Is the executor network of the kth head, and takes a value range of [0, 1 ]]I.e., 0 to 1;
Figure FDA00026595948600000211
gradient values representing k-th actor head model parameters;
Figure FDA00026595948600000212
is the gradient of the Q value of the kth judge head for action a; act by
Figure FDA00026595948600000213
Generating, i.e. the action generated by the kth actor head;
Figure FDA00026595948600000214
is the gradient of the kth actor head model parameter;
updating the kth pair of target network parameters according to a formula:
Figure FDA00026595948600000215
Figure FDA00026595948600000216
Figure FDA0002659594860000031
is the parameter of the kth target judge head;
Figure FDA0002659594860000032
is the parameter of the kth judge head;
Figure FDA0002659594860000033
is a parameter of the kth target actor head;
Figure FDA0002659594860000034
is a parameter of the kth actor head; τ is the updated scale parameter;
updating the confidence module according to the policy gradient:
Figure FDA0002659594860000035
θCrepresents a parameter thereof; α represents a learning rate;
Figure FDA0002659594860000036
a gradient of a confidence module;
Figure FDA0002659594860000037
the parameterized strategy output value, namely the output value of the confidence coefficient module; qπ(si,ai) The evaluated Q value was obtained.
CN201811144686.6A 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method Active CN109523029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811144686.6A CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811144686.6A CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Publications (2)

Publication Number Publication Date
CN109523029A CN109523029A (en) 2019-03-26
CN109523029B true CN109523029B (en) 2020-11-03

Family

ID=65771996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811144686.6A Active CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Country Status (1)

Country Link
CN (1) CN109523029B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363295A (en) * 2019-06-28 2019-10-22 电子科技大学 A kind of intelligent vehicle multilane lane-change method based on DQN
CN110428615B (en) * 2019-07-12 2021-06-22 中国科学院自动化研究所 Single intersection traffic signal control method, system and device based on deep reinforcement learning
CN110442129B (en) * 2019-07-26 2021-10-22 中南大学 Control method and system for multi-agent formation
CN110502721B (en) * 2019-08-02 2021-04-06 上海大学 Continuity reinforcement learning system and method based on random differential equation
CN112782973B (en) * 2019-11-07 2022-10-18 四川省桑瑞光辉标识系统股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN111245008B (en) * 2020-01-14 2021-07-16 香港中文大学(深圳) Wind field cooperative control method and device
CN111310384A (en) * 2020-01-16 2020-06-19 香港中文大学(深圳) Wind field cooperative control method, terminal and computer readable storage medium
CN111813904A (en) * 2020-05-28 2020-10-23 平安科技(深圳)有限公司 Multi-turn conversation management method and device and computer equipment
CN111899728A (en) * 2020-07-23 2020-11-06 海信电子科技(武汉)有限公司 Training method and device for intelligent voice assistant decision strategy
CN112019249B (en) * 2020-10-22 2021-02-19 中山大学 Intelligent reflecting surface regulation and control method and device based on deep reinforcement learning
CN112418436B (en) * 2020-11-19 2022-06-21 华南师范大学 Artificial intelligence ethical virtual simulation experiment method based on human decision and robot
CN112446503B (en) * 2020-11-19 2022-06-21 华南师范大学 Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot
CN112668235B (en) * 2020-12-07 2022-12-09 中原工学院 Robot control method based on off-line model pre-training learning DDPG algorithm
CN114202229B (en) * 2021-12-20 2023-06-30 南方电网数字电网研究院有限公司 Determining method of energy management strategy of micro-grid based on deep reinforcement learning
CN114371634B (en) * 2021-12-22 2022-10-25 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007029516A1 (en) * 2005-09-02 2007-03-15 National University Corporation Yokohama National University Reinforcement learning value function expressing method and device using this
CN103496368A (en) * 2013-09-25 2014-01-08 吉林大学 Automobile cooperative type self-adaptive cruise control system and method with learning ability
CN103514371A (en) * 2013-09-22 2014-01-15 宁波开世通信息科技有限公司 Measuring and risk evaluation method of executive capability of scheduled task
CN105850901A (en) * 2016-04-18 2016-08-17 华南农业大学 Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
WO2017037859A1 (en) * 2015-08-31 2017-03-09 株式会社日立製作所 Information processing device and method
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN107020636A (en) * 2017-05-09 2017-08-08 重庆大学 A kind of Learning Control Method for Robot based on Policy-Gradient
CN108108822A (en) * 2018-01-16 2018-06-01 中国科学技术大学 The different tactful deeply learning method of parallel training
CN108321795A (en) * 2018-01-19 2018-07-24 上海交通大学 Start-stop of generator set configuration method based on depth deterministic policy algorithm and system
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020156973A1 (en) * 2001-01-29 2002-10-24 Ulrich Thomas R. Enhanced disk array
US20020138559A1 (en) * 2001-01-29 2002-09-26 Ulrich Thomas R. Dynamically distributed file system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007029516A1 (en) * 2005-09-02 2007-03-15 National University Corporation Yokohama National University Reinforcement learning value function expressing method and device using this
CN103514371A (en) * 2013-09-22 2014-01-15 宁波开世通信息科技有限公司 Measuring and risk evaluation method of executive capability of scheduled task
CN103496368A (en) * 2013-09-25 2014-01-08 吉林大学 Automobile cooperative type self-adaptive cruise control system and method with learning ability
WO2017037859A1 (en) * 2015-08-31 2017-03-09 株式会社日立製作所 Information processing device and method
CN105850901A (en) * 2016-04-18 2016-08-17 华南农业大学 Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107020636A (en) * 2017-05-09 2017-08-08 重庆大学 A kind of Learning Control Method for Robot based on Policy-Gradient
CN108108822A (en) * 2018-01-16 2018-06-01 中国科学技术大学 The different tactful deeply learning method of parallel training
CN108321795A (en) * 2018-01-19 2018-07-24 上海交通大学 Start-stop of generator set configuration method based on depth deterministic policy algorithm and system
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Value-function reinforcement learning in Markov games;Michael L. Littman;《Cognitive Systems Research》;20011231;第1-11页 *
基于强化学习算法的自适应配对交易模型;胡文伟;《管理科学》;20170331(第2期);第148-158页 *
深度强化学习综述;刘全等;《计算机学报》;20180131;第41卷(第1期);第1-22页 *

Also Published As

Publication number Publication date
CN109523029A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109523029B (en) Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN110515303B (en) DDQN-based self-adaptive dynamic path planning method
CN110794842A (en) Reinforced learning path planning algorithm based on potential field
Whiteson Evolutionary function approximation for reinforcement learning
CN112809689B (en) Language-guidance-based mechanical arm action element simulation learning method and storage medium
CN111856925B (en) State trajectory-based confrontation type imitation learning method and device
CN110442129A (en) A kind of control method and system that multiple agent is formed into columns
JP2020123345A (en) Learning method and learning device for generating training data acquired from virtual data on virtual world by using generative adversarial network (gan), to thereby reduce annotation cost required in learning processes of neural network for autonomous driving, and testing method and testing device using the same
CN115099606A (en) Training method and terminal for power grid dispatching model
CN114290339B (en) Robot realistic migration method based on reinforcement learning and residual modeling
Yang et al. DDPG with meta-learning-based experience replay separation for robot trajectory planning
CN116050505A (en) Partner network-based intelligent agent deep reinforcement learning method
CN112989017B (en) Method for generating high-quality simulation experience for dialogue strategy learning
KR100850914B1 (en) method for controlling game character
Yang et al. Adaptive inner-reward shaping in sparse reward games
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN114840024A (en) Unmanned aerial vehicle control decision method based on context memory
Ti et al. Dynamic movement primitives for movement generation using GMM-GMR analytical method
Chen et al. Self-imitation learning for robot tasks with sparse and delayed rewards
CN113255883A (en) Weight initialization method based on power law distribution
Zhao et al. Convolutional fitted Q iteration for vision-based control problems
CN109829490A (en) Modification vector searching method, objective classification method and equipment
Zhao et al. An improved extreme learning machine with adaptive growth of hidden nodes based on particle swarm optimization
CN113485107B (en) Reinforced learning robot control method and system based on consistency constraint modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant