CN109523029A - For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body - Google Patents

For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body Download PDF

Info

Publication number
CN109523029A
CN109523029A CN201811144686.6A CN201811144686A CN109523029A CN 109523029 A CN109523029 A CN 109523029A CN 201811144686 A CN201811144686 A CN 201811144686A CN 109523029 A CN109523029 A CN 109523029A
Authority
CN
China
Prior art keywords
value
head
executor
movement
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811144686.6A
Other languages
Chinese (zh)
Other versions
CN109523029B (en
Inventor
袁春
郑卓彬
朱新瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Tsinghua University
Original Assignee
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Tsinghua University filed Critical Shenzhen Graduate School Tsinghua University
Priority to CN201811144686.6A priority Critical patent/CN109523029B/en
Publication of CN109523029A publication Critical patent/CN109523029A/en
Application granted granted Critical
Publication of CN109523029B publication Critical patent/CN109523029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, while improving judge man assessment performance from bogie structure using bull, improve the efficiency that executor explores environment, and depth deterministic policy gradient (DDPG) algorithm can be optimized to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, the convergence for accelerating DDPG algorithm improves performance on the basis of training is stablized.It is demonstrated experimentally that the present invention can reach three advantages that the most fast of training speed, performance are best and stability is best in experimental data set (simulated environment), it has been more than known solution in specific value.

Description

For the adaptive double from driving depth deterministic policy gradient reinforcing of training smart body Learning method
Technical field
The present invention relates to for a kind of adaptive double from driving depth deterministic policy gradient extensive chemical of training smart body Learning method.
Background technique:
Deeply study obtains immense success in a series of challenging problems, such as unmanned, automatic Change robot, Intelligent voice dialog system etc..Depth deterministic policy gradient (DDPG) is as a kind of environmental modeling that is not based on Offline nitrification enhancement achieves by using the executor with experience replay-judge man architecture and compares conventional method Higher sampling efficiency, and more and more common application is obtained because it is optimal performance in continuous control task. But DDPG is easy to be influenced by environmental complexity and randomness, and it is unstable and not can guarantee instruction to may result in performance Practice result convergence.This means that a large amount of hyper parameter is needed to adjust the result that work can just be got well.
In order to improve the effect of DDPG, MA-BDDPG is made from driving DQN as house is judged using bull in existing method The Sample utilization efficiency of experience replay is improved (source: [Kalweit and Boedecker, 2017] Gabriel Kalweit and Joschka Boedecker.Uncertainty-driven imagination for continuous deep reinforcement learning.In Conference on Robot Learning,pages 195–206, 2017.), but MA-BDDPG is because only introduce single bull judge of driving certainly man, is easy to cause and explores insufficient problem to environment. Multi-DDPG using it is single from driving bull executor framework go to improve DDPG for multitask adaptability (source: [Yangetal.,2017]ZhaoyangYang,KathrynMerrick,HusseinAb-bass,and Lianwen Jin. Multi-task deep reinforcement learning for continuous action control.In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3301-3307,2017.), but executed since Multi-DDPG only introduces bull Person causes uniquely to judge the problem that family assesses multiple executors' movement inaccuracy.
Moreover, MA-BDDPG and Multi-DDPG, can only alleviate above-mentioned be easy by environmental complexity to a certain extent The problem of with the influence of randomness, but it is each introduced into new problem and defect.
Summary of the invention
Present invention aim to solve the DDPG algorithm in the study of prior art deeply to be easy to be answered by environment The influence of polygamy and randomness causes performance unstable and is not easy the problem of restraining.
For this purpose, the present invention proposes that a kind of adaptive double depth deterministic policy gradients of driving certainly for training smart body are strong Chemical learning method includes the following steps: each to hold when observing a state using multiple judge men and multiple executors Passerby's head generates a movement vector, and composition K dimension acts vector set.In the case where given same state, judging house will each be moved It is spliced in itself shared hidden layer as vector and generates Q value one by one, and then generate intermediate result Q value matrix (dimension K × K), Confidence level module output confidence level vector c (dimension K) at the same time.E- judges family's layer in combination with the two tensor (Q value squares Battle array and confidence level vector) it is weighted operation, an E-Q value vector (dimension K) is generated, the latent of each movement vector is represent It is being worth.Last E- executor layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum according to E-Q value vector To obtain the maximum movement rewarded, current state interacts potential with environment, is rewarded later, with training smart body.
In some embodiments, it is also possible to include the following steps:
Step is preset, setting head quantity K, crowd training samples number n, maximum training rounds E, mask are distributed M.
Random initializtion has the judge man network and executor's network of K headAnd weight is copied to respectively From target network parameterI.e.Wherein θ refers to the parameter of model, than Such as all parameters of neural network, upper right mark W, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man, and target executes Person.
Initialize experience replay pond R, Belief network θC
Selection acts according to the following formula:
atRefer to really to select under t moment and is performed movement;Refer to the confidence level of i-th of judge man head under t moment;Qi Refer to i-th assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and moves Make; stFor the state of environment under t moment;The selection of movement is from k-th of executor's head μkAnd the output of a function, ginseng Number isInput is state;For the random noise under t moment.
It executes to receive after the movement of the selection and rewards r immediatelytWith new state st+1
Sampling drives mask m certainlyt~M;Memory transfer tuple (st,at,rt,st+1,mt) in experience pond R;Stochastical sampling n Tuple is shifted as a batch training data.
It minimizes following loss function and updates k-th of judge man head Qk:
K-th of loss function value for judging family's head is indicated, for training optimization;It indicates a batch n below to turn The calculated value for moving tuple data is averaged;yiFor the target of Q value;The assessment of family's head is judged for k-th Value --- Q value inputs the state and movement that tuple data is shifted for i-th,
Wherein
yiFor the target of Q value;riThe reward value of tuple data is shifted for i-th;γ is discount factor;It is two letters later Number is nested, and first is Q value function that k-th of judgment of objectives person's head generates, inputs as next state and movement, movement by K-th of target executor's head generates, and is second function, inputs as next state.
K-th of executor's head μ is updated using Policy-Gradientk:
Indicate the gradient value of k-th of executor's head model parameter;It is the Q value of k-th of judge man head For acting the gradient of a;Movement byIt generates, i.e. the movement that k-th of executor's head generates;It is kth The gradient of a executor's head model parameter;Two gradients are the relationships being multiplied behind formula.
Kth is updated to target network parameter according to formula:
It is the parameter of k-th of judgment of objectives person's head;It is the parameter of k-th of judge man head;It is k-th of target The parameter of executor's head;It is the parameter of k-th of executor's head;τ is the scale parameter updated.
Belief network is updated according to Policy-Gradient:
θCRepresent its parameter;α represents learning rate;For the gradient of Belief network;For the strategy of parametrization Output valve, the i.e. output valve of Belief network;Qπ(si,ai) it is the Q value assessed.
The present invention is that the working principle based on DDPG goes to carry out, and original DDPG is to be judged house using single and individually executed Person, executor's observation state generation movement, state export Q value, comment as the input for judging house, movement splicing in its hidden layer Estimate the potential value of movement;And solution while method of the invention the advantages of being combined with both MA-BDDPG and Multi-DDPG The deficiency of the two of having determined improves executor and visits to environment while improving judge man assessment performance from bogie structure using bull The efficiency of rope, and can optimization algorithm to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, accelerate The convergence of DDPG algorithm improves performance on the basis of training is stablized.
It is demonstrated experimentally that the present invention can experimental data set (simulated environment) reach the most fast of training speed, performance it is best, And three advantages that stability is best, it has been more than known solution in specific value.
In some embodiments, the irregular problem of family's evaluation capacity is judged for bull, the present invention introduces adaptive The method of confidence level strategy solves, further optimization DDPG.
Detailed description of the invention
Fig. 1 is that the embodiment of the present invention is adaptive double from the framework and operational process schematic diagram that drive DDPG.
Fig. 2A, 2B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 3 A, 3B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 4 A, 4B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 5 is self the adjustment effect schematic diagram of self-adapting confidence degree of embodiment of the present invention strategy in the training process.
Specific embodiment
Fig. 1 is adaptive double frameworks and operational process schematic diagram from driving DDPG that we invent.
So-called pair refers to from driving: being on the one hand to judge house to use bull from actuation techniques raising assessment performance, another party Face executor improves the efficiency explored to environment with bull from bogie structure.
Such as Fig. 1, environment first generates state s (the current the case where needing intelligent body to carry out decision provided by environment, state Dimension depends on environment attribute), when executor's (2.) observes the state, each executor's head (generally has k, k is greater than 2 natural number) generating a movement vector a, (form interacted with state that environmental requirement provides, vector dimension depend on Environment attribute) composition movement vector set A (K vector in total).
(it is identical in received state of same time to judge the every head of family) in the case where given same state, judges family (1.) The movement vector acted in vector set A is spliced to one by one in itself shared hidden layer (i.e. inclusion layer in figure) and is generated one by one Value, and then (Q value matrix V is intermediate result, is really played an important role for generation intermediate result Q value matrix V (dimension K × K) It is the E-Q value vector of subsequent generation), confidence level module (3.) output confidence level vector c (dimension K) at the same time. Ensemble- judges family's layer (E- judges family's layer) in combination with the two tensor (Q value matrix and confidence level vectors;Tensor includes Vector sum matrix) it is weighted operation, an Ensemble-Q value vector (E-Q value, dimension K) is generated, each movement is represent The potential value of vector.
Last Ensemble- executor layer (E- executor layer) chooses wherein corresponding maximum E-Q value according to E-Q value vector Ensemble- act (E- movement), i.e., with maximum capacity to obtain the maximum movement rewarded, with environment current state into Row interaction, reward later, with this<state, movement, traditional executor-judge man calculation can be used in reward>shift tuple Method is carried out training smart body and is trained confidence level module using Policy-Gradient algorithm.
The wherein process that environment is interacted with intelligent body: environment provides state, and for intelligent body to act interaction, environment generates new shape State, and give intelligent body reward.This process and mechanism are that intensified learning is general, be it is existing in the prior art, can refer to The associated description of the prior art.DDPG has individually a pair of of executor and judges the framework of family, can also refer to the prior art In description.
The embodiment of the present invention on the basis of existing technology, has carried out following improvement:
(1) extension DDPG is double from driving bull framework;
(2) multiple executors generate multiple movements, the processes judged family's weighting marking, screen optimal movement multiple simultaneously;
(3) increase self-adapting confidence degree module.
Specific training algorithm can be described as the following table 1:
Note: Fig. 1 only shows most crucial model framework and simple operational process, does not include some details, for example set in advance Part and design parameter training etc. are set, and algorithm shown in table 1 is to pre-set in more detail and training step.Fig. 1 is convenient intuitive Understand, 1 algorithm of table is to supplement in more detail.
We again further remark additionally to some detail sections below.
Splice (see " splicing " printed words of the middle section Fig. 1)
Family's network is judged to be made of the full articulamentum of multilayer, during operation can will movement vector (such as dimension be X), and hide (that mentions above " splices the movement vector acted in vector set A one by one for input vector (for example dimension is Y) splicing of layer Q value is generated into itself inclusion layer and one by one ", mean identical with expressing herein, " inclusion layer " is exactly " to hide in fact Layer ", and the expression " being spliced to inclusion layer " of front, actual operation are exactly " splicing with the input vector of hidden layer ", are followed by The expression of more details), the vector (i.e. X vector splices Y-direction amount=(X+Y) vector) that the dimension of final hidden layer input is X+Y, It can simplify the splicing for vector.
Q value matrix expression formula (assuming that K head in total)
Vt∈RK×K
Wherein VtThe Q value matrix of t moment, ∈ " are belonged to " in mathematical formulae, RK×KRefer to that dimension is the real number square that K multiplies K Battle array.This is merely creating the intermediate result of E-Q value vector.
Confidence level vector (dimension K)
It is exported by confidence level module (network)
Wherein ctIt is the confidence level vector under t moment, is made of K value,It is to judge setting for family's head k-th under t moment Reliability is greater than 0 value for being less than or equal to 1.Confidence level module is a neural network, is equivalent to a function, ct=f (st), it is defeated Enter state, exports confidence level vector.
E- judges house and executes steps are as follows formula, intermediateBe exactly E- judge family execute operation, weighted sum process, Confidence level vector sum Q value matrix is subjected to product, the E-Q value vector that dimension is K can be generated in this way
E- judges family's model (1.) in the step corresponding diagram, and the step of generating E-Q value vector;Corresponding algorithm step 10.
E- executor executes step
Formula as above, the layer choose the E- movement of wherein corresponding maximum E-Q value, corresponding argmax behaviour according to E-Q value vector Make, i.e., obtains the movement of maximum reward with maximum capacity, current state interacts with environment.
E- executor model (2.) in the step corresponding diagram, and the step of generating E- movement;Corresponding algorithm step 10.
More specifically training details refers to pseudo-code of the algorithm (i.e. table 1 above).
In existing method, MA-BDDPG judges house because only introducing single bull of driving certainly, causes to explore insufficient problem (it has lacked bull execution part and confidence level module in our Fig. 1, and the process of integrated assessment selection movement);And Multi-DDPG then only introduces single bull executor of driving certainly, and (bull that it has lacked in our Fig. 1 judges family part and confidence Spend module, and the process of integrated assessment selection movement), lead to the problem of assessment inaccuracy.
And our method be combine two above the advantages of while, occur the irregular problem of evaluating ability (instruction Experienced process can a part of head can fall into local optimum or training direction have deviation, and cause final Training Capability uneven), It is solved so introducing self-adapting confidence degree strategy.This method all expands to the executor of DDPG and judge man more from driving Head network is explored.It based on bull framework, judges house and is assessed using integrated Q value, to improve potential optimal action in experience The probability for being explored out in playback.The strategy (being automatically generated by confidence level module) of self-adapting confidence degree is proposed at the same time The next weight for calibrating weighted sum operation automatically, the irregular caused evaluation of evaluating abilities for solving different judge men head are inaccurate True problem.
In an embodiment of the present invention, we have carried out a large amount of experiment to the technology, in Mujoco experimental situation Under Hopper/Walker, speed promotes 45% while stablizing training, and average behavior (reward) promotes 44%.Below to specific Experiment is introduced.
Experiment
The method that we test us under the Mujoco simulator experimental situation of OpenAI.It is mainly therein Hopper-v1 and Walker-2d.
Hopper-v1 be allow monopodia robot jump study environment (state is made of the vector that 11 real numbers form (i.e. this state is a vector for 11 dimensions, and each dimension is a real number.Analogize below), act from 3 real numbers form to Amount is constituted);
Walker-2d be allow biped robot study walk environment (state is made of the vector that 17 real numbers form, Movement is made of the vector that 6 real numbers form).
Based on the two environment, We conducted following examples controls:
1. our model (configure/not configuring two kinds of self-adapting confidence degree strategy) is compareed with other models:
2. comparing the performance of different confidence level strategies;
3. having used the driving bull framework certainly of different number, more different performance boosts.
In all experiments, exercise wheel number is set as 10000 wheels, and experience replay block size is 1000000, of large quantities It is small to be set as 1024.
Experiment one
Comparison model: DDPG (archetype), MA-BDDPG (single to judge family's head frameworks from driving), Multi-DDPG (single from the more executor's head frameworks of driving), and DBDDPG (our model, it is double from driving bull frameworks, do not contain adaptive confidence Degree), and SOUP (our model, it is double from driving bull framework, contain self-adapting confidence degree)
It is fastest from Fig. 2A, 2B as it can be seen that our method obtains average reward performance highest (solid line longitudinal axis highest) (solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin).
Experiment two
Compare our different confidence level strategies of method collocation: No Confidence (no confidence level), Fixed Confidence (fixed confidence level), Decayed Confidence (decaying confidence level), Self-Adaptive Confidence (our method, self-adapting confidence degree)
From Fig. 3 A, 3B as it can be seen that self-adapting confidence degree strategy obtains average reward performance highest (solid line longitudinal axis highest), speed Most fast (solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin)
Experiment three
Compare we method collocation different number bull framework: DDPG (archetype), SOUP (our model, 3 heads, 5 heads, 10 heads)
From Fig. 4 A, 4B as it can be seen that with head quantity promotion, acquired average reward performance is higher (the solid line longitudinal axis is higher), Speed is most got over (the faster rising of solid line, slope are bigger), and stability is better (shade is band-like thinner)
Fig. 5 shows our self adjustment of self-adapting confidence degree strategy in the training process.The confidence level of each head can be with Reward difference and dynamic adjust, be trained by the method for Policy-Gradient.
The above is only illustrating to examples of the invention, it is not considered as limiting the invention.Those skilled in the art Member can also make the scheme of some deformations under the inspiration of the application, also belong to protection scope of the present invention.For example, of the invention The scheme of following deformation can also be used:
1. using multiple DDPG (without share network) simultaneously be trained finally with confidence level fusion decision scheme (without It is to utilize multiple heads).
2. double Belief network is not added from the extension of driving bulls but goes to optimize to DDPG.
3. couple DDPG carries out single driving extension certainly, and goes to balance using confidence level.
The present invention can also be applied in following technical field:
1. intelligent co-pilot: making the vehicles (as an intelligent body) being capable of the self-teaching in simulated environment Speed is promoted, and is moved to relatively stable in actual environment;
2. game AI: trained intelligent body can be interacted by interacting with player, or with game itself, constantly be evolved, Higher reward and score are taken in game itself from student aid;
3. fields such as intelligent robots: mechanical arm or robot can be adapted to faster existing by being equipped with our algorithm Real environment is rapidly achieved basic task demand, accurately complete task (such as clamping object, differentiate object, screen object Deng).

Claims (10)

1. it is a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, using more A judge man and multiple executors, operational process include the following steps: that when observing a state, each executor's head generates One movement vector, composition K dimension act vector set;In the case where given same state, judges house and splice each movement vector It is shared to itself in hidden layer and generates Q value one by one, and then generate intermediate result Q value matrix, dimension K × K;Confidence at the same time It spends module and exports confidence level vector c, dimension K;E- judges family's layer in combination with Q value matrix and confidence level vector the two tensors It is weighted operation, generates an E-Q value vector, dimension K represents the potential value of each movement vector;Last E- is held Passerby's layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum capacity to obtain most Grand Prix according to E-Q value vector The movement encouraged, current state interacts with environment, is rewarded later, with training smart body.Wherein K is oneself greater than 2 So number.
2. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: presetting step, head quantity K is arranged.
3. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: random initializtion has the judge man network and executor's network of K headAnd it will power Respective target network parameter is copied to againI.e.Wherein θ refers to model Parameter, such as all parameters of neural network, upper right mark Q, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man, Target executor.
4. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: initialization Belief network θC
5. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In, include the following steps: selection according to the following formula act:
atRefer to really to select under t moment and is performed movement;Refer to the confidence level of i-th of judge man head under t moment;QiRefer to i-th A assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and movement;stFor The state of environment under t moment;The selection of movement is from k-th of executor's head μkAnd the output of a function, parameter are Input is state;For the random noise under t moment.
6. as claimed in claim 2 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In sampling drives mask m certainlyt~M;Memory transfer tuple (st,at,rt,st+1,mt) in experience pond R.
7. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In including the following steps: to minimize following loss function and update k-th of judges man Qk:
K-th of loss function value for judging family's head is indicated, for training optimization;It indicates n transfer member of a batch below The calculated value of group data is averaged;yiFor the target of Q value;Assessed value --- the Q of family's head is judged for k-th Value inputs the state and movement that tuple data is shifted for i-th,
Wherein
yiFor the target of Q value;riThe reward value of tuple data is shifted for i-th;γ is discount factor;It is that two functions are embedding later Set, first is Q value function that k-th of judgment of objectives man head generates, inputs as next state and movement, acts by kth A target executor head generates, and is second function, inputs as next state.
8. as claimed in claim 7 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: updating k-th of executor's head μ using Policy-Gradientk:
Indicate the gradient value of k-th of executor's head model parameter;It is the Q value of k-th of judge man head for dynamic Make the gradient of a;Movement byIt generates, i.e. the movement that k-th of executor's head generates;It is k-th of executor The gradient of head model parameter;Two gradients are the relationships being multiplied behind formula.
9. as claimed in claim 8 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: kth is updated to target network parameter according to formula:
It is the parameter of k-th of judgment of objectives person's head;It is the parameter of k-th of judge man head;It is k-th of target executor The parameter of head;It is the parameter of k-th of executor's head;τ is the scale parameter updated.
10. as claimed in claim 9 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: Belief network is updated according to Policy-Gradient:
θCRepresent its parameter;α represents learning rate;For the gradient of Belief network;For the strategy output of parametrization Value, the i.e. output valve of Belief network;Qπ(si,ai) it is the Q value assessed.
CN201811144686.6A 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method Active CN109523029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811144686.6A CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811144686.6A CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Publications (2)

Publication Number Publication Date
CN109523029A true CN109523029A (en) 2019-03-26
CN109523029B CN109523029B (en) 2020-11-03

Family

ID=65771996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811144686.6A Active CN109523029B (en) 2018-09-28 2018-09-28 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Country Status (1)

Country Link
CN (1) CN109523029B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363295A (en) * 2019-06-28 2019-10-22 电子科技大学 A kind of intelligent vehicle multilane lane-change method based on DQN
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN110442129A (en) * 2019-07-26 2019-11-12 中南大学 A kind of control method and system that multiple agent is formed into columns
CN110502721A (en) * 2019-08-02 2019-11-26 上海大学 A kind of continuity reinforcement learning system and method based on stochastic differential equation
CN111245008A (en) * 2020-01-14 2020-06-05 香港中文大学(深圳) Wind field cooperative control method and device
CN111310384A (en) * 2020-01-16 2020-06-19 香港中文大学(深圳) Wind field cooperative control method, terminal and computer readable storage medium
CN111813904A (en) * 2020-05-28 2020-10-23 平安科技(深圳)有限公司 Multi-turn conversation management method and device and computer equipment
CN111899728A (en) * 2020-07-23 2020-11-06 海信电子科技(武汉)有限公司 Training method and device for intelligent voice assistant decision strategy
CN112019249A (en) * 2020-10-22 2020-12-01 中山大学 Intelligent reflecting surface regulation and control method and device based on deep reinforcement learning
CN112418436A (en) * 2020-11-19 2021-02-26 华南师范大学 Artificial intelligence ethical virtual simulation experiment method based on human decision and robot
CN112446503A (en) * 2020-11-19 2021-03-05 华南师范大学 Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112782973A (en) * 2019-11-07 2021-05-11 四川省桑瑞光辉标识系统股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN114202229A (en) * 2021-12-20 2022-03-18 南方电网数字电网研究院有限公司 Method and device for determining energy management strategy, computer equipment and storage medium
CN114371634A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle combat analog simulation method based on multi-stage after experience playback

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138559A1 (en) * 2001-01-29 2002-09-26 Ulrich Thomas R. Dynamically distributed file system
US20020156973A1 (en) * 2001-01-29 2002-10-24 Ulrich Thomas R. Enhanced disk array
WO2007029516A1 (en) * 2005-09-02 2007-03-15 National University Corporation Yokohama National University Reinforcement learning value function expressing method and device using this
CN103496368A (en) * 2013-09-25 2014-01-08 吉林大学 Automobile cooperative type self-adaptive cruise control system and method with learning ability
CN103514371A (en) * 2013-09-22 2014-01-15 宁波开世通信息科技有限公司 Measuring and risk evaluation method of executive capability of scheduled task
CN105850901A (en) * 2016-04-18 2016-08-17 华南农业大学 Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
WO2017037859A1 (en) * 2015-08-31 2017-03-09 株式会社日立製作所 Information processing device and method
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN107020636A (en) * 2017-05-09 2017-08-08 重庆大学 A kind of Learning Control Method for Robot based on Policy-Gradient
CN108108822A (en) * 2018-01-16 2018-06-01 中国科学技术大学 The different tactful deeply learning method of parallel training
CN108321795A (en) * 2018-01-19 2018-07-24 上海交通大学 Start-stop of generator set configuration method based on depth deterministic policy algorithm and system
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020156973A1 (en) * 2001-01-29 2002-10-24 Ulrich Thomas R. Enhanced disk array
US20020138559A1 (en) * 2001-01-29 2002-09-26 Ulrich Thomas R. Dynamically distributed file system
WO2007029516A1 (en) * 2005-09-02 2007-03-15 National University Corporation Yokohama National University Reinforcement learning value function expressing method and device using this
CN103514371A (en) * 2013-09-22 2014-01-15 宁波开世通信息科技有限公司 Measuring and risk evaluation method of executive capability of scheduled task
CN103496368A (en) * 2013-09-25 2014-01-08 吉林大学 Automobile cooperative type self-adaptive cruise control system and method with learning ability
WO2017037859A1 (en) * 2015-08-31 2017-03-09 株式会社日立製作所 Information processing device and method
CN105850901A (en) * 2016-04-18 2016-08-17 华南农业大学 Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107020636A (en) * 2017-05-09 2017-08-08 重庆大学 A kind of Learning Control Method for Robot based on Policy-Gradient
CN108108822A (en) * 2018-01-16 2018-06-01 中国科学技术大学 The different tactful deeply learning method of parallel training
CN108321795A (en) * 2018-01-19 2018-07-24 上海交通大学 Start-stop of generator set configuration method based on depth deterministic policy algorithm and system
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MICHAEL L. LITTMAN: "Value-function reinforcement learning in Markov games", 《COGNITIVE SYSTEMS RESEARCH》 *
刘全等: "深度强化学习综述", 《计算机学报》 *
胡文伟: "基于强化学习算法的自适应配对交易模型", 《管理科学》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363295A (en) * 2019-06-28 2019-10-22 电子科技大学 A kind of intelligent vehicle multilane lane-change method based on DQN
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN110428615B (en) * 2019-07-12 2021-06-22 中国科学院自动化研究所 Single intersection traffic signal control method, system and device based on deep reinforcement learning
CN110442129A (en) * 2019-07-26 2019-11-12 中南大学 A kind of control method and system that multiple agent is formed into columns
CN110442129B (en) * 2019-07-26 2021-10-22 中南大学 Control method and system for multi-agent formation
CN110502721B (en) * 2019-08-02 2021-04-06 上海大学 Continuity reinforcement learning system and method based on random differential equation
CN110502721A (en) * 2019-08-02 2019-11-26 上海大学 A kind of continuity reinforcement learning system and method based on stochastic differential equation
CN112782973A (en) * 2019-11-07 2021-05-11 四川省桑瑞光辉标识系统股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN111245008A (en) * 2020-01-14 2020-06-05 香港中文大学(深圳) Wind field cooperative control method and device
CN111310384A (en) * 2020-01-16 2020-06-19 香港中文大学(深圳) Wind field cooperative control method, terminal and computer readable storage medium
CN111310384B (en) * 2020-01-16 2024-05-21 香港中文大学(深圳) Wind field cooperative control method, terminal and computer readable storage medium
CN111813904A (en) * 2020-05-28 2020-10-23 平安科技(深圳)有限公司 Multi-turn conversation management method and device and computer equipment
WO2021239069A1 (en) * 2020-05-28 2021-12-02 平安科技(深圳)有限公司 Multi-round dialogue management method and apparatus, and computer device
CN111899728A (en) * 2020-07-23 2020-11-06 海信电子科技(武汉)有限公司 Training method and device for intelligent voice assistant decision strategy
CN111899728B (en) * 2020-07-23 2024-05-28 海信电子科技(武汉)有限公司 Training method and device for intelligent voice assistant decision strategy
CN112019249A (en) * 2020-10-22 2020-12-01 中山大学 Intelligent reflecting surface regulation and control method and device based on deep reinforcement learning
CN112446503A (en) * 2020-11-19 2021-03-05 华南师范大学 Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot
CN112418436A (en) * 2020-11-19 2021-02-26 华南师范大学 Artificial intelligence ethical virtual simulation experiment method based on human decision and robot
CN112446503B (en) * 2020-11-19 2022-06-21 华南师范大学 Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot
CN112418436B (en) * 2020-11-19 2022-06-21 华南师范大学 Artificial intelligence ethical virtual simulation experiment method based on human decision and robot
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112668235B (en) * 2020-12-07 2022-12-09 中原工学院 Robot control method based on off-line model pre-training learning DDPG algorithm
CN114202229B (en) * 2021-12-20 2023-06-30 南方电网数字电网研究院有限公司 Determining method of energy management strategy of micro-grid based on deep reinforcement learning
CN114202229A (en) * 2021-12-20 2022-03-18 南方电网数字电网研究院有限公司 Method and device for determining energy management strategy, computer equipment and storage medium
CN114371634A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle combat analog simulation method based on multi-stage after experience playback

Also Published As

Publication number Publication date
CN109523029B (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN110991545B (en) Multi-agent confrontation oriented reinforcement learning training optimization method and device
Knox et al. Tamer: Training an agent manually via evaluative reinforcement
CN110321666A (en) Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN111695690A (en) Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning
CN109740741B (en) Reinforced learning method combined with knowledge transfer and learning method applied to autonomous skills of unmanned vehicles
CN111856925B (en) State trajectory-based confrontation type imitation learning method and device
CN114952828A (en) Mechanical arm motion planning method and system based on deep reinforcement learning
Wang et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models
CN109858574A (en) The autonomous learning method and system of intelligent body towards man-machine coordination work
Toubman et al. Modeling behavior of computer generated forces with machine learning techniques, the nato task group approach
CN114290339B (en) Robot realistic migration method based on reinforcement learning and residual modeling
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
KR100850914B1 (en) method for controlling game character
Hilleli et al. Toward deep reinforcement learning without a simulator: An autonomous steering example
Knox et al. Understanding human teaching modalities in reinforcement learning environments: A preliminary report
CN113919475B (en) Robot skill learning method and device, electronic equipment and storage medium
Cheng et al. An autonomous inter-task mapping learning method via artificial neural network for transfer learning
CN110070185A (en) A method of feedback, which is assessed, from demonstration and the mankind interacts intensified learning
Huang Fetching Policy of Intelligent Robotic Arm Based on Multiple-agents Reinforcement Learning Method
CN116540535A (en) Progressive strategy migration method based on self-adaptive dynamics model
CN113485107B (en) Reinforced learning robot control method and system based on consistency constraint modeling
CN114770497B (en) Search and rescue method and device of search and rescue robot and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant