CN109523029A - For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body - Google Patents
For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body Download PDFInfo
- Publication number
- CN109523029A CN109523029A CN201811144686.6A CN201811144686A CN109523029A CN 109523029 A CN109523029 A CN 109523029A CN 201811144686 A CN201811144686 A CN 201811144686A CN 109523029 A CN109523029 A CN 109523029A
- Authority
- CN
- China
- Prior art keywords
- value
- head
- executor
- movement
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, while improving judge man assessment performance from bogie structure using bull, improve the efficiency that executor explores environment, and depth deterministic policy gradient (DDPG) algorithm can be optimized to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, the convergence for accelerating DDPG algorithm improves performance on the basis of training is stablized.It is demonstrated experimentally that the present invention can reach three advantages that the most fast of training speed, performance are best and stability is best in experimental data set (simulated environment), it has been more than known solution in specific value.
Description
Technical field
The present invention relates to for a kind of adaptive double from driving depth deterministic policy gradient extensive chemical of training smart body
Learning method.
Background technique:
Deeply study obtains immense success in a series of challenging problems, such as unmanned, automatic
Change robot, Intelligent voice dialog system etc..Depth deterministic policy gradient (DDPG) is as a kind of environmental modeling that is not based on
Offline nitrification enhancement achieves by using the executor with experience replay-judge man architecture and compares conventional method
Higher sampling efficiency, and more and more common application is obtained because it is optimal performance in continuous control task.
But DDPG is easy to be influenced by environmental complexity and randomness, and it is unstable and not can guarantee instruction to may result in performance
Practice result convergence.This means that a large amount of hyper parameter is needed to adjust the result that work can just be got well.
In order to improve the effect of DDPG, MA-BDDPG is made from driving DQN as house is judged using bull in existing method
The Sample utilization efficiency of experience replay is improved (source: [Kalweit and Boedecker, 2017] Gabriel
Kalweit and Joschka Boedecker.Uncertainty-driven imagination for continuous
deep reinforcement learning.In Conference on Robot Learning,pages 195–206,
2017.), but MA-BDDPG is because only introduce single bull judge of driving certainly man, is easy to cause and explores insufficient problem to environment.
Multi-DDPG using it is single from driving bull executor framework go to improve DDPG for multitask adaptability (source:
[Yangetal.,2017]ZhaoyangYang,KathrynMerrick,HusseinAb-bass,and Lianwen Jin.
Multi-task deep reinforcement learning for continuous action control.In
Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, IJCAI-17, pages 3301-3307,2017.), but executed since Multi-DDPG only introduces bull
Person causes uniquely to judge the problem that family assesses multiple executors' movement inaccuracy.
Moreover, MA-BDDPG and Multi-DDPG, can only alleviate above-mentioned be easy by environmental complexity to a certain extent
The problem of with the influence of randomness, but it is each introduced into new problem and defect.
Summary of the invention
Present invention aim to solve the DDPG algorithm in the study of prior art deeply to be easy to be answered by environment
The influence of polygamy and randomness causes performance unstable and is not easy the problem of restraining.
For this purpose, the present invention proposes that a kind of adaptive double depth deterministic policy gradients of driving certainly for training smart body are strong
Chemical learning method includes the following steps: each to hold when observing a state using multiple judge men and multiple executors
Passerby's head generates a movement vector, and composition K dimension acts vector set.In the case where given same state, judging house will each be moved
It is spliced in itself shared hidden layer as vector and generates Q value one by one, and then generate intermediate result Q value matrix (dimension K × K),
Confidence level module output confidence level vector c (dimension K) at the same time.E- judges family's layer in combination with the two tensor (Q value squares
Battle array and confidence level vector) it is weighted operation, an E-Q value vector (dimension K) is generated, the latent of each movement vector is represent
It is being worth.Last E- executor layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum according to E-Q value vector
To obtain the maximum movement rewarded, current state interacts potential with environment, is rewarded later, with training smart body.
In some embodiments, it is also possible to include the following steps:
Step is preset, setting head quantity K, crowd training samples number n, maximum training rounds E, mask are distributed M.
Random initializtion has the judge man network and executor's network of K headAnd weight is copied to respectively
From target network parameterI.e.Wherein θ refers to the parameter of model, than
Such as all parameters of neural network, upper right mark W, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man, and target executes
Person.
Initialize experience replay pond R, Belief network θC。
Selection acts according to the following formula:
atRefer to really to select under t moment and is performed movement;Refer to the confidence level of i-th of judge man head under t moment;Qi
Refer to i-th assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and moves
Make; stFor the state of environment under t moment;The selection of movement is from k-th of executor's head μkAnd the output of a function, ginseng
Number isInput is state;For the random noise under t moment.
It executes to receive after the movement of the selection and rewards r immediatelytWith new state st+1。
Sampling drives mask m certainlyt~M;Memory transfer tuple (st,at,rt,st+1,mt) in experience pond R;Stochastical sampling n
Tuple is shifted as a batch training data.
It minimizes following loss function and updates k-th of judge man head Qk:
K-th of loss function value for judging family's head is indicated, for training optimization;It indicates a batch n below to turn
The calculated value for moving tuple data is averaged;yiFor the target of Q value;The assessment of family's head is judged for k-th
Value --- Q value inputs the state and movement that tuple data is shifted for i-th,
Wherein
yiFor the target of Q value;riThe reward value of tuple data is shifted for i-th;γ is discount factor;It is two letters later
Number is nested, and first is Q value function that k-th of judgment of objectives person's head generates, inputs as next state and movement, movement by
K-th of target executor's head generates, and is second function, inputs as next state.
K-th of executor's head μ is updated using Policy-Gradientk:
Indicate the gradient value of k-th of executor's head model parameter;It is the Q value of k-th of judge man head
For acting the gradient of a;Movement byIt generates, i.e. the movement that k-th of executor's head generates;It is kth
The gradient of a executor's head model parameter;Two gradients are the relationships being multiplied behind formula.
Kth is updated to target network parameter according to formula:
It is the parameter of k-th of judgment of objectives person's head;It is the parameter of k-th of judge man head;It is k-th of target
The parameter of executor's head;It is the parameter of k-th of executor's head;τ is the scale parameter updated.
Belief network is updated according to Policy-Gradient:
θCRepresent its parameter;α represents learning rate;For the gradient of Belief network;For the strategy of parametrization
Output valve, the i.e. output valve of Belief network;Qπ(si,ai) it is the Q value assessed.
The present invention is that the working principle based on DDPG goes to carry out, and original DDPG is to be judged house using single and individually executed
Person, executor's observation state generation movement, state export Q value, comment as the input for judging house, movement splicing in its hidden layer
Estimate the potential value of movement;And solution while method of the invention the advantages of being combined with both MA-BDDPG and Multi-DDPG
The deficiency of the two of having determined improves executor and visits to environment while improving judge man assessment performance from bogie structure using bull
The efficiency of rope, and can optimization algorithm to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, accelerate
The convergence of DDPG algorithm improves performance on the basis of training is stablized.
It is demonstrated experimentally that the present invention can experimental data set (simulated environment) reach the most fast of training speed, performance it is best,
And three advantages that stability is best, it has been more than known solution in specific value.
In some embodiments, the irregular problem of family's evaluation capacity is judged for bull, the present invention introduces adaptive
The method of confidence level strategy solves, further optimization DDPG.
Detailed description of the invention
Fig. 1 is that the embodiment of the present invention is adaptive double from the framework and operational process schematic diagram that drive DDPG.
Fig. 2A, 2B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 3 A, 3B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 4 A, 4B are the result schematic diagrams of experiment one in the embodiment of the present invention.
Fig. 5 is self the adjustment effect schematic diagram of self-adapting confidence degree of embodiment of the present invention strategy in the training process.
Specific embodiment
Fig. 1 is adaptive double frameworks and operational process schematic diagram from driving DDPG that we invent.
So-called pair refers to from driving: being on the one hand to judge house to use bull from actuation techniques raising assessment performance, another party
Face executor improves the efficiency explored to environment with bull from bogie structure.
Such as Fig. 1, environment first generates state s (the current the case where needing intelligent body to carry out decision provided by environment, state
Dimension depends on environment attribute), when executor's (2.) observes the state, each executor's head (generally has k, k is greater than
2 natural number) generating a movement vector a, (form interacted with state that environmental requirement provides, vector dimension depend on
Environment attribute) composition movement vector set A (K vector in total).
(it is identical in received state of same time to judge the every head of family) in the case where given same state, judges family (1.)
The movement vector acted in vector set A is spliced to one by one in itself shared hidden layer (i.e. inclusion layer in figure) and is generated one by one
Value, and then (Q value matrix V is intermediate result, is really played an important role for generation intermediate result Q value matrix V (dimension K × K)
It is the E-Q value vector of subsequent generation), confidence level module (3.) output confidence level vector c (dimension K) at the same time.
Ensemble- judges family's layer (E- judges family's layer) in combination with the two tensor (Q value matrix and confidence level vectors;Tensor includes
Vector sum matrix) it is weighted operation, an Ensemble-Q value vector (E-Q value, dimension K) is generated, each movement is represent
The potential value of vector.
Last Ensemble- executor layer (E- executor layer) chooses wherein corresponding maximum E-Q value according to E-Q value vector
Ensemble- act (E- movement), i.e., with maximum capacity to obtain the maximum movement rewarded, with environment current state into
Row interaction, reward later, with this<state, movement, traditional executor-judge man calculation can be used in reward>shift tuple
Method is carried out training smart body and is trained confidence level module using Policy-Gradient algorithm.
The wherein process that environment is interacted with intelligent body: environment provides state, and for intelligent body to act interaction, environment generates new shape
State, and give intelligent body reward.This process and mechanism are that intensified learning is general, be it is existing in the prior art, can refer to
The associated description of the prior art.DDPG has individually a pair of of executor and judges the framework of family, can also refer to the prior art
In description.
The embodiment of the present invention on the basis of existing technology, has carried out following improvement:
(1) extension DDPG is double from driving bull framework;
(2) multiple executors generate multiple movements, the processes judged family's weighting marking, screen optimal movement multiple simultaneously;
(3) increase self-adapting confidence degree module.
Specific training algorithm can be described as the following table 1:
Note: Fig. 1 only shows most crucial model framework and simple operational process, does not include some details, for example set in advance
Part and design parameter training etc. are set, and algorithm shown in table 1 is to pre-set in more detail and training step.Fig. 1 is convenient intuitive
Understand, 1 algorithm of table is to supplement in more detail.
We again further remark additionally to some detail sections below.
Splice (see " splicing " printed words of the middle section Fig. 1)
Family's network is judged to be made of the full articulamentum of multilayer, during operation can will movement vector (such as dimension be X), and hide
(that mentions above " splices the movement vector acted in vector set A one by one for input vector (for example dimension is Y) splicing of layer
Q value is generated into itself inclusion layer and one by one ", mean identical with expressing herein, " inclusion layer " is exactly " to hide in fact
Layer ", and the expression " being spliced to inclusion layer " of front, actual operation are exactly " splicing with the input vector of hidden layer ", are followed by
The expression of more details), the vector (i.e. X vector splices Y-direction amount=(X+Y) vector) that the dimension of final hidden layer input is X+Y,
It can simplify the splicing for vector.
Q value matrix expression formula (assuming that K head in total)
Vt∈RK×K
Wherein VtThe Q value matrix of t moment, ∈ " are belonged to " in mathematical formulae, RK×KRefer to that dimension is the real number square that K multiplies K
Battle array.This is merely creating the intermediate result of E-Q value vector.
Confidence level vector (dimension K)
It is exported by confidence level module (network)
Wherein ctIt is the confidence level vector under t moment, is made of K value,It is to judge setting for family's head k-th under t moment
Reliability is greater than 0 value for being less than or equal to 1.Confidence level module is a neural network, is equivalent to a function, ct=f (st), it is defeated
Enter state, exports confidence level vector.
E- judges house and executes steps are as follows formula, intermediateBe exactly E- judge family execute operation, weighted sum process,
Confidence level vector sum Q value matrix is subjected to product, the E-Q value vector that dimension is K can be generated in this way
E- judges family's model (1.) in the step corresponding diagram, and the step of generating E-Q value vector;Corresponding algorithm step 10.
E- executor executes step
Formula as above, the layer choose the E- movement of wherein corresponding maximum E-Q value, corresponding argmax behaviour according to E-Q value vector
Make, i.e., obtains the movement of maximum reward with maximum capacity, current state interacts with environment.
E- executor model (2.) in the step corresponding diagram, and the step of generating E- movement;Corresponding algorithm step 10.
More specifically training details refers to pseudo-code of the algorithm (i.e. table 1 above).
In existing method, MA-BDDPG judges house because only introducing single bull of driving certainly, causes to explore insufficient problem
(it has lacked bull execution part and confidence level module in our Fig. 1, and the process of integrated assessment selection movement);And
Multi-DDPG then only introduces single bull executor of driving certainly, and (bull that it has lacked in our Fig. 1 judges family part and confidence
Spend module, and the process of integrated assessment selection movement), lead to the problem of assessment inaccuracy.
And our method be combine two above the advantages of while, occur the irregular problem of evaluating ability (instruction
Experienced process can a part of head can fall into local optimum or training direction have deviation, and cause final Training Capability uneven),
It is solved so introducing self-adapting confidence degree strategy.This method all expands to the executor of DDPG and judge man more from driving
Head network is explored.It based on bull framework, judges house and is assessed using integrated Q value, to improve potential optimal action in experience
The probability for being explored out in playback.The strategy (being automatically generated by confidence level module) of self-adapting confidence degree is proposed at the same time
The next weight for calibrating weighted sum operation automatically, the irregular caused evaluation of evaluating abilities for solving different judge men head are inaccurate
True problem.
In an embodiment of the present invention, we have carried out a large amount of experiment to the technology, in Mujoco experimental situation
Under Hopper/Walker, speed promotes 45% while stablizing training, and average behavior (reward) promotes 44%.Below to specific
Experiment is introduced.
Experiment
The method that we test us under the Mujoco simulator experimental situation of OpenAI.It is mainly therein
Hopper-v1 and Walker-2d.
Hopper-v1 be allow monopodia robot jump study environment (state is made of the vector that 11 real numbers form
(i.e. this state is a vector for 11 dimensions, and each dimension is a real number.Analogize below), act from 3 real numbers form to
Amount is constituted);
Walker-2d be allow biped robot study walk environment (state is made of the vector that 17 real numbers form,
Movement is made of the vector that 6 real numbers form).
Based on the two environment, We conducted following examples controls:
1. our model (configure/not configuring two kinds of self-adapting confidence degree strategy) is compareed with other models:
2. comparing the performance of different confidence level strategies;
3. having used the driving bull framework certainly of different number, more different performance boosts.
In all experiments, exercise wheel number is set as 10000 wheels, and experience replay block size is 1000000, of large quantities
It is small to be set as 1024.
Experiment one
Comparison model: DDPG (archetype), MA-BDDPG (single to judge family's head frameworks from driving), Multi-DDPG
(single from the more executor's head frameworks of driving), and DBDDPG (our model, it is double from driving bull frameworks, do not contain adaptive confidence
Degree), and SOUP (our model, it is double from driving bull framework, contain self-adapting confidence degree)
It is fastest from Fig. 2A, 2B as it can be seen that our method obtains average reward performance highest (solid line longitudinal axis highest)
(solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin).
Experiment two
Compare our different confidence level strategies of method collocation: No Confidence (no confidence level), Fixed
Confidence (fixed confidence level), Decayed Confidence (decaying confidence level), Self-Adaptive
Confidence (our method, self-adapting confidence degree)
From Fig. 3 A, 3B as it can be seen that self-adapting confidence degree strategy obtains average reward performance highest (solid line longitudinal axis highest), speed
Most fast (solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin)
Experiment three
Compare we method collocation different number bull framework: DDPG (archetype), SOUP (our model,
3 heads, 5 heads, 10 heads)
From Fig. 4 A, 4B as it can be seen that with head quantity promotion, acquired average reward performance is higher (the solid line longitudinal axis is higher),
Speed is most got over (the faster rising of solid line, slope are bigger), and stability is better (shade is band-like thinner)
Fig. 5 shows our self adjustment of self-adapting confidence degree strategy in the training process.The confidence level of each head can be with
Reward difference and dynamic adjust, be trained by the method for Policy-Gradient.
The above is only illustrating to examples of the invention, it is not considered as limiting the invention.Those skilled in the art
Member can also make the scheme of some deformations under the inspiration of the application, also belong to protection scope of the present invention.For example, of the invention
The scheme of following deformation can also be used:
1. using multiple DDPG (without share network) simultaneously be trained finally with confidence level fusion decision scheme (without
It is to utilize multiple heads).
2. double Belief network is not added from the extension of driving bulls but goes to optimize to DDPG.
3. couple DDPG carries out single driving extension certainly, and goes to balance using confidence level.
The present invention can also be applied in following technical field:
1. intelligent co-pilot: making the vehicles (as an intelligent body) being capable of the self-teaching in simulated environment
Speed is promoted, and is moved to relatively stable in actual environment;
2. game AI: trained intelligent body can be interacted by interacting with player, or with game itself, constantly be evolved,
Higher reward and score are taken in game itself from student aid;
3. fields such as intelligent robots: mechanical arm or robot can be adapted to faster existing by being equipped with our algorithm
Real environment is rapidly achieved basic task demand, accurately complete task (such as clamping object, differentiate object, screen object
Deng).
Claims (10)
1. it is a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, using more
A judge man and multiple executors, operational process include the following steps: that when observing a state, each executor's head generates
One movement vector, composition K dimension act vector set;In the case where given same state, judges house and splice each movement vector
It is shared to itself in hidden layer and generates Q value one by one, and then generate intermediate result Q value matrix, dimension K × K;Confidence at the same time
It spends module and exports confidence level vector c, dimension K;E- judges family's layer in combination with Q value matrix and confidence level vector the two tensors
It is weighted operation, generates an E-Q value vector, dimension K represents the potential value of each movement vector;Last E- is held
Passerby's layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum capacity to obtain most Grand Prix according to E-Q value vector
The movement encouraged, current state interacts with environment, is rewarded later, with training smart body.Wherein K is oneself greater than 2
So number.
2. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: presetting step, head quantity K is arranged.
3. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: random initializtion has the judge man network and executor's network of K headAnd it will power
Respective target network parameter is copied to againI.e.Wherein θ refers to model
Parameter, such as all parameters of neural network, upper right mark Q, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man,
Target executor.
4. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: initialization Belief network θC。
5. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In, include the following steps: selection according to the following formula act:
atRefer to really to select under t moment and is performed movement;Refer to the confidence level of i-th of judge man head under t moment;QiRefer to i-th
A assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and movement;stFor
The state of environment under t moment;The selection of movement is from k-th of executor's head μkAnd the output of a function, parameter are
Input is state;For the random noise under t moment.
6. as claimed in claim 2 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In sampling drives mask m certainlyt~M;Memory transfer tuple (st,at,rt,st+1,mt) in experience pond R.
7. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In including the following steps: to minimize following loss function and update k-th of judges man Qk:
K-th of loss function value for judging family's head is indicated, for training optimization;It indicates n transfer member of a batch below
The calculated value of group data is averaged;yiFor the target of Q value;Assessed value --- the Q of family's head is judged for k-th
Value inputs the state and movement that tuple data is shifted for i-th,
Wherein
yiFor the target of Q value;riThe reward value of tuple data is shifted for i-th;γ is discount factor;It is that two functions are embedding later
Set, first is Q value function that k-th of judgment of objectives man head generates, inputs as next state and movement, acts by kth
A target executor head generates, and is second function, inputs as next state.
8. as claimed in claim 7 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: updating k-th of executor's head μ using Policy-Gradientk:
Indicate the gradient value of k-th of executor's head model parameter;It is the Q value of k-th of judge man head for dynamic
Make the gradient of a;Movement byIt generates, i.e. the movement that k-th of executor's head generates;It is k-th of executor
The gradient of head model parameter;Two gradients are the relationships being multiplied behind formula.
9. as claimed in claim 8 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: kth is updated to target network parameter according to formula:
It is the parameter of k-th of judgment of objectives person's head;It is the parameter of k-th of judge man head;It is k-th of target executor
The parameter of head;It is the parameter of k-th of executor's head;τ is the scale parameter updated.
10. as claimed in claim 9 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists
In further including following steps: Belief network is updated according to Policy-Gradient:
θCRepresent its parameter;α represents learning rate;For the gradient of Belief network;For the strategy output of parametrization
Value, the i.e. output valve of Belief network;Qπ(si,ai) it is the Q value assessed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811144686.6A CN109523029B (en) | 2018-09-28 | 2018-09-28 | Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811144686.6A CN109523029B (en) | 2018-09-28 | 2018-09-28 | Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109523029A true CN109523029A (en) | 2019-03-26 |
CN109523029B CN109523029B (en) | 2020-11-03 |
Family
ID=65771996
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811144686.6A Active CN109523029B (en) | 2018-09-28 | 2018-09-28 | Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109523029B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110363295A (en) * | 2019-06-28 | 2019-10-22 | 电子科技大学 | A kind of intelligent vehicle multilane lane-change method based on DQN |
CN110428615A (en) * | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
CN110442129A (en) * | 2019-07-26 | 2019-11-12 | 中南大学 | A kind of control method and system that multiple agent is formed into columns |
CN110502721A (en) * | 2019-08-02 | 2019-11-26 | 上海大学 | A kind of continuity reinforcement learning system and method based on stochastic differential equation |
CN111245008A (en) * | 2020-01-14 | 2020-06-05 | 香港中文大学(深圳) | Wind field cooperative control method and device |
CN111310384A (en) * | 2020-01-16 | 2020-06-19 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
CN111813904A (en) * | 2020-05-28 | 2020-10-23 | 平安科技(深圳)有限公司 | Multi-turn conversation management method and device and computer equipment |
CN111899728A (en) * | 2020-07-23 | 2020-11-06 | 海信电子科技(武汉)有限公司 | Training method and device for intelligent voice assistant decision strategy |
CN112019249A (en) * | 2020-10-22 | 2020-12-01 | 中山大学 | Intelligent reflecting surface regulation and control method and device based on deep reinforcement learning |
CN112418436A (en) * | 2020-11-19 | 2021-02-26 | 华南师范大学 | Artificial intelligence ethical virtual simulation experiment method based on human decision and robot |
CN112446503A (en) * | 2020-11-19 | 2021-03-05 | 华南师范大学 | Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot |
CN112668235A (en) * | 2020-12-07 | 2021-04-16 | 中原工学院 | Robot control method of DDPG algorithm based on offline model pre-training learning |
CN112782973A (en) * | 2019-11-07 | 2021-05-11 | 四川省桑瑞光辉标识系统股份有限公司 | Biped robot walking control method and system based on double-agent cooperative game |
CN114202229A (en) * | 2021-12-20 | 2022-03-18 | 南方电网数字电网研究院有限公司 | Method and device for determining energy management strategy, computer equipment and storage medium |
CN114371634A (en) * | 2021-12-22 | 2022-04-19 | 中国人民解放军军事科学院战略评估咨询中心 | Unmanned aerial vehicle combat analog simulation method based on multi-stage after experience playback |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020138559A1 (en) * | 2001-01-29 | 2002-09-26 | Ulrich Thomas R. | Dynamically distributed file system |
US20020156973A1 (en) * | 2001-01-29 | 2002-10-24 | Ulrich Thomas R. | Enhanced disk array |
WO2007029516A1 (en) * | 2005-09-02 | 2007-03-15 | National University Corporation Yokohama National University | Reinforcement learning value function expressing method and device using this |
CN103496368A (en) * | 2013-09-25 | 2014-01-08 | 吉林大学 | Automobile cooperative type self-adaptive cruise control system and method with learning ability |
CN103514371A (en) * | 2013-09-22 | 2014-01-15 | 宁波开世通信息科技有限公司 | Measuring and risk evaluation method of executive capability of scheduled task |
CN105850901A (en) * | 2016-04-18 | 2016-08-17 | 华南农业大学 | Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
WO2017037859A1 (en) * | 2015-08-31 | 2017-03-09 | 株式会社日立製作所 | Information processing device and method |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN106970615A (en) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | A kind of real-time online paths planning method of deeply study |
CN107020636A (en) * | 2017-05-09 | 2017-08-08 | 重庆大学 | A kind of Learning Control Method for Robot based on Policy-Gradient |
CN108108822A (en) * | 2018-01-16 | 2018-06-01 | 中国科学技术大学 | The different tactful deeply learning method of parallel training |
CN108321795A (en) * | 2018-01-19 | 2018-07-24 | 上海交通大学 | Start-stop of generator set configuration method based on depth deterministic policy algorithm and system |
CN108563112A (en) * | 2018-03-30 | 2018-09-21 | 南京邮电大学 | Control method for emulating Soccer robot ball-handling |
-
2018
- 2018-09-28 CN CN201811144686.6A patent/CN109523029B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020156973A1 (en) * | 2001-01-29 | 2002-10-24 | Ulrich Thomas R. | Enhanced disk array |
US20020138559A1 (en) * | 2001-01-29 | 2002-09-26 | Ulrich Thomas R. | Dynamically distributed file system |
WO2007029516A1 (en) * | 2005-09-02 | 2007-03-15 | National University Corporation Yokohama National University | Reinforcement learning value function expressing method and device using this |
CN103514371A (en) * | 2013-09-22 | 2014-01-15 | 宁波开世通信息科技有限公司 | Measuring and risk evaluation method of executive capability of scheduled task |
CN103496368A (en) * | 2013-09-25 | 2014-01-08 | 吉林大学 | Automobile cooperative type self-adaptive cruise control system and method with learning ability |
WO2017037859A1 (en) * | 2015-08-31 | 2017-03-09 | 株式会社日立製作所 | Information processing device and method |
CN105850901A (en) * | 2016-04-18 | 2016-08-17 | 华南农业大学 | Detection of concentration of ammonia in breeding environment and application thereof in establishing silkworm growth and development judgment system |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
CN106970615A (en) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | A kind of real-time online paths planning method of deeply study |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107020636A (en) * | 2017-05-09 | 2017-08-08 | 重庆大学 | A kind of Learning Control Method for Robot based on Policy-Gradient |
CN108108822A (en) * | 2018-01-16 | 2018-06-01 | 中国科学技术大学 | The different tactful deeply learning method of parallel training |
CN108321795A (en) * | 2018-01-19 | 2018-07-24 | 上海交通大学 | Start-stop of generator set configuration method based on depth deterministic policy algorithm and system |
CN108563112A (en) * | 2018-03-30 | 2018-09-21 | 南京邮电大学 | Control method for emulating Soccer robot ball-handling |
Non-Patent Citations (3)
Title |
---|
MICHAEL L. LITTMAN: "Value-function reinforcement learning in Markov games", 《COGNITIVE SYSTEMS RESEARCH》 * |
刘全等: "深度强化学习综述", 《计算机学报》 * |
胡文伟: "基于强化学习算法的自适应配对交易模型", 《管理科学》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110363295A (en) * | 2019-06-28 | 2019-10-22 | 电子科技大学 | A kind of intelligent vehicle multilane lane-change method based on DQN |
CN110428615A (en) * | 2019-07-12 | 2019-11-08 | 中国科学院自动化研究所 | Learn isolated intersection traffic signal control method, system, device based on deeply |
CN110428615B (en) * | 2019-07-12 | 2021-06-22 | 中国科学院自动化研究所 | Single intersection traffic signal control method, system and device based on deep reinforcement learning |
CN110442129A (en) * | 2019-07-26 | 2019-11-12 | 中南大学 | A kind of control method and system that multiple agent is formed into columns |
CN110442129B (en) * | 2019-07-26 | 2021-10-22 | 中南大学 | Control method and system for multi-agent formation |
CN110502721B (en) * | 2019-08-02 | 2021-04-06 | 上海大学 | Continuity reinforcement learning system and method based on random differential equation |
CN110502721A (en) * | 2019-08-02 | 2019-11-26 | 上海大学 | A kind of continuity reinforcement learning system and method based on stochastic differential equation |
CN112782973A (en) * | 2019-11-07 | 2021-05-11 | 四川省桑瑞光辉标识系统股份有限公司 | Biped robot walking control method and system based on double-agent cooperative game |
CN111245008A (en) * | 2020-01-14 | 2020-06-05 | 香港中文大学(深圳) | Wind field cooperative control method and device |
CN111310384A (en) * | 2020-01-16 | 2020-06-19 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
CN111310384B (en) * | 2020-01-16 | 2024-05-21 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
CN111813904A (en) * | 2020-05-28 | 2020-10-23 | 平安科技(深圳)有限公司 | Multi-turn conversation management method and device and computer equipment |
WO2021239069A1 (en) * | 2020-05-28 | 2021-12-02 | 平安科技(深圳)有限公司 | Multi-round dialogue management method and apparatus, and computer device |
CN111899728A (en) * | 2020-07-23 | 2020-11-06 | 海信电子科技(武汉)有限公司 | Training method and device for intelligent voice assistant decision strategy |
CN111899728B (en) * | 2020-07-23 | 2024-05-28 | 海信电子科技(武汉)有限公司 | Training method and device for intelligent voice assistant decision strategy |
CN112019249A (en) * | 2020-10-22 | 2020-12-01 | 中山大学 | Intelligent reflecting surface regulation and control method and device based on deep reinforcement learning |
CN112446503A (en) * | 2020-11-19 | 2021-03-05 | 华南师范大学 | Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot |
CN112418436A (en) * | 2020-11-19 | 2021-02-26 | 华南师范大学 | Artificial intelligence ethical virtual simulation experiment method based on human decision and robot |
CN112446503B (en) * | 2020-11-19 | 2022-06-21 | 华南师范大学 | Multi-person decision-making and potential ethical risk prevention virtual experiment method and robot |
CN112418436B (en) * | 2020-11-19 | 2022-06-21 | 华南师范大学 | Artificial intelligence ethical virtual simulation experiment method based on human decision and robot |
CN112668235A (en) * | 2020-12-07 | 2021-04-16 | 中原工学院 | Robot control method of DDPG algorithm based on offline model pre-training learning |
CN112668235B (en) * | 2020-12-07 | 2022-12-09 | 中原工学院 | Robot control method based on off-line model pre-training learning DDPG algorithm |
CN114202229B (en) * | 2021-12-20 | 2023-06-30 | 南方电网数字电网研究院有限公司 | Determining method of energy management strategy of micro-grid based on deep reinforcement learning |
CN114202229A (en) * | 2021-12-20 | 2022-03-18 | 南方电网数字电网研究院有限公司 | Method and device for determining energy management strategy, computer equipment and storage medium |
CN114371634A (en) * | 2021-12-22 | 2022-04-19 | 中国人民解放军军事科学院战略评估咨询中心 | Unmanned aerial vehicle combat analog simulation method based on multi-stage after experience playback |
Also Published As
Publication number | Publication date |
---|---|
CN109523029B (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109523029A (en) | For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body | |
CN110991545B (en) | Multi-agent confrontation oriented reinforcement learning training optimization method and device | |
Knox et al. | Tamer: Training an agent manually via evaluative reinforcement | |
CN110321666A (en) | Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN111026272B (en) | Training method and device for virtual object behavior strategy, electronic equipment and storage medium | |
CN111695690A (en) | Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning | |
CN109740741B (en) | Reinforced learning method combined with knowledge transfer and learning method applied to autonomous skills of unmanned vehicles | |
CN111856925B (en) | State trajectory-based confrontation type imitation learning method and device | |
CN114952828A (en) | Mechanical arm motion planning method and system based on deep reinforcement learning | |
Wang et al. | Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models | |
CN109858574A (en) | The autonomous learning method and system of intelligent body towards man-machine coordination work | |
Toubman et al. | Modeling behavior of computer generated forces with machine learning techniques, the nato task group approach | |
CN114290339B (en) | Robot realistic migration method based on reinforcement learning and residual modeling | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
KR100850914B1 (en) | method for controlling game character | |
Hilleli et al. | Toward deep reinforcement learning without a simulator: An autonomous steering example | |
Knox et al. | Understanding human teaching modalities in reinforcement learning environments: A preliminary report | |
CN113919475B (en) | Robot skill learning method and device, electronic equipment and storage medium | |
Cheng et al. | An autonomous inter-task mapping learning method via artificial neural network for transfer learning | |
CN110070185A (en) | A method of feedback, which is assessed, from demonstration and the mankind interacts intensified learning | |
Huang | Fetching Policy of Intelligent Robotic Arm Based on Multiple-agents Reinforcement Learning Method | |
CN116540535A (en) | Progressive strategy migration method based on self-adaptive dynamics model | |
CN113485107B (en) | Reinforced learning robot control method and system based on consistency constraint modeling | |
CN114770497B (en) | Search and rescue method and device of search and rescue robot and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |