CN109523029A

CN109523029A - For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body

Info

Publication number: CN109523029A
Application number: CN201811144686.6A
Authority: CN
Inventors: 袁春; 郑卓彬; 朱新瑞
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-03-26
Anticipated expiration: 2038-09-28
Also published as: CN109523029B

Abstract

The present invention relates to a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, while improving judge man assessment performance from bogie structure using bull, improve the efficiency that executor explores environment, and depth deterministic policy gradient (DDPG) algorithm can be optimized to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, the convergence for accelerating DDPG algorithm improves performance on the basis of training is stablized.It is demonstrated experimentally that the present invention can reach three advantages that the most fast of training speed, performance are best and stability is best in experimental data set (simulated environment), it has been more than known solution in specific value.

Description

For the adaptive double from driving depth deterministic policy gradient reinforcing of training smart body Learning method

Technical field

The present invention relates to for a kind of adaptive double from driving depth deterministic policy gradient extensive chemical of training smart body Learning method.

Background technique:

Deeply study obtains immense success in a series of challenging problems, such as unmanned, automatic Change robot, Intelligent voice dialog system etc..Depth deterministic policy gradient (DDPG) is as a kind of environmental modeling that is not based on Offline nitrification enhancement achieves by using the executor with experience replay-judge man architecture and compares conventional method Higher sampling efficiency, and more and more common application is obtained because it is optimal performance in continuous control task. But DDPG is easy to be influenced by environmental complexity and randomness, and it is unstable and not can guarantee instruction to may result in performance Practice result convergence.This means that a large amount of hyper parameter is needed to adjust the result that work can just be got well.

In order to improve the effect of DDPG, MA-BDDPG is made from driving DQN as house is judged using bull in existing method The Sample utilization efficiency of experience replay is improved (source: [Kalweit and Boedecker, 2017] Gabriel Kalweit and Joschka Boedecker.Uncertainty-driven imagination for continuous deep reinforcement learning.In Conference on Robot Learning,pages 195–206, 2017.), but MA-BDDPG is because only introduce single bull judge of driving certainly man, is easy to cause and explores insufficient problem to environment. Multi-DDPG using it is single from driving bull executor framework go to improve DDPG for multitask adaptability (source: [Yangetal.,2017]ZhaoyangYang,KathrynMerrick,HusseinAb-bass,and Lianwen Jin. Multi-task deep reinforcement learning for continuous action control.In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3301-3307,2017.), but executed since Multi-DDPG only introduces bull Person causes uniquely to judge the problem that family assesses multiple executors' movement inaccuracy.

Moreover, MA-BDDPG and Multi-DDPG, can only alleviate above-mentioned be easy by environmental complexity to a certain extent The problem of with the influence of randomness, but it is each introduced into new problem and defect.

Summary of the invention

Present invention aim to solve the DDPG algorithm in the study of prior art deeply to be easy to be answered by environment The influence of polygamy and randomness causes performance unstable and is not easy the problem of restraining.

For this purpose, the present invention proposes that a kind of adaptive double depth deterministic policy gradients of driving certainly for training smart body are strong Chemical learning method includes the following steps: each to hold when observing a state using multiple judge men and multiple executors Passerby's head generates a movement vector, and composition K dimension acts vector set.In the case where given same state, judging house will each be moved It is spliced in itself shared hidden layer as vector and generates Q value one by one, and then generate intermediate result Q value matrix (dimension K × K), Confidence level module output confidence level vector c (dimension K) at the same time.E- judges family's layer in combination with the two tensor (Q value squares Battle array and confidence level vector) it is weighted operation, an E-Q value vector (dimension K) is generated, the latent of each movement vector is represent It is being worth.Last E- executor layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum according to E-Q value vector To obtain the maximum movement rewarded, current state interacts potential with environment, is rewarded later, with training smart body.

In some embodiments, it is also possible to include the following steps:

Step is preset, setting head quantity K, crowd training samples number n, maximum training rounds E, mask are distributed M.

Random initializtion has the judge man network and executor's network of K headAnd weight is copied to respectively From target network parameterI.e.Wherein θ refers to the parameter of model, than Such as all parameters of neural network, upper right mark W, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man, and target executes Person.

Initialize experience replay pond R, Belief network θ^C。

Selection acts according to the following formula:

a_tRefer to really to select under t moment and is performed movement；Refer to the confidence level of i-th of judge man head under t moment；Q_i Refer to i-th assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and moves Make； s_tFor the state of environment under t moment；The selection of movement is from k-th of executor's head μ_kAnd the output of a function, ginseng Number isInput is state；For the random noise under t moment.

It executes to receive after the movement of the selection and rewards r immediately_tWith new state s_t+1。

Sampling drives mask m certainly_t~M；Memory transfer tuple (s_t,a_t,r_t,s_t+1,m_t) in experience pond R；Stochastical sampling n Tuple is shifted as a batch training data.

It minimizes following loss function and updates k-th of judge man head Q^k:

K-th of loss function value for judging family's head is indicated, for training optimization；It indicates a batch n below to turn The calculated value for moving tuple data is averaged；y_iFor the target of Q value；The assessment of family's head is judged for k-th Value --- Q value inputs the state and movement that tuple data is shifted for i-th,

Wherein

y_iFor the target of Q value；r_iThe reward value of tuple data is shifted for i-th；γ is discount factor；It is two letters later Number is nested, and first is Q value function that k-th of judgment of objectives person's head generates, inputs as next state and movement, movement by K-th of target executor's head generates, and is second function, inputs as next state.

K-th of executor's head μ is updated using Policy-Gradient^k:

Indicate the gradient value of k-th of executor's head model parameter；It is the Q value of k-th of judge man head For acting the gradient of a；Movement byIt generates, i.e. the movement that k-th of executor's head generates；It is kth The gradient of a executor's head model parameter；Two gradients are the relationships being multiplied behind formula.

Kth is updated to target network parameter according to formula:

It is the parameter of k-th of judgment of objectives person's head；It is the parameter of k-th of judge man head；It is k-th of target The parameter of executor's head；It is the parameter of k-th of executor's head；τ is the scale parameter updated.

Belief network is updated according to Policy-Gradient:

θ^CRepresent its parameter；α represents learning rate；For the gradient of Belief network；For the strategy of parametrization Output valve, the i.e. output valve of Belief network；Q^π(s_i,a_i) it is the Q value assessed.

The present invention is that the working principle based on DDPG goes to carry out, and original DDPG is to be judged house using single and individually executed Person, executor's observation state generation movement, state export Q value, comment as the input for judging house, movement splicing in its hidden layer Estimate the potential value of movement；And solution while method of the invention the advantages of being combined with both MA-BDDPG and Multi-DDPG The deficiency of the two of having determined improves executor and visits to environment while improving judge man assessment performance from bogie structure using bull The efficiency of rope, and can optimization algorithm to a certain extent, alleviate the adverse effect such as above-mentioned environmental complexity and randomness, accelerate The convergence of DDPG algorithm improves performance on the basis of training is stablized.

It is demonstrated experimentally that the present invention can experimental data set (simulated environment) reach the most fast of training speed, performance it is best, And three advantages that stability is best, it has been more than known solution in specific value.

In some embodiments, the irregular problem of family's evaluation capacity is judged for bull, the present invention introduces adaptive The method of confidence level strategy solves, further optimization DDPG.

Detailed description of the invention

Fig. 1 is that the embodiment of the present invention is adaptive double from the framework and operational process schematic diagram that drive DDPG.

Fig. 2A, 2B are the result schematic diagrams of experiment one in the embodiment of the present invention.

Fig. 3 A, 3B are the result schematic diagrams of experiment one in the embodiment of the present invention.

Fig. 4 A, 4B are the result schematic diagrams of experiment one in the embodiment of the present invention.

Fig. 5 is self the adjustment effect schematic diagram of self-adapting confidence degree of embodiment of the present invention strategy in the training process.

Specific embodiment

Fig. 1 is adaptive double frameworks and operational process schematic diagram from driving DDPG that we invent.

So-called pair refers to from driving: being on the one hand to judge house to use bull from actuation techniques raising assessment performance, another party Face executor improves the efficiency explored to environment with bull from bogie structure.

Such as Fig. 1, environment first generates state s (the current the case where needing intelligent body to carry out decision provided by environment, state Dimension depends on environment attribute), when executor's (2.) observes the state, each executor's head (generally has k, k is greater than 2 natural number) generating a movement vector a, (form interacted with state that environmental requirement provides, vector dimension depend on Environment attribute) composition movement vector set A (K vector in total).

(it is identical in received state of same time to judge the every head of family) in the case where given same state, judges family (1.) The movement vector acted in vector set A is spliced to one by one in itself shared hidden layer (i.e. inclusion layer in figure) and is generated one by one Value, and then (Q value matrix V is intermediate result, is really played an important role for generation intermediate result Q value matrix V (dimension K × K) It is the E-Q value vector of subsequent generation), confidence level module (3.) output confidence level vector c (dimension K) at the same time. Ensemble- judges family's layer (E- judges family's layer) in combination with the two tensor (Q value matrix and confidence level vectors；Tensor includes Vector sum matrix) it is weighted operation, an Ensemble-Q value vector (E-Q value, dimension K) is generated, each movement is represent The potential value of vector.

Last Ensemble- executor layer (E- executor layer) chooses wherein corresponding maximum E-Q value according to E-Q value vector Ensemble- act (E- movement), i.e., with maximum capacity to obtain the maximum movement rewarded, with environment current state into Row interaction, reward later, with this<state, movement, traditional executor-judge man calculation can be used in reward>shift tuple Method is carried out training smart body and is trained confidence level module using Policy-Gradient algorithm.

The wherein process that environment is interacted with intelligent body: environment provides state, and for intelligent body to act interaction, environment generates new shape State, and give intelligent body reward.This process and mechanism are that intensified learning is general, be it is existing in the prior art, can refer to The associated description of the prior art.DDPG has individually a pair of of executor and judges the framework of family, can also refer to the prior art In description.

The embodiment of the present invention on the basis of existing technology, has carried out following improvement:

(1) extension DDPG is double from driving bull framework；

(2) multiple executors generate multiple movements, the processes judged family's weighting marking, screen optimal movement multiple simultaneously；

(3) increase self-adapting confidence degree module.

Specific training algorithm can be described as the following table 1:

Note: Fig. 1 only shows most crucial model framework and simple operational process, does not include some details, for example set in advance Part and design parameter training etc. are set, and algorithm shown in table 1 is to pre-set in more detail and training step.Fig. 1 is convenient intuitive Understand, 1 algorithm of table is to supplement in more detail.

We again further remark additionally to some detail sections below.

Splice (see " splicing " printed words of the middle section Fig. 1)

Family's network is judged to be made of the full articulamentum of multilayer, during operation can will movement vector (such as dimension be X), and hide (that mentions above " splices the movement vector acted in vector set A one by one for input vector (for example dimension is Y) splicing of layer Q value is generated into itself inclusion layer and one by one ", mean identical with expressing herein, " inclusion layer " is exactly " to hide in fact Layer ", and the expression " being spliced to inclusion layer " of front, actual operation are exactly " splicing with the input vector of hidden layer ", are followed by The expression of more details), the vector (i.e. X vector splices Y-direction amount=(X+Y) vector) that the dimension of final hidden layer input is X+Y, It can simplify the splicing for vector.

Q value matrix expression formula (assuming that K head in total)

V_t∈R^K×K

Wherein V_tThe Q value matrix of t moment, ∈ " are belonged to " in mathematical formulae, R^K×KRefer to that dimension is the real number square that K multiplies K Battle array.This is merely creating the intermediate result of E-Q value vector.

Confidence level vector (dimension K)

It is exported by confidence level module (network)

Wherein c_tIt is the confidence level vector under t moment, is made of K value,It is to judge setting for family's head k-th under t moment Reliability is greater than 0 value for being less than or equal to 1.Confidence level module is a neural network, is equivalent to a function, c_t=f (s_t), it is defeated Enter state, exports confidence level vector.

E- judges house and executes steps are as follows formula, intermediateBe exactly E- judge family execute operation, weighted sum process, Confidence level vector sum Q value matrix is subjected to product, the E-Q value vector that dimension is K can be generated in this way

E- judges family's model (1.) in the step corresponding diagram, and the step of generating E-Q value vector；Corresponding algorithm step 10.

E- executor executes step

Formula as above, the layer choose the E- movement of wherein corresponding maximum E-Q value, corresponding argmax behaviour according to E-Q value vector Make, i.e., obtains the movement of maximum reward with maximum capacity, current state interacts with environment.

E- executor model (2.) in the step corresponding diagram, and the step of generating E- movement；Corresponding algorithm step 10.

More specifically training details refers to pseudo-code of the algorithm (i.e. table 1 above).

In existing method, MA-BDDPG judges house because only introducing single bull of driving certainly, causes to explore insufficient problem (it has lacked bull execution part and confidence level module in our Fig. 1, and the process of integrated assessment selection movement)；And Multi-DDPG then only introduces single bull executor of driving certainly, and (bull that it has lacked in our Fig. 1 judges family part and confidence Spend module, and the process of integrated assessment selection movement), lead to the problem of assessment inaccuracy.

And our method be combine two above the advantages of while, occur the irregular problem of evaluating ability (instruction Experienced process can a part of head can fall into local optimum or training direction have deviation, and cause final Training Capability uneven), It is solved so introducing self-adapting confidence degree strategy.This method all expands to the executor of DDPG and judge man more from driving Head network is explored.It based on bull framework, judges house and is assessed using integrated Q value, to improve potential optimal action in experience The probability for being explored out in playback.The strategy (being automatically generated by confidence level module) of self-adapting confidence degree is proposed at the same time The next weight for calibrating weighted sum operation automatically, the irregular caused evaluation of evaluating abilities for solving different judge men head are inaccurate True problem.

In an embodiment of the present invention, we have carried out a large amount of experiment to the technology, in Mujoco experimental situation Under Hopper/Walker, speed promotes 45% while stablizing training, and average behavior (reward) promotes 44%.Below to specific Experiment is introduced.

Experiment

The method that we test us under the Mujoco simulator experimental situation of OpenAI.It is mainly therein Hopper-v1 and Walker-2d.

Hopper-v1 be allow monopodia robot jump study environment (state is made of the vector that 11 real numbers form (i.e. this state is a vector for 11 dimensions, and each dimension is a real number.Analogize below), act from 3 real numbers form to Amount is constituted)；

Walker-2d be allow biped robot study walk environment (state is made of the vector that 17 real numbers form, Movement is made of the vector that 6 real numbers form).

Based on the two environment, We conducted following examples controls:

1. our model (configure/not configuring two kinds of self-adapting confidence degree strategy) is compareed with other models:

2. comparing the performance of different confidence level strategies；

3. having used the driving bull framework certainly of different number, more different performance boosts.

In all experiments, exercise wheel number is set as 10000 wheels, and experience replay block size is 1000000, of large quantities It is small to be set as 1024.

Experiment one

Comparison model: DDPG (archetype), MA-BDDPG (single to judge family's head frameworks from driving), Multi-DDPG (single from the more executor's head frameworks of driving), and DBDDPG (our model, it is double from driving bull frameworks, do not contain adaptive confidence Degree), and SOUP (our model, it is double from driving bull framework, contain self-adapting confidence degree)

It is fastest from Fig. 2A, 2B as it can be seen that our method obtains average reward performance highest (solid line longitudinal axis highest) (solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin).

Experiment two

Compare our different confidence level strategies of method collocation: No Confidence (no confidence level), Fixed Confidence (fixed confidence level), Decayed Confidence (decaying confidence level), Self-Adaptive Confidence (our method, self-adapting confidence degree)

From Fig. 3 A, 3B as it can be seen that self-adapting confidence degree strategy obtains average reward performance highest (solid line longitudinal axis highest), speed Most fast (solid line most rises fastly, maximum slope), stability preferably (shade is band-like most thin)

Experiment three

Compare we method collocation different number bull framework: DDPG (archetype), SOUP (our model, 3 heads, 5 heads, 10 heads)

From Fig. 4 A, 4B as it can be seen that with head quantity promotion, acquired average reward performance is higher (the solid line longitudinal axis is higher), Speed is most got over (the faster rising of solid line, slope are bigger), and stability is better (shade is band-like thinner)

Fig. 5 shows our self adjustment of self-adapting confidence degree strategy in the training process.The confidence level of each head can be with Reward difference and dynamic adjust, be trained by the method for Policy-Gradient.

The above is only illustrating to examples of the invention, it is not considered as limiting the invention.Those skilled in the art Member can also make the scheme of some deformations under the inspiration of the application, also belong to protection scope of the present invention.For example, of the invention The scheme of following deformation can also be used:

1. using multiple DDPG (without share network) simultaneously be trained finally with confidence level fusion decision scheme (without It is to utilize multiple heads).

2. double Belief network is not added from the extension of driving bulls but goes to optimize to DDPG.

3. couple DDPG carries out single driving extension certainly, and goes to balance using confidence level.

The present invention can also be applied in following technical field:

1. intelligent co-pilot: making the vehicles (as an intelligent body) being capable of the self-teaching in simulated environment Speed is promoted, and is moved to relatively stable in actual environment；

2. game AI: trained intelligent body can be interacted by interacting with player, or with game itself, constantly be evolved, Higher reward and score are taken in game itself from student aid；

3. fields such as intelligent robots: mechanical arm or robot can be adapted to faster existing by being equipped with our algorithm Real environment is rapidly achieved basic task demand, accurately complete task (such as clamping object, differentiate object, screen object Deng).

Claims

1. it is a kind of for the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body, using more A judge man and multiple executors, operational process include the following steps: that when observing a state, each executor's head generates One movement vector, composition K dimension act vector set；In the case where given same state, judges house and splice each movement vector It is shared to itself in hidden layer and generates Q value one by one, and then generate intermediate result Q value matrix, dimension K × K；Confidence at the same time It spends module and exports confidence level vector c, dimension K；E- judges family's layer in combination with Q value matrix and confidence level vector the two tensors It is weighted operation, generates an E-Q value vector, dimension K represents the potential value of each movement vector；Last E- is held Passerby's layer chooses the E- movement of wherein corresponding maximum E-Q value, that is, has maximum capacity to obtain most Grand Prix according to E-Q value vector The movement encouraged, current state interacts with environment, is rewarded later, with training smart body.Wherein K is oneself greater than 2 So number.

2. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: presetting step, head quantity K is arranged.

3. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: random initializtion has the judge man network and executor's network of K headAnd it will power Respective target network parameter is copied to againI.e.Wherein θ refers to model Parameter, such as all parameters of neural network, upper right mark Q, μ, Q ', μ ', which is respectively indicated, judges house, executor, judgment of objectives man, Target executor.

4. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: initialization Belief network θ^C。

5. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In, include the following steps: selection according to the following formula act:

a_tRefer to really to select under t moment and is performed movement；Refer to the confidence level of i-th of judge man head under t moment；Q_iRefer to i-th A assessed value --- Q value for judging family's head, is here the output of a function, parameter isInput is state and movement；s_tFor The state of environment under t moment；The selection of movement is from k-th of executor's head μ_kAnd the output of a function, parameter are Input is state；For the random noise under t moment.

6. as claimed in claim 2 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In sampling drives mask m certainly_t~M；Memory transfer tuple (s_t,a_t,r_t,s_t+1,m_t) in experience pond R.

7. as described in claim 1 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In including the following steps: to minimize following loss function and update k-th of judges man Q^k:

K-th of loss function value for judging family's head is indicated, for training optimization；It indicates n transfer member of a batch below The calculated value of group data is averaged；y_iFor the target of Q value；Assessed value --- the Q of family's head is judged for k-th Value inputs the state and movement that tuple data is shifted for i-th,

Wherein

y_iFor the target of Q value；r_iThe reward value of tuple data is shifted for i-th；γ is discount factor；It is that two functions are embedding later Set, first is Q value function that k-th of judgment of objectives man head generates, inputs as next state and movement, acts by kth A target executor head generates, and is second function, inputs as next state.

8. as claimed in claim 7 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: updating k-th of executor's head μ using Policy-Gradient^k:

Indicate the gradient value of k-th of executor's head model parameter；It is the Q value of k-th of judge man head for dynamic Make the gradient of a；Movement byIt generates, i.e. the movement that k-th of executor's head generates；It is k-th of executor The gradient of head model parameter；Two gradients are the relationships being multiplied behind formula.

9. as claimed in claim 8 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: kth is updated to target network parameter according to formula:

It is the parameter of k-th of judgment of objectives person's head；It is the parameter of k-th of judge man head；It is k-th of target executor The parameter of head；It is the parameter of k-th of executor's head；τ is the scale parameter updated.

10. as claimed in claim 9 adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method, feature exists In further including following steps: Belief network is updated according to Policy-Gradient:

θ^CRepresent its parameter；α represents learning rate；For the gradient of Belief network；For the strategy output of parametrization Value, the i.e. output valve of Belief network；Q^π(s_i,a_i) it is the Q value assessed.