CN109523029B

CN109523029B - Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Info

Publication number: CN109523029B
Application number: CN201811144686.6A
Authority: CN
Inventors: 袁春; 郑卓彬; 朱新瑞
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-11-03
Anticipated expiration: 2038-09-28
Also published as: CN109523029A

Abstract

The invention relates to a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an intelligent agent, which improves the evaluation performance of an evaluator by using a multi-head self-driven architecture, improves the efficiency of an executor on environment exploration, can optimize a depth certainty strategy gradient (DDPG) algorithm to a certain extent, relieves the adverse effects of environment complexity, randomness and the like, accelerates the convergence of the DDPG algorithm, and improves the performance on the basis of stable training. Experiments prove that the invention can achieve the three advantages of fastest training speed, best performance and best stability in an experimental data set (simulation environment), and the specific numerical value exceeds that of the known solution.

Description

Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method

Technical Field

The invention relates to a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an agent.

Background art:

deep reinforcement learning has enjoyed great success in a series of challenging problems, such as unmanned driving, automated robotics, intelligent voice dialog systems, and the like. Deep deterministic strategy gradients (DDPG), an off-line reinforcement learning algorithm not based on environmental modeling, achieves higher sampling efficiency than traditional methods by using an actor-judge architecture with empirical replay, and is gaining more and more widespread use as it achieves optimal performance in continuous control tasks. However, DDPG is susceptible to environmental complexity and randomness, which may lead to performance instability and may not guarantee convergence of training results. This means that a large amount of over-parameter adjustment work is required to obtain good results.

In order to improve the effect of the DDPG, the MA-BDDPG in the existing method utilizes multi-head self-driven DQN as an evaluator to improve the utilization efficiency of samples played back by experience (the point is that [ Kalweit and Boedecker,2017] Gabriel Kalweit and Joschka Boedecker. nozzle-real-ready for continuous searching, in Conference on Robot Learning, pages 195 and 206,2017.), but the MA-BDDPG only introduces a single self-driven multi-head evaluator to easily cause the problem of insufficient environmental exploration. The Multi-DDPG adopts a single self-driven Multi-head executor architecture to improve the adaptability of the DDPG to multiple tasks (presented in the places of [ Yangtal, 2017] ZHaoyangYang, KathrynMerrick, Husseinb-bass and Lianwen jin. Multi-task later retrieval for controlling action. InProcedents of the two-site International task Conference on Intelligent Intellignment, IJCAI-17, pages 3301. 3307,2017.), but because the Multi-head executor is only introduced into the Multi-head executor, the problem that the action of a plurality of executors is evaluated inaccurately by a unique evaluator is caused.

Furthermore, MA-BDDPG and Multi-DDPG, while alleviating to some extent the above-mentioned problems of susceptibility to environmental complexity and randomness, introduce new problems and drawbacks, respectively.

Disclosure of Invention

The invention aims to solve the problems that a DDPG algorithm in deep reinforcement learning in the prior art is easily influenced by environmental complexity and randomness, so that the performance is unstable and convergence is not easy to occur.

Therefore, the invention provides a self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method for training an intelligent agent, which adopts a plurality of judges and a plurality of executors and comprises the following steps: when a state is observed, each actor head generates an action vector, constituting a set of K-dimensional action vectors. Under the condition of giving the same state, the judge splices each action vector into the self-sharing hidden layer and generates Q values one by one, and further generates an intermediate result Q value matrix (dimension K multiplied by K), and meanwhile, the confidence coefficient module outputs a confidence coefficient vector c (dimension K). The E-assessor layer performs a weighting operation on the two tensors (Q matrix and confidence vector) simultaneously to generate an E-Q vector (with dimension K) representing the potential value of each motion vector. And finally, the E-executor layer selects the E-action corresponding to the maximum E-Q value in the E-action vector, namely the action with the maximum potential to obtain the maximum reward, interacts with the current state of the environment, and then obtains the reward to train the intelligent agent.

In some embodiments, it is also possible to include the steps of:

and a presetting step, namely setting the number K of heads, the number n of batch training samples, the maximum training round number E and mask distribution M.

Randomly initializing assessor and executor networks with K headers

And copy the weights to respective target network parameters

Namely, it is

Where θ refers to parameters of the model, such as all parameters of the neural network, and the right superscript Q, μ, Q ', μ' represents the judge, the executor, the target judge, and the target executor, respectively.

Initializing an experience replay pool R, a confidence network theta^C。

Selecting an action according to the following formula:

a_tthe action executed by real selection at the moment t is pointed out;

the confidence of the ith judgment home at the moment t is pointed; q_iThe evaluation value of the ith evaluator head is Q value, which is the output of a function, and the parameter is

The input is state and action; s_tIs the state of the environment at time t; the selection of action comes from the kth taskHead mu of pedestrian_kIs also the output of a function with parameters of

The input is a state;

is random noise at time t.

Receiving an immediate reward r after performing the selected action_tAnd new state s_t+1。

Sampling self-driven mask m_t-M; storing the transfer tuple(s)_t,a_t,r_t,s_t+1,m_t) In an experience pool R; and randomly sampling n transfer tuples to serve as a batch of training data.

Update the kth judge head Q by minimizing the following loss function^k：

The loss function value of the kth judge head is represented and used for training optimization;

means for averaging the calculated values of the subsequent batch of n transfer tuple data; y is_iA target for Q value;

q value, which is the evaluation value of the k-th evaluator head, is input as the state and action of the i-th transfer tuple data,

wherein

y_iA target for Q value; r is_iA prize value for the ith transfer metadata set; gamma rayIs a discount factor; the next two function nests follow, the first is the Q function generated by the kth target evaluator head, the input is the next state and the action, the action is generated by the kth target executor head, the action is the second function, and the input is the next state.

Updating kth actor head mu using a policy gradient^k：

Gradient values representing k-th actor head model parameters;

is the gradient of the Q value of the kth judge head for action a; act by

Generating, i.e. the action generated by the kth actor head;

is the gradient of the kth actor head model parameter; the two gradients after the formula are in a multiplicative relationship.

Updating the kth pair of target network parameters according to a formula:

is the parameter of the kth target judge head;

is the parameter of the kth judge head;

is a parameter of the kth target actor head;

is a parameter of the kth actor head; τ is the updated scale parameter.

Updating the confidence network according to the policy gradient:

θ^Crepresents a parameter thereof; α represents a learning rate;

is the gradient of the confidence network;

the parameterized strategy output value, namely the output value of the confidence coefficient network; q^π(s_i,a_i) The evaluated Q value was obtained.

The method is carried out based on the working principle of the DDPG, the original DDPG uses a single judge and a single executor, the executor observes the state to generate action, the state is used as the input of the judge, the action is spliced on a hidden layer of the judge, the Q value is output, and the potential value of the action is evaluated; the method of the invention combines the advantages of the MA-BDDPG and the Multi-DDPG, solves the defects of the MA-BDDPG and the Multi-DDPG, improves the evaluation performance of a judge by using a Multi-head self-driven architecture, improves the efficiency of an executor for environment exploration, optimizes the algorithm to a certain extent, relieves the adverse effects of the environment complexity, randomness and the like, accelerates the convergence of the DDPG algorithm, and improves the performance on the basis of stable training.

Experiments prove that the invention can achieve the three advantages of fastest training speed, best performance and best stability in an experimental data set (simulation environment), and the specific numerical value exceeds that of the known solution.

In some embodiments, for the problem of the uneven evaluation capability of a multi-head judge, the invention introduces a method of an adaptive confidence strategy to solve and further optimize the DDPG.

Drawings

Fig. 1 is a schematic diagram of the architecture and operation flow of an adaptive dual self-driven DDPG according to an embodiment of the present invention.

FIGS. 2A and 2B are graphs showing the results of experiment one in the example of the present invention.

FIGS. 3A and 3B are graphs showing the results of experiment one in the example of the present invention.

FIGS. 4A and 4B are graphs showing the results of experiment one in the example of the present invention.

FIG. 5 is a diagram illustrating the effect of self-tuning of an adaptive confidence policy in a training process according to an embodiment of the present invention.

Detailed Description

FIG. 1 is a schematic diagram of the architecture and operational flow of our inventive adaptive dual self-driven DDPG.

The dual self-driving means: on one hand, the evaluator uses the multi-head self-driven technology to improve the evaluation performance, and on the other hand, the executor uses the multi-head self-driven architecture to improve the efficiency of environmental exploration.

As shown in fig. 1, firstly, the environment generates a state s (the situation provided by the environment and currently requiring the agent to make a decision, the state dimension depends on the environment attribute), and when the executor (ii) observes the state, each executor head (generally, there are K, K is a natural number greater than 2) generates an action vector a (the form given by the environment requirement to interact with the state, the vector dimension depends on the environment attribute) to form an action vector set a (K vectors in total).

Under the condition of giving the same state (the state received by each head of the judge at the same time is the same), the judge (i) splices the action vectors in the action vector set A into a shared hidden layer (namely a shared layer in the picture) of the judge one by one and generates Q values one by one, and further generates an intermediate result Q value matrix V (dimension K multiplied by K) (the Q value matrix V is only an intermediate result and really plays an important role in subsequent generation of E-Q value vectors), and meanwhile, the confidence coefficient module (iii) outputs a confidence coefficient vector c (the dimension is K). The Ensemble-assessor layer (E-assessor layer) performs a weighting operation on the two tensors (Q-matrix and confidence vector; the tensors include vector and matrix) to generate an Ensemble-Q vector (E-Q, dimension K) representing the potential value of each motion vector.

Finally, the Ensemble-executor layer (E-executor layer) selects Ensemble-actions (E-actions) corresponding to the maximum E-Q value according to the E-Q value vector, namely the actions with the maximum potential to obtain the maximum reward, interacts with the current state of the environment, and then obtains the reward, so that the < state, action, reward > transfer tuples can train the agent by using a traditional executor-judge algorithm and train the confidence module by using a strategy gradient algorithm.

The process in which the environment interacts with the agent: the environment provides the state, the agent interacts with the action, the environment generates a new state, and the agent is awarded the reward. This process and mechanism is common to reinforcement learning, is well known in the art, and can be referred to in the related description of the prior art. The DDPG has a single pair of actor and judge architectures, as also described in the prior art.

The embodiment of the invention is improved on the basis of the prior art as follows:

(1) expanding the DDPG into a double self-driven multi-head architecture;

(2) the flow of generating a plurality of actions by a plurality of executives, weighting and scoring a plurality of judges and screening the optimal action at the same time;

(3) an adaptive confidence module is added.

The specific training algorithm may be described as table 1 below:

note: fig. 1 only shows the most core model architecture and simple operation flow, and does not contain some details such as the presetting part and the specific parameter training, and the algorithm shown in table 1 is a more detailed presetting and training step. Fig. 1 is a convenient visual understanding and the table 1 algorithm is supplemented in more detail.

We shall now add further details to the following.

Splicing (see the middle part of figure 1 'splicing' word)

The judge network is composed of multiple fully-connected layers, during operation, motion vectors (for example, the dimension of X) are spliced with input vectors (for example, the dimension of Y) of a hidden layer (that is, the above-mentioned "motion vectors in a motion vector set a are spliced into a shared layer of itself one by one and Q values are generated one by one", the meaning of the expression is the same, the "shared layer" is actually the "hidden layer", the expression in the front is "spliced to the shared layer", the actual operation is "spliced with the input vectors of the hidden layer", the expression in the following is more detailed), and finally, vectors (that is, X vector splicing Y vector (X + Y vector)) with the dimension of X + Y input to the hidden layer can be simplified into vector splicing.

Q value matrix expression (assuming K heads in total)

V_t∈R^K×K

Wherein V_tThe Q value matrix at the t moment is belonged to in the mathematical formula, R^K×KRefers to a real matrix with dimensions K by K. This is simply an intermediate result that produces a vector of E-Q values.

Confidence vector (dimension K)

Output by confidence module (network)

Wherein c is_tIs a confidence vector at time t, consisting of K values,

the confidence of the kth judge head at time t is a value greater than 0 and equal to or less than 1. The confidence module is a neural network, corresponding to a function, c_t＝f(s_t) And inputting the state and outputting the confidence coefficient vector.

E-judge executive steps

The following formula, intermediate

It is the operation, weighted sum process, performed by the E-judge that multiplies the confidence vector by the Q matrix, which results in an E-Q vector of dimension K

This step corresponds to the E-judge model (r) in the graph, and the step of generating the E-Q value vector; corresponding to step 10 of the algorithm.

E-performer performing steps

As with the above formula, the layer selects, based on the E-Q vector, the E-action in which the maximum E-Q corresponds to the argmax operation, i.e., the action with the greatest potential to obtain the greatest reward, to interact with the current state of the environment.

This step corresponds to the E-actor model (II) in the graph, and the step of generating E-actions; corresponding to step 10 of the algorithm.

More specific training details refer to the algorithm pseudo code (i.e., table 1 above).

In the existing method, the MA-BDDPG only introduces a single self-driven multi-head judge, which causes the problem of insufficient exploration (it omits the multi-head execution part and confidence module in our FIG. 1 and the process of integrated evaluation selection action); while the Multi-DDPG introduces only a single self-driven Multi-head executor (which omits the Multi-head evaluator part and the confidence module in our FIG. 1 and integrates the process of evaluating the selection action), resulting in the problem of inaccurate evaluation.

The method combines the advantages of the two methods, and simultaneously, the problem of uneven evaluation capability (part of heads can fall into local optimum in the training process or the training direction has deviation, so that the final training capability is uneven) occurs, so that the method introduces an adaptive confidence strategy to solve the problem. The method expands the executors and judges of the DDPG to the self-driven multi-head network for exploration. Based on the multi-headed architecture, the assessor utilizes integrated Q value assessment to increase the probability that potentially optimal actions are explored in empirical playback. Meanwhile, a self-adaptive confidence degree strategy (automatically generated by a confidence degree module) is provided for automatically calibrating the weight of the weight and the operation weight, and the problem of inaccurate evaluation caused by the fact that the evaluation capabilities of different judges are different is solved.

In the embodiment of the invention, a large number of experiments are carried out on the technology, and under a Hopper/Walker in a Mujoco experiment environment, the speed is improved by 45% while the training is stabilized, and the average performance (reward) is improved by 44%. Specific experiments are described below.

Experiment of

We tested our method in the Mujoco simulator experimental environment of OpenAI. Mainly Hopper-v1 and Walker-2d among them.

Hopper-v1 is an environment for the one-legged robot to learn by jumping (a state is a vector consisting of 11 real numbers (i.e., this state is an 11-dimensional vector, each dimension is a real number, and so on), and an action is a vector consisting of 3 real numbers);

walker-2d is an environment in which the biped robot learns to walk (the state is a vector consisting of 17 real numbers, and the motion is a vector consisting of 6 real numbers).

Based on these two environments we performed the following example comparisons:

1. our model (both configured/unconfigured adaptive confidence policies) was contrasted with other models:

2. the performances of different confidence policies are compared;

3. different numbers of self-driven multi-head frameworks are used, and different performances are improved.

In all experiments, the number of training rounds was set to 10000 rounds, the empirical playback module size was 1000000, and the batch size was set to 1024.

Experiment one

Comparing the models: DDPG (original model), MA-BDDPG (Single self-driven multiple judgments Home header architecture), Multi-DDPG (Single self-driven multiple executors header architecture), DBDDPG (our model, double self-driven multiple headers architecture, without adaptive confidence), SOUP (our model, double self-driven multiple headers architecture, with adaptive confidence)

As can be seen from fig. 2A and 2B, our method achieves the highest average reward performance (the vertical axis of the solid line is highest), the highest speed (the solid line rises fastest, the slope is largest), and the best stability (the shaded band is thinnest).

Experiment two

Our approach was compared with different confidence strategies: no Confidence, FixedConfiguration, Decayed Confidence, Self-adaptive Confidence

As can be seen from FIGS. 3A and 3B, the adaptive confidence policy achieves the highest average reward performance (highest vertical axis of the solid line), the fastest speed (fastest rise of the solid line, maximum slope), and the best stability (thinnest shaded band)

Experiment three

Comparing our approach with different numbers of multi-headed architectures: DDPG (original model), SOUP (our model, 3 heads, 5 heads, 10 heads)

As can be seen from FIGS. 4A and 4B, as the number of heads increases, the higher the average reward performance achieved (the higher the vertical axis of the solid line), the faster the speed (the faster the solid line rises, the greater the slope), the better the stability (the thinner the shaded band)

Fig. 5 shows the self-adjustment of our adaptive confidence strategy during training. The confidence of each head can be dynamically adjusted according to different rewards, and the training is carried out by a strategy gradient method.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. The technical personnel in the field can also make some modified solutions in the light of this application, also belong to the protective scope of this invention. For example, the present invention may also adopt the following modifications:

1. multiple DDPGs (shared nothing networks) are used to train simultaneously a scheme that ultimately fuses decisions with confidence (rather than using multiple headers).

2. Only double self-driven multi-head extension is carried out on the DDPG, but a confidence coefficient network is not added for optimization.

3. The DDPG is single self-driven extended and confidence rebalanced.

The invention can also be applied in the following technical fields:

1. intelligent autopilot system: the self-learning speed of the vehicle (as an intelligent agent) in the simulation environment is increased, and the vehicle is more stable when being transferred to the real environment;

2. and (4) game AI: the trained agent can continuously evolve by interacting with the player or interacting with the game, and self-help learns to obtain higher rewards and scores in the game;

3. intelligent robot etc. field: the mechanical arm or the robot can adapt to the real environment more quickly by being provided with an algorithm, the basic task requirement is met quickly, and the task (such as clamping an object, distinguishing the object, screening the object and the like) is completed accurately.

Claims

1. A self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method is used for training an intelligent agent and is characterized in that a plurality of judges and a plurality of executors are adopted, the intelligent agent is a robot, and the operation process comprises the following steps: the intelligent agent receives the state provided by the environment, and when the executor in the intelligent agent observes a state, each executor head generates an action vector to form a K-dimensional action vector set; under the condition of giving the same state, a judge in the intelligent agent splices each motion vector into a self-shared hidden layer and generates Q values one by one, and then an intermediate result Q value matrix and a dimension K multiplied by K are generated; meanwhile, a confidence coefficient module in the agent outputs a confidence coefficient vector c, and the dimensionality is K; the E-judging layer performs weighting operation by combining the Q value matrix and the confidence coefficient vector to generate an E-Q value vector, the dimensionality is K, and the E-Q value vector represents the potential value of each action vector; finally, the E-executor layer in the agent selects the E-action corresponding to the maximum E-Q value in the E-action according to the E-Q value vector, namely the action with the maximum potential to obtain the maximum reward, interacts with the current state of the environment to obtain the reward, and the environment generates a new state to be provided for the agent to train the agent, wherein K is a natural number larger than 2; the specific training algorithm comprises the following steps:

a presetting step, setting the number K of heads;

randomly initializing assessor and executor networks with K headers

And copy the weights to respective target network parameters

Namely, it is

Wherein theta refers to all parameters of the neural network, and the upper right marks Q, mu, Q ', mu' respectively represent an evaluator, an executor, a target evaluator and a target executor;

initialize confidence module θ^C；

Selecting an action according to the following formula:

a_tthe action executed by real selection at the moment t is pointed out;

The input is state and action; s_tIs the state of the environment at time t; the selection of the action comes from the k performer head mu_kIs also the output of a function with parameters of

The input is a state;

random noise at time t;

sampling self-driven mask m_t-M; storing the transfer tuple(s)_t，a_t，r_t，s_t+1，m_t) In an empirical tank R, where m_tIs a self-driven mask, M is a mask distribution; a is_tIs the movement at time t, r_tIs at time t in state s_tLower execution action a_tIs awarded, s_t+1 is the state at time t + 1;

update the kth judge head Q by minimizing the following loss function^k：

is the evaluation network of the kth head, s_iIs the current state of the ith transition tuple data, a_iIs the action executed by the current state s _ i in the ith transition metadata;

wherein

Is the k-th judge target network,

is the executor target network of the kth head, y_iA target for Q value; r is_iA prize value for the ith transfer metadata set; gamma is a discount factor; then nesting two functions, wherein the outer layer is a Q value function generated by the kth target judging head, the input is the next state and action, the action is generated by the kth target executor head, the action is an inner layer function, and the input is the next state; s_i+1Is the current state s in the ith transition tuple data_iPerforming action a_iThe next state later generated;

updating kth actor head mu using a policy gradient^k：

Is the executor network of the kth head, and takes a value range of [0, 1 ]]I.e., 0 to 1;

gradient values representing k-th actor head model parameters;

is the gradient of the Q value of the kth judge head for action a; act by

Generating, i.e. the action generated by the kth actor head;

is the gradient of the kth actor head model parameter;

updating the kth pair of target network parameters according to a formula:

is the parameter of the kth target judge head;

is the parameter of the kth judge head;

is a parameter of the kth target actor head;

is a parameter of the kth actor head; τ is the updated scale parameter;

updating the confidence module according to the policy gradient:

θ^Crepresents a parameter thereof; α represents a learning rate;

a gradient of a confidence module;

the parameterized strategy output value, namely the output value of the confidence coefficient module; q^π(s_i，a_i) The evaluated Q value was obtained.