CN112668235B

CN112668235B - Robot control method based on off-line model pre-training learning DDPG algorithm

Info

Publication number: CN112668235B
Application number: CN202011429368.1A
Authority: CN
Inventors: 张茜; 王洪格; 姚中原; 戚续博
Original assignee: Zhongyuan University of Technology
Current assignee: Zhongyuan University of Technology
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-12-09
Anticipated expiration: 2040-12-07
Also published as: CN112668235A

Abstract

The invention provides a robot control method based on a DDPG algorithm of offline model pre-training learning, which comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set; constructing and initializing an artificial neural network and initializing parameters; pre-training an evaluation network and an action network off line by utilizing a training data set; initializing a target network by using the pre-trained evaluation network, and storing state conversion data into a storage buffer by the agent to serve as an online data set of the training online network; training an online strategy network and an online Q network by using an online data set, and updating the online strategy network and the online Q network by using a DDQN structure; and soft updating, and controlling the state of the 2D dummy. The invention has the advantages of more efficient efficiency, more accurate generated Q value, higher average reward value and more stable and reliable learning strategy, improves the convergence speed, obtains the accumulated reward value to a higher level and leads the robot to quickly reach the destination.

Description

Robot control method based on off-line model pre-training learning DDPG algorithm

Technical Field

The invention relates to the technical field of robot control, in particular to a robot control method based on an off-line model pre-training learning DDPG algorithm.

Background

Reinforcement learning is an important branch of machine learning, where an agent learns the behavior in the environment by performing certain operations and observing rewards or outcomes derived from those operations. Mainly comprises four elements: agent, environmental status, action, and reward. The goal of reinforcement learning is that the agent performs actions in the positive direction as much as possible based on the positive feedback of the environment to obtain the most accumulated reward.

At present, deep reinforcement learning has generated important influences in the directions of simulation control, motion control, indoor and outdoor navigation, synchronous positioning and the like of a robot, so that the robot can automatically learn in a simulation environment and even in the real world through experience and environment interaction, and the return maximization or the realization of a specific target is achieved.

DDPG (Deep Deterministic Policy Gradient) can be suitable for tasks with continuous motion space and continuous state space, the DDPG algorithm is taken as a classic algorithm in the aspect of continuous motion control, training is stable, the learning process is slow, and a target Q value is generally obtained directly by a greedy method, so that high Q value estimation deviation exists, when the accumulated error reaches a certain degree, updating and divergence behaviors of a suboptimal strategy can be caused, and the finally obtained algorithm model has great deviation.

In addition, online reinforcement learning requires online processing of state data and feedback rewards at each time in the environment, and the application of an action must wait for the next feedback reward from the environment, which can be time-consuming and costly. In addition, when reinforcement learning is in the initial training stage, the generalization capability of the action network and the evaluation network is weak, a large amount of redundant trial and error actions and invalid data are generated, and online computing resources are wasted to a certain extent.

The perception capability of deep learning and the decision capability of the deep learning are combined in the deep reinforcement learning, the deep reinforcement learning is widely applied to robot operation tasks, dylan P.Losey et al propose a global optimal leading artificial bee colony algorithm for updating robot path tracks, L.Tai et al realize model-free obstacle avoidance behaviors, and a mobile robot can explore unknown environments under the condition of not colliding with other objects, but the problem of limitation that the continuous state space of decision is not accurate enough due to discrete classification exists.

Volodymy Mnih et al propose a Deep Q Network (DQN) to obtain an effective representation of the environment from a high-dimensional sensory input, and use them to generalize past experiences to new situations, however, for a physical control task with a continuous and high-dimensional action space, DQN cannot be directly applied to a continuous domain because it depends on finding a value function that maximizes an action value, timothy P et al propose a Deep deterministic strategy gradient DDPG, which solves the problem that DQN cannot handle a large number of continuous action spaces and Actor Critic is difficult to converge, DDPG algorithm is widely used to solve the problems of obstacle avoidance, path planning, etc., and strategies can be learned in a high-dimensional continuous action space. However, like most model-free reinforcement methods, the DDPG algorithm requires a significant amount of training to find a solution, and since sample data acquisition is limited by real-time operation, in general, model-based algorithms are superior to model-free learners in terms of sample complexity. The Pieiffer M et al propose a model capable of learning a strategy for avoiding collision and safely guiding the robot to reach a specified target through an obstacle environment, but the model may be trained from perfect simulation data, and has a defect of navigation performance.

Disclosure of Invention

Aiming at the technical problems that the existing control method utilizing the DDPG algorithm can get into a local minimum value in the online training process and can generate a large amount of trial and error actions and invalid data when the DDPG network is initially trained, the invention provides the robot control method of the DDPG algorithm based on offline model pre-training learning.

In order to achieve the purpose, the technical scheme of the invention is realized as follows: a robot control method based on DDPG algorithm of off-line model pre-training learning comprises the following steps:

the method comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set;

step two: establishing and initializing an evaluation network, an action network, an object state model network and a value reward network of the artificial neural network, and initializing respective parameters; utilizing the training data set obtained in the step one to pre-train an evaluation network and an action network in an off-line manner;

step three: initializing a target evaluation network and an action network by using the pre-trained evaluation network in the step two, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network;

step four: training an online strategy network and an online Q network by using the online data set obtained in the step three, and updating the online strategy network and the online Q network by using a DDQN structure;

step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online strategy network and the online Q network, and controlling the state of the 2D dummy by using the states output by the target evaluation network and the target action network.

The training data is obtained by learning behaviors in the environment by advancing and observing rewards or results obtained in the advancing process when the 2D dummy walks from the starting point to the terminal point and executing actions in the positive direction according to the feedback of the environment; randomly generating state data, actions and corresponding value rewards and next states of the training data in the environment state and action range, namely collecting environment sample data of the 2D dummy from a system historical data table or generating random actions in an off-line environment to obtain corresponding reward values and feedback reward data, wherein the data format is (S) _i ,A _i ,R _i ,S _i+1 ) Wherein S is _i Is an ambient state value, A _i For action, the agent will respond to the incoming environmental state value S _i Performing action A _i ，R _i For feedback values or value awards values, S _i+1 Is the state value of the next environment; state S of intelligent agent in random environment _i Next, a behavior action A is randomly selected _i Performing the action, returning a reward R after performing the action _i And a new environmental state S _i+1 Then the round of data (S) _i ,A _i ,R _i ,S _i+1 ) And storing the data into a database.

The method for preprocessing the training data comprises the following steps: processing to remove null values and abnormal values, and carrying out normalization conversion on the format of the data; zero-mean gaussian noise is added to the action and the processed data is stored in a training data set.

2 fully-connected artificial neural networks with similar structures, namely an object state prediction network predictNN and a Value reward network Value NN, are newly constructed on the structure of an action network and a Value evaluation network of the original DDPG, and the number of artificial neurons of each layer of network is similar; the newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the output layer is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function; the newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.

The method for off-line pre-training the evaluation network and the action network by using the training data set obtained in the first step comprises the following steps:

step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) ^Q ) And the motion network mu (s | theta) ^μ ) Object state model network P (s, a | θ) ^P ) And value reward network r (s, a | θ) ^r ) Initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set;

step 2, evaluating the network Q (s, a | theta) by using the trained object state model network and the value reward network ^Q ) And action network mu (s | theta) ^μ ) Pre-training is performed.

Training the object state model network P (s, a | theta) in the step 1 ^P ) And value reward network r (s, a | θ) ^r ) The method comprises the following steps:

minimizing the loss function of the object state model network:

wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s _i+1 Representing the ambient state, s, obtained by the agent at time i +1 _i Indicating the environmental state at time i, P(s) _i ,a _i |θ ^p ) The object state prediction network representing the state and the behavior at the current moment is given by a predictNN neural network in the state prediction network module; p is the state value of the agent, θ ^p Is a parameter that adjusts the object state network;

minimizing the loss function of the value reward network:

wherein r is _i Representing the sum of all the actions earned from the current state up to some future state, r(s) _i ,a _t |θ ^r ) The environmental return representing the current state and behavior is given by a Value NN neural network in the Value reward network module; l2 is the loss value, θ ^r Refers to a parameter of the value reward network.

Evaluating network Q (s, a theta) based on trained object state model network and value reward network pair ^Q ) And the motion network mu (s | theta) ^μ ) Pre-training is carried out: selecting N samples from the training dataset (S) _i ,A _i ) Training the value reward network, and predicting the feedback reward R after the current state executes the action through the value reward function _i ：

R _i ＝r(s _i ,a _t |θ ^r )；

Predicting next state S through object state model network _i+1 ：

S _i+1 ＝P(s _i ,a _i |θ ^p )。

The method for initializing the target action network and the storage buffer R comprises the following steps: randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of a network, and emptying a set R of experience playback;

the construction method of the online data set comprises the following steps:

step 31, initializing random initialization distribution N1 as action search, initializing S _i Is the current first state.

Step 32, the agent selects an action according to the action strategy and sends the action to the environment to execute the action;

step 33, after the agent performs the action, the environment returns to the reward after the current state has been performed and the new state S _i+1 ；

Step 34, the agent converts the state into data (S) _i ,A _i ,R _i ,S _i+1 ) And storing the data into a storage buffer R as an online data set for training an online network.

The method for updating the online policy network and the online Q network by using the DDQN structure in the fourth step comprises the following steps:

step 41, randomly sampling N state transition data from the storage buffer R as a small batch of training data of the online strategy network and the online Q network, (S) _i ,A _i ,R _i ,S _i+1 ) Representing individual transformed data in the small lot;

step 42, predicting the next action a through the target action network _i ＝μ′(s|θ ^μ′ ) (ii) a Mapping state s to specific action a to maintain a specified current policy θ ^μ μ' denotes the parameterized motion network μ (s | θ |) ^μ ) A learned strategy, which is a strategy function between a state and a specific action; comparing Q values by using a DDQN structure;

43, calculating a strategy gradient of the online Q network;

step 44, updating the online policy network: updating theta with Adam optimizer ^μ And target action network mu (s | theta) ^μ )。

In step 42, the Q value of the next step is obtained through the target evaluation network: q _i+1 ＝Q′(S _i+1 ,a _i |θ ^Q )；

Wherein Q is _i+1 And Q 'is the next Q value, Q' (S) _i+1 ,a _i |θ ^Q ) Q value, S, representing the current operation and the state at the next time _i+1 Is the next state value;

using DDQN structure Q _i ＝r _i +γQ _i+1 Comparing the Q value:

Q _i+1 ′＝Q′(S _i+1 ,a _i |θ ^Q′ )

Q _i+1 ＝min(Q _i+1 ,Q _i+1 ′)；

where γ ∈ [0,1] is a decay factor that balances the importance of instant and future rewards.

The loss function of the Q network is:

wherein Q is equivalent to an evaluator, and a parameter in the Q network is defined as theta ^Q ，Q(s _i ,a _i |θ ^Q ) Denotes the use of theta ^Q Policy is in state s _i Selecting action a _i The acquired return expectation value;

updating theta with Adam optimizer ^μ And target action network mu (s | theta) ^μ ) The method comprises the following steps: a is _i ＝μ(s _i |θ ^μ )；

Minimizing the loss function of the target action network:

obtaining an optimized weight theta ^μ And theta ^Q 。

The soft update updates the parameters μ 'and Q' within the target evaluation network and the target action network:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ ，

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ ；

wherein, theta ^μ 、θ ^μ′ 、θ ^Q And theta ^Q′ Respectively representing parameters corresponding to an Actor current network, an Actor target network, a Critic current network and a Critic target network, wherein tau is an update coefficient.

Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the off-line real training data to preprocess the action network and the evaluation network and construct the object prediction model network and the value reward network, thereby accelerating the robot to learn from the known environment more efficiently so as to reach the destination quickly. The DDPG algorithm has the advantages that function approximation is not flexible enough, noise exists, the known Q value is estimated too high, and an optimal strategy cannot be generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic structural diagram of the present invention.

Fig. 2 is a schematic diagram of the process of the dummy's traveling, wherein (a) is the dummy at the starting point, (b) is the dummy walking, and (c) is the dummy at the ending point.

FIG. 3 is a loss curve of the prediction model training of the present invention.

FIG. 4 is a loss curve of the reward function training of the present invention.

Fig. 5 is a training reward curve of a conventional DDPG algorithm.

FIG. 6 is a graph of improved training rewards after pre-training of the model of the invention.

FIG. 7 is a comparison graph of the evaluated rewards of the present invention and a conventional DDPG algorithm in a noise-free environment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, a robot control method based on DDPG algorithm of offline model pre-training learning includes the following steps:

the method comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set.

The experimental environment is windows 10+ paddlee 1.7+ par 1.3.1+ cuda10.0. The hardware is core i8-8300+ display card GTX1060, and the simulation platform is Bipedaldarwalker-v 2. And respectively training 4000 rounds by using a DDPG algorithm and an improved DDPG algorithm based on an offline model, and analyzing the relation between the feedback reward value of the robot, namely the 2D dummy from the starting point to the end point and the number of training rounds.

Bipedal walker-v2 is an open source simulator, the environment characteristic of the bipedal walker is that the terrain is generated completely randomly, the task of the bipedal walker is to enable a 2D dummy to walk from a starting point to a terminal point, the robot is provided with four joints which can be controlled, the joints are respectively connected with the root parts of a left leg and a right leg, the knee part of the left leg and the knee part of the right leg, and the skill is used for simulating the forward walking process of a bipedal animal. The farther forward the score is, the more points are scored and if the robot falls the score is deducted, the trained model must be very robust to get a high average score. As shown in fig. 2, when the robot moves from the starting point to the ending point, the robot learns the behavior in the environment by traveling and observing the reward or result obtained during traveling, and performs the action in the positive direction as much as possible according to the feedback of the environment, thereby learning a good strategy. The training data is obtained by randomly generating state data, actions and corresponding value rewards and next states within the environment state and action range.

Collecting environmental sample data of the 2D dummy from the system historical data table or generating random action in an off-line environment to obtain corresponding reward value and feedback reward data, wherein the data format is (S) _t ,A _t ,R _t ,S _t+1 ) Wherein S is _t Is an ambient state value, A _t For action, the agent will respond to the incoming environmental state value S _t Performing action A _t ，R _t For the feedback value or value award value, S _t+1 Obtaining a return value R for the next context state value after performing the action _t And update the new state S _t+1 。

Collecting training data: the invention utilizes the intelligent agent to be in the random environment state S _t Next, a behavior action A is randomly selected _t Performing the action, returning a reward R after performing the action _t And a new environmental state S _t+1 . Then the round data (S) _t ,A _t ,R _t ,S _t+1 ) And storing the data into a database, wherein the data is only collected at the moment, and the data can be acquired in other modes.

Data preprocessing: many data can cause the generation of so-called dirty data due to the characteristics of incompleteness, non-uniformity before and after the data and the like, and if the data are directly used for model pre-training without considering the intrinsic characteristics of the data, the final result has larger error and the overall effect is influenced. Therefore, before the data is used, corresponding processing for removing null values and abnormal values is required, the format of the data is subjected to normalization conversion, interference can be reduced, and prediction accuracy is improved. In addition, zero mean Gaussian noise is added into the action to improve the robustness of the model, and finally, the processed data is stored in a training data set.

Step two: an evaluation network, an action network, an object state model network and a value reward network of the artificial neural network are constructed and initialized, and respective parameters are initialized; and (4) utilizing the training data set obtained in the step one to pre-train the evaluation network and the action network in an off-line manner.

Offline pre-training evaluation network and action network: n sample data are extracted from the preprocessed training data set, an object state model network and a value reward network are trained offline, and the two offline object state model networks and the value reward network are used for simulating an online training process in advance to pre-train and learn an action network and an evaluation network in the DDPG, so that a large amount of trial and error work in the early stage is reduced, and the efficiency and the quality of online learning are improved.

In 2016, DDPG was proposed by Deepmind as a combination of an Actor-Critic framework and a DQN algorithm, and an off-policy and Model-Free depth reinforcement learning algorithm for a continuous motion space. In general, the DDPG network is based on the Actor-Critic method, so the neural network with Policy and the neural network based on Value include a Policy network for generating actions, an evaluation network for judging the actions and absorbing the excellent characteristics of DQN, and a sample experience playback pool and a fixed target network are used. The DDPG algorithm simulates a strategy function and a Q function by a convolutional neural network on the DPG algorithm, and is trained by a deep learning method instead of linear regression, so that the accuracy, high performance and convergence of a nonlinear simulation function in the reinforcement learning method are proved.

The DDPG algorithm structure comprises a parameter theta ^π Of a motion network and a parameter theta ^Q To calculate a deterministic policy a = pi (s | θ), respectively ^π ) And the action cost function Q (s, a | θ) ^Q ) Since the learning process of a single network is not stable, the action network and the evaluation are carried out by taking the successful experience of the DQN fixed target network as a referenceThe estimation networks are each subdivided into a real network and an estimation network. The real network and the estimation network have the same structure, and the estimation network parameters are subjected to soft update by the real network parameters at a certain frequency. The action estimation network is used for outputting real-time actions for the intelligent agent to execute the actions in the real environment, and the action real network is used for updating the evaluation network system. And meanwhile, the value evaluation network is also subdivided into a real network and an estimation network and used for outputting the value reward of each state, the input ends are different, the state real network analyzes the observed values of the actions and the states input according to the action real network, and the state estimation network takes the actions applied by the intelligent agent as input. The value of the evaluation action is called the Q value: representing the desire for the agent to select this action, and then to the final state reward total.

DDPG is a data-driven control method that can learn a generative model from input/output state data of a 2D dummy and implement an optimal strategy for the 2D dummy to reach a destination according to a given reward. In the real world, the collection of sample data is limited by real-time operation, so the invention carries out pretreatment through off-line data, trains a state prediction model and a value reward prediction model of an object under the off-line condition, trains an action network and an evaluation network in reinforcement learning by the two models to complete off-line pre-learning work, and then carries out learning by putting the action network and the evaluation network in an actual object, thereby greatly reducing the workload of the intelligent agent and promoting the intelligent agent to more efficiently complete tasks.

The invention constructs 2 fully-connected artificial neural networks with similar structures, namely an object state prediction network predictNN and a Value reward network Value NN, on the basis of the structure of the original action network and Value evaluation network of the DDPG, and the number of artificial neurons in each layer of the network is basically similar. The newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the last layer of neural network, namely the output layer, is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function. The newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.

The method for off-line pre-training the evaluation network and the action network by utilizing the training data set obtained in the step one comprises the following steps:

step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) ^Q ) Motion network mu (s | theta) ^μ ) Object state model network P (s, a | θ) ^P ) And value reward network r (s, a | θ) ^r ) And initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set.

Training object state model network P (s, a | θ) ^P ) And value reward network r (s, a | θ) ^r ) The method comprises the following steps: and preprocessing the off-line data, randomly initializing the weight value, and verifying the minimization of the function value of the loss function to obtain the accuracy of the network.

The newly constructed object state model network and the value reward network have different functions and structures, the corresponding training modes are different, different loss functions are used for training, and the loss function of the object state model network is minimized:

wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s _i+1 Representing the ambient state, s, obtained by the agent at time i +1 _i Indicates the environmental state at the i-th time, P(s) _i ,a _i |θ ^p ) And an object state prediction network representing the state and the behavior at the current moment, wherein a model is obtained after network training and is given by a predictNN neural network in the state prediction network module. P (s, a | theta) ^P ) Is shown in state S _i Lower execution action a _i The latter state value. a is a _i For an action performed by an agent, p is the agent's state value, θ ^p Is a parameter that adjusts the object state network. Such as Q network isThe evaluation network carries out evaluation scoring on the action output by the agent at each step, and the theta of the neural network of the evaluation network is adjusted according to the feedback reward of the audience, namely the environment ^Q And (4) parameters.

Minimizing the loss function of the value reward network:

wherein r is _i Representing the sum of all the activity-earned reward values from the current state up to some future state, r(s) _i ,a _t |θ ^r ) ² The environmental reward, representing the current state and behavior, is given by the Value NN neural network in the Value reward network module. L2= loss is a loss value used for expressing the difference degree between the prediction and actual data, the smaller the loss is, the better the model prediction is, and theta is ^r Refers to a parameter of the value reward network.

Evaluating network Q (s, a | theta) based on trained object state model network and value reward network pair ^Q ) Motion network mu (s | theta) ^μ ) Pre-training is carried out: selecting N samples from the training dataset (S) _t ,A _t ) Training the value reward network, and predicting the feedback reward R after the current state executes the action through the value reward function _i ：

R _i ＝r(s _i ,a _t |θ ^r ) (3)

Predicting next state S through object state model network _i+1 ：

S _i+1 ＝P(s _i ,a _i |θ ^p ) (4)

The action network and the evaluation network in the DDPG are pre-trained and learned by simulating an online training process in advance through the two constructed models, namely an object state model and a value reward model, so that the two models simulate the environment to feed back a reward value and a next state value.

Step three: and initializing a target evaluation network and an action network by using the pre-trained evaluation network in the step two, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network.

Initializing a target network corresponding to the evaluation network and the action network, and initializing the storage buffer R; randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of the network, and emptying the set R of experience playback.

Step 32, the intelligent agent selects an action according to the action strategy and sends the action to the environment to execute the action;

step 33, after the agent performs the action, the environment returns to the reward after the current state execution and the new state S _i+1 ；

Step four: and (4) training the online strategy network and the online Q network by using the online data set obtained in the step three, and updating the online strategy network and the online Q network by using a DDQN structure.

Any type of estimation error will cause upward bias, whether these errors are caused by environmental noise, function approximation, non-stationarity or any other source, therefore, the present invention adds a DDQN structure to the processing of the Q network, separates the action selection network from the evaluation network, learns two value functions by randomly assigning each experience to update one of the value functions, and thus has two sets of weights θ and θ'. Firstly, finding out the action corresponding to the maximum Q value in an action network; then, the selected action is used to calculate the Q value in the evaluation network, and the minimum value of the two values is used to calculate the target Q value in the target network. For each update, one set of θ weights is used to determine the greedy strategy, and another set of θ' weights is used to determine its value. And estimating the value of the greedy strategy according to the current value defined by the weight theta, and fairly evaluating the value of the strategy by using the second group of weights theta', without introducing an additional network, but evaluating the update of the value of the target network by using the target network, wherein the update of the target network and the DQN are kept unchanged, so that the DQN is enabled to change towards the minimum possible of double-Q learning, a more accurate Q value is obtained as far as possible, and a better strategy is generated.

And step 41, randomly sampling N state conversion data from the storage buffer R to be used as a small batch of training data of the online strategy network and the online Q network. For the present invention (S) _i ,A _i ,R _i ,S _i+1 ) Representing a single transformed data in a small batch.

DDPG has four networks: action networks- -Actor current network (policy network) and Actor target network, evaluation networks- -criticic current network (current Q network), criticic target network, respectively. Actor the current network: is responsible for the iterative update of the strategy network parameter theta, is responsible for selecting the current action A according to the current state S and is used for generating S by interacting with the environment _i+1 (ii) a Actor target network: responsible for empirically replaying the next state S sampled in the pool _i+1 Selecting the optimal next action A _i+1 (ii) a Critic current network: is responsible for evaluating the network parameter theta ^Q The iterative update of (2) is responsible for calculating the current Q value; critic target network: responsible for calculating Q' in the target Q value (distinguished from the current Q value, representing the next Q value), the network parameter θ ^Q′ Periodically from theta ^Q And (6) copying.

Step 42, predicting the next action a through the target action network _i ＝μ′(s|θ ^μ′ ) (ii) a Mapping state s to specific action a to maintain a specified current policy θ ^μ Parameterization of (u)' corresponds to u _i+1 It is not meant that the current u corresponds to an optimal behavior strategy, but that the parameterized action network μ (s | θ μ) learns a specific strategy and builds a strategy function between the state and the specific action.

By target commentEstimating the network to obtain the Q value of the next step: q _i+1 ＝Q′(S _i+1 ,a _i |θ ^Q )；

Wherein Q is _i+1 And Q 'is the next Q value, Q' (S) _i+1 ,a _i |θ ^Q′ ) Q value (evaluation value), S, obtained representing the current action and the state at the next time _i+1 The next state value.

Using DDQN structure Q _i ＝r _i +γQ _i+1 Comparing the Q value:

Q _i+1 ′＝Q′(S _i+1 ,a _i |θ ^Q′ ) (11)

Q _i+1 ＝min(Q _i+1 ,Q _i+1 ′) (12)

Step 43, calculating a policy gradient of the online Q network, where a loss function of the Q network defines:

wherein Q is equivalent to an evaluator, and a parameter in the Q network is defined as theta ^Q ，Q(s _i ,a _i |θ ^Q ) Denotes the use of theta ^Q Policy is in state s _i Selection action a _i The expected value of the acquired reward.

a _i ＝μ(s _i |θ ^μ ) (14)

Minimizing the loss function of the target action network:

obtaining an optimized strategy theta ^μ And theta ^Q 。

Step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online policy network and the online Q network. The invention is equivalent to a trained brain through reinforcement learning, the state values are the states of the 2D dummy including the position, the posture, the speed, the acceleration and the angles of the foot joints, the action values are the speeds of the two joints of the two legs,

performing soft update, and updating parameters mu 'and Q' in the target evaluation network and the target action network:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ ，

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ ；

wherein, theta ^μ 、θ ^μ′ 、θ ^Q 、θ ^Q′ Parameters corresponding to an Actor current network, an Actor target network, a criticic current network (current Q network) and a criticic target network are respectively represented, wherein tau is an update coefficient, and the range is 0.01-0.1 in order to avoid overlarge parameter change amplitude. Soft updating is used, i.e. each parameter is updated only a little, instead of direct assignment updating, each network corresponds to one parameter.

The loss curve of the prediction model training is shown in fig. 3, the loss curve of the reward function training is shown in fig. 4, and it can be seen from fig. 3 and 4 that the training loss function curve generally shows a descending trend, the loss values between adjacent training loss function curves have smaller fluctuation, the convergence rate is faster, the change degree of the data is smaller, and the prediction model describes the experimental data with better accuracy, so that the final model reaches the convergence state, and the error of the model prediction value is reduced.

Fig. 5 is a conventional DDPG training reward curve, fig. 6 is a training reward curve of the present invention, with the ordinate being the number of times of training and the abscissa being the reward per training. The improved DDPG algorithm of the invention obtains more rewards per 500 rounds on average than the DDPG algorithm, which shows that the DDPG algorithm improved based on the model can effectively improve the performance of the algorithm, because the network Q (s, a | theta) is evaluated through a pair of a prediction model network and a feedback reward network ^Q ) Motion network mu (s | theta) ^μ ) The practice of pre-training may better determine that each action in each state needs to be repeatedThe execution times save the times of repeated execution of multiple actions by the intelligent agent in most time, and the decision making capability of the intelligent agent is improved.

Fig. 7 shows the average jackpot curve obtained in a noiseless environment per average 100 rounds of the conventional DDPG algorithm, and the higher the jackpot, the better the robot performs according to the expected target. From fig. 7, both curves show increasing trend, and when the training round exceeds a certain value, fig. 7 shows that the average cumulative reward of the improved DDPG algorithm of the present invention has already tended to be overall stable around 2600 rounds, the value is around 300, while the original DDP algorithm G starts to be in a stable state around 3600 rounds, obviously showing that the former is better than the latter to tend to be stable first, and the convergence rate of the algorithm is better than the latter.

It is apparent from figure 7 that the first 1500 rounds of jackpot the improved DDPG algorithm of the present invention is lower than the DDPG algorithm, while the improved DDPG algorithm is overall higher than the average jackpot of the DDPG algorithm in later studies, showing that the average jackpot of the present invention is 82.3, the maximum jackpot is 142, and the minimum is-58 in rounds 0-4000; the original DDPG algorithm has an average reward of 75.4, a maximum reward of 118 and a minimum of-66; the average reward of the former under the test environment is 198.2, the highest reward is 302, and the lowest reward is-198; the latter has an average prize of 189.6, a highest prize of 281 and a lowest prize of-186.4.

The invention constructs a network to generate an object state model and a value reward model from offline sample data learning training, and pre-trains a learning action network and an evaluation network through the model (the two constructed models are used for simulating an environment to train the action network and the value network in the original DDPG in advance), thereby saving the cost of online learning and improving the quality and the efficiency of the online learning. In addition, a DDQN network structure is added, the maximum action in the target is decomposed into action evaluation and action selection to reduce the overestimation of the Q value, a greedy strategy is evaluated according to an online network, and the value of the target network is evaluated to reach an optimal strategy as far as possible, so that a more stable and reliable learning process is achieved.

Simulation experiment results of the bipedal walker-v2 platform show that the maximum accumulated reward obtained by the improved DDPG algorithm can reach a higher level, and can reach a stable state more quickly, and a destination can be reached more efficiently and quickly in the process of operating a 2D dummy.

According to the method, a large amount of off-line data are used for training an object state model and a reward model, then a DDPG network is pre-trained in an off-line mode through a model-based reinforcement learning method, the decision-making capability of the network is improved in an off-line mode, and therefore the on-line learning efficiency and performance are improved; meanwhile, the structure of a double-Q value network in the DDQN algorithm is utilized, the Q value is prevented from being excessively estimated in the online training process, and the condition of excessive Q estimation is eliminated.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A robot control method based on DDPG algorithm of off-line model pre-training learning is characterized by comprising the following steps:

an off-line pre-training evaluation network and an action network: extracting N sample data from the preprocessed training data set, training an object state model network and a value reward network in an off-line manner, and simulating an on-line training process in advance by utilizing the two off-line object state model networks and the value reward network to pre-train and learn an action network and an evaluation network in the DDPG;

step three: initializing a target evaluation network and an action network by using the pre-trained evaluation network in the second step, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network;

step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online policy network and the online Q network, and controlling the state of the 2D dummy by using the target evaluation network and the target action network.

2. The robot control method based on the DDPG algorithm of offline model pre-training learning according to claim 1, characterized in that the training data is data obtained by learning the behavior in the environment by traveling and observing the reward or result obtained in traveling during the course of the 2D dummy walking from the starting point to the ending point, and performing the action in the positive direction according to the feedback of the environment; randomly generating state data, actions and corresponding value rewards and next states of the training data in the environment state and action range, namely collecting environment sample data of the 2D dummy from a system historical data table or generating random actions in an off-line environment to obtain corresponding reward values and feedback reward data, wherein the data format is (S) _i ,A _i ,R _i ,S _i+1 ) Wherein S is _i Is an ambient state value, A _i For action, the agent will respond to the incoming environmental state value S _i Performing action A _i ，R _i For feedback values or value awards values, S _i+1 Is the state value of the next environment; state S of intelligent agent in random environment _i Next, a behavior action A is randomly selected _i Performing the action, returning the reward R after performing the action _i And a new environmental state S _i+1 Then the round of data (S) _i ,A _i ,R _i ,S _i+1 ) And storing the data into a database.

3. The robot control method based on the DDPG algorithm of offline model pre-training learning of claim 2, wherein the method of the pre-processing of the training data is as follows: processing to remove null values and abnormal values, and carrying out normalization conversion on the format of the data; zero-mean gaussian noise is added to the action and the processed data is stored in a training data set.

4. The robot control method based on the off-line model pre-training learning DDPG algorithm according to claim 1 or 2, characterized in that 2 fully connected artificial neural networks (object state prediction NN) with similar structures and Value reward NN are newly constructed on the action network and evaluation network structure of the original DDPG, and the number of the artificial neurons of each network is similar; the newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the output layer is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function; the newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.

5. The robot control method based on the DDPG algorithm of offline model pre-training learning of claim 4, characterized in that the method for offline pre-training the evaluation network and the action network by using the training data set obtained in step one is:

step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) ^Q ) Motion network mu (s | theta) ^μ ) Object state model network P (s, a | θ) ^P ) And value reward network r (s, a | θ) ^r ) Initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set;

6. The robot control method for DDPG algorithm based on offline model pre-training learning of claim 5, wherein the object state model network P (s, a | θ) is trained in step 1 ^P ) And value reward network r (s, a | θ) ^r ) The method comprises the following steps:

minimizing the loss function of the object state model network:

wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s _i+1 Indicating the environmental status, s, obtained by the agent at time i +1 _i Indicating the environmental state at time i, P(s) _i ,a _i |θ ^p ) The object state prediction network representing the state and the behavior at the current moment is given by a predictNN neural network in the state prediction network module; p is the state value of the agent, θ ^p Is a parameter that adjusts the object state network;

minimizing the loss function of the value reward network:

wherein r is _i Representing the sum of all the activity-earned reward values from the current state up to some future state, r(s) _i ,a _t |θ ^r ) The environmental return representing the current state and behavior is given by a Value NN neural network in the Value reward network module; l2 is the loss value, θ ^r Parameters of a value reward network;

evaluating network Q (s, a | theta) based on trained object state model network and value reward network pair ^Q ) Motion network mu (s | theta) ^μ ) Pre-training is carried out: selecting N samples from the training dataset (S) _i ,A _i ) Training the value reward network, passing valueReward function predicts feedback reward R after action is executed at current state _i ：

R _i ＝r(s _i ,a _t |θ ^r )；

Predicting next state S through object state model network _i+1 ：

S _i+1 ＝P(s _i ,a _i |θ ^p )。

7. The robot control method based on the DDPG algorithm of offline model pre-training learning according to claim 6, characterized in that the target action network is initialized, and the method of initializing the storage buffer R is: randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of a network, and clearing a set R of experience playback;

the construction method of the online data set comprises the following steps:

step 31, the random initialization distribution N1 is used as an action search, and the initialization S _i Is the current first state;

Step 34, the agent converts the state into data (S) _i ,A _i ,R _i ,S _i+1 ) Stored in a memory buffer R as an online data set for training an online network, wherein A _i In order to act, R _i Is a feedback value.

8. The robot control method based on the offline model pre-training learning DDPG algorithm of claim 7, wherein the method for updating the online strategy network and the online Q network by using the DDQN structure in the fourth step is:

43, calculating a strategy gradient of the online Q network;

9. The robot control method based on the offline model pre-training learning DDPG algorithm of claim 8, characterized in that the Q value of the next step is obtained through the target evaluation network in the step 42: q _i+1 ＝Q′(S _i+1 ,a _i |θ ^Q )；

using DDQN structure Q _i ＝r _i +γQ _i+1 Comparing the Q value:

Q _i+1 ′＝Q′(S _i+1 ,a _i |θ ^Q′ )，

Q _i+1 ＝min(Q _i+1 ,Q _i+1 ′)；

wherein γ ∈ [0,1] is a decay factor;

the loss function of the Q network is:

wherein the parameter in the Q network is theta ^Q ，Q(s _i ,a _i |θ ^Q ) Representing a usage policy θ ^Q In a state s _i Selection action a _i The acquired return expectation value;

updating theta with Adam optimizer ^μ And target action network mu (s | theta) ^μ ) The method comprises the following steps: a is a _i ＝μ(s _i |θ ^μ )；

Minimizing the loss function of the target action network:

obtaining an optimized strategy theta ^μ And theta ^Q 。

10. The method of robot control based on DDPG algorithm of offline model pre-training learning of claim 9, wherein the soft update updates the parameters μ 'and Q' within the target evaluation network and the target action network:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ ，

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ ；

wherein, theta ^μ 、θ ^μ′ 、θ ^Q And theta ^Q′ Respectively representing parameters corresponding to an Actor current network, an Actor target network, a criticic current network and a criticic target network, wherein tau is an update coefficient.