CN112668235B - Robot control method based on off-line model pre-training learning DDPG algorithm - Google Patents

Robot control method based on off-line model pre-training learning DDPG algorithm Download PDF

Info

Publication number
CN112668235B
CN112668235B CN202011429368.1A CN202011429368A CN112668235B CN 112668235 B CN112668235 B CN 112668235B CN 202011429368 A CN202011429368 A CN 202011429368A CN 112668235 B CN112668235 B CN 112668235B
Authority
CN
China
Prior art keywords
network
state
action
training
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011429368.1A
Other languages
Chinese (zh)
Other versions
CN112668235A (en
Inventor
张茜
王洪格
姚中原
戚续博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongyuan University of Technology
Original Assignee
Zhongyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongyuan University of Technology filed Critical Zhongyuan University of Technology
Priority to CN202011429368.1A priority Critical patent/CN112668235B/en
Publication of CN112668235A publication Critical patent/CN112668235A/en
Application granted granted Critical
Publication of CN112668235B publication Critical patent/CN112668235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

The invention provides a robot control method based on a DDPG algorithm of offline model pre-training learning, which comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set; constructing and initializing an artificial neural network and initializing parameters; pre-training an evaluation network and an action network off line by utilizing a training data set; initializing a target network by using the pre-trained evaluation network, and storing state conversion data into a storage buffer by the agent to serve as an online data set of the training online network; training an online strategy network and an online Q network by using an online data set, and updating the online strategy network and the online Q network by using a DDQN structure; and soft updating, and controlling the state of the 2D dummy. The invention has the advantages of more efficient efficiency, more accurate generated Q value, higher average reward value and more stable and reliable learning strategy, improves the convergence speed, obtains the accumulated reward value to a higher level and leads the robot to quickly reach the destination.

Description

Robot control method based on off-line model pre-training learning DDPG algorithm
Technical Field
The invention relates to the technical field of robot control, in particular to a robot control method based on an off-line model pre-training learning DDPG algorithm.
Background
Reinforcement learning is an important branch of machine learning, where an agent learns the behavior in the environment by performing certain operations and observing rewards or outcomes derived from those operations. Mainly comprises four elements: agent, environmental status, action, and reward. The goal of reinforcement learning is that the agent performs actions in the positive direction as much as possible based on the positive feedback of the environment to obtain the most accumulated reward.
At present, deep reinforcement learning has generated important influences in the directions of simulation control, motion control, indoor and outdoor navigation, synchronous positioning and the like of a robot, so that the robot can automatically learn in a simulation environment and even in the real world through experience and environment interaction, and the return maximization or the realization of a specific target is achieved.
DDPG (Deep Deterministic Policy Gradient) can be suitable for tasks with continuous motion space and continuous state space, the DDPG algorithm is taken as a classic algorithm in the aspect of continuous motion control, training is stable, the learning process is slow, and a target Q value is generally obtained directly by a greedy method, so that high Q value estimation deviation exists, when the accumulated error reaches a certain degree, updating and divergence behaviors of a suboptimal strategy can be caused, and the finally obtained algorithm model has great deviation.
In addition, online reinforcement learning requires online processing of state data and feedback rewards at each time in the environment, and the application of an action must wait for the next feedback reward from the environment, which can be time-consuming and costly. In addition, when reinforcement learning is in the initial training stage, the generalization capability of the action network and the evaluation network is weak, a large amount of redundant trial and error actions and invalid data are generated, and online computing resources are wasted to a certain extent.
The perception capability of deep learning and the decision capability of the deep learning are combined in the deep reinforcement learning, the deep reinforcement learning is widely applied to robot operation tasks, dylan P.Losey et al propose a global optimal leading artificial bee colony algorithm for updating robot path tracks, L.Tai et al realize model-free obstacle avoidance behaviors, and a mobile robot can explore unknown environments under the condition of not colliding with other objects, but the problem of limitation that the continuous state space of decision is not accurate enough due to discrete classification exists.
Volodymy Mnih et al propose a Deep Q Network (DQN) to obtain an effective representation of the environment from a high-dimensional sensory input, and use them to generalize past experiences to new situations, however, for a physical control task with a continuous and high-dimensional action space, DQN cannot be directly applied to a continuous domain because it depends on finding a value function that maximizes an action value, timothy P et al propose a Deep deterministic strategy gradient DDPG, which solves the problem that DQN cannot handle a large number of continuous action spaces and Actor Critic is difficult to converge, DDPG algorithm is widely used to solve the problems of obstacle avoidance, path planning, etc., and strategies can be learned in a high-dimensional continuous action space. However, like most model-free reinforcement methods, the DDPG algorithm requires a significant amount of training to find a solution, and since sample data acquisition is limited by real-time operation, in general, model-based algorithms are superior to model-free learners in terms of sample complexity. The Pieiffer M et al propose a model capable of learning a strategy for avoiding collision and safely guiding the robot to reach a specified target through an obstacle environment, but the model may be trained from perfect simulation data, and has a defect of navigation performance.
Disclosure of Invention
Aiming at the technical problems that the existing control method utilizing the DDPG algorithm can get into a local minimum value in the online training process and can generate a large amount of trial and error actions and invalid data when the DDPG network is initially trained, the invention provides the robot control method of the DDPG algorithm based on offline model pre-training learning.
In order to achieve the purpose, the technical scheme of the invention is realized as follows: a robot control method based on DDPG algorithm of off-line model pre-training learning comprises the following steps:
the method comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set;
step two: establishing and initializing an evaluation network, an action network, an object state model network and a value reward network of the artificial neural network, and initializing respective parameters; utilizing the training data set obtained in the step one to pre-train an evaluation network and an action network in an off-line manner;
step three: initializing a target evaluation network and an action network by using the pre-trained evaluation network in the step two, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network;
step four: training an online strategy network and an online Q network by using the online data set obtained in the step three, and updating the online strategy network and the online Q network by using a DDQN structure;
step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online strategy network and the online Q network, and controlling the state of the 2D dummy by using the states output by the target evaluation network and the target action network.
The training data is obtained by learning behaviors in the environment by advancing and observing rewards or results obtained in the advancing process when the 2D dummy walks from the starting point to the terminal point and executing actions in the positive direction according to the feedback of the environment; randomly generating state data, actions and corresponding value rewards and next states of the training data in the environment state and action range, namely collecting environment sample data of the 2D dummy from a system historical data table or generating random actions in an off-line environment to obtain corresponding reward values and feedback reward data, wherein the data format is (S) i ,A i ,R i ,S i+1 ) Wherein S is i Is an ambient state value, A i For action, the agent will respond to the incoming environmental state value S i Performing action A i ,R i For feedback values or value awards values, S i+1 Is the state value of the next environment; state S of intelligent agent in random environment i Next, a behavior action A is randomly selected i Performing the action, returning a reward R after performing the action i And a new environmental state S i+1 Then the round of data (S) i ,A i ,R i ,S i+1 ) And storing the data into a database.
The method for preprocessing the training data comprises the following steps: processing to remove null values and abnormal values, and carrying out normalization conversion on the format of the data; zero-mean gaussian noise is added to the action and the processed data is stored in a training data set.
2 fully-connected artificial neural networks with similar structures, namely an object state prediction network predictNN and a Value reward network Value NN, are newly constructed on the structure of an action network and a Value evaluation network of the original DDPG, and the number of artificial neurons of each layer of network is similar; the newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the output layer is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function; the newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.
The method for off-line pre-training the evaluation network and the action network by using the training data set obtained in the first step comprises the following steps:
step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) Q ) And the motion network mu (s | theta) μ ) Object state model network P (s, a | θ) P ) And value reward network r (s, a | θ) r ) Initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set;
step 2, evaluating the network Q (s, a | theta) by using the trained object state model network and the value reward network Q ) And action network mu (s | theta) μ ) Pre-training is performed.
Training the object state model network P (s, a | theta) in the step 1 P ) And value reward network r (s, a | θ) r ) The method comprises the following steps:
minimizing the loss function of the object state model network:
Figure GDA0003933208970000031
wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s i+1 Representing the ambient state, s, obtained by the agent at time i +1 i Indicating the environmental state at time i, P(s) i ,a ip ) The object state prediction network representing the state and the behavior at the current moment is given by a predictNN neural network in the state prediction network module; p is the state value of the agent, θ p Is a parameter that adjusts the object state network;
minimizing the loss function of the value reward network:
Figure GDA0003933208970000032
wherein r is i Representing the sum of all the actions earned from the current state up to some future state, r(s) i ,a tr ) The environmental return representing the current state and behavior is given by a Value NN neural network in the Value reward network module; l2 is the loss value, θ r Refers to a parameter of the value reward network.
Evaluating network Q (s, a theta) based on trained object state model network and value reward network pair Q ) And the motion network mu (s | theta) μ ) Pre-training is carried out: selecting N samples from the training dataset (S) i ,A i ) Training the value reward network, and predicting the feedback reward R after the current state executes the action through the value reward function i
R i =r(s i ,a tr );
Predicting next state S through object state model network i+1
S i+1 =P(s i ,a ip )。
The method for initializing the target action network and the storage buffer R comprises the following steps: randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of a network, and emptying a set R of experience playback;
the construction method of the online data set comprises the following steps:
step 31, initializing random initialization distribution N1 as action search, initializing S i Is the current first state.
Step 32, the agent selects an action according to the action strategy and sends the action to the environment to execute the action;
step 33, after the agent performs the action, the environment returns to the reward after the current state has been performed and the new state S i+1
Step 34, the agent converts the state into data (S) i ,A i ,R i ,S i+1 ) And storing the data into a storage buffer R as an online data set for training an online network.
The method for updating the online policy network and the online Q network by using the DDQN structure in the fourth step comprises the following steps:
step 41, randomly sampling N state transition data from the storage buffer R as a small batch of training data of the online strategy network and the online Q network, (S) i ,A i ,R i ,S i+1 ) Representing individual transformed data in the small lot;
step 42, predicting the next action a through the target action network i =μ′(s|θ μ′ ) (ii) a Mapping state s to specific action a to maintain a specified current policy θ μ μ' denotes the parameterized motion network μ (s | θ |) μ ) A learned strategy, which is a strategy function between a state and a specific action; comparing Q values by using a DDQN structure;
43, calculating a strategy gradient of the online Q network;
step 44, updating the online policy network: updating theta with Adam optimizer μ And target action network mu (s | theta) μ )。
In step 42, the Q value of the next step is obtained through the target evaluation network: q i+1 =Q′(S i+1 ,a iQ );
Wherein Q is i+1 And Q 'is the next Q value, Q' (S) i+1 ,a iQ ) Q value, S, representing the current operation and the state at the next time i+1 Is the next state value;
using DDQN structure Q i =r i +γQ i+1 Comparing the Q value:
Q i+1 ′=Q′(S i+1 ,a iQ′ )
Q i+1 =min(Q i+1 ,Q i+1 ′);
where γ ∈ [0,1] is a decay factor that balances the importance of instant and future rewards.
The loss function of the Q network is:
Figure GDA0003933208970000051
wherein Q is equivalent to an evaluator, and a parameter in the Q network is defined as theta Q ,Q(s i ,a iQ ) Denotes the use of theta Q Policy is in state s i Selecting action a i The acquired return expectation value;
updating theta with Adam optimizer μ And target action network mu (s | theta) μ ) The method comprises the following steps: a is i =μ(s iμ );
Minimizing the loss function of the target action network:
Figure GDA0003933208970000052
obtaining an optimized weight theta μ And theta Q
The soft update updates the parameters μ 'and Q' within the target evaluation network and the target action network:
θ μ′ ←τθ μ +(1-τ)θ μ′
θ Q′ ←τθ Q +(1-τ)θ Q′
wherein, theta μ 、θ μ′ 、θ Q And theta Q′ Respectively representing parameters corresponding to an Actor current network, an Actor target network, a Critic current network and a Critic target network, wherein tau is an update coefficient.
Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the off-line real training data to preprocess the action network and the evaluation network and construct the object prediction model network and the value reward network, thereby accelerating the robot to learn from the known environment more efficiently so as to reach the destination quickly. The DDPG algorithm has the advantages that function approximation is not flexible enough, noise exists, the known Q value is estimated too high, and an optimal strategy cannot be generated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic structural diagram of the present invention.
Fig. 2 is a schematic diagram of the process of the dummy's traveling, wherein (a) is the dummy at the starting point, (b) is the dummy walking, and (c) is the dummy at the ending point.
FIG. 3 is a loss curve of the prediction model training of the present invention.
FIG. 4 is a loss curve of the reward function training of the present invention.
Fig. 5 is a training reward curve of a conventional DDPG algorithm.
FIG. 6 is a graph of improved training rewards after pre-training of the model of the invention.
FIG. 7 is a comparison graph of the evaluated rewards of the present invention and a conventional DDPG algorithm in a noise-free environment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, a robot control method based on DDPG algorithm of offline model pre-training learning includes the following steps:
the method comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set.
The experimental environment is windows 10+ paddlee 1.7+ par 1.3.1+ cuda10.0. The hardware is core i8-8300+ display card GTX1060, and the simulation platform is Bipedaldarwalker-v 2. And respectively training 4000 rounds by using a DDPG algorithm and an improved DDPG algorithm based on an offline model, and analyzing the relation between the feedback reward value of the robot, namely the 2D dummy from the starting point to the end point and the number of training rounds.
Bipedal walker-v2 is an open source simulator, the environment characteristic of the bipedal walker is that the terrain is generated completely randomly, the task of the bipedal walker is to enable a 2D dummy to walk from a starting point to a terminal point, the robot is provided with four joints which can be controlled, the joints are respectively connected with the root parts of a left leg and a right leg, the knee part of the left leg and the knee part of the right leg, and the skill is used for simulating the forward walking process of a bipedal animal. The farther forward the score is, the more points are scored and if the robot falls the score is deducted, the trained model must be very robust to get a high average score. As shown in fig. 2, when the robot moves from the starting point to the ending point, the robot learns the behavior in the environment by traveling and observing the reward or result obtained during traveling, and performs the action in the positive direction as much as possible according to the feedback of the environment, thereby learning a good strategy. The training data is obtained by randomly generating state data, actions and corresponding value rewards and next states within the environment state and action range.
Collecting environmental sample data of the 2D dummy from the system historical data table or generating random action in an off-line environment to obtain corresponding reward value and feedback reward data, wherein the data format is (S) t ,A t ,R t ,S t+1 ) Wherein S is t Is an ambient state value, A t For action, the agent will respond to the incoming environmental state value S t Performing action A t ,R t For the feedback value or value award value, S t+1 Obtaining a return value R for the next context state value after performing the action t And update the new state S t+1
Collecting training data: the invention utilizes the intelligent agent to be in the random environment state S t Next, a behavior action A is randomly selected t Performing the action, returning a reward R after performing the action t And a new environmental state S t+1 . Then the round data (S) t ,A t ,R t ,S t+1 ) And storing the data into a database, wherein the data is only collected at the moment, and the data can be acquired in other modes.
Data preprocessing: many data can cause the generation of so-called dirty data due to the characteristics of incompleteness, non-uniformity before and after the data and the like, and if the data are directly used for model pre-training without considering the intrinsic characteristics of the data, the final result has larger error and the overall effect is influenced. Therefore, before the data is used, corresponding processing for removing null values and abnormal values is required, the format of the data is subjected to normalization conversion, interference can be reduced, and prediction accuracy is improved. In addition, zero mean Gaussian noise is added into the action to improve the robustness of the model, and finally, the processed data is stored in a training data set.
Step two: an evaluation network, an action network, an object state model network and a value reward network of the artificial neural network are constructed and initialized, and respective parameters are initialized; and (4) utilizing the training data set obtained in the step one to pre-train the evaluation network and the action network in an off-line manner.
Offline pre-training evaluation network and action network: n sample data are extracted from the preprocessed training data set, an object state model network and a value reward network are trained offline, and the two offline object state model networks and the value reward network are used for simulating an online training process in advance to pre-train and learn an action network and an evaluation network in the DDPG, so that a large amount of trial and error work in the early stage is reduced, and the efficiency and the quality of online learning are improved.
In 2016, DDPG was proposed by Deepmind as a combination of an Actor-Critic framework and a DQN algorithm, and an off-policy and Model-Free depth reinforcement learning algorithm for a continuous motion space. In general, the DDPG network is based on the Actor-Critic method, so the neural network with Policy and the neural network based on Value include a Policy network for generating actions, an evaluation network for judging the actions and absorbing the excellent characteristics of DQN, and a sample experience playback pool and a fixed target network are used. The DDPG algorithm simulates a strategy function and a Q function by a convolutional neural network on the DPG algorithm, and is trained by a deep learning method instead of linear regression, so that the accuracy, high performance and convergence of a nonlinear simulation function in the reinforcement learning method are proved.
The DDPG algorithm structure comprises a parameter theta π Of a motion network and a parameter theta Q To calculate a deterministic policy a = pi (s | θ), respectively π ) And the action cost function Q (s, a | θ) Q ) Since the learning process of a single network is not stable, the action network and the evaluation are carried out by taking the successful experience of the DQN fixed target network as a referenceThe estimation networks are each subdivided into a real network and an estimation network. The real network and the estimation network have the same structure, and the estimation network parameters are subjected to soft update by the real network parameters at a certain frequency. The action estimation network is used for outputting real-time actions for the intelligent agent to execute the actions in the real environment, and the action real network is used for updating the evaluation network system. And meanwhile, the value evaluation network is also subdivided into a real network and an estimation network and used for outputting the value reward of each state, the input ends are different, the state real network analyzes the observed values of the actions and the states input according to the action real network, and the state estimation network takes the actions applied by the intelligent agent as input. The value of the evaluation action is called the Q value: representing the desire for the agent to select this action, and then to the final state reward total.
DDPG is a data-driven control method that can learn a generative model from input/output state data of a 2D dummy and implement an optimal strategy for the 2D dummy to reach a destination according to a given reward. In the real world, the collection of sample data is limited by real-time operation, so the invention carries out pretreatment through off-line data, trains a state prediction model and a value reward prediction model of an object under the off-line condition, trains an action network and an evaluation network in reinforcement learning by the two models to complete off-line pre-learning work, and then carries out learning by putting the action network and the evaluation network in an actual object, thereby greatly reducing the workload of the intelligent agent and promoting the intelligent agent to more efficiently complete tasks.
The invention constructs 2 fully-connected artificial neural networks with similar structures, namely an object state prediction network predictNN and a Value reward network Value NN, on the basis of the structure of the original action network and Value evaluation network of the DDPG, and the number of artificial neurons in each layer of the network is basically similar. The newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the last layer of neural network, namely the output layer, is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function. The newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.
The method for off-line pre-training the evaluation network and the action network by utilizing the training data set obtained in the step one comprises the following steps:
step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) Q ) Motion network mu (s | theta) μ ) Object state model network P (s, a | θ) P ) And value reward network r (s, a | θ) r ) And initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set.
Training object state model network P (s, a | θ) P ) And value reward network r (s, a | θ) r ) The method comprises the following steps: and preprocessing the off-line data, randomly initializing the weight value, and verifying the minimization of the function value of the loss function to obtain the accuracy of the network.
The newly constructed object state model network and the value reward network have different functions and structures, the corresponding training modes are different, different loss functions are used for training, and the loss function of the object state model network is minimized:
Figure GDA0003933208970000081
wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s i+1 Representing the ambient state, s, obtained by the agent at time i +1 i Indicates the environmental state at the i-th time, P(s) i ,a ip ) And an object state prediction network representing the state and the behavior at the current moment, wherein a model is obtained after network training and is given by a predictNN neural network in the state prediction network module. P (s, a | theta) P ) Is shown in state S i Lower execution action a i The latter state value. a is a i For an action performed by an agent, p is the agent's state value, θ p Is a parameter that adjusts the object state network. Such as Q network isThe evaluation network carries out evaluation scoring on the action output by the agent at each step, and the theta of the neural network of the evaluation network is adjusted according to the feedback reward of the audience, namely the environment Q And (4) parameters.
Minimizing the loss function of the value reward network:
Figure GDA0003933208970000091
wherein r is i Representing the sum of all the activity-earned reward values from the current state up to some future state, r(s) i ,a tr ) 2 The environmental reward, representing the current state and behavior, is given by the Value NN neural network in the Value reward network module. L2= loss is a loss value used for expressing the difference degree between the prediction and actual data, the smaller the loss is, the better the model prediction is, and theta is r Refers to a parameter of the value reward network.
Step 2, evaluating the network Q (s, a | theta) by using the trained object state model network and the value reward network Q ) And action network mu (s | theta) μ ) Pre-training is performed.
Evaluating network Q (s, a | theta) based on trained object state model network and value reward network pair Q ) Motion network mu (s | theta) μ ) Pre-training is carried out: selecting N samples from the training dataset (S) t ,A t ) Training the value reward network, and predicting the feedback reward R after the current state executes the action through the value reward function i
R i =r(s i ,a tr ) (3)
Predicting next state S through object state model network i+1
S i+1 =P(s i ,a ip ) (4)
The action network and the evaluation network in the DDPG are pre-trained and learned by simulating an online training process in advance through the two constructed models, namely an object state model and a value reward model, so that the two models simulate the environment to feed back a reward value and a next state value.
Step three: and initializing a target evaluation network and an action network by using the pre-trained evaluation network in the step two, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network.
Initializing a target network corresponding to the evaluation network and the action network, and initializing the storage buffer R; randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of the network, and emptying the set R of experience playback.
Step 31, initializing random initialization distribution N1 as action search, initializing S i Is the current first state.
Step 32, the intelligent agent selects an action according to the action strategy and sends the action to the environment to execute the action;
step 33, after the agent performs the action, the environment returns to the reward after the current state execution and the new state S i+1
Step 34, the agent converts the state into data (S) i ,A i ,R i ,S i+1 ) And storing the data into a storage buffer R as an online data set for training an online network.
Step four: and (4) training the online strategy network and the online Q network by using the online data set obtained in the step three, and updating the online strategy network and the online Q network by using a DDQN structure.
Any type of estimation error will cause upward bias, whether these errors are caused by environmental noise, function approximation, non-stationarity or any other source, therefore, the present invention adds a DDQN structure to the processing of the Q network, separates the action selection network from the evaluation network, learns two value functions by randomly assigning each experience to update one of the value functions, and thus has two sets of weights θ and θ'. Firstly, finding out the action corresponding to the maximum Q value in an action network; then, the selected action is used to calculate the Q value in the evaluation network, and the minimum value of the two values is used to calculate the target Q value in the target network. For each update, one set of θ weights is used to determine the greedy strategy, and another set of θ' weights is used to determine its value. And estimating the value of the greedy strategy according to the current value defined by the weight theta, and fairly evaluating the value of the strategy by using the second group of weights theta', without introducing an additional network, but evaluating the update of the value of the target network by using the target network, wherein the update of the target network and the DQN are kept unchanged, so that the DQN is enabled to change towards the minimum possible of double-Q learning, a more accurate Q value is obtained as far as possible, and a better strategy is generated.
And step 41, randomly sampling N state conversion data from the storage buffer R to be used as a small batch of training data of the online strategy network and the online Q network. For the present invention (S) i ,A i ,R i ,S i+1 ) Representing a single transformed data in a small batch.
DDPG has four networks: action networks- -Actor current network (policy network) and Actor target network, evaluation networks- -criticic current network (current Q network), criticic target network, respectively. Actor the current network: is responsible for the iterative update of the strategy network parameter theta, is responsible for selecting the current action A according to the current state S and is used for generating S by interacting with the environment i+1 (ii) a Actor target network: responsible for empirically replaying the next state S sampled in the pool i+1 Selecting the optimal next action A i+1 (ii) a Critic current network: is responsible for evaluating the network parameter theta Q The iterative update of (2) is responsible for calculating the current Q value; critic target network: responsible for calculating Q' in the target Q value (distinguished from the current Q value, representing the next Q value), the network parameter θ Q′ Periodically from theta Q And (6) copying.
Step 42, predicting the next action a through the target action network i =μ′(s|θ μ′ ) (ii) a Mapping state s to specific action a to maintain a specified current policy θ μ Parameterization of (u)' corresponds to u i+1 It is not meant that the current u corresponds to an optimal behavior strategy, but that the parameterized action network μ (s | θ μ) learns a specific strategy and builds a strategy function between the state and the specific action.
By target commentEstimating the network to obtain the Q value of the next step: q i+1 =Q′(S i+1 ,a iQ );
Wherein Q is i+1 And Q 'is the next Q value, Q' (S) i+1 ,a iQ′ ) Q value (evaluation value), S, obtained representing the current action and the state at the next time i+1 The next state value.
Using DDQN structure Q i =r i +γQ i+1 Comparing the Q value:
Q i+1 ′=Q′(S i+1 ,a iQ′ ) (11)
Q i+1 =min(Q i+1 ,Q i+1 ′) (12)
where γ ∈ [0,1] is a decay factor that balances the importance of instant and future rewards.
Step 43, calculating a policy gradient of the online Q network, where a loss function of the Q network defines:
Figure GDA0003933208970000101
wherein Q is equivalent to an evaluator, and a parameter in the Q network is defined as theta Q ,Q(s i ,a iQ ) Denotes the use of theta Q Policy is in state s i Selection action a i The expected value of the acquired reward.
Step 44, updating the online policy network: updating theta with Adam optimizer μ And target action network mu (s | theta) μ )。
a i =μ(s iμ ) (14)
Minimizing the loss function of the target action network:
Figure GDA0003933208970000111
obtaining an optimized strategy theta μ And theta Q
Step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online policy network and the online Q network. The invention is equivalent to a trained brain through reinforcement learning, the state values are the states of the 2D dummy including the position, the posture, the speed, the acceleration and the angles of the foot joints, the action values are the speeds of the two joints of the two legs,
performing soft update, and updating parameters mu 'and Q' in the target evaluation network and the target action network:
θ μ′ ←τθ μ +(1-τ)θ μ′
θ Q′ ←τθ Q +(1-τ)θ Q′
wherein, theta μ 、θ μ′ 、θ Q 、θ Q′ Parameters corresponding to an Actor current network, an Actor target network, a criticic current network (current Q network) and a criticic target network are respectively represented, wherein tau is an update coefficient, and the range is 0.01-0.1 in order to avoid overlarge parameter change amplitude. Soft updating is used, i.e. each parameter is updated only a little, instead of direct assignment updating, each network corresponds to one parameter.
The loss curve of the prediction model training is shown in fig. 3, the loss curve of the reward function training is shown in fig. 4, and it can be seen from fig. 3 and 4 that the training loss function curve generally shows a descending trend, the loss values between adjacent training loss function curves have smaller fluctuation, the convergence rate is faster, the change degree of the data is smaller, and the prediction model describes the experimental data with better accuracy, so that the final model reaches the convergence state, and the error of the model prediction value is reduced.
Fig. 5 is a conventional DDPG training reward curve, fig. 6 is a training reward curve of the present invention, with the ordinate being the number of times of training and the abscissa being the reward per training. The improved DDPG algorithm of the invention obtains more rewards per 500 rounds on average than the DDPG algorithm, which shows that the DDPG algorithm improved based on the model can effectively improve the performance of the algorithm, because the network Q (s, a | theta) is evaluated through a pair of a prediction model network and a feedback reward network Q ) Motion network mu (s | theta) μ ) The practice of pre-training may better determine that each action in each state needs to be repeatedThe execution times save the times of repeated execution of multiple actions by the intelligent agent in most time, and the decision making capability of the intelligent agent is improved.
Fig. 7 shows the average jackpot curve obtained in a noiseless environment per average 100 rounds of the conventional DDPG algorithm, and the higher the jackpot, the better the robot performs according to the expected target. From fig. 7, both curves show increasing trend, and when the training round exceeds a certain value, fig. 7 shows that the average cumulative reward of the improved DDPG algorithm of the present invention has already tended to be overall stable around 2600 rounds, the value is around 300, while the original DDP algorithm G starts to be in a stable state around 3600 rounds, obviously showing that the former is better than the latter to tend to be stable first, and the convergence rate of the algorithm is better than the latter.
It is apparent from figure 7 that the first 1500 rounds of jackpot the improved DDPG algorithm of the present invention is lower than the DDPG algorithm, while the improved DDPG algorithm is overall higher than the average jackpot of the DDPG algorithm in later studies, showing that the average jackpot of the present invention is 82.3, the maximum jackpot is 142, and the minimum is-58 in rounds 0-4000; the original DDPG algorithm has an average reward of 75.4, a maximum reward of 118 and a minimum of-66; the average reward of the former under the test environment is 198.2, the highest reward is 302, and the lowest reward is-198; the latter has an average prize of 189.6, a highest prize of 281 and a lowest prize of-186.4.
The invention constructs a network to generate an object state model and a value reward model from offline sample data learning training, and pre-trains a learning action network and an evaluation network through the model (the two constructed models are used for simulating an environment to train the action network and the value network in the original DDPG in advance), thereby saving the cost of online learning and improving the quality and the efficiency of the online learning. In addition, a DDQN network structure is added, the maximum action in the target is decomposed into action evaluation and action selection to reduce the overestimation of the Q value, a greedy strategy is evaluated according to an online network, and the value of the target network is evaluated to reach an optimal strategy as far as possible, so that a more stable and reliable learning process is achieved.
Simulation experiment results of the bipedal walker-v2 platform show that the maximum accumulated reward obtained by the improved DDPG algorithm can reach a higher level, and can reach a stable state more quickly, and a destination can be reached more efficiently and quickly in the process of operating a 2D dummy.
According to the method, a large amount of off-line data are used for training an object state model and a reward model, then a DDPG network is pre-trained in an off-line mode through a model-based reinforcement learning method, the decision-making capability of the network is improved in an off-line mode, and therefore the on-line learning efficiency and performance are improved; meanwhile, the structure of a double-Q value network in the DDQN algorithm is utilized, the Q value is prevented from being excessively estimated in the online training process, and the condition of excessive Q estimation is eliminated.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims (10)

1. A robot control method based on DDPG algorithm of off-line model pre-training learning is characterized by comprising the following steps:
the method comprises the following steps: collecting training data of the 2D dummy in an off-line environment, and preprocessing the training data to obtain a training data set;
step two: establishing and initializing an evaluation network, an action network, an object state model network and a value reward network of the artificial neural network, and initializing respective parameters; utilizing the training data set obtained in the step one to pre-train an evaluation network and an action network in an off-line manner;
an off-line pre-training evaluation network and an action network: extracting N sample data from the preprocessed training data set, training an object state model network and a value reward network in an off-line manner, and simulating an on-line training process in advance by utilizing the two off-line object state model networks and the value reward network to pre-train and learn an action network and an evaluation network in the DDPG;
step three: initializing a target evaluation network and an action network by using the pre-trained evaluation network in the second step, initializing a storage buffer R and a current first state, and storing state conversion data into the storage buffer R by the agent to serve as an online data set of the training online network;
step four: training an online strategy network and an online Q network by using the online data set obtained in the step three, and updating the online strategy network and the online Q network by using a DDQN structure;
step five: soft updating: and updating parameters in the target evaluation network and the target action network by using the online policy network and the online Q network, and controlling the state of the 2D dummy by using the target evaluation network and the target action network.
2. The robot control method based on the DDPG algorithm of offline model pre-training learning according to claim 1, characterized in that the training data is data obtained by learning the behavior in the environment by traveling and observing the reward or result obtained in traveling during the course of the 2D dummy walking from the starting point to the ending point, and performing the action in the positive direction according to the feedback of the environment; randomly generating state data, actions and corresponding value rewards and next states of the training data in the environment state and action range, namely collecting environment sample data of the 2D dummy from a system historical data table or generating random actions in an off-line environment to obtain corresponding reward values and feedback reward data, wherein the data format is (S) i ,A i ,R i ,S i+1 ) Wherein S is i Is an ambient state value, A i For action, the agent will respond to the incoming environmental state value S i Performing action A i ,R i For feedback values or value awards values, S i+1 Is the state value of the next environment; state S of intelligent agent in random environment i Next, a behavior action A is randomly selected i Performing the action, returning the reward R after performing the action i And a new environmental state S i+1 Then the round of data (S) i ,A i ,R i ,S i+1 ) And storing the data into a database.
3. The robot control method based on the DDPG algorithm of offline model pre-training learning of claim 2, wherein the method of the pre-processing of the training data is as follows: processing to remove null values and abnormal values, and carrying out normalization conversion on the format of the data; zero-mean gaussian noise is added to the action and the processed data is stored in a training data set.
4. The robot control method based on the off-line model pre-training learning DDPG algorithm according to claim 1 or 2, characterized in that 2 fully connected artificial neural networks (object state prediction NN) with similar structures and Value reward NN are newly constructed on the action network and evaluation network structure of the original DDPG, and the number of the artificial neurons of each network is similar; the newly constructed object state prediction network predictNN is used for predicting the state of the next moment, the input is the current state and the executed action, the output layer is linear output, the output is the predicted next state, and the neurons of other layers use relu as an activation function; the newly constructed Value reward NN is used for calculating feedback rewards after actions are executed in the current state, the input is the current state and the actions, and the final layer of neural network is linear output and outputs reward feedback values.
5. The robot control method based on the DDPG algorithm of offline model pre-training learning of claim 4, characterized in that the method for offline pre-training the evaluation network and the action network by using the training data set obtained in step one is:
step 1, constructing and initializing an artificial neural network evaluation network Q (s, a | theta) Q ) Motion network mu (s | theta) μ ) Object state model network P (s, a | θ) P ) And value reward network r (s, a | θ) r ) Initializing respective parameters, and randomly selecting N sample training object state model networks and value reward networks from the training data set;
step 2, evaluating the network Q (s, a | theta) by using the trained object state model network and the value reward network Q ) And action network mu (s | theta) μ ) Pre-training is performed.
6. The robot control method for DDPG algorithm based on offline model pre-training learning of claim 5, wherein the object state model network P (s, a | θ) is trained in step 1 P ) And value reward network r (s, a | θ) r ) The method comprises the following steps:
minimizing the loss function of the object state model network:
Figure FDA0003933208960000021
wherein L1 is a loss function of the object state model network, N is the number of samples randomly extracted from the training data set, s i+1 Indicating the environmental status, s, obtained by the agent at time i +1 i Indicating the environmental state at time i, P(s) i ,a ip ) The object state prediction network representing the state and the behavior at the current moment is given by a predictNN neural network in the state prediction network module; p is the state value of the agent, θ p Is a parameter that adjusts the object state network;
minimizing the loss function of the value reward network:
Figure FDA0003933208960000022
wherein r is i Representing the sum of all the activity-earned reward values from the current state up to some future state, r(s) i ,a tr ) The environmental return representing the current state and behavior is given by a Value NN neural network in the Value reward network module; l2 is the loss value, θ r Parameters of a value reward network;
evaluating network Q (s, a | theta) based on trained object state model network and value reward network pair Q ) Motion network mu (s | theta) μ ) Pre-training is carried out: selecting N samples from the training dataset (S) i ,A i ) Training the value reward network, passing valueReward function predicts feedback reward R after action is executed at current state i
R i =r(s i ,a tr );
Predicting next state S through object state model network i+1
S i+1 =P(s i ,a ip )。
7. The robot control method based on the DDPG algorithm of offline model pre-training learning according to claim 6, characterized in that the target action network is initialized, and the method of initializing the storage buffer R is: randomly initializing values Q corresponding to all states and actions, randomly initializing all parameters theta of a network, and clearing a set R of experience playback;
the construction method of the online data set comprises the following steps:
step 31, the random initialization distribution N1 is used as an action search, and the initialization S i Is the current first state;
step 32, the agent selects an action according to the action strategy and sends the action to the environment to execute the action;
step 33, after the agent performs the action, the environment returns to the reward after the current state has been performed and the new state S i+1
Step 34, the agent converts the state into data (S) i ,A i ,R i ,S i+1 ) Stored in a memory buffer R as an online data set for training an online network, wherein A i In order to act, R i Is a feedback value.
8. The robot control method based on the offline model pre-training learning DDPG algorithm of claim 7, wherein the method for updating the online strategy network and the online Q network by using the DDQN structure in the fourth step is:
step 41, randomly sampling N state transition data from the storage buffer R as a small batch of training data of the online strategy network and the online Q network, (S) i ,A i ,R i ,S i+1 ) Representing individual transformed data in the small lot;
step 42, predicting the next action a through the target action network i =μ′(s|θ μ′ ) (ii) a Mapping state s to specific action a to maintain a specified current policy θ μ μ' denotes the parameterized motion network μ (s | θ |) μ ) A learned strategy, which is a strategy function between a state and a specific action; comparing Q values by using a DDQN structure;
43, calculating a strategy gradient of the online Q network;
step 44, updating the online policy network: updating theta with Adam optimizer μ And target action network mu (s | theta) μ )。
9. The robot control method based on the offline model pre-training learning DDPG algorithm of claim 8, characterized in that the Q value of the next step is obtained through the target evaluation network in the step 42: q i+1 =Q′(S i+1 ,a iQ );
Wherein Q is i+1 And Q 'is the next Q value, Q' (S) i+1 ,a iQ ) Q value, S, representing the current operation and the state at the next time i+1 Is the next state value;
using DDQN structure Q i =r i +γQ i+1 Comparing the Q value:
Q i+1 ′=Q′(S i+1 ,a iQ′ ),
Q i+1 =min(Q i+1 ,Q i+1 ′);
wherein γ ∈ [0,1] is a decay factor;
the loss function of the Q network is:
Figure FDA0003933208960000041
wherein the parameter in the Q network is theta Q ,Q(s i ,a iQ ) Representing a usage policy θ Q In a state s i Selection action a i The acquired return expectation value;
updating theta with Adam optimizer μ And target action network mu (s | theta) μ ) The method comprises the following steps: a is a i =μ(s iμ );
Minimizing the loss function of the target action network:
Figure FDA0003933208960000042
obtaining an optimized strategy theta μ And theta Q
10. The method of robot control based on DDPG algorithm of offline model pre-training learning of claim 9, wherein the soft update updates the parameters μ 'and Q' within the target evaluation network and the target action network:
θ μ′ ←τθ μ +(1-τ)θ μ′
θ Q′ ←τθ Q +(1-τ)θ Q′
wherein, theta μ 、θ μ′ 、θ Q And theta Q′ Respectively representing parameters corresponding to an Actor current network, an Actor target network, a criticic current network and a criticic target network, wherein tau is an update coefficient.
CN202011429368.1A 2020-12-07 2020-12-07 Robot control method based on off-line model pre-training learning DDPG algorithm Active CN112668235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011429368.1A CN112668235B (en) 2020-12-07 2020-12-07 Robot control method based on off-line model pre-training learning DDPG algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011429368.1A CN112668235B (en) 2020-12-07 2020-12-07 Robot control method based on off-line model pre-training learning DDPG algorithm

Publications (2)

Publication Number Publication Date
CN112668235A CN112668235A (en) 2021-04-16
CN112668235B true CN112668235B (en) 2022-12-09

Family

ID=75401628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011429368.1A Active CN112668235B (en) 2020-12-07 2020-12-07 Robot control method based on off-line model pre-training learning DDPG algorithm

Country Status (1)

Country Link
CN (1) CN112668235B (en)

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112018B (en) * 2021-04-27 2023-10-31 清华大学深圳国际研究生院 Batch limitation reinforcement learning method
CN113128689A (en) * 2021-04-27 2021-07-16 中国电力科学研究院有限公司 Entity relationship path reasoning method and system for regulating knowledge graph
CN113191487B (en) * 2021-04-28 2023-04-07 重庆邮电大学 Self-adaptive continuous power control method based on distributed PPO algorithm
CN113408782B (en) * 2021-05-11 2023-01-31 山东师范大学 Robot path navigation method and system based on improved DDPG algorithm
CN113240118B (en) * 2021-05-18 2023-05-09 中国科学院自动化研究所 Dominance estimation method, dominance estimation device, electronic device, and storage medium
CN113364712B (en) * 2021-05-19 2022-06-14 电子科技大学 DDPG network-based mixed radiation source signal separation method
CN113110516B (en) * 2021-05-20 2023-12-22 广东工业大学 Operation planning method for limited space robot with deep reinforcement learning
CN113290557A (en) * 2021-05-21 2021-08-24 南京信息工程大学 Snake-shaped robot control method based on data driving
CN113282705B (en) * 2021-05-24 2022-01-28 暨南大学 Case pre-judgment intelligent body training method and system capable of being automatically updated
CN113340324B (en) * 2021-05-27 2022-04-29 东南大学 Visual inertia self-calibration method based on depth certainty strategy gradient
CN113312874B (en) * 2021-06-04 2022-12-06 福州大学 Overall wiring method based on improved deep reinforcement learning
CN113532457B (en) * 2021-06-07 2024-02-02 山东师范大学 Robot path navigation method, system, equipment and storage medium
CN113341972A (en) * 2021-06-07 2021-09-03 沈阳理工大学 Robot path optimization planning method based on deep reinforcement learning
CN113219997B (en) * 2021-06-08 2022-08-30 河北师范大学 TPR-DDPG-based mobile robot path planning method
CN113392396B (en) * 2021-06-11 2022-10-14 浙江工业大学 Strategy protection defense method for deep reinforcement learning
CN113554166A (en) * 2021-06-16 2021-10-26 中国人民解放军国防科技大学 Deep Q network reinforcement learning method and equipment for accelerating cognitive behavior model
CN113868113A (en) * 2021-06-22 2021-12-31 中国矿业大学 Class integration test sequence generation method based on Actor-Critic algorithm
CN113361132B (en) * 2021-06-28 2022-03-15 浩鲸云计算科技股份有限公司 Air-cooled data center energy-saving method based on deep Q learning block network
CN113204061B (en) * 2021-07-06 2021-10-08 中国气象局公共气象服务中心(国家预警信息发布中心) Method and device for constructing lattice point wind speed correction model
CN113568954B (en) * 2021-08-02 2024-03-19 湖北工业大学 Parameter optimization method and system for preprocessing stage of network flow prediction data
CN113821045B (en) * 2021-08-12 2023-07-07 浙江大学 Reinforced learning action generating system of leg-foot robot
CN113744719A (en) * 2021-09-03 2021-12-03 清华大学 Voice extraction method, device and equipment
CN113792846A (en) * 2021-09-06 2021-12-14 中国科学院自动化研究所 State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment
CN113741525B (en) * 2021-09-10 2024-02-06 南京航空航天大学 Policy set-based MADDPG multi-unmanned aerial vehicle cooperative attack and defense countermeasure method
CN113904948B (en) * 2021-11-12 2023-11-03 福州大学 5G network bandwidth prediction system and method based on cross-layer multidimensional parameters
CN114692890A (en) * 2021-12-24 2022-07-01 中国人民解放军军事科学院战争研究院 Model-based weight combination planning value extension method
CN114943278B (en) * 2022-04-27 2023-09-12 浙江大学 Continuous online group incentive method and device based on reinforcement learning and storage medium
CN114697394B (en) * 2022-05-27 2022-08-16 合肥工业大学 Edge cache decision model, method and system based on discrete MADDPG
CN114844822A (en) * 2022-06-02 2022-08-02 广东电网有限责任公司 Networking method, device and equipment of power line carrier network and storage medium
CN114771783B (en) * 2022-06-02 2023-08-22 浙江大学 Control method and system for submarine stratum space robot
CN114708568B (en) * 2022-06-07 2022-10-04 东北大学 Pure vision automatic driving control system, method and medium based on improved RTFNet
CN115319741B (en) * 2022-08-05 2023-10-10 美的集团(上海)有限公司 Robot control model training method and robot control method
CN115128960B (en) * 2022-08-30 2022-12-16 齐鲁工业大学 Method and system for controlling motion of biped robot based on deep reinforcement learning
CN115475036A (en) * 2022-08-31 2022-12-16 上海电机学院 Adaptive control method, equipment and storage medium for intelligent artificial limb shoulder joint
CN115145592A (en) * 2022-09-01 2022-10-04 新华三技术有限公司 Offline model deployment method and device, network equipment and analyzer
CN115758705B (en) * 2022-11-10 2023-05-05 北京航天驭星科技有限公司 Modeling method, system and acquisition method for satellite north-south maintenance strategy model
CN115837677B (en) * 2023-02-24 2023-04-28 深圳育智科创科技有限公司 Robot intelligent control method
CN116430860A (en) * 2023-03-28 2023-07-14 兰州大学 Off-line reinforcement learning-based automatic driving training and control method for locomotive
CN117237720B (en) * 2023-09-18 2024-04-12 大连理工大学 Label noise correction image classification method based on reinforcement learning
CN117313826B (en) * 2023-11-30 2024-02-23 安徽大学 Arbitrary-angle inverted pendulum model training method based on reinforcement learning
CN117850244B (en) * 2024-03-04 2024-05-07 海克斯康制造智能技术(青岛)有限公司 Visual measurement control system and method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019040901A1 (en) * 2017-08-25 2019-02-28 Google Llc Batched reinforcement learning
CN109523029A (en) * 2018-09-28 2019-03-26 清华大学深圳研究生院 For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN109919319A (en) * 2018-12-31 2019-06-21 中国科学院软件研究所 Deeply learning method and equipment based on multiple history best Q networks
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110414725A (en) * 2019-07-11 2019-11-05 山东大学 The integrated wind power plant energy-storage system dispatching method of forecast and decision and device
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning
CN111652371A (en) * 2020-05-29 2020-09-11 京东城市(北京)数字科技有限公司 Offline reinforcement learning network training method, device, system and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11002202B2 (en) * 2018-08-21 2021-05-11 Cummins Inc. Deep reinforcement learning for air handling control

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019040901A1 (en) * 2017-08-25 2019-02-28 Google Llc Batched reinforcement learning
CN109523029A (en) * 2018-09-28 2019-03-26 清华大学深圳研究生院 For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN109919319A (en) * 2018-12-31 2019-06-21 中国科学院软件研究所 Deeply learning method and equipment based on multiple history best Q networks
CN110414725A (en) * 2019-07-11 2019-11-05 山东大学 The integrated wind power plant energy-storage system dispatching method of forecast and decision and device
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning
CN111652371A (en) * 2020-05-29 2020-09-11 京东城市(北京)数字科技有限公司 Offline reinforcement learning network training method, device, system and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deep Reinforcement Learning with Double Q-learning;Hado van Hasselt et al.;《Association for the Advancement of Artificial Intelligence》;20161231;全文 *
The Path Planning of Mobile Robot by Neural Networks and Hierarchical Reinforcement Learning;Jinglun Yu et al.;《Frontiers in Neurorobotics》;20201002;第14卷;全文 *
基于深度强化学习的无人机着陆轨迹跟踪控制;宋欣屿 等;《航空科学技术》;20200125;第31卷(第1期);全文 *
强化学习在移动机器人避障上的应用;唐鹏等;《科学家》;20160515(第05期);全文 *

Also Published As

Publication number Publication date
CN112668235A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN110262511B (en) Biped robot adaptive walking control method based on deep reinforcement learning
CN111766782B (en) Strategy selection method based on Actor-Critic framework in deep reinforcement learning
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
Whiteson et al. Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
Liu et al. Reinforcement learning-based collision avoidance: Impact of reward function and knowledge transfer
CN113947022B (en) Near-end strategy optimization method based on model
CN116643499A (en) Model reinforcement learning-based agent path planning method and system
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN113743603A (en) Control method, control device, storage medium and electronic equipment
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN115933712A (en) Bionic fish leader-follower formation control method based on deep reinforcement learning
CN115765050A (en) Power system safety correction control method, system, equipment and storage medium
CN113807005A (en) Bearing residual life prediction method based on improved FPA-DBN
Morales Deep Reinforcement Learning
Raza et al. Policy reuse in reinforcement learning for modular agents
CN114114911B (en) Automatic super-parameter adjusting method based on model reinforcement learning
Gao Soft computing methods for control and instrumentation
CN112008734B (en) Robot control method and device based on component interaction degree
Jain RAMario: Experimental Approach to Reptile Algorithm--Reinforcement Learning for Mario
Li et al. Proximal policy optimization with model-based methods
Liu et al. Improving learning from demonstrations by learning from experience
Norouzzadeh et al. Efficient Knowledge Transfer in Shaping Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant