CN113510704A - Industrial mechanical arm motion planning method based on reinforcement learning algorithm - Google Patents
Industrial mechanical arm motion planning method based on reinforcement learning algorithm Download PDFInfo
- Publication number
- CN113510704A CN113510704A CN202110709508.9A CN202110709508A CN113510704A CN 113510704 A CN113510704 A CN 113510704A CN 202110709508 A CN202110709508 A CN 202110709508A CN 113510704 A CN113510704 A CN 113510704A
- Authority
- CN
- China
- Prior art keywords
- action
- function
- strategy
- state
- mechanical arm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000002787 reinforcement Effects 0.000 title claims abstract description 56
- 230000007613 environmental effect Effects 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000004088 simulation Methods 0.000 claims abstract description 13
- 230000003993 interaction Effects 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 146
- 230000009471 action Effects 0.000 claims description 124
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 19
- 238000011156 evaluation Methods 0.000 claims description 12
- 230000007704 transition Effects 0.000 claims description 11
- 230000000875 corresponding effect Effects 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 9
- 230000001902 propagating effect Effects 0.000 claims description 6
- 230000000630 rising effect Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000003068 static effect Effects 0.000 claims description 2
- 239000000126 substance Substances 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 230000003044 adaptive effect Effects 0.000 abstract description 3
- 230000002452 interceptive effect Effects 0.000 abstract description 3
- 230000002068 genetic effect Effects 0.000 description 5
- 238000009776 industrial production Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011217 control strategy Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002028 premature Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1656—Programme controls characterised by programming, planning systems for manipulators
- B25J9/1664—Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J18/00—Arms
Abstract
The invention discloses an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, and belongs to the field of application of reinforcement learning to mechanical arm motion planning. According to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.
Description
Technical Field
The invention belongs to the field of application of reinforcement learning to mechanical arm motion planning, and particularly relates to an industrial mechanical arm motion planning method based on a reinforcement learning algorithm.
Background
Robotic arm motion planning to accomplish a particular complex task in an uncertain environment has been a very challenging problem. The traditional control method usually depends on a system model, however, the model has the characteristics of high order, nonlinearity, multivariable, strong coupling and the like, and the mechanical arm system is difficult to have good adaptability and certain autonomy. In recent years, the artificial intelligence technology is developed vigorously, and a new idea is provided for the automatic learning control of the mechanical arm. The core view of artificial intelligence is that an online learning mechanism is introduced into the planning and control of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control. At present, the genetic algorithm is developed primarily in the field of path planning, the application of the genetic algorithm in the field of robots has good global search capability, but the local search capability of the genetic algorithm is poor, so that the efficiency of the simple genetic algorithm is low in later search. The genetic algorithm has the problem of premature convergence, and has defects in practical engineering application, so that other machine learning algorithms need to be searched. The reinforcement learning is considered to the time sequence problem based on the Markov decision process, and has the advantages that long-term return can be considered in the long-term vision, and the problem that the premature convergence of the cost function falls into local optimization is solved. The problem is converted into a time-based sequence decision problem through reinforcement learning, the design of mechanical arm grabbing automation and optimization strategies is facilitated, and a reinforcement learning state, action and reward model are defined for a mechanical arm model. The value evaluation algorithm of reinforcement learning is based on a value iteration method, the convergence of a value function to the optimum is facilitated, but the value function is not suitable for being applied to a continuous motion process, the strategy evaluation algorithm is more suitable for high-dimensionality and continuous motion control based on parameter optimization, and meanwhile, the strategy evaluation algorithm has a better convergence attribute, but the evaluation of a single strategy is easy to fall into local optimum. The Actor-criticic algorithm combines the value evaluation method and the strategy evaluation method with the aim of integrating the advantages of the two methods, reduces the variance of the loss function and improves the problem of local optimization, so that the Actor-criticic algorithm can be better applied to the motion planning control of the mechanical arm.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides the industrial mechanical arm motion planning method based on the reinforcement learning algorithm, which is reasonable in design, overcomes the defects of the prior art and has a good effect.
In order to achieve the purpose, the invention adopts the following technical scheme:
an industrial mechanical arm motion planning method based on a reinforcement learning algorithm comprises the following steps:
step 1: building a simulation environment of the mechanical arm hand-eye system;
step 2: establishing a reinforcement learning Actor-criticic algorithm model framework;
and step 3: based on the Actor-Critic algorithm model framework established in the step 2, a strategy function algorithm model of the Actor part is perfected, a strategy function neural network is established, strategy parameters are optimized, and an optimal strategy is searched;
and 4, step 4: establishing a Critic partial value function algorithm model for reinforcement learning according to the strategy function algorithm model in the step 3, establishing a value function neural network, and evaluating the quality of the strategy function selection action;
and 5: and finishing the motion planning training of the mechanical arm and realizing the intelligent control of the mechanical arm.
Preferably, in step 1, the simulation environment based on the reinforcement learning algorithm has a markov property; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
set of states S consisting of all ambient states S at time ti(i=1,2,…,t);
All action sets A corresponding to the execution of action a at the time ti(i=1,2,…,t);
All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R;
the set P of all state transition probability distributions, S × A → S;
p represents a transition probability p (s '| s, a) that an action a is selected in a state s to transition the environmental state to the state s';
the method specifically comprises the following steps:
step 1.1: initializing, setting a target state and a time step length, and observing to obtain a current environment state si;
Step 1.2: the current environment state siInput strategy pi*Obtaining action information a;
step 1.3: executing the action information a, and observing the environment state at the next moment;
step 1.4: and (4) judging whether the environmental state in the step 1.3 reaches the target state, if so, ending, otherwise, returning to the step 1.2.
Preferably, in step 2, the method specifically comprises the following steps:
step 2.1: the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1;
Step 2.2: the current environment information siAnd the next time environmental information si+1Are respectively input intoA value function neural network to obtain the value v of the current output stateiAnd the value v of the state at the next momenti+1;
Step 2.3: calculating environment feedback r and a state value v through a time sequence difference error;
step 2.4: calculating a loss function of the action strategy function neural network through the time sequence difference error, and reversely propagating and updating the action strategy function neural network;
step 2.5: calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network;
step 2.6: and finishing the training process of the reinforcement learning algorithm.
Preferably, the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a cost function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.
Preferably, the cost function of the Critic part has Markov characteristics, and a time sequence difference method is adopted to ensure that the cost function is updated in real time; commonly used cost functions include a state cost function and an action cost function.
Preferably, the state cost function is as shown in equation (2):
V(St)←V(St)+α[Rt+1+γV(St+1)-V(St)] (2);
wherein, V (S)t) And V (S)t+1) A state cost function for representing the expectation of the reward sum of the state of the agent at the time t and the time t +1 to the final state; alpha is the learning rate of reinforcement learning, and represents the learning efficiency of the mechanical arm according to the environmental feedback, StIs a set of environmental states, state S, at the t time nodetNext time node context transfer to S after interaction with contextt+1Status, and receive an instant prize Rt+1(ii) a The direction of each update state cost function is Rt+1+γV(St+1)-V(St)。
Preferably, the definition of the action cost function is associated with the state cost functionThe similarity of numbers, the introduction of action information in the state cost function obtains the action cost function Q (S)t,At) Wherein A istFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)-Q(St,At)] (3);
wherein, Q (S)t,At) After the representative mechanical arm selects the action, the expected value of the final state reward sum indicates the long-term influence on the action strategy function pi(s) when the action a is taken in the current state s; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state St+1Is the corresponding action merit function Q (S) in the end statet+1,At+1) Is zero; each time the action value updates the need (S)t,At,Rt+1,St+1,At+1) Five elements; take max when iteratingaQ(St+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is setaQ(St+1And a) as a learning target, accelerating the exploration process to obtain the following formula:
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)-Q(St,At)] (4)。
preferably, an action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, and is used for representing the probability of taking any action and obtaining an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)t);
Defining a policy objective function as:
j (theta) represents a defined strategy objective function and is used for measuring the quality of a strategy, and theta is a parameter of an Actor part updating strategy function; stIndicating the state of the agent at time t, rtIndicating the environmental feedback at time t,representing an agent from st(s) representing a static distribution of the Markov decision process with respect to the state under the current strategy, obtaining the state s and the action a of the decision choice by d(s) sampling;representing an instant reward, i.e. an environmental feedback where all actions are taken per state at a time step, for finding an environmental feedback expectation;
differentiating the strategy objective function by using the strategy differentiable property to obtain the gradient function of the objective function as shown in formula (6):
wherein the content of the first and second substances,smoothing the action strategy parameter theta for a score function so as to be used for action decision; the above formula is based on the evaluation of the action strategy based on the environmental feedback r and based on the cost functionThe objective function gradient is more suitable for evaluating the quality of action decision, so that a cost function is usedInstead of the environmental feedback r, the gradient function of the objective function is obtained as shown in equation (7):
The strategy gradient algorithm finds the maximum value of a strategy objective function J (theta) through gradient lifting, and the strategy gradient is defined as follows:
wherein alpha is a training step length, delta theta represents the rising direction of the action strategy parameter theta obtained under the training step length,the gradient function of the objective function is obtained by deriving the objective function J (θ).
The rising direction of the action strategy parameter theta is obtained through the formula (8), and the parameter theta is updated towards the increasing direction:
θt+1=θt+Δθ (9)。
preferably, the updating form of the merit function of the Critic part is shown in formula (10):
wherein w is the action cost function Q in the Critic partwUpdate parameter of (s, a), Qw(s, a) a cost function representing a partial parameterization of Critic; theta is an update parameter of the target policy function J (theta) in the Actor section,representing a cost function under the current strategy parameter theta; actor part utilizes merit function Q of Critic partw(s, a) updating the parameter theta of the action strategy function, and further updating the action strategy function pi (s, a, theta) and the cost functionThe strategic gradient of the Critic part was obtained as shown in equation (11):
obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; updating the cost function of the Critic part by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the Critic part cost function:
δ=r+γQw(st+1,at+1)-Qw(st,at) (12);
Qw(st,at) And Qw(st+1,at+1) The cost functions are respectively corresponding to the time t and the time t + 1.
Finally, the strategy gradient parameter updating form based on the Actor-Critic algorithm is as follows:
wherein, α and β are parameter update step lengths of the action policy function Actor and the cost function Critic, respectively.
The invention has the following beneficial technical effects:
the mechanical arm control model has the characteristics of high order, nonlinearity, multivariable and strong coupling, the mechanical arm system is difficult to have good adaptability and certain autonomy, and in order to get rid of the problem that the mechanical arm control is limited by the system model, the invention provides an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, so that the limitation of the system model is got rid of, and the design of automation and optimization strategies is facilitated; according to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.
Drawings
FIG. 1 is a view of the 6-dof robot arm.
Fig. 2 is a reinforcement learning framework diagram.
FIG. 3 is a flow chart of the Actor-Critic algorithm.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
mechanical arms are machine equipment commonly used in industrial production. The six-degree-of-freedom full-rotation joint mechanical arm is a common mechanical arm structure in an actual production environment, and the mechanical arm can basically meet the requirements of general industrial production. Fig. 1 shows a typical industrial mechanical watch structure. Wherein 6 joints are all rotary joints and the axes of the rear three shafts intersect at one point. The structure has stronger kinematics solvability. Moreover, the mechanical arm structure can basically meet the positioning and grabbing tasks of three-dimensional space in industrial production. Therefore, the mechanical arm with the classical structure is widely applied to industrial production.
As shown in fig. 2, the reinforcement learning is based on a markov decision process, and similarly, the Actor-Critic algorithm is modeled based on the markov decision process, and the model is established on the assumption that the environment has a markov property:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
(1) set of states S consisting of all ambient states S at time ti(i=1,2,…,t)
(2) All action sets corresponding to the execution of action a at the time tAi(i=1,2,…,t)
(3) All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R
(4) The set P of all state transition probability distributions includes sxa → S, where P represents the transition probability P (S '| S, a) and represents the probability of selecting action a in state S to transition the environmental state to state S'.
It can be expressed as: hypothetical environment SiWhere (i ═ 1,2, …, t) is fully appreciable, at any time t the agent is in state stAt this point, the agent takes action atAnd the agent transitions to the next state s via p (s' | s, a)t+1At the same time, an environmental reward r will be obtained. The goal of reinforcement learning is to finally learn a strategy a by exploring different states of the environmentt=π(st) Thereby maximizing the cumulative feedback. The environment state s of the reinforcement learning application in the mechanical arm system is set, and state observation information collected by a sensor is input as the environment s of the reinforcement learning in the mechanical arm hand-eye system, wherein the environment s comprises the tail end position of the mechanical arm, the target point position, the obstacle, the distance from the tail end of the mechanical arm to the target point and the like.
The action space of the mechanical arm motion planning comprises three dimensions in a three-dimensional Cartesian space, the action control is closely related to the change of the environment state, and the reinforcement learning action a completes the motion of the mechanical arm by controlling the tail end position of the mechanical arm. The state space under the robot arm motion planning task is generally a continuous space. The Markov process state space S and the action space A have correlation, and the state observation information comprises state transition information generated by strategy interaction in time series. The reinforcement learning strategy obtained in the simulation environment generally has the problem of difficult application in the real environment, and one of the main reasons is that the simulation environment is different from the real environment, and physical effects such as acceleration, gravity, rigid body density, object surface friction force and the like have certain errors in the virtual and real environments. In order to ensure that the strategy obtained by reinforcement learning training in the simulation environment has certain applicability in the real environment, the method avoids factors with larger difference between virtual and real conditions as much as possible when designing state information. The spatial coordinates have the same properties in the simulated environment as in the real environment. When the error between the mechanical arm model in the simulation environment and the mechanical arm model in the real environment is small, the state space using the position coordinates as the description information is more likely to be simultaneously applicable to the simulation and the real environment.
Another core problem of the strategy control of the mechanical arm in the exploration process is how to obtain a reward value according to the current observation state in the training process. Returning the reward value r to 1 when the mechanical arm is in the target point, designing the reward function as a continuous reward value when the mechanical arm is not in the target point in other cases, and if the tail end position of the initial mechanical arm is [ x ]0,y0,z0]Current mechanical arm end coordinate [ x ]T,yT,zT]And the coordinates of the target point are [ x ]g,yg,zg]Ensuring that the reward value r is in the range of-1, 1]。
The motion control process applied to the mechanical arm based on the reinforcement learning Actor-criticic algorithm can be divided into the following steps:
1. initializing, setting a target state and a time step length, and observing to obtain a current environment state si。
2. Will state siInput strategy pi*The action information a is obtained.
3. And executing the action a, and observing the environment state in the next time step.
4. And (5) judging whether the target state is reached, if so, ending, and if not, returning to the step 2.
The mechanical arm motion planning based on reinforcement learning is similar to closed-loop control, the control strategy is responsible for outputting action information a, the environment state is transferred after the intelligent agent executes the action a, and the control strategy makes a decision on a new action based on the new environment state and continuously circulates until the goal is achieved.
The updating mode of the Actor-Critic algorithm comprises value evaluation updating and strategy evaluation updating. The reinforcement learning method based on value evaluation is slightly insufficient for continuous motion space, and the reinforcement learning method based on strategy evaluation has the problem of slow convergence. The Actor-Critic algorithm framework integrates a value method and a strategy method. The algorithm can generally be divided into a decision part and a judgment part, wherein the strategy part is similar to a strategy gradient method for decision making according to the state, and the other part can be understood as a method for improving the value part in the strategy gradient by using a value-based method. The criticic part of the algorithm estimates the action cost function:
value function Qw(s, a) is a function of the composition of the parameter w. Strategy gradient under the Actor-Critic framework binding:
strategy gradient under the Actor-Critic framework binding:
the parameter updating direction of the gradient part of the Actor-Critic strategy is obtained. The algorithm still needs to determine the update direction of the parameters of the Critic part. Updating the Critic part by adopting a TD method to obtain a TD error of the Critic part:
δ=r+γQw(s',a')-Qw(s,a) (5);
the strategy gradient method parameter updating form based on the Actor-Critic framework is as follows:
wherein, α and β are parameter update step lengths of the Actor and Critic parts respectively.
The operation process of the Actor-criticic algorithm is shown in fig. 3, wherein an Actor network receives an input state, performs an action to select an output action variable, and evaluates the quality degree of the selected action to calculate a reward value. The specific flow is shown in fig. 3:
1. the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1。
2. The current environment information siAnd the next time environmental information si+1Respectively input into a value function neural network to obtain the value v of the output current stateiAnd the value v of the state at the next momenti+1。
3. And calculating environment feedback r and a state value v through the time sequence difference error.
4. And calculating a loss function of the updating action strategy function neural network through the time sequence difference error, and reversely propagating the updating action strategy function neural network.
5. And calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network.
6. And (5) circulating the processes to finish the Actor-criticic algorithm training process.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (9)
1. An industrial mechanical arm motion planning method based on a reinforcement learning algorithm is characterized by comprising the following steps: the method comprises the following steps:
step 1: building a simulation environment of the mechanical arm hand-eye system;
step 2: establishing a reinforcement learning Actor-criticic algorithm model framework;
and step 3: based on the Actor-Critic algorithm model framework established in the step 2, a strategy function algorithm model of the Actor part is perfected, a strategy function neural network is established, strategy parameters are optimized, and an optimal strategy is searched;
and 4, step 4: establishing a Critic partial value function algorithm model for reinforcement learning according to the strategy function algorithm model in the step 3, establishing a value function neural network, and evaluating the quality of the strategy function selection action;
and 5: and finishing the motion planning training of the mechanical arm and realizing the intelligent control of the mechanical arm.
2. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in step 1, a simulation environment based on a reinforcement learning algorithm has Markov properties; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
set of states S consisting of all ambient states S at time ti(i=1,2,…,t);
All action sets A corresponding to the execution of action a at the time ti(i=1,2,…,t);
All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R;
the set P of all state transition probability distributions, S × A → S;
p represents a transition probability p (s '| s, a) that an action a is selected in a state s to transition the environmental state to the state s';
the method specifically comprises the following steps:
step 1.1: initializing, setting a target state and a time step length, and observing to obtain a current environment state si;
Step 1.2: the current environment state siInput strategy pi*Obtaining action information a;
step 1.3: executing the action information a, and observing the environment state at the next moment;
step 1.4: and (4) judging whether the environmental state in the step 1.3 reaches the target state, if so, ending, otherwise, returning to the step 1.2.
3. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in the step 2, the method specifically comprises the following steps:
step 2.1: the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1;
Step 2.2: the current environment information siAnd the next time environmental information si+1Respectively input into a value function neural network to obtain the value v of the output current stateiAnd the value v of the state at the next momenti+1;
Step 2.3: calculating environment feedback r and a state value v through a time sequence difference error;
step 2.4: calculating a loss function of the action strategy function neural network through the time sequence difference error, and reversely propagating and updating the action strategy function neural network;
step 2.5: calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network;
step 2.6: and finishing the training process of the reinforcement learning algorithm.
4. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a value function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.
5. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the value function of the Critic part has Markov characteristic, and a time sequence difference method is adopted to ensure that the value function is updated in real time; commonly used cost functions include a state cost function and an action cost function.
6. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the state cost function is shown in equation (2):
V(St)←V(St)+α[Rt+1+γV(St+1)-V(St)] (2);
wherein, V (S)t) And V (S)t+1) A state cost function for representing the expectation of the reward sum of the state of the agent at the time t and the time t +1 to the final state; alpha is the learning rate of reinforcement learning, and represents the learning efficiency of the mechanical arm according to the environmental feedback, StIs a set of environmental states, state S, at the t time nodetNext time node context transfer to S after interaction with contextt+1Status, and receive an instant prize Rt+1(ii) a The direction of each update state cost function is Rt+1+γV(St+1)-V(St)。
7. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the definition of the action cost function is similar to that of the state cost function, and the action cost function Q (S) is obtained by introducing action information into the state cost functiont,At) Wherein A istFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)-Q(St,At)] (3);
wherein, Q (S)t,At) After the representative arm selection action, the expectation value of the reward sum is up to the final state, indicating that the current state s isLong-term influence on an action policy function pi(s) when an action a is taken; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state St+1Is the corresponding action merit function Q (S) in the end statet+1,At+1) Is zero; each time the action value updates the need (S)t,At,Rt+1,St+1,At+1) Five elements; take max when iteratingaQ(St+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is setaQ(St+1And a) as a learning target, accelerating the exploration process to obtain the following formula:
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)-Q(St,At)] (4)。
8. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, is used for expressing the probability of taking any action, and obtains an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)t);
Defining a policy objective function as:
j (theta) represents a defined strategy objective function and is used for measuring the quality of a strategy, and theta is a parameter of an Actor part updating strategy function; stIndicating the state of the agent at time t, rtIndicating the environmental feedback at time t,representing an agent from st(s) representing a static distribution of the Markov decision process with respect to the state under the current strategy, obtaining the state s and the action a of the decision choice by d(s) sampling;representing an instant reward, i.e. an environmental feedback where all actions are taken per state at a time step, for finding an environmental feedback expectation;
differentiating the strategy objective function by using the strategy differentiable property to obtain the gradient function of the objective function as shown in formula (6):
wherein the content of the first and second substances,smoothing the action strategy parameter theta for a score function so as to be used for action decision; the above formula is based on the evaluation of the action strategy based on the environmental feedback r and based on the cost functionThe objective function gradient is more suitable for evaluating the quality of action decision, so that a cost function is usedInstead of the environmental feedback r, the gradient function to get the objective function is shown in equation (7):
the strategy gradient algorithm finds the maximum value of a strategy objective function J (theta) through gradient lifting, and the strategy gradient is defined as follows:
wherein alpha is a training step length, delta theta represents the rising direction of the action strategy parameter theta obtained under the training step length,the gradient function of the objective function is obtained by deriving the objective function J (θ).
The rising direction of the action strategy parameter theta is obtained through the formula (8), and the parameter theta is updated towards the increasing direction:
θt+1=θt+Δθ (9)。
9. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the updating form of the merit function of the Critic part is shown in formula (10):
wherein w is the action cost function Q in the Critic partwUpdate parameter of (s, a), Qw(s, a) a cost function representing a partial parameterization of Critic; theta is an update parameter of the target policy function J (theta) in the Actor section,representing a cost function under the current strategy parameter theta; actor part utilizes merit function Q of Critic partw(s, a) updating the parameter theta of the action strategy function, and further updating the action strategy function pi (s, a, theta) and the cost functionThe strategic gradient of the Critic part was obtained as shown in equation (11):
obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; and updating the critical part of the cost function by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the critical part of the cost function:
δ=r+γQw(st+1,at+1)-Qw(st,at) (12);
Qw(st,at) And Qw(st+1,at+1) The cost functions are respectively corresponding to the time t and the time t + 1.
Finally, the strategy gradient parameter updating form based on the Actor-Critic algorithm is as follows:
wherein, α and β are parameter update step lengths of the action policy function Actor and the cost function Critic, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110709508.9A CN113510704A (en) | 2021-06-25 | 2021-06-25 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110709508.9A CN113510704A (en) | 2021-06-25 | 2021-06-25 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113510704A true CN113510704A (en) | 2021-10-19 |
Family
ID=78065896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110709508.9A Pending CN113510704A (en) | 2021-06-25 | 2021-06-25 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113510704A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114012735A (en) * | 2021-12-06 | 2022-02-08 | 山西大学 | Mechanical arm control method and system based on deep reinforcement learning |
CN114139472A (en) * | 2021-11-04 | 2022-03-04 | 江阴市智行工控科技有限公司 | Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure |
CN114378820A (en) * | 2022-01-18 | 2022-04-22 | 中山大学 | Robot impedance learning method based on safety reinforcement learning |
CN115319759A (en) * | 2022-09-21 | 2022-11-11 | 上海摩马智能科技有限公司 | Intelligent planning algorithm for tail end control track of mechanical arm |
WO2023116742A1 (en) * | 2021-12-21 | 2023-06-29 | 清华大学 | Energy-saving optimization method and apparatus for terminal air conditioning system of integrated data center cabinet |
CN116476042A (en) * | 2022-12-31 | 2023-07-25 | 中国科学院长春光学精密机械与物理研究所 | Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning |
CN116803635A (en) * | 2023-08-21 | 2023-09-26 | 南京邮电大学 | Near-end strategy optimization training acceleration method based on Gaussian kernel loss function |
CN117283565A (en) * | 2023-11-03 | 2023-12-26 | 安徽大学 | Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109884897A (en) * | 2019-03-21 | 2019-06-14 | 中山大学 | A kind of matching of unmanned plane task and computation migration method based on deeply study |
CN112060082A (en) * | 2020-08-19 | 2020-12-11 | 大连理工大学 | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model |
-
2021
- 2021-06-25 CN CN202110709508.9A patent/CN113510704A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109884897A (en) * | 2019-03-21 | 2019-06-14 | 中山大学 | A kind of matching of unmanned plane task and computation migration method based on deeply study |
CN112060082A (en) * | 2020-08-19 | 2020-12-11 | 大连理工大学 | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model |
Non-Patent Citations (3)
Title |
---|
刘建平: "强化学习(十四)", Retrieved from the Internet <URL:https://www.cnblogs.com/pinard/p/10272023.html> * |
方丹: "方差相关的策略梯度方法研究" * |
李娟: "自主操作机器人的运动规划控制与抓取策略研究", pages 35 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114139472B (en) * | 2021-11-04 | 2023-05-02 | 江阴市智行工控科技有限公司 | Integrated circuit direct current analysis method and system based on reinforcement learning dual-mode structure |
CN114139472A (en) * | 2021-11-04 | 2022-03-04 | 江阴市智行工控科技有限公司 | Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure |
CN114012735A (en) * | 2021-12-06 | 2022-02-08 | 山西大学 | Mechanical arm control method and system based on deep reinforcement learning |
CN114012735B (en) * | 2021-12-06 | 2022-08-05 | 山西大学 | Mechanical arm control method and system based on deep reinforcement learning |
WO2023116742A1 (en) * | 2021-12-21 | 2023-06-29 | 清华大学 | Energy-saving optimization method and apparatus for terminal air conditioning system of integrated data center cabinet |
CN114378820A (en) * | 2022-01-18 | 2022-04-22 | 中山大学 | Robot impedance learning method based on safety reinforcement learning |
CN115319759A (en) * | 2022-09-21 | 2022-11-11 | 上海摩马智能科技有限公司 | Intelligent planning algorithm for tail end control track of mechanical arm |
CN116476042A (en) * | 2022-12-31 | 2023-07-25 | 中国科学院长春光学精密机械与物理研究所 | Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning |
CN116476042B (en) * | 2022-12-31 | 2024-01-12 | 中国科学院长春光学精密机械与物理研究所 | Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning |
CN116803635A (en) * | 2023-08-21 | 2023-09-26 | 南京邮电大学 | Near-end strategy optimization training acceleration method based on Gaussian kernel loss function |
CN116803635B (en) * | 2023-08-21 | 2023-12-22 | 南京邮电大学 | Near-end strategy optimization training acceleration method based on Gaussian kernel loss function |
CN117283565A (en) * | 2023-11-03 | 2023-12-26 | 安徽大学 | Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback |
CN117283565B (en) * | 2023-11-03 | 2024-03-22 | 安徽大学 | Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113510704A (en) | Industrial mechanical arm motion planning method based on reinforcement learning algorithm | |
CN109948642B (en) | Multi-agent cross-modal depth certainty strategy gradient training method based on image input | |
CN108161934B (en) | Method for realizing robot multi-axis hole assembly by utilizing deep reinforcement learning | |
CN110238839B (en) | Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction | |
Zhang et al. | Deep interactive reinforcement learning for path following of autonomous underwater vehicle | |
Leottau et al. | Decentralized reinforcement learning of robot behaviors | |
WO2022012265A1 (en) | Robot learning from demonstration via meta-imitation learning | |
CN111881772A (en) | Multi-mechanical arm cooperative assembly method and system based on deep reinforcement learning | |
CN112338921A (en) | Mechanical arm intelligent control rapid training method based on deep reinforcement learning | |
CN113093526B (en) | Overshoot-free PID controller parameter setting method based on reinforcement learning | |
CN116460860B (en) | Model-based robot offline reinforcement learning control method | |
CN113821045A (en) | Leg and foot robot reinforcement learning action generation system | |
CN116700327A (en) | Unmanned aerial vehicle track planning method based on continuous action dominant function learning | |
Jauhri et al. | Interactive imitation learning in state-space | |
CN114083539B (en) | Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning | |
CN115416024A (en) | Moment-controlled mechanical arm autonomous trajectory planning method and system | |
Li et al. | Navigation of mobile robots based on deep reinforcement learning: Reward function optimization and knowledge transfer | |
Xu et al. | Learning strategy for continuous robot visual control: A multi-objective perspective | |
Yan et al. | Path Planning for Mobile Robot's Continuous Action Space Based on Deep Reinforcement Learning | |
CN117086877A (en) | Industrial robot shaft hole assembly method, device and equipment based on deep reinforcement learning | |
CN117207186A (en) | Assembly line double-mechanical-arm collaborative grabbing method based on reinforcement learning | |
CN114779661B (en) | Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm | |
CN115674204A (en) | Robot shaft hole assembling method based on deep reinforcement learning and admittance control | |
CN115373415A (en) | Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning | |
CN112264995B (en) | Robot double-shaft hole assembling method based on hierarchical reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |