CN113510704A - Industrial mechanical arm motion planning method based on reinforcement learning algorithm - Google Patents

Industrial mechanical arm motion planning method based on reinforcement learning algorithm Download PDF

Info

Publication number
CN113510704A
CN113510704A CN202110709508.9A CN202110709508A CN113510704A CN 113510704 A CN113510704 A CN 113510704A CN 202110709508 A CN202110709508 A CN 202110709508A CN 113510704 A CN113510704 A CN 113510704A
Authority
CN
China
Prior art keywords
action
function
strategy
state
mechanical arm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110709508.9A
Other languages
Chinese (zh)
Inventor
聂君
李强
卢晓
盛春阳
张治国
宋诗斌
梁笑
张焕水
王倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Bosheng Youkong Intelligent Technology Co ltd
Original Assignee
Qingdao Bosheng Youkong Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Bosheng Youkong Intelligent Technology Co ltd filed Critical Qingdao Bosheng Youkong Intelligent Technology Co ltd
Priority to CN202110709508.9A priority Critical patent/CN113510704A/en
Publication of CN113510704A publication Critical patent/CN113510704A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J18/00Arms

Abstract

The invention discloses an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, and belongs to the field of application of reinforcement learning to mechanical arm motion planning. According to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.

Description

Industrial mechanical arm motion planning method based on reinforcement learning algorithm
Technical Field
The invention belongs to the field of application of reinforcement learning to mechanical arm motion planning, and particularly relates to an industrial mechanical arm motion planning method based on a reinforcement learning algorithm.
Background
Robotic arm motion planning to accomplish a particular complex task in an uncertain environment has been a very challenging problem. The traditional control method usually depends on a system model, however, the model has the characteristics of high order, nonlinearity, multivariable, strong coupling and the like, and the mechanical arm system is difficult to have good adaptability and certain autonomy. In recent years, the artificial intelligence technology is developed vigorously, and a new idea is provided for the automatic learning control of the mechanical arm. The core view of artificial intelligence is that an online learning mechanism is introduced into the planning and control of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control. At present, the genetic algorithm is developed primarily in the field of path planning, the application of the genetic algorithm in the field of robots has good global search capability, but the local search capability of the genetic algorithm is poor, so that the efficiency of the simple genetic algorithm is low in later search. The genetic algorithm has the problem of premature convergence, and has defects in practical engineering application, so that other machine learning algorithms need to be searched. The reinforcement learning is considered to the time sequence problem based on the Markov decision process, and has the advantages that long-term return can be considered in the long-term vision, and the problem that the premature convergence of the cost function falls into local optimization is solved. The problem is converted into a time-based sequence decision problem through reinforcement learning, the design of mechanical arm grabbing automation and optimization strategies is facilitated, and a reinforcement learning state, action and reward model are defined for a mechanical arm model. The value evaluation algorithm of reinforcement learning is based on a value iteration method, the convergence of a value function to the optimum is facilitated, but the value function is not suitable for being applied to a continuous motion process, the strategy evaluation algorithm is more suitable for high-dimensionality and continuous motion control based on parameter optimization, and meanwhile, the strategy evaluation algorithm has a better convergence attribute, but the evaluation of a single strategy is easy to fall into local optimum. The Actor-criticic algorithm combines the value evaluation method and the strategy evaluation method with the aim of integrating the advantages of the two methods, reduces the variance of the loss function and improves the problem of local optimization, so that the Actor-criticic algorithm can be better applied to the motion planning control of the mechanical arm.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides the industrial mechanical arm motion planning method based on the reinforcement learning algorithm, which is reasonable in design, overcomes the defects of the prior art and has a good effect.
In order to achieve the purpose, the invention adopts the following technical scheme:
an industrial mechanical arm motion planning method based on a reinforcement learning algorithm comprises the following steps:
step 1: building a simulation environment of the mechanical arm hand-eye system;
step 2: establishing a reinforcement learning Actor-criticic algorithm model framework;
and step 3: based on the Actor-Critic algorithm model framework established in the step 2, a strategy function algorithm model of the Actor part is perfected, a strategy function neural network is established, strategy parameters are optimized, and an optimal strategy is searched;
and 4, step 4: establishing a Critic partial value function algorithm model for reinforcement learning according to the strategy function algorithm model in the step 3, establishing a value function neural network, and evaluating the quality of the strategy function selection action;
and 5: and finishing the motion planning training of the mechanical arm and realizing the intelligent control of the mechanical arm.
Preferably, in step 1, the simulation environment based on the reinforcement learning algorithm has a markov property; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
set of states S consisting of all ambient states S at time ti(i=1,2,…,t);
All action sets A corresponding to the execution of action a at the time ti(i=1,2,…,t);
All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R;
the set P of all state transition probability distributions, S × A → S;
p represents a transition probability p (s '| s, a) that an action a is selected in a state s to transition the environmental state to the state s';
the method specifically comprises the following steps:
step 1.1: initializing, setting a target state and a time step length, and observing to obtain a current environment state si
Step 1.2: the current environment state siInput strategy pi*Obtaining action information a;
step 1.3: executing the action information a, and observing the environment state at the next moment;
step 1.4: and (4) judging whether the environmental state in the step 1.3 reaches the target state, if so, ending, otherwise, returning to the step 1.2.
Preferably, in step 2, the method specifically comprises the following steps:
step 2.1: the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1
Step 2.2: the current environment information siAnd the next time environmental information si+1Are respectively input intoA value function neural network to obtain the value v of the current output stateiAnd the value v of the state at the next momenti+1
Step 2.3: calculating environment feedback r and a state value v through a time sequence difference error;
step 2.4: calculating a loss function of the action strategy function neural network through the time sequence difference error, and reversely propagating and updating the action strategy function neural network;
step 2.5: calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network;
step 2.6: and finishing the training process of the reinforcement learning algorithm.
Preferably, the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a cost function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.
Preferably, the cost function of the Critic part has Markov characteristics, and a time sequence difference method is adopted to ensure that the cost function is updated in real time; commonly used cost functions include a state cost function and an action cost function.
Preferably, the state cost function is as shown in equation (2):
V(St)←V(St)+α[Rt+1+γV(St+1)-V(St)] (2);
wherein, V (S)t) And V (S)t+1) A state cost function for representing the expectation of the reward sum of the state of the agent at the time t and the time t +1 to the final state; alpha is the learning rate of reinforcement learning, and represents the learning efficiency of the mechanical arm according to the environmental feedback, StIs a set of environmental states, state S, at the t time nodetNext time node context transfer to S after interaction with contextt+1Status, and receive an instant prize Rt+1(ii) a The direction of each update state cost function is Rt+1+γV(St+1)-V(St)。
Preferably, the definition of the action cost function is associated with the state cost functionThe similarity of numbers, the introduction of action information in the state cost function obtains the action cost function Q (S)t,At) Wherein A istFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)-Q(St,At)] (3);
wherein, Q (S)t,At) After the representative mechanical arm selects the action, the expected value of the final state reward sum indicates the long-term influence on the action strategy function pi(s) when the action a is taken in the current state s; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state St+1Is the corresponding action merit function Q (S) in the end statet+1,At+1) Is zero; each time the action value updates the need (S)t,At,Rt+1,St+1,At+1) Five elements; take max when iteratingaQ(St+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is setaQ(St+1And a) as a learning target, accelerating the exploration process to obtain the following formula:
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)-Q(St,At)] (4)。
preferably, an action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, and is used for representing the probability of taking any action and obtaining an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)t);
Defining a policy objective function as:
Figure BDA0003132918560000041
j (theta) represents a defined strategy objective function and is used for measuring the quality of a strategy, and theta is a parameter of an Actor part updating strategy function; stIndicating the state of the agent at time t, rtIndicating the environmental feedback at time t,
Figure BDA0003132918560000042
representing an agent from st(s) representing a static distribution of the Markov decision process with respect to the state under the current strategy, obtaining the state s and the action a of the decision choice by d(s) sampling;
Figure BDA00031329185600000410
representing an instant reward, i.e. an environmental feedback where all actions are taken per state at a time step, for finding an environmental feedback expectation;
differentiating the strategy objective function by using the strategy differentiable property to obtain the gradient function of the objective function as shown in formula (6):
Figure BDA0003132918560000043
wherein the content of the first and second substances,
Figure BDA0003132918560000044
smoothing the action strategy parameter theta for a score function so as to be used for action decision; the above formula is based on the evaluation of the action strategy based on the environmental feedback r and based on the cost function
Figure BDA0003132918560000045
The objective function gradient is more suitable for evaluating the quality of action decision, so that a cost function is used
Figure BDA0003132918560000046
Instead of the environmental feedback r, the gradient function of the objective function is obtained as shown in equation (7):
Figure BDA0003132918560000047
The strategy gradient algorithm finds the maximum value of a strategy objective function J (theta) through gradient lifting, and the strategy gradient is defined as follows:
Figure BDA0003132918560000048
wherein alpha is a training step length, delta theta represents the rising direction of the action strategy parameter theta obtained under the training step length,
Figure BDA0003132918560000049
the gradient function of the objective function is obtained by deriving the objective function J (θ).
The rising direction of the action strategy parameter theta is obtained through the formula (8), and the parameter theta is updated towards the increasing direction:
θt+1=θt+Δθ (9)。
preferably, the updating form of the merit function of the Critic part is shown in formula (10):
Figure BDA0003132918560000051
wherein w is the action cost function Q in the Critic partwUpdate parameter of (s, a), Qw(s, a) a cost function representing a partial parameterization of Critic; theta is an update parameter of the target policy function J (theta) in the Actor section,
Figure BDA0003132918560000052
representing a cost function under the current strategy parameter theta; actor part utilizes merit function Q of Critic partw(s, a) updating the parameter theta of the action strategy function, and further updating the action strategy function pi (s, a, theta) and the cost function
Figure BDA0003132918560000053
The strategic gradient of the Critic part was obtained as shown in equation (11):
Figure BDA0003132918560000054
obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; updating the cost function of the Critic part by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the Critic part cost function:
δ=r+γQw(st+1,at+1)-Qw(st,at) (12);
Qw(st,at) And Qw(st+1,at+1) The cost functions are respectively corresponding to the time t and the time t + 1.
Finally, the strategy gradient parameter updating form based on the Actor-Critic algorithm is as follows:
Figure BDA0003132918560000055
wherein, α and β are parameter update step lengths of the action policy function Actor and the cost function Critic, respectively.
The invention has the following beneficial technical effects:
the mechanical arm control model has the characteristics of high order, nonlinearity, multivariable and strong coupling, the mechanical arm system is difficult to have good adaptability and certain autonomy, and in order to get rid of the problem that the mechanical arm control is limited by the system model, the invention provides an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, so that the limitation of the system model is got rid of, and the design of automation and optimization strategies is facilitated; according to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.
Drawings
FIG. 1 is a view of the 6-dof robot arm.
Fig. 2 is a reinforcement learning framework diagram.
FIG. 3 is a flow chart of the Actor-Critic algorithm.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
mechanical arms are machine equipment commonly used in industrial production. The six-degree-of-freedom full-rotation joint mechanical arm is a common mechanical arm structure in an actual production environment, and the mechanical arm can basically meet the requirements of general industrial production. Fig. 1 shows a typical industrial mechanical watch structure. Wherein 6 joints are all rotary joints and the axes of the rear three shafts intersect at one point. The structure has stronger kinematics solvability. Moreover, the mechanical arm structure can basically meet the positioning and grabbing tasks of three-dimensional space in industrial production. Therefore, the mechanical arm with the classical structure is widely applied to industrial production.
As shown in fig. 2, the reinforcement learning is based on a markov decision process, and similarly, the Actor-Critic algorithm is modeled based on the markov decision process, and the model is established on the assumption that the environment has a markov property:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
(1) set of states S consisting of all ambient states S at time ti(i=1,2,…,t)
(2) All action sets corresponding to the execution of action a at the time tAi(i=1,2,…,t)
(3) All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R
(4) The set P of all state transition probability distributions includes sxa → S, where P represents the transition probability P (S '| S, a) and represents the probability of selecting action a in state S to transition the environmental state to state S'.
It can be expressed as: hypothetical environment SiWhere (i ═ 1,2, …, t) is fully appreciable, at any time t the agent is in state stAt this point, the agent takes action atAnd the agent transitions to the next state s via p (s' | s, a)t+1At the same time, an environmental reward r will be obtained. The goal of reinforcement learning is to finally learn a strategy a by exploring different states of the environmentt=π(st) Thereby maximizing the cumulative feedback. The environment state s of the reinforcement learning application in the mechanical arm system is set, and state observation information collected by a sensor is input as the environment s of the reinforcement learning in the mechanical arm hand-eye system, wherein the environment s comprises the tail end position of the mechanical arm, the target point position, the obstacle, the distance from the tail end of the mechanical arm to the target point and the like.
The action space of the mechanical arm motion planning comprises three dimensions in a three-dimensional Cartesian space, the action control is closely related to the change of the environment state, and the reinforcement learning action a completes the motion of the mechanical arm by controlling the tail end position of the mechanical arm. The state space under the robot arm motion planning task is generally a continuous space. The Markov process state space S and the action space A have correlation, and the state observation information comprises state transition information generated by strategy interaction in time series. The reinforcement learning strategy obtained in the simulation environment generally has the problem of difficult application in the real environment, and one of the main reasons is that the simulation environment is different from the real environment, and physical effects such as acceleration, gravity, rigid body density, object surface friction force and the like have certain errors in the virtual and real environments. In order to ensure that the strategy obtained by reinforcement learning training in the simulation environment has certain applicability in the real environment, the method avoids factors with larger difference between virtual and real conditions as much as possible when designing state information. The spatial coordinates have the same properties in the simulated environment as in the real environment. When the error between the mechanical arm model in the simulation environment and the mechanical arm model in the real environment is small, the state space using the position coordinates as the description information is more likely to be simultaneously applicable to the simulation and the real environment.
Another core problem of the strategy control of the mechanical arm in the exploration process is how to obtain a reward value according to the current observation state in the training process. Returning the reward value r to 1 when the mechanical arm is in the target point, designing the reward function as a continuous reward value when the mechanical arm is not in the target point in other cases, and if the tail end position of the initial mechanical arm is [ x ]0,y0,z0]Current mechanical arm end coordinate [ x ]T,yT,zT]And the coordinates of the target point are [ x ]g,yg,zg]Ensuring that the reward value r is in the range of-1, 1]。
Figure BDA0003132918560000071
The motion control process applied to the mechanical arm based on the reinforcement learning Actor-criticic algorithm can be divided into the following steps:
1. initializing, setting a target state and a time step length, and observing to obtain a current environment state si
2. Will state siInput strategy pi*The action information a is obtained.
3. And executing the action a, and observing the environment state in the next time step.
4. And (5) judging whether the target state is reached, if so, ending, and if not, returning to the step 2.
The mechanical arm motion planning based on reinforcement learning is similar to closed-loop control, the control strategy is responsible for outputting action information a, the environment state is transferred after the intelligent agent executes the action a, and the control strategy makes a decision on a new action based on the new environment state and continuously circulates until the goal is achieved.
The updating mode of the Actor-Critic algorithm comprises value evaluation updating and strategy evaluation updating. The reinforcement learning method based on value evaluation is slightly insufficient for continuous motion space, and the reinforcement learning method based on strategy evaluation has the problem of slow convergence. The Actor-Critic algorithm framework integrates a value method and a strategy method. The algorithm can generally be divided into a decision part and a judgment part, wherein the strategy part is similar to a strategy gradient method for decision making according to the state, and the other part can be understood as a method for improving the value part in the strategy gradient by using a value-based method. The criticic part of the algorithm estimates the action cost function:
Figure BDA0003132918560000072
value function Qw(s, a) is a function of the composition of the parameter w. Strategy gradient under the Actor-Critic framework binding:
strategy gradient under the Actor-Critic framework binding:
Figure BDA0003132918560000081
the parameter updating direction of the gradient part of the Actor-Critic strategy is obtained. The algorithm still needs to determine the update direction of the parameters of the Critic part. Updating the Critic part by adopting a TD method to obtain a TD error of the Critic part:
δ=r+γQw(s',a')-Qw(s,a) (5);
the strategy gradient method parameter updating form based on the Actor-Critic framework is as follows:
Figure BDA0003132918560000082
wherein, α and β are parameter update step lengths of the Actor and Critic parts respectively.
The operation process of the Actor-criticic algorithm is shown in fig. 3, wherein an Actor network receives an input state, performs an action to select an output action variable, and evaluates the quality degree of the selected action to calculate a reward value. The specific flow is shown in fig. 3:
1. the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1
2. The current environment information siAnd the next time environmental information si+1Respectively input into a value function neural network to obtain the value v of the output current stateiAnd the value v of the state at the next momenti+1
3. And calculating environment feedback r and a state value v through the time sequence difference error.
4. And calculating a loss function of the updating action strategy function neural network through the time sequence difference error, and reversely propagating the updating action strategy function neural network.
5. And calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network.
6. And (5) circulating the processes to finish the Actor-criticic algorithm training process.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (9)

1. An industrial mechanical arm motion planning method based on a reinforcement learning algorithm is characterized by comprising the following steps: the method comprises the following steps:
step 1: building a simulation environment of the mechanical arm hand-eye system;
step 2: establishing a reinforcement learning Actor-criticic algorithm model framework;
and step 3: based on the Actor-Critic algorithm model framework established in the step 2, a strategy function algorithm model of the Actor part is perfected, a strategy function neural network is established, strategy parameters are optimized, and an optimal strategy is searched;
and 4, step 4: establishing a Critic partial value function algorithm model for reinforcement learning according to the strategy function algorithm model in the step 3, establishing a value function neural network, and evaluating the quality of the strategy function selection action;
and 5: and finishing the motion planning training of the mechanical arm and realizing the intelligent control of the mechanical arm.
2. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in step 1, a simulation environment based on a reinforcement learning algorithm has Markov properties; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:
S1,A1,R1,S2,A2,R2…St,At,Rt… (1);
set of states S consisting of all ambient states S at time ti(i=1,2,…,t);
All action sets A corresponding to the execution of action a at the time ti(i=1,2,…,t);
All feedback sets R corresponding to the environmental feedback R at the time ti(i ═ 1,2, …, t), with sxa → R;
the set P of all state transition probability distributions, S × A → S;
p represents a transition probability p (s '| s, a) that an action a is selected in a state s to transition the environmental state to the state s';
the method specifically comprises the following steps:
step 1.1: initializing, setting a target state and a time step length, and observing to obtain a current environment state si
Step 1.2: the current environment state siInput strategy pi*Obtaining action information a;
step 1.3: executing the action information a, and observing the environment state at the next moment;
step 1.4: and (4) judging whether the environmental state in the step 1.3 reaches the target state, if so, ending, otherwise, returning to the step 1.2.
3. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in the step 2, the method specifically comprises the following steps:
step 2.1: the current environment information siInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next momenti+1
Step 2.2: the current environment information siAnd the next time environmental information si+1Respectively input into a value function neural network to obtain the value v of the output current stateiAnd the value v of the state at the next momenti+1
Step 2.3: calculating environment feedback r and a state value v through a time sequence difference error;
step 2.4: calculating a loss function of the action strategy function neural network through the time sequence difference error, and reversely propagating and updating the action strategy function neural network;
step 2.5: calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network;
step 2.6: and finishing the training process of the reinforcement learning algorithm.
4. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a value function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.
5. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the value function of the Critic part has Markov characteristic, and a time sequence difference method is adopted to ensure that the value function is updated in real time; commonly used cost functions include a state cost function and an action cost function.
6. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the state cost function is shown in equation (2):
V(St)←V(St)+α[Rt+1+γV(St+1)-V(St)] (2);
wherein, V (S)t) And V (S)t+1) A state cost function for representing the expectation of the reward sum of the state of the agent at the time t and the time t +1 to the final state; alpha is the learning rate of reinforcement learning, and represents the learning efficiency of the mechanical arm according to the environmental feedback, StIs a set of environmental states, state S, at the t time nodetNext time node context transfer to S after interaction with contextt+1Status, and receive an instant prize Rt+1(ii) a The direction of each update state cost function is Rt+1+γV(St+1)-V(St)。
7. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the definition of the action cost function is similar to that of the state cost function, and the action cost function Q (S) is obtained by introducing action information into the state cost functiont,At) Wherein A istFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)-Q(St,At)] (3);
wherein, Q (S)t,At) After the representative arm selection action, the expectation value of the reward sum is up to the final state, indicating that the current state s isLong-term influence on an action policy function pi(s) when an action a is taken; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state St+1Is the corresponding action merit function Q (S) in the end statet+1,At+1) Is zero; each time the action value updates the need (S)t,At,Rt+1,St+1,At+1) Five elements; take max when iteratingaQ(St+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is setaQ(St+1And a) as a learning target, accelerating the exploration process to obtain the following formula:
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)-Q(St,At)] (4)。
8. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, is used for expressing the probability of taking any action, and obtains an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)t);
Defining a policy objective function as:
Figure FDA0003132918550000031
j (theta) represents a defined strategy objective function and is used for measuring the quality of a strategy, and theta is a parameter of an Actor part updating strategy function; stIndicating the state of the agent at time t, rtIndicating the environmental feedback at time t,
Figure FDA0003132918550000032
representing an agent from st(s) representing a static distribution of the Markov decision process with respect to the state under the current strategy, obtaining the state s and the action a of the decision choice by d(s) sampling;
Figure FDA0003132918550000033
representing an instant reward, i.e. an environmental feedback where all actions are taken per state at a time step, for finding an environmental feedback expectation;
differentiating the strategy objective function by using the strategy differentiable property to obtain the gradient function of the objective function as shown in formula (6):
Figure FDA0003132918550000034
wherein the content of the first and second substances,
Figure FDA0003132918550000041
smoothing the action strategy parameter theta for a score function so as to be used for action decision; the above formula is based on the evaluation of the action strategy based on the environmental feedback r and based on the cost function
Figure FDA0003132918550000042
The objective function gradient is more suitable for evaluating the quality of action decision, so that a cost function is used
Figure FDA0003132918550000043
Instead of the environmental feedback r, the gradient function to get the objective function is shown in equation (7):
Figure FDA0003132918550000044
the strategy gradient algorithm finds the maximum value of a strategy objective function J (theta) through gradient lifting, and the strategy gradient is defined as follows:
Figure FDA0003132918550000045
wherein alpha is a training step length, delta theta represents the rising direction of the action strategy parameter theta obtained under the training step length,
Figure FDA0003132918550000046
the gradient function of the objective function is obtained by deriving the objective function J (θ).
The rising direction of the action strategy parameter theta is obtained through the formula (8), and the parameter theta is updated towards the increasing direction:
θt+1=θt+Δθ (9)。
9. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the updating form of the merit function of the Critic part is shown in formula (10):
Figure FDA0003132918550000047
wherein w is the action cost function Q in the Critic partwUpdate parameter of (s, a), Qw(s, a) a cost function representing a partial parameterization of Critic; theta is an update parameter of the target policy function J (theta) in the Actor section,
Figure FDA0003132918550000048
representing a cost function under the current strategy parameter theta; actor part utilizes merit function Q of Critic partw(s, a) updating the parameter theta of the action strategy function, and further updating the action strategy function pi (s, a, theta) and the cost function
Figure FDA0003132918550000049
The strategic gradient of the Critic part was obtained as shown in equation (11):
Figure FDA00031329185500000410
obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; and updating the critical part of the cost function by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the critical part of the cost function:
δ=r+γQw(st+1,at+1)-Qw(st,at) (12);
Qw(st,at) And Qw(st+1,at+1) The cost functions are respectively corresponding to the time t and the time t + 1.
Finally, the strategy gradient parameter updating form based on the Actor-Critic algorithm is as follows:
Figure FDA0003132918550000051
wherein, α and β are parameter update step lengths of the action policy function Actor and the cost function Critic, respectively.
CN202110709508.9A 2021-06-25 2021-06-25 Industrial mechanical arm motion planning method based on reinforcement learning algorithm Pending CN113510704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110709508.9A CN113510704A (en) 2021-06-25 2021-06-25 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110709508.9A CN113510704A (en) 2021-06-25 2021-06-25 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Publications (1)

Publication Number Publication Date
CN113510704A true CN113510704A (en) 2021-10-19

Family

ID=78065896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110709508.9A Pending CN113510704A (en) 2021-06-25 2021-06-25 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Country Status (1)

Country Link
CN (1) CN113510704A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114012735A (en) * 2021-12-06 2022-02-08 山西大学 Mechanical arm control method and system based on deep reinforcement learning
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN114378820A (en) * 2022-01-18 2022-04-22 中山大学 Robot impedance learning method based on safety reinforcement learning
CN115319759A (en) * 2022-09-21 2022-11-11 上海摩马智能科技有限公司 Intelligent planning algorithm for tail end control track of mechanical arm
WO2023116742A1 (en) * 2021-12-21 2023-06-29 清华大学 Energy-saving optimization method and apparatus for terminal air conditioning system of integrated data center cabinet
CN116476042A (en) * 2022-12-31 2023-07-25 中国科学院长春光学精密机械与物理研究所 Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning
CN116803635A (en) * 2023-08-21 2023-09-26 南京邮电大学 Near-end strategy optimization training acceleration method based on Gaussian kernel loss function
CN117283565A (en) * 2023-11-03 2023-12-26 安徽大学 Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109884897A (en) * 2019-03-21 2019-06-14 中山大学 A kind of matching of unmanned plane task and computation migration method based on deeply study
CN112060082A (en) * 2020-08-19 2020-12-11 大连理工大学 Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109884897A (en) * 2019-03-21 2019-06-14 中山大学 A kind of matching of unmanned plane task and computation migration method based on deeply study
CN112060082A (en) * 2020-08-19 2020-12-11 大连理工大学 Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘建平: "强化学习(十四)", Retrieved from the Internet <URL:https://www.cnblogs.com/pinard/p/10272023.html> *
方丹: "方差相关的策略梯度方法研究" *
李娟: "自主操作机器人的运动规划控制与抓取策略研究", pages 35 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139472B (en) * 2021-11-04 2023-05-02 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-mode structure
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN114012735A (en) * 2021-12-06 2022-02-08 山西大学 Mechanical arm control method and system based on deep reinforcement learning
CN114012735B (en) * 2021-12-06 2022-08-05 山西大学 Mechanical arm control method and system based on deep reinforcement learning
WO2023116742A1 (en) * 2021-12-21 2023-06-29 清华大学 Energy-saving optimization method and apparatus for terminal air conditioning system of integrated data center cabinet
CN114378820A (en) * 2022-01-18 2022-04-22 中山大学 Robot impedance learning method based on safety reinforcement learning
CN115319759A (en) * 2022-09-21 2022-11-11 上海摩马智能科技有限公司 Intelligent planning algorithm for tail end control track of mechanical arm
CN116476042A (en) * 2022-12-31 2023-07-25 中国科学院长春光学精密机械与物理研究所 Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning
CN116476042B (en) * 2022-12-31 2024-01-12 中国科学院长春光学精密机械与物理研究所 Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning
CN116803635A (en) * 2023-08-21 2023-09-26 南京邮电大学 Near-end strategy optimization training acceleration method based on Gaussian kernel loss function
CN116803635B (en) * 2023-08-21 2023-12-22 南京邮电大学 Near-end strategy optimization training acceleration method based on Gaussian kernel loss function
CN117283565A (en) * 2023-11-03 2023-12-26 安徽大学 Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback
CN117283565B (en) * 2023-11-03 2024-03-22 安徽大学 Flexible joint mechanical arm control method based on Actor-Critic network full-state feedback

Similar Documents

Publication Publication Date Title
CN113510704A (en) Industrial mechanical arm motion planning method based on reinforcement learning algorithm
CN109948642B (en) Multi-agent cross-modal depth certainty strategy gradient training method based on image input
CN108161934B (en) Method for realizing robot multi-axis hole assembly by utilizing deep reinforcement learning
CN110238839B (en) Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction
Zhang et al. Deep interactive reinforcement learning for path following of autonomous underwater vehicle
Leottau et al. Decentralized reinforcement learning of robot behaviors
WO2022012265A1 (en) Robot learning from demonstration via meta-imitation learning
CN111881772A (en) Multi-mechanical arm cooperative assembly method and system based on deep reinforcement learning
CN112338921A (en) Mechanical arm intelligent control rapid training method based on deep reinforcement learning
CN113093526B (en) Overshoot-free PID controller parameter setting method based on reinforcement learning
CN116460860B (en) Model-based robot offline reinforcement learning control method
CN113821045A (en) Leg and foot robot reinforcement learning action generation system
CN116700327A (en) Unmanned aerial vehicle track planning method based on continuous action dominant function learning
Jauhri et al. Interactive imitation learning in state-space
CN114083539B (en) Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning
CN115416024A (en) Moment-controlled mechanical arm autonomous trajectory planning method and system
Li et al. Navigation of mobile robots based on deep reinforcement learning: Reward function optimization and knowledge transfer
Xu et al. Learning strategy for continuous robot visual control: A multi-objective perspective
Yan et al. Path Planning for Mobile Robot's Continuous Action Space Based on Deep Reinforcement Learning
CN117086877A (en) Industrial robot shaft hole assembly method, device and equipment based on deep reinforcement learning
CN117207186A (en) Assembly line double-mechanical-arm collaborative grabbing method based on reinforcement learning
CN114779661B (en) Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm
CN115674204A (en) Robot shaft hole assembling method based on deep reinforcement learning and admittance control
CN115373415A (en) Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning
CN112264995B (en) Robot double-shaft hole assembling method based on hierarchical reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination