CN113510704A

CN113510704A - Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Info

Publication number: CN113510704A
Application number: CN202110709508.9A
Authority: CN
Inventors: 聂君; 李强; 卢晓; 盛春阳; 张治国; 宋诗斌; 梁笑; 张焕水; 王倩
Original assignee: Qingdao Bosheng Youkong Intelligent Technology Co ltd
Current assignee: Qingdao Bosheng Youkong Intelligent Technology Co ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-10-19

Abstract

The invention discloses an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, and belongs to the field of application of reinforcement learning to mechanical arm motion planning. According to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.

Description

Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Technical Field

The invention belongs to the field of application of reinforcement learning to mechanical arm motion planning, and particularly relates to an industrial mechanical arm motion planning method based on a reinforcement learning algorithm.

Background

Robotic arm motion planning to accomplish a particular complex task in an uncertain environment has been a very challenging problem. The traditional control method usually depends on a system model, however, the model has the characteristics of high order, nonlinearity, multivariable, strong coupling and the like, and the mechanical arm system is difficult to have good adaptability and certain autonomy. In recent years, the artificial intelligence technology is developed vigorously, and a new idea is provided for the automatic learning control of the mechanical arm. The core view of artificial intelligence is that an online learning mechanism is introduced into the planning and control of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control. At present, the genetic algorithm is developed primarily in the field of path planning, the application of the genetic algorithm in the field of robots has good global search capability, but the local search capability of the genetic algorithm is poor, so that the efficiency of the simple genetic algorithm is low in later search. The genetic algorithm has the problem of premature convergence, and has defects in practical engineering application, so that other machine learning algorithms need to be searched. The reinforcement learning is considered to the time sequence problem based on the Markov decision process, and has the advantages that long-term return can be considered in the long-term vision, and the problem that the premature convergence of the cost function falls into local optimization is solved. The problem is converted into a time-based sequence decision problem through reinforcement learning, the design of mechanical arm grabbing automation and optimization strategies is facilitated, and a reinforcement learning state, action and reward model are defined for a mechanical arm model. The value evaluation algorithm of reinforcement learning is based on a value iteration method, the convergence of a value function to the optimum is facilitated, but the value function is not suitable for being applied to a continuous motion process, the strategy evaluation algorithm is more suitable for high-dimensionality and continuous motion control based on parameter optimization, and meanwhile, the strategy evaluation algorithm has a better convergence attribute, but the evaluation of a single strategy is easy to fall into local optimum. The Actor-criticic algorithm combines the value evaluation method and the strategy evaluation method with the aim of integrating the advantages of the two methods, reduces the variance of the loss function and improves the problem of local optimization, so that the Actor-criticic algorithm can be better applied to the motion planning control of the mechanical arm.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides the industrial mechanical arm motion planning method based on the reinforcement learning algorithm, which is reasonable in design, overcomes the defects of the prior art and has a good effect.

In order to achieve the purpose, the invention adopts the following technical scheme:

an industrial mechanical arm motion planning method based on a reinforcement learning algorithm comprises the following steps:

step 1: building a simulation environment of the mechanical arm hand-eye system;

step 2: establishing a reinforcement learning Actor-criticic algorithm model framework;

and step 3: based on the Actor-Critic algorithm model framework established in the step 2, a strategy function algorithm model of the Actor part is perfected, a strategy function neural network is established, strategy parameters are optimized, and an optimal strategy is searched;

and 4, step 4: establishing a Critic partial value function algorithm model for reinforcement learning according to the strategy function algorithm model in the step 3, establishing a value function neural network, and evaluating the quality of the strategy function selection action;

and 5: and finishing the motion planning training of the mechanical arm and realizing the intelligent control of the mechanical arm.

Preferably, in step 1, the simulation environment based on the reinforcement learning algorithm has a markov property; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:

S₁,A₁,R₁,S₂,A₂,R₂…S_t,A_t,R_t… (1)；

set of states S consisting of all ambient states S at time t_i(i＝1,2,…,t)；

All action sets A corresponding to the execution of action a at the time t_i(i＝1,2,…,t)；

All feedback sets R corresponding to the environmental feedback R at the time t_i(i ═ 1,2, …, t), with sxa → R;

the set P of all state transition probability distributions, S × A → S;

p represents a transition probability p (s '| s, a) that an action a is selected in a state s to transition the environmental state to the state s';

the method specifically comprises the following steps:

step 1.1: initializing, setting a target state and a time step length, and observing to obtain a current environment state s_i；

Step 1.2: the current environment state s_iInput strategy pi_*Obtaining action information a;

step 1.3: executing the action information a, and observing the environment state at the next moment;

step 1.4: and (4) judging whether the environmental state in the step 1.3 reaches the target state, if so, ending, otherwise, returning to the step 1.2.

Preferably, in step 2, the method specifically comprises the following steps:

step 2.1: the current environment information s_iInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next moment_i+1；

Step 2.2: the current environment information s_iAnd the next time environmental information s_i+1Are respectively input intoA value function neural network to obtain the value v of the current output state_iAnd the value v of the state at the next moment_i+1；

Step 2.3: calculating environment feedback r and a state value v through a time sequence difference error;

step 2.4: calculating a loss function of the action strategy function neural network through the time sequence difference error, and reversely propagating and updating the action strategy function neural network;

step 2.5: calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network;

step 2.6: and finishing the training process of the reinforcement learning algorithm.

Preferably, the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a cost function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.

Preferably, the cost function of the Critic part has Markov characteristics, and a time sequence difference method is adopted to ensure that the cost function is updated in real time; commonly used cost functions include a state cost function and an action cost function.

Preferably, the state cost function is as shown in equation (2):

V(S_t)←V(S_t)+α[R_t+1+γV(S_t+1)-V(S_t)] (2)；

wherein, V (S)_t) And V (S)_t+1) A state cost function for representing the expectation of the reward sum of the state of the agent at the time t and the time t +1 to the final state; alpha is the learning rate of reinforcement learning, and represents the learning efficiency of the mechanical arm according to the environmental feedback, S_tIs a set of environmental states, state S, at the t time node_tNext time node context transfer to S after interaction with context_t+1Status, and receive an instant prize R_t+1(ii) a The direction of each update state cost function is R_t+1+γV(S_t+1)-V(S_t)。

Preferably, the definition of the action cost function is associated with the state cost functionThe similarity of numbers, the introduction of action information in the state cost function obtains the action cost function Q (S)_t,A_t) Wherein A is_tFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):

Q(S_t,A_t)←Q(S_t,A_t)+α[R_t+1+γQ(S_t+1,A_t+1)-Q(S_t,A_t)] (3)；

wherein, Q (S)_t,A_t) After the representative mechanical arm selects the action, the expected value of the final state reward sum indicates the long-term influence on the action strategy function pi(s) when the action a is taken in the current state s; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state S_t+1Is the corresponding action merit function Q (S) in the end state_t+1,A_t+1) Is zero; each time the action value updates the need (S)_t,A_t,R_t+1,S_t+1,A_t+1) Five elements; take max when iterating_aQ(S_t+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is set_aQ(S_t+1And a) as a learning target, accelerating the exploration process to obtain the following formula:

Q(S_t,A_t)←Q(S_t,A_t)+α[R_t+1+γmax_aQ(S_t+1,a)-Q(S_t,A_t)] (4)。

preferably, an action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, and is used for representing the probability of taking any action and obtaining an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)_t)；

Defining a policy objective function as:

j (theta) represents a defined strategy objective function and is used for measuring the quality of a strategy, and theta is a parameter of an Actor part updating strategy function; s_tIndicating the state of the agent at time t, r_tIndicating the environmental feedback at time t,

representing an agent from s_t(s) representing a static distribution of the Markov decision process with respect to the state under the current strategy, obtaining the state s and the action a of the decision choice by d(s) sampling;

representing an instant reward, i.e. an environmental feedback where all actions are taken per state at a time step, for finding an environmental feedback expectation;

differentiating the strategy objective function by using the strategy differentiable property to obtain the gradient function of the objective function as shown in formula (6):

wherein the content of the first and second substances,

smoothing the action strategy parameter theta for a score function so as to be used for action decision; the above formula is based on the evaluation of the action strategy based on the environmental feedback r and based on the cost function

The objective function gradient is more suitable for evaluating the quality of action decision, so that a cost function is used

Instead of the environmental feedback r, the gradient function of the objective function is obtained as shown in equation (7)：

The strategy gradient algorithm finds the maximum value of a strategy objective function J (theta) through gradient lifting, and the strategy gradient is defined as follows:

wherein alpha is a training step length, delta theta represents the rising direction of the action strategy parameter theta obtained under the training step length,

the gradient function of the objective function is obtained by deriving the objective function J (θ).

The rising direction of the action strategy parameter theta is obtained through the formula (8), and the parameter theta is updated towards the increasing direction:

θ_t+1＝θ_t+Δθ (9)。

preferably, the updating form of the merit function of the Critic part is shown in formula (10):

wherein w is the action cost function Q in the Critic part^wUpdate parameter of (s, a), Q^w(s, a) a cost function representing a partial parameterization of Critic; theta is an update parameter of the target policy function J (theta) in the Actor section,

representing a cost function under the current strategy parameter theta; actor part utilizes merit function Q of Critic part^w(s, a) updating the parameter theta of the action strategy function, and further updating the action strategy function pi (s, a, theta) and the cost function

The strategic gradient of the Critic part was obtained as shown in equation (11):

obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; updating the cost function of the Critic part by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the Critic part cost function:

δ＝r+γQ_w(s_t+1,a_t+1)-Q_w(s_t,a_t) (12)；

Q_w(s_t,a_t) And Q_w(s_t+1,a_t+1) The cost functions are respectively corresponding to the time t and the time t + 1.

Finally, the strategy gradient parameter updating form based on the Actor-Critic algorithm is as follows:

wherein, α and β are parameter update step lengths of the action policy function Actor and the cost function Critic, respectively.

The invention has the following beneficial technical effects:

the mechanical arm control model has the characteristics of high order, nonlinearity, multivariable and strong coupling, the mechanical arm system is difficult to have good adaptability and certain autonomy, and in order to get rid of the problem that the mechanical arm control is limited by the system model, the invention provides an industrial mechanical arm motion planning method based on a reinforcement learning algorithm, so that the limitation of the system model is got rid of, and the design of automation and optimization strategies is facilitated; according to the invention, a reinforcement learning Actor-Critic algorithm is applied to the motion planning of the mechanical arm, so that an interactive relation is established between the mechanical arm and the environment, and the adaptive capacity of the mechanical arm to the environment is improved through the real-time interaction with the environment for training, thereby realizing the autonomous learning control; firstly, building a simulation environment of a manipulator eye-hand system of the manipulator, then building a reinforcement learning algorithm model according to the simulation environment, and finally completing motion planning training of the manipulator to realize intelligent control of the manipulator; the mechanical arm motion planning algorithm based on reinforcement learning has good environmental adaptability and stability.

Drawings

FIG. 1 is a view of the 6-dof robot arm.

Fig. 2 is a reinforcement learning framework diagram.

FIG. 3 is a flow chart of the Actor-Critic algorithm.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

mechanical arms are machine equipment commonly used in industrial production. The six-degree-of-freedom full-rotation joint mechanical arm is a common mechanical arm structure in an actual production environment, and the mechanical arm can basically meet the requirements of general industrial production. Fig. 1 shows a typical industrial mechanical watch structure. Wherein 6 joints are all rotary joints and the axes of the rear three shafts intersect at one point. The structure has stronger kinematics solvability. Moreover, the mechanical arm structure can basically meet the positioning and grabbing tasks of three-dimensional space in industrial production. Therefore, the mechanical arm with the classical structure is widely applied to industrial production.

As shown in fig. 2, the reinforcement learning is based on a markov decision process, and similarly, the Actor-Critic algorithm is modeled based on the markov decision process, and the model is established on the assumption that the environment has a markov property:

S₁,A₁,R₁,S₂,A₂,R₂…S_t,A_t,R_t… (1)；

(1) set of states S consisting of all ambient states S at time t_i(i＝1,2,…,t)

(2) All action sets corresponding to the execution of action a at the time tA_i(i＝1,2,…,t)

(3) All feedback sets R corresponding to the environmental feedback R at the time t_i(i ═ 1,2, …, t), with sxa → R

(4) The set P of all state transition probability distributions includes sxa → S, where P represents the transition probability P (S '| S, a) and represents the probability of selecting action a in state S to transition the environmental state to state S'.

It can be expressed as: hypothetical environment S_iWhere (i ═ 1,2, …, t) is fully appreciable, at any time t the agent is in state s_tAt this point, the agent takes action a_tAnd the agent transitions to the next state s via p (s' | s, a)_t+1At the same time, an environmental reward r will be obtained. The goal of reinforcement learning is to finally learn a strategy a by exploring different states of the environment_t＝π(s_t) Thereby maximizing the cumulative feedback. The environment state s of the reinforcement learning application in the mechanical arm system is set, and state observation information collected by a sensor is input as the environment s of the reinforcement learning in the mechanical arm hand-eye system, wherein the environment s comprises the tail end position of the mechanical arm, the target point position, the obstacle, the distance from the tail end of the mechanical arm to the target point and the like.

The action space of the mechanical arm motion planning comprises three dimensions in a three-dimensional Cartesian space, the action control is closely related to the change of the environment state, and the reinforcement learning action a completes the motion of the mechanical arm by controlling the tail end position of the mechanical arm. The state space under the robot arm motion planning task is generally a continuous space. The Markov process state space S and the action space A have correlation, and the state observation information comprises state transition information generated by strategy interaction in time series. The reinforcement learning strategy obtained in the simulation environment generally has the problem of difficult application in the real environment, and one of the main reasons is that the simulation environment is different from the real environment, and physical effects such as acceleration, gravity, rigid body density, object surface friction force and the like have certain errors in the virtual and real environments. In order to ensure that the strategy obtained by reinforcement learning training in the simulation environment has certain applicability in the real environment, the method avoids factors with larger difference between virtual and real conditions as much as possible when designing state information. The spatial coordinates have the same properties in the simulated environment as in the real environment. When the error between the mechanical arm model in the simulation environment and the mechanical arm model in the real environment is small, the state space using the position coordinates as the description information is more likely to be simultaneously applicable to the simulation and the real environment.

Another core problem of the strategy control of the mechanical arm in the exploration process is how to obtain a reward value according to the current observation state in the training process. Returning the reward value r to 1 when the mechanical arm is in the target point, designing the reward function as a continuous reward value when the mechanical arm is not in the target point in other cases, and if the tail end position of the initial mechanical arm is [ x ]₀,y₀,z₀]Current mechanical arm end coordinate [ x ]_T,y_T,z_T]And the coordinates of the target point are [ x ]_g,y_g,z_g]Ensuring that the reward value r is in the range of-1, 1]。

The motion control process applied to the mechanical arm based on the reinforcement learning Actor-criticic algorithm can be divided into the following steps:

1. initializing, setting a target state and a time step length, and observing to obtain a current environment state s_i。

2. Will state s_iInput strategy pi_*The action information a is obtained.

3. And executing the action a, and observing the environment state in the next time step.

4. And (5) judging whether the target state is reached, if so, ending, and if not, returning to the step 2.

The mechanical arm motion planning based on reinforcement learning is similar to closed-loop control, the control strategy is responsible for outputting action information a, the environment state is transferred after the intelligent agent executes the action a, and the control strategy makes a decision on a new action based on the new environment state and continuously circulates until the goal is achieved.

The updating mode of the Actor-Critic algorithm comprises value evaluation updating and strategy evaluation updating. The reinforcement learning method based on value evaluation is slightly insufficient for continuous motion space, and the reinforcement learning method based on strategy evaluation has the problem of slow convergence. The Actor-Critic algorithm framework integrates a value method and a strategy method. The algorithm can generally be divided into a decision part and a judgment part, wherein the strategy part is similar to a strategy gradient method for decision making according to the state, and the other part can be understood as a method for improving the value part in the strategy gradient by using a value-based method. The criticic part of the algorithm estimates the action cost function:

value function Q^w(s, a) is a function of the composition of the parameter w. Strategy gradient under the Actor-Critic framework binding:

strategy gradient under the Actor-Critic framework binding:

the parameter updating direction of the gradient part of the Actor-Critic strategy is obtained. The algorithm still needs to determine the update direction of the parameters of the Critic part. Updating the Critic part by adopting a TD method to obtain a TD error of the Critic part:

δ＝r+γQ_w(s',a')-Q_w(s,a) (5)；

the strategy gradient method parameter updating form based on the Actor-Critic framework is as follows:

wherein, α and β are parameter update step lengths of the Actor and Critic parts respectively.

The operation process of the Actor-criticic algorithm is shown in fig. 3, wherein an Actor network receives an input state, performs an action to select an output action variable, and evaluates the quality degree of the selected action to calculate a reward value. The specific flow is shown in fig. 3:

1. the current environment information s_iInputting the data into action strategy function neural network, outputting probability distribution of action information a, selecting action with larger probability, executing action, and obtaining environment feedback r and environment state s at next moment_i+1。

2. The current environment information s_iAnd the next time environmental information s_i+1Respectively input into a value function neural network to obtain the value v of the output current state_iAnd the value v of the state at the next moment_i+1。

3. And calculating environment feedback r and a state value v through the time sequence difference error.

4. And calculating a loss function of the updating action strategy function neural network through the time sequence difference error, and reversely propagating the updating action strategy function neural network.

5. And calculating a loss function of the updated cost function network, and reversely propagating the updated cost function neural network.

6. And (5) circulating the processes to finish the Actor-criticic algorithm training process.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. An industrial mechanical arm motion planning method based on a reinforcement learning algorithm is characterized by comprising the following steps: the method comprises the following steps:

2. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in step 1, a simulation environment based on a reinforcement learning algorithm has Markov properties; the mathematical description of the markov decision process is shown in equation (1), and a loop sequence shown below is formed in the order of state s, action a, and feedback r:

S₁,A₁,R₁,S₂,A₂,R₂…S_t,A_t,R_t… (1)；

the set P of all state transition probability distributions, S × A → S;

the method specifically comprises the following steps:

3. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: in the step 2, the method specifically comprises the following steps:

Step 2.2: the current environment information s_iAnd the next time environmental information s_i+1Respectively input into a value function neural network to obtain the value v of the output current state_iAnd the value v of the state at the next moment_i+1；

4. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 1, wherein: the reinforcement learning algorithm, namely the Actor-Critic algorithm, adopts an action strategy function and a value function; the action strategy function of the Actor part selects actions based on Gaussian distribution, and the value function of the Critic part evaluates the actions selected by the action strategy function.

5. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the value function of the Critic part has Markov characteristic, and a time sequence difference method is adopted to ensure that the value function is updated in real time; commonly used cost functions include a state cost function and an action cost function.

6. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the state cost function is shown in equation (2):

V(S_t)←V(S_t)+α[R_t+1+γV(S_t+1)-V(S_t)] (2)；

7. The industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 5, wherein: the definition of the action cost function is similar to that of the state cost function, and the action cost function Q (S) is obtained by introducing action information into the state cost function_t,A_t) Wherein A is_tFor the action set of the t time node, the specific update form of the action cost function is shown in formula (3):

Q(S_t,A_t)←Q(S_t,A_t)+α[R_t+1+γQ(S_t+1,A_t+1)-Q(S_t,A_t)] (3)；

wherein, Q (S)_t,A_t) After the representative arm selection action, the expectation value of the reward sum is up to the final state, indicating that the current state s isLong-term influence on an action policy function pi(s) when an action a is taken; alpha is the learning rate of reinforcement learning, gamma is the discount coefficient of reward, and the current state S_t+1Is the corresponding action merit function Q (S) in the end state_t+1,A_t+1) Is zero; each time the action value updates the need (S)_t,A_t,R_t+1,S_t+1,A_t+1) Five elements; take max when iterating_aQ(S_t+1A) as the action value of the next time node at the time of update, the action value max optimal at the past time is set_aQ(S_t+1And a) as a learning target, accelerating the exploration process to obtain the following formula:

Q(S_t,A_t)←Q(S_t,A_t)+α[R_t+1+γmax_aQ(S_t+1,a)-Q(S_t,A_t)] (4)。

8. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the action strategy function pi (s, a, theta) of the Actor part consists of an action strategy parameter theta, a state s and an action a, is used for expressing the probability of taking any action, and obtains an optimal strategy by continuously optimizing and adjusting the action strategy parameter theta; establishing a strategy objective function, and solving the gradient of the action strategy parameters by the strategy objective function to obtain the optimization direction of the action strategy parameters; the goal of the action strategy parameter optimization is to increase the initial state value V(s)_t)；

Defining a policy objective function as:

wherein the content of the first and second substances,

Instead of the environmental feedback r, the gradient function to get the objective function is shown in equation (7):

θ_t+1＝θ_t+Δθ (9)。

9. the industrial mechanical arm motion planning method based on the reinforcement learning algorithm as claimed in claim 4, wherein: the updating form of the merit function of the Critic part is shown in formula (10):

obtaining the parameter updating direction of the gradient part of the reinforcement learning Actor-Critic strategy; but the algorithm still needs to determine the updating direction of the parameters of the Critic partial cost function; and updating the critical part of the cost function by adopting a time sequence difference method similar to the formula (3) to obtain a time sequence difference error of the critical part of the cost function:

δ＝r+γQ_w(s_t+1,a_t+1)-Q_w(s_t,a_t) (12)；