CN108052004B

CN108052004B - Industrial mechanical arm automatic control method based on deep reinforcement learning

Info

Publication number: CN108052004B
Application number: CN201711275146.7A
Authority: CN
Inventors: 柯丰恺; 周唯倜; 赵大兴; 孙国栋; 许万; 丁国龙; 吴震宇; 赵迪
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2020-11-10
Anticipated expiration: 2037-12-06
Also published as: CN108052004A

Abstract

The invention relates to an industrial mechanical arm automatic control method based on deep reinforcement learning, which is used for constructing a deep reinforcement learning model, constructing output interference and establishing a reward r_tCalculating a model, constructing a simulation environment, accumulating an experience pool, training a deep reinforcement learning neural network and controlling the motion of the mechanical arm in practice by utilizing the trained deep reinforcement learning model. By adding the deep reinforcement learning network, the problem of automatic control of the mechanical arm in a complex environment is solved, automatic control of the mechanical arm is completed, and the mechanical arm is high in running speed and high in precision after training is completed.

Description

Industrial mechanical arm automatic control method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of reinforcement learning algorithms, and particularly relates to an automatic control method of an industrial mechanical arm based on deep reinforcement learning.

Background

Compared with manpower, the industrial mechanical arm can more efficiently finish simple, repeated and heavy operations, greatly improves the production efficiency, reduces the labor cost and the labor intensity, and can reduce the probability of occurrence of manual accidents while ensuring the production quality. In some severe environments, such as high temperature, high pressure, low temperature, low pressure, dust, flammability, explosiveness and the like, manual operation is replaced by the mechanical arm, so that manual accidents caused by negligence in operation can be prevented, and the method has great significance.

The motion solving process of the mechanical arm comprises the steps of firstly obtaining pose information of a grabbed target, then obtaining the pose information of the mechanical arm, and solving the rotation angle of each axis through inverse dynamics. Due to the flexible effect of the joint and the connecting rod in the motion process, the structure is deformed, and the precision is reduced. It is a big problem to achieve control of the flexible robot arm. Common control methods include PID control, force feedback control, adaptive control, fuzzy and neural network control, and the like. Among them, the neural network control has a distinct advantage that a mathematical model of a controlled object is not required, and in a society of not artificial intelligence, automatic control based on the neural network will be the mainstream.

Disclosure of Invention

The invention aims to provide an automatic control method of an industrial mechanical arm based on deep reinforcement learning, which solves the problem of automatic control of the mechanical arm in a complex environment by adding a deep reinforcement learning network and completes the automatic control of the mechanical arm.

In order to achieve the purpose, the invention provides an industrial mechanical arm automatic control method based on deep reinforcement learning, which is characterized in that: the control method comprises the following steps:

step 1) constructing a deep reinforcement learning model

1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;

1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formed^μ) Network, μ (s | θ)^μ′) Network, Q (s, a | θ)^Q) Network and Q (s, a | theta)^Q′) The network has four networks in total, namely mu (s | theta)^μ) Network estimation network for behavior, mu (s | theta [ ])^μ′) The network is a behavioral target network, Q (s, a | theta |)^Q) Network estimation for evaluation network, Q (s, a | θ)^Q′) The network is an evaluation target network; randomly initializing μ (s | θ)^μ) Parameter theta of network^μAnd randomly initializing Q (s, a | theta [ ])^Q) Parameter theta of network^QThen μ (s | θ)^μ) Parameter theta of network^μThe values being assigned to the behavioral target network, i.e. theta^μ′←θ^μQ (s, a | θ)^Q) Parameter theta of network^QValue assignment to evaluation target network, i.e. theta^Q′←θ^Q；

Step 2) constructing output interference

According to the current input state s_tBy passing

The network obtains the action value a_t' and then setting a mean value as a_t' variance is var²Random normal distribution of

From a random normal distribution

Randomly obtaining an actual output action value a_tRandom normal distribution

To action a_t' applying a disturbance for exploring the environment, wherein,

representing the parameters of the behavior estimation network at the moment t, wherein t is the moment of the current input state;

step 3) establishing a reward r_tCalculation model

Step 4) establishing a simulation environment

The Robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the Robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual Robot simulation platform) software;

step 5) accumulating experience pools

5.1) according to the current input state s_tBy passing

The network obtains the action value a_t' obtaining the actual output action value a according to the output interference established in the step 2)_tAnd receive a reward r from the environment_tAnd subsequent input state s_t+1Inputting the current state s_tActually output the action value a_tR is a prize_tAnd subsequent input state s_t+1Storing in experience pool, and inputting current state s_tActual output action value a_tPrize r_tSubsequent input state s_t+1Collectively referred to as state transition information transition;

5.2) inputting the subsequent state s_t+1As the present current input state s_tRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;

5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;

step 6) training deep reinforcement learning neural network

6.1) sampling

Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;

6.2) updating the evaluation network parameters

6.3) updating the behavior estimation network parameters

6.4) updating the target network parameters

6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, and then updating the var value of the output interference to var max {0.1, var x gamma }, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;

step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)

7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;

7.2) obtaining the current input state s of the real environment by a camera_tThe deep reinforcement learning network is based on the current input state s_tControlling the mechanical arm to rotate to obtain a subsequent input state s_t+1. Will subsequently input the state s_t+1As the current input state s_tAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.

Further, in the step 3), a reward r is established_tThe specific process of calculating the model is as follows:

the mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state s_tThe state is the current input s_tRandom normal distribution in the time from step 2)

Randomly obtaining an actual output action value a_t(i.e., the angle of rotation of each axis of the arm), the end position of the arm, and the position of the arm tipMarked as x1_t，y1_t，z1_tThe target position is x0_t，y0_t，z0_tIs awarded

Further, in the step 6.2), the specific process of updating the evaluation network parameters is as follows:

passing through the state transition information transition of the batch group sample taken in the step 6.1)

Network and

the network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';

constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;

estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function Loss^QAnd (6) updating.

Further, in the step 6.3), a specific process of updating the behavior estimation network parameters is as follows:

s in per batch group sample state transition information transition_tBy passing

The network and the output interference obtain the corresponding actual output action value a_tAccording to

Estimated Q 'value eval _ Q' of network versus actual outputGo out action value a_tObtaining the estimated Q' value and the actual output action value a by taking the derivative_tGradient of (2)

Representing the actual output action value a_tA derivative is obtained; according to

Actual output action value a of the network_tTo pair

Obtaining the actual output action value a by calculating the derivative of the network parameter_tTo pair

Gradient of network parameters

Wherein

Representing the derivation of parameters of the behavior estimation network;

estimated Q value versus actual output action value a_tGradient of (2)

And the actual output action value a_tEstimating gradients of network parameters for behavior

The product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;

and updating the behavior estimation network parameters by using a gradient ascent method.

Further, in the step 6.4), the specific process of updating the target network parameter is as follows:

and assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K.

Compared with the prior art, the invention has the following advantages: the industrial mechanical arm automatic control method based on the deep reinforcement learning solves the problem of automatic control of the mechanical arm in a complex environment by adding the deep reinforcement learning network, completes automatic control of the mechanical arm, and has high running speed and high precision after training is completed.

Drawings

Fig. 1 is a flow chart of the industrial robot automatic control method based on deep reinforcement learning of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Fig. 1 is a schematic flow chart of an industrial robot arm automatic control method based on deep reinforcement learning, which includes the following steps:

step 1) constructing a deep reinforcement learning model

1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formed^μ) Network, (s | θ)^μ′) Network, Q (s, a | θ)^Q) Network and Q (s, a | theta)^Q′) The network has four networks in total, namely mu (s | theta)^μ) NetworkEstimating network, μ (s | θ) for behavior^μ′) The network is a behavioral target network, Q (s, a | theta |)^Q) Network estimation for evaluation network, Q (s, a | θ)^Q′) The network is an evaluation target network; randomly initializing μ (s | θ)^μ) Parameter theta of network^μAnd randomly initializing Q (s, a | theta [ ])^Q) Parameter theta of network^QThen μ (s | θ)^μ) Parameter theta of network^μThe values being assigned to the behavioral target network, i.e. theta^μ′←θ^μQ (s, a | θ)^Q) Parameter theta of network^QValue assignment to evaluation target network, i.e. theta^Q′←θ^Q；

Step 2) constructing output interference

According to the current input state s_tBy passing

From a random normal distribution

Randomly obtaining an actual output action value a_tRandom normal distribution

For action value a_t' applying a disturbance for exploring the environment, wherein,

evaluating and estimating parameters of the network at a representative time t, wherein t is the time of the current input state;

step 3) establishing a reward r_tCalculation model

Randomly obtaining an actual output action value a_t(i.e., the angle of rotation of each axis of the arm), and the arm tip position coordinates are x1_t，y1_t，z1_tThe target position is x0_t，y0_t，z0_tIs awarded

Step 4) establishing a simulation environment

step 5) accumulating experience pools

5.1) according to the current input state s_tBy passing

step 6) training deep reinforcement learning neural network

6.1) sampling

6.2) updating the evaluation network parameters

Network and

estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function Loss^QUpdating is carried out;

6.3) updating the behavior estimation network parameters

St pass in per batch group sample state transition information transition

Estimated Q 'value eval _ Q' of the network versus actual output action value a_tObtaining the estimated Q' value and the actual output action value a by taking the derivative_tGradient of (2)

Actual output action value a of the network_tTo pair

Gradient of network parameters

Wherein

Representing the derivation of parameters of the behavior estimation network;

estimated Q value versus actual output action value a_tGradient of (2)

updating the behavior estimation network parameters by using a gradient ascent method;

6.4) updating the target network parameters

Assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K;

6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, then updating the var value of the output interference to var max {0.1, var x gamma }, namely, taking the maximum value of the var value after attenuation of 0.1 and the var value at the previous moment, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;

Experimental data

The experimental target is that in a simulation environment of the SCARA robot, a mechanical arm is controlled to automatically position the target in the target through a deep reinforcement learning neural network, and grabbing is carried out. The experiment was set up for 600 rounds of training, one round for 200 steps. After training is finished, the target can be quickly grabbed within 20-30 steps of operation, and the requirements of modern industrial flow line production can be met. The traditional control mechanical arm needs to establish a mathematical model and has large computation amount for real-time solution of reverse dynamics.

Claims

1. An industrial mechanical arm automatic control method based on deep reinforcement learning is characterized in that: the control method comprises the following steps:

step 1) constructing a deep reinforcement learning model

1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information 1 is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;

1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formed^μ) Network, μ (s | θ)^μ′) Network, Q (s, a | θ)^Q) Network and Q (s, a | theta)^Q′) The network has four networks in total, namely mu (s | theta)^μ) Network estimation network for behavior, mu (s | theta [ ])^μ′) The network is a behavioral target network, Q (s, a | theta |)^Q) Network estimation for evaluation network, Q (s, a | θ)^Q') the network is an evaluation target network; randomly initializing μ (s | θ)^μ) Parameter theta of network^μAnd randomly initializing Q (s, a | theta [ ])^Q) Parameter theta of network^QThen μ (s | θ)^μ) Parameter theta of network^μThe values being assigned to the behavioral target network, i.e. theta^μ′←θ^μQ (s, a | θ)^Q) Parameter theta of network^QValue assignment to evaluation target network, i.e. theta^Q′←θ^Q；

Step 2) constructing output interference

According to the current input state s_tBy passing

From a random normal distribution

Randomly obtaining an actual output action value a_tRandom normal distribution

step 3) establishing a reward r_tCalculation model

Step 4) establishing a simulation environment

step 5) accumulating experience pools

5.1) according to the current input state s_tBy passing

step 6) training deep reinforcement learning neural network

6.1) sampling

6.2) updating the evaluation network parameters

6.3) updating the behavior estimation network parameters

6.4) updating the target network parameters

7.2) obtaining the current input state s of the real environment by a camera_tThe deep reinforcement learning network is based on the current input state s_tControlling the mechanical arm to rotate to obtain a subsequent input state s_t+1Will subsequently input the state s_t+1As the current input state s_tAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.

2. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 3), a reward r is established_tThe specific process of calculating the model is as follows:

Randomly obtaining an actual output action value a_tThe coordinate of the end position of the mechanical arm is (x 1)_t,y1_t,z1_t) The target position is (x 0)_t,y0_t,z0_t) Is awarded

3. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.2), the specific process of updating the evaluation network parameters is as follows:

Network and

4. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.3), the specific process of updating the behavior estimation network parameters is as follows:

s in per batch group sample state transition information transition_tBy passing

Estimated Q 'value eval _ Q' of the network versus actual output action value a_tObtaining the gradient of the estimated Q' value to the actual output action value at by taking the derivative

Actual output action value a of the network_tTo pair

Gradient of network parameters

Wherein

Representing the derivation of parameters of the behavior estimation network;

estimated Q value versus actual output action value a_tGradient of (2)

5. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.4), the specific process of updating the target network parameters is as follows:

j and K are training times.