CN108052004B - Industrial mechanical arm automatic control method based on deep reinforcement learning - Google Patents

Industrial mechanical arm automatic control method based on deep reinforcement learning Download PDF

Info

Publication number
CN108052004B
CN108052004B CN201711275146.7A CN201711275146A CN108052004B CN 108052004 B CN108052004 B CN 108052004B CN 201711275146 A CN201711275146 A CN 201711275146A CN 108052004 B CN108052004 B CN 108052004B
Authority
CN
China
Prior art keywords
network
state
value
target
mechanical arm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711275146.7A
Other languages
Chinese (zh)
Other versions
CN108052004A (en
Inventor
柯丰恺
周唯倜
赵大兴
孙国栋
许万
丁国龙
吴震宇
赵迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN201711275146.7A priority Critical patent/CN108052004B/en
Publication of CN108052004A publication Critical patent/CN108052004A/en
Application granted granted Critical
Publication of CN108052004B publication Critical patent/CN108052004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • G05B13/045Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance using a perturbation signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Automation & Control Theory (AREA)
  • Mechanical Engineering (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Geometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to an industrial mechanical arm automatic control method based on deep reinforcement learning, which is used for constructing a deep reinforcement learning model, constructing output interference and establishing a reward rtCalculating a model, constructing a simulation environment, accumulating an experience pool, training a deep reinforcement learning neural network and controlling the motion of the mechanical arm in practice by utilizing the trained deep reinforcement learning model. By adding the deep reinforcement learning network, the problem of automatic control of the mechanical arm in a complex environment is solved, automatic control of the mechanical arm is completed, and the mechanical arm is high in running speed and high in precision after training is completed.

Description

Industrial mechanical arm automatic control method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of reinforcement learning algorithms, and particularly relates to an automatic control method of an industrial mechanical arm based on deep reinforcement learning.
Background
Compared with manpower, the industrial mechanical arm can more efficiently finish simple, repeated and heavy operations, greatly improves the production efficiency, reduces the labor cost and the labor intensity, and can reduce the probability of occurrence of manual accidents while ensuring the production quality. In some severe environments, such as high temperature, high pressure, low temperature, low pressure, dust, flammability, explosiveness and the like, manual operation is replaced by the mechanical arm, so that manual accidents caused by negligence in operation can be prevented, and the method has great significance.
The motion solving process of the mechanical arm comprises the steps of firstly obtaining pose information of a grabbed target, then obtaining the pose information of the mechanical arm, and solving the rotation angle of each axis through inverse dynamics. Due to the flexible effect of the joint and the connecting rod in the motion process, the structure is deformed, and the precision is reduced. It is a big problem to achieve control of the flexible robot arm. Common control methods include PID control, force feedback control, adaptive control, fuzzy and neural network control, and the like. Among them, the neural network control has a distinct advantage that a mathematical model of a controlled object is not required, and in a society of not artificial intelligence, automatic control based on the neural network will be the mainstream.
Disclosure of Invention
The invention aims to provide an automatic control method of an industrial mechanical arm based on deep reinforcement learning, which solves the problem of automatic control of the mechanical arm in a complex environment by adding a deep reinforcement learning network and completes the automatic control of the mechanical arm.
In order to achieve the purpose, the invention provides an industrial mechanical arm automatic control method based on deep reinforcement learning, which is characterized in that: the control method comprises the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, μ (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) Network estimation network for behavior, mu (s | theta [ ])μ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q′) The network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ
Step 2) constructing output interference
According to the current input state stBy passing
Figure GDA0002639503800000027
The network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution of
Figure GDA0002639503800000028
From a random normal distribution
Figure GDA0002639503800000029
Randomly obtaining an actual output action value atRandom normal distribution
Figure GDA00026395038000000210
To action at' applying a disturbance for exploring the environment, wherein,
Figure GDA00026395038000000211
representing the parameters of the behavior estimation network at the moment t, wherein t is the moment of the current input state;
step 3) establishing a reward rtCalculation model
Step 4) establishing a simulation environment
The Robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the Robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual Robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passing
Figure GDA00026395038000000212
The network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
6.3) updating the behavior estimation network parameters
6.4) updating the target network parameters
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, and then updating the var value of the output interference to var max {0.1, var x gamma }, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1. Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
Further, in the step 3), a reward r is establishedtThe specific process of calculating the model is as follows:
the mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)
Figure GDA0002639503800000041
Randomly obtaining an actual output action value at(i.e., the angle of rotation of each axis of the arm), the end position of the arm, and the position of the arm tipMarked as x1t,y1t,z1tThe target position is x0t,y0t,z0tIs awarded
Figure GDA0002639503800000042
Further, in the step 6.2), the specific process of updating the evaluation network parameters is as follows:
passing through the state transition information transition of the batch group sample taken in the step 6.1)
Figure GDA0002639503800000043
Network and
Figure GDA0002639503800000044
the network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQAnd (6) updating.
Further, in the step 6.3), a specific process of updating the behavior estimation network parameters is as follows:
s in per batch group sample state transition information transitiontBy passing
Figure GDA0002639503800000045
The network and the output interference obtain the corresponding actual output action value atAccording to
Figure GDA0002639503800000046
Estimated Q 'value eval _ Q' of network versus actual outputGo out action value atObtaining the estimated Q' value and the actual output action value a by taking the derivativetGradient of (2)
Figure GDA0002639503800000047
Figure GDA0002639503800000048
Representing the actual output action value atA derivative is obtained; according to
Figure GDA0002639503800000049
Actual output action value a of the networktTo pair
Figure GDA00026395038000000410
Obtaining the actual output action value a by calculating the derivative of the network parametertTo pair
Figure GDA00026395038000000411
Gradient of network parameters
Figure GDA00026395038000000412
Wherein
Figure GDA00026395038000000413
Representing the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)
Figure GDA00026395038000000414
And the actual output action value atEstimating gradients of network parameters for behavior
Figure GDA00026395038000000415
The product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
and updating the behavior estimation network parameters by using a gradient ascent method.
Further, in the step 6.4), the specific process of updating the target network parameter is as follows:
and assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K.
Compared with the prior art, the invention has the following advantages: the industrial mechanical arm automatic control method based on the deep reinforcement learning solves the problem of automatic control of the mechanical arm in a complex environment by adding the deep reinforcement learning network, completes automatic control of the mechanical arm, and has high running speed and high precision after training is completed.
Drawings
Fig. 1 is a flow chart of the industrial robot automatic control method based on deep reinforcement learning of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Fig. 1 is a schematic flow chart of an industrial robot arm automatic control method based on deep reinforcement learning, which includes the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) NetworkEstimating network, μ (s | θ) for behaviorμ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q′) The network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ
Step 2) constructing output interference
According to the current input state stBy passing
Figure GDA0002639503800000064
The network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution of
Figure GDA0002639503800000065
From a random normal distribution
Figure GDA0002639503800000066
Randomly obtaining an actual output action value atRandom normal distribution
Figure GDA0002639503800000067
For action value at' applying a disturbance for exploring the environment, wherein,
Figure GDA0002639503800000068
evaluating and estimating parameters of the network at a representative time t, wherein t is the time of the current input state;
step 3) establishing a reward rtCalculation model
The mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)
Figure GDA0002639503800000069
Randomly obtaining an actual output action value at(i.e., the angle of rotation of each axis of the arm), and the arm tip position coordinates are x1t,y1t,z1tThe target position is x0t,y0t,z0tIs awarded
Figure GDA00026395038000000610
Step 4) establishing a simulation environment
The Robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the Robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual Robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passing
Figure GDA00026395038000000611
The network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
Passing through the state transition information transition of the batch group sample taken in the step 6.1)
Figure GDA0002639503800000071
Network and
Figure GDA0002639503800000072
the network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQUpdating is carried out;
6.3) updating the behavior estimation network parameters
St pass in per batch group sample state transition information transition
Figure GDA0002639503800000073
The network and the output interference obtain the corresponding actual output action value atAccording to
Figure GDA0002639503800000074
Estimated Q 'value eval _ Q' of the network versus actual output action value atObtaining the estimated Q' value and the actual output action value a by taking the derivativetGradient of (2)
Figure GDA0002639503800000075
Figure GDA0002639503800000076
Representing the actual output action value atA derivative is obtained; according to
Figure GDA0002639503800000077
Actual output action value a of the networktTo pair
Figure GDA0002639503800000078
Obtaining the actual output action value a by calculating the derivative of the network parametertTo pair
Figure GDA0002639503800000079
Gradient of network parameters
Figure GDA00026395038000000710
Wherein
Figure GDA00026395038000000711
Representing the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)
Figure GDA00026395038000000712
And the actual output action value atEstimating gradients of network parameters for behavior
Figure GDA00026395038000000713
The product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
updating the behavior estimation network parameters by using a gradient ascent method;
6.4) updating the target network parameters
Assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K;
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, then updating the var value of the output interference to var max {0.1, var x gamma }, namely, taking the maximum value of the var value after attenuation of 0.1 and the var value at the previous moment, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1. Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
Experimental data
The experimental target is that in a simulation environment of the SCARA robot, a mechanical arm is controlled to automatically position the target in the target through a deep reinforcement learning neural network, and grabbing is carried out. The experiment was set up for 600 rounds of training, one round for 200 steps. After training is finished, the target can be quickly grabbed within 20-30 steps of operation, and the requirements of modern industrial flow line production can be met. The traditional control mechanical arm needs to establish a mathematical model and has large computation amount for real-time solution of reverse dynamics.

Claims (5)

1. An industrial mechanical arm automatic control method based on deep reinforcement learning is characterized in that: the control method comprises the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information 1 is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, μ (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) Network estimation network for behavior, mu (s | theta [ ])μ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q') the network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ
Step 2) constructing output interference
According to the current input state stBy passing
Figure FDA0002639503790000011
The network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution of
Figure FDA0002639503790000013
From a random normal distribution
Figure FDA0002639503790000014
Randomly obtaining an actual output action value atRandom normal distribution
Figure FDA0002639503790000015
For action value at' applying a disturbance for exploring the environment, wherein,
Figure FDA0002639503790000012
representing the parameters of the behavior estimation network at the moment t, wherein t is the moment of the current input state;
step 3) establishing a reward rtCalculation model
Step 4) establishing a simulation environment
The robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passing
Figure FDA0002639503790000021
The network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
6.3) updating the behavior estimation network parameters
6.4) updating the target network parameters
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, and then updating the var value of the output interference to var max {0.1, var x gamma }, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
2. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 3), a reward r is establishedtThe specific process of calculating the model is as follows:
the mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)
Figure FDA0002639503790000035
Randomly obtaining an actual output action value atThe coordinate of the end position of the mechanical arm is (x 1)t,y1t,z1t) The target position is (x 0)t,y0t,z0t) Is awarded
Figure FDA0002639503790000031
3. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.2), the specific process of updating the evaluation network parameters is as follows:
passing through the state transition information transition of the batch group sample taken in the step 6.1)
Figure FDA0002639503790000032
Network and
Figure FDA0002639503790000033
the network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQAnd (6) updating.
4. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.3), the specific process of updating the behavior estimation network parameters is as follows:
s in per batch group sample state transition information transitiontBy passing
Figure FDA0002639503790000034
The network and the output interference obtain the corresponding actual output action value atAccording to
Figure FDA0002639503790000041
Estimated Q 'value eval _ Q' of the network versus actual output action value atObtaining the gradient of the estimated Q' value to the actual output action value at by taking the derivative
Figure FDA0002639503790000042
Figure FDA0002639503790000043
Representing the actual output action value atA derivative is obtained; according to
Figure FDA0002639503790000044
Actual output action value a of the networktTo pair
Figure FDA0002639503790000045
Obtaining the actual output action value a by calculating the derivative of the network parametertTo pair
Figure FDA0002639503790000046
Gradient of network parameters
Figure FDA0002639503790000047
Wherein
Figure FDA0002639503790000048
Representing the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)
Figure FDA0002639503790000049
And the actual output action value atEstimating gradients of network parameters for behavior
Figure FDA00026395037900000410
The product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
and updating the behavior estimation network parameters by using a gradient ascent method.
5. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.4), the specific process of updating the target network parameters is as follows:
assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K;
j and K are training times.
CN201711275146.7A 2017-12-06 2017-12-06 Industrial mechanical arm automatic control method based on deep reinforcement learning Active CN108052004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711275146.7A CN108052004B (en) 2017-12-06 2017-12-06 Industrial mechanical arm automatic control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711275146.7A CN108052004B (en) 2017-12-06 2017-12-06 Industrial mechanical arm automatic control method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN108052004A CN108052004A (en) 2018-05-18
CN108052004B true CN108052004B (en) 2020-11-10

Family

ID=62121722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711275146.7A Active CN108052004B (en) 2017-12-06 2017-12-06 Industrial mechanical arm automatic control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN108052004B (en)

Families Citing this family (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108803615B (en) * 2018-07-03 2021-03-23 东南大学 Virtual human unknown environment navigation algorithm based on deep reinforcement learning
CN109240280B (en) * 2018-07-05 2021-09-07 上海交通大学 Anchoring auxiliary power positioning system control method based on reinforcement learning
CN109242099B (en) * 2018-08-07 2020-11-10 中国科学院深圳先进技术研究院 Training method and device of reinforcement learning network, training equipment and storage medium
CN108927806A (en) * 2018-08-13 2018-12-04 哈尔滨工业大学(深圳) A kind of industrial robot learning method applied to high-volume repeatability processing
CN109379752B (en) * 2018-09-10 2021-09-24 中国移动通信集团江苏有限公司 Massive MIMO optimization method, device, equipment and medium
CN109352648B (en) * 2018-10-12 2021-03-02 北京地平线机器人技术研发有限公司 Mechanical mechanism control method and device and electronic equipment
CN109352649B (en) * 2018-10-15 2021-07-20 同济大学 Manipulator control method and system based on deep learning
CN109614631B (en) * 2018-10-18 2022-10-14 清华大学 Aircraft full-automatic pneumatic optimization method based on reinforcement learning and transfer learning
CN109483534B (en) * 2018-11-08 2022-08-02 腾讯科技(深圳)有限公司 Object grabbing method, device and system
CN109948642B (en) * 2019-01-18 2023-03-28 中山大学 Multi-agent cross-modal depth certainty strategy gradient training method based on image input
CN109800864B (en) * 2019-01-18 2023-05-30 中山大学 Robot active learning method based on image input
CN109605377B (en) * 2019-01-21 2020-05-22 厦门大学 Robot joint motion control method and system based on reinforcement learning
CN111476257A (en) * 2019-01-24 2020-07-31 富士通株式会社 Information processing method and information processing apparatus
CN110070099A (en) * 2019-02-20 2019-07-30 北京航空航天大学 A kind of industrial data feature structure method based on intensified learning
CN110238839B (en) * 2019-04-11 2020-10-20 清华大学 Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction
CN110053034A (en) * 2019-05-23 2019-07-26 哈尔滨工业大学 A kind of multi purpose space cellular machineries people's device of view-based access control model
CN110125939B (en) * 2019-06-03 2020-10-20 湖南工学院 Virtual visual control method for robot
CN110053053B (en) * 2019-06-14 2022-04-12 西南科技大学 Self-adaptive method of mechanical arm screwing valve based on deep reinforcement learning
DE102019209616A1 (en) * 2019-07-01 2021-01-07 Kuka Deutschland Gmbh Carrying out a given task with the aid of at least one robot
US20220339787A1 (en) * 2019-07-01 2022-10-27 Kuka Deutschland Gmbh Carrying out an application using at least one robot
CN110370295B (en) * 2019-07-02 2020-12-18 浙江大学 Small-sized football robot active control ball suction method based on deep reinforcement learning
CN110400345B (en) * 2019-07-24 2021-06-15 西南科技大学 Deep reinforcement learning-based radioactive waste push-grab cooperative sorting method
CN110900601B (en) * 2019-11-15 2022-06-03 武汉理工大学 Robot operation autonomous control method for human-robot cooperation safety guarantee
CN110826701A (en) * 2019-11-15 2020-02-21 北京邮电大学 Method for carrying out system identification on two-degree-of-freedom flexible leg based on BP neural network algorithm
TWI790408B (en) * 2019-11-19 2023-01-21 財團法人工業技術研究院 Gripping device and gripping method
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 Unmanned mine card tracking control system and method based on deep reinforcement learning
CN110909859B (en) * 2019-11-29 2023-03-24 中国科学院自动化研究所 Bionic robot fish motion control method and system based on antagonistic structured control
CN111223141B (en) * 2019-12-31 2023-10-24 东华大学 Automatic pipeline work efficiency optimization system and method based on reinforcement learning
CN111360834B (en) * 2020-03-25 2023-04-07 中南大学 Humanoid robot motion control method and system based on deep reinforcement learning
CN111461325B (en) * 2020-03-30 2023-06-20 华南理工大学 Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem
CN111487863B (en) * 2020-04-14 2022-06-17 东南大学 Active suspension reinforcement learning control method based on deep Q neural network
CN111618847B (en) * 2020-04-22 2022-06-21 南通大学 Mechanical arm autonomous grabbing method based on deep reinforcement learning and dynamic motion elements
CN111644398A (en) * 2020-05-28 2020-09-11 华中科技大学 Push-grab cooperative sorting network based on double viewing angles and sorting method and system thereof
CN111515961B (en) * 2020-06-02 2022-06-21 南京大学 Reinforcement learning reward method suitable for mobile mechanical arm
CN111881772B (en) * 2020-07-06 2023-11-07 上海交通大学 Multi-mechanical arm cooperative assembly method and system based on deep reinforcement learning
CN112506044A (en) * 2020-09-10 2021-03-16 上海交通大学 Flexible arm control and planning method based on visual feedback and reinforcement learning
CN112434464B (en) * 2020-11-09 2021-09-10 中国船舶重工集团公司第七一六研究所 Arc welding cooperative welding method for multiple mechanical arms of ship based on MADDPG algorithm
CN112338921A (en) * 2020-11-16 2021-02-09 西华师范大学 Mechanical arm intelligent control rapid training method based on deep reinforcement learning
CN112405543B (en) * 2020-11-23 2022-05-06 长沙理工大学 Mechanical arm dense object temperature-first grabbing method based on deep reinforcement learning
CN112643668B (en) * 2020-12-01 2022-05-24 浙江工业大学 Mechanical arm pushing and grabbing cooperation method suitable for intensive environment
CN112809696B (en) * 2020-12-31 2022-03-15 山东大学 Omnibearing intelligent nursing system and method for high-infectivity isolated disease area
CN113159410B (en) * 2021-04-14 2024-02-27 北京百度网讯科技有限公司 Training method of automatic control model and fluid supply system control method
CN113283167A (en) * 2021-05-24 2021-08-20 暨南大学 Special equipment production line optimization method and system based on safety reinforcement learning
CN113510709B (en) * 2021-07-28 2022-08-19 北京航空航天大学 Industrial robot pose precision online compensation method based on deep reinforcement learning
CN113843802B (en) * 2021-10-18 2023-09-05 南京理工大学 Mechanical arm motion control method based on deep reinforcement learning TD3 algorithm
CN114789444B (en) * 2022-05-05 2022-12-16 山东省人工智能研究院 Compliant human-computer contact method based on deep reinforcement learning and impedance control
CN115464659B (en) * 2022-10-05 2023-10-24 哈尔滨理工大学 Mechanical arm grabbing control method based on visual information deep reinforcement learning DDPG algorithm
CN117618125A (en) * 2024-01-25 2024-03-01 科弛医疗科技(北京)有限公司 Image trolley

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105690392A (en) * 2016-04-14 2016-06-22 苏州大学 Robot motion control method and device based on actor-critic method
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
WO2017083772A1 (en) * 2015-11-12 2017-05-18 Google Inc. Asynchronous deep reinforcement learning
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017083772A1 (en) * 2015-11-12 2017-05-18 Google Inc. Asynchronous deep reinforcement learning
CN105690392A (en) * 2016-04-14 2016-06-22 苏州大学 Robot motion control method and device based on actor-critic method
CN106094516A (en) * 2016-06-08 2016-11-09 南京大学 A kind of robot self-adapting grasping method based on deeply study
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Learning State Representation for Deep Actor-Critic Control;Jelle Munk等;《2016 IEEE 55th Conference on Decision and Control》;20161229;第4667-4673页 *
机器人足球行为控制学习算法的研究;唐鹏;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160815(第08期);全文 *

Also Published As

Publication number Publication date
CN108052004A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108052004B (en) Industrial mechanical arm automatic control method based on deep reinforcement learning
CN110238839B (en) Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction
CN111515961B (en) Reinforcement learning reward method suitable for mobile mechanical arm
Laskey et al. Robot grasping in clutter: Using a hierarchy of supervisors for learning from demonstrations
CN110000785B (en) Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment
CN109978176B (en) Multi-agent cooperative learning method based on state dynamic perception
CN109782600A (en) A method of autonomous mobile robot navigation system is established by virtual environment
CN111240356B (en) Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning
CN110076772A (en) A kind of grasping means of mechanical arm and device
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN109407644A (en) One kind being used for manufacturing enterprise's Multi-Agent model control method and system
CN113076615B (en) High-robustness mechanical arm operation method and system based on antagonistic deep reinforcement learning
CN107992939B (en) Equal cutting force gear machining method based on deep reinforcement learning
CN111783994A (en) Training method and device for reinforcement learning
CN113524186B (en) Deep reinforcement learning double-arm robot control method and system based on demonstration examples
CN113232019A (en) Mechanical arm control method and device, electronic equipment and storage medium
CN116038691A (en) Continuous mechanical arm motion control method based on deep reinforcement learning
CN114888801A (en) Mechanical arm control method and system based on offline strategy reinforcement learning
CN115464659A (en) Mechanical arm grabbing control method based on deep reinforcement learning DDPG algorithm of visual information
CN116352715A (en) Double-arm robot cooperative motion control method based on deep reinforcement learning
Zakaria et al. Robotic control of the deformation of soft linear objects using deep reinforcement learning
CN114077258B (en) Unmanned ship pose control method based on reinforcement learning PPO2 algorithm
Jiang et al. Mastering the complex assembly task with a dual-arm robot based on deep reinforcement learning: A novel reinforcement learning method
CN116533234A (en) Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning
CN116541701A (en) Training data generation method, intelligent body training device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant