CN108052004B - Industrial mechanical arm automatic control method based on deep reinforcement learning - Google Patents
Industrial mechanical arm automatic control method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN108052004B CN108052004B CN201711275146.7A CN201711275146A CN108052004B CN 108052004 B CN108052004 B CN 108052004B CN 201711275146 A CN201711275146 A CN 201711275146A CN 108052004 B CN108052004 B CN 108052004B
- Authority
- CN
- China
- Prior art keywords
- network
- state
- value
- target
- mechanical arm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/0265—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
- G05B13/027—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1656—Programme controls characterised by programming, planning systems for manipulators
- B25J9/1664—Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
- G05B13/045—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance using a perturbation signal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Robotics (AREA)
- Automation & Control Theory (AREA)
- Mechanical Engineering (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Geometry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to an industrial mechanical arm automatic control method based on deep reinforcement learning, which is used for constructing a deep reinforcement learning model, constructing output interference and establishing a reward rtCalculating a model, constructing a simulation environment, accumulating an experience pool, training a deep reinforcement learning neural network and controlling the motion of the mechanical arm in practice by utilizing the trained deep reinforcement learning model. By adding the deep reinforcement learning network, the problem of automatic control of the mechanical arm in a complex environment is solved, automatic control of the mechanical arm is completed, and the mechanical arm is high in running speed and high in precision after training is completed.
Description
Technical Field
The invention belongs to the technical field of reinforcement learning algorithms, and particularly relates to an automatic control method of an industrial mechanical arm based on deep reinforcement learning.
Background
Compared with manpower, the industrial mechanical arm can more efficiently finish simple, repeated and heavy operations, greatly improves the production efficiency, reduces the labor cost and the labor intensity, and can reduce the probability of occurrence of manual accidents while ensuring the production quality. In some severe environments, such as high temperature, high pressure, low temperature, low pressure, dust, flammability, explosiveness and the like, manual operation is replaced by the mechanical arm, so that manual accidents caused by negligence in operation can be prevented, and the method has great significance.
The motion solving process of the mechanical arm comprises the steps of firstly obtaining pose information of a grabbed target, then obtaining the pose information of the mechanical arm, and solving the rotation angle of each axis through inverse dynamics. Due to the flexible effect of the joint and the connecting rod in the motion process, the structure is deformed, and the precision is reduced. It is a big problem to achieve control of the flexible robot arm. Common control methods include PID control, force feedback control, adaptive control, fuzzy and neural network control, and the like. Among them, the neural network control has a distinct advantage that a mathematical model of a controlled object is not required, and in a society of not artificial intelligence, automatic control based on the neural network will be the mainstream.
Disclosure of Invention
The invention aims to provide an automatic control method of an industrial mechanical arm based on deep reinforcement learning, which solves the problem of automatic control of the mechanical arm in a complex environment by adding a deep reinforcement learning network and completes the automatic control of the mechanical arm.
In order to achieve the purpose, the invention provides an industrial mechanical arm automatic control method based on deep reinforcement learning, which is characterized in that: the control method comprises the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, μ (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) Network estimation network for behavior, mu (s | theta [ ])μ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q′) The network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ;
Step 2) constructing output interference
According to the current input state stBy passingThe network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution ofFrom a random normal distributionRandomly obtaining an actual output action value atRandom normal distributionTo action at' applying a disturbance for exploring the environment, wherein,representing the parameters of the behavior estimation network at the moment t, wherein t is the moment of the current input state;
step 3) establishing a reward rtCalculation model
Step 4) establishing a simulation environment
The Robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the Robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual Robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passingThe network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
6.3) updating the behavior estimation network parameters
6.4) updating the target network parameters
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, and then updating the var value of the output interference to var max {0.1, var x gamma }, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1. Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
Further, in the step 3), a reward r is establishedtThe specific process of calculating the model is as follows:
the mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)Randomly obtaining an actual output action value at(i.e., the angle of rotation of each axis of the arm), the end position of the arm, and the position of the arm tipMarked as x1t,y1t,z1tThe target position is x0t,y0t,z0tIs awarded
Further, in the step 6.2), the specific process of updating the evaluation network parameters is as follows:
passing through the state transition information transition of the batch group sample taken in the step 6.1)Network andthe network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQAnd (6) updating.
Further, in the step 6.3), a specific process of updating the behavior estimation network parameters is as follows:
s in per batch group sample state transition information transitiontBy passingThe network and the output interference obtain the corresponding actual output action value atAccording toEstimated Q 'value eval _ Q' of network versus actual outputGo out action value atObtaining the estimated Q' value and the actual output action value a by taking the derivativetGradient of (2) Representing the actual output action value atA derivative is obtained; according toActual output action value a of the networktTo pairObtaining the actual output action value a by calculating the derivative of the network parametertTo pairGradient of network parametersWhereinRepresenting the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)And the actual output action value atEstimating gradients of network parameters for behaviorThe product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
and updating the behavior estimation network parameters by using a gradient ascent method.
Further, in the step 6.4), the specific process of updating the target network parameter is as follows:
and assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K.
Compared with the prior art, the invention has the following advantages: the industrial mechanical arm automatic control method based on the deep reinforcement learning solves the problem of automatic control of the mechanical arm in a complex environment by adding the deep reinforcement learning network, completes automatic control of the mechanical arm, and has high running speed and high precision after training is completed.
Drawings
Fig. 1 is a flow chart of the industrial robot automatic control method based on deep reinforcement learning of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Fig. 1 is a schematic flow chart of an industrial robot arm automatic control method based on deep reinforcement learning, which includes the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) NetworkEstimating network, μ (s | θ) for behaviorμ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q′) The network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ;
Step 2) constructing output interference
According to the current input state stBy passingThe network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution ofFrom a random normal distributionRandomly obtaining an actual output action value atRandom normal distributionFor action value at' applying a disturbance for exploring the environment, wherein,evaluating and estimating parameters of the network at a representative time t, wherein t is the time of the current input state;
step 3) establishing a reward rtCalculation model
The mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)Randomly obtaining an actual output action value at(i.e., the angle of rotation of each axis of the arm), and the arm tip position coordinates are x1t,y1t,z1tThe target position is x0t,y0t,z0tIs awarded
Step 4) establishing a simulation environment
The Robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the Robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual Robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passingThe network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
Passing through the state transition information transition of the batch group sample taken in the step 6.1)Network andthe network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQUpdating is carried out;
6.3) updating the behavior estimation network parameters
St pass in per batch group sample state transition information transitionThe network and the output interference obtain the corresponding actual output action value atAccording toEstimated Q 'value eval _ Q' of the network versus actual output action value atObtaining the estimated Q' value and the actual output action value a by taking the derivativetGradient of (2) Representing the actual output action value atA derivative is obtained; according toActual output action value a of the networktTo pairObtaining the actual output action value a by calculating the derivative of the network parametertTo pairGradient of network parametersWhereinRepresenting the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)And the actual output action value atEstimating gradients of network parameters for behaviorThe product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
updating the behavior estimation network parameters by using a gradient ascent method;
6.4) updating the target network parameters
Assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K;
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, then updating the var value of the output interference to var max {0.1, var x gamma }, namely, taking the maximum value of the var value after attenuation of 0.1 and the var value at the previous moment, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1. Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
Experimental data
The experimental target is that in a simulation environment of the SCARA robot, a mechanical arm is controlled to automatically position the target in the target through a deep reinforcement learning neural network, and grabbing is carried out. The experiment was set up for 600 rounds of training, one round for 200 steps. After training is finished, the target can be quickly grabbed within 20-30 steps of operation, and the requirements of modern industrial flow line production can be met. The traditional control mechanical arm needs to establish a mathematical model and has large computation amount for real-time solution of reverse dynamics.
Claims (5)
1. An industrial mechanical arm automatic control method based on deep reinforcement learning is characterized in that: the control method comprises the following steps:
step 1) constructing a deep reinforcement learning model
1.1) experience pool initialization: setting an experience pool as a two-dimensional matrix with m rows and n columns, and initializing the value of each element in the two-dimensional matrix to 0, wherein m is the size of the sample capacity, n is the number of information stored in each sample, n is 2 × state _ dim + action _ dim +1, state _ dim is the dimension of the state, and action _ dim is the dimension of the action; meanwhile, a space for storing the reward information 1 is reserved in the experience pool, and 1 in the formula of n ═ 2 × state _ dim + action _ dim +1 is a reserved space for storing the reward information;
1.2) neural network initialization: the neural network is divided into an Actor network and a Critic network, the Actor network is a behavior network, the Critic network is an evaluation network, each part respectively constructs eval net and target net which have the same structure and different parameters, the eval net is an estimation network, and the target net is a target network, so that mu (s | theta) is formedμ) Network, μ (s | θ)μ′) Network, Q (s, a | θ)Q) Network and Q (s, a | theta)Q′) The network has four networks in total, namely mu (s | theta)μ) Network estimation network for behavior, mu (s | theta [ ])μ′) The network is a behavioral target network, Q (s, a | theta |)Q) Network estimation for evaluation network, Q (s, a | θ)Q') the network is an evaluation target network; randomly initializing μ (s | θ)μ) Parameter theta of networkμAnd randomly initializing Q (s, a | theta [ ])Q) Parameter theta of networkQThen μ (s | θ)μ) Parameter theta of networkμThe values being assigned to the behavioral target network, i.e. thetaμ′←θμQ (s, a | θ)Q) Parameter theta of networkQValue assignment to evaluation target network, i.e. thetaQ′←θQ;
Step 2) constructing output interference
According to the current input state stBy passingThe network obtains the action value at' and then setting a mean value as at' variance is var2Random normal distribution ofFrom a random normal distributionRandomly obtaining an actual output action value atRandom normal distributionFor action value at' applying a disturbance for exploring the environment, wherein,representing the parameters of the behavior estimation network at the moment t, wherein t is the moment of the current input state;
step 3) establishing a reward rtCalculation model
Step 4) establishing a simulation environment
The robot simulation software V-REP has models of all large industrial robots in the world, on the basis of the models, the difficulty in building the simulation environment of the robot arm is reduced, and the simulation environment which is consistent with the actual application is built through V-REP (virtual robot simulation platform) software;
step 5) accumulating experience pools
5.1) according to the current input state stBy passingThe network obtains the action value at' obtaining the actual output action value a according to the output interference established in the step 2)tAnd receive a reward r from the environmenttAnd subsequent input state st+1Inputting the current state stActually output the action value atR is a prizetAnd subsequent input state st+1Storing in experience pool, and inputting current state stActual output action value atPrize rtSubsequent input state st+1Collectively referred to as state transition information transition;
5.2) inputting the subsequent state st+1As the present current input state stRepeating the step 5.1), and storing the calculated state transition information transition in an experience pool;
5.3) repeating the step 5.2) until the space of the experience pool is fully stored, and skipping to execute the step 6) after the space of the experience pool is fully stored and executing the step 5.2) once;
step 6) training deep reinforcement learning neural network
6.1) sampling
Taking a batch group sample from the experience pool for neural network learning, wherein the batch represents a natural number;
6.2) updating the evaluation network parameters
6.3) updating the behavior estimation network parameters
6.4) updating the target network parameters
6.5) dividing the interference into xm rounds, repeating the steps 6.1) -6.4) xn times in each round, and repeating the steps 6.1) -6.4) each time, and then updating the var value of the output interference to var max {0.1, var x gamma }, wherein xm and xn represent natural numbers, and gamma is a rational number which is larger than zero and smaller than 1;
step 7) controlling the motion of the mechanical arm in practice by using the deep reinforcement learning model trained in the step 6)
7.1) in a real environment, preprocessing the input of an industrial ccd camera, and taking a picture at the time t as a state for neural network processing after Gaussian filtering;
7.2) obtaining the current input state s of the real environment by a cameratThe deep reinforcement learning network is based on the current input state stControlling the mechanical arm to rotate to obtain a subsequent input state st+1Will subsequently input the state st+1As the current input state stAnd circulating the steps until the depth-enhanced learning model controls the mechanical arm to grab the target.
2. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 3), a reward r is establishedtThe specific process of calculating the model is as follows:
the mechanical arm obtains image information at the time t through an industrial ccd camera in the environment at the time t, and Gaussian noise is added to obtain a current input state stThe state is the current input stRandom normal distribution in the time from step 2)Randomly obtaining an actual output action value atThe coordinate of the end position of the mechanical arm is (x 1)t,y1t,z1t) The target position is (x 0)t,y0t,z0t) Is awarded
3. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.2), the specific process of updating the evaluation network parameters is as follows:
passing through the state transition information transition of the batch group sample taken in the step 6.1)Network andthe network respectively obtains an estimated Q ' value eval _ Q ' and a target Q ' value target _ Q ' corresponding to each group of state transition information, and further obtains a time difference error TD _ error ', TD _ error ═ target _ Q ' -eval _ Q '; t 'is the input state time of the step 5.2) executed after the test pool space is stored to be full in the step 5.3), that is, the input state time of the step 5.2) executed each time after the test pool space is stored to be full in the step 5.3) is t';
constructing a Loss function Loss according to the time difference error TD _ error ', wherein the Loss function Loss is sigma TD _ error'/batch;
estimating a network parameter theta for evaluation by using a gradient descent method according to a Loss function LossQAnd (6) updating.
4. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.3), the specific process of updating the behavior estimation network parameters is as follows:
s in per batch group sample state transition information transitiontBy passingThe network and the output interference obtain the corresponding actual output action value atAccording toEstimated Q 'value eval _ Q' of the network versus actual output action value atObtaining the gradient of the estimated Q' value to the actual output action value at by taking the derivative Representing the actual output action value atA derivative is obtained; according toActual output action value a of the networktTo pairObtaining the actual output action value a by calculating the derivative of the network parametertTo pairGradient of network parametersWhereinRepresenting the derivation of parameters of the behavior estimation network;
estimated Q value versus actual output action value atGradient of (2)And the actual output action value atEstimating gradients of network parameters for behaviorThe product of (a) is the gradient of the behavior estimation network parameter to the estimation Q value;
and updating the behavior estimation network parameters by using a gradient ascent method.
5. The industrial mechanical arm automatic control method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 6.4), the specific process of updating the target network parameters is as follows:
assigning the network parameters of the actor _ eval to the actor _ target every J rounds, and assigning the network parameters of the critic _ eval to the critic _ target every K rounds, wherein J is not equal to K;
j and K are training times.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711275146.7A CN108052004B (en) | 2017-12-06 | 2017-12-06 | Industrial mechanical arm automatic control method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711275146.7A CN108052004B (en) | 2017-12-06 | 2017-12-06 | Industrial mechanical arm automatic control method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108052004A CN108052004A (en) | 2018-05-18 |
CN108052004B true CN108052004B (en) | 2020-11-10 |
Family
ID=62121722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711275146.7A Active CN108052004B (en) | 2017-12-06 | 2017-12-06 | Industrial mechanical arm automatic control method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108052004B (en) |
Families Citing this family (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108803615B (en) * | 2018-07-03 | 2021-03-23 | 东南大学 | Virtual human unknown environment navigation algorithm based on deep reinforcement learning |
CN109240280B (en) * | 2018-07-05 | 2021-09-07 | 上海交通大学 | Anchoring auxiliary power positioning system control method based on reinforcement learning |
CN109242099B (en) * | 2018-08-07 | 2020-11-10 | 中国科学院深圳先进技术研究院 | Training method and device of reinforcement learning network, training equipment and storage medium |
CN108927806A (en) * | 2018-08-13 | 2018-12-04 | 哈尔滨工业大学(深圳) | A kind of industrial robot learning method applied to high-volume repeatability processing |
CN109379752B (en) * | 2018-09-10 | 2021-09-24 | 中国移动通信集团江苏有限公司 | Massive MIMO optimization method, device, equipment and medium |
CN109352648B (en) * | 2018-10-12 | 2021-03-02 | 北京地平线机器人技术研发有限公司 | Mechanical mechanism control method and device and electronic equipment |
CN109352649B (en) * | 2018-10-15 | 2021-07-20 | 同济大学 | Manipulator control method and system based on deep learning |
CN109614631B (en) * | 2018-10-18 | 2022-10-14 | 清华大学 | Aircraft full-automatic pneumatic optimization method based on reinforcement learning and transfer learning |
CN109483534B (en) * | 2018-11-08 | 2022-08-02 | 腾讯科技(深圳)有限公司 | Object grabbing method, device and system |
CN109948642B (en) * | 2019-01-18 | 2023-03-28 | 中山大学 | Multi-agent cross-modal depth certainty strategy gradient training method based on image input |
CN109800864B (en) * | 2019-01-18 | 2023-05-30 | 中山大学 | Robot active learning method based on image input |
CN109605377B (en) * | 2019-01-21 | 2020-05-22 | 厦门大学 | Robot joint motion control method and system based on reinforcement learning |
CN111476257A (en) * | 2019-01-24 | 2020-07-31 | 富士通株式会社 | Information processing method and information processing apparatus |
CN110070099A (en) * | 2019-02-20 | 2019-07-30 | 北京航空航天大学 | A kind of industrial data feature structure method based on intensified learning |
CN110238839B (en) * | 2019-04-11 | 2020-10-20 | 清华大学 | Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction |
CN110053034A (en) * | 2019-05-23 | 2019-07-26 | 哈尔滨工业大学 | A kind of multi purpose space cellular machineries people's device of view-based access control model |
CN110125939B (en) * | 2019-06-03 | 2020-10-20 | 湖南工学院 | Virtual visual control method for robot |
CN110053053B (en) * | 2019-06-14 | 2022-04-12 | 西南科技大学 | Self-adaptive method of mechanical arm screwing valve based on deep reinforcement learning |
DE102019209616A1 (en) * | 2019-07-01 | 2021-01-07 | Kuka Deutschland Gmbh | Carrying out a given task with the aid of at least one robot |
US20220339787A1 (en) * | 2019-07-01 | 2022-10-27 | Kuka Deutschland Gmbh | Carrying out an application using at least one robot |
CN110370295B (en) * | 2019-07-02 | 2020-12-18 | 浙江大学 | Small-sized football robot active control ball suction method based on deep reinforcement learning |
CN110400345B (en) * | 2019-07-24 | 2021-06-15 | 西南科技大学 | Deep reinforcement learning-based radioactive waste push-grab cooperative sorting method |
CN110900601B (en) * | 2019-11-15 | 2022-06-03 | 武汉理工大学 | Robot operation autonomous control method for human-robot cooperation safety guarantee |
CN110826701A (en) * | 2019-11-15 | 2020-02-21 | 北京邮电大学 | Method for carrying out system identification on two-degree-of-freedom flexible leg based on BP neural network algorithm |
TWI790408B (en) * | 2019-11-19 | 2023-01-21 | 財團法人工業技術研究院 | Gripping device and gripping method |
CN110879595A (en) * | 2019-11-29 | 2020-03-13 | 江苏徐工工程机械研究院有限公司 | Unmanned mine card tracking control system and method based on deep reinforcement learning |
CN110909859B (en) * | 2019-11-29 | 2023-03-24 | 中国科学院自动化研究所 | Bionic robot fish motion control method and system based on antagonistic structured control |
CN111223141B (en) * | 2019-12-31 | 2023-10-24 | 东华大学 | Automatic pipeline work efficiency optimization system and method based on reinforcement learning |
CN111360834B (en) * | 2020-03-25 | 2023-04-07 | 中南大学 | Humanoid robot motion control method and system based on deep reinforcement learning |
CN111461325B (en) * | 2020-03-30 | 2023-06-20 | 华南理工大学 | Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem |
CN111487863B (en) * | 2020-04-14 | 2022-06-17 | 东南大学 | Active suspension reinforcement learning control method based on deep Q neural network |
CN111618847B (en) * | 2020-04-22 | 2022-06-21 | 南通大学 | Mechanical arm autonomous grabbing method based on deep reinforcement learning and dynamic motion elements |
CN111644398A (en) * | 2020-05-28 | 2020-09-11 | 华中科技大学 | Push-grab cooperative sorting network based on double viewing angles and sorting method and system thereof |
CN111515961B (en) * | 2020-06-02 | 2022-06-21 | 南京大学 | Reinforcement learning reward method suitable for mobile mechanical arm |
CN111881772B (en) * | 2020-07-06 | 2023-11-07 | 上海交通大学 | Multi-mechanical arm cooperative assembly method and system based on deep reinforcement learning |
CN112506044A (en) * | 2020-09-10 | 2021-03-16 | 上海交通大学 | Flexible arm control and planning method based on visual feedback and reinforcement learning |
CN112434464B (en) * | 2020-11-09 | 2021-09-10 | 中国船舶重工集团公司第七一六研究所 | Arc welding cooperative welding method for multiple mechanical arms of ship based on MADDPG algorithm |
CN112338921A (en) * | 2020-11-16 | 2021-02-09 | 西华师范大学 | Mechanical arm intelligent control rapid training method based on deep reinforcement learning |
CN112405543B (en) * | 2020-11-23 | 2022-05-06 | 长沙理工大学 | Mechanical arm dense object temperature-first grabbing method based on deep reinforcement learning |
CN112643668B (en) * | 2020-12-01 | 2022-05-24 | 浙江工业大学 | Mechanical arm pushing and grabbing cooperation method suitable for intensive environment |
CN112809696B (en) * | 2020-12-31 | 2022-03-15 | 山东大学 | Omnibearing intelligent nursing system and method for high-infectivity isolated disease area |
CN113159410B (en) * | 2021-04-14 | 2024-02-27 | 北京百度网讯科技有限公司 | Training method of automatic control model and fluid supply system control method |
CN113283167A (en) * | 2021-05-24 | 2021-08-20 | 暨南大学 | Special equipment production line optimization method and system based on safety reinforcement learning |
CN113510709B (en) * | 2021-07-28 | 2022-08-19 | 北京航空航天大学 | Industrial robot pose precision online compensation method based on deep reinforcement learning |
CN113843802B (en) * | 2021-10-18 | 2023-09-05 | 南京理工大学 | Mechanical arm motion control method based on deep reinforcement learning TD3 algorithm |
CN114789444B (en) * | 2022-05-05 | 2022-12-16 | 山东省人工智能研究院 | Compliant human-computer contact method based on deep reinforcement learning and impedance control |
CN115464659B (en) * | 2022-10-05 | 2023-10-24 | 哈尔滨理工大学 | Mechanical arm grabbing control method based on visual information deep reinforcement learning DDPG algorithm |
CN117618125A (en) * | 2024-01-25 | 2024-03-01 | 科弛医疗科技(北京)有限公司 | Image trolley |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105690392A (en) * | 2016-04-14 | 2016-06-22 | 苏州大学 | Robot motion control method and device based on actor-critic method |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
WO2017083772A1 (en) * | 2015-11-12 | 2017-05-18 | Google Inc. | Asynchronous deep reinforcement learning |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
-
2017
- 2017-12-06 CN CN201711275146.7A patent/CN108052004B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017083772A1 (en) * | 2015-11-12 | 2017-05-18 | Google Inc. | Asynchronous deep reinforcement learning |
CN105690392A (en) * | 2016-04-14 | 2016-06-22 | 苏州大学 | Robot motion control method and device based on actor-critic method |
CN106094516A (en) * | 2016-06-08 | 2016-11-09 | 南京大学 | A kind of robot self-adapting grasping method based on deeply study |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
Non-Patent Citations (2)
Title |
---|
Learning State Representation for Deep Actor-Critic Control;Jelle Munk等;《2016 IEEE 55th Conference on Decision and Control》;20161229;第4667-4673页 * |
机器人足球行为控制学习算法的研究;唐鹏;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160815(第08期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108052004A (en) | 2018-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052004B (en) | Industrial mechanical arm automatic control method based on deep reinforcement learning | |
CN110238839B (en) | Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction | |
CN111515961B (en) | Reinforcement learning reward method suitable for mobile mechanical arm | |
Laskey et al. | Robot grasping in clutter: Using a hierarchy of supervisors for learning from demonstrations | |
CN110000785B (en) | Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment | |
CN109978176B (en) | Multi-agent cooperative learning method based on state dynamic perception | |
CN109782600A (en) | A method of autonomous mobile robot navigation system is established by virtual environment | |
CN111240356B (en) | Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning | |
CN110076772A (en) | A kind of grasping means of mechanical arm and device | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN109407644A (en) | One kind being used for manufacturing enterprise's Multi-Agent model control method and system | |
CN113076615B (en) | High-robustness mechanical arm operation method and system based on antagonistic deep reinforcement learning | |
CN107992939B (en) | Equal cutting force gear machining method based on deep reinforcement learning | |
CN111783994A (en) | Training method and device for reinforcement learning | |
CN113524186B (en) | Deep reinforcement learning double-arm robot control method and system based on demonstration examples | |
CN113232019A (en) | Mechanical arm control method and device, electronic equipment and storage medium | |
CN116038691A (en) | Continuous mechanical arm motion control method based on deep reinforcement learning | |
CN114888801A (en) | Mechanical arm control method and system based on offline strategy reinforcement learning | |
CN115464659A (en) | Mechanical arm grabbing control method based on deep reinforcement learning DDPG algorithm of visual information | |
CN116352715A (en) | Double-arm robot cooperative motion control method based on deep reinforcement learning | |
Zakaria et al. | Robotic control of the deformation of soft linear objects using deep reinforcement learning | |
CN114077258B (en) | Unmanned ship pose control method based on reinforcement learning PPO2 algorithm | |
Jiang et al. | Mastering the complex assembly task with a dual-arm robot based on deep reinforcement learning: A novel reinforcement learning method | |
CN116533234A (en) | Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning | |
CN116541701A (en) | Training data generation method, intelligent body training device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |