CN108052004A

CN108052004A - Industrial machinery arm autocontrol method based on depth enhancing study

Info

Publication number: CN108052004A
Application number: CN201711275146.7A
Authority: CN
Inventors: 柯丰恺; 周唯倜; 赵大兴; 孙国栋; 许万; 丁国龙; 吴震宇; 赵迪
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2018-05-18
Anticipated expiration: 2037-12-06
Also published as: CN108052004B

Abstract

The present invention relates to a kind of industrial machinery arm autocontrol method based on depth enhancing study, reward r is established in structure depth enhancing learning model, construction output interference_tComputation model builds simulated environment, pond of accumulating experience, training deeply learning neural network and utilizes the control machinery arm movement in practice of trained depth enhancing learning model.Enhance learning network by adding in depth, solve the problems, such as mechanical arm automatically controlling in complex environment, complete automatically controlling for mechanical arm, and the speed of service is fast after the completion of training, precision is high.

Description

Industrial machinery arm autocontrol method based on depth enhancing study

Technical field

The invention belongs to nitrification enhancement technical fields, and in particular to a kind of industrial machinery based on depth enhancing study Arm autocontrol method.

Background technology

Industrial machinery arm can more be efficiently completed some simple repetitions and bulky behaviour for manpower Make, while production efficiency is greatly improved, labour cost and labor intensity can also be reduced, ensureing the same of the quality of production When can reduce the probability that artificial accident occurs again.In some adverse circumstances, as high temperature, high pressure, low temperature, low pressure, dust, easily Combustion, explosive etc. replace manual work by mechanical arm, can prevent, because of artificial accident caused by negligence of operation, having great Meaning.

Then the movement solution procedure of mechanical arm obtains the pose letter of itself to obtain the posture information of crawl target first Breath goes out the rotation angle of each axis by reverse Dynamic solving.Due to the flexible effect of joint in motion process and connecting rod In the presence of, structure is made to deform, precision reduce.It is achieved that the control of flexible mechanical arm is a big problem.Common controlling party Method has PID control, force-feedback control, self adaptive control, fuzzy and ANN Control etc..Wherein ANN Control has bright The advantages of aobvious, is not required the mathematical model of control target, and in the society of following artificial intelligence, based on neutral net It automatically controls and will be mainstream.

The content of the invention

It is an object of the invention to provide a kind of industrial machinery arm autocontrol methods based on depth enhancing study, pass through Depth enhancing learning network is added in, mechanical arm automatically controlling in complex environment is solved the problems, such as, completes automatically controlling for mechanical arm.

To achieve the above object, the industrial machinery arm side of automatically controlling based on depth enhancing study designed by the present invention Method, it is characterised in that：The control method includes the following steps：

Step 1) structure depth enhancing learning model

1.1) experience pond initializes：Experience pond is set as m rows, the two-dimensional matrix of n row, the value of each element in two-dimensional matrix 0 is initialized as, wherein, the information content that m is sample size size, n is each sample storage, n=2 × state_dim+ The dimension that action_dim+1, state_dim are the dimension of state, action_dim is action；Meanwhile it is reserved in experience pond Go out to store the space of incentive message, 1 in this formula of n=2 × state_dim+action_dim+1 is storage prize Encourage the headspace of information；

1.2) neutral net initializes：Neutral net is divided into two parts of Actor networks and Critic networks, Actor nets Network is behavior network, Critic networks are evaluation network, be each partly divided into not Gou Jian two structures are identical and parameter not With eval net and target net, eval net be that estimation network, target net are objective network, so as to formed μ (s | θ^μ) network, μ (s | θ^μ′) network, Q (s, a | θ^Q) network and Q (s, a | θ^Q′) network totally four networks, i.e. μ (s | θ^μ) network for row For estimation network, μ (s | θ^μ′) network for performance-based objective network, Q (s, a | θ^Q) network for evaluation estimation network, Q (s, a | θ^Q′) net Network is evaluation objective network；Random initializtion μ (s | θ^μ) network parameter θ^μWith random initializtion Q (s, a | θ^Q) network parameter θ^Q, then by μ (s | θ^μ) network parameter θ^μValue assigns performance-based objective network, i.e. θ^μ′←θ^μ, by Q (s, a | θ^Q) network parameter θ^Q Value assigns evaluation objective network, i.e. θ^Q′←θ^Q；

Step 2) construction output interference

According to current input state s_t, pass throughNetwork obtains action a_t', an average is reset as a_t', variance For var²Random normal distributionIt is distributed from random normalIn be randomly derived a reality output working value a_t, at random just State is distributedTo acting a_t' interference is applied with, for exploring environment, wherein,The parameter of t moment evaluation estimation network is represented, At the time of t is current input state；

Step 3) establishes reward r_tComputation model

Step 4) builds simulated environment

Robot simulation simulation softward V-REP has the model of the major industrial robot in the world, and based on this, robotic arm is imitated True environmental structure difficulty reduces, and passes through V-REP (Virtual Robot Experimentation Platform) software, structure The simulated environment being consistent with practical application；

Step 5) is accumulated experience pond

5.1) according to current input state s_t, pass throughNetwork obtains action a_t', it is established further according to step 2) defeated Go out interference and obtain reality output action a_t, and the r that receives awards from environment_tWith follow-up input state s_t+1, by current input state s_t, reality output action a_t, reward r_tWith follow-up input state s_t+1It is stored in experience pond, and by current input state s_t, it is real Border output action a_t, reward r_t, follow-up input state s_t+1It is referred to as state transinformation transition；

5.2) by follow-up input state s_t+1As present current input state s_t, step 5.1) is repeated, will be calculated State transinformation transition be stored in experience pond；

5.3) until the space in experience pond is full by storage, the space in experience pond is often performed repetition step 5.2) after storage completely Step 5.2), which just redirects, performs a step 6)；

Step 6) training deeply learning neural network

6.1) sample

Batch groups sample is taken out from experience pond for neural network learning, batch represents natural number；

6.2) evaluation network parameter is updated

6.3) behavior estimation network parameter is updated

6.4) objective network parameter is updated

6.5) it is divided into xm bouts, each bout repeats step 6.1)~6.4) xn times, every time repeatedly 6.1)~6.4) after, it is defeated The var values for going out interference are updated to var=max { 0.1, var=var × gamma }, and wherein xm, xn represents natural number, and gamma is It is less than 1 rational more than zero；

Step 7) utilizes trained depth enhancing learning model control machinery arm movement in practice in step 6)

7.1) in true environment, the input of industrial ccd cameras pre-processes, after the picture of t moment is by gaussian filtering As the state for Processing with Neural Network；

7.2) the current input state s of true environment is obtained by camera_t, depth enhancing learning network is according to current input State s_tControl machinery arm rotates, and obtains follow-up input state s_t+1.By follow-up input state s_t+1As current input state s_t, So Xun Huan, until depth enhancing learning model control machinery arm grabs target.

Further, in the step 3), reward r is established_tThe detailed process of computation model is：

Mechanical arm is obtained the image information of t moment by the industrial ccd cameras in environment, adds Gauss and make an uproar in t moment Sound obtains current input state s_t, state is current input s_tWhen from step 2) random normal distributionIn be randomly derived one A reality output working value a_t, (i.e. the rotation angle of each axis of mechanical arm), mechanical arm tail end position coordinates is x1_t,y1_t,z1_t, Target location is x0_t,y0_t,z0_t, reward

Further, in the step 6.2), the detailed process being updated to evaluation network parameter is：

Batch group sample state transinformations transition according to being taken out in step 6.1) passes throughNetwork WithNetwork respectively obtains the corresponding estimation Q ' value eval_Q ' and target Q ' values target_ of every group of state transinformation Q ', and then obtain time difference mistake TD_error ', TD_error '=target_Q '-eval_Q '；T ' is in step 5.3) Experience pool space is performed the input state moment of step 5.2) by storage after full, that is to say, that experience pool space quilt in step 5.3) Input state moment when often performing a step 5.2) after storage is full is t '；

Loss function Loss, Loss=∑ TD_error '/batch is constructed according to time difference mistake TD_error '；

According to loss function Loss using gradient descent method to evaluation estimation network parameter θ^QIt is updated.

Further, in the step 6.3), the detailed process being updated to behavior estimation network parameter is：

Per the s in batch group sample state transinformations transition_tPass throughNetwork and output are disturbed A is acted to corresponding reality output_t, according toEstimation Q ' value the eval_Q ' of network act a to reality output_tDerivation Number obtains estimation Q ' values and acts a to reality output_tGradientReality output is moved in representative Make a_tIt differentiates；According toThe reality output action a of network_tValue pairNetwork parameter is differentiated, and obtains reality Output action a_tValue pairThe gradient of network parameterWhereinIt represents to behavior estimation network Parameter is differentiated；

Estimate that Q values act a to reality output_tGradientA is acted with reality output_tValue is to row To estimate the gradient of network parameterProduct be estimate Q values to behavior estimate network parameter gradient；

Behavior estimation network parameter is updated using gradient rise method.

Further, in the step 6.4), the detailed process being updated to objective network parameter is：

At interval of J bouts, the network parameter of actor_eval is assigned to actor_target, at interval of K bouts, The network parameter of critic_eval is assigned to critic_target, wherein, J ≠ K.

Compared with prior art, the present invention it has the following advantages：And the present invention is based on the industrial machineries of depth enhancing study Arm autocontrol method enhances learning network by adding in depth, solves the problems, such as mechanical arm automatically controlling in complex environment, complete Into automatically controlling for mechanical arm, and after the completion of training, the speed of service is fast, precision is high.

Description of the drawings

Fig. 1 is that the present invention is based on the flow diagrams of the industrial machinery arm autocontrol method of depth enhancing study.

Specific embodiment

The present invention is described in further detail in the following with reference to the drawings and specific embodiments.

It is the flow diagram of the industrial machinery arm autocontrol method based on depth enhancing study as shown in Figure 1, including Following steps：

Step 1) structure depth enhancing learning model

Step 2) construction output interference

Step 3) establishes reward r_tComputation model

Step 4) builds simulated environment

Step 5) is accumulated experience pond

5.1) according to current input state s_t, pass throughNetwork obtains action a_t', it is established further according to step 2) Output interference obtains reality output action a_t, and the r that receives awards from environment_tWith follow-up input state s_t+1, will currently input shape State s_t, reality output action a_t, reward r_tWith follow-up input state s_t+1It is stored in experience pond, and by current input state s_t、 Reality output acts a_t, reward r_t, follow-up input state s_t+1It is referred to as state transinformation transition；

Step 6) training deeply learning neural network

6.1) sample

6.2) evaluation network parameter is updated

Batch group sample state transinformations transition according to being taken out in step 6.1) passes throughNet Network andNetwork respectively obtains the corresponding estimation Q ' value eval_Q ' of every group of state transinformation and target Q ' values Target_Q ', and then obtain time difference mistake TD_error ', TD_error '=target_Q '-eval_Q '；T ' is step 5.3) experience pool space is performed the input state moment of step 5.2) by storage in after full, that is to say, that experience pond in step 5.3) Input state moment when space often performs a step 5.2) by storage after full is t '；

According to loss function Loss using gradient descent method to evaluation estimation network parameter θ^QIt is updated；

6.3) behavior estimation network parameter is updated

Per the s in batch group sample state transinformations transition_tPass throughNetwork and output are disturbed A is acted to corresponding reality output_t, according toEstimation Q ' value the eval_Q ' of network act a to reality output_tDerivation Number obtains estimation Q ' values and acts a to reality output_tGradientIt represents and reality output is acted a_tIt differentiates；According toThe reality output action a of network_tValue pairNetwork parameter is differentiated, and is obtained actual defeated Go out to act a_tValue pairThe gradient of network parameterWhereinRepresent the ginseng to behavior estimation network Number is differentiated；

Behavior estimation network parameter is updated using gradient rise method；

6.4) objective network parameter is updated

At interval of J bouts, the network parameter of actor_eval is assigned to actor_target, at interval of K bouts, The network parameter of critic_eval is assigned to critic_target, wherein, J ≠ K；

6.5) it is divided into xm bouts, each bout repeats step 6.1)~6.4) xn times, every time repeatedly 6.1)~6.4) after, it is defeated The var values for going out interference are updated to var=max { 0.1, var=var × gamma }, i.e. var values take the var of 0.1 and last moment It is worth the maximum after overdamping, wherein xm, xn represents natural number, and gamma is the rational less than 1 more than zero；

Experimental data

Object of experiment is in SCARA robot simulated environment, passes through deeply learning neural network, control machinery arm Target is automatically positioned in target, and implements to capture.Experimental setup is to train 600 bouts, 200 step of one bout.It is tied in training Shu Hou can rapidly grab target within 20~30 steps running, be adapted to the requirement of modern industry flow line production. And traditional control machinery arm needs founding mathematical models and reverse dynamic (dynamical) Real-time solution operand is big.

Claims

1. a kind of industrial machinery arm autocontrol method based on depth enhancing study, it is characterised in that：The control method bag Include following steps：

Step 1) structure depth enhancing learning model

1.1) experience pond initializes：Experience pond is set as m rows, the two-dimensional matrix of n row, the value of each element is initial in two-dimensional matrix 0 is turned to, wherein, the information content that m is sample size size, n is each sample storage, n=2 × state_dim+action_ The dimension that dim+1, state_dim are the dimension of state, action_dim is action；Meanwhile it reserves and is used in experience pond The space of incentive message 1 is stored, 1 in this formula of n=2 × state_dim+action_dim+1 is storage incentive message Headspace；

1.2) neutral net initializes：Neutral net is divided into two parts of Actor networks and Critic networks, and Actor networks are Behavior network, Critic networks for evaluation network, be each partly divided into not Gou Jian two structure is identical and parameter is different Eval net and target net, eval net are that estimation network, target net are objective network, so as to formed μ (s | θ^μ) Network, μ (s | θ^μ′) network, Q (s, a | θ^Q) network and Q (s, a | θ^Q′) network totally four networks, i.e. μ (s | θ^μ) network estimates for behavior Count network, μ (s | θ^μ′) network for performance-based objective network, Q (s, a | θ^Q) network for evaluation estimation network, Q (s, a | θ^Q′) network is Evaluate objective network；Random initializtion μ (s | θ^μ) network parameter θ^μWith random initializtion Q (s, a | θ^Q) network parameter θ^Q, so Afterwards by μ (s | θ^μ) network parameter θ^μValue assigns performance-based objective network, i.e. θ^μ′←θ^μ, by Q (s, a | θ^Q) network parameter θ^QValue is assigned Give evaluation objective network, i.e. θ^Q′←θ^Q；

Step 2) construction output interference

According to current input state s_t, pass throughNetwork obtains action a_t', an average is reset as a_t', variance be var²Random normal distributionIt is distributed from random normalIn be randomly derived a reality output working value a_t, random normal DistributionTo acting a_t' interference is applied with, for exploring environment, wherein,Represent the parameter of t moment evaluation estimation network, t For current input state at the time of；

Step 3) establishes reward r_tComputation model

Step 4) builds simulated environment

Robot simulation simulation softward V-REP have the major industrial robot in the world model, based on this, the emulation ring of robotic arm Difficulty reduction is built in border, by V-REP (Virtual Robot Experimentation Platform) software, is built and real The simulated environment that border application is consistent；

Step 5) is accumulated experience pond

5.1) according to current input state s_t, pass throughNetwork obtains action a_t', the output established further according to step 2) is done It disturbs to obtain reality output action a_t, and the r that receives awards from environment_tWith follow-up input state s_t+1, by current input state s_t, it is real Border output action a_t, reward r_tWith follow-up input state s_t+1It is stored in experience pond, and by current input state s_t, reality output Act a_t, reward r_t, follow-up input state s_t+1It is referred to as state transinformation transition；

5.2) by follow-up input state s_t+1As present current input state s_t, repeat step 5.1), the shape that will be calculated State transinformation transition is stored in experience pond；

5.3) until the space in experience pond is full by storage, the space in experience pond is often performed once repetition step 5.2) after storage completely Step 5.2), which just redirects, performs a step 6)；

Step 6) training deeply learning neural network

6.1) sample

6.2) evaluation network parameter is updated

6.3) behavior estimation network parameter is updated

6.4) objective network parameter is updated

6.5) it is divided into xm bouts, each bout repeats step 6.1)~6.4) xn times, every time repeatedly 6.1)~6.4) after, output is dry The var values disturbed are updated to var=max { 0.1, var=var × gamma }, and wherein xm, xn represents natural number, gamma be more than Zero is less than 1 rational；

7.1) in true environment, the input of industrial ccd cameras pre-processes, and the picture of t moment after gaussian filtering by being used as For the state of Processing with Neural Network；

2. the industrial machinery arm autocontrol method according to claim 1 based on depth enhancing study, it is characterised in that：Institute It states in step 3), establishes reward r_tThe detailed process of computation model is：

Mechanical arm is obtained the image information of t moment by the industrial ccd cameras in environment, adds Gaussian noise and obtain in t moment To current input state s_t, state is current input s_tWhen from step 2) random normal distributionIn be randomly derived a reality Border output action value a_t, (i.e. the rotation angle of each axis of mechanical arm), mechanical arm tail end position coordinates is x1_t,y1_t,z1_t, target Position is x0_t,y0_t,z0_t, reward

3. the industrial machinery arm autocontrol method according to claim 1 based on depth enhancing study, it is characterised in that：Institute It states in step 6.2), the detailed process being updated to evaluation network parameter is：

Batch group sample state transinformations transition according to being taken out in step 6.1) passes throughNetwork andNetwork respectively obtains the corresponding estimation Q ' value eval_Q ' and target Q ' values target_ of every group of state transinformation Q ', and then obtain time difference mistake TD_error ', TD_error '=target_Q '-eval_Q '；T ' is in step 5.3) Experience pool space is performed the input state moment of step 5.2) by storage after full, that is to say, that experience pool space quilt in step 5.3) Input state moment when often performing a step 5.2) after storage is full is t '；

4. the industrial machinery arm autocontrol method according to claim 1 based on depth enhancing study, it is characterised in that：Institute It states in step 6.3), the detailed process being updated to behavior estimation network parameter is：

Per the s in batch group sample state transinformations transition_tPass throughNetwork and output interference obtain pair The reality output action a answered_t, according toEstimation Q ' value the eval_Q ' of network act a to reality output_tIt differentiates, obtains A is acted to reality output to estimation Q ' values_tGradient It represents and a is acted to reality output_tDerivation Number；According toThe reality output action a of network_tValue pairNetwork parameter is differentiated, and obtains reality output action a_tValue pairThe gradient of network parameterWhereinRepresent the parameter derivation to behavior estimation network Number；

Estimate that Q values act a to reality output_tGradientA is acted with reality output_tValue estimates behavior Count the gradient of network parameterProduct be estimate Q values to behavior estimate network parameter gradient；

Behavior estimation network parameter is updated using gradient rise method.

5. the industrial machinery arm autocontrol method according to claim 1 based on depth enhancing study, it is characterised in that：Institute It states in step 6.4), the detailed process being updated to objective network parameter is：

At interval of J bouts, the network parameter of actor_eval is assigned to actor_target, at interval of K bouts, critic_ The network parameter of eval is assigned to critic_target, wherein, J ≠ K.