CN106094516A

CN106094516A - A kind of robot self-adapting grasping method based on deeply study

Info

Publication number: CN106094516A
Application number: CN201610402319.6A
Authority: CN
Inventors: 陈春林; 侯跃南; 刘力锋; 魏青; 徐旭东; 朱张青; 辛博; 马海兰
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-06-08
Filing date: 2016-06-08
Publication date: 2016-11-09

Abstract

The invention provides a kind of robot self-adapting grasping method based on deeply study, step includes: in distance in time capturing target certain distance, robot obtains the photo of target by anterior photographic head, utilize binocular distance-finding method to calculate the positional information of target further according to photo, and the positional information calculated is used for robot navigation；When target enters in the range of mechanical arm is grabbed, then the photo by anterior photographic head photographic subjects, and the deeply learning network based on DDPG utilizing training in advance to cross carries out Data Dimensionality Reduction feature extraction to photo；Draw the control strategy of robot according to feature extraction result, robot utilizes control strategy to control the pose of motion path and mechanical arm, thus realizes the self-adapting grasping of target.This grasping means can realize self-adapting grasping to size shape difference, the unfixed object in position, has good market application foreground.

Description

A kind of robot self-adapting grasping method based on deeply study

Technical field

The present invention relates to a kind of method that robot captures object, a kind of robot based on deeply study Self-adapting grasping method.

Background technology

Autonomous robot is the most intelligentized service type robot, has the learning functionality of environment to external world.For reality The function of existing various basic activities (such as location, mobile, crawl), needs robot be furnished with mechanical arm and mechanical paw and merge The information of multisensor carries out machine learning (such as degree of depth study and intensified learning), interacts with external environment, it is achieved its The various functions such as perception, decision-making and action.Now most capture humanoid robots be operated in object size to be captured, shape and The relatively-stationary situation in position, and the technology that captures is mainly based upon the sensors such as ultrasound wave, infrared and laser ranging, therefore makes The most restricted by scope, it is impossible to adaptation crawl environment is increasingly complex, capture object size, the unfixed situation of shape and position； At present, existing optic type robotics is difficult to solve visual information dimension is high, data volume is big " dimension disaster " of input Problem；Further, the neutral net utilizing machine learning to train also is difficult to convergence, it is impossible to directly process the image information of input.Always For body, present optic type captures the control technology of service robot and not yet reaches gratifying result, especially in practicality In also need to optimize further.

Summary of the invention

The technical problem to be solved in the present invention be existing cannot adapt to capture environment increasingly complex, capture object size, The unfixed situation of shape and position.

In order to solve above-mentioned technical problem, the invention provides a kind of robot self adaptation based on deeply study and grab Access method, comprises the steps:

Step 1, in distance in time capturing target certain distance, robot obtains the photograph of target by anterior photographic head Sheet, utilizes binocular distance-finding method to calculate the positional information of target further according to photo, and the positional information calculated is used for machine Device people navigates；

Step 2, robot moves according to navigation, when target enters in the range of mechanical arm is grabbed, then by front portion The photo of photographic head photographic subjects, and photo carries out by the deeply learning network based on DDPG utilizing training in advance to cross Data Dimensionality Reduction feature extraction；

Step 3, draws the control strategy of robot according to feature extraction result, and robot utilizes control strategy to control fortune Dynamic path and the pose of mechanical arm, thus realize the self-adapting grasping of target.

Binocular distance-finding method is utilized to calculate target as limiting further in scheme, step 1 of the present invention according to photo The concretely comprising the following steps of positional information:

Step 1.1, obtains the focal distance f of photographic head, centre-to-centre spacing T of two photographic head in left and right_xAnd impact point is in left and right two The subpoint of the image plane of individual photographic head is to physical distance x of the respective image plane leftmost side^lAnd x^r, two, left and right photographic head correspondence The image plane in left side and the image plane on right side be rectangle plane, and be positioned on same imaging plane, two, left and right photographic head Photocentre projection lay respectively at the center of corresponding image plane, then parallax d is:

D=x^l-x^r (1)

Step 1.2, utilizes Similar Principle of Triangle to set up Q matrix to be:

Q = [\begin{matrix} 1 & 0 & 0 & - c_{x} \\ 0 & 1 & 0 & - c_{y} \\ 0 & 0 & 0 & f \\ 0 & 0 & - \frac{1}{T_{x}} & \frac{c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] - - - (2)

Q [\begin{matrix} x \\ y \\ d \\ 1 \end{matrix}] = [\begin{matrix} x - c_{x} \\ y - c_{y} \\ f \\ \frac{- d + c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] = [\begin{matrix} X \\ Y \\ Z \\ W \end{matrix}] - - - (3)

In formula (2) and (3), (X, Y, Z) is impact point seat in the three-dimensional coordinate system with left photographic head photocentre as initial point Mark, W is for rotating translation conversion ratio example coefficient, and (x, y) is impact point coordinate in the image plane in left side, c_xAnd c_yIt is respectively a left side The coordinate system of the image plane of side and the image plane on right side and the side-play amount of initial point, c in three-dimensional coordinate system_x' for c_xCorrection value；

Step 1.3, being calculated impact point to the space length of imaging plane is:

Z = \frac{- T_{x} f}{d - (c_{x} - {c_{x}}^{'})} - - - (4)

Using the photocentre position of left photographic head as robot position, by the co-ordinate position information of impact point (X, Y, Z) carry out robot navigation as navigation purpose.

The deeply based on DDPG utilizing training in advance to cross in scheme, step 2 is limited further as the present invention Learning network carries out Data Dimensionality Reduction feature extraction to photo and concretely comprises the following steps:

Step 2.1, utilizes target to capture process and meets intensified learning and meet the condition of Markov character, when calculating t Observed quantity and the collection of action before quarter are combined into:

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

In formula (5), x_tAnd a_tThe observed quantity being respectively t and the action taked；

Step 2.2, Utilization strategies value function describes the prospective earnings of crawl process and is:

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

In formula (6),The future profits summation that discount is later, γ was beaten for what moment t obtained ∈ [0,1] is discount factor, r (s_t,a_t) it is the revenue function of moment t, T is to capture the moment terminated, and π is for capturing strategy；

Target strategy π owing to capturing presets and determines, being designated as function mu: S ← A, S are state space, A is N-dimensional degree Motion space, utilize Bellman equation to process formula (6) has simultaneously:

Q^{μ} (s_{t}, a_{t}) = E_{s_{t + 1} ~ E} [r (s_{t}, a_{t}) + {γQ}^{μ} (s_{t + 1}, μ (s_{t + 1}))] - - - (7)

In formula (7), s_t+1～E represents that the observed quantity in t+1 moment obtains from environment E, μ (s_t+1) represent the t+1 moment The action being be mapped to by function mu from observed quantity；

Step 2.3, utilizes the principle of maximal possibility estimation, by minimizing loss function and updating network weight parameter is θ^QPolicy evaluation network Q (s, a | θ^Q), the loss function used is:

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

In formula (8), y_t=r (s_t,a_t)+γQ(s_t+1,μ(s_t+1)|θ^Q) it is that target strategy assesses network, μ ' is target plan Slightly；

Step 2.4, is θ for actual parameter^μStrategic function μ (s | θ^μ), the gradient utilizing chain method to obtain is:

\begin{matrix} {&dtri;}_{θ^{μ}} μ \approx E_{μ^{'}} [{&dtri;}_{θ^{μ}} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t} | θ^{μ})}] \\ = E_{μ^{'}} [{&dtri;}_{a} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t})} {&dtri;}_{θ^{μ}} (s | θ^{μ}) |_{s = s_{t}}] \end{matrix} - - - (9)

Be Policy-Gradient by formula (9) calculated gradient, recycling Policy-Gradient update strategic function μ (s | θ^μ)；

Step 2.5, utilizes and carrys out training network from policing algorithm, and the sample data used in network training is from same sample Relief area obtains, to minimize the relatedness between sample, trains neutral net, i.e. with a target Q value network simultaneously Using experience replay mechanism and target Q value network method for the renewal of objective network, the slow more New Policy used is:

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

In formula (10) and (11), τ is turnover rate, and τ ＜＜ 1 the most just constructs a deeply based on DDPG Practise network, and be the neutral net of convergence；

Step 2.6, utilizes the deeply learning network built that photo is carried out Data Dimensionality Reduction feature extraction, it is thus achieved that machine The control strategy of device people.

The deeply learning network limited further in scheme, step 2.6 as the present invention is inputted by an image Layer, two convolutional layers, two full articulamentums and an output layer are constituted, and image input layer comprises object to be captured for input Image；Convolutional layer is used for extracting feature, i.e. the deep layer form of expression of an image；Full articulamentum and output layer are for composition one Individual deep layer network, after training, input feature vector information, to this most exportable control instruction of deep layer network, i.e. controls robot Mechanical arm steering wheel angle and control carry dolly DC motor speed.By selected convolutional layer and the number of full articulamentum Amount be the purpose of two be both can effectively to have extracted characteristics of image, again so that neutral net training time be easy to convergence.

The beneficial effects of the present invention is: during (1) pre-training neutral net, use experience replay mechanism and stochastical sampling true Before and after the image information of fixed input can effectively solve photo, degree of association is unsatisfactory for more greatly neutral net for input data each other The problem of demand for independence；(2) realize Data Dimensionality Reduction by degree of depth study, use target Q value network technique constantly to adjust nerve net The weight matrix of network, can ensure the neutral net convergence of training as much as possible；(3) degree of depth based on DDPG trained Intensified learning neutral net can realize Data Dimensionality Reduction and object feature extraction, and directly gives the motor control plan of robot Slightly, " dimension disaster " problem is effectively solved.

Accompanying drawing explanation

Fig. 1 is the system structure schematic diagram of the present invention；

Fig. 2 is the method flow diagram of the present invention；

Fig. 3 is the binocular distance-finding method floor map of the present invention；

Fig. 4 is the binocular ranging technology schematic perspective view of the present invention；

Fig. 5 is the composition schematic diagram of the deeply learning network based on DDPG of the present invention.

Detailed description of the invention

As it is shown in figure 1, the system bag of a kind of based on deeply learning method the robot self-adapting grasping of the present invention Include: image processing system, wireless telecommunication system and robot motion's system.

Wherein, image processing system mainly has photographic head and the matlab software sharing being arranged on robot front portion；Wireless Communication system is mainly made up of WIFI module；Robot motion's system is mainly made up of base dolly and mechanical arm；First need Will be by dynamics simulation platform pre-training deeply learning network based on DDPG (degree of depth deterministic policy gradient), at this During generally use experience replay mechanism and target Q value network both approaches to guarantee that deeply based on DDPG learns Network can be restrained during pre-training, and then image processing system obtains the image of target object, passes through wireless telecommunication system Image information is passed to computer, robot distance wait capture object farther out time, use binocular ranging technology, to obtain object The positional information of body also uses it for the navigation of robot.

When robot moves and can catch object to mechanical arm, shoot object picture the most again and utilize the most trained Good deeply learning network based on DDPG realizes Data Dimensionality Reduction and extracts feature and provide the control strategy of robot, finally Send control strategy to robot motion system by wireless telecommunication system and control the kinestate of robot, it is achieved target The accurate crawl of object.

First with matlab software, the RGB image of target object is converted into gray level image during pre-training, then uses warp Test playback mechanism so that before and after photo degree of association the least with meet neutral net for input data independent of each other want Ask, obtain the image of input neural network finally by stochastical sampling；Realize Data Dimensionality Reduction by degree of depth study, use target Q-value network technique constantly adjusts the weight matrix of neutral net, finally gives the neutral net of convergence.

The control of robot Arduino plate realizes, and plate has carried WIFI module, and mechanical arm is made up of 4 steering wheels, Realizing 4 degree of freedom altogether, base dolly is by DC motor Driver；Image processing system is mainly soft by photographic head and image transmitting thereof Part and matlab are main；The photo of the target object that photographic head photographs will be transferred to electricity by the WIFI module on Arduino plate Brain, and transfer to matlab process.

Operationally, step is as follows for system:

Step 1, it is necessary first to based on DDPG (degree of depth deterministic policy gradient) by dynamics simulation platform pre-training Deeply learning network, the most generally uses experience replay mechanism and target Q value network both approaches to guarantee Deeply learning network based on DDPG can restrain during pre-training；

Step 2, obtains the image of target object, utilizes WIFI module by image with the photographic head being arranged on robot front portion Information passes to computer；

Step 3, robot distance wait capture object farther out time, use binocular ranging technology, to obtain target object Positional information also uses it for the navigation of robot；

Step 4, when robot moves and can catch object to mechanical arm, shoots object picture the most again and utilizes Trained good deeply learning network based on DDPG realizes Data Dimensionality Reduction and extracts feature and provide the control plan of robot Slightly；

Step 5, utilizes WIFI module to send control information to robot motion system, it is achieved accurately grabbing of target object Take；

As shown in Figure 3 and Figure 4, binocular ranging technology mainly make use of impact point imaging horizontal on the width view of left and right two The difference (i.e. parallax) that coordinate directly exists also exists inversely proportional relation with the distance of impact point to imaging plane.Ordinary circumstance Under, the dimension of focal length is pixel, and the dimension of photographic head centre-to-centre spacing is by the tessellated actual size of calibration plate and our input Value determines, is usually in units of millimeter (in order to improve precision, we are set to 0.1 millimeter of magnitude), and the dimension of parallax is also picture Vegetarian refreshments.Therefore molecule denominator is divided out, and the dimension of the distance of impact point to imaging plane is identical with photographic head centre-to-centre spacing.

As it is shown in figure 5, deeply learning network based on DDPG mainly by an image input layer, two convolutional layers, Two full articulamentums, output layers are constituted.The degree of depth network architecture is used for realizing Data Dimensionality Reduction, and convolutional layer is used for extracting feature, Output layer output control information.

As in figure 2 it is shown, the invention provides a kind of robot self-adapting grasping method based on deeply study, including Following steps:

Wherein, step 1 utilizes binocular distance-finding method to calculate the concretely comprising the following steps of positional information of target according to photo:

Step 1.1, obtains the focal distance f of photographic head, centre-to-centre spacing T of two photographic head in left and right_xAnd impact point is in left and right two The subpoint of the image plane of individual photographic head is to physical distance x of the respective image plane leftmost side^lAnd x^r, two, left and right photographic head correspondence The image plane in left side and the image plane on right side be rectangle plane, and be positioned on same imaging plane, two, left and right photographic head Photocentre projection lay respectively at the center of corresponding image plane, i.e. O_l、O_rAt the subpoint of imaging plane, then parallax d is:

D=x^l-x^r (1)

Step 1.2, utilizes Similar Principle of Triangle to set up Q matrix to be:

Q = [\begin{matrix} 1 & 0 & 0 & - c_{x} \\ 0 & 1 & 0 & - c_{y} \\ 0 & 0 & 0 & f \\ 0 & 0 & - \frac{1}{T_{x}} & \frac{c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] - - - (2)

Q [\begin{matrix} x \\ y \\ d \\ 1 \end{matrix}] = [\begin{matrix} x - c_{x} \\ y - c_{y} \\ f \\ \frac{- d + c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] = [\begin{matrix} X \\ Y \\ Z \\ W \end{matrix}] - - - (3)

In formula (2) and (3), (X, Y, Z) is impact point seat in the three-dimensional coordinate system with left photographic head photocentre as initial point Mark, W is for rotating translation conversion ratio example coefficient, and (x, y) is impact point coordinate in the image plane in left side, c_xAnd c_yIt is respectively a left side The coordinate system of the image plane of side and the image plane on right side and the side-play amount of initial point, c in three-dimensional coordinate system_x' for c_xCorrection value (two Person's numerical value is typically more or less the same, in the present invention it is believed that both approximately equals)；

Z = \frac{- T_{x} f}{d - (c_{x} - {c_{x}}^{'})} - - - (4)

It is special that the deeply learning network based on DDPG utilizing training in advance to cross in step 2 carries out Data Dimensionality Reduction to photo Levy concretely comprising the following steps of extraction:

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

Q^{μ} (s_{t}, a_{t}) = E_{s_{t + 1} ~ E} [r (s_{t}, a_{t}) + {γQ}^{μ} (s_{t + 1}, μ (s_{t + 1}))] - - - (7)

In formula (7), s_t+1～E represents that the observed quantity in t+1 moment obtains from environment E, μ (s_t+1) represent t+1

The action that moment is be mapped to by function mu from observed quantity；

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

\begin{matrix} {&dtri;}_{θ^{μ}} μ \approx E_{μ^{'}} [{&dtri;}_{θ^{μ}} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t} | θ^{μ})}] \\ = E_{μ^{'}} [{&dtri;}_{a} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t})} {&dtri;}_{θ^{μ}} (s | θ^{μ}) |_{s = s_{t}}] \end{matrix} - - - (9)

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

Step 2.6, utilizes the deeply learning network built that photo is carried out Data Dimensionality Reduction feature extraction, it is thus achieved that machine The control strategy of device people；Deeply learning network is by an image input layer, two convolutional layers, two full articulamentums and Individual output layer constitute, wherein, the quantity of selected convolutional layer and full articulamentum be the purpose of two be both can effectively to have extracted Characteristics of image, again so that neutral net is easy to convergence when training；Image input layer comprises object to be captured for input Image；Convolutional layer is used for extracting feature, i.e. the deep layer form of expression of an image, such as some lines, limit, camber line etc.；Quan Lian Connect layer and output layer for constitute a deep layer network, by training after, input feature vector information can export control to this network System instruction, i.e. controls the mechanical arm steering wheel angle of robot and controls to carry the DC motor speed of dolly.

Experience replay mechanism and stochastical sampling is used to determine that the image information of input can during pre-training neutral net of the present invention The problem that neutral net requires independently of one another it is unsatisfactory for more greatly for input data with degree of association before and after effectively solving photo；Pass through Degree of depth study realizes Data Dimensionality Reduction, uses target Q value network technique constantly to adjust the weight matrix of neutral net, can be as far as possible Ground ensures the neutral net convergence of training；The deeply learning neural network based on DDPG trained can realize number According to dimensionality reduction and object feature extraction, and directly give the Motion Control Strategies of robot, effectively solve " dimension disaster " problem.

Claims

1. a robot self-adapting grasping method based on deeply study, it is characterised in that comprise the steps:

Step 1, in distance in time capturing target certain distance, robot obtains the photo of target by anterior photographic head, then Utilize binocular distance-finding method to calculate the positional information of target according to photo, and the positional information calculated is used for robot leads Boat；

Step 2, robot moves according to navigation, when target enters in the range of mechanical arm is grabbed, then is taken the photograph by anterior As the photo of head photographic subjects, and the deeply learning network based on DDPG utilizing training in advance to cross carries out data to photo Dimensionality reduction feature extraction；

Step 3, draws the control strategy of robot according to feature extraction result, and robot utilizes control strategy to control road of moving Footpath and the pose of mechanical arm, thus realize the self-adapting grasping of target.

Robot self-adapting grasping method based on deeply study the most according to claim 1, it is characterised in that step Binocular distance-finding method is utilized to calculate the concretely comprising the following steps of positional information of target according to photo in rapid 1:

Step 1.1, obtains the focal distance f of photographic head, centre-to-centre spacing T of two photographic head in left and right_xAnd impact point is in the shooting of two, left and right The subpoint of the image plane of head is to physical distance x of the respective image plane leftmost side^lAnd x^r, left side that two, left and right photographic head is corresponding Image plane and the image plane on right side be rectangle plane, and be positioned on same imaging plane, the photocentre of two photographic head in left and right Projection lays respectively at the center of corresponding image plane, then parallax d is:

D=x^l-x^r (1)

Step 1.2, utilizes Similar Principle of Triangle to set up Q matrix to be:

Q = [\begin{matrix} 1 & 0 & 0 & - c_{x} \\ 0 & 1 & 0 & - c_{y} \\ 0 & 0 & 0 & f \\ 0 & 0 & - \frac{1}{T_{x}} & \frac{c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] - - - (2)

Q [\begin{matrix} x \\ y \\ d \\ 1 \end{matrix}] = [\begin{matrix} x - c_{x} \\ y - c_{y} \\ f \\ \frac{- d + c_{x} - {c_{x}}^{'}}{T_{x}} \end{matrix}] = [\begin{matrix} X \\ Y \\ Z \\ W \end{matrix}] - - - (3)

In formula (2) and (3), (X, Y, Z) is impact point coordinate in the three-dimensional coordinate system with left photographic head photocentre as initial point, W For rotating translation conversion ratio example coefficient, (x y) is impact point coordinate in the image plane in left side, c_xAnd c_yIt is respectively left side The coordinate system of the image plane on image plane and right side and the side-play amount of initial point, c in three-dimensional coordinate system_x' for c_xCorrection value；

Z = \frac{- T_{x} f}{d - (c_{x} - {c_{x}}^{'})} - - - (4)

Using the photocentre position of left photographic head as robot position, by the co-ordinate position information (X, Y, Z) of impact point Robot navigation is carried out as navigation purpose.

Robot self-adapting grasping method based on deeply study the most according to claim 1 and 2, its feature exists In, the deeply learning network based on DDPG utilizing training in advance to cross in step 2 carries out Data Dimensionality Reduction feature to photo and carries Take concretely comprises the following steps:

Step 2.1, utilize target capture process meet intensified learning and meet the condition of Markov character, calculate t it Front observed quantity and the collection of action are combined into:

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

In formula (6),The future profits summation that discount is later, γ ∈ was beaten for what moment t obtained [0,1] is discount factor, r (s_t,a_t) it is the revenue function of moment t, T is to capture the moment terminated, and π is for capturing strategy；

Target strategy π owing to capturing presets and determines, being designated as function mu: S ← A, S are state space, A is the action of N-dimensional degree Space, utilize Bellman equation to process formula (6) has simultaneously:

Q^{μ} (s_{t}, a_{t}) = E_{s_{t + 1} ~ E} [r (s_{t}, a_{t}) + {γQ}^{μ} (s_{t + 1}, μ (s_{t + 1}))] - - - (7)

In formula (7), s_t+1～E represents that the observed quantity in t+1 moment obtains from environment E, μ (s_t+1) represent that the t+1 moment is from sight The action that the amount of examining is be mapped to by function mu；

Step 2.3, utilizes the principle of maximal possibility estimation, is θ by minimize loss function updating network weight parameter^Q's Policy evaluation network Q (s, a | θ^Q), the loss function used is:

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

In formula (8), y_t=r (s_t,a_t)+γQ(s_t+1,μ(s_t+1)|θ^Q) it is that target strategy assesses network, μ ' is target strategy；

\begin{matrix} {&dtri;}_{θ^{μ}} μ \approx E_{μ^{'}} [{&dtri;}_{θ^{μ}} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t} | θ^{μ})}] \\ = E_{μ^{'}} [{&dtri;}_{a} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t})} {&dtri;}_{θ^{μ}} μ (s | θ^{μ}) |_{s = s_{t}}] \end{matrix} - - - (9)

Step 2.5, utilizes and carrys out training network from policing algorithm, and the sample data used in network training is from same Sample Buffer District obtains, to minimize the relatedness between sample, trains neutral net with a target Q value network simultaneously, i.e. use Experience replay mechanism and target Q value network method are for the renewal of objective network, and the slow more New Policy used is:

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

In formula (10) and (11), τ is turnover rate, τ ＜＜ 1, the most just constructs deeply based on a DDPG study net Network, and be the neutral net of convergence；

Step 2.6, utilizes the deeply learning network built that photo is carried out Data Dimensionality Reduction feature extraction, it is thus achieved that robot Control strategy.

Robot self-adapting grasping method based on deeply study the most according to claim 3, it is characterised in that step Deeply learning network in rapid 2.6 is by an image input layer, two convolutional layers, two full articulamentums and an output Layer is constituted, and image input layer comprises the image of object to be captured for input；Convolutional layer is used for extracting feature, i.e. an image The deep layer form of expression；Full articulamentum and output layer are for constituting a deep layer network, and after training, input feature vector information arrives This most exportable control instruction of deep layer network, i.e. controls the mechanical arm steering wheel angle of robot and controls to carry the direct current of dolly Motor speed.