CN112734727A - Apple picking method based on improved deep neural network - Google Patents
Apple picking method based on improved deep neural network Download PDFInfo
- Publication number
- CN112734727A CN112734727A CN202110031817.5A CN202110031817A CN112734727A CN 112734727 A CN112734727 A CN 112734727A CN 202110031817 A CN202110031817 A CN 202110031817A CN 112734727 A CN112734727 A CN 112734727A
- Authority
- CN
- China
- Prior art keywords
- network
- grabbing
- dimensional
- layer
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- A—HUMAN NECESSITIES
- A01—AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
- A01D—HARVESTING; MOWING
- A01D91/00—Methods for harvesting agricultural products
- A01D91/04—Products growing above the soil
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Environmental Sciences (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an apple picking robot fruit grabbing method based on a deep neural network, which improves the traditional neural network and constructs a grabbing suggestion network and a three-dimensional reconstruction network by using a residual error network and a recurrent neural network (LSTM). It includes: the method comprises the steps of collecting pictures of fruits of apples in the field, extracting sample pictures from the pictures, preprocessing the samples, constructing a grabbing suggestion network and a three-dimensional reconstruction network, training the grabbing suggestion network and the three-dimensional reconstruction network by using a training data set, evaluating the grabbing suggestion network by using a test data set, and picking the apples in real time by using the trained convolutional network, so that the identification and three-dimensional space positioning of the target objects of the fruits are realized, parameters are provided for the movement of a manipulator, and the accurate picking of agricultural fruits is realized. The gripping is improved by using the convolutional neural network, the gripping performance is improved, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.
Description
Technical Field
The invention designs a grabbing method of an apple picking robot based on an improved convolutional neural network image recognition technology.
Background
At present, a large amount of labor force is needed in the fruit picking process, but along with the continuous deepening of the aging degree of China, the rural labor force is continuously reduced, the production cost is continuously improved, the development of the whole fruit industry is limited, in addition, the manual picking can be influenced by human subjective factors, and the picking quality can be reduced. With the rapid development of computers and image processing technologies, the appearance of an intelligent fruit picking machine reduces the labor intensity and production cost of fruit growers, can effectively increase the production efficiency of the whole industry, and meets the demand of industrial development of fruits. The application of the digital image processing technology in the picking robot is mainly to identify fruit target objects and position the fruit target objects in a three-dimensional space so as to provide parameters for the movement of a manipulator, thereby realizing the accurate picking of agricultural fruits. Aiming at the problems that a large amount of labor is needed in the existing picking and the picking efficiency is low, the invention researches an intelligent fruit picking method based on deep learning, effectively improves the fruit picking efficiency and makes a contribution to agricultural production.
Disclosure of Invention
The invention provides an apple picking method based on a deep neural network, which utilizes an image processing technology and a convolutional neural network to accurately grab field ripe apples, wherein the convolutional neural network is used for constructing a grabbing suggestion network and a three-dimensional reconstruction network of apple fruits so as to provide parameters for the movement of a manipulator and further realize the accurate picking of agricultural fruits. The labor intensity and the production cost are reduced, and the production efficiency of the whole industry can be effectively increased.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the apple picking method based on the deep neural network is characterized by comprising the following steps:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample:
(2a) carrying out Gaussian filtering on the collected picture, smoothing the image and removing noise;
(2b) carrying out image enhancement processing on the picture to highlight the characteristics of the apple and improve the definition of the image;
(2c) and (4) carrying out image segmentation by adopting a color space reference table method, and removing the background in the image to obtain an image containing an apple area. As the picking robot is greatly influenced by the light intensity change of the working environment, the picking robot has great influence on HSV, YCbCrAnd L*a*b*3 models, taking 2 components H and S, Cr and Cb, a and b, independent of luminance, respectively, which constitute a two-dimensional color space. The spatial reference table is established as follows:
(2d) establishing an integer array (corresponding to a two-dimensional color space) of 256@256, and initializing the integer array to be zero;
(2e) converting the sample pixel from the RGB color space to a specified color space (such as HSV and Lxa b), and mapping each component to the range of 0-255;
(2f) counting sample pixels in a specified two-dimensional color space (such as HS and a × b, namely the array established in the step (2 d)), so as to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;
(2g) regarding the two-dimensional color distribution density map obtained in the step (2f) as a gray scale map, taking a proper threshold value, and binarizing the threshold value to obtain a binary image;
(2h) performing a series of mathematical morphology operations on the binary image obtained in the step (2g) by adopting an improved expansion and corrosion algorithm, wherein the two-dimensional array corresponding to the finally obtained binary image is the required color space reference table;
(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (no matter whether the pixel point is a target pixel point) and considering the pixel point as the target pixel point when the number exceeds half, otherwise, considering the pixel point as a non-target pixel point. Then, the region marking algorithm is used for finding out the region of each fruit, and the circumscribed rectangle of the regions is obtained, so that the extraction of the apple target is completed;
(3) collecting 500 images of apples with different sizes, different angles and different illumination, selecting 400 images as neural network training samples, and selecting the remaining 100 images as test samples;
(4) the method comprises the steps of constructing a convolutional neural network and training the convolutional neural network by using a training data set, wherein the method comprises two neural networks, and the outputs of the two neural networks are combined by an optimization module to plan grabbing. Grabbing recommended network GPNet output versus camera framecTζE SE (3). The three-dimensional recursive reconstruction neural network outputs a three-dimensional reconstruction of the object. A reasonable estimate of the shape of the occluded part of the object is provided. Output of two networks by grabbing propositionscTζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing propositioncTζ +. Due to the camera being opposite to the manipulatorThe pose of (a) is known, so camera frame grabbing may be converted into robot frames for the robot to perform:
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,this represents a homogeneous transformationIt is a (relative to the camera) grabbing gestureAn estimate of the potential. The first 3 values (t1, t2, t3) represent the desired (x, y, z) position of the gripper in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:wherein x islAnd xl+1Respectively representing the input and output of the lth residual unit, F is the residual function, representing the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L areAfter GPNet is constructed, a training data set is used for carrying out grabbing and recommending network training, and the training steps are as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*;
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:whereinIs the square of the Euclidean distance loss Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transposeUsing the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set;
(5) the invention constructs a three-dimensional reconstruction network SRNet, uses a three-dimensional recursive reconstruction neural network (3D-R2N2) which can construct three-dimensional reconstruction of object examples from different angles, each 3D-R2N2 comprises an encoder, a 3D-LSTM and a decoder, and the working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the image is coded using CNN as a feature, the encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit, and a fully connected layer. The input image is learnt layer by layer through an encoder to obtain the low-dimensional characteristics of the image, residual connection is added between standard layers of the encoder in order to improve the optimization performance of a deeper network, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, and then output flattening is transmitted to a complete connection layer. The encoder thus encodes the input image X into the low-dimensional features t (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st)
Wherein it,ftRespectively showing an input gate and a forgetting gate, stAnd htIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of an element, the subscript t indicating the activation at time t, W (-), U (-), respectively the current input x of the transitiontAnd previous hidden state ht-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;
(5d) the hidden state of the LSTM unit is decoded by a decoder and a 3D probabilistic voxel reconstruction is generated. The decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]In which the pair input x is omittedt}t∈{1,T}χ={xt}t∈{1,…T}Let the corresponding basic true-value occupancy be y(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k));
(6) Optimizing a grabbing part, in order to realize that the robot accurately grabs the fruit, projecting the proposed grabbing onto a reconstructed curved surface by using an Iterative Closest Point (ICP) algorithm, wherein the ICP algorithm has the flow:
(6a) taking a point set P in a target point cloud Pi∈P;
(6b)Finding out corresponding point set Q in source point cloud QiBelongs to Q, so that | | | Qi-pi||=min;
(6c) Calculating a rotation matrix R and a translation matrix t to minimize an error function;
(6d) carrying out rotation and translation transformation on pi by using the rotation matrix R and the translation matrix t obtained in the previous step to obtain a new corresponding point set pi'={pi'=Rpi+t,pi∈P};
(6f) if d is less than a given threshold or greater than a preset maximum number of iterations, the iterative computation is stopped. Otherwise, returning to the step (6b) until the convergence condition is met;
(6) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;
(7) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
The invention has the beneficial effects that:
a new robot grabbing planning method is provided, which uses a learned grabbing proposal network and a learned three-dimensional shape reconstruction network simultaneously, uses two different improved convolutional neural networks in the two networks, utilizes the representation capability and generalization performance of the two improved convolutional neural networks, outputs a complete 6-degree-of-freedom grabbing gesture compared with a pixel space method, and can grab the blocked part of an object which cannot be seen in an input image. The three-dimensional reconstruction network can deduce the shape of the invisible part of the object, and then the grabbing points are projected onto the reconstructed point cloud, so that grabbing is improved, and grabbing performance is improved. Enabling the robot to accurately grasp both the hidden and visible portions of the object. Such a combined system can produce more accurate crawling than a network of image-based crawling recommendations alone. Therefore, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.
Drawings
FIG. 1 is a block diagram of the overall process of the method of the present invention.
Fig. 2 is a diagram of a grab recommendation network architecture.
Fig. 3 is a diagram of a depth residual network architecture.
Fig. 4 is a diagram of a three-dimensional reconstruction network structure.
FIG. 5 is a diagram of a three-dimensional convolution LSTM structure.
Detailed Description
As shown in fig. 1, the process of the apple picking method based on the convolutional network robot is as follows:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample;
(3) dividing the preprocessed picture sample into a training data set and a testing data set;
(4) constructing a convolution network and training by using a training data set;
(5) evaluating the trained convolutional network by using a test data set;
(6) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
As shown in fig. 2, the method of the present invention grabs a proposed network:
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,this represents a homogeneous transformationIt is an estimate of the grab pose (relative to the camera). The first 3 values (t1, t2, t3) represent grippersThe desired (x, y, z) position in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. The training process of the GPNet is as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*;
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:whereinIs the square of the Euclidean distance loss Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transposeUsing the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set.
As shown in FIG. 3, the depth residual block structure of the method of the present invention is
ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:wherein x islAnd xl+1Respectively representing the input and output of the L-th residual unit, F being the residual function, ylDenotes the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are
As shown in fig. 4, the three-dimensional reconstruction network of the method of the present invention:
the invention uses a three-dimensional recursive reconstruction neural network (3D-R2N2) that is capable of constructing three-dimensional reconstructions of object instances from different angular viewpoints. Each 3D-R2N2 is composed of an encoder, a recursion unit, and a decoder. The working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the image is coded using CNN as a feature, the encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit, and a fully connected layer. The input image is learnt layer by layer through an encoder to obtain the low-dimensional characteristics of the image, residual connection is added between standard layers of the encoder in order to improve the optimization performance of a deeper network, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, and then output flattening is transmitted to a complete connection layer. The encoder thus encodes the input image X into the low-dimensional features t (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st);
Wherein it,ftThe input gate and the forgetting gate are respectively indicated. stAnd htIndicating respectively the memory cell and the hidden state, with |, indicating the element multiplication and the subscript t indicating the activation at time t. W (-), U (-), respectively, are the transformed current inputs xtAnd previous hidden state ht-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced;
(5d) decoding the hidden state of the LSTM unit and generating a 3D probabilistic voxel reconstruction by a decoder; the decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omittedt}t∈{1,...,T}Let the corresponding basic true-value occupancy be yi,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k))。
As shown in FIG. 5, the 3D-LSTM network architecture of the present invention:
the 3D-LSTM is composed of a set of structured LSTM cells with constrained connections, each receiving the same eigenvector from the encoder by a 3 × 3 × 3 convolution and from its neighborhoodAnd (3) taking hidden states as input, wherein each unit is responsible for reconstructing a specific part of final output, and the coding characteristics and the hidden states can selectively update the state of the unit after passing through the 3D-LSTM or keep the state by closing an input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st)
Wherein it,ftThe input gate and the forgetting gate are respectively indicated. stAnd htRespectively, memory cell and hidden state. An element multiplication is indicated by [ < u >, and the subscript t indicates activation at time t. W (-), U (-), respectively, are the transformed current inputs xtAnd previous hidden state ht-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced.
Claims (1)
1. An apple picking method based on an improved deep neural network is characterized by comprising the following steps:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample:
(2a) carrying out Gaussian filtering on the collected picture, smoothing the image and removing noise;
(2b) carrying out image enhancement processing on the picture to highlight the characteristics of the apple and improve the definition of the image;
(2c) performing image segmentation by using a color space reference table method, removing the background in the image, and obtaining an image containing an apple regionSince the picking robot is greatly influenced by the light intensity change of the working environment, for HSV, YCbCrAnd L*a*b*3 models of 2 components H and S, Cr and Cb, a independent of brightness*And b*They form a two-dimensional color space, and the spatial reference table is built as follows:
(2d) establishing an integer array (corresponding to a two-dimensional color space) of 256@256, and initializing the integer array to be zero;
(2e) converting sample pixels from an RGB color space to a specified color space (e.g., HSV and L)*a*b*Etc.), and mapping each component to the range of 0-255;
(2f) in a given two-dimensional color space (e.g., HS and a)*b*The array established in the step (2d) to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;
(2g) regarding the two-dimensional color distribution density map obtained in the step (2f) as a gray scale map, taking a proper threshold value, and binarizing the threshold value to obtain a binary image;
(2h) performing a series of mathematical morphology operations on the binary image obtained in the step (2g) by adopting an improved expansion and corrosion algorithm, wherein the two-dimensional array corresponding to the finally obtained binary image is the required color space reference table;
(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (whether the pixel point is a target pixel point or not), considering the pixel point as the target pixel point when the number exceeds half, and otherwise, considering the pixel point as a non-target pixel point, then finding out the regions of each fruit by using a region marking algorithm, and obtaining the circumscribed rectangles of the regions, thereby completing the extraction of the apple target;
(3) collecting 500 images of apples with different sizes, different angles and different illumination, selecting 400 images as neural network training samples, and selecting the remaining 100 images as test samples;
(4) the method of the present invention consists of two neural networks, the output of which is oneThe optimization modules are combined to plan grabbing, and the grabbing recommendation network GPNet outputs are relative to the camera framecTζE SE (3), three-dimensional recursive reconstruction of the neural network output three-dimensional reconstruction of the object, providing a reasonable estimate of the shape of the occluded part of the object, the output of both networks by the proposition of the grabbingcTζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing propositioncTζ +Due to the camera being opposite to the manipulatorThe pose of (a) is known, so camera frame grabbing may be converted into robot frames for the robot to perform:
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,this represents a homogeneous transformationIt is an estimate of the (relative to the camera) grab pose, the first 3 values (t1, t2, t3) representing the (x, y, z) position of the gripper needed in the camera coordinate frame, the last 9 values representing a serialized three-dimensional rotation matrix, ResNet-34 consisting essentially of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer, each residual block consisting of 2 3 × 3 convolutional layers and 2 ReLU units, each residual unit can be represented as:wherein x islAnd xl+1Respectively representing the input and output of the Lth residual unit, F being the residual functionDenotes the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are
After GPNet is constructed, a training data set is used for carrying out grabbing and recommending network training, and the training steps are as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*;
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:whereinIs the square of the Euclidean distance loss Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transposeUsing the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set;
(5) the invention constructs a three-dimensional reconstruction network SRNet, uses a three-dimensional recursive reconstruction neural network (3D-R2N2) which can construct three-dimensional reconstruction of object examples from different angles, each 3D-R2N2 comprises an encoder, a 3D-LSTM and a decoder, and the working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the method comprises the steps that CNN is used for coding an image into features, an encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit and a complete connection layer, input images are learned layer by layer through the encoder to obtain low-dimensional features of the image, in order to improve the optimization performance of a deeper network, residual connection is added between standard layers of the encoder, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, then output flattening is transmitted to the complete connection layer, and therefore the encoder codes the input images X into the low-dimensional features T (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with constrained connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoded features and hidden states having selectively updated cell states after passing through the 3D-LSTM or holding states by closing the input gates, in a three-dimensional grid, N x N3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position being ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st);
Wherein it,ftRespectively showing an input gate and a forgetting gate, stAnd htIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of the element, the subscript t indicating the activation at time t, W (-), U (-), respectivelyTransforming a current input xtAnd previous hidden state ht-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;
(5d) the hidden state of the LSTM unit is decoded by a decoder and a 3D probabilistic voxel reconstruction is generated. The decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of the voxel cross-entropies, leaving the final output of each voxel (i, j, k) to be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omittedt}t∈{1,...,T}Let the corresponding basic true-value occupancy be y(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k));
(6) Optimizing a grabbing part, in order to realize that the robot accurately grabs the fruit, projecting the proposed grabbing onto a reconstructed curved surface by using an Iterative Closest Point (ICP) algorithm, wherein the ICP algorithm has the flow:
(6a) taking a point set P in a target point cloud Pi∈P;
(6b) Finding out corresponding point set Q in source point cloud QiBelongs to Q, so that | | | Qi-pi||=min;
(6c) Calculating a rotation matrix R and a translation matrix t to minimize an error function;
(6d) carrying out rotation and translation transformation on pi by using the rotation matrix R and the translation matrix t obtained in the previous step to obtain a new corresponding point set pi'={pi'=Rpi+t,pi∈P};
(6f) if d is smaller than a given threshold value or larger than a preset maximum iteration number, stopping iterative computation, otherwise, returning to the step (6b) until a convergence condition is met;
(5) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;
(6) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110031817.5A CN112734727A (en) | 2021-01-11 | 2021-01-11 | Apple picking method based on improved deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110031817.5A CN112734727A (en) | 2021-01-11 | 2021-01-11 | Apple picking method based on improved deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112734727A true CN112734727A (en) | 2021-04-30 |
Family
ID=75590438
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110031817.5A Pending CN112734727A (en) | 2021-01-11 | 2021-01-11 | Apple picking method based on improved deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112734727A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113099847A (en) * | 2021-05-25 | 2021-07-13 | 广东技术师范大学 | Fruit picking method based on fruit three-dimensional parameter prediction model |
CN113160392A (en) * | 2021-05-14 | 2021-07-23 | 电子科技大学成都学院 | Optical building target three-dimensional reconstruction method based on deep neural network |
CN113575111A (en) * | 2021-09-01 | 2021-11-02 | 南京农业大学 | Real-time identification positioning and intelligent picking device for greenhouse tomatoes |
CN113743287A (en) * | 2021-08-31 | 2021-12-03 | 之江实验室 | Robot self-adaptive grabbing control method and system based on impulse neural network |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102914967A (en) * | 2012-09-21 | 2013-02-06 | 浙江工业大学 | Autonomous navigation and man-machine coordination picking operating system of picking robot |
CN105279789A (en) * | 2015-11-18 | 2016-01-27 | 中国兵器工业计算机应用技术研究所 | A three-dimensional reconstruction method based on image sequences |
CN108247637A (en) * | 2018-01-24 | 2018-07-06 | 中南大学 | A kind of industrial machine human arm vision anticollision control method |
CN108510062A (en) * | 2018-03-29 | 2018-09-07 | 东南大学 | A kind of robot irregular object crawl pose rapid detection method based on concatenated convolutional neural network |
CN109241964A (en) * | 2018-08-17 | 2019-01-18 | 上海非夕机器人科技有限公司 | The acquisition methods and equipment of the crawl point of mechanical arm |
CN109702741A (en) * | 2018-12-26 | 2019-05-03 | 中国科学院电子学研究所 | Mechanical arm visual grasping system and method based on self-supervisory learning neural network |
CN109934864A (en) * | 2019-03-14 | 2019-06-25 | 东北大学 | Residual error network depth learning method towards mechanical arm crawl pose estimation |
CN109948514A (en) * | 2019-03-15 | 2019-06-28 | 中国科学院宁波材料技术与工程研究所 | Workpiece based on single goal three-dimensional reconstruction quickly identifies and localization method |
CN110509273A (en) * | 2019-08-16 | 2019-11-29 | 天津职业技术师范大学(中国职业培训指导教师进修中心) | The robot mechanical arm of view-based access control model deep learning feature detects and grasping means |
CN110910452A (en) * | 2019-11-26 | 2020-03-24 | 上海交通大学 | Low-texture industrial part pose estimation method based on deep learning |
-
2021
- 2021-01-11 CN CN202110031817.5A patent/CN112734727A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102914967A (en) * | 2012-09-21 | 2013-02-06 | 浙江工业大学 | Autonomous navigation and man-machine coordination picking operating system of picking robot |
CN105279789A (en) * | 2015-11-18 | 2016-01-27 | 中国兵器工业计算机应用技术研究所 | A three-dimensional reconstruction method based on image sequences |
CN108247637A (en) * | 2018-01-24 | 2018-07-06 | 中南大学 | A kind of industrial machine human arm vision anticollision control method |
CN108510062A (en) * | 2018-03-29 | 2018-09-07 | 东南大学 | A kind of robot irregular object crawl pose rapid detection method based on concatenated convolutional neural network |
CN109241964A (en) * | 2018-08-17 | 2019-01-18 | 上海非夕机器人科技有限公司 | The acquisition methods and equipment of the crawl point of mechanical arm |
CN109702741A (en) * | 2018-12-26 | 2019-05-03 | 中国科学院电子学研究所 | Mechanical arm visual grasping system and method based on self-supervisory learning neural network |
CN109934864A (en) * | 2019-03-14 | 2019-06-25 | 东北大学 | Residual error network depth learning method towards mechanical arm crawl pose estimation |
CN109948514A (en) * | 2019-03-15 | 2019-06-28 | 中国科学院宁波材料技术与工程研究所 | Workpiece based on single goal three-dimensional reconstruction quickly identifies and localization method |
CN110509273A (en) * | 2019-08-16 | 2019-11-29 | 天津职业技术师范大学(中国职业培训指导教师进修中心) | The robot mechanical arm of view-based access control model deep learning feature detects and grasping means |
CN110910452A (en) * | 2019-11-26 | 2020-03-24 | 上海交通大学 | Low-texture industrial part pose estimation method based on deep learning |
Non-Patent Citations (4)
Title |
---|
C. B. CHOY等: "3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction", 《EUROPEAN CONFERENCE ON COMPUTER VISION》 * |
DANIEL YANG等: "Robotic Grasping through Combined Image-Based Grasp Proposal and 3D Reconstruction", 《ARXIV:2003.01649V3 [CS.RO] 6 NOV 2020》 * |
张铁中等: "水果采摘机器人视觉系统的目标提取", 《中国农业大学学报》 * |
武玉伟编著: "《深度学习基础与应用》", 30 April 2020, 北京:北京理工大学出版社 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160392A (en) * | 2021-05-14 | 2021-07-23 | 电子科技大学成都学院 | Optical building target three-dimensional reconstruction method based on deep neural network |
CN113160392B (en) * | 2021-05-14 | 2022-03-01 | 电子科技大学成都学院 | Optical building target three-dimensional reconstruction method based on deep neural network |
CN113099847A (en) * | 2021-05-25 | 2021-07-13 | 广东技术师范大学 | Fruit picking method based on fruit three-dimensional parameter prediction model |
CN113743287A (en) * | 2021-08-31 | 2021-12-03 | 之江实验室 | Robot self-adaptive grabbing control method and system based on impulse neural network |
CN113743287B (en) * | 2021-08-31 | 2024-03-26 | 之江实验室 | Robot self-adaptive grabbing control method and system based on impulse neural network |
CN113575111A (en) * | 2021-09-01 | 2021-11-02 | 南京农业大学 | Real-time identification positioning and intelligent picking device for greenhouse tomatoes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112734727A (en) | Apple picking method based on improved deep neural network | |
CN111950649B (en) | Attention mechanism and capsule network-based low-illumination image classification method | |
WO2022036777A1 (en) | Method and device for intelligent estimation of human body movement posture based on convolutional neural network | |
CN110032925B (en) | Gesture image segmentation and recognition method based on improved capsule network and algorithm | |
CN108304826A (en) | Facial expression recognizing method based on convolutional neural networks | |
CN111161364B (en) | Real-time shape completion and attitude estimation method for single-view depth map | |
CN108388896A (en) | A kind of licence plate recognition method based on dynamic time sequence convolutional neural networks | |
CN110880165A (en) | Image defogging method based on contour and color feature fusion coding | |
CN110580472B (en) | Video foreground detection method based on full convolution network and conditional countermeasure network | |
CN111489394B (en) | Object posture estimation model training method, system, device and medium | |
CN112257766A (en) | Shadow recognition detection method under natural scene based on frequency domain filtering processing | |
CN108876907A (en) | A kind of active three-dimensional rebuilding method of object-oriented object | |
CN104408697B (en) | Image Super-resolution Reconstruction method based on genetic algorithm and canonical prior model | |
CN115588237A (en) | Three-dimensional hand posture estimation method based on monocular RGB image | |
CN114049314A (en) | Medical image segmentation method based on feature rearrangement and gated axial attention | |
CN116258757A (en) | Monocular image depth estimation method based on multi-scale cross attention | |
CN113822825B (en) | Optical building target three-dimensional reconstruction method based on 3D-R2N2 | |
Hirner et al. | FC-DCNN: A densely connected neural network for stereo estimation | |
Li et al. | RoadFormer: Duplex Transformer for RGB-normal semantic road scene parsing | |
Basak et al. | Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image | |
CN114972794A (en) | Three-dimensional object recognition method based on multi-view Pooll transducer | |
Yan et al. | Cascaded transformer U-net for image restoration | |
CN114022392A (en) | Serial attention-enhancing UNet + + defogging network for defogging single image | |
CN114170304A (en) | Camera positioning method based on multi-head self-attention and replacement attention | |
Oza et al. | Semi-supervised image-to-image translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210430 |