CN112734727A - Apple picking method based on improved deep neural network - Google Patents

Apple picking method based on improved deep neural network Download PDF

Info

Publication number
CN112734727A
CN112734727A CN202110031817.5A CN202110031817A CN112734727A CN 112734727 A CN112734727 A CN 112734727A CN 202110031817 A CN202110031817 A CN 202110031817A CN 112734727 A CN112734727 A CN 112734727A
Authority
CN
China
Prior art keywords
network
grabbing
dimensional
layer
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110031817.5A
Other languages
Chinese (zh)
Inventor
李静
黄友锐
韩涛
兰世豪
江灵雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202110031817.5A priority Critical patent/CN112734727A/en
Publication of CN112734727A publication Critical patent/CN112734727A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • AHUMAN NECESSITIES
    • A01AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
    • A01DHARVESTING; MOWING
    • A01D91/00Methods for harvesting agricultural products
    • A01D91/04Products growing above the soil
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Environmental Sciences (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an apple picking robot fruit grabbing method based on a deep neural network, which improves the traditional neural network and constructs a grabbing suggestion network and a three-dimensional reconstruction network by using a residual error network and a recurrent neural network (LSTM). It includes: the method comprises the steps of collecting pictures of fruits of apples in the field, extracting sample pictures from the pictures, preprocessing the samples, constructing a grabbing suggestion network and a three-dimensional reconstruction network, training the grabbing suggestion network and the three-dimensional reconstruction network by using a training data set, evaluating the grabbing suggestion network by using a test data set, and picking the apples in real time by using the trained convolutional network, so that the identification and three-dimensional space positioning of the target objects of the fruits are realized, parameters are provided for the movement of a manipulator, and the accurate picking of agricultural fruits is realized. The gripping is improved by using the convolutional neural network, the gripping performance is improved, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.

Description

Apple picking method based on improved deep neural network
Technical Field
The invention designs a grabbing method of an apple picking robot based on an improved convolutional neural network image recognition technology.
Background
At present, a large amount of labor force is needed in the fruit picking process, but along with the continuous deepening of the aging degree of China, the rural labor force is continuously reduced, the production cost is continuously improved, the development of the whole fruit industry is limited, in addition, the manual picking can be influenced by human subjective factors, and the picking quality can be reduced. With the rapid development of computers and image processing technologies, the appearance of an intelligent fruit picking machine reduces the labor intensity and production cost of fruit growers, can effectively increase the production efficiency of the whole industry, and meets the demand of industrial development of fruits. The application of the digital image processing technology in the picking robot is mainly to identify fruit target objects and position the fruit target objects in a three-dimensional space so as to provide parameters for the movement of a manipulator, thereby realizing the accurate picking of agricultural fruits. Aiming at the problems that a large amount of labor is needed in the existing picking and the picking efficiency is low, the invention researches an intelligent fruit picking method based on deep learning, effectively improves the fruit picking efficiency and makes a contribution to agricultural production.
Disclosure of Invention
The invention provides an apple picking method based on a deep neural network, which utilizes an image processing technology and a convolutional neural network to accurately grab field ripe apples, wherein the convolutional neural network is used for constructing a grabbing suggestion network and a three-dimensional reconstruction network of apple fruits so as to provide parameters for the movement of a manipulator and further realize the accurate picking of agricultural fruits. The labor intensity and the production cost are reduced, and the production efficiency of the whole industry can be effectively increased.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the apple picking method based on the deep neural network is characterized by comprising the following steps:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample:
(2a) carrying out Gaussian filtering on the collected picture, smoothing the image and removing noise;
(2b) carrying out image enhancement processing on the picture to highlight the characteristics of the apple and improve the definition of the image;
(2c) and (4) carrying out image segmentation by adopting a color space reference table method, and removing the background in the image to obtain an image containing an apple area. As the picking robot is greatly influenced by the light intensity change of the working environment, the picking robot has great influence on HSV, YCbCrAnd L*a*b*3 models, taking 2 components H and S, Cr and Cb, a and b, independent of luminance, respectively, which constitute a two-dimensional color space. The spatial reference table is established as follows:
(2d) establishing an integer array (corresponding to a two-dimensional color space) of 256@256, and initializing the integer array to be zero;
(2e) converting the sample pixel from the RGB color space to a specified color space (such as HSV and Lxa b), and mapping each component to the range of 0-255;
(2f) counting sample pixels in a specified two-dimensional color space (such as HS and a × b, namely the array established in the step (2 d)), so as to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;
(2g) regarding the two-dimensional color distribution density map obtained in the step (2f) as a gray scale map, taking a proper threshold value, and binarizing the threshold value to obtain a binary image;
(2h) performing a series of mathematical morphology operations on the binary image obtained in the step (2g) by adopting an improved expansion and corrosion algorithm, wherein the two-dimensional array corresponding to the finally obtained binary image is the required color space reference table;
(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (no matter whether the pixel point is a target pixel point) and considering the pixel point as the target pixel point when the number exceeds half, otherwise, considering the pixel point as a non-target pixel point. Then, the region marking algorithm is used for finding out the region of each fruit, and the circumscribed rectangle of the regions is obtained, so that the extraction of the apple target is completed;
(3) collecting 500 images of apples with different sizes, different angles and different illumination, selecting 400 images as neural network training samples, and selecting the remaining 100 images as test samples;
(4) the method comprises the steps of constructing a convolutional neural network and training the convolutional neural network by using a training data set, wherein the method comprises two neural networks, and the outputs of the two neural networks are combined by an optimization module to plan grabbing. Grabbing recommended network GPNet output versus camera framecTζE SE (3). The three-dimensional recursive reconstruction neural network outputs a three-dimensional reconstruction of the object. A reasonable estimate of the shape of the occluded part of the object is provided. Output of two networks by grabbing propositionscTζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing propositioncTζ +. Due to the camera being opposite to the manipulator
Figure BDA0002892653340000025
The pose of (a) is known, so camera frame grabbing may be converted into robot frames for the robot to perform:
Figure BDA0002892653340000022
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,
Figure BDA0002892653340000023
this represents a homogeneous transformation
Figure BDA0002892653340000024
It is a (relative to the camera) grabbing gestureAn estimate of the potential. The first 3 values (t1, t2, t3) represent the desired (x, y, z) position of the gripper in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:
Figure BDA0002892653340000031
wherein x islAnd xl+1Respectively representing the input and output of the lth residual unit, F is the residual function, representing the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are
Figure BDA0002892653340000032
After GPNet is constructed, a training data set is used for carrying out grabbing and recommending network training, and the training steps are as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*
(4b) Obtaining the predicted value by network step-by-step calculation
Figure BDA0002892653340000033
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:
Figure BDA0002892653340000034
wherein
Figure BDA0002892653340000035
Is the square of the Euclidean distance loss
Figure BDA0002892653340000036
Figure BDA0002892653340000037
Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transpose
Figure BDA0002892653340000038
Using the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set;
(5) the invention constructs a three-dimensional reconstruction network SRNet, uses a three-dimensional recursive reconstruction neural network (3D-R2N2) which can construct three-dimensional reconstruction of object examples from different angles, each 3D-R2N2 comprises an encoder, a 3D-LSTM and a decoder, and the working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the image is coded using CNN as a feature, the encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit, and a fully connected layer. The input image is learnt layer by layer through an encoder to obtain the low-dimensional characteristics of the image, residual connection is added between standard layers of the encoder in order to improve the optimization performance of a deeper network, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, and then output flattening is transmitted to a complete connection layer. The encoder thus encodes the input image X into the low-dimensional features t (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st)
Wherein it,ftRespectively showing an input gate and a forgetting gate, stAnd htIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of an element, the subscript t indicating the activation at time t, W (-), U (-), respectively the current input x of the transitiontAnd previous hidden state ht-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;
(5d) the hidden state of the LSTM unit is decoded by a decoder and a 3D probabilistic voxel reconstruction is generated. The decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]In which the pair input x is omittedt}t∈{1,T}χ={xt}t∈{1,…T}Let the corresponding basic true-value occupancy be y(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k));
(6) Optimizing a grabbing part, in order to realize that the robot accurately grabs the fruit, projecting the proposed grabbing onto a reconstructed curved surface by using an Iterative Closest Point (ICP) algorithm, wherein the ICP algorithm has the flow:
(6a) taking a point set P in a target point cloud Pi∈P;
(6b)Finding out corresponding point set Q in source point cloud QiBelongs to Q, so that | | | Qi-pi||=min;
(6c) Calculating a rotation matrix R and a translation matrix t to minimize an error function;
(6d) carrying out rotation and translation transformation on pi by using the rotation matrix R and the translation matrix t obtained in the previous step to obtain a new corresponding point set pi'={pi'=Rpi+t,pi∈P};
(6e) Calculating pi' and corresponding point set qiAverage distance of (d);
Figure BDA0002892653340000051
(6f) if d is less than a given threshold or greater than a preset maximum number of iterations, the iterative computation is stopped. Otherwise, returning to the step (6b) until the convergence condition is met;
(6) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;
(7) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
The invention has the beneficial effects that:
a new robot grabbing planning method is provided, which uses a learned grabbing proposal network and a learned three-dimensional shape reconstruction network simultaneously, uses two different improved convolutional neural networks in the two networks, utilizes the representation capability and generalization performance of the two improved convolutional neural networks, outputs a complete 6-degree-of-freedom grabbing gesture compared with a pixel space method, and can grab the blocked part of an object which cannot be seen in an input image. The three-dimensional reconstruction network can deduce the shape of the invisible part of the object, and then the grabbing points are projected onto the reconstructed point cloud, so that grabbing is improved, and grabbing performance is improved. Enabling the robot to accurately grasp both the hidden and visible portions of the object. Such a combined system can produce more accurate crawling than a network of image-based crawling recommendations alone. Therefore, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.
Drawings
FIG. 1 is a block diagram of the overall process of the method of the present invention.
Fig. 2 is a diagram of a grab recommendation network architecture.
Fig. 3 is a diagram of a depth residual network architecture.
Fig. 4 is a diagram of a three-dimensional reconstruction network structure.
FIG. 5 is a diagram of a three-dimensional convolution LSTM structure.
Detailed Description
As shown in fig. 1, the process of the apple picking method based on the convolutional network robot is as follows:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample;
(3) dividing the preprocessed picture sample into a training data set and a testing data set;
(4) constructing a convolution network and training by using a training data set;
(5) evaluating the trained convolutional network by using a test data set;
(6) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
As shown in fig. 2, the method of the present invention grabs a proposed network:
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,
Figure BDA0002892653340000061
this represents a homogeneous transformation
Figure BDA0002892653340000062
It is an estimate of the grab pose (relative to the camera). The first 3 values (t1, t2, t3) represent grippersThe desired (x, y, z) position in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. The training process of the GPNet is as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*
(4b) Obtaining the predicted value by network step-by-step calculation
Figure BDA0002892653340000063
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:
Figure BDA0002892653340000064
wherein
Figure BDA0002892653340000065
Is the square of the Euclidean distance loss
Figure BDA0002892653340000066
Figure BDA0002892653340000067
Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transpose
Figure BDA0002892653340000068
Using the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set.
As shown in FIG. 3, the depth residual block structure of the method of the present invention is
ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:
Figure BDA0002892653340000069
wherein x islAnd xl+1Respectively representing the input and output of the L-th residual unit, F being the residual function, ylDenotes the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are
Figure BDA0002892653340000071
As shown in fig. 4, the three-dimensional reconstruction network of the method of the present invention:
the invention uses a three-dimensional recursive reconstruction neural network (3D-R2N2) that is capable of constructing three-dimensional reconstructions of object instances from different angular viewpoints. Each 3D-R2N2 is composed of an encoder, a recursion unit, and a decoder. The working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the image is coded using CNN as a feature, the encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit, and a fully connected layer. The input image is learnt layer by layer through an encoder to obtain the low-dimensional characteristics of the image, residual connection is added between standard layers of the encoder in order to improve the optimization performance of a deeper network, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, and then output flattening is transmitted to a complete connection layer. The encoder thus encodes the input image X into the low-dimensional features t (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st);
Wherein it,ftThe input gate and the forgetting gate are respectively indicated. stAnd htIndicating respectively the memory cell and the hidden state, with |, indicating the element multiplication and the subscript t indicating the activation at time t. W (-), U (-), respectively, are the transformed current inputs xtAnd previous hidden state ht-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced;
(5d) decoding the hidden state of the LSTM unit and generating a 3D probabilistic voxel reconstruction by a decoder; the decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omittedt}t∈{1,...,T}Let the corresponding basic true-value occupancy be yi,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k))。
As shown in FIG. 5, the 3D-LSTM network architecture of the present invention:
the 3D-LSTM is composed of a set of structured LSTM cells with constrained connections, each receiving the same eigenvector from the encoder by a 3 × 3 × 3 convolution and from its neighborhoodAnd (3) taking hidden states as input, wherein each unit is responsible for reconstructing a specific part of final output, and the coding characteristics and the hidden states can selectively update the state of the unit after passing through the 3D-LSTM or keep the state by closing an input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st)
Wherein it,ftThe input gate and the forgetting gate are respectively indicated. stAnd htRespectively, memory cell and hidden state. An element multiplication is indicated by [ < u >, and the subscript t indicates activation at time t. W (-), U (-), respectively, are the transformed current inputs xtAnd previous hidden state ht-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced.

Claims (1)

1. An apple picking method based on an improved deep neural network is characterized by comprising the following steps:
(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;
(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample:
(2a) carrying out Gaussian filtering on the collected picture, smoothing the image and removing noise;
(2b) carrying out image enhancement processing on the picture to highlight the characteristics of the apple and improve the definition of the image;
(2c) performing image segmentation by using a color space reference table method, removing the background in the image, and obtaining an image containing an apple regionSince the picking robot is greatly influenced by the light intensity change of the working environment, for HSV, YCbCrAnd L*a*b*3 models of 2 components H and S, Cr and Cb, a independent of brightness*And b*They form a two-dimensional color space, and the spatial reference table is built as follows:
(2d) establishing an integer array (corresponding to a two-dimensional color space) of 256@256, and initializing the integer array to be zero;
(2e) converting sample pixels from an RGB color space to a specified color space (e.g., HSV and L)*a*b*Etc.), and mapping each component to the range of 0-255;
(2f) in a given two-dimensional color space (e.g., HS and a)*b*The array established in the step (2d) to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;
(2g) regarding the two-dimensional color distribution density map obtained in the step (2f) as a gray scale map, taking a proper threshold value, and binarizing the threshold value to obtain a binary image;
(2h) performing a series of mathematical morphology operations on the binary image obtained in the step (2g) by adopting an improved expansion and corrosion algorithm, wherein the two-dimensional array corresponding to the finally obtained binary image is the required color space reference table;
(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (whether the pixel point is a target pixel point or not), considering the pixel point as the target pixel point when the number exceeds half, and otherwise, considering the pixel point as a non-target pixel point, then finding out the regions of each fruit by using a region marking algorithm, and obtaining the circumscribed rectangles of the regions, thereby completing the extraction of the apple target;
(3) collecting 500 images of apples with different sizes, different angles and different illumination, selecting 400 images as neural network training samples, and selecting the remaining 100 images as test samples;
(4) the method of the present invention consists of two neural networks, the output of which is oneThe optimization modules are combined to plan grabbing, and the grabbing recommendation network GPNet outputs are relative to the camera framecTζE SE (3), three-dimensional recursive reconstruction of the neural network output three-dimensional reconstruction of the object, providing a reasonable estimate of the shape of the occluded part of the object, the output of both networks by the proposition of the grabbingcTζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing propositioncTζ +Due to the camera being opposite to the manipulator
Figure FDA00028926533300000212
The pose of (a) is known, so camera frame grabbing may be converted into robot frames for the robot to perform:
Figure FDA0002892653330000021
the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,
Figure FDA0002892653330000022
this represents a homogeneous transformation
Figure FDA0002892653330000023
It is an estimate of the (relative to the camera) grab pose, the first 3 values (t1, t2, t3) representing the (x, y, z) position of the gripper needed in the camera coordinate frame, the last 9 values representing a serialized three-dimensional rotation matrix, ResNet-34 consisting essentially of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer, each residual block consisting of 2 3 × 3 convolutional layers and 2 ReLU units, each residual unit can be represented as:
Figure FDA0002892653330000024
wherein x islAnd xl+1Respectively representing the input and output of the Lth residual unit, F being the residual functionDenotes the learned residual, and h (X)L)=XLRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are
Figure FDA0002892653330000025
After GPNet is constructed, a training data set is used for carrying out grabbing and recommending network training, and the training steps are as follows:
(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)g,Id) I.e. grey and depth images, output as corresponding ground truth values t*
(4b) Obtaining the predicted value by network step-by-step calculation
Figure FDA0002892653330000026
(4c) According to the ground real grabbing t*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:
Figure FDA0002892653330000027
wherein
Figure FDA0002892653330000028
Is the square of the Euclidean distance loss
Figure FDA0002892653330000029
Figure FDA00028926533300000210
Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transpose
Figure FDA00028926533300000211
Using the weight λ in training GPNETT=λRLearning rate of 1X10-4The Adam optimizer of (a) is trained on the data set;
(5) the invention constructs a three-dimensional reconstruction network SRNet, uses a three-dimensional recursive reconstruction neural network (3D-R2N2) which can construct three-dimensional reconstruction of object examples from different angles, each 3D-R2N2 comprises an encoder, a 3D-LSTM and a decoder, and the working process of the network is as follows:
(5a) giving a sample image to an input layer;
(5b) the method comprises the steps that CNN is used for coding an image into features, an encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit and a complete connection layer, input images are learned layer by layer through the encoder to obtain low-dimensional features of the image, in order to improve the optimization performance of a deeper network, residual connection is added between standard layers of the encoder, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, then output flattening is transmitted to the complete connection layer, and therefore the encoder codes the input images X into the low-dimensional features T (X);
(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with constrained connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoded features and hidden states having selectively updated cell states after passing through the 3D-LSTM or holding states by closing the input gates, in a three-dimensional grid, N x N3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position being ft=σ(WfT(xt)+Uf*ht-1+bf),it=σ(WiT(xt)+Ui*ht-1+bi),St=ft⊙st-1+it⊙tanh(WsT(xt)+Us*ht-1+bs),ht=tanh(st);
Wherein it,ftRespectively showing an input gate and a forgetting gate, stAnd htIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of the element, the subscript t indicating the activation at time t, W (-), U (-), respectivelyTransforming a current input xtAnd previous hidden state ht-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;
(5d) the hidden state of the LSTM unit is decoded by a decoder and a 3D probabilistic voxel reconstruction is generated. The decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;
(5e) the loss function of the network is defined as the sum of the voxel cross-entropies, leaving the final output of each voxel (i, j, k) to be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omittedt}t∈{1,...,T}Let the corresponding basic true-value occupancy be y(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y(i,j,k)log(p(i,j,k))+(1-y(i,j,k))log(1-p(i,j,k));
(6) Optimizing a grabbing part, in order to realize that the robot accurately grabs the fruit, projecting the proposed grabbing onto a reconstructed curved surface by using an Iterative Closest Point (ICP) algorithm, wherein the ICP algorithm has the flow:
(6a) taking a point set P in a target point cloud Pi∈P;
(6b) Finding out corresponding point set Q in source point cloud QiBelongs to Q, so that | | | Qi-pi||=min;
(6c) Calculating a rotation matrix R and a translation matrix t to minimize an error function;
(6d) carrying out rotation and translation transformation on pi by using the rotation matrix R and the translation matrix t obtained in the previous step to obtain a new corresponding point set pi'={pi'=Rpi+t,pi∈P};
(6e) Calculating pi' and corresponding point set qiAverage distance of (d);
Figure FDA0002892653330000041
(6f) if d is smaller than a given threshold value or larger than a preset maximum iteration number, stopping iterative computation, otherwise, returning to the step (6b) until a convergence condition is met;
(5) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;
(6) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.
CN202110031817.5A 2021-01-11 2021-01-11 Apple picking method based on improved deep neural network Pending CN112734727A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110031817.5A CN112734727A (en) 2021-01-11 2021-01-11 Apple picking method based on improved deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110031817.5A CN112734727A (en) 2021-01-11 2021-01-11 Apple picking method based on improved deep neural network

Publications (1)

Publication Number Publication Date
CN112734727A true CN112734727A (en) 2021-04-30

Family

ID=75590438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110031817.5A Pending CN112734727A (en) 2021-01-11 2021-01-11 Apple picking method based on improved deep neural network

Country Status (1)

Country Link
CN (1) CN112734727A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099847A (en) * 2021-05-25 2021-07-13 广东技术师范大学 Fruit picking method based on fruit three-dimensional parameter prediction model
CN113160392A (en) * 2021-05-14 2021-07-23 电子科技大学成都学院 Optical building target three-dimensional reconstruction method based on deep neural network
CN113575111A (en) * 2021-09-01 2021-11-02 南京农业大学 Real-time identification positioning and intelligent picking device for greenhouse tomatoes
CN113743287A (en) * 2021-08-31 2021-12-03 之江实验室 Robot self-adaptive grabbing control method and system based on impulse neural network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102914967A (en) * 2012-09-21 2013-02-06 浙江工业大学 Autonomous navigation and man-machine coordination picking operating system of picking robot
CN105279789A (en) * 2015-11-18 2016-01-27 中国兵器工业计算机应用技术研究所 A three-dimensional reconstruction method based on image sequences
CN108247637A (en) * 2018-01-24 2018-07-06 中南大学 A kind of industrial machine human arm vision anticollision control method
CN108510062A (en) * 2018-03-29 2018-09-07 东南大学 A kind of robot irregular object crawl pose rapid detection method based on concatenated convolutional neural network
CN109241964A (en) * 2018-08-17 2019-01-18 上海非夕机器人科技有限公司 The acquisition methods and equipment of the crawl point of mechanical arm
CN109702741A (en) * 2018-12-26 2019-05-03 中国科学院电子学研究所 Mechanical arm visual grasping system and method based on self-supervisory learning neural network
CN109934864A (en) * 2019-03-14 2019-06-25 东北大学 Residual error network depth learning method towards mechanical arm crawl pose estimation
CN109948514A (en) * 2019-03-15 2019-06-28 中国科学院宁波材料技术与工程研究所 Workpiece based on single goal three-dimensional reconstruction quickly identifies and localization method
CN110509273A (en) * 2019-08-16 2019-11-29 天津职业技术师范大学(中国职业培训指导教师进修中心) The robot mechanical arm of view-based access control model deep learning feature detects and grasping means
CN110910452A (en) * 2019-11-26 2020-03-24 上海交通大学 Low-texture industrial part pose estimation method based on deep learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102914967A (en) * 2012-09-21 2013-02-06 浙江工业大学 Autonomous navigation and man-machine coordination picking operating system of picking robot
CN105279789A (en) * 2015-11-18 2016-01-27 中国兵器工业计算机应用技术研究所 A three-dimensional reconstruction method based on image sequences
CN108247637A (en) * 2018-01-24 2018-07-06 中南大学 A kind of industrial machine human arm vision anticollision control method
CN108510062A (en) * 2018-03-29 2018-09-07 东南大学 A kind of robot irregular object crawl pose rapid detection method based on concatenated convolutional neural network
CN109241964A (en) * 2018-08-17 2019-01-18 上海非夕机器人科技有限公司 The acquisition methods and equipment of the crawl point of mechanical arm
CN109702741A (en) * 2018-12-26 2019-05-03 中国科学院电子学研究所 Mechanical arm visual grasping system and method based on self-supervisory learning neural network
CN109934864A (en) * 2019-03-14 2019-06-25 东北大学 Residual error network depth learning method towards mechanical arm crawl pose estimation
CN109948514A (en) * 2019-03-15 2019-06-28 中国科学院宁波材料技术与工程研究所 Workpiece based on single goal three-dimensional reconstruction quickly identifies and localization method
CN110509273A (en) * 2019-08-16 2019-11-29 天津职业技术师范大学(中国职业培训指导教师进修中心) The robot mechanical arm of view-based access control model deep learning feature detects and grasping means
CN110910452A (en) * 2019-11-26 2020-03-24 上海交通大学 Low-texture industrial part pose estimation method based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
C. B. CHOY等: "3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction", 《EUROPEAN CONFERENCE ON COMPUTER VISION》 *
DANIEL YANG等: "Robotic Grasping through Combined Image-Based Grasp Proposal and 3D Reconstruction", 《ARXIV:2003.01649V3 [CS.RO] 6 NOV 2020》 *
张铁中等: "水果采摘机器人视觉系统的目标提取", 《中国农业大学学报》 *
武玉伟编著: "《深度学习基础与应用》", 30 April 2020, 北京:北京理工大学出版社 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160392A (en) * 2021-05-14 2021-07-23 电子科技大学成都学院 Optical building target three-dimensional reconstruction method based on deep neural network
CN113160392B (en) * 2021-05-14 2022-03-01 电子科技大学成都学院 Optical building target three-dimensional reconstruction method based on deep neural network
CN113099847A (en) * 2021-05-25 2021-07-13 广东技术师范大学 Fruit picking method based on fruit three-dimensional parameter prediction model
CN113743287A (en) * 2021-08-31 2021-12-03 之江实验室 Robot self-adaptive grabbing control method and system based on impulse neural network
CN113743287B (en) * 2021-08-31 2024-03-26 之江实验室 Robot self-adaptive grabbing control method and system based on impulse neural network
CN113575111A (en) * 2021-09-01 2021-11-02 南京农业大学 Real-time identification positioning and intelligent picking device for greenhouse tomatoes

Similar Documents

Publication Publication Date Title
CN112734727A (en) Apple picking method based on improved deep neural network
CN111950649B (en) Attention mechanism and capsule network-based low-illumination image classification method
WO2022036777A1 (en) Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN110032925B (en) Gesture image segmentation and recognition method based on improved capsule network and algorithm
CN108304826A (en) Facial expression recognizing method based on convolutional neural networks
CN111161364B (en) Real-time shape completion and attitude estimation method for single-view depth map
CN108388896A (en) A kind of licence plate recognition method based on dynamic time sequence convolutional neural networks
CN110880165A (en) Image defogging method based on contour and color feature fusion coding
CN110580472B (en) Video foreground detection method based on full convolution network and conditional countermeasure network
CN111489394B (en) Object posture estimation model training method, system, device and medium
CN112257766A (en) Shadow recognition detection method under natural scene based on frequency domain filtering processing
CN108876907A (en) A kind of active three-dimensional rebuilding method of object-oriented object
CN104408697B (en) Image Super-resolution Reconstruction method based on genetic algorithm and canonical prior model
CN115588237A (en) Three-dimensional hand posture estimation method based on monocular RGB image
CN114049314A (en) Medical image segmentation method based on feature rearrangement and gated axial attention
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
CN113822825B (en) Optical building target three-dimensional reconstruction method based on 3D-R2N2
Hirner et al. FC-DCNN: A densely connected neural network for stereo estimation
Li et al. RoadFormer: Duplex Transformer for RGB-normal semantic road scene parsing
Basak et al. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image
CN114972794A (en) Three-dimensional object recognition method based on multi-view Pooll transducer
Yan et al. Cascaded transformer U-net for image restoration
CN114022392A (en) Serial attention-enhancing UNet + + defogging network for defogging single image
CN114170304A (en) Camera positioning method based on multi-head self-attention and replacement attention
Oza et al. Semi-supervised image-to-image translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210430