CN112734727A

CN112734727A - Apple picking method based on improved deep neural network

Info

Publication number: CN112734727A
Application number: CN202110031817.5A
Authority: CN
Inventors: 李静; 黄友锐; 韩涛; 兰世豪; 江灵雅
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2021-01-11
Filing date: 2021-01-11
Publication date: 2021-04-30

Abstract

The invention discloses an apple picking robot fruit grabbing method based on a deep neural network, which improves the traditional neural network and constructs a grabbing suggestion network and a three-dimensional reconstruction network by using a residual error network and a recurrent neural network (LSTM). It includes: the method comprises the steps of collecting pictures of fruits of apples in the field, extracting sample pictures from the pictures, preprocessing the samples, constructing a grabbing suggestion network and a three-dimensional reconstruction network, training the grabbing suggestion network and the three-dimensional reconstruction network by using a training data set, evaluating the grabbing suggestion network by using a test data set, and picking the apples in real time by using the trained convolutional network, so that the identification and three-dimensional space positioning of the target objects of the fruits are realized, parameters are provided for the movement of a manipulator, and the accurate picking of agricultural fruits is realized. The gripping is improved by using the convolutional neural network, the gripping performance is improved, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.

Description

Apple picking method based on improved deep neural network

Technical Field

The invention designs a grabbing method of an apple picking robot based on an improved convolutional neural network image recognition technology.

Background

At present, a large amount of labor force is needed in the fruit picking process, but along with the continuous deepening of the aging degree of China, the rural labor force is continuously reduced, the production cost is continuously improved, the development of the whole fruit industry is limited, in addition, the manual picking can be influenced by human subjective factors, and the picking quality can be reduced. With the rapid development of computers and image processing technologies, the appearance of an intelligent fruit picking machine reduces the labor intensity and production cost of fruit growers, can effectively increase the production efficiency of the whole industry, and meets the demand of industrial development of fruits. The application of the digital image processing technology in the picking robot is mainly to identify fruit target objects and position the fruit target objects in a three-dimensional space so as to provide parameters for the movement of a manipulator, thereby realizing the accurate picking of agricultural fruits. Aiming at the problems that a large amount of labor is needed in the existing picking and the picking efficiency is low, the invention researches an intelligent fruit picking method based on deep learning, effectively improves the fruit picking efficiency and makes a contribution to agricultural production.

Disclosure of Invention

The invention provides an apple picking method based on a deep neural network, which utilizes an image processing technology and a convolutional neural network to accurately grab field ripe apples, wherein the convolutional neural network is used for constructing a grabbing suggestion network and a three-dimensional reconstruction network of apple fruits so as to provide parameters for the movement of a manipulator and further realize the accurate picking of agricultural fruits. The labor intensity and the production cost are reduced, and the production efficiency of the whole industry can be effectively increased.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the apple picking method based on the deep neural network is characterized by comprising the following steps:

(1) taking pictures from the field by adopting a binocular depth camera, and acquiring the pictures of the apples in different time periods and under different illumination conditions;

(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample:

(2a) carrying out Gaussian filtering on the collected picture, smoothing the image and removing noise;

(2b) carrying out image enhancement processing on the picture to highlight the characteristics of the apple and improve the definition of the image;

(2c) and (4) carrying out image segmentation by adopting a color space reference table method, and removing the background in the image to obtain an image containing an apple area. As the picking robot is greatly influenced by the light intensity change of the working environment, the picking robot has great influence on HSV, YC_bC_rAnd L^*a^*b^*3 models, taking 2 components H and S, Cr and Cb, a and b, independent of luminance, respectively, which constitute a two-dimensional color space. The spatial reference table is established as follows:

(2d) establishing an integer array (corresponding to a two-dimensional color space) of 256@256, and initializing the integer array to be zero;

(2e) converting the sample pixel from the RGB color space to a specified color space (such as HSV and Lxa b), and mapping each component to the range of 0-255;

(2f) counting sample pixels in a specified two-dimensional color space (such as HS and a × b, namely the array established in the step (2 d)), so as to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;

(2g) regarding the two-dimensional color distribution density map obtained in the step (2f) as a gray scale map, taking a proper threshold value, and binarizing the threshold value to obtain a binary image;

(2h) performing a series of mathematical morphology operations on the binary image obtained in the step (2g) by adopting an improved expansion and corrosion algorithm, wherein the two-dimensional array corresponding to the finally obtained binary image is the required color space reference table;

(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (no matter whether the pixel point is a target pixel point) and considering the pixel point as the target pixel point when the number exceeds half, otherwise, considering the pixel point as a non-target pixel point. Then, the region marking algorithm is used for finding out the region of each fruit, and the circumscribed rectangle of the regions is obtained, so that the extraction of the apple target is completed;

(3) collecting 500 images of apples with different sizes, different angles and different illumination, selecting 400 images as neural network training samples, and selecting the remaining 100 images as test samples;

(4) the method comprises the steps of constructing a convolutional neural network and training the convolutional neural network by using a training data set, wherein the method comprises two neural networks, and the outputs of the two neural networks are combined by an optimization module to plan grabbing. Grabbing recommended network GPNet output versus camera frame^cT_ζE SE (3). The three-dimensional recursive reconstruction neural network outputs a three-dimensional reconstruction of the object. A reasonable estimate of the shape of the occluded part of the object is provided. Output of two networks by grabbing propositions^cT_ζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing proposition^cT_ζ ⁺. Due to the camera being opposite to the manipulator

The pose of (a) is known, so camera frame grabbing may be converted into robot frames for the robot to perform:

the architecture of GPNet consists of parallel ResNet-34 modules, whose input is a pair of aligned gray scale and depth images, then followed by two fully connected layers, the output regresses to a vector,

this represents a homogeneous transformation

It is a (relative to the camera) grabbing gestureAn estimate of the potential. The first 3 values (t1, t2, t3) represent the desired (x, y, z) position of the gripper in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:

wherein x is_lAnd x_l+1Respectively representing the input and output of the lth residual unit, F is the residual function, representing the learned residual, and h (X)_L)＝X_LRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are

After GPNet is constructed, a training data set is used for carrying out grabbing and recommending network training, and the training steps are as follows:

(4a) inputting picture samples into a GPNet network, wherein the input of the network is (I)_g,I_d) I.e. grey and depth images, output as corresponding ground truth values t^*；

(4b) Obtaining the predicted value by network step-by-step calculation

(4c) According to the ground real grabbing t^*Calculating the loss, which embodies the proximity of the prediction to ground truth, as a weighted sum of the translational and rotational components:

wherein

Is the square of the Euclidean distance loss

Is the square deviation of the product of the predicted rotation matrix and the ground true rotation matrix transpose

Using the weight λ in training GPNET_T＝λ_RLearning rate of 1X10^-4The Adam optimizer of (a) is trained on the data set;

(5) the invention constructs a three-dimensional reconstruction network SRNet, uses a three-dimensional recursive reconstruction neural network (3D-R2N2) which can construct three-dimensional reconstruction of object examples from different angles, each 3D-R2N2 comprises an encoder, a 3D-LSTM and a decoder, and the working process of the network is as follows:

(5a) giving a sample image to an input layer;

(5b) the image is coded using CNN as a feature, the encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit, and a fully connected layer. The input image is learnt layer by layer through an encoder to obtain the low-dimensional characteristics of the image, residual connection is added between standard layers of the encoder in order to improve the optimization performance of a deeper network, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, and then output flattening is transmitted to a complete connection layer. The encoder thus encodes the input image X into the low-dimensional features t (X);

(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is f_t＝σ(W_fT(x_t)+U_f*h_t-1+b_f),i_t＝σ(W_iT(x_t)+U_i*h_t-1+b_i),S_t＝f_t⊙s_t-1+i_t⊙tanh(WsT(x_t)+Us*h_t-1+bs),h_t＝tanh(s_t)

Wherein i_t，f_tRespectively showing an input gate and a forgetting gate, s_tAnd h_tIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of an element, the subscript t indicating the activation at time t, W (-), U (-), respectively the current input x of the transition_tAnd previous hidden state h_t-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;

(5d) the hidden state of the LSTM unit is decoded by a decoder and a 3D probabilistic voxel reconstruction is generated. The decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;

(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]In which the pair input x is omitted_t}_t∈{1,T}χ＝{x_t}_t∈{1,…T}Let the corresponding basic true-value occupancy be y_(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y_(i,j,k)log(p_(i,j,k))+(1-y_(i,j,k))log(1-p_(i,j,k))；

(6) Optimizing a grabbing part, in order to realize that the robot accurately grabs the fruit, projecting the proposed grabbing onto a reconstructed curved surface by using an Iterative Closest Point (ICP) algorithm, wherein the ICP algorithm has the flow:

(6a) taking a point set P in a target point cloud P_i∈P；

(6b)Finding out corresponding point set Q in source point cloud Q_iBelongs to Q, so that | | | Q_i-p_i||＝min；

(6c) Calculating a rotation matrix R and a translation matrix t to minimize an error function;

(6d) carrying out rotation and translation transformation on pi by using the rotation matrix R and the translation matrix t obtained in the previous step to obtain a new corresponding point set p_i'＝{p_i'＝Rp_i+t,p_i∈P}；

(6e) Calculating p_i' and corresponding point set q_iAverage distance of (d);

(6f) if d is less than a given threshold or greater than a preset maximum number of iterations, the iterative computation is stopped. Otherwise, returning to the step (6b) until the convergence condition is met;

(6) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;

(7) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.

The invention has the beneficial effects that:

a new robot grabbing planning method is provided, which uses a learned grabbing proposal network and a learned three-dimensional shape reconstruction network simultaneously, uses two different improved convolutional neural networks in the two networks, utilizes the representation capability and generalization performance of the two improved convolutional neural networks, outputs a complete 6-degree-of-freedom grabbing gesture compared with a pixel space method, and can grab the blocked part of an object which cannot be seen in an input image. The three-dimensional reconstruction network can deduce the shape of the invisible part of the object, and then the grabbing points are projected onto the reconstructed point cloud, so that grabbing is improved, and grabbing performance is improved. Enabling the robot to accurately grasp both the hidden and visible portions of the object. Such a combined system can produce more accurate crawling than a network of image-based crawling recommendations alone. Therefore, the apple picking efficiency is greatly improved, and a new solution is provided for the robot-based apple picking.

Drawings

FIG. 1 is a block diagram of the overall process of the method of the present invention.

Fig. 2 is a diagram of a grab recommendation network architecture.

Fig. 3 is a diagram of a depth residual network architecture.

Fig. 4 is a diagram of a three-dimensional reconstruction network structure.

FIG. 5 is a diagram of a three-dimensional convolution LSTM structure.

Detailed Description

As shown in fig. 1, the process of the apple picking method based on the convolutional network robot is as follows:

(2) extracting a picture sample from a shot apple picture and preprocessing the picture sample;

(3) dividing the preprocessed picture sample into a training data set and a testing data set;

(4) constructing a convolution network and training by using a training data set;

(5) evaluating the trained convolutional network by using a test data set;

(6) the trained network is used for grabbing the apples in real time, so that the ripe fruits are accurately positioned, and grabbing performance is improved.

As shown in fig. 2, the method of the present invention grabs a proposed network:

this represents a homogeneous transformation

It is an estimate of the grab pose (relative to the camera). The first 3 values (t1, t2, t3) represent grippersThe desired (x, y, z) position in the camera coordinate frame. The last 9 values represent a serialized three-dimensional rotation matrix. The training process of the GPNet is as follows:

(4b) Obtaining the predicted value by network step-by-step calculation

wherein

Is the square of the Euclidean distance loss

Using the weight λ in training GPNET_T＝λ_RLearning rate of 1X10^-4The Adam optimizer of (a) is trained on the data set.

As shown in FIG. 3, the depth residual block structure of the method of the present invention is

ResNet-34 is mainly composed of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer. Each residual block is composed of 2 3 × 3 convolutional layers and 2 ReLU units, and each residual unit can be expressed as:

wherein x is_lAnd x_l+1Respectively representing the input and output of the L-th residual unit, F being the residual function, y_lDenotes the learned residual, and h (X)_L)＝X_LRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are

As shown in fig. 4, the three-dimensional reconstruction network of the method of the present invention:

the invention uses a three-dimensional recursive reconstruction neural network (3D-R2N2) that is capable of constructing three-dimensional reconstructions of object instances from different angular viewpoints. Each 3D-R2N2 is composed of an encoder, a recursion unit, and a decoder. The working process of the network is as follows:

(5a) giving a sample image to an input layer;

(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with restricted connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoding features and the hidden state having selectively updated cell states after passing through the 3D-LSTM or maintaining the states by closing the input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is f_t＝σ(W_fT(x_t)+U_f*h_t-1+b_f),i_t＝σ(W_iT(x_t)+U_i*h_t-1+b_i),S_t＝f_t⊙s_t-1+i_t⊙tanh(WsT(x_t)+Us*h_t-1+bs),h_t＝tanh(s_t)；

Wherein i_t，f_tThe input gate and the forgetting gate are respectively indicated. s_tAnd h_tIndicating respectively the memory cell and the hidden state, with |, indicating the element multiplication and the subscript t indicating the activation at time t. W (-), U (-), respectively, are the transformed current inputs x_tAnd previous hidden state h_t-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced;

(5d) decoding the hidden state of the LSTM unit and generating a 3D probabilistic voxel reconstruction by a decoder; the decoder is a three-dimensional deconvolution neural network (3D-DNCC), a simple 5-convolution decoder network is used, a 4-residue connected depth residual error network is added, the hidden state from the 3D-LSTM is learned layer by layer through a deconvolution layer, a nonlinear correction layer and an inverse pooling layer of the decoder, and finally an activation layer is used, and the final output is converted into the occupation probability of a voxel at a certain position by using an activation function;

(5e) the loss function of the network is defined as the sum of voxel cross entropies. Let the final output of each voxel (i, j, k) be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omitted_t}_{t∈{1,...,T}}Let the corresponding basic true-value occupancy be y_i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y_(i,j,k)log(p_(i,j,k))+(1-y_(i,j,k))log(1-p_(i,j,k))。

As shown in FIG. 5, the 3D-LSTM network architecture of the present invention:

the 3D-LSTM is composed of a set of structured LSTM cells with constrained connections, each receiving the same eigenvector from the encoder by a 3 × 3 × 3 convolution and from its neighborhoodAnd (3) taking hidden states as input, wherein each unit is responsible for reconstructing a specific part of final output, and the coding characteristics and the hidden states can selectively update the state of the unit after passing through the 3D-LSTM or keep the state by closing an input gate. In a three-dimensional grid, with NxNxN 3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position is f_t＝σ(W_fT(x_t)+U_f*h_t-1+b_f),i_t＝σ(W_iT(x_t)+U_i*h_t-1+b_i),S_t＝f_t⊙s_t-1+i_t⊙tanh(WsT(x_t)+Us*h_t-1+bs),h_t＝tanh(s_t)

Wherein i_t，f_tThe input gate and the forgetting gate are respectively indicated. s_tAnd h_tRespectively, memory cell and hidden state. An element multiplication is indicated by [ < u >, and the subscript t indicates activation at time t. W (-), U (-), respectively, are the transformed current inputs x_tAnd previous hidden state h_t-1B (-) represents the deviation. Denotes convolution operation. Unlike standard LSTM, this network has no output gates, since the output is only extracted at the end. By removing redundant output gates, the number of parameters can be reduced.

Claims

1. An apple picking method based on an improved deep neural network is characterized by comprising the following steps:

(2c) performing image segmentation by using a color space reference table method, removing the background in the image, and obtaining an image containing an apple regionSince the picking robot is greatly influenced by the light intensity change of the working environment, for HSV, YC_bC_rAnd L^*a^*b^*3 models of 2 components H and S, Cr and Cb, a independent of brightness^*And b^*They form a two-dimensional color space, and the spatial reference table is built as follows:

(2e) converting sample pixels from an RGB color space to a specified color space (e.g., HSV and L)^*a^*b^*Etc.), and mapping each component to the range of 0-255;

(2f) in a given two-dimensional color space (e.g., HS and a)^*b^*The array established in the step (2d) to obtain a two-dimensional color distribution density map, which is also the expansion of the gray level histogram in the two-dimensional color space;

(2i) extraction of fruit targets: counting the number of target pixel points in the 5@5 neighborhood of each pixel point (whether the pixel point is a target pixel point or not), considering the pixel point as the target pixel point when the number exceeds half, and otherwise, considering the pixel point as a non-target pixel point, then finding out the regions of each fruit by using a region marking algorithm, and obtaining the circumscribed rectangles of the regions, thereby completing the extraction of the apple target;

(4) the method of the present invention consists of two neural networks, the output of which is oneThe optimization modules are combined to plan grabbing, and the grabbing recommendation network GPNet outputs are relative to the camera frame^cT_ζE SE (3), three-dimensional recursive reconstruction of the neural network output three-dimensional reconstruction of the object, providing a reasonable estimate of the shape of the occluded part of the object, the output of both networks by the proposition of the grabbing^cT_ζProjected to the nearest point in the reconstructed point cloud for combination, thereby obtaining an accurate capturing proposition^cT_ζ ⁺Due to the camera being opposite to the manipulator

this represents a homogeneous transformation

It is an estimate of the (relative to the camera) grab pose, the first 3 values (t1, t2, t3) representing the (x, y, z) position of the gripper needed in the camera coordinate frame, the last 9 values representing a serialized three-dimensional rotation matrix, ResNet-34 consisting essentially of L residual blocks, 1 average pooling layer, 1 maximum pooling layer and 1 full-link layer, each residual block consisting of 2 3 × 3 convolutional layers and 2 ReLU units, each residual unit can be represented as:

wherein x is_lAnd x_l+1Respectively representing the input and output of the Lth residual unit, F being the residual functionDenotes the learned residual, and h (X)_L)＝X_LRepresenting an identity map, f is the ReLu activation function, so that the learning features from the shallow layer L to the deep layer L are

(4b) Obtaining the predicted value by network step-by-step calculation

wherein

Is the square of the Euclidean distance loss

(5a) giving a sample image to an input layer;

(5b) the method comprises the steps that CNN is used for coding an image into features, an encoder consists of a 2D convolutional layer, a pooling layer, a leakage correction linear unit and a complete connection layer, input images are learned layer by layer through the encoder to obtain low-dimensional features of the image, in order to improve the optimization performance of a deeper network, residual connection is added between standard layers of the encoder, meanwhile, in order to match the number of channels after convolution, 1X1 convolution is applied to a depth residual error network, then output flattening is transmitted to the complete connection layer, and therefore the encoder codes the input images X into the low-dimensional features T (X);

(5c) inputting the feature map from the encoder to a 3D-LSTM, the 3D-LSTM being composed of a set of structured LSTM cells with constrained connections, each cell receiving the same feature vector from the encoder by a 3 x 3 convolution and receiving as input a hidden state from its neighborhood, each cell being responsible for reconstructing a specific part of the final output, the encoded features and hidden states having selectively updated cell states after passing through the 3D-LSTM or holding states by closing the input gates, in a three-dimensional grid, N x N3D-LSTM cells, where N is the spatial resolution of the 3D-LSTM grid, the process of forward propagation at each index position being f_t＝σ(W_fT(x_t)+U_f*h_t-1+b_f),i_t＝σ(W_iT(x_t)+U_i*h_t-1+b_i),S_t＝f_t⊙s_t-1+i_t⊙tanh(WsT(x_t)+Us*h_t-1+bs),h_t＝tanh(s_t)；

Wherein i_t，f_tRespectively showing an input gate and a forgetting gate, s_tAnd h_tIndicating respectively the memory cell and the hidden state, with [ ] indicating the multiplication of the element, the subscript t indicating the activation at time t, W (-), U (-), respectivelyTransforming a current input x_tAnd previous hidden state h_t-1B (-) represents the deviation and represents the convolution operation, unlike the standard LSTM, this network has no output gates, since the output is only extracted at the end, and the number of parameters can be reduced by removing the redundant output gates;

(5e) the loss function of the network is defined as the sum of the voxel cross-entropies, leaving the final output of each voxel (i, j, k) to be the Bernoulli distribution [1-p (i, j, k), p (i, j, k)]Wherein the input x ═ x is omitted_t}_{t∈{1,...,T}}Let the corresponding basic true-value occupancy be y_(i,j,k)E {0,1}, i.e., L (χ, y) ═ Σ y_(i,j,k)log(p_(i,j,k))+(1-y_(i,j,k))log(1-p_(i,j,k))；

(6a) taking a point set P in a target point cloud P_i∈P；

(6b) Finding out corresponding point set Q in source point cloud Q_iBelongs to Q, so that | | | Q_i-p_i||＝min；

(6e) Calculating p_i' and corresponding point set q_iAverage distance of (d);

(6f) if d is smaller than a given threshold value or larger than a preset maximum iteration number, stopping iterative computation, otherwise, returning to the step (6b) until a convergence condition is met;

(5) testing the network by using the 100 pictures obtained in the step (3) and verifying the network;