CN116310622A

CN116310622A - Method and system for accurately identifying tray based on deep learning

Info

Publication number: CN116310622A
Application number: CN202211616543.7A
Authority: CN
Inventors: 邹家帅; 昝学彦; 李发频; 李飞军; 张四龙; 李家钧; 蒋干胜; 徐波
Original assignee: Zhuhai Makerwit Technology Co ltd
Current assignee: Zhuhai Makerwit Technology Co ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-06-23

Abstract

The invention discloses a method and a system for accurately identifying a tray based on deep learning, wherein the method comprises the following steps: collecting the trays by using an image collecting device to obtain depth images and color images of a plurality of trays and aligning the depth images and the color images; the position of the tray in the color image is marked and then used as a deep learning training data set and is input into a neural network; the neural network recognizes the coordinates of the tray in the color image in a deep learning mode, and obtains the position of the tray in the depth image, and the outline dimension of the tray is input at the position to construct a standard tray point cloud set; and performing ICP point cloud matching on the actual tray point cloud set and the standard tray point cloud set which are currently acquired by the image acquisition device, and acquiring the position and angle of the target tray relative to the virtual tray, thereby obtaining the pose of the target tray relative to the image acquisition device. The method is used for solving the technical problem that misjudgment is easy to occur when the tray is identified by adopting the detection method based on the point cloud plane contour matching.

Description

Method and system for accurately identifying tray based on deep learning

Technical Field

The invention relates to the technical field of image recognition, in particular to a method and a system for accurately recognizing a tray based on deep learning.

Background

The detection of the tray is a key step of carrying goods by the storage robot, and aiming at the problems that the current detection method is low in illumination robustness, is constrained by the relative pose between the tray and the sensor and the like, a detection method based on point cloud plane contour matching is provided. According to the method, a TOF (Time-of-Flight) camera is used for collecting point clouds, the point clouds are preprocessed, a region growing algorithm with normal lines as constraints is used for carrying out plane segmentation, a grid diagram is generated through projection along the direction of the main normal line of the point clouds, the problem that the point clouds are constrained by relative pose is solved, finally after contour extraction is carried out on the grid diagram, matching of a target and a template is carried out by utilizing contour features fused with Hu invariant moment and scale proportion features, and detection of a tray is achieved.

However, since the TOF camera outputs depth point cloud data, the constructed image is a gray and black combined image, misjudgment is easy to occur by simply using the depth camera, for example, a person standing on the edge of a tray may identify the person's leg and the tray together, so that an incorrect tray pose is calculated, and an object other than the tray may be identified as the tray.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for accurately identifying a tray based on deep learning, which are used for solving the technical problem that misjudgment is easy to occur when the tray is identified by adopting a detection method based on point cloud plane contour matching, thereby achieving the purpose of improving the accuracy when the tray is identified.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a method for accurately identifying a tray based on deep learning comprises the following steps:

acquiring the trays by using an image acquisition device to obtain depth images and color images of a plurality of trays, and aligning the depth images and the color images;

after marking the positions of the trays in the color images, taking the marked color images as a deep learning training data set, and inputting the deep learning training data set into a neural network;

the neural network identifies the coordinates of the tray in the color image in a deep learning mode;

obtaining the position of the tray in the depth image according to the coordinates of the tray, inputting the shape size of the tray to construct a standard tray point cloud set;

performing ICP point cloud matching on the actual tray point cloud set and the standard tray point cloud set which are currently acquired by the image acquisition device, and acquiring the position and the angle of a target tray relative to a virtual tray, so as to obtain the pose of the target tray relative to the image acquisition device;

the actual tray point cloud set comprises target trays to be identified, and the standard tray point cloud set comprises virtual trays constructed according to the positions of the trays and the outline dimensions of the trays.

As a preferred embodiment of the present invention, when the neural network recognizes the coordinates of the tray in the color image by means of deep learning, the method includes:

aligning an input color image into RGB pictures with 640 x 640 size through an input layer, and inputting the RGB pictures into a backlight layer;

the backup layer performs feature extraction on the RGB picture and outputs three layers of feature images with different size to the head layer;

the head layer extracts and detects the features of the three feature images with different size again to obtain the coordinates of the target tray;

the neural network comprises an input layer, a backup layer and a head layer.

As a preferred embodiment of the present invention, when the input layer aligns an input color image, it includes:

and performing self-adaptive size processing on the input deep learning training data set, adjusting the RGB picture with the size of 1280 x 1280, reducing the size of the deep learning training data set to 640 x 640 by using a 16-layer convolution module, performing normalization processing and alignment, activating by an activation function, and then sending the deep learning training data set to the backstone layer.

As a preferred embodiment of the present invention, when the backup layer performs feature extraction on the RGB picture, the method includes:

the BConv layer receives the RGB picture, performs feature extraction through a convolution layer, performs acceleration convergence by utilizing a BN layer, and inputs the RGB picture into an alternate E-ELAN layer and an MPConv layer after being activated by adopting an activation function, and outputs three feature images with different size through the alternate E-ELAN layer and the MPConv layer;

the backup layer comprises a BConv layer, an E-ELAN layer and an MPConv layer, wherein the BConv layer consists of a convolution layer, a BN layer and an activation function.

As a preferred embodiment of the present invention, when the head layer performs feature extraction and detection, the method includes:

and the head layer carries out feature extraction on the three feature images with different sizes output by the back bone layer through the SPPCPC layer, the plurality of BConv layers, the plurality of MPConv layers and the plurality of Catconv layers, and outputs the three feature images with different sizes again, and after detection is carried out on the three RepVGG block layers and the three conv layers respectively, the coordinates of the target tray are obtained.

As a preferred embodiment of the present invention, when acquiring a position of a target tray with respect to a virtual tray, the method includes:

the standard tray point cloud set and the actual tray point cloud set are constrained according to a certain constraint condition, and the constraint method is specifically shown as a formula 1 and a formula 2:

in the method, in the process of the invention,

for a single point of the standard tray point cloud, +.>

For standard tray point clouds, +.>

Is->

Centroid of->

For a single point of the actual tray point cloud, +.>

For the actual tray point cloud, +.>

Is->

Is a centroid of (c).

In a preferred embodiment of the present invention, when acquiring the position of the target tray relative to the virtual tray, the method further includes:

according to the constraint condition, a first loss function equation is established, and the first loss function equation is specifically shown as a formula 3:

wherein R is a rotation matrix, and t is a translation matrix;

let N be the total number of point clouds |P _s And (3) deriving the first loss function equation to enable the derivative to be 0, and obtaining a coordinate equation, wherein the coordinate equation is specifically shown as a formula 4:

an optimal t, i.e. the coordinates (X, Y, Z) of the target tray with respect to the virtual tray, is obtained from the coordinate equation.

As a preferred embodiment of the present invention, when acquiring an angle of a target tray with respect to a virtual tray, the method includes:

without considering translation, a second loss function equation is established as shown in equation 5:

wherein R is a rotation matrix,

centroid for standard tray point cloud, +.>

Centroid that is the actual tray point cloud;

in the second loss function equation by relation 6 and relation 7

And simplifying to obtain a simplified relational expression, wherein the simplified relational expression is specifically shown in a formula 8:

R ^T R＝I (6)；

wherein, the superscript T is the transposed matrix of the matrix, and I is R ^T Itself, the method comprises the steps of;

since the coordinates (X, Y, Z) of the pallet are determined independently of R, by finding

Minimizing the second loss function equation, as shown in equation 9:

as a preferred embodiment of the present invention, when acquiring the angle of the target tray with respect to the virtual tray, further comprising:

the equation 9 is transformed according to equation 10 as shown in equation 11:

R ^* ＝argmax _R trace(P _t ^T RP _s ) (11)；

by taking advantage of the properties of trace, the trace (P _t ^T RP _s ) The conversion is performed as shown in formula 12:

wherein V is P _s P _t ^T Is the orthogonal matrix of U is P _t ^T P _s R is a diagonal matrix, V ^T RU is P _t ^T RP _s Is a matrix of orthogonality;

the formula 12 is converted by using a matrix relation 13, wherein the matrix relation is shown in formula 13, and the conversion process is shown in formula 14:

trace(∑V ^T RU)＝trace(∑M)

＝σ ₁ m ₁₁ +σ ₂ m ₂₂ +σ ₁ m ₃₃ (14):

wherein M is a feature vector matrix;

let M be a unit array, and maximize trace (ΣM) to obtain the angle of the target tray relative to the virtual tray, specifically as shown in formula 15, formula 16 and formula 17:

V ^T RU＝I (15)；

R＝VU ^T (16)；

R ^* ＝VU ^T (17)。

wherein R is ^* Is the angle of the target tray relative to the virtual tray.

A system for accurately identifying a tray based on deep learning, comprising:

training data set construction unit: the image acquisition device is used for acquiring the trays to obtain depth images and color images of a plurality of trays, and aligning the depth images and the color images; after marking the positions of the trays in the color images, taking the marked color images as a deep learning training data set, and inputting the deep learning training data set into a neural network;

standard tray point cloud set construction unit: the method comprises the steps that coordinates of a tray in a color image are identified through a neural network in a deep learning mode; obtaining the position of the tray in the depth image according to the coordinates of the tray, inputting the shape size of the tray to construct a standard tray point cloud set;

tray identification unit: the method comprises the steps of performing ICP point cloud matching on an actual tray point cloud set and a standard tray point cloud set which are currently acquired by an image acquisition device, and acquiring the position and the angle of a target tray relative to a virtual tray, so that the pose of the target tray relative to the image acquisition device is obtained;

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention eliminates the problem of false tray identification caused by the pure depth map by using a neural network to carry out deep learning;

(2) In the process of identifying and positioning the tray, the coordinates of the tray in the color image are identified by the color image through a deep learning mode, so that misidentification caused by human interference or other objects is eliminated, the position of the tray in the depth image can be roughly known after the position of the color image is known, and the accuracy of tray identification is improved.

The invention is described in further detail below with reference to the drawings and the detailed description.

Drawings

FIG. 1 is a diagram of the steps of a method for accurately identifying a tray based on deep learning in accordance with an embodiment of the present invention;

FIG. 2-is a network architecture diagram of a YOLOv7 neural network of an embodiment of the present invention;

FIG. 3-is a network architecture diagram of a Yolov7 neural network backbone layer, according to an embodiment of the present invention;

FIG. 4-is a network structure diagram of the BConv layer of the backhaul layer of an embodiment of the present invention;

FIG. 5-is a network architecture diagram of an E-ELAN layer of a backhaul layer according to an embodiment of the present invention;

FIG. 6-is a network structure diagram of the MPConv layer of the backhaul layer of an embodiment of the present invention;

FIG. 7-is a network architecture diagram of the YOLOv7 neural network head layer of an embodiment of the invention;

FIG. 8-is a network structure diagram of SPPCSPC layer of the head layer of an embodiment of the present invention;

FIG. 9-is a network structure diagram of the Catconv layer of the head layer of an embodiment of the present invention;

FIG. 10 is a network architecture diagram of the RepVGG block layer of the head layer of an embodiment of the invention.

Detailed Description

The method for accurately identifying the tray based on the deep learning provided by the invention, as shown in fig. 1, comprises the following steps:

step S1: acquiring the trays by using an image acquisition device to obtain depth images and color images of a plurality of trays, and aligning the depth images and the color images;

step S2: after marking the positions of the trays in the color images, taking the marked color images as a deep learning training data set, and inputting the deep learning training data set into a neural network;

step S3: the neural network identifies the coordinates of the tray in the color image in a deep learning mode;

step S4: obtaining the position of the tray in the depth image according to the coordinates of the tray, inputting the outline dimension of the tray into the position of the tray in the depth image, and constructing a standard tray point cloud set;

step S5: performing ICP point cloud matching on the actual tray point cloud set and the standard tray point cloud set which are currently acquired by the image acquisition device, and acquiring the position and the angle of the target tray relative to the virtual tray, so as to obtain the pose of the target tray relative to the image acquisition device;

In the above steps S1 and S5, the image capturing device is a depth camera, and the tray is identified by the depth camera, so that the depth image data can be output, and the normal color image data can be output.

Further, the depth camera is a intel Realsense D455 depth camera.

In the step S2, when marking the position of the tray in the color image, the method includes: and manually controlling the forklift to insert and take the tray, opening the depth camera to record pictures in the inserting and taking process as samples all the time, and calling marking software to mark each frame of pictures in the samples, wherein the position of the tray in each frame of pictures is marked as a data set of the deep learning training.

Further, the labeling software is labelmg software.

In the above step S3, as shown in fig. 2, when the neural network recognizes the coordinates of the tray in the color image by the deep learning method, the method includes:

the back bone layer performs feature extraction on the RGB picture and outputs three layers of feature pictures with different size to the head layer;

the head layer extracts and detects the features of the three layers of feature images with different size again to obtain the coordinates of the target tray;

the neural network comprises an input layer, a backup layer and a head layer.

Further, the neural network is a YOLOv7 neural network.

Further, when the input layer aligns the input color image, the method includes:

and performing self-adaptive size processing on the input deep learning training data set, adjusting the RGB picture with the size of 1280 x 1280, reducing the size of the deep learning training data set to 640 x 640 by utilizing a 16-layer convolution module, performing normalization processing and alignment, activating through an activation function, and then sending the activated deep learning training data set to a backstone layer.

Further, when the back plane layer performs feature extraction on the RGB picture, the method includes:

after the BConv layer receives the RGB picture, the characteristic extraction is carried out through the convolution layer, the acceleration convergence is carried out by utilizing the BN layer, the activation function is adopted for activation, the activation function is input into the alternating E-ELAN layer and the MPConv layer, and the characteristic diagrams with three layers of different sizes are output through the alternating E-ELAN layer and the MPConv layer;

Still further, the activation function is ReakyReLu.

Specifically, the backlight layer of YOLOv7 is shown in fig. 3, and is composed of several BConv layers, E-ELAN layers, and MPConv layers, wherein the BConv layers are composed of a convolution layer+bn layer+an activation function, as shown in fig. 4.

In fig. 4, bconv of different colors indicates that the convolutions are different (k indicates the size of the kernel length and width, s indicates stride, o is outcannel, i is inchannel, where o=i indicates that outcannel=inchannel, o+.i indicates that outcannel has no correlation with inchannel and is not necessarily equal in value), the first is the convolution of the convolution kernel (k=1, s=1), the length and width of the input and output are unchanged, the second is the convolution of the convolution kernel (k=3, s=1), the length and width of the output and output are unchanged, and the third is s=2, and the length and width of the output are half of the input. The Bconv of the different colors is mainly used for distinguishing k and s, and is not used for distinguishing input and output channels.

Specifically, the E-ELAN layer is also spliced by different convolutions, as shown in FIG. 5. The input and output length and width of the whole E-EL AN layer are unchanged, o=2i on channels, wherein 2i is formed by splicing output concatees with i/2 of 4 conv layer output channels.

Specifically, as shown in fig. 6, the MPConv layer has the same input/output channels, the output length and width are half of the input length and width, the upper branch is halved by maxpooling, the channel is halved by BConv, the lower branch is halved by the first BConv, the second k=3, s=2 BConv halved the length and width, and then the upper branch cat and the lower branch cat are combined to obtain the output with the length and width halved, and o=i.

Overview the entire backhaul layer was alternately halved in length and width, doubled in channels, and features extracted from several BConv layers, E-ELAN layers, and MPConv layers.

Further, as shown in fig. 7, when the head layer performs feature extraction and detection, the method includes:

the head layer carries out feature extraction again on three feature graphs with different sizes output by the back bone layer through the SPPCPC layer, the plurality of BConv layers, the plurality of MPConv layers and the plurality of Catconv layers, and then outputs the feature graphs with different sizes of the three layers again, and the coordinates of the target tray are obtained after the three RepVGGblock layers and the three conv layers are respectively detected.

Specifically, as shown in fig. 8, the output layer channel of the whole SPPCSPC layer is an out_channel, and in the calculation process, a hidden_channel=int (2×e×out_channel) is calculated, which is used for expanding the hidden_channel (hereinafter collectively referred to as hc), and in general, e=0.5 is taken, where hc=out_channel.

Specifically, the Catconv layer operates substantially the same as the E-ELAN layer, as shown in FIG. 9. The length and width of the input and output of the whole Catconv layer are unchanged, o=2i on channels, wherein 2i is formed by splicing output concatemers with i/2 of 6 conv layer output channels.

Specifically, the RepVGG block (REP) layer is shown in FIG. 10. REP is different in structure during training and deployment, 1*1 convolution branches are added by 3*3 convolution during training, meanwhile if the channel of input and output and the size of h and w are consistent, one BN branch is added, and three branches are added for output, and during deployment, parameters of the branches are re-parameterized to a main branch for convenient deployment, and 3*3 main branch convolution output is taken.

In the step S5, when the position of the target tray with respect to the virtual tray is acquired, the method includes:

in the method, in the process of the invention,

for a single point of the standard tray point cloud, +.>

For standard tray point clouds, +.>

Is->

Centroid of->

For a single point of the actual tray point cloud, +.>

For the actual tray point cloud, +.>

Is->

Is a centroid of (c).

Further, when the position of the target tray relative to the virtual tray is acquired, the method further comprises:

wherein R is a rotation matrix, and t is a translation matrix;

the optimal t, i.e. the coordinates (X, Y, Z) of the target tray relative to the virtual tray, can be obtained according to the coordinate equation, where the first loss function equation is also minimal.

Specifically, the z-axis height of the tray is equal to the ground height plus the tray height, the tray X, Y axis is near the center of the depth camera, the tray is placed on the front face, and the tray is 0.6-2.2 meters away from the depth camera.

In the step S5, when acquiring the angle of the target tray with respect to the virtual tray, the method includes:

wherein R is a rotation matrix,

centroid for standard tray point cloud, +.>

Centroid that is the actual tray point cloud;

in the second loss function equation by relation 6 and relation 7

R ^T R＝I (6)；

since the coordinates (X, Y, Z) of the tray are determined independently of R, by finding

And minimizes the second loss function equation, as shown in equation 9:

specifically, the above-described relation 6 and relation 7 are obtained by utilizing the property that the transpose of the scalar is equal to itself.

Further, when acquiring the angle of the target tray relative to the virtual tray, the method further comprises:

the conversion of equation 9 is performed according to equation 10, as shown in equation 11:

R ^* ＝argmax _R trace(P _t ^T RP _s ) (11)；

using the properties of trace, the trace (P _t ^T RP _s ) The conversion is performed as shown in formula 12:

equation 12 is transformed using a matrix relationship 13, the matrix relationship is shown in equation 13, and the transformation process is shown in equation 14:

trace(∑V ^T RU)＝trace(∑M)

＝σ ₁ m ₁₁ +σ ₂ m ₂₂ +σ ₁ m ₃₃ (14):

wherein M is a feature vector matrix;

let M be the unit matrix, make trace (Sigma M) maximum, get the angle of the target tray relative to the virtual tray, specifically as shown in formula 15, formula 16 and formula 17:

V ^T RU＝I (15)；

R＝VU ^T (16)；

R ^* ＝VU ^T (17)。

wherein R is ^* Is the angle of the target tray relative to the virtual tray.

Specifically, the above relation 10 is obtained by matrix multiplication and the definition of trace, and the property of trace is specifically shown in the formula 18:

trace(AB)＝trace(BA) (18)。

specifically, from the non-negative nature of the singular value and the nature of the orthogonal matrix (the absolute value of the element in the orthogonal matrix is not more than 1), it is easy to prove that trace (Σm) is maximum only when M is the unit matrix.

The invention provides a system for accurately identifying a tray based on deep learning, which comprises the following components:

standard tray point cloud set construction unit: the method comprises the steps that coordinates of a tray in a color image are identified through a neural network in a deep learning mode; obtaining the position of the tray in the depth image according to the coordinates of the tray, inputting the outline dimension of the tray in the position of the tray in the depth image, and constructing a standard tray point cloud set;

Compared with the prior art, the invention has the beneficial effects that:

The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but any insubstantial changes and substitutions made by those skilled in the art on the basis of the present invention are intended to be within the scope of the present invention as claimed.

Claims

1. The method for accurately identifying the tray based on deep learning is characterized by comprising the following steps of:

2. The method for accurately identifying the tray based on the deep learning according to claim 1, wherein when the neural network identifies the coordinates of the tray in the color image by means of the deep learning, the method comprises:

the neural network comprises an input layer, a backup layer and a head layer.

3. The method for accurately recognizing a tray based on deep learning according to claim 2, wherein when the input layer aligns the input color image, comprising:

4. The method for accurately identifying a tray based on deep learning according to claim 2, wherein when the back plane layer performs feature extraction on the RGB picture, the method comprises:

5. The method for accurately identifying a tray based on deep learning according to claim 2, wherein when the head layer performs feature extraction and detection, the method comprises:

and the head layer carries out feature extraction on the three feature images with different sizes output by the back bone layer through the SPPCPC layer, the plurality of BConv layers, the plurality of MPConv layers and the plurality of Catconv layers, and outputs the three feature images with different sizes again, and after detection is carried out on the three RepVGGblock layers and the three conv layers respectively, the coordinates of the target tray are obtained.

6. The method for accurately identifying a tray based on deep learning according to claim 1, wherein when acquiring the position of a target tray with respect to a virtual tray, comprising:

in the method, in the process of the invention,

for a single point of the standard tray point cloud, +.>

For standard tray point clouds, +.>

Is->

Centroid of->

For a single point of the actual tray point cloud, +.>

For the actual tray point cloud, +.>

Is->

Is a centroid of (c).

7. The method for accurately identifying a tray based on deep learning of claim 6, further comprising, when acquiring the position of the target tray relative to the virtual tray:

wherein R is a rotation matrix, and t is a translation matrix;

8. The method for accurately identifying a tray based on deep learning according to claim 1, wherein when acquiring an angle of a target tray with respect to a virtual tray, comprising:

wherein R is a rotation matrix,

centroid for standard tray point cloud, +.>

Centroid that is the actual tray point cloud;

in the second loss function equation by relation 6 and relation 7

R ^T R＝I (6)；

Minimizing the second loss function equation, as shown in equation 9:

9. the method for accurately identifying a tray based on deep learning of claim 8, further comprising, when acquiring an angle of a target tray with respect to a virtual tray:

the equation 9 is transformed according to equation 10 as shown in equation 11: