CN115471542A

CN115471542A - Packaging object binocular recognition and positioning method based on YOLO v5

Info

Publication number: CN115471542A
Application number: CN202210479182.XA
Authority: CN
Inventors: 艾长胜; 张传斌
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-12-13

Abstract

The invention discloses a binocular recognition and positioning method of wrappage based on YOLO v5, which mainly comprises the following steps: A. calibrating a binocular camera; B. image acquisition and stereo correction; C. carrying out package identification and positioning by using YOLO v 5; D. stereo matching; F. the invention discloses a binocular recognition and positioning method of a package based on YOLO v5, which comprises the steps of firstly collecting images by using a calibrated binocular camera, carrying out three-dimensional correction on the obtained left and right images, correcting the images into binocular images under ideal imaging, inputting the binocular images into a target detection model, matching the left and right images of a recognized target package by using a three-dimensional matching algorithm, calculating a distance value of the target package according to a parallax value of the left and right images, and finally sending the three-dimensional coordinates of the package to a robot, wherein the three-dimensional coordinates can be used for guiding the robot to grab and sort the packages such as packaging bags or packaging boxes in an automatic scene.

Description

Packaging object binocular recognition and positioning method based on YOLO v5

Technical Field

The invention relates to the field of machine vision detection, in particular to a packaging object binocular recognition positioning method based on YOLO v 5.

Background

With the development of modern industry and the maturity of automation technology, the application scene of replacing the manual work by the robot is more and more common. The sorting of packing materials such as packing boxes and packing bags on an assembly line and the carrying of packing materials such as the packing bags or the boxes in the containers are tedious, and a large amount of human resources are consumed for a long time.

The development of artificial intelligence and deep learning enables the target detection technology to be rapidly developed, YOLO v5 is an end-to-end single-stage target detection algorithm, and the category and the position of a target are directly calculated by using a regression method. By using a large amount of data iterative training for YOLOv5, accurate identification and positioning of the wrappage can be realized.

Acquiring the spatial position coordinates of the packages relative to the robot is extremely important for the sorting and handling work of the robot. The binocular vision positioning method is a vision distance estimation and scenery reconstruction process of a simulated biological system, namely two color cameras with a certain distance are used for shooting the same object, imaging pixel points of the same object in the two cameras have a corresponding relation, and the spatial position of a target can be obtained according to the positions of an imaging transformation matrix and the corresponding pixel points in an image space.

Disclosure of Invention

In order to solve the problem that the work efficiency of manual sorting and carrying of packaged objects is low, the invention provides a packaged object binocular recognition and positioning method based on YOLO v 5.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention comprises the following steps:

A. calibrating a binocular camera;

B. collecting images and carrying out stereo correction;

C. using YOLO v5 to identify and position the package;

D. matching the left image and the right image in a three-dimensional way;

E. and calculating the three-dimensional coordinates of the packaged objects.

Further, the specific operation of the step a is to use a binocular camera to shoot checkerboard images of 20 left and right cameras, calibrate the left and right cameras by using a zhang's scaling method, obtain internal and external parameters of the left and right cameras and a transformation matrix of relative positions of the left and right cameras, and obtain the spatial position of the object according to the transformation matrix and the positions of corresponding pixel points in the image space.

Further, the specific operation of step B is to use a binocular camera to acquire left and right images, then perform stereo correction on the left and right images according to the camera parameters acquired in step a, and finally use the left and right images as the input of the target detection model, and the stereo correction schematic diagram is shown in fig. 3.

Further, the specific operation of step C is to firstly acquire an image data set of the packages to train a YOLO v5 detection model, and then detect the Bounding box (the center coordinates and the width, height and category of the prediction box) of the packages in the image by using the trained model.

Further, YOLO v5 in the step C is an excellent single-stage target detection model, and the wrappage is identified and positioned in a return mode directly, so that the real-time requirement is met.

Further, the specific operation of the step D is to match the packing object targets detected by the left and right images by using a stereo matching algorithm to obtain a disparity map, wherein the stereo matching algorithm adopts an SGBM algorithm which is better than other matching algorithms in precision and speed.

And furthermore, the concrete operation of the step E is to calculate the space position coordinates of the wrappage according to the position of the camera transformation matrix and the corresponding wrappage pixel points in the image space.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a packaging object binocular recognition positioning method based on YOLOv5, which utilizes a deep learning target detection model to recognize a packaging object, is based on a data-driven deep learning model, can achieve higher precision than the target detection in the traditional mode through sufficient data training, can extract higher robustness by using a convolutional neural network, can deploy a single-stage YOLOv5 model to embedded equipment and mobile equipment for operation, and reduces the threshold for using the deep learning model.

Drawings

Fig. 1 is a binocular camera ranging diagram.

Fig. 2 is a flow chart of binocular camera calibration.

Fig. 3 is a schematic view of image stereo correction.

Fig. 4 is a binocular camera obstacle detection flowchart.

Fig. 5 is a flow chart of the YOLOv5 package identification and positioning.

FIG. 6 is a structural diagram of YOLOv 5.

FIG. 7 is a CSP1_ X structural diagram.

FIG. 8 is a CSP2_ X structural diagram.

Detailed Description

In order to make the technical solution of the present invention better understood, the technical solution of the present invention will be fully and thoroughly described with reference to the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

A. And calibrating the binocular camera.

B. And acquiring images and performing stereo correction.

C. The package identification and positioning was performed using YOLO v 5.

D. And carrying out stereo matching on the left image and the right image.

E. And calculating the three-dimensional coordinates of the packages.

The invention is described in further detail below with reference to the accompanying drawings: a packaging object binocular identification and positioning method based on YOLO v5 comprises the following steps.

A. Calibrating a camera: in image measurement processes and machine vision applications, in order to determine the correlation between the three-dimensional geometric position of a point on the surface of an object in space and the corresponding point in the image, a geometric model of the camera imaging must be established, and the parameters of the geometric model are the parameters of the camera. The aim of camera calibration is to obtain internal and external parameters and distortion parameters of a camera, and solving the parameters is equivalent to finding a three-dimensional to two-dimensional mapping relation model. In the invention, a traditional calibration mode is adopted, a two-dimensional calibration plate is placed in front of a camera at various angles, the shot calibration plate is input into a calibration program to obtain internal parameters and distortion parameters of a left camera and a right camera and a translation matrix and a rotation matrix of the left camera relative to the right camera, and a calibration process is shown in an attached figure 2.

B. And (3) image correction: the images are collected by the binocular camera and used as the input of the package identification and positioning system, then the images are subjected to preprocessing operation of eliminating distortion, radial distortion and tangential distortion exist in the images shot by the camera due to the inherent perspective distortion of the optical lens, and the distortion degree can be relieved according to the distortion parameters calibrated by the camera.

The main task of the binocular camera is to measure depth, and the parallax distance-calculating formula is derived under the ideal condition of the binocular system, so that the actual binocular system is corrected to the ideal binocular system: the image planes of the two cameras are parallel, the optical axis is vertical to the image planes, and the pole is positioned at a far distance from the radio. The invention specifically adopts a Bouguet stereo correction algorithm, and the core principle is that the left image and the right image are subjected to perspective transformation through a pixel plane, so that the re-projection error is minimum, a binocular system is closest to an ideal state, and the final effect of stereo correction is as shown in figure 3.

C. The Yolo v5 model is used for package identification and location.

Object detection is generally divided into a combination of two tasks: image classification and image localization. YOLO is a single-stage target detection algorithm, which carries out feature extraction on an input image through CNN (CNN), and then infers the category of a target and the position of the target in the image by using a direct regression mode; different from a two-stage target detection algorithm, the speed is higher, and the parameter quantity is less; compared with the traditional method for extracting the features by the manually designed feature extractor, the method has stronger robustness of the features extracted by the convolutional neural network, so that the method adopts single-stage YOLO v5 as a frame for target detection to identify and position the wrappage.

The principle of YOLO works by dividing an input image into a number of meshes, and if the center point of an object falls within a certain mesh, predicting the object from the mesh. However, a problem arises in that if two different classes of objects fall into the same mesh, the mesh does not know which object to predict, and therefore, an Anchor mechanism is introduced into YOLO to solve the problem that two object classes exist in one mesh. In the YOLO v5 algorithm, anchor frames with the length and the width set initially exist for different data sets, but in order to adapt to the application scenario of the present invention, a new Anchor size needs to be calculated, and the new Anchor can be obtained in a K-means clustering manner. The Anchor mechanism can improve the recall rate of the target and improve the stability of algorithm training. And YOLO v5 adopts a self-adaptive anchor frame mechanism to automatically generate an anchor frame according to the data set. And for the problem that the size of the generated anchor frame is not accurate, the automatic generation of the anchor frame is cancelled, and the K-means algorithm is used for generating the anchor frame.

In the invention, the packing materials are divided into three categories, namely box packing materials, bag packing materials and bottle packing materials. According to the network structure of the multi-scale detection of YOLOv5, the input pictures are respectively downsampled by 8 times, 16 times and 32 times. Assuming an input training picture 608 x 608, the image is divided into 76 x 76 grids, 38 x 38 grids, and 19 x 19 grids, 76 x 76 predicts small targets, 38 x 38 predicts medium targets, 19 x 19 predicts large targets, each grid assumes B preset anchors, and each Anchor derives coordinates of the center of the predicted target (p) _x ,p _y ) And predicted frame width height (p) _w ,p _h ) And confidence of class (p) _x ,p _y ,p _w ,p _h C1, c 2) have a total of five values. Thus, according to the above analysis, the tensor output by the grid of 76 × 76 is 76 × 76 (B × 5+ 3), the tensor output by the grid of 38 × 38 is 38 × 38 (B × 5+ 3), and the tensor output by the grid of 19 × 19 is 19 (B × 5+ 3).

The network structure of the YOLOv5 is the same as a common target detection algorithm, the whole network structure is divided into an input end, a feature layer (BackBone), a feature fusion layer (Neck) and a target detection layer (Head) are extracted.

The input end of the Yolov5 adopts a mode of Mosaic data enhancement, and the detection effect on small targets is quite good by splicing the modes of random zooming, random cutting and random arrangement of input images. Not only the data set is enriched but also the computational load is reduced by the enhancement of the image.

Before entering the Backbone, the input image first passes through a Focus structure, which slices the input image, for example, the input 608 × 3 image, and the slicing operation of Focus first becomes a feature map of 304 × 12, and then passes through a convolution operation of 32 convolution kernels, and finally becomes a feature map of 304 × 32.

YOLO v5 uses the CSPNet network structure for reference, and applies CSP1_ X to the Backbone layer and CSP2_ X to the hack layer, as shown in fig. 6 and 7. CSPNet solves the problem of large calculation amount in the inference from the perspective of network structure design, adopts a CSP module to divide the feature mapping of a basic layer into two parts, and then combines the two parts through a cross-stage hierarchical structure, thereby reducing the calculation amount and ensuring the accuracy, and the main advantage of using the CSP structure in YOLOv5 is to enhance the learning ability of CNN, so that the accuracy is kept while the weight is reduced; the computing bottleneck is reduced and the memory cost is reduced.

Loss function of YOLO v5 by target confidence loss L _conf (objectness loss), target class loss L _cla (class loss) and target location loss L _loc (bounding box loss). YOLOv5 adopts BEC logs Loss function to calculate target confidence Loss, target classification Loss adopts cross entropy Loss function (BCEclsloss), and target positioning Loss adopts GIOU Loss, wherein lambda is ₁ ，λ ₂ ，λ ₃ Is the equilibrium coefficient.

L＝λ ₁ L _obj +λ ₂ L _cls +λ ₂ L _box 。

The confidence coefficient of the target is the probability value of the target in the prediction box, YOLOv5 adopts a binary cross entropy loss function, wherein y _i The element belongs to {0,1}, and represents whether a target really exists in a prediction target frame i, 0 represents that the target does not exist, and 1 represents that the target exists; p is a radical of formula _i And (4) the Sigmoid probability of whether the target exists in the predicted target box i or not.

p _i ＝sigmod(w ^T x+b)。

The target classification loss is also implemented by using a two-value cross entropy loss function, where y _i The element belongs to {0,1}, and represents whether a target really exists in a prediction target frame i, 0 represents that the target does not exist, and 1 represents that the target exists; p is a radical of formula _i And (4) the Sigmoid probability of whether the target exists in the predicted target frame i or not is shown.

p _i ＝sigmod(w ^T x+b)。

The target positioning Loss adopts CIOU as a positioning Loss function, CIOU _ Loss considers the distance between the overlapping area and the central point between a prediction frame and a real frame, and directly measures the distance between the two frames when the real frame surrounds the prediction frame, thereby considering the information of the distance between the central point of the boundary frame and the scale information of the width-height ratio of the boundary frame and simultaneously also considering the length-width ratio of the prediction frame and the target frame, and leading the regression result of the boundary frame to be better.

Wherein the center point of the prediction frame is represented by b, and the center point of the real frame is represented by b ^gt Expressing that rho represents Euclidean distance, c represents the distance between the center line of the minimum bounding rectangle of the intersected prediction frame and the real frame, alpha is a weight coefficient, and ν represents the parameter of aspect ratio consistency, wherein the calculation formula is as follows:

the identification and positioning of the wrappage are divided into two parts of the training of a YOLOv5 model and the identification and positioning of the wrappage, and the specific scheme is shown in a flow chart 4.

Training a model: 2000 wrappage data sets are collected by a camera, the data sets are collected under different scenes, different angles and different illumination to improve the generalization capability of the model, and the model is trained in an accelerated way by using a GPU to obtain the trained model.

And (3) package identification and positioning: correcting images acquired by a binocular camera to obtain two collinear left and right images, inputting the images into a YOLOv5 model, and reasoning the left image to obtain the category of the target in the left image and a Bounding Box (x) corresponding to each target _L ,y _L ,w _L ,h _L ) Reasoning is carried out on the right image to obtain the category of the target in the right image and a Bounding Box (x) corresponding to each target _R ,y _R ,w _R ,h _R )。

D. And (5) stereo matching.

The position and the category of the packing material of the left image and the right image in the image can be calculated by using the model, but according to the principle of calculating the distance by using a binocular camera, the parallax value of the left image and the parallax value of the right image are also required to be known, so that the left image and the right image are required to be subjected to stereo matching, the same characteristic is searched in the left image and the right image, the parallax image of the current left image and the current right image is generated, each pixel point in the parallax image is the parallax value of the left image and the right image, and the distance between the packing material and the camera can be known according to the principle of binocular distance measurement of the parallax values.

Furthermore, because the identification and positioning of the packing object need real-time performance, the three-dimensional matching algorithm needs to meet the requirement of both high precision and high speed, and therefore the SGBM algorithm is adopted for three-dimensional matching of the left image and the right image.

E. And calculating the three-dimensional coordinates of the packages.

The coordinate of the central point of all the packing materials and the length and width values of the packing materials are known in the model reasoning stage, the pixel value of the coordinate of the corresponding position of the coordinate of the central point in the disparity map is the disparity value of the packing materials, and the three-dimensional coordinate (X, Y and Z) of the packing materials can be calculated by substituting the disparity value into a formula.

Where B is the baseline distance of the camera, X _L Is the abscissa, Y, of the package in the left image under the pixel coordinate system _L The f is the focal length of the binocular camera, and d is the parallax value of the packaged object in the left and right images.

In the invention, firstly, the epipolar rectification is carried out on the left image and the right image collected by a binocular camera, so that the left image and the right image meet the image under an ideal binocular imaging system, the subsequent stereo matching is facilitated, then the images are input into a YOLO v5 target detection model, the central coordinate of a packing material in the images and the length and width of a prediction frame are obtained for the left image, and the central coordinate of the packing material in the images and the length and width of the prediction frame are obtained for the right image; secondly, obtaining a disparity map of the left image and the right image by adopting an SGBM algorithm, so that the disparity value of the packing can be obtained, and the pixel point value in the disparity map is the disparity value of the left image and the right image; for the wrapper subjected to YOLO v5 inference in the previous step, the obtained pixel value of the point of the center point coordinate in the disparity map is the disparity value X _dis ＝X _L -X _R Substituting the parallax value into the formula [3 ]]A distance value is obtained.

Claims

1. A packaging object binocular recognition and positioning method based on YOLO v5 is characterized by comprising the following steps:

A. calibrating a binocular camera;

B. image acquisition and stereo correction;

C. carrying out package identification and positioning by using YOLO v 5;

D. stereo matching;

E. and acquiring the three-dimensional coordinates of the packaged object.

2. The binocular recognition and positioning method for wrappings based on YOLO v5 as claimed in claim 1, wherein the specific requirements of step A are to use a binocular camera to shoot checkerboard images of 20 left and right cameras, use the Zhang calibration method to calibrate the left and right cameras, obtain the internal and external parameters of the left and right cameras and the transformation matrix of the relative positions of the left and right cameras, and then calculate the spatial position coordinates of the object according to the transformation matrix and the positions of corresponding pixel points in the image space.

3. The method for binocular identification and positioning of wrappings based on YOLO v5 as claimed in claim 1, wherein the specific requirements of step B are to use a binocular camera to acquire left and right images, then to perform stereo correction on the left and right images according to the camera parameters acquired in step a, and finally to use the left and right images as the input of the target detection model.

4. The method for binocular identification and positioning of packages based on YOLO v5 as claimed in claim 1, wherein the specific requirement of step C is to train a YOLO v5 detection model first, and then use the trained model to detect the coordinates of the center point of the package in the image and the height and width positions of the bounding box.

5. The binocular recognition and positioning method of packages based on YOLO v5 as claimed in claim 1, wherein the step D is specifically operated to match the packages detected by the left and right images by using a stereo matching algorithm according to the result of target detection, so as to obtain the disparity value of the matched packages in the left and right images.

6. The YOLO v 5-based binocular identification and positioning method of wrappings in accordance with claim 1 wherein step E is specifically operated to find the coordinates of the spatial positions of the wrappings in terms of the camera transformation relationship matrix and the positions of the corresponding wrappings pixel points in the image space.