CN113932712A

CN113932712A - Melon and fruit vegetable size measuring method based on depth camera and key points

Info

Publication number: CN113932712A
Application number: CN202111189990.4A
Authority: CN
Inventors: 孙桂玲; 郑博文; 孟兆南; 南瑞丽
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-01-14

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a non-contact measurement method for the size of melons, fruits and vegetables based on a depth camera and key point detection. The method uses a binocular depth camera to acquire color images and depth of the vegetables. The key point detection network takes the color image as input, can identify the types of melons, fruits and vegetables, and positions six key points including a handle, a top, a left, a bottom, a right and a center. The size calculation module is used for fusing the position relation and the depth value of the key point and estimating the sizes of various fruits and vegetables. Experimental results show that the method can accurately identify the types of 4 melon and fruit vegetables and measure the diameters and the lengths of the melon and fruit vegetables with high precision. The invention provides a new idea for solving the non-contact measurement problem of vegetables by using a vision method, and promotes the application of computer vision in the field of agricultural automation.

Description

Melon and fruit vegetable size measuring method based on depth camera and key points

The invention discloses a melon and fruit vegetable size measuring method based on a depth camera and key points, and belongs to the technical field of image processing.

Agricultural automation greatly improves the efficiency of agricultural production through intelligent perception and automation technology. In recent years, computer vision and deep learning are widely applied to various links of agricultural production, such as pest detection, weed identification, yield prediction, crop non-contact measurement and the like. Through computer vision technology, a producer can automatically acquire a large amount of parameters and information through images without contacting a measuring object. Compared with manual inspection, the method has the characteristics of unified standard, high efficiency and good real-time performance. Therefore, crop non-contact measurement technology based on deep learning and computer vision is an important research direction.

Existing vegetable and fruit size detection systems mainly use methods based on simple image processing, such as edge detection, image masking. Such methods work well with round fruits, but are difficult to work on fruits of complex shape. These fruit size detection methods typically require the distance from the camera to the fruit as a priori information, and are only suitable for the statistics and classification of the fruit after picking. Part of the work of introducing depth values completes the sizing of fruit on the tree, but such methods typically work only on a single fruit and require complex equipment to deploy. The detection system based on the deep neural network is mainly used for identifying the fruit types and detecting the lesions, and is not used for measuring the fruit sizes.

In the field of machine vision, a depth camera is an important sensor, also called the eye of a robot. Compared with a common 2D camera, the depth camera can not only obtain RGB images, but also detect the depth distance of a shooting space. A normal color camera can record all objects within the camera's viewing angle, but cannot capture the distance of those objects from the distance computer. The depth camera compensates for the defect, the distance from each point in the image to the camera can be recorded, and the position of the point in the three-dimensional space can be obtained by combining the (x, y) coordinates of the store in the 2D image. The RealSense D415 camera includes an infrared laser transmitter, two infrared sensors and an RGB camera. Due to the application of the infrared laser transmitter, the RealSense series binocular camera can obtain a more accurate depth map even in a low-light environment. In addition, Intel provides a RealSense software development kit of Python version, which is pyrealsense2, and greatly accelerates the development process of researchers.

Detectron2 is a target detection platform introduced by Facebook AI Research, and contains the latest target detection algorithm. Detectron2 is based on the PyTorch framework and provides a more intuitive imperative programming model that researchers can iterate model design and experiments more quickly. Detectron2 supports multiple tasks such as target detection, instance segmentation, semantic segmentation, panorama segmentation and attitude detection, and comprises advanced models such as Faster R-CNN, Mask R-CNN, RetinaNet, DensePose, Cascade R-CNN, Panoptic FPN and TensorMask. Also, Detectron2 introduces a modular design that allows the user to plug custom modules into any part of the target detection system, making task development based on the Detectron platform more flexible.

Detectron2 was introduced as a consistent acceptance by researchers and was used in various fields of target detection. The invention mainly uses a Keypoint R-CNN key point detection model of Detectron 2. The model takes a Feature Pyramid Network (FPN) as a main network to extract Feature maps with different scales from an input image. The FPN is an advanced image feature extraction network framework based on a Deep Residual network (ResNet). The FPN is a pyramid-structured network, the bottom-layer features facilitate detection of small-size simple-feature targets, and the high-layer features facilitate detection of large-size complex-feature targets. The FPN solves the problems of low level of bottom layer characteristics and the like through the process from top to bottom and transverse connection, and improves the detection precision while keeping the advantages of the low-layer characteristics which are beneficial to detecting small targets.

The invention provides a melon and fruit vegetable size measuring method based on a depth camera and key points, which aims to solve the problem of automatic measurement of the size of melon and fruit vegetables under the non-contact condition by using visual information.

The invention is based on a RealSense depth camera and a Detectron2 target detection platform, and can complete the processes of image acquisition, identification and size measurement of 4 kinds of melon and fruit vegetables under the condition of unknown distance between a target and the camera. Figure 1 shows the general structure of the invention. The color and depth maps of the target are photographed and aligned by the RealSense camera. In the color image processing flow, a Detectron2 platform is used for identifying the type of the target, and six key points of a handle, a top, a left, a bottom, a right and a center are detected to obtain the pixel coordinate of each key point. In the depth map processing flow, the original depth map with the holes is filled by a plurality of filters. After the key points are identified, the depth information of the key points can be obtained by inquiring the depth of the corresponding points in the depth map. And finally, fusing the pixel coordinates and the depth of the depth key points to obtain the size of the target. The method can be divided into four modules of color image and depth image acquisition, key point detection network, multi-scale target detection and size calculation.

The measuring method of the melons, fruits and vegetables uses a RealSense D415 camera to obtain an RGB (red, green and blue) image and a depth image of a target. The RealSense D415 camera has dimensions of 99mm by 20mm by 23mm and a weight of about 75g, using active IR stereo depth technology. The field angle of the depth camera is 65 degrees multiplied by 40 degrees, the resolution is 1280 multiplied by 720, the frame rate can reach 90fps, and the detection depth range is 0.3m-10 m. The angle of field of the RGB camera is 69 degrees multiplied by 42 degrees, when the resolution is 1920 multiplied by 1080, the frame rate can reach 30fps, the focal length is 1.88mm, and the photosensitive size is 2.73mm multiplied by 1.55 mm.

Fig. 2 is a flow chart of acquiring and processing color and depth maps. First, the resolution of the color map and the depth map is set to 640 × 480. The invention does not use the maximum resolution 1280 × 720 supported by RealSense D415, because the smaller resolution can provide a smaller detection depth, when the resolution is 1280 × 720, the depth camera cannot detect the depth of the object within 450mm, and when the resolution is 640 × 480, the minimum depth is 310 mm. When the resolution is too small, the imaging quality of the RGB image is degraded, which is not favorable for the subsequent recognition, and therefore 640 × 480 is a suitable resolution.

And after the resolution is set, the RGB frame and the depth frame can be obtained. The lens positions of the RGB camera and the depth camera are different. Therefore, different reference systems are used for the obtained depth map and the RGB map, wherein the origin of the depth map is an infrared camera, and the origin of the RGB map is an RGB camera. This results in the position of the detection target in the depth map and the color map being non-uniform, and the two need to be aligned when in use. The principle of aligning the RealSense camera to the depth map to the RGB map is: the 2D points in the depth map are first converted into 3D space and then the points in 3D space are projected into the plane of the RGB camera. Finally, for each point in the RGB image, the depth of the point is inquired in the depth map at the corresponding position.

The depth calculation using the left camera as a matching reference will lose the depth of the target contained in the left camera but not in the right camera, so there is no depth data at the left edge of the target object in every frame. This would result in a return of 0 when querying the depth of the fruit edge keypoints. The depth map needs to be filled in. The invention uses two filters, namely, spatial _ filter and hold _ filtering in the pyrealsense2 toolkit to repair the depth map.

The invention realizes the identification and key point extraction of melon and fruit vegetables by using Keypoint R-CNN provided by Detectron 2. The structure of the network is shown in fig. 3. The model uses FPN to extract feature maps of different scales of an input image. Firstly, a residual error network called Stem (kernel size 7 x 7, step length 2) is used by the main network to preliminarily extract the characteristics of the RGB three-channel image, and a 64-channel characteristic diagram is output. Four residual networks (res2, res3, res4, res5) are then used to extract features of different scales in sequence. The final output characteristics of the FPN are P2(1/4 scale), P3(1/8 scale), P4(1/16 scale), P5(1/32 scale), P6(1/64 scale), all of which contain 256 channels.

The five feature maps at different scales are then input into the Region Proposal Network (RPN). The RPN detects an object region from the multiple scale features, calculates the objectionability, i.e., the probability that the region contains an object, and the anchor frame of the different regions, which indicates the location of the region on the original image. The RPN finally provides 1000 highest objective recommended boxes to ROI Heads.

The ROI Heads used in the present invention include Keypoint Head and Box Head. The inputs for ROI headers are P2, P3, P4, P5 and proposal boxes. Keypoint Head first used ROIAlign to obtain a fixed profile with dimensions 14X 14. The keypoint heat map is then derived using a multi-layer convolution and a layer of deconvolution. And finally, calculating the coordinates of the key points according to the heat map of the key points. Box Head first used ROIAlign to obtain a fixed profile with dimensions 7X 7. And then expanding the feature diagram, carrying out multilayer full connection, and finally outputting the position and classification of each frame. The method discards all candidate frames with the classification score smaller than 0.6, and finally can output all the target frames to be detected and the key points contained in the image.

The key point detection network extracts multi-scale image features, theoretically, small targets in the images can be detected, but in practical application, if the images contain a large number of background areas and the targets to be detected are small, it is usually difficult to accurately predict a classification frame containing the targets and key points of the targets. The reason is that the detection of small targets only utilizes low-level feature maps, and higher classification values cannot be obtained under a smaller training set. In addition, in practical applications, only vegetables located in the center of the visual field are usually focused, and it is particularly important to accurately predict the type and key points of the target in the center of the visual field.

To solve this problem, the present invention proposes a multi-scale scaling detection module. Implementation details of multi-scale scaling detection are shown in fig. 4:

(1) inputting an image to be detected and an amplification parameter alpha.

(2) The magnification is set to M1 and the maximum score is set to MS 0.

(3) And executing a detection loop, and amplifying the original image by M times by using a Zoom-in method to obtain a new image Img. The Zoom-in method comprises the following specific operations: and (4) amplifying the original image by M times by adopting bilinear interpolation. A new image of the same size as the original image is then cut out at the center of the magnified image.

(4) And sending the Img into a pre-trained key point Detection network Detection to obtain the score, the frame and the key point of the target contained in the image.

(5) If the sum of target scores is greater than MS, then the maximum score MS is updated and the current magnification M is recorded as BestM. Otherwise, directly entering the next step.

(6) The update magnification is M ═ mxα. If M > 5, the module will jump out of the detection loop into (7), otherwise return to (3).

(7) And mapping the target frame and the key points on the amplified image back to the original image according to the magnification by using a reverse method to obtain final output FinBoxes and FinPoints. The specific operation of the Revert method is as follows: the center point of the image is first calculated according to the size of the original image, and then the distance between the coordinates in Boxes and Points and the center point is reduced to one-half BestM.

The pixel coordinates of the key points on the color image can be obtained through the key point detection network. The distance from the point to the plane where the camera is located can be obtained by searching the depth map. The idea of calculating the distance in the three-dimensional space according to the key points and the depth map is as follows: firstly, calculating the pixel distance between two points in a color map, then mapping the pixel distance to a plane where a target is located according to the depth of a key point in the center of the target, and simultaneously compensating the distance error caused by the depth difference between different key points by using a mathematical method.

Let A and B denote two key points in the color image. The pixel coordinate of the point A is (x)_a，y_a) And the pixel coordinate of the point B is (x)_b，y_b). The pixel distance between two points is then:

an image with a resolution of 640 x 480 corresponds to a size of 2.07mm x 1.55mm on an actual image sensor. The actual distance between the two points on the image sensor is therefore:

fig. 5 is a principle of calculating an actual distance from the distance of two points on the image sensor. Wherein O is the optical center of the RGB camera. A ', B' are two points in space, and A 'and B' are the projections of two key points on the image sensor. f is the focal length of the RGB camera, and d is the distance from the target to the plane where the camera is located. Based on the similarity of triangle OA 'B' to triangle OA "B", the actual distance between the two points is:

vegetables are three-dimensional objects with thickness, usually with edge keypoints deeper than the center keypoints. Therefore, it is necessary to compensate for this depth difference when actually calculating the target diameter and length. Fortunately, the cross-section of melon and fruit vegetables is usually circular, i.e. the difference in depth between the central and edge points is equal to the radius. The invention takes the distance between the left and right key points as the diameter D of the vegetable, and the distance between the projections of the left and right key points on the image sensor is assumed to be D_sAnd d is the depth of the central key point. Then there are:

further:

the invention takes the distance of the top and bottom key points as the length L of the vegetable, and the distance between the projections of the top and bottom key points on the image sensor is L_s. The same principle is as follows:

compared with the prior art, the invention has the following advantages and positive effects:

1. the invention provides a non-contact size measuring method for melon and fruit vegetables based on a depth camera and a key point detection network. According to the method, the identification and size measurement of 4 kinds of melon and fruit vegetables can be automatically completed only by using the RealSense depth camera and the processor, the deployment is convenient, and the cost is low. In addition, the system does not need to contact with a target to be measured in the working process, and can completely and nondestructively measure the vegetables and the fruits;

2. according to the invention, the depth information is introduced into the melon, fruit and vegetable size measurement process based on vision, and the size of the target can be measured in a self-adaptive manner without using the distance from the target to the plane of the camera as prior information. The measuring system has higher precision within 60cm, and can completely meet the requirements of melon and fruit size classification. Furthermore, the invention can be used as a vision system for robots for automatically picking and classifying melons, fruits and vegetables.

3. The invention introduces a key point detection network into a melon, fruit and vegetable size non-contact measurement method. The method based on the deep neural network can improve the detection precision along with the increase of the training set, so the method provided by the invention is a method capable of growing. Along with the improvement of the training set, the measurement accuracy of the vegetable size can be greatly improved.

Fig. 1 is a general structure diagram of a melon, fruit and vegetable size measurement method based on a depth camera and key points, which is provided by the invention;

FIG. 2 is a flow chart of the present invention for acquiring and processing color and depth maps;

FIG. 3 is a diagram of a key point detection network used by the present invention;

FIG. 4 is a flow chart of a multi-scale scaling detection module proposed by the present invention

FIG. 5 is a schematic diagram of calculating an actual distance from the distance of two points on an image sensor

Embodiments and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.

Figure 1 shows the general structure of the invention. The color and depth maps of the target are photographed and aligned by the RealSense camera. In the color image processing flow, a Detectron2 platform is used for identifying the type of the target, and six key points of a handle, a top, a left, a bottom, a right and a center are detected to obtain the pixel coordinate of each key point. In the depth map processing flow, the original depth map with the holes is filled by a plurality of filters. After the key points are identified, the depth information of the key points can be obtained by inquiring the depth of the corresponding points of the depth map. And finally, fusing the pixel coordinates and the depth of the depth key points to obtain the size of the target. The method can be divided into four modules of color image and depth image acquisition, key point detection network, multi-scale target detection and size calculation.

Fig. 2 is a flow chart of acquiring and processing color and depth maps. First, the resolution of the color map and the depth map is set to 640 × 480. And after the resolution is set, the RGB frame and the depth frame can be obtained. The lens positions of the RGB camera and the depth camera are different. Therefore, different reference systems are used for the obtained depth map and the RGB map, wherein the origin of the depth map is an infrared camera, and the origin of the RGB map is an RGB camera. This results in the position of the detection target in the depth map and the color map being non-uniform, and the two need to be aligned when in use. Finally, for each point in the RGB image, the depth of the point is inquired in the depth map at the corresponding position.

The depth calculation using the left camera as a matching reference will lose the depth of the target contained in the left camera but not in the right camera, so there is no depth data at the left edge of the target object in every frame. This would result in a return of 0 when querying the depth of the vegetable edge keypoints. The depth map needs to be filled in. The depth map is repaired using two filters, spatial _ filter and hold _ filtering in the pyrealsense2 toolkit.

The identification and key point extraction of melon and fruit vegetables are realized by using Keypoint R-CNN provided by Detectron 2. And a multi-scale scaling detection module is used in the vegetable identification and key point extraction processes. The implementation details of the multi-scale scaling detection are shown in fig. 4, and the flow is as follows:

(8) inputting an image to be detected and an amplification parameter alpha.

(9) The magnification is set to M1 and the maximum score is set to MS 0.

(10) And executing a detection loop, and amplifying the original image by M times by using a Zoom-in method to obtain a new image Img. The Zoom-in method comprises the following specific operations: and (4) amplifying the original image by M times by adopting bilinear interpolation. A new image of the same size as the original image is then cut out at the center of the magnified image.

(11) And sending the Img into a pre-trained key point Detection network Detection to obtain the score, the frame and the key point of the target contained in the image.

(12) If the sum of target scores is greater than MS, then the maximum score MS is updated and the current magnification M is recorded as BestM. Otherwise, directly entering the next step.

(13) The update magnification is M ═ mxα. If M > 5, the module will jump out of the detection loop into (7), otherwise return to (3).

(14) And mapping the target frame and the key points on the amplified image back to the original image according to the magnification by using a reverse method to obtain final output FinBoxes and FinPoints. The specific operation of the Revert method is as follows: the center point of the image is first calculated according to the size of the original image, and then the distance between the coordinates in Boxes and Points and the center point is reduced to one-half BestM.

Let A and B denote two key points in the color image. The pixel coordinate of the point A is (x)_α，y_a) And the pixel coordinate of the point B is (x)_b，y_b). The pixel distance between two points is then:

further:

the hardware configuration of the simulation experiment computer is as follows: intel (R) Xeon (R) W-2145@3, 70 GHzCPU; 64.0GB DDR4 memory; NVIDIA Quadro RTX4000 GPU.

The simulation experiment software of the invention is configured as follows: ubuntu 20.04 operating system, simulation language Python.

In the simulation experiment, the data set used by the inventor is to collect the key point data set of the marked melon, fruit and vegetable. The data set has 1600 pictures and comprises four common melon and fruit vegetables including cucumber, eggplant, tomato and sweet pepper. Each category contains 400 pictures, 320 of which are training sets and 80 of which are testing sets. Each picture contains an object, and each object comprises a frame and six key points of handle, top, left, bottom, right and center. The Handle is a point on the vegetable Handle, the point is reserved for automatic picking of melons and fruits, and the point is not used in the size prediction method. top is the point on the fruit closest to the stalk. Bottom is the point on the fruit furthest from the stalk. Left and Right are the Left and Right endpoints of the widest line segment perpendicular to the central axis on the fruit. The Center point is the visual Center point of the fruit. The invention refers to the distance between Top and Bottom as the length, and the distance between Left and Right as the diameter. The dataset annotation file is in standard Coco format. Coco, collectively known as Common Objects in Context, is a data set available to Microsoft corporation for image recognition.

The dimensional measurement effect of the present invention was verified using a vegetable simulation model. Three different models were used for each vegetable, and table 1 is a standard parameter for each model, and the data was measured using a vernier caliper.

Table 1 testing parameters of vegetable models

TABLE 2 relative error of measurement of four vegetable sizes

Table 2 shows the relative error results of the diameter and length predictions of the four melon and fruit vegetables at different depths. The target size estimation effect of the present invention was evaluated here using Mean Relative Error (MRE). The diameter MRE calculation method comprises the following steps:

wherein TP is the number of successful identifications of the vegetables of the category D_pDiameter predicted for the invention, D_rIs the actual diameter of the vegetable model. The length MRE calculation method comprises the following steps:

wherein L is_pLength, L, predicted for the present invention_rIs the actual length of the object model. The method has the most accurate prediction on the tomato size and the worst prediction result on the cucumber size. The average relative error in predicted size increases with increasing depth. Considering that the present invention rapidly decreases the species recognition capability at a depth of more than 100cm, the significance of the size measurement is not great. The present invention only tested size estimation performance within 100 cm. Table 2 shows that this method works very well for tomato size estimation. When the depth is 40cm, the relative error of the diameter estimates for the three targets is only 2%. The real diameter of the experimental target is about 80mm, and the error is less than 2 mm. When the distance is 100cm, the relative error is less than 8%, and the absolute error is only 6 mm. This accuracy is sufficient as a reference index to be applied to individual size classification of fruits. It can also be noted that the size estimation error of the proposed method for the other three targets is slightly larger, because the identification difficulty of key points of vegetables with irregular shapes is greater than that of regular targets such as tomatoes. However, when the distance is less than 60cm, the relative error of the size estimation of the four melon and fruit vegetables can be controlled within 8%. Sufficient to guide the individual size classification of the fruit.

The size estimation performance of the present invention also tends to decrease with increasing depth. The degradation is particularly pronounced for objects with irregular shapes. This problem can be effectively solved using a high definition camera. At the same time, the data set has a very large impact on the identification of key points of the target. By providing a more diverse training set, the accuracy of keypoint identification may be increased.

The invention can accurately identify the types of 4 melon and fruit vegetables within 100cm, and can measure the size of the target within 60cm with high precision. In addition, the system does not need to contact with a target to be measured in the working process, and can achieve complete nondestructive measurement of vegetables and fruits. The invention has wide application prospect in the field of automatic picking and classification of vegetables and has promotion effect on application of deep learning and machine vision to intelligent agriculture.

Claims

1. A melon and fruit vegetable size measuring method based on a depth camera and key points. The method is characterized in that:

(1) based on a RealSense depth camera and a Detectron2 target detection platform, the method can finish the processes of picture acquisition, type identification and size measurement of 4 kinds of melon and fruit vegetables under the condition of unknown distance between a target and the camera.

(2) The color and depth maps of the target were captured and aligned by a RealSense binocular depth camera with an image resolution of 640 x 480, and then fed into different processing flows.

(3) In the color image processing flow, a Detectron2 platform is used for identifying the type of the target, six key points of a handle, a top, a left, a bottom, a right and a center are detected, and the pixel coordinate of each key point is obtained.

(4) In order to improve the prediction accuracy of the central target of the visual field, a scaling module is introduced in the identification process. The specific operation is as follows: before each recognition, the original image is amplified in proportion, a new image with the same size as the original image is cut out from the center of the amplified image and is sent to a recognition network, and after the recognition is finished, the coordinates of the key points are converted to the original image according to the scaling.

(5) In the depth map processing flow, the original depth map with the holes is filled by a plurality of filters.

(6) After the key point detection network obtains the pixel coordinates of the key point, the distance between the point and the plane where the camera is located can be obtained by searching the depth map. The camera focal length and the image sensor width are used as constants, the diameter and the length of the vegetables are obtained according to the distance between key points of the images of the melons and fruits and the depth of the central key point, the diameter is the distance between the left key point and the right key point, and the length is the distance between the top key point and the bottom key point.