CN114581523A

CN114581523A - Method and device for determining labeling data for monocular 3D target detection

Info

Publication number: CN114581523A
Application number: CN202210212889.4A
Authority: CN
Inventors: 安耀祖; 许新玉
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03

Abstract

The invention discloses a method and a device for determining labeling data for monocular 3D target detection, and relates to the technical field of computers. One embodiment of the method comprises: receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera; processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object; processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information; and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the labeling data according to the intersection ratio and the first threshold. The embodiment reduces the cost of determining the annotation data, and can determine a large batch of annotation data quickly and efficiently.

Description

Method and device for determining labeling data for monocular 3D target detection

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for determining labeling data for monocular 3D target detection.

Background

Monocular 3D Object Detection (Monocular 3D Object Detection) is one of the most widely used computer vision technologies, and can be applied to the fields of vehicle automatic driving systems, intelligent robots, intelligent transportation, and the like. Especially in the field of automatic driving system perception, compared with the 3D target detection of the multi-line laser radar, the method has the great advantages of dense information and low cost.

Performing monocular 3D target detection requires a large amount of annotation data for displaying attribute information (such as length, width, height, position, etc.) of an annotated object in different scenes, and only two-dimensional information of the annotated object, such as length, width, and position in an image coordinate system, can be acquired based on a 2D image; however, it is difficult to determine the three-dimensional information of the labeled object in the real-world coordinate system, such as the real length, width, height, deflection angle, distance, etc. of the object.

The related art has at least the following problems:

in the related art, only the annotation data determined by the 2D image includes only the two-dimensional information of the annotation object, so that the accuracy of the annotation data is low, and it is difficult to obtain a large amount of annotation data.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for determining annotation data for monocular 3D target detection, which can combine three-dimensional information and two-dimensional information of an annotation object according to point cloud data acquired by a laser radar and image data acquired by a camera, reduce the cost for determining the annotation data, and can determine large batches of annotation data quickly and efficiently.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an annotation data determination method for monocular 3D object detection, including:

receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera;

processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object;

processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information;

and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the labeling data according to the intersection ratio and the first threshold.

Further, sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information, which may include:

respectively converting the plurality of 3D detection results to obtain a plurality of 3D frame information under a camera coordinate system; wherein the 3D frame information indicates three-dimensional information of the tagged object;

and mapping the plurality of pieces of 3D frame information to obtain a plurality of pieces of first 2D frame information corresponding to the plurality of pieces of 3D frame information.

Further, determining the annotation data according to the intersection ratio and the first threshold, which may further include:

judging whether the intersection ratio is smaller than a first threshold value;

if so, deleting the 3D frame information corresponding to the first 2D frame information, reserving the second 2D frame information, and taking the second 2D frame information as the marking data;

if not, determining a matching result of the first 2D frame information and the second 2D frame information according to the intersection ratio, reserving the 3D frame information corresponding to the matched first 2D frame information and the unmatched second 2D frame information, and taking the reserved information as the marking data.

Further, if the intersection ratio is greater than or equal to the first threshold, the method further includes:

determining a 3D projection center point corresponding to the 3D frame information in the reserved information and a 2D frame center point corresponding to the second 2D frame information in the reserved information;

calculating the Euclidean distance between the 3D projection center point and the 2D frame center point, and calculating the frame diagonal distance of second 2D frame information in the reserved information;

calculating the labeling error of the labeled object according to the Euclidean distance and the frame diagonal distance;

and updating the labeling data according to the labeling error and the second threshold value.

Further, the reserved information comprises one or more marked objects; if the number of the marked objects is one, the step of calculating the marking error of the image according to the Euclidean distance and the frame diagonal distance comprises the following steps:

calculating a normalized distance between the 3D projection center point and the 2D frame center point according to the Euclidean distance and the frame diagonal distance;

and calculating the labeling error according to the normalized distance.

Further, if the number of the labeled objects is multiple, the step of calculating the labeling error of the image according to the euclidean distance and the frame diagonal distance includes:

calculating the average value of the normalized distance between the 3D projection center point and the 2D frame center point according to the marked quantity, the Euclidean distance and the frame diagonal distance;

and calculating the labeling error according to the average value of the normalized distance and the labeling quantity.

According to another aspect of the embodiments of the present invention, there is provided an annotation data determination apparatus for monocular 3D object detection, including:

a data receiving module to: receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera;

a point cloud data processing module for: processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object;

an image data processing module to: processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information;

an annotation data determination module to: and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the labeling data according to the intersection ratio and the first threshold.

Further, the annotation data determination module is further configured to:

judging whether the intersection ratio is smaller than a first threshold value;

if not, determining a matching result of the first 2D frame information and the second 2D frame information according to the intersection ratio, reserving the 3D frame information corresponding to the matched first 2D frame information and the second 2D frame information which is not matched, and taking the reserved information as the marking data.

According to another aspect of the embodiments of the present invention, there is provided an electronic device for annotation data determination for monocular 3D object detection, including:

one or more processors;

a storage device for storing one or more programs,

when executed by one or more processors, cause the one or more processors to implement any of the annotation data determination methods for monocular 3D object detection described above.

According to another aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements any one of the above mentioned annotation data determination methods for monocular 3D object detection.

One embodiment of the above invention has the following advantages or benefits: because a plurality of 3D point cloud data and a plurality of image data which are synchronously collected by a receiving laser radar and a camera are adopted; processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the 2D frame information indicates two-dimensional information of the tagged object; processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information; the technical means of calculating the intersection ratio of the first 2D frame information and the second 2D frame information respectively and determining the annotation data according to the intersection ratio and the first threshold value overcome the technical problems that in the related technology, the annotation data is determined only by 2D images, the accuracy of the annotation data is low and large-batch annotation data is difficult to obtain due to the fact that the annotation data only comprises two-dimensional information of an annotation object, and therefore the technical effects that the cost of determining the annotation data is reduced and the large-batch annotation data is determined quickly and efficiently are achieved according to point cloud data acquired by a laser radar and image data acquired by a camera.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a main flow of an annotation data determination method for monocular 3D object detection according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of an annotation data determination method for monocular 3D object detection according to yet another embodiment of the present invention;

FIG. 3 is a schematic diagram of the main modules of an annotation data determination apparatus for monocular 3D object detection according to another embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of an annotation data determination method for monocular 3D object detection according to an embodiment of the present invention; as shown in fig. 1, a method for determining labeling data for monocular 3D target detection provided in an embodiment of the present invention mainly includes:

step S101, receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera.

Specifically, calibration parameters and conversion matrix parameters between vehicle-mounted sensors can be preset, and point cloud data and image data of the surrounding environment are synchronously acquired (acquisition timestamp synchronization) by using a laser radar and a camera in the vehicle running process. Although the point clouds collected by the laser radar with less wiring harnesses are sparse, and the situation that no scanning point clouds exist in the marking object with a longer distance is easy to occur, the accuracy of the determined marking data can be improved while the low cost is ensured by combining the image data collected by the camera.

Step S102, processing a plurality of 3D point cloud data by using a 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object, which is further determined based on the 3D point cloud data.

Specifically, according to the embodiment of the invention, all 3D point cloud data can be processed by adopting the existing 3D point cloud detection model (such as pointpilars and the like) which has high detection precision and mature detection, so as to obtain a 3D detection result corresponding to each point cloud data, wherein the 3D detection result indicates the three-dimensional information of the annotation object under the point cloud coordinate system. And then, the 3D detection result is sequentially subjected to conversion processing and mapping processing to obtain first 2D frame information, which indicates the two-dimensional information of the annotation object under the camera coordinate system, so that the two-dimensional information of the annotation object (second 2D frame information) obtained after the image data processing can be combined with the two-dimensional information to determine the annotation data with higher accuracy.

Further, according to an embodiment of the present invention, the sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of pieces of first 2D frame information includes:

Through the arrangement, the 3D detection results under the point cloud coordinate systems are respectively converted to obtain 3D frame information under the camera coordinate system, and the 3D frame information is used for indicating the three-dimensional information of the marked object. And then mapping the 3D frame information to obtain the two-dimensional information of the annotation object under the camera coordinate system.

Step S103, processing the plurality of image data by using the 2D object detection model to obtain a plurality of second 2D frame information.

Specifically, according to the embodiment of the present invention, the second 2D frame information may be obtained by processing a plurality of image data collected by the camera using an existing 2D target detection model (e.g., SSD, YOLO, etc.). The second 2D frame information obtained through image data processing and the first 2D frame information obtained through 3D point cloud data processing determine the intersection ratio of the marked object in different 2D frame information, and then the intersection ratio is compared with a first threshold value, so that marked data with high accuracy can be rapidly determined in a large batch for monocular 3D target detection. Wherein the second 2D frame information indicates two-dimensional information of the annotation object determined based on the image data.

And step S104, calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the marking data according to the intersection ratio and the first threshold.

Specifically, according to the embodiment of the present invention, determining the annotation data according to the intersection ratio and the first threshold may further include:

judging whether the intersection ratio is smaller than a first threshold value;

Through the above setting, the intersection ratio is compared with the first threshold to determine the two types of the annotation data. According to a specific implementation manner of the embodiment of the present invention, the first threshold may be set to be 0.5 (the value range of the first threshold is between 0 and 1), that is, if half of the two-dimensional information displayed in the first 2D frame information of the annotation object and the two-dimensional information displayed in the second 2D frame information of the annotation object are successfully matched, the 3D frame information corresponding to the first 2D frame information of the successfully matched part is retained, and the second 2D frame information that is not successfully matched is retained, and the retained information is used as annotation data; if the matching success part of the two-dimensional information displayed by the labeled object in the first 2D frame information and the two-dimensional information displayed by the labeled object in the second 2D frame information is less than half, only the second 2D frame information is reserved and used as the labeled data. It should be noted that the setting of the first threshold is only an example, and different first thresholds may be set according to the actual accuracy to be achieved and the amount of the labeled data to be determined, and if the accuracy to be achieved is higher, a higher first threshold may be set, and if more labeled data is to be acquired, a lower first threshold may be set.

Further, according to the embodiment of the present invention, if the intersection ratio is greater than or equal to the first threshold, the method may further include:

Through the arrangement, for the situation that the first 2D frame information and the second 2D frame information are both reserved, in order to improve the accuracy of the determined labeling data, a calibration error needs to be calculated, and the data is cleaned by using the calibration error and the second threshold. Determining a 3D projection central point of 3D frame information corresponding to first 2D frame information in the reserved information and a 2D frame central point corresponding to second 2D frame information in the reserved information, calculating a Euclidean distance between the two central points and a frame diagonal distance of the second 2D frame information, and calculating a labeling error of a labeled object according to the Euclidean distance and the frame diagonal distance; and updating the labeling data according to the labeling error and the second threshold value.

Illustratively, according to the embodiment of the present invention, the reserved information includes one or more tagged objects; if the number of the marked objects is one, the step of calculating the marking error of the image according to the Euclidean distance and the frame diagonal distance comprises the following steps:

calculating a normalized distance between the 3D projection center point and the 2D frame center point according to the Euclidean distance and the frame diagonal distance; and calculating the labeling error according to the normalized distance.

Preferably, according to an embodiment of the present invention, if the number of the objects to be labeled is plural, the step of calculating the labeling error of the image according to the euclidean distance and the frame diagonal distance may include:

It can be understood that the larger the number of the annotation objects included in the included information is, the more accurate the finally determined annotation error is by calculating the average value of the normalized distances, thereby making the accuracy of the determined annotation data higher.

According to the technical scheme of the embodiment of the invention, a plurality of 3D point cloud data and a plurality of image data which are synchronously collected by a laser radar and a camera are received; processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object; processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information; the technical means of calculating the intersection ratio of the first 2D frame information and the second 2D frame information respectively and determining the annotation data according to the intersection ratio and the first threshold value overcome the technical problems that in the related technology, the accuracy of the annotation data is low and large batch of annotation data is difficult to obtain because only two-dimensional information of an annotation object is included, and therefore the technical problems that point cloud data collected by a laser radar and image data collected by a camera can be obtained, three-dimensional information and two-dimensional information of the annotation object are combined, the cost of determining the annotation data is reduced, and large batch of annotation data can be determined quickly and efficiently.

FIG. 2 is a schematic diagram of a main flow of an annotation data determination method for monocular 3D object detection according to yet another embodiment of the present invention; as shown in fig. 2, the method for determining labeling data for monocular 3D target detection provided in the embodiment of the present invention mainly includes:

step S201, receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera.

Specifically, calibration parameters and conversion matrix parameters between vehicle-mounted sensors can be preset, and point cloud data and image data of the surrounding environment are synchronously acquired (acquisition timestamp synchronization) by using a laser radar and a camera in the vehicle running process.

Step S202, processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results.

Specifically, according to the embodiment of the invention, all 3D point cloud data can be processed by adopting the existing 3D point cloud detection model (such as pointpilars and the like) which has high detection precision and mature detection, so as to obtain a 3D detection result corresponding to each point cloud data, wherein the 3D detection result indicates the three-dimensional information of the annotation object under the point cloud coordinate system.

Step S203, respectively carrying out conversion processing on a plurality of 3D detection results to obtain a plurality of 3D frame information under a camera coordinate system; wherein the 3D frame information indicates three-dimensional information of the tagged object; and mapping the plurality of pieces of 3D frame information to obtain a plurality of pieces of first 2D frame information corresponding to the plurality of pieces of 3D frame information.

The first 2D frame information can be obtained by sequentially performing conversion processing and mapping processing on the 3D detection result, and the two-dimensional information of the annotation object under the camera coordinate system is indicated, so that the two-dimensional information (the second 2D frame information) of the annotation object obtained after the image data processing can be combined with the two-dimensional information, and the annotation data with high accuracy can be determined.

Step S204, processing the image data by using the 2D target detection model to obtain a plurality of second 2D frame information.

Specifically, according to the embodiment of the present invention, the second 2D frame information may be obtained by processing a plurality of image data collected by the camera using an existing 2D target detection model (e.g., SSD, YOLO, etc.). The second 2D frame information obtained through image data processing and the first 2D frame information obtained through 3D point cloud data processing determine the intersection ratio of the marked object in different 2D frame information, and then the intersection ratio is compared with a first threshold value, so that marked data with high accuracy can be rapidly determined in a large batch for monocular 3D target detection.

Step S205, calculating an intersection ratio of the plurality of first 2D frame information and the plurality of second 2D frame information, respectively.

Through the setting, the matching degree between the two-dimensional information displayed by the labeling object in the first 2D frame information and the two-dimensional information displayed by the labeling object in the second 2D frame information is embodied by the intersection ratio. Furthermore, the comparison of the intersection ratio with the first threshold value to determine two types of the annotation data helps to further improve the accuracy of the determined annotation data.

Step S206, judging whether the intersection ratio is smaller than a first threshold value.

Specifically, if yes, the intersection ratio is smaller than the first threshold, step S207 is executed; if not, the intersection ratio is greater than or equal to the first threshold, then step S208 is executed.

Step S207, delete the 3D frame information corresponding to the first 2D frame information, retain the second 2D frame information, and use the second 2D frame information as the annotation data.

According to the embodiment of the invention, the intersection ratio is smaller than the first threshold, which indicates that the matching degree of the two-dimensional information of the marked object displayed in the current first 2D frame information and the two-dimensional information embodied in the image data is low, and indicates that the point cloud data can not embody the three-dimensional information of the marked object well due to point cloud sparseness and the like, so that the 3D frame information corresponding to the current first 2D frame information is deleted, only the second 2D frame information is reserved, and the two-dimensional information of the marked object indicated by the second 2D frame information is used as the marked data.

Step S208, determining a matching result of the first 2D frame information and the second 2D frame information according to the intersection ratio, reserving the 3D frame information corresponding to the matched first 2D frame information and the unmatched second 2D frame information, and using the reserved information as the labeling data.

According to the embodiment of the invention, the intersection ratio is greater than or equal to the first threshold, and the matching degree of the two-dimensional information of the marked object displayed in the current first 2D frame information and the two-dimensional information embodied in the image data is higher, the 3D frame information corresponding to the matched first 2D frame information and the unmatched second 2D frame information are reserved. The three-dimensional information of the marked object is embodied in the 3D frame information, and is more comprehensive and accurate than the two-dimensional information embodied in the image data, so that the 3D frame information corresponding to the matched first 2D frame information is reserved. For the first 2D frame information of the unmatched part, similar to step S207, the deletion process is performed, leaving only the corresponding (i.e., unmatched) second 2D frame information.

Step S209, determining a 3D projection center point corresponding to the 3D frame information in the reserved information and a 2D frame center point corresponding to the second 2D frame information in the reserved information; calculating the Euclidean distance between the 3D projection center point and the 2D frame center point, and calculating the frame diagonal distance of second 2D frame information in the reserved information; calculating the labeling error of the labeled object according to the Euclidean distance and the frame diagonal distance; and updating the labeling data according to the labeling error and the second threshold value.

Specifically, according to the embodiment of the present invention, the euclidean distance D between the 3D projection center point and the 2D frame center point is calculated as follows:

wherein (c)_x，c_y) Represents a 2D frame center point, (c'_x，c′_y) Representing the 3D projected centre point.

The calculation formula of the frame diagonal distance l of the second 2D frame information is as follows:

wherein (x)₁，y₁) And (x)₂，y₂) Respectively, the diagonal coordinates of the 2D frame.

Further, according to the embodiment of the present invention, the reserved information includes one or more tagged objects; if the number of the marked objects is one, the step of calculating the marking error of the image according to the Euclidean distance and the frame diagonal distance comprises the following steps: calculating a normalized distance between the 3D projection center point and the 2D frame center point according to the Euclidean distance and the frame diagonal distance; and calculating the labeling error according to the normalized distance.

Preferably, according to an embodiment of the present invention, if the number of the objects to be labeled is plural, the step of calculating the labeling error of the image according to the euclidean distance and the frame diagonal distance includes:

Specifically, according to the embodiment of the present invention, the calculation formula of the labeling error (metric) is as follows:

wherein N represents the number of the tagged objects.

FIG. 3 is a schematic diagram of the main modules of an annotation data determination apparatus for monocular 3D object detection according to another embodiment of the present invention; as shown in fig. 3, an annotation data determination apparatus 300 for monocular 3D object detection according to an embodiment of the present invention mainly includes:

a data receiving module 301, configured to: and receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by the laser radar and the camera.

Specifically, calibration parameters and conversion matrix parameters between vehicle-mounted sensors can be preset, and point cloud data and image data of the surrounding environment in the vehicle running process are synchronously acquired by using a laser radar and a camera. Although the point clouds collected by the laser radar with less wiring harnesses are sparse, and the situation that no scanning point clouds exist in the marking object with a longer distance is easy to occur, the accuracy of the determined marking data can be improved while the low cost is ensured by combining the image data collected by the camera.

A point cloud data processing module 302 for: processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object, which is further determined based on the 3D point cloud data.

Specifically, according to the embodiment of the present invention, all 3D point cloud data can be processed by using the existing 3D point cloud detection model (such as pointpilars, etc.) which has high detection accuracy and is relatively mature in detection, so as to obtain a 3D detection result corresponding to each 3D point cloud data, wherein the 3D detection result indicates the three-dimensional information of the label object under the point cloud coordinate system. And then, the 3D detection result is sequentially subjected to conversion processing and mapping processing to obtain first 2D frame information, which indicates the two-dimensional information of the annotation object under the camera coordinate system, so that the two-dimensional information of the annotation object (second 2D frame information) obtained after the image data processing can be combined with the two-dimensional information to determine the annotation data with higher accuracy.

Further, according to an embodiment of the present invention, the point cloud data processing module 302 is further configured to: respectively converting the plurality of 3D detection results to obtain a plurality of 3D frame information under a camera coordinate system; wherein the 3D frame information indicates three-dimensional information of the tagged object; and mapping the plurality of pieces of 3D frame information to obtain a plurality of pieces of first 2D frame information corresponding to the plurality of pieces of 3D frame information.

An image data processing module 303 for: and processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information.

An annotation data determination module 304 for: and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the labeling data according to the intersection ratio and the first threshold.

Specifically, according to the embodiment of the present invention, the above-mentioned labeled data determining module 304 is further configured to:

judging whether the intersection ratio is smaller than a first threshold value;

Through the above setting, the intersection ratio is compared with the first threshold value to determine two types of the annotation data. According to a specific implementation manner of the embodiment of the present invention, the first threshold may be set to be 0.5 (the value range of the first threshold is between 0 and 1), that is, if half of the two-dimensional information displayed in the first 2D frame information of the annotation object and the two-dimensional information displayed in the second 2D frame information of the annotation object are successfully matched, the 3D frame information corresponding to the first 2D frame information of the successfully matched part is retained, and the second 2D frame information that is not successfully matched is retained, and the retained information is used as annotation data; and if the matching part of the two-dimensional information displayed by the labeled object in the first 2D frame information and the two-dimensional information displayed by the labeled object in the second 2D frame information is less than half, only the second 2D frame information is reserved and is used as the labeled data. It should be noted that the setting of the first threshold is only an example, and different first thresholds may be set according to the actual accuracy to be achieved and the amount of the labeled data to be determined, and if the accuracy to be achieved is higher, a higher first threshold may be set, and if more labeled data is to be acquired, a lower first threshold may be set.

Further, according to an embodiment of the present invention, if the intersection ratio is greater than or equal to the first threshold, the annotation data determining module 304 is further configured to:

Through the arrangement, for the situation that the first 2D frame information and the second 2D frame information are both reserved, in order to improve the accuracy of the determined labeling data, a calibration error needs to be calculated, and the data is cleaned by using the calibration error and the second threshold. Determining a 3D projection center point of 3D frame information corresponding to first 2D frame information in the reserved information and a 2D frame center point corresponding to second 2D frame information in the reserved information, calculating a Euclidean distance between the two center points and a frame diagonal distance of the second 2D frame information, and calculating a labeling error of a labeled object according to the Euclidean distance and the frame diagonal distance; and updating the labeling data according to the labeling error and the second threshold value.

Illustratively, according to the embodiment of the present invention, the reserved information includes one or more tagged objects; if the number of the tagged objects is one, the tagged data determining module 304 is further configured to:

Preferably, according to an embodiment of the present invention, if the number of the tagged objects is multiple, the tagged data determining module 304 is further configured to:

Fig. 4 shows an exemplary system architecture 400 to which the annotation data determination method for monocular 3D object detection or the annotation data determination apparatus for monocular 3D object detection of embodiments of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include

terminal devices

401, 402, 403, a network 404, and a server 405 (this architecture is merely an example, and the components included in a particular architecture may be adapted according to application specific circumstances). The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The

terminal devices

401, 402, 403 may have installed thereon various communication client applications, such as a data processing application, a tagging data determination application, a search application, an instant messaging tool, social platform software, etc. (by way of example only).

The

terminal devices

401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server that provides various services, such as a server (for example only) for users to (perform data processing with) the

terminal devices

401, 402, 403. The server may analyze and otherwise process the received data, such as the 3D point cloud data and the image data, and feed back a processing result (e.g., annotation data, for example only) to the terminal device.

It should be noted that the annotation data determination method for monocular 3D object detection provided in the embodiment of the present invention is generally executed by the server 405, and accordingly, the annotation data determination device for monocular 3D object detection is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a data receiving module, a point cloud data processing module, an image data processing module and an annotation data determining module. The names of the modules do not limit the modules themselves in some cases, and for example, the data receiving module may also be described as a "module for receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: receiving a plurality of 3D point cloud data and a plurality of image data synchronously acquired by a laser radar and a camera; processing the plurality of 3D point cloud data by using the 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object; processing the plurality of image data by using the 2D target detection model to obtain a plurality of second 2D frame information; and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the labeling data according to the intersection ratio and the first threshold.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A labeling data determination method for monocular 3D object detection is characterized by comprising the following steps:

processing the plurality of 3D point cloud data by using a 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object;

processing the plurality of image data by using a 2D target detection model to obtain a plurality of second 2D frame information;

and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the marking data according to the intersection ratio and a first threshold value.

2. The method according to claim 1, wherein the sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of pieces of first 2D frame information includes:

and mapping the plurality of pieces of 3D frame information respectively to obtain a plurality of pieces of first 2D frame information corresponding to the plurality of pieces of 3D frame information.

3. The method as claimed in claim 1, wherein the determining the label data according to the intersection ratio and the first threshold further comprises:

judging whether the intersection ratio is smaller than the first threshold value;

if so, deleting the 3D frame information corresponding to the first 2D frame information, reserving the second 2D frame information, and taking the second 2D frame information as the labeling data;

4. The method of claim 3, wherein if the intersection ratio is greater than or equal to the first threshold, the method further comprises:

5. The method according to claim 4, wherein the remaining information includes one or more annotation objects; if the number of the labeled objects is one, the step of calculating the labeling error of the image according to the Euclidean distance and the frame diagonal distance comprises the following steps:

and calculating the labeling error according to the normalized distance.

6. The method according to claim 5, wherein if there are a plurality of the annotation objects, the step of calculating the annotation error of the image according to the Euclidean distance and the frame diagonal distance comprises:

calculating an average value of normalized distances between the 3D projection center point and the 2D frame center point according to the labeling quantity, the Euclidean distance and the frame diagonal distance;

7. An annotation data determination apparatus for monocular 3D object detection, characterized by comprising:

a point cloud data processing module for: processing the plurality of 3D point cloud data by using a 3D point cloud detection model to obtain a plurality of 3D detection results, and sequentially performing conversion processing and mapping processing on the plurality of 3D detection results to obtain a plurality of first 2D frame information; wherein the first 2D frame information indicates two-dimensional information of the tagged object;

an image data processing module to: processing the plurality of image data by using a 2D target detection model to obtain a plurality of second 2D frame information;

an annotation data determination module to: and calculating the intersection ratio of the plurality of pieces of first 2D frame information and the plurality of pieces of second 2D frame information, and determining the marking data according to the intersection ratio and a first threshold value.

8. The apparatus of claim 7, wherein the annotation data determination module is further configured to:

9. An electronic device for annotation data determination for monocular 3D object detection, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.