CN113281780B

CN113281780B - Method and device for marking image data and electronic equipment

Info

Publication number: CN113281780B
Application number: CN202110583714.XA
Authority: CN
Inventors: 丁壮壮; 刘强; 胡义涵; 邵文昕; 格润洲; 黄礼; 李堃; 徐书森
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2024-04-30
Anticipated expiration: 2041-05-27
Also published as: CN113281780A

Abstract

The invention discloses a method and a device for labeling image data, wherein the method comprises the following steps: determining an image data set to be marked and a first point cloud data set which is synchronous in time with the image data set, wherein the first point cloud data set is acquired through a laser radar; performing multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused second point cloud data set; densifying the second point cloud data set to obtain an enhanced third point cloud data set; performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set; and labeling a second target object in the image dataset based on the first target object. According to the technical scheme provided by the disclosure, the second target object in the image data set can be automatically determined, so that automatic labeling data can be obtained, manual participation is not needed, and the labeling cost can be reduced.

Description

Method and device for marking image data and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for labeling image data, and an electronic device.

Background

Monocular camera 3D perception algorithms are becoming an integral part of advanced driving assistance systems (ADVANCED DRIVING ASSISTANCE SYSTEM, ADAS) in various carts. The mainstream monocular camera 3D sensing algorithm is based on deep learning, and is similar to the monocular camera 2D sensing algorithm, and the model needs to be trained by using marked data. Compared with monocular camera 2D perception algorithm labeling, monocular camera 3D perception algorithm labeling is more difficult, and 3D perception algorithm labeling data are more difficult to obtain.

Disclosure of Invention

In order to solve the technical problems, the embodiment of the application provides a method and a device for labeling image data and electronic equipment.

According to one aspect of the present application, there is provided a method of annotating image data, comprising: determining an image data set to be marked and a first point cloud data set which is synchronous in time with the image data set, wherein the first point cloud data set is acquired through a laser radar; performing multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused first point cloud data set; densifying the second point cloud data set to obtain an enhanced second point cloud data set; performing test data enhancement and target frame weighted fusion on a third point cloud data set of the third point cloud data set to obtain a first target object in the synchronous point cloud data set; and labeling a second target object in the image dataset based on the first target object.

According to another aspect of the present application, there is provided an apparatus for labeling image data, comprising: the marking data determining module is used for determining an image data set to be marked and a first point cloud data set which is synchronous in time with the image data set, wherein the first point cloud data set is acquired through a laser radar; the multi-sensor fusion module is used for carrying out multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused second point cloud data set; the densification module is used for densifying the second point cloud data set to obtain an enhanced third point cloud data set; the data enhancement weighted fusion module is used for carrying out test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set; and the target object determining module is used for determining a second target object in the image data set based on the first target object.

According to another aspect of the present application, there is provided a computer readable storage medium storing a computer program for performing any one of the methods described above.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to perform any of the methods described above.

According to the method for labeling image data, provided by the embodiment of the application, the first target object in the first point cloud data set is determined by carrying out multi-sensor fusion, densification, test data enhancement and target frame weighted fusion on the image data set to be labeled and the first point cloud data set, and because the first point cloud data set is acquired through the laser radar, the second target object in the image data set is determined based on the first target object acquired by the first point cloud data set, the accuracy of labeling data is greatly improved, the second target object in the image data set is determined through the first target object in the first point cloud data set, the second target object in the image data set can be automatically determined, automatic labeling data is obtained, manual participation is not needed, and the labeling cost can be reduced.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flowchart of a method for labeling image data according to an exemplary embodiment of the present application.

FIG. 2 is a flow chart of a method for determining an image dataset that needs to be annotated and a first point cloud dataset that is synchronized in time with the image dataset, according to an exemplary embodiment of the application.

Fig. 3-1 is a schematic flow chart of a process for performing multi-sensor fusion on an image dataset and a point cloud dataset to obtain a fused second point cloud dataset according to an exemplary embodiment of the present application.

Fig. 3-2 are schematic diagrams of a fusion process according to an exemplary embodiment of the present application.

Fig. 4-1 is a schematic diagram of a second point cloud dataset before densification, in accordance with an exemplary embodiment of the application.

Fig. 4-2 is a schematic diagram of a densified second point cloud data set (i.e., an enhanced third point cloud data set) provided by an exemplary embodiment of the present application.

Fig. 5-1 is a schematic flow chart of performing test data enhancement and target frame weighted fusion on a third point cloud data set to obtain a first target object in a first point cloud data set according to an exemplary embodiment of the present application.

Fig. 5-2 is a schematic diagram of 30 test data enhancement results for the same target object according to an exemplary embodiment of the present application.

Fig. 5-3 are effect diagrams of a test data enhancement implementation provided in an exemplary embodiment of the present application.

Fig. 5-4 are schematic diagrams of a weighted fusion result of target frames of the same target object according to an exemplary embodiment of the present application.

Fig. 5-5 are effect diagrams of a practical application of weighted fusion of object frames according to an exemplary embodiment of the present application.

FIG. 6-1 is a flow chart of another method for labeling image data according to an exemplary embodiment of the present application.

Fig. 6-2 is a schematic diagram of an output optimized target object provided by an exemplary embodiment of the present application.

Fig. 7 is a flowchart of yet another method for labeling image data according to an exemplary embodiment of the present application.

Fig. 8 is a schematic diagram of an overall architecture for labeling image data according to an exemplary embodiment of the present application.

FIG. 9 is a schematic diagram of an automatic labeling process for labeling image data according to an exemplary embodiment of the present application.

Fig. 10 is a schematic structural diagram of an apparatus for labeling image data according to an exemplary embodiment of the present application.

Fig. 11 is a schematic structural diagram of a marking data determining module in an apparatus for marking image data according to an exemplary embodiment of the present application.

Fig. 12 is a schematic structural diagram of a multi-sensor fusion module in an apparatus for labeling image data according to an exemplary embodiment of the present application.

Fig. 13 is a schematic structural diagram of a data enhancement weighted fusion module in a device for labeling image data according to an exemplary embodiment of the present application.

Fig. 14 is a schematic structural diagram of another apparatus for labeling image data according to an exemplary embodiment of the present application.

Fig. 15 is a schematic structural diagram of still another apparatus for labeling image data according to an exemplary embodiment of the present application.

Fig. 16 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Hereinafter, exemplary embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

Accurate 3D space information is difficult to obtain from the monocular camera, so that monocular camera 3D sensing algorithm labeling data with high enough precision cannot be obtained from a monocular camera acquired monocular image. At present, monocular camera 3D sensing algorithm marking data are mainly obtained through combining a monocular camera with other sensors through manual marking, synchronization of multiple sensors is involved in combination of the monocular camera and the other sensors, mapping relation among the multiple sensors is achieved, and 3D position information and semantic information are marked, so that a great amount of manpower and material resources are consumed in the existing process of obtaining monocular camera 3D sensing algorithm marking data.

Aiming at the technical problems, the basic idea of the application is to provide a method and a device for marking image data and electronic equipment.

Various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary method

Fig. 1 is a flowchart of a method for labeling image data according to an exemplary embodiment of the present application. The method for labeling the image data, which is provided by the embodiment of the application, can be applied to the technical field of computer vision. As shown in fig. 1, the method for labeling image data provided by the embodiment of the application includes the following steps:

step 101, determining an image dataset to be annotated and a first point cloud dataset that is time synchronized with the image dataset.

In an embodiment, the image dataset to be marked may be acquired by a Camera, and the first point cloud dataset synchronized in time with the image dataset may be acquired by a laser radar.

And 102, performing multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused second point cloud data set.

The multi-sensor fusion is an information processing process performed by analyzing and integrating information and data from multiple sensors or multiple sources according to a certain criterion by utilizing computer technology so as to complete required decision and estimation.

Step 103, densifying (Densify) the fused second point cloud data set to obtain an enhanced third point cloud data set (Augmented point cloud).

Wherein, densification is to transform the second point cloud data set before the current frame onto the current frame of the second point cloud data set through projection. The second point cloud data is a set of points in the three-dimensional space, and can be directly combined with the second point cloud data of the current frame after projection transformation, so that the fused second point cloud data set is densified.

Step 104, performing test data enhancement (Test Time Augmentation, TTA) and target frame weighted fusion (Weighted Box Fusion) on the second point cloud dataset to obtain a first target object in the first point cloud dataset.

The test data enhancement is to perform multiple perturbation transformation on the third point cloud data set to obtain multiple groups of data, wherein one perturbation corresponds to one group of data, so that the data volume for labeling the target object is increased, and the stability and the accuracy of the labeling result are improved; the target frame weighted fusion is to perform weighted calculation on multiple versions of data to obtain average output, and as a final output result, the weight used in the weighting process can be set empirically or can be determined by a parameter adjustment means GRID SEARCH (network search) mode. The GRID SEARCH mode is to train a model according to the combination condition of each possible parameter (weight), evaluate the model, and determine the parameter (weight) according to the evaluation result.

The first target object in the first point cloud data set includes information such as a center point coordinate of the first target object, a length, a width and a height of the first target object, a yaw angle of the first target object, and the first target object may be an automobile, a pedestrian, and the like.

Step 105, annotating a second target object in the image dataset based on the first target object.

Specifically, based on the information such as the center point coordinates of the first target object, the length and width of the first target object, and the yaw angle of the first target object in the first point cloud data set, the information such as the center point coordinates of the second target object, the length and width of the second target object, and the yaw angle of the second target object in the image data set corresponding to the first point cloud data set is determined. And after determining the second target object in the image dataset, it may be presented in the form of a detection box.

It should be noted that, step 102 is preferably performed first and then step 103 is performed, and step 103 may be performed first and then step 102 is performed.

FIG. 2 is a flow chart of a method for determining an image dataset that needs to be annotated and a first point cloud dataset that is synchronized in time with the image dataset, according to an exemplary embodiment of the application. The embodiment of fig. 2 of the present application is extended from the embodiment of fig. 1 of the present application, and differences between the embodiment of fig. 2 and the embodiment of fig. 1 are mainly described below, which will not be repeated.

As shown in fig. 2, in the method for labeling image data provided by the embodiment of the present application, determining an image dataset to be labeled and a first point cloud dataset that is time-synchronized with the image dataset (i.e. step 101) includes:

and step 1011, controlling the time synchronization of the camera and the laser radar.

Specifically, the time synchronization between the camera and the laser radar is performed, and the calibration between the camera and the laser radar is performed, so that the geometric mapping between the camera and the laser radar can be obtained. Compared with a camera, the laser radar has an inherent advantage in the aspect of 3D detection, and the detection result of the laser radar with high accuracy can be mapped to a local three-dimensional coordinate system where the camera is positioned to be used as a true value (Ground Truth, GT) of the 3D detection of the camera.

In addition, it should be noted that in practical application, one camera and one laser radar may be adopted, or a plurality of cameras and one laser radar, a plurality of cameras and a plurality of laser radars may be adopted, and specific implementation processes of adopting a plurality of cameras and one laser radar, a plurality of cameras and a plurality of laser radars are similar to those of adopting a camera and one laser radar.

Step 1012, determining an original image dataset acquired by the camera and an original point cloud dataset acquired by the lidar.

In one embodiment, the collected raw image dataset may be a number suffixed to. Jpg, and the collected raw point cloud dataset may be data suffixed to. Bin.

In step 1013, frames are extracted from the original image dataset and the original point cloud dataset according to a preset time interval, so as to obtain the image dataset and the first point cloud dataset which need to be marked.

Specifically, the preset time interval may be set according to practical application, for example, the preset time interval is 5 seconds, and each time interval is 5 seconds, one frame of data is extracted from the original image dataset and the original point cloud dataset, so as to obtain the image dataset and the first point cloud dataset which need to be marked.

According to the method for marking the image data, the camera and the laser radar are controlled to be in time synchronization, the original image data set collected by the camera and the original point cloud data set collected by the laser radar are determined, frames are extracted from the original image data set and the original point cloud data set according to the preset time interval, the image data set and the first point cloud data set which need to be marked are obtained, frames can be extracted from the original image data set collected by the camera and the original point cloud data set collected by the laser radar according to the preset time interval, the image data set and the first point cloud data set which need to be marked are obtained, the implementation mode is simple and rapid, marking data can be obtained rapidly, and marking efficiency is improved.

Fig. 3-1 is a schematic flow chart of a process for performing multi-sensor fusion on an image dataset and a first point cloud dataset to obtain a fused second point cloud dataset according to an exemplary embodiment of the present application. The embodiment of the present application shown in fig. 3-1 is extended from the embodiment of the present application shown in fig. 1, and differences between the embodiment of the present application shown in fig. 3-1 and the embodiment shown in fig. 1 are highlighted below, and the details of the differences are not repeated.

The embodiment of the application shown in fig. 3-1 mainly adopts a Point Painting multi-sensor fusion method, and the embodiment of the application shown in fig. 3-1 carries out multi-sensor fusion on the image dataset and the first point cloud dataset to obtain a fused second point cloud dataset (i.e. step 102), which specifically comprises the following steps:

Step 1021, determining a projection matrix between a camera for acquiring the image dataset and a lidar for acquiring the first point cloud dataset.

In one embodiment, the projection matrix is as follows:

Wherein u and v represent positions of pixel points in image data set collected by a camera (abscissa and ordinate of pixel points), X _w、Y_w、Z_w represents positions of pixel points in point cloud data set in a three-dimensional space collected by a laser radar (X coordinate, y coordinate and Z coordinate), Z _c represents a coefficient, (u ₀,v₀) represents a coordinate origin in a u-v coordinate system, d _x and d _y represent physical dimensions of each pixel point on a horizontal axis X and a vertical axis y respectively, R represents an orthogonal identity matrix (also called a rotation matrix) of 3×3, represents a three-dimensional translation vector, is a3×1 translation matrix, f represents a focal length of the camera, a _x＝f/d_x is a scale factor on a u axis, a _y＝f/d_y is a scale factor on a v axis, M is a3×4 matrix called a projection matrix; m ₁ is determined by a _x、a_y、u₀、v₀, and only the parameters of the camera itself are related to these four parameters, so they are called internal parameters of the camera, The azimuth determination matrix between the camera and the world coordinate system is M ₂, so M ₂ is the external parameter of the camera,/>

Step 1022, performing semantic segmentation on the image dataset (Semantic Segmentation) to obtain a one-hot vector for each point in the image dataset.

Wherein semantic segmentation is a fundamental task in computer vision, in which visual input is divided into different semantic interpretable categories, the interpretability of the semantics, i.e. classification categories, are meaningful in the real world. For example, all pixels belonging to the car in the image dataset are distinguished.

Step 1023, determining a mapping relationship between the first point cloud data set and the image data set based on the projection matrix.

Wherein, according to the projection matrix of step 1021, a mapping relationship between the first point cloud dataset (X _w,Y_w,Z_w) and the image dataset (u, v) may be determined.

Step 1024, determining a single thermal encoding for each point in the first point cloud dataset based on the mapping relationship between the first point cloud dataset and the image dataset and the single thermal encoding for each point in the image dataset.

Based on a mapping relationship between the first point cloud data set and the image data set, the first point cloud data set is projected onto the image data set (picture) from the 3D space, and the one-heat coding of the pixel where each point is located on the image data set is the one-heat coding of each point corresponding to the first point cloud data set.

Step 1025, combining the original attribute of each point in the first point cloud data set with the single-heat code of each point in the first point cloud data set to obtain a fused second point cloud data set.

Wherein the original attribute of each point in the first point cloud dataset represents the spatial location and reflectivity (reflectivity) of each point (X _w,Y_w,Z_w). Fig. 3-2 is a schematic diagram of a fusion process according to an exemplary embodiment of the present application, and as can be seen from fig. 3-2, the attribute of the second point cloud data set after fusion is richer than the attribute of the first point cloud data set before fusion, the attribute of the first point cloud data set before fusion has only 4 dimensions, and the attribute of the second point cloud data set after fusion has 14 dimensions.

According to the method for labeling the image data, which is provided by the embodiment of the application, based on the mapping relation between the first point cloud data set and the image data set and the independent heat codes of each point in the image data set, the independent heat codes of each point in the first point cloud data set are determined, the original attribute of each point in the first point cloud data set is connected with the independent heat codes of each point in the first point cloud data set, a fused second point cloud data set is obtained, the attribute of the fused second point cloud data set is richer, and the precision of labeling the image data can be improved.

An exemplary embodiment of the present application provides for densifying the second point cloud data set resulting in an enhanced third point cloud data set. The embodiment of the present application extends from the embodiment of fig. 1, and differences between the embodiment of the present application and the embodiment of fig. 1 are mainly described below, which are not repeated.

The densification of the second point cloud data set provided in the embodiment of the present application, to obtain an enhanced third point cloud data set (i.e. step 103) includes:

And projecting a second point cloud data set of a preset frame number before the current frame to obtain an enhanced third point cloud data set.

In an embodiment, the second point cloud data set is densified by using information in a time dimension, n frames (where n may be selected according to the practical application) of the second point cloud data set before the current frame are projected into the current frame by using an M ₂ matrix (M ₂ in step 1021) and combined with the second point cloud data set of the current frame, and the second point cloud data set of the current frame is densified. Referring to fig. 4-1 for a schematic view of the second point cloud data set before densification, referring to fig. 4-2 for a schematic view of the second point cloud data set after densification (i.e., the enhanced third point cloud data set), it can be seen from fig. 4-1 and fig. 4-2 that the data amount of fig. 4-2 is significantly higher than that of fig. 4-1.

According to the method for labeling the image data, disclosed by the embodiment of the application, the second point cloud data set of the preset frame number before the current frame is projected to the current frame to obtain the enhanced third point cloud data set, so that the data volume is increased, and the accuracy of labeling the image data can be improved.

Fig. 5-1 is a schematic flow chart of performing test data enhancement and target frame weighted fusion on a third point cloud data set to obtain a first target object in a first point cloud data set according to an exemplary embodiment of the present application. The embodiment of the present application shown in fig. 5-1 is extended from any of the above embodiments of the present application, and differences between the embodiment shown in fig. 5-1 and any of the above embodiments are described below for the sake of brevity.

As shown in fig. 5-1, in the method for performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set (i.e. step 104), the method includes:

step 1041, performing perturbation of a preset degree on the third point cloud data set to obtain a preset group of perturbation point cloud data sets.

In an embodiment, the method of performing the perturbation operation on the third point cloud data set to the preset degree may be to rotate the third point cloud data set around the z-axis by a preset angle, for example, the third point cloud data set rotates around the z-axis by 10 angles (for example, 10 angles are respectively-pi/4, -pi/8,0, pi/8, pi/4, pi×3/4, pi×7/8, pi, pi×9/8, pi×5/4), and a group of perturbation point cloud data sets are obtained after each rotation by one angle, so that 10 groups of perturbation point cloud data sets can be obtained.

Step 1042, pre-labeling the first target object in the first point cloud data set based on the preset laser radar deep learning model and the preset group perturbation point cloud data set to obtain a preset group initial target object.

Specifically, the preset lidar deep learning model may employ a CNN (Convolutional Neural Networks, convolutional neural network) model, which is sometimes a skeleton of the whole algorithm for feature extraction, and thus is also called a Backbone (Backbone network) model, and in an embodiment, the preset lidar deep learning model may be implemented using 3 backbones, which are denoted as Backbone1, backbone2, and Backbone3. Before the preset laser radar deep learning model is used, the preset laser radar deep learning model can be trained through the existing marking data, so that the accuracy of the preset laser radar deep learning model is improved. The perturbation point cloud data set of 10 groups is respectively input into a backup 1, a backup 2 and a backup 3, each backup obtains 10 groups of initial target objects, and the total of 3 backbones obtains 30 groups of initial target objects. Referring to fig. 5-2, a schematic diagram of the enhancement result of 30 test data of the same target object is shown, referring to fig. 5-3, and an effect diagram of the actual application of the enhancement of the test data is shown.

Step 1043, matching the initial target objects of the preset group, and performing weighted fusion on the initial target objects successfully matched in the initial target objects of the preset group to obtain target objects in the point cloud data set.

In an embodiment, the successful matching of the initial target objects in the preset group means that the similarity of the initial labeling data in the preset group reaches the preset similarity, where the value of the preset similarity may be set according to the actual application, for example, the preset similarity is set to 80%. When the initial labeling data of the preset group is the initial detection frame, the initial detection frame reaches the preset coincidence degree, and the matching is successful, wherein the value of the preset coincidence degree can be set according to practical application, for example, the preset coincidence degree is 80%.

In an embodiment, the initial target objects successfully matched in the preset group of initial target objects are weighted and fused according to preset weights to obtain target objects in the point cloud data set, wherein the preset weights can be set according to actual application, for example, the weights of the initial target objects successfully matched are all set to be 0.7. Referring to fig. 5-4, a schematic diagram of a target frame weighted fusion result of the same target object, referring to fig. 5-5, an effect diagram of a practical application of the target frame weighted fusion is shown.

According to the method for marking the image data, disclosed by the embodiment of the application, the perturbation of the preset degree is carried out on the third point cloud data set to obtain the preset group perturbation point cloud data set, the target objects in the point cloud data set are pre-marked based on the preset laser radar deep learning model and the preset group perturbation point cloud data set to obtain the preset group initial target objects, the data quantity is increased by carrying out the perturbation of the preset degree on the third point cloud data set, and the accuracy of marking the image data can be further improved; and matching the initial target objects in the preset group, and carrying out weighted fusion on the initial target objects successfully matched in the initial target objects in the preset group to obtain target objects in the point cloud data set, so that more accurate data can be obtained through matching, and the accuracy of labeling the image data is further improved.

FIG. 6-1 is a flow chart of another method for labeling image data according to an exemplary embodiment of the present application. The embodiment of the present application shown in fig. 6-1 is extended from any of the above embodiments of the present application, and differences between the embodiment of fig. 6-1 and any of the above embodiments are described below for the sake of brevity.

As shown in fig. 6-1, after performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain the first target object in the first point cloud data set (i.e. step 105), the method further includes:

Step 106, determining an occlusion relationship of the first target object in the first point cloud data set.

Specifically, mapping the detection result of the high-accuracy laser radar to the camera coordinate system, as the true value of the camera 3D detection, also has a certain defect, such as that the laser radar does not consider the problem of occlusion in the image. In this embodiment, by using the principle of the ray casting method (RAY CASTING), the occlusion relationship of the first target object in the first point cloud data set is determined, and the specific process is as follows: and obtaining a top view of the first point cloud data set (the z-axis data of the first point cloud data set is obtained by removing), calculating an occlusion relation from an origin of the top view, wherein if other objects exist on a connecting line of the first target object and the origin, the first target object is occluded, and if no other objects exist on the connecting line of the first target object and the origin, the first target object is not occluded. And traversing each first target object on the top view to obtain the shielding relation of all the first target objects in the first point cloud data set.

Step 107, determining a matching relationship of the first target object in the first point cloud data set.

In an embodiment, inputting an image data set to be marked into a preset camera deep learning model, marking the image data set through the preset camera deep learning model to obtain a camera prediction target object, wherein the preset camera deep learning model can be realized by adopting a CNN model, and a corresponding model can be set according to practical application, so that the method is not limited; the camera prediction target object is matched with the second target object in the image data set (namely, the projection of the first target object in the first point cloud data set on the picture is obtained through the projection matrix in the step 1021), the matching relation between the first target object in the first point cloud data set is determined according to the matching degree of the camera prediction target object and the second target object in the image data set, a matching degree threshold value can be set according to the actual application condition, when the matching degree of the camera prediction target object and the second target object in the image data set is greater than or equal to the matching degree threshold value, the camera prediction target object and the second target object in the image data set are determined to be matched, and when the matching degree of the camera prediction target object and the second target object in the image data set is smaller than the matching degree threshold value, the camera prediction target object and the second target object in the image data set are determined to be not matched. If the camera predicts that the target object is displayed in the form of a detection frame with a second target object in the image dataset, the degree of matching can be determined by the degree of coincidence of the detection frame.

And step 108, filtering the first target object in the first point cloud data set based on the shielding relation and the matching relation to obtain an optimized target object.

Specifically, if the first target object is occluded or the first target object does not match, the occluded first target object or the first target object that does not match is filtered out.

It should be noted that, the steps 106 and 107 are not limited in sequence, and may be set according to actual application situations.

Referring to fig. 6-2, for a schematic view of an output optimized target object, in which an object in a white frame represents a second target object in the image dataset (i.e., a projection of a first target object in the first point cloud dataset onto a picture), and an object in a black frame represents a camera predicted target object.

According to the method for labeling image data, which is provided by the embodiment of the application, the shielding relation of the first target object in the first point cloud data set is determined, the matching relation of the first target object in the first point cloud data set is determined, the first target object in the first point cloud data set is filtered based on the shielding relation and the matching relation, so that the optimized target object is obtained, the shielded first target object or the first target object which is not matched can be filtered, and the accuracy of labeling the image data can be further improved.

Fig. 7 is a flowchart of yet another method for labeling image data according to an exemplary embodiment of the present application. The embodiment of the present application shown in fig. 7 is extended from the embodiment of fig. 6-1, and differences between the embodiment of fig. 7 and the embodiment of fig. 6-1 are emphasized below, which are not repeated.

As shown in fig. 7, after filtering the first target object in the first point cloud data set based on the occlusion relationship and the matching relationship to obtain the optimized target object (i.e. step 108), the method further includes:

and step 109, training a preset camera deep learning model by using the optimized target object.

The preset camera deep learning model may be a CNN model, preferably, the same model as that in step 107. The optimized target object has higher accuracy, is data obtained by automatic labeling, has larger data volume, and can improve the robustness and accuracy of the preset camera deep learning model by using the optimized target object for training the preset camera deep learning model.

And 110, marking the image data by using a preset camera deep learning model to obtain a camera target object.

Specifically, an image dataset to be marked is input into a trained preset camera deep learning model, and the image dataset is marked through the preset camera deep learning model to obtain a camera target object.

Step 111, selecting a target object of a difficult scene by using an active learning method based on the camera target object and the optimized target object.

Among them, the difficult scene is a more special scene which occurs less frequently than the normal scene, for example: police detect vehicles, scenes where vehicles are congested, etc. Based on a projection matrix between a camera for acquiring an image dataset and a laser radar for acquiring a point cloud dataset, establishing a corresponding relation between a camera target object and an optimized target object; and comparing the camera target object with a corresponding optimized target object by using an active learning method (ACTIVE LEARNING), and if the difference between the camera target object and the optimized target object in a certain frame is larger than a preset difference threshold (the value can be set according to the actual application condition), taking the target object in the frame as a target object of a difficult scene, training a camera deep learning model and a laser radar deep learning model by taking the target object of the difficult scene as a sample, and improving the accuracy of the model.

According to the method for labeling the image data, the optimized target object is utilized to train the preset camera deep learning model, the preset camera deep learning model is utilized to label the image data, the camera target object is obtained, the target object of the difficult scene is selected based on the camera target object and the optimized target object by utilizing the active learning method, and the target object of the difficult scene can be used for training the camera deep learning model and the laser radar deep learning model in the follow-up mode, so that the accuracy of the model is improved.

In order to further improve the accuracy of the annotation, the method can manually inspect (review) the target object of the difficult scene, train the camera deep learning model and the laser radar deep learning model by using the annotation data after manual inspection, and further improve the accuracy of the model and realize data closed loop. Referring to fig. 8, a schematic diagram of an overall architecture for labeling image data according to an exemplary embodiment of the present application is provided, where a module 10 represents training a preset lidar deep learning model, a module 11 represents automatically labeling image data (corresponding to steps 101-105), a module 12 represents post-processing labeling results (corresponding to steps 106-108), a module 13 represents training a preset camera deep learning model (corresponding to step 109), a module 14 represents data filtering (corresponding to steps 110-111), and a module 15 represents manual review (optional). From fig. 8, it can be seen that the module 10 can use manual annotation data in the initial input stage of training the preset laser radar deep learning model, and can directly use the annotation data after manual examination by the module 15 later, if the module 15 is not provided, the annotation data after data screening by the module 14 can also be directly used, so as to realize data closed loop.

Referring to fig. 9, a schematic diagram of an automatic labeling process for labeling image data (refinement of block 11 in fig. 8) according to an exemplary embodiment of the present application, block 20 represents a process of extracting frames from an original image dataset and an original point cloud dataset to obtain the image dataset and the first point cloud dataset, block 21 represents a process of performing multi-sensor fusion (using Point Painting) on the image dataset and the first point cloud dataset to obtain a fused second point cloud dataset, block 22 represents a process of performing densification on the fused second point cloud dataset to obtain an enhanced third point cloud dataset (block 23), block 24 represents a process of enhancing test data (Test Time Augmentation, TTA), block 25 represents a process of weighted fusion of target frames (Weighted Box Fusion), and block 26 represents a 3D automatic labeling result (first target object in the first point cloud dataset).

Exemplary apparatus

Fig. 10 is a schematic structural diagram of an apparatus for labeling image data according to an exemplary embodiment of the present application. The device for labeling image data provided by the embodiment of the application can be applied to the technical field of computer vision, as shown in fig. 10, and the device for labeling image data provided by the embodiment of the application comprises:

The labeling data determining module 201 is configured to determine an image dataset to be labeled and a first point cloud dataset that is time-synchronized with the image dataset, where the first point cloud dataset is acquired by a laser radar;

the multi-sensor fusion module 202 is configured to perform multi-sensor fusion on the image dataset and the first point cloud dataset to obtain a fused second point cloud dataset;

A densification module 203, configured to densify the second point cloud data set to obtain an enhanced third point cloud data set;

The data enhancement weighted fusion module 204 is configured to perform test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set;

The target object determination module 205 is configured to determine a second target object in the image dataset based on the first target object.

Fig. 11 is a schematic structural diagram of a marking data determining module 201 in an apparatus for marking image data according to an exemplary embodiment of the present application. The embodiment of fig. 11 of the present application is extended from the embodiment of fig. 10 of the present application, and differences between the embodiment of fig. 11 and the embodiment of fig. 10 are emphasized below, and are not repeated.

As shown in fig. 11, in the apparatus for labeling image data provided in the embodiment of the present application, the labeling data determining module 201 includes:

A time synchronization control unit 2011, configured to control time synchronization of the camera and the laser radar;

An original data set determining unit 2012 configured to determine an original image data set collected by the camera and an original point cloud data set collected by the laser radar;

A data set determining unit 2013, configured to extract frames from the original image data set and the original point cloud data set at preset time intervals, and obtain an image data set and a first point cloud data set.

Fig. 12 is a schematic structural diagram of a multi-sensor fusion module 202 in an apparatus for labeling image data according to an exemplary embodiment of the present application. The embodiment of fig. 12 of the present application is extended from the embodiment of fig. 10 of the present application, and differences between the embodiment of fig. 12 and the embodiment of fig. 10 are emphasized below, and are not repeated.

As shown in fig. 12, in the apparatus for labeling image data provided by the embodiment of the present application, the multi-sensor fusion module 202 includes:

a projection matrix determination unit 2021 for determining a projection matrix between a camera for acquiring an image dataset and a lidar for acquiring a point cloud dataset;

a semantic segmentation unit 2022, configured to perform semantic segmentation on the image dataset to obtain a single thermal code of each point in the image dataset;

A mapping relation determining unit 2023 for determining a mapping relation between the first point cloud data set and the image data set;

a single thermal encoding determination unit 2024 for determining a single thermal encoding of each point in the first point cloud data set based on the mapping relationship between the first point cloud data set and the image data set, and the single thermal encoding of each point in the image data set;

And a linking unit 2025, configured to link the original attribute of each point in the first point cloud data set with the single thermal code of each point in the first point cloud data set, so as to obtain a fused second point cloud data set.

An exemplary embodiment of the present application provides a schematic structural diagram of the densification module 203 in the apparatus for labeling image data. The embodiment of the present application extends from the embodiment of fig. 10, and differences between the embodiment of the present application and the embodiment of fig. 10 are mainly described below, which are not repeated.

In the device for labeling image data provided by the embodiment of the present application, the densification module 203 is specifically configured to project the second point cloud data set of the preset frame number before the current frame to the current frame, so as to obtain the enhanced third point cloud data set.

Fig. 13 is a schematic structural diagram of a data enhancement weighted fusion module 204 in an apparatus for labeling image data according to an exemplary embodiment of the present application. The embodiment of fig. 13 of the present application extends from any of the above embodiments of the present application, and differences between the embodiment of fig. 13 and any of the above embodiments are described below with emphasis, and the details of the differences are not repeated.

As shown in fig. 13, in the apparatus for labeling image data provided in the embodiment of the present application, the data enhancement weighted fusion module 204 includes:

the perturbation unit 2041 is configured to perform perturbation to a preset degree on the third point cloud data set to obtain a preset group of perturbation point cloud data sets;

The pre-labeling unit 2042 is configured to pre-label the first target object in the first point cloud data set based on the preset laser radar deep learning model and the preset group perturbation point cloud data set, so as to obtain a preset group initial target object;

and a weighted fusion unit 2043, configured to match the preset initial target objects, and perform weighted fusion on the initial target objects successfully matched in the preset initial target objects, so as to obtain a first target object in the first point cloud data set.

Fig. 14 is a schematic structural diagram of another apparatus for labeling image data according to an exemplary embodiment of the present application. The embodiment of fig. 14 of the present application extends from any of the above embodiments of the present application, and differences between the embodiment of fig. 14 and the previous embodiment are described below for the sake of brevity.

As shown in fig. 14, in the apparatus for labeling image data provided by the embodiment of the present application, the apparatus further includes:

An occlusion relationship determination module 206, configured to determine an occlusion relationship of a first target object in the first point cloud data set;

A matching relationship determining module 207, configured to determine a matching relationship of the first target object in the first point cloud data set;

The filtering module 208 is configured to filter the first target object in the first point cloud data set based on the occlusion relationship and the matching relationship, so as to obtain an optimized target object.

Fig. 15 is a schematic structural diagram of still another apparatus for labeling image data according to an exemplary embodiment of the present application. The embodiment of fig. 15 of the present application extends beyond the embodiment of fig. 14 of the present application, and differences between the embodiment of fig. 15 and the embodiment of fig. 14 are emphasized below, which are not repeated.

As shown in fig. 15, in the apparatus for labeling image data provided in the embodiment of the present application, the apparatus further includes:

The training module 209 is configured to train a preset camera deep learning model by using the optimized target object;

The labeling module 210 is configured to label the image data by using a preset camera deep learning model, so as to obtain a camera target object;

The selection module 211 is configured to select a target object of the difficult scene by using an active learning method based on the camera target object and the optimized target object.

It should be understood that, in the apparatus for labeling image data provided in fig. 10 to 15, the labeling data determining module 201, the multi-sensor fusion module 202, the densification module 203, the data enhancement weighted fusion module 204, the target object determining module 205, the occlusion relation determining module 206, the matching relation determining module 207, the filtering module 208, the training module 209, the labeling module 210, and the selecting module 211, the time synchronization control unit 2011, the raw dataset determining unit 2012, the dataset determining unit 2013 in the labeling data determining module 201, the projection matrix determining unit 2021, the semantic segmentation unit 2022, the mapping relation determining unit 2023, the single-hot encoding determining unit 2024, and the coupling unit 2025 in the multi-sensor fusion module 202, and the operations and functions of the perturbation unit 2041, the pre-labeling unit 2042, and the weighted fusion unit 2043 in the data enhancement weighted fusion module 204 may refer to the method for labeling image data provided in fig. 1 to 9, which will not be repeated herein.

Exemplary electronic device

Fig. 16 illustrates a block diagram of an electronic device of an embodiment of the application.

As shown in fig. 16, the electronic device 31 includes one or more processors 311 and memory 312.

The processor 311 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 31 to perform desired functions.

Memory 312 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 311 to implement the methods of annotating image data and/or other desired functions of the various embodiments of the present application described above. Various contents such as an input signal, an operation result, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 31 may further include: an input device 313 and an output device 314, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, the input device 313 may be a camera or microphone, a microphone array, etc. for capturing an input signal of an image. When the electronic device is a stand-alone device, the input means 313 may be a communication network connector for receiving the acquired input signals from the network processor.

In addition, the input device 313 may also include, for example, a keyboard, a mouse, and the like.

The output device 314 may output various information to the outside, including the determined output voltage, output current information, and the like. The output devices 314 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 31 relevant to the present application are shown in fig. 16 for simplicity, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 31 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in a method of annotating image data described in the "exemplary methods" section of the present description.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps of a method of annotating image data according to the various embodiments of the present application described in the above "exemplary methods" section of the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be construed as necessarily possessed by the various embodiments of the application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of annotating image data, comprising:

Determining an image data set to be marked and a first point cloud data set which is synchronous in time with the image data set, wherein the first point cloud data set is acquired through a laser radar;

performing multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused second point cloud data set;

Densifying the second point cloud data set to obtain an enhanced third point cloud data set;

Performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set;

Annotating a second target object in the image dataset based on the first target object;

Performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set, wherein the method comprises the following steps:

Performing perturbation of a preset degree on the third point cloud data set to obtain a preset group of perturbation point cloud data sets;

Pre-labeling a first target object in the first point cloud data set based on a preset laser radar deep learning model and the preset group perturbation point cloud data set to obtain a preset group initial target object;

And matching the initial target objects in the preset group, and carrying out weighted fusion on the initial target objects successfully matched in the initial target objects in the preset group to obtain a first target object in the first point cloud data set.

2. The method of claim 1, wherein determining an image dataset that needs to be annotated and a first point cloud dataset that is synchronized in time with the image dataset comprises:

controlling the camera to be time-synchronous with the laser radar;

determining an original image data set acquired by the camera and an original point cloud data set acquired by the laser radar;

and extracting frames from the original image data set and the original point cloud data set according to a preset time interval to obtain the image data set and the first point cloud data set.

3. The method of claim 1, wherein multi-sensor fusion of the image dataset and the first point cloud dataset to obtain a fused second point cloud dataset, comprising:

Determining a projection matrix between a camera for acquiring the image dataset and the lidar;

Performing semantic segmentation on the image data set to obtain a single-heat code of each point in the image data set;

determining a mapping relationship between the first point cloud dataset and the image dataset based on the projection matrix;

Determining a single thermal encoding of each point in the first point cloud dataset based on a mapping relationship between the first point cloud dataset and the image dataset and the single thermal encoding of each point in the image dataset;

And connecting the original attribute of each point in the first point cloud data set with the single-heat code of each point in the first point cloud data set to obtain a fused second point cloud data set.

4. The method of claim 1, wherein densifying the second point cloud data set results in an enhanced third point cloud data set, comprising:

5. The method of any of claims 1-4, further comprising, after performing test data enhancement and target frame weighted fusion on the third point cloud data set to obtain the first target object in the first point cloud data set:

determining an occlusion relationship of a first target object in the first point cloud dataset;

determining a matching relationship of a first target object in the first point cloud data set;

And filtering the first target object in the first point cloud data set based on the shielding relation and the matching relation to obtain an optimized target object.

6. The method of claim 5, filtering a first target object in the first point cloud dataset based on the occlusion relationship and the matching relationship to obtain an optimized target object, further comprising:

Training a preset camera deep learning model by using the optimized target object;

labeling the image data by using the preset camera deep learning model to obtain a camera target object;

and selecting a target object of the difficult scene by using an active learning method based on the camera target object and the optimized target object.

7. An apparatus for labeling image data, comprising:

the marking data determining module is used for determining an image data set to be marked and a first point cloud data set which is synchronous in time with the image data set, wherein the first point cloud data set is acquired through a laser radar;

The multi-sensor fusion module is used for carrying out multi-sensor fusion on the image data set and the first point cloud data set to obtain a fused second point cloud data set;

the densification module is used for densifying the second point cloud data set to obtain an enhanced third point cloud data set;

The data enhancement weighted fusion module is used for carrying out test data enhancement and target frame weighted fusion on the third point cloud data set to obtain a first target object in the first point cloud data set;

A target object determination module for determining a second target object in the image dataset based on the first target object;

The data enhancement weighted fusion module comprises:

The perturbation unit is used for carrying out perturbation of a preset degree on the third point cloud data set to obtain a preset group of perturbation point cloud data set;

The pre-labeling unit is used for pre-labeling the first target object in the first point cloud data set based on a preset laser radar deep learning model and the preset group perturbation point cloud data set to obtain a preset group initial target object;

and the weighted fusion unit is used for matching the initial target objects of the preset group and carrying out weighted fusion on the initial target objects successfully matched in the initial target objects of the preset group to obtain a first target object in the first point cloud data set.

8. A computer readable storage medium storing a computer program for performing the method of annotating image data of any of the previous claims 1-6.

9. An electronic device, the electronic device comprising:

A processor;

A memory for storing the processor-executable instructions;

The processor being configured to perform the method of annotating image data as claimed in any one of the preceding claims 1-6.