CN113808186B

CN113808186B - Training data generation method and device and electronic equipment

Info

Publication number: CN113808186B
Application number: CN202110238897.1A
Authority: CN
Inventors: 安耀祖; 许新玉; 孔旗
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2024-01-16
Anticipated expiration: 2041-03-04
Also published as: CN113808186A

Abstract

The disclosure provides a training data generation method, a training data generation device and electronic equipment. The training data generation method comprises the following steps: acquiring target three-dimensional point cloud data corresponding to a target image, wherein the target three-dimensional point cloud data comprises three-dimensional annotation frames of a plurality of objects and annotation results corresponding to each three-dimensional annotation frame; acquiring a first three-dimensional annotation frame in the identification range of the target image; processing two or more first three-dimensional annotation frames of the same object to obtain a preset number of second three-dimensional annotation frames of the same object; and labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, wherein the training data are used for training the target detection model. The method and the device can automatically and efficiently generate the training data for training the monocular three-dimensional target detection model.

Description

Training data generation method and device and electronic equipment

Technical Field

The disclosure relates to the field of information technology, and in particular relates to a training data generation method, a training data generation device and electronic equipment for generating training data of a monocular three-dimensional target detection model.

Background

Monocular three-dimensional target detection (Monocular 3D Object Detection) is a technology for outputting information such as the type of a target object, the precise length, width, height, rotation angle and the like of the target object in a three-dimensional space only by utilizing an image or video sequence shot by a Monocular camera, and is widely applied to the fields of automatic driving systems of vehicles, intelligent robots, intelligent video monitoring, intelligent traffic and the like. Because only one visual sensor is needed, the sensor has the advantages of simple structure and simple camera calibration, and has the huge advantages of dense information and low cost compared with the three-dimensional target detection technology realized by the multi-line laser radar in the field of automatic driving system perception.

However, an excellent and stable monocular three-dimensional object detection model requires a large number of training data sets rich in scenes to train, and in the related art, training of the ability of the monocular three-dimensional object detection model to recognize three-dimensional information of a two-dimensional image is generally accomplished by constructing a three-dimensional model (e.g., CAD model) of an object, training data is limited, and the cost of generating training data and using the training data is high.

Therefore, a method capable of producing training data of a monocular three-dimensional object detection model in a large scale at low cost is required.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a training data generating method, apparatus and electronic device for generating training data of a monocular three-dimensional object detection model, which overcome, at least in part, the problem of scarcity of training data of a monocular three-dimensional object detection model due to limitations and drawbacks of the related art.

According to a first aspect of an embodiment of the present disclosure, there is provided a training data generating method, including: acquiring target three-dimensional point cloud data corresponding to a target image, wherein the target three-dimensional point cloud data comprises three-dimensional annotation frames of a plurality of objects and annotation results corresponding to each three-dimensional annotation frame; acquiring a first three-dimensional annotation frame in the identification range of the target image; processing two or more first three-dimensional annotation frames of the same object to obtain a preset number of second three-dimensional annotation frames of the same object; and labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, wherein the training data are used for training the target detection model.

In an exemplary embodiment of the present disclosure, the acquiring a first three-dimensional annotation frame within the recognition range of the target image includes: acquiring a projection frame of a three-dimensional annotation frame in the target three-dimensional point cloud data on a first projection plane and a central point coordinate of the projection frame, wherein the first projection plane is a shooting plane corresponding to the target image; if one of the center point coordinates is out of the display coordinate range of the target image, deleting the three-dimensional annotation frame corresponding to the center point coordinate; if the overlapping degree of the two projection frames is larger than a first preset value, deleting the three-dimensional center points in the two three-dimensional labeling frames corresponding to the two projection frames which are far away from the first projection surface; and determining the rest three-dimensional annotation frames as the first three-dimensional annotation frame.

In an exemplary embodiment of the present disclosure, the processing two or more first three-dimensional labeling frames of the same object to obtain a preset number of second three-dimensional labeling frames of the same object includes: acquiring a first projection frame and a second projection frame of two first three-dimensional labeling frames on a second projection plane, wherein the second projection plane is a shooting overlook plane corresponding to the target image; deleting a first three-dimensional labeling frame corresponding to the smaller area in the first projection frame and the second projection frame when the overlapping degree of the first projection frame and the second projection frame is larger than a second preset value and smaller than a third preset value; when the overlapping degree of the first projection frame and the second projection frame is larger than or equal to the third preset value, a third projection frame and a fourth projection frame of the two first three-dimensional labeling frames on a first projection surface are obtained, and the first projection surface is a shooting surface corresponding to the target image; and deleting the first three-dimensional annotation frame corresponding to the third projection frame and the fourth projection frame which meet the preset height condition.

In an exemplary embodiment of the present disclosure, the deleting the first three-dimensional labeling frame corresponding to the person meeting the preset height condition in the third projection frame and the fourth projection frame includes: determining a first center point height and a first height corresponding to the third projection frame, and determining a first vertex height of the third projection frame according to the first center point height and the first height; determining a second center point height and a second height corresponding to the fourth projection frame, and determining a second vertex height of the fourth projection frame according to the second center point height and the second height; deleting a first three-dimensional annotation frame corresponding to the third projection frame when the first center point height is larger than the second vertex height; and deleting the first three-dimensional annotation frame corresponding to the fourth projection frame when the height of the second center point is larger than that of the first vertex.

In an exemplary embodiment of the present disclosure, after the obtaining a preset number of second three-dimensional labeling frames of the same object, the method further includes: and when the labeling result of the second three-dimensional labeling frame is matched with a target object, updating the labeling result according to the size of the second three-dimensional labeling frame.

In an exemplary embodiment of the disclosure, the updating the labeling result according to the size of the second three-dimensional labeling frame includes: acquiring a target projection frame of the second three-dimensional annotation frame on a second projection surface, wherein the second projection surface is a shooting overlook surface corresponding to the target image; if the length of the target projection frame in the normal direction of the second projection surface is smaller than a fourth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a first object; if the length of the target projection frame in the normal direction of the second projection surface is larger than a fifth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a second object; and if the length of the target projection frame in the normal direction of the second projection surface is larger than or equal to the fourth preset value and smaller than or equal to the fifth preset value, updating the labeling result of the second three-dimensional labeling frame corresponding to the target projection frame into a third object.

In an exemplary embodiment of the disclosure, the labeling the target image according to the labeling result of each of the second three-dimensional labeling frames to generate training data includes: obtaining a target object corresponding to each second three-dimensional annotation frame and preset parameters, wherein the preset parameters at least comprise height information and distance information; determining the position of each target object in the target image; and generating the training data according to the target image, the position of each target object and the preset parameters of each target object.

According to a second aspect of the embodiments of the present disclosure, there is provided a training data generating apparatus, including: the point cloud data acquisition module is used for acquiring target three-dimensional point cloud data corresponding to a target image, wherein the target three-dimensional point cloud data comprise three-dimensional annotation frames of a plurality of objects and annotation results corresponding to the three-dimensional annotation frames; the visual frame screening module is used for acquiring a first three-dimensional annotation frame in the identification range of the target image; the repeated frame processing module is used for processing two or more first three-dimensional annotation frames of the same object to obtain a preset number of second three-dimensional annotation frames of the same object; the data labeling module is used for labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, and the training data are used for training the target detection model.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform the method of any of the above based on instructions stored in the memory.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the training data generation method as set forth in any one of the above.

According to the embodiment of the disclosure, the existing laser point cloud data and the corresponding shooting images are processed to obtain the training data for training the monocular three-dimensional target detection model, and the corresponding relation between the three-dimensional information and the two-dimensional information of various objects can be obtained without modeling, so that the cost for generating the training data of the monocular three-dimensional target detection model is greatly reduced; in addition, the labeling frame of the laser point cloud data is directly and simply processed, so that the processing efficiency is high, the problems of high cost and low efficiency of training data for constructing the monocular three-dimensional target detection model in the related technology can be solved, and the training data of the monocular three-dimensional target detection model can be generated in a large scale at low cost.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

Fig. 1 is a flowchart of a training data generation method in an exemplary embodiment of the present disclosure.

Fig. 2 is a sub-flowchart of step S2 in one embodiment of the present disclosure.

Fig. 3 is a schematic diagram of step S22 in one embodiment of the present disclosure.

FIG. 4 is a flow chart of determining a first three-dimensional annotation box in one embodiment of the disclosure.

FIG. 5 is a schematic diagram of a first three-dimensional annotation box corresponding to the embodiment shown in FIG. 3, according to one embodiment of the disclosure.

Fig. 6 is a sub-flowchart of step S3 in one embodiment of the present disclosure.

FIG. 7 is a schematic view of a projection frame of a first three-dimensional annotation frame onto a second projection surface according to one embodiment of the disclosure.

FIG. 8 is a schematic view of a projection frame of the first three-dimensional labeling frame of the embodiment of FIG. 7 on a first projection surface.

FIG. 9 is a flow chart of determining a second three-dimensional annotation box in one embodiment of the disclosure.

FIG. 10 is a schematic view of a second three-dimensional annotation frame on a second projection surface, corresponding to the embodiment shown in FIG. 7, after operation with the embodiment shown in FIG. 9.

FIG. 11 is a flow chart of updating labeling results according to the dimensions of a second three-dimensional labeling frame in one embodiment of the disclosure.

Fig. 12 is another flow chart of the embodiment shown in fig. 11.

Fig. 13 is a block diagram of a training data generation apparatus in an exemplary embodiment of the present disclosure.

Fig. 14 is a block diagram of an electronic device in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are only schematic illustrations of the present disclosure, in which the same reference numerals denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The following describes example embodiments of the present disclosure in detail with reference to the accompanying drawings.

Referring to fig. 1, the training data generation method 100 may include:

step S1, target three-dimensional point cloud data corresponding to a target image are obtained, wherein the target three-dimensional point cloud data comprise three-dimensional labeling frames of a plurality of objects and labeling results corresponding to the three-dimensional labeling frames;

s2, acquiring a first three-dimensional annotation frame in the identification range of the target image;

step S3, processing two or more first three-dimensional labeling frames of the same object to obtain a preset number of second three-dimensional labeling frames of the same object;

and S4, marking the target image according to the marking result of each second three-dimensional marking frame to generate training data, wherein the training data are used for training the target detection model.

Next, each step of the training data generation method 100 will be described in detail.

In step S1, target three-dimensional point cloud data corresponding to a target image is obtained, where the target three-dimensional point cloud data includes three-dimensional labeling frames of a plurality of objects and labeling results corresponding to each three-dimensional labeling frame.

Because the technology of acquiring three-dimensional point cloud data through a laser radar is mature, a large number of three-dimensional point cloud data sets are disclosed in the market. In the embodiment of the disclosure, the three-dimensional point cloud data in the disclosed three-dimensional point cloud data set can be directly processed, so that an own training database is expanded by utilizing externally disclosed data, the generation cost of the training data is reduced, and the generation efficiency of the training data is improved. It is noted that a three-dimensional point cloud dataset comprising two-dimensional image information needs to be selected.

Of course, in the case that the cost of the disclosed three-dimensional point cloud data set is high, or in the case that the disclosed three-dimensional point cloud data set lacks a target object to be identified, a vehicle with laser point cloud information acquisition capability and identification capability can also be used, and two-dimensional images and three-dimensional point cloud data of the target object to be identified can be acquired in real time through a laser radar and a camera. Because the cost of the laser radar and the camera is limited, the acquisition and labeling speeds are high, even if a person skilled in the art acquires three-dimensional point cloud data and target images on site for establishing a monocular three-dimensional target detection model, the cost is far lower than that of a method for establishing a CAD model in the prior art, and the efficiency and the quantity of training data can be greatly improved.

Because the three-dimensional point cloud data are all set for training the laser radar, the marking frames are three-dimensional marking frames, and the problems that the marking frames are mutually shielded, different parts of an object are respectively marked and the like generally exist, the three-dimensional point cloud data cannot be directly applied to training of a monocular three-dimensional target detection model, and therefore the embodiment of the disclosure sets the existing three-dimensional point cloud data to be processed as follows.

In some cases, the three-dimensional point cloud data set may be a set of data corresponding to a plurality of two-dimensional images, which may be, for example, photographs, video frames, and the like. At this time, the target three-dimensional point cloud data corresponding to each image is the same set of three-dimensional point cloud data. In other cases, for example, in a scenario where three-dimensional point cloud data is collected in a custom manner, the obtained two-dimensional image may correspond to a set of three-dimensional point cloud data sets, and the target three-dimensional point cloud data corresponding to the target image may be a specific set of three-dimensional point cloud data. The correspondence between the target image and the target three-dimensional point cloud data may be different according to different data acquisition manners, which is not limited in this disclosure.

In step S2, a first three-dimensional annotation frame within the recognition range of the target image is acquired.

Referring to fig. 2, in one embodiment, step S2 may include:

step S21, a projection frame of a three-dimensional annotation frame in the target three-dimensional point cloud data on a first projection surface and a center point coordinate of the projection frame are obtained, wherein the first projection surface is a shooting surface corresponding to the target image;

step S22, deleting the three-dimensional annotation frame corresponding to the center point coordinate if one center point coordinate is out of the display coordinate range of the target image;

step S23, deleting the three-dimensional center point in the two three-dimensional labeling frames corresponding to the two projection frames from the first projection surface if the overlapping degree of the two projection frames is larger than a first preset value;

and S24, determining the rest three-dimensional annotation frames as the first three-dimensional annotation frame.

Because the display range of the target image is limited, and the spatial range of the three-dimensional point cloud data is generally larger, the embodiment of the disclosure sets to delete the three-dimensional annotation frame which cannot be displayed in the two-dimensional target image.

Referring to fig. 3, the first projection plane is a photographing plane of a target image having a rectangular display range 300, and a plurality of three-dimensional labeling frames have projection frames 31 to 39 each having center point coordinates on the first projection plane.

In the calculation process, a three-dimensional labeling frame is set to be (length, width, height), the coordinates of a three-dimensional center point are (x, y, z), the coordinates of a projection frame on the shooting surface of the target image are calculated to be (x 0, y0, x1, y 1) by using a projection matrix, and the coordinates of the center point of the projection frame are (center) _x ，center _y ) The target image is (image _width ，image _height ) The projection frame center point coordinates satisfying the following formula (1) can be determined to be within the display coordinate range of the target image:

if the center point coordinates not satisfying the formula (1) are determined to be outside the display coordinate range of the target image, the three-dimensional label frame corresponding to the projection frame corresponding to the center point coordinates is deleted. In the embodiment shown in fig. 3, the three-dimensional annotation boxes corresponding to the projection boxes 31, 39 are deleted.

Because the visual imaging is unidirectional, the distant object on the two-dimensional image is very easily blocked by the near object, and thus the embodiment of the present disclosure sets to delete the three-dimensional annotation frame corresponding to the invisible object blocked for the target image in step S23.

In an embodiment of the present disclosure, the arrangement uses the degree of overlap (Intersection over Union, IOU) between projection frames to determine occlusion relationships between objects. The IOU is also called an intersection ratio, and calculates the ratio of the intersection and the union of the predicted border and the real border of the graphic object, taking two graphics as an example, firstly calculating the intersection area of the two graphics, then calculating the union area of the two graphics, and finally determining the overlapping degree (IOU) between the two graphics according to the ratio of the intersection area and the union area.

With continued reference to fig. 3, it can be observed that there is a high degree of overlap between the projection frame 33 and the projection frame 34. And if two projection frames are overlapped on the shooting surface of the target image, the corresponding two target objects are indicated to have shielding relation. Since the object located at the rear cannot be identified through the target image when the occlusion is serious, a first preset value, for example, 0.7, can be set to determine whether the occlusion relationship affects the display effect of the object in the target image, thereby affecting the identification of the object. If the overlapping degree exceeds a first preset value, the fact that the rear object is seriously blocked by the front object is indicated, the rear object does not need to be identified, and the three-dimensional marking frame corresponding to the rear object can be deleted, so that the ratio of the effective training data in all the training data is improved. The front and rear positions of the two objects can be determined according to the positions of the three-dimensional center points of the two three-dimensional labeling frames and the first projection surface, when the three-dimensional labeling frames have depth information corresponding to the image shooting surface, the depth information of the three-dimensional labeling frames corresponding to the two projection frames with the overlapping degree exceeding a first preset value can be directly obtained, and the three-dimensional labeling frames with larger depth relative to the shooting surface are deleted.

And finally, determining the rest three-dimensional annotation frames as a first three-dimensional annotation frame, namely the three-dimensional annotation frame of the object which can be accurately identified through the target picture.

Referring to FIG. 4, in one embodiment, determining a particular flow of a first three-dimensional annotation box can comprise:

step S400, a three-dimensional point cloud data set is obtained, wherein the three-dimensional point cloud data set comprises a plurality of three-dimensional annotation frames, an annotation result corresponding to each three-dimensional annotation frame and a plurality of pictures corresponding to the three-dimensional point cloud data set.

In step S401, coordinates of N projection center points of N three-dimensional labeling frames in the three-dimensional point cloud data corresponding to the target picture on N projection frames T1 and N projection center points of the first projection plane are obtained, where i=0 and m=n are set.

The images in the three-dimensional point cloud data set can be ordered, each image is marked in sequence, and the currently marked image is called a target image. i is a parameter used for recording serial numbers of three-dimensional labeling frames in the embodiment, and one three-dimensional labeling frame, a projection frame corresponding to the three-dimensional labeling frame and a projection center point are all set to be the same serial number; m is a parameter for recording the number of three-dimensional annotation boxes within the display range of the target picture in the present embodiment.

Step S402, judging whether the coordinates of the ith projection center point exceed the display area of the target image, if so, entering step S403, deleting the ith three-dimensional labeling frame, subtracting one from M, and entering step S404; if not, go to step S404;

step S404, judging whether i is equal to the total number N of the three-dimensional labeling frames, if not, entering step S405 to add one to i, and returning to step S402 to process the next projection frame; if yes, go to step S406;

in step S406, the i value is reset and p=m is set, where P is a parameter used to record the number of first three-dimensional label frames in this embodiment. The step is used for setting a serial number for the three-dimensional labeling frame in the display range of the target image.

It should be noted that, because there may be a layer-by-layer shielding problem between the three-dimensional labeling frames, in step S406, each three-dimensional labeling frame may be numbered sequentially from large to small according to the distance between the three-dimensional center point and the first projection plane, so that in the subsequent steps S408 to S410, the three-dimensional labeling frame corresponding to the shielded object is deleted from back to front, and it is avoided that after the three-dimensional labeling frame located in the middle of the shielding queue is deleted first, the object located at the forefront of the shielding queue and the object located at the rearmost of the shielding queue are judged as having no shielding relationship.

Step S407, judging the ith projection frame T1 _i And the (i+1) th projection frame T1 _i+1 Overlap IOU (T1) _i ,T1 _i+1 ) Whether or not to be greater than or equal to a first preset value Vth ₁ If yes, step S408 is carried out, and step S409 is carried out after subtracting one from P, if the three-dimensional center points in the ith three-dimensional labeling frame and the (i+1) th three-dimensional labeling frame are far away from the first projection surface; if not, go to step S409;

step S409, judging whether i+1 is equal to M, if not, entering step S410 to add one to i, returning to step S407 to judge the next group of projection frames; if so, step S411 is entered to determine the remaining P three-dimensional annotation frames as the first three-dimensional annotation frame.

FIG. 5 is a schematic diagram of a first three-dimensional annotation box corresponding to the embodiment shown in FIG. 3 in one embodiment of the disclosure.

Referring to fig. 5, after deleting the three-dimensional annotation frame corresponding to the projection frames 31, 39 exceeding the target image display range 300 and the blocked following three-dimensional annotation frame (corresponding to the projection frame 33), the first three-dimensional annotation frame includes the three-dimensional annotation frames corresponding to the projection frames 32, 34, 35, 36, 37, 38.

In step S3, two or more first three-dimensional labeling frames of the same object are processed, so as to obtain a preset number of second three-dimensional labeling frames of the same object.

The disclosed embodiments may be used for generation of training data for monocular three-dimensional object detection models equipped for autonomous vehicles. Because three-dimensional point cloud data generally includes multiple labeling methods, for example, different parts of an object are labeled respectively, and the application scene of automatic driving field or other monocular three-dimensional target detection needs to pay attention to the object main body, in order to improve the effective duty ratio of training data, multiple labeling frames of the same object can be processed.

Referring to fig. 6, in one embodiment, step S3 may include:

step S31, a first projection frame and a second projection frame of two first three-dimensional labeling frames on a second projection plane are obtained, wherein the second projection plane is a shooting overlook plane corresponding to the target image;

step S32, deleting a first three-dimensional labeling frame corresponding to the smaller area in the first projection frame and the second projection frame when the overlapping degree of the first projection frame and the second projection frame is larger than a second preset value and smaller than a third preset value;

step S33, when the overlapping degree of the first projection frame and the second projection frame is larger than or equal to the third preset value, a third projection frame and a fourth projection frame of the two first three-dimensional labeling frames on a first projection plane are obtained, wherein the first projection plane is a shooting plane corresponding to the target image;

And step S34, deleting the first three-dimensional annotation frame corresponding to the person meeting the preset height condition in the third projection frame and the fourth projection frame.

Referring to fig. 7, the projection frame of the first three-dimensional labeling frame on the second projection surface may include, for example, projection frames 71 to 76. The inventor analyzes that, because the second projection surface is a shooting top view surface corresponding to the target image, if two three-dimensional marking frames are overlapped in a large range on the top view surface, the two three-dimensional marking frames mark parts (such as a hanger and a vehicle body of a crane) which possibly are different in height of the same object or different components (such as a shovel and a vehicle body of a bulldozer). In the embodiment of the disclosure, two projection frames with higher overlapping degree (larger than a third preset value) are judged to be projection frames with different heights of the same object; and judging that the projection frames with lower overlapping degree (between the second preset value and the third preset value) are projection frames of different components of the same object. Further, different methods are adopted for processing different situations.

For projection frames of different components, only a main part is reserved, so that the embodiment of the disclosure sets to delete the first three-dimensional labeling frame corresponding to the smaller area on the second projection surface in the two overlapped projection frames. However, in some embodiments, the main component may be in a thin and high shape with respect to the ground, and the secondary component may be in a flat shape with respect to the ground, so that the first three-dimensional labeling frame corresponding to the main component may be deleted by mistake, and therefore, the projection frames determined to be different components of the same object may be screened out according to the height of the corresponding first three-dimensional labeling frame on the projection frame of the first projection surface, so that the first three-dimensional labeling frame with a height meeting the preset condition may be screened out. In some embodiments of the present disclosure, only one three-dimensional annotation frame closest to the ground is reserved for one object, so as to facilitate application of the generated training data in the unmanned technical field; in other embodiments of the present disclosure, one or more marking boxes that remain in compliance with preset conditions for an object may also be provided, so as to apply the training data to other target technical fields. The preset conditions can be set by a person skilled in the art according to the application purpose of the training data, which is not limited in this disclosure.

Referring to fig. 8, the projection frames 81 to 86 and the projection frames 71 to 76 are a projection frame of the first three-dimensional labeling frame on the second projection plane and a projection frame on the first projection plane, respectively. The projection frames 81 to 86 correspond to the projection frames 71 to 76, respectively. As can be seen from fig. 8, the projection frames 73 and 74 with overlapping degree larger than the third preset value (for example, 0.9) on the second projection surface are different in height, at this time, if only one three-dimensional labeling frame closest to the ground is set for one object, so as to facilitate the application of the generated training data in the unmanned technical field, the first three-dimensional labeling frame corresponding to the higher projection frame 84 may be deleted; similarly, the overlapping degree of the projection frames 71 and 72 on the second projection surface is greater than a second preset value (for example, 0.1) but less than a third preset value (for example, 0.9), the heights of the projection frames 81 and 82 of the corresponding first projection surfaces are different, and the first three-dimensional labeling frame corresponding to the higher projection frame 81 can be deleted. Finally, the remaining first three-dimensional annotation frame is set as a second three-dimensional annotation frame for generating annotation information.

In one embodiment, the method for deleting the higher first three-dimensional annotation frame in fig. 8 by step S34 may include: determining a first center point height and a first height corresponding to the third projection frame, and determining a first vertex height of the third projection frame according to the first center point height and the first height; determining a second center point height and a second height corresponding to the fourth projection frame, and determining a second vertex height of the fourth projection frame according to the second center point height and the second height; deleting a first three-dimensional annotation frame corresponding to the third projection frame when the first center point height is larger than the second vertex height; and deleting the first three-dimensional annotation frame corresponding to the fourth projection frame when the height of the second center point is larger than that of the first vertex.

Referring to FIG. 9, in one embodiment, the process of determining the second three-dimensional annotation box can include:

step S901, obtaining P projection frames T2 of the P first three-dimensional labeling frames on the second projection surface, where j=0 and q=p are set; the present embodiment records the number of second three-dimensional projection frames using Q.

Step S902, judging the projection frame j T2 of the jth first three-dimensional labeling frame on the second projection surface _j And the j+1th projection frame T2 _j+1 Overlap IOU (T2) _j ,T2 _j+1 ) Whether or not to be at the second preset value Vth ₂ And a third preset value Vth ₃ If yes, enter step S903 and delete the first three-dimensional label frame corresponding to the j-th projection frame and the smaller area in the j+1-th projection frame, enter step S911 after subtracting one to Q; if not, go to step S904;

step S904, determine IOU (T2 _j ,T2 _j+1 ) Whether or not to be greater than or equal to a third preset value Vth ₃ If yes, go to step S905, if no, go to step S911;

step S905, obtaining the j-th first three-dimensional labeling frame and the projection frame T1 of the j+1th first three-dimensional labeling frame on the first projection surface _j And projection frame T1 _j+1 ；

Step S906, the acquired projection frame T1 _j Is used for obtaining a projection frame T1 by the first center point height and the first peak point height _j+1 A second center point height and a second vertex height;

step S907, judging whether the first center point height is larger than the second vertex height, if so, entering step S908 to delete the j-th first three-dimensional labeling frame, and entering step S911 after subtracting one operation for Q; if not, go to step S909;

step S909, judging whether the second center point height is greater than the first vertex height, if so, entering step S910 to delete the j+1th first three-dimensional labeling frame, and entering step S911 after subtracting one from Q; if not, go to step S911;

step S911, judging whether j+1 is equal to P, if not, proceeding to step S912 to add one operation to j, returning to step S902 to judge the next group of projection frames; if so, the process proceeds to step S913 to determine the remaining Q first three-dimensional annotation frames as second three-dimensional annotation frames.

Referring to fig. 10, after deleting the first three-dimensional label frame corresponding to the smaller projection frame 71 and the first three-dimensional label frame corresponding to the higher projection frame 74, the second three-dimensional label frame is the three-dimensional label frame corresponding to the projection frames 72, 73, 75, 76.

In another embodiment of the present disclosure, if the three-dimensional point cloud data is from the public dataset, the labeling category is thicker, and the labeling result of each second three-dimensional labeling frame can be updated, so as to achieve labeling more in line with the detection purpose of the monocular three-dimensional target detection model.

For example, the labeling result may be adjusted when the labeling result of one of the second three-dimensional labeling frames matches the target object. In some embodiments, for example, the public dataset labels all vehicles as vehicles, at which time the labeling results may be "bicycle", "small motor vehicle", "large motor vehicle", etc. according to the size of the three-dimensional labeling frame; in other embodiments, if the labeling result is "large-scale motor vehicle", the classification model (e.g. trained neural network model) may be further used to further classify the object corresponding to the second three-dimensional labeling frame into "bus", "truck", and so on.

Referring to FIG. 11, in one embodiment, the process of updating the annotation result based on the size of the second three-dimensional annotation frame may comprise:

Step S111, obtaining a target projection frame of the second three-dimensional labeling frame on a second projection surface, wherein the second projection surface is a shooting overlook surface corresponding to the target image;

step S112, if the length of the target projection frame in the normal direction of the second projection surface is smaller than a fourth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame to a first object;

step S113, if the length of the target projection frame in the normal direction of the second projection surface is greater than a fifth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame to a second object;

step S114, if the length of the target projection frame in the normal direction of the second projection surface is greater than or equal to the fourth preset value and less than or equal to the fifth preset value, updating the labeling result of the second three-dimensional labeling frame corresponding to the target projection frame to a third object.

Fig. 12 is another flow chart of the embodiment shown in fig. 11.

Referring to fig. 12, first, k=0 is set at step S121, k being a sequence number representing a second three-dimensional label frame.

Step S122, determining whether the labeling result of the kth second three-dimensional labeling frame is matched with the target object, if not, entering step S128; if yes, go to step S123;

Step S123, judging whether the length of the kth second three-dimensional labeling frame in the normal direction of the second projection surface is smaller than a fourth preset value, if so, entering step S124, updating the labeling result of the kth second three-dimensional labeling frame into a first object, and entering step S128; if not, go to step S125;

step S125, judging whether the length of the kth second three-dimensional labeling frame in the normal direction of the second projection surface is larger than a fifth preset value, if so, entering step S126, updating the labeling result of the kth second three-dimensional labeling frame into a second object, and entering step S128; if not, the step S127 is entered, and the labeling result of the kth second three-dimensional labeling frame is updated to a third object, and then the step S128 is entered;

step S128, judging whether k is equal to the number Q of the second three-dimensional labeling frames, if not, entering step S129 to add one operation to k, and returning to step S122 to judge the next second three-dimensional labeling frame; if not, step S4 is entered.

The target object in the embodiment shown in fig. 11 and 12 may be, for example, a vehicle, the first object may be, for example, a bicycle, the second object may be, for example, a large motor vehicle, and the third object may be, for example, a small motor vehicle. After updating the labeling result of the target object according to the size, a classification model may be further used to classify a part or all of the second three-dimensional labeling frames in a finer granularity, and update the labeling result again, which is not particularly limited in the present disclosure.

And in step S4, labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, wherein the training data is used for training the target detection model.

The process of labeling the target image according to the labeling result of each of the second three-dimensional labeling frames in the embodiments of the present disclosure may, for example, include: obtaining a target object corresponding to each second three-dimensional annotation frame and preset parameters, wherein the preset parameters at least comprise height information and distance information; determining the position of each target object in the target image; and generating the training data according to the target image, the position of each target object and the preset parameters of each target object.

Through the labeling process, each identifiable object in each image in the generated training data has the height information and the distance information, so that the monocular three-dimensional target detection model trained by using the training data can judge the three-dimensional information of the target object according to the two-dimensional image, for example, the distance of the object is estimated according to the height information of the object, and then corresponding operation is realized according to the distance of the object.

The embodiment of the disclosure can be used for directly cleaning the public data set to generate the training data of the monocular three-dimensional target detection model in a large quantity at low cost, so that a large quantity of training data meeting the training requirements can be obtained rapidly, the generalization capability of the training data set and the model can be enhanced, and the labor and financial cost can be saved greatly.

Corresponding to the above method embodiments, the present disclosure further provides a training data generating device, which may be used to perform the above method embodiments.

Referring to fig. 13, the training data generating apparatus 1300 may include:

the point cloud data acquisition module 131 is configured to acquire target three-dimensional point cloud data corresponding to a target image, where the target three-dimensional point cloud data includes three-dimensional labeling frames of a plurality of objects and labeling results corresponding to each of the three-dimensional labeling frames;

a visual frame screening module 132 configured to acquire a first three-dimensional annotation frame within an identification range of the target image;

the repeated frame processing module 133 is configured to process two or more first three-dimensional labeling frames of the same object to obtain a preset number of second three-dimensional labeling frames of the same object;

The data labeling module 134 is configured to label the target image according to the labeling result of each of the second three-dimensional labeling frames to generate training data, where the training data is used to train the target detection model.

In one exemplary embodiment of the present disclosure, the visual box screening module 132 is configured to: acquiring a projection frame of a three-dimensional annotation frame in the target three-dimensional point cloud data on a first projection plane and a central point coordinate of the projection frame, wherein the first projection plane is a shooting plane corresponding to the target image; if one of the center point coordinates is out of the display coordinate range of the target image, deleting the three-dimensional annotation frame corresponding to the center point coordinate; if the overlapping degree of the two projection frames is larger than a first preset value, deleting the three-dimensional center points in the two three-dimensional labeling frames corresponding to the two projection frames which are far away from the first projection surface; and determining the rest three-dimensional annotation frames as the first three-dimensional annotation frame.

In one exemplary embodiment of the present disclosure, the repeat block processing module 133 is configured to: acquiring a first projection frame and a second projection frame of two first three-dimensional labeling frames on a second projection plane, wherein the second projection plane is a shooting overlook plane corresponding to the target image; deleting a first three-dimensional labeling frame corresponding to the smaller area in the first projection frame and the second projection frame when the overlapping degree of the first projection frame and the second projection frame is larger than a second preset value and smaller than a third preset value; when the overlapping degree of the first projection frame and the second projection frame is larger than or equal to the third preset value, a third projection frame and a fourth projection frame of the two first three-dimensional labeling frames on a first projection surface are obtained, and the first projection surface is a shooting surface corresponding to the target image; and deleting the first three-dimensional annotation frame corresponding to the third projection frame and the fourth projection frame which meet the preset height condition.

In one exemplary embodiment of the present disclosure, the repeat block processing module 133 is configured to: determining a first center point height and a first height corresponding to the third projection frame, and determining a first vertex height of the third projection frame according to the first center point height and the first height; determining a second center point height and a second height corresponding to the fourth projection frame, and determining a second vertex height of the fourth projection frame according to the second center point height and the second height; deleting a first three-dimensional annotation frame corresponding to the third projection frame when the first center point height is larger than the second vertex height; and deleting the first three-dimensional annotation frame corresponding to the fourth projection frame when the height of the second center point is larger than that of the first vertex.

In an exemplary embodiment of the present disclosure, the method further includes a classification module 135 configured to: and when the labeling result of the second three-dimensional labeling frame is matched with a target object, updating the labeling result according to the size of the second three-dimensional labeling frame.

In one exemplary embodiment of the present disclosure, classification module 135 is configured to: acquiring a target projection frame of the second three-dimensional annotation frame on a second projection surface, wherein the second projection surface is a shooting overlook surface corresponding to the target image; if the length of the target projection frame in the normal direction of the second projection surface is smaller than a fourth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a first object; if the length of the target projection frame in the normal direction of the second projection surface is larger than a fifth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a second object; and if the length of the target projection frame in the normal direction of the second projection surface is larger than or equal to the fourth preset value and smaller than or equal to the fifth preset value, updating the labeling result of the second three-dimensional labeling frame corresponding to the target projection frame into a third object.

In one exemplary embodiment of the present disclosure, the data annotation module 134 is configured to: obtaining a target object corresponding to each second three-dimensional annotation frame and preset parameters, wherein the preset parameters at least comprise height information and distance information; determining the position of each target object in the target image; and generating the training data according to the target image, the position of each target object and the preset parameters of each target object.

Since the functions of the apparatus 1300 are described in detail in the corresponding method embodiments, the disclosure is not repeated here.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 1400 according to such an embodiment of the invention is described below with reference to fig. 14. The electronic device 1400 shown in fig. 14 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 14, the electronic device 1400 is embodied in the form of a general purpose computing device. Components of electronic device 1400 may include, but are not limited to: the at least one processing unit 1410, the at least one memory unit 1420, and a bus 1430 connecting the different system components (including the memory unit 1420 and the processing unit 1410).

Wherein the storage unit stores program code that is executable by the processing unit 1410 such that the processing unit 1410 performs steps according to various exemplary embodiments of the present invention described in the above section of the "exemplary method" of the present specification. For example, the processing unit 1410 may perform the methods as shown in the embodiments of the present disclosure.

The memory unit 1420 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 14201 and/or cache memory 14202, and may further include Read Only Memory (ROM) 14203.

The memory unit 1420 may also include a program/utility 14204 having a set (at least one) of program modules 14205, such program modules 14205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1430 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 1400 may also communicate with one or more external devices 1500 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1400, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1400 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1450. Also, electronic device 1400 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1460. As shown, the network adapter 1460 communicates with other modules of the electronic device 1400 via the bus 1430. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1400, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

The program product for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A training data generation method for a monocular three-dimensional object detection model, comprising:

acquiring target three-dimensional point cloud data corresponding to a target image, wherein the target three-dimensional point cloud data comprises three-dimensional annotation frames of a plurality of objects and annotation results corresponding to each three-dimensional annotation frame;

Acquiring a first three-dimensional annotation frame in the identification range of the target image;

processing two or more first three-dimensional annotation frames of the same object to obtain a preset number of second three-dimensional annotation frames of the same object;

labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, wherein the training data are used for training the target detection model;

the processing the two or more first three-dimensional labeling frames of the same object to obtain a preset number of second three-dimensional labeling frames of the same object comprises the following steps:

acquiring a first projection frame and a second projection frame of two first three-dimensional labeling frames on a second projection plane, wherein the second projection plane is a shooting overlook plane corresponding to the target image;

deleting a first three-dimensional labeling frame corresponding to the smaller area in the first projection frame and the second projection frame when the overlapping degree of the first projection frame and the second projection frame is larger than a second preset value and smaller than a third preset value;

when the overlapping degree of the first projection frame and the second projection frame is larger than or equal to the third preset value, a third projection frame and a fourth projection frame of the two first three-dimensional labeling frames on a first projection surface are obtained, and the first projection surface is a shooting surface corresponding to the target image;

And deleting the first three-dimensional annotation frame corresponding to the third projection frame and the fourth projection frame which meet the preset height condition.

2. The training data generation method of claim 1, wherein the acquiring a first three-dimensional annotation box within the recognition range of the target image comprises:

acquiring a projection frame of a three-dimensional annotation frame in the target three-dimensional point cloud data on a first projection plane and a central point coordinate of the projection frame, wherein the first projection plane is a shooting plane corresponding to the target image;

if one of the center point coordinates is out of the display coordinate range of the target image, deleting the three-dimensional annotation frame corresponding to the center point coordinate;

if the overlapping degree of the two projection frames is larger than a first preset value, deleting the three-dimensional center points in the two three-dimensional labeling frames corresponding to the two projection frames which are far away from the first projection surface;

and determining the rest three-dimensional annotation frames as the first three-dimensional annotation frame.

3. The training data generating method as claimed in claim 1, wherein said deleting the first three-dimensional labeling frame corresponding to the person meeting the preset height condition in the third projection frame and the fourth projection frame includes:

Determining a first center point height and a first height corresponding to the third projection frame, and determining a first vertex height of the third projection frame according to the first center point height and the first height;

determining a second center point height and a second height corresponding to the fourth projection frame, and determining a second vertex height of the fourth projection frame according to the second center point height and the second height;

deleting a first three-dimensional annotation frame corresponding to the third projection frame when the first center point height is larger than the second vertex height;

and deleting the first three-dimensional annotation frame corresponding to the fourth projection frame when the height of the second center point is larger than that of the first vertex.

4. The training data generating method according to claim 1, further comprising, after the obtaining a predetermined number of second three-dimensional annotation frames of the same object:

and when the labeling result of the second three-dimensional labeling frame is matched with a target object, updating the labeling result according to the size of the second three-dimensional labeling frame.

5. The training data generation method of claim 4, wherein the updating the annotation result based on the size of the second three-dimensional annotation frame comprises:

Acquiring a target projection frame of the second three-dimensional annotation frame on a second projection surface, wherein the second projection surface is a shooting overlook surface corresponding to the target image;

if the length of the target projection frame in the normal direction of the second projection surface is smaller than a fourth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a first object;

if the length of the target projection frame in the normal direction of the second projection surface is larger than a fifth preset value, updating the labeling result of a second three-dimensional labeling frame corresponding to the target projection frame into a second object;

and if the length of the target projection frame in the normal direction of the second projection surface is larger than or equal to the fourth preset value and smaller than or equal to the fifth preset value, updating the labeling result of the second three-dimensional labeling frame corresponding to the target projection frame into a third object.

6. The training data generation method of claim 1, wherein labeling the target image according to the labeling result of each of the second three-dimensional labeling frames to generate training data comprises:

obtaining a target object corresponding to each second three-dimensional annotation frame and preset parameters, wherein the preset parameters at least comprise height information and distance information;

Determining the position of each target object in the target image;

and generating the training data according to the target image, the position of each target object and the preset parameters of each target object.

7. A training data generation apparatus for a monocular three-dimensional object detection model, comprising:

the point cloud data acquisition module is used for acquiring target three-dimensional point cloud data corresponding to a target image, wherein the target three-dimensional point cloud data comprise three-dimensional annotation frames of a plurality of objects and annotation results corresponding to the three-dimensional annotation frames;

the visual frame screening module is used for acquiring a first three-dimensional annotation frame in the identification range of the target image;

the repeated frame processing module is used for processing two or more first three-dimensional annotation frames of the same object to obtain a preset number of second three-dimensional annotation frames of the same object;

the data labeling module is used for labeling the target image according to the labeling result of each second three-dimensional labeling frame to generate training data, and the training data are used for training the target detection model;

wherein the repeating frame processing module is configured to: acquiring a first projection frame and a second projection frame of two first three-dimensional labeling frames on a second projection plane, wherein the second projection plane is a shooting overlook plane corresponding to the target image; deleting a first three-dimensional labeling frame corresponding to the smaller area in the first projection frame and the second projection frame when the overlapping degree of the first projection frame and the second projection frame is larger than a second preset value and smaller than a third preset value; when the overlapping degree of the first projection frame and the second projection frame is larger than or equal to the third preset value, a third projection frame and a fourth projection frame of the two first three-dimensional labeling frames on a first projection surface are obtained, and the first projection surface is a shooting surface corresponding to the target image; and deleting the first three-dimensional annotation frame corresponding to the third projection frame and the fourth projection frame which meet the preset height condition.

8. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the training data generation method of any of claims 1-6 based on instructions stored in the memory.

9. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the training data generation method of any of claims 1-6.