CN113887290A

CN113887290A - Monocular 3D detection method and device, electronic equipment and storage medium

Info

Publication number: CN113887290A
Application number: CN202111013236.5A
Authority: CN
Inventors: 安建平; 郝雨萌; 程新景
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2022-01-04

Abstract

The invention provides a monocular 3D detection method, a monocular 3D detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected; projecting the 3D detection result to a picture to be detected to obtain a projected 2D detection result; iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result; and carrying out 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result, realizing the 3D detection result of effective optimization output, simultaneously improving the rationality and stability of the 3D detection result and improving the overall precision of monocular 3D detection.

Description

Monocular 3D detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a monocular 3D detection method, a monocular 3D detection system, electronic equipment and a storage medium.

Background

The existing mainstream monocular 3D detection method is an end-to-end scheme represented by centret, and returns a 3D attribute of a target while completing a 2D target detection task on an image, and finally outputs a 3D detection result.

However, the requirements on the stability and accuracy of the 3D detection result are high, the current network model cannot guarantee the lower limit of the detection result, and some errors or detection results with large errors often occur.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a monocular 3D detection method, a monocular 3D detection system, electronic equipment and a storage medium.

In a first aspect, the present invention provides a monocular 3D detection method, including:

acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected;

projecting the 3D detection result to the picture to be detected to obtain a projected 2D detection result;

iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result;

and performing 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

In one embodiment, the iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result includes:

determining loss information according to the 2D detection result and the projected 2D detection result;

and adjusting the target parameters in the 3D detection result according to the loss information and the preset iteration step length and the preset termination step length to obtain the adjusted 3D detection result.

In one embodiment, determining loss information according to the 2D detection result and the post-projection 2D detection result includes:

determining a 2D frame and a projection 2D frame according to the 2D detection result and the projected 2D detection result;

and determining a loss value according to the coordinate point of the 2D frame and the coordinate point of the projection 2D frame.

In one embodiment, the target parameters in the 3D detection result include depth values and orientation angles, and accordingly, the adjusting the target parameters in the 3D detection result according to the loss information and preset iteration step length and end step length to obtain an adjusted 3D detection result includes:

setting an initial iteration step _ d corresponding to the depth value depth, an iteration ending step _ d _ end, an initial iteration step r corresponding to the orientation angle rot, an iteration ending step r _ end, and an iteration step attenuation coefficient eta; l is the loss value determined by the 2D frame and the initial projected 2D frame;

1) if step _ D > step _ D _ end, enabling depth _ neg to be depth-step _ D and depth _ pos to be depth + step _ D, recalculating the 3D detection result projection, and calculating loss values L _ neg and L _ pos of the projected 2D frame and the projected 2D frame;

2) if L _ neg ═ L and L _ pos < L, let step _ d ═ η jump to 4), otherwise go to 3);

3) if L _ pos > L and L _ pos > L _ neg, let depth be depth + step _ d, L be L _ pos; otherwise, let depth-step _ d, L-neg;

4) if step _ r > step _ r _ end, let rot _ neg be rot-step _ r, rot _ pos be rot + step _ r, recalculate the 3D detection result projection, calculate the loss values L _ neg and L _ pos of the projected 2D frame and 2D frame;

5) if L _ neg ═ L and L _ pos < L, let step _ r ═ η jump to 1), otherwise go to 6);

6) if L _ pos > L and L _ pos > L _ neg, let rot be rot + step _ r, L be L _ pos; otherwise let rot-step _ r, L-L _ neg;

wherein depth _ neg and depth _ pos are respectively adjustment values of the depth value in two adjustment directions, and rot _ neg and rot _ pos are respectively adjustment values of the orientation angle in two adjustment directions; l _ neg and L _ pos are loss values between the 2D frame and the projected 2D frame corresponding to the adjustment values of the depth value or orientation value in the two adjustment directions, respectively.

In one embodiment, the projecting the 3D detection result onto the to-be-detected picture to obtain a projected 2D detection result includes:

and acquiring internal parameters of the picture acquisition equipment, projecting the 3D detection result to a picture to be detected according to the internal parameters, and acquiring a projected 2D detection result.

In one embodiment, the projecting the 3D detection result to a to-be-detected picture according to the internal parameters to obtain a projected 2D detection result includes:

obtaining a 3D frame according to the 3D detection result, calculating the projected coordinates of each coordinate point according to the coordinates and internal parameters of the coordinate points of the 3D frame by adopting a projection formula, and forming the projected 2D detection result by the projected coordinates of each coordinate point;

the projection formula includes:

wherein the internal reference is (f)_x，f_y，p_x，p_y) The coordinates of the coordinate point of the 3D frame are (X, Y, Z), and the coordinates after projection are (u, v).

In a second aspect, the present invention provides a monocular 3D detection device, comprising:

the identification module is used for acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected;

the projection module is used for projecting the 3D detection result onto the picture to be detected to obtain a projected 2D detection result;

the adjusting module is used for performing iterative adjustment on the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result;

and the processing module is used for carrying out 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

In a third aspect, the present invention provides an electronic device, comprising a memory and a memory storing a computer program, wherein the processor implements the steps of the monocular 3D detection method of the first aspect when executing the program.

In a fourth aspect, the present invention provides a processor-readable storage medium storing a computer program for causing a processor to perform the steps of the monocular 3D detection method of the first aspect.

According to the monocular 3D detection method, the monocular 3D detection system, the electronic device and the storage medium, the projected 2D detection result is obtained by projecting the 3D detection result onto the picture to be detected, the 3D detection result is iteratively adjusted according to the 2D detection result and the projected 2D detection result, the adjusted 3D detection result is obtained, and therefore 2D frame labeling and 3D frame labeling are conducted on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result, the output 3D detection result is effectively optimized, meanwhile, the rationality and the stability of the 3D detection result are improved, and the overall precision of monocular 3D detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a monocular 3D detection method provided by the present invention;

FIG. 2 is a schematic diagram of a frame marking of a target in a to-be-detected frame according to the present invention;

FIG. 3 is a schematic structural diagram of a monocular 3D detection device provided by the present invention;

FIG. 4 is a schematic structural diagram of an electronic device provided by the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The monocular 3D detecting method, system, electronic device, and storage medium of the present invention are described below with reference to fig. 1 to 4.

Fig. 1 shows a schematic flow chart of a monocular 3D detection method of the present invention, and referring to fig. 1, the method includes:

11. acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected;

12. projecting the 3D detection result to a picture to be detected to obtain a projected 2D detection result;

13. iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result;

14. and performing 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

For steps 11 to 14, it should be noted that, in the present invention, the trained picture target detection model is used to detect the target in the picture to be detected, so as to obtain a 2D detection result and a 3D detection result.

The image target detection model is obtained by machine learning training by taking the image characteristics of the target and the detection results corresponding to the image characteristics as input, and is used for positioning the target in the video image acquired by the monocular camera.

In the invention, the target in the video picture can be labeled in a picture frame (2D frame or 3D frame) form according to the 2D detection result and the 3D detection result, so as to realize the positioning of the target in the video picture.

However, the 3D detection result may not be accurate enough, and thus, the 3D detection result needs to be adjusted.

In the invention, the 3D detection result needs to be projected onto the frame to be detected to obtain the projected 2D detection result. For this reason, the post-projection 2D detection result can also obtain one frame (2D frame) on the screen.

Then, iterative adjustment is performed on the 3D detection result again by means of the constraint between the 2D detection result and the projected 2D detection result. And regenerating the corresponding projected 2D detection result from the adjusted 3D detection result every time, and then continuing to perform constraint processing. After the iterative algorithm is finished, an adjusted 3D detection result may be obtained.

And finally, carrying out 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result, thereby realizing the accurate labeling of the target in the picture.

According to the monocular 3D detection method provided by the invention, the projected 2D detection result is obtained by projecting the 3D detection result on the picture to be detected, the 3D detection result is iteratively adjusted according to the 2D detection result and the projected 2D detection result, and the adjusted 3D detection result is obtained, so that the 2D frame labeling and the 3D frame labeling are carried out on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result, the output 3D detection result is effectively optimized, the rationality and the stability of the 3D detection result are improved, and the overall precision of monocular 3D detection is improved.

In the further explanation of the above method, the explanation of the processing procedure for iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain the adjusted 3D detection result is mainly as follows:

In contrast, in the present invention, loss information, that is, a loss value of a 2D frame corresponding to the 2D detection result and a projected 2D frame corresponding to the projected 2D detection result, is determined according to the 2D detection result and the projected 2D detection result, and the loss value can represent a degree of coincidence between the 2D frame and the projected 2D frame.

When the loss value of the 2D frame corresponding to the 2D detection result and the loss value of the projected 2D frame corresponding to the projected 2D detection result are large, it indicates that the 2D frame and the projected 2D frame are not in agreement, and at this time, the target parameter in the 3D detection result needs to be adjusted. In the invention, the parameters are adjusted by adopting an iteration adjustment mode, and the loss value is compared in each iteration.

In the iteration process, an iteration step length and a termination step length are configured, and after the iteration is finished, the adjusted 3D detection result can be determined. And performing 3D frame labeling on the target on the picture according to the 3D detection result.

Referring to fig. 2, 21 is an object on a screen to be detected, 22 is a 2D frame of the object, 23 is a 3D frame of the object, and 24 is a projected 2D frame of the object.

In the invention, the 3D detection result requires a center point (Amodal Cneter offset), a Depth value (Object Depth), an Orientation angle (Object Orientation) and an Object size (Object Dimension), which are parameters that can be further optimized, but from the comprehensive practical expression, Amodal Cneter offset and Object Dimension are generally predicted accurately, so no additional optimization is required, and Object Depth and Object Dimension mainly need to be optimized. For this, the target parameters in the 3D detection result include a depth value and an orientation angle.

The further method of the invention adjusts and limits the target parameters by configuring the iteration step length and the termination step length, can ensure gradual adjustment, and enables the adjusted 3D detection result to be closer to reasonability and stability.

In the further explanation of the above method, the explanation of the processing procedure for determining the loss information according to the 2D detection result and the projected 2D detection result is mainly as follows:

determining a 2D frame and a projected 2D frame according to the 2D detection result and the projected 2D detection result;

and determining a loss value according to the coordinate point of the 2D frame and the coordinate point of the projected 2D frame.

In this regard, it should be noted that, in the present invention, the 2D frame and the projected 2D frame are determined from the 2D detection result and the post-projection 2D detection result. Let the coordinates of the 2D frame be (x1, y1, x2, y2), and the coordinates of the projection 2D frame be (x1 ', y 1', x2 ', y 2'), the calculated loss value L ═ abs (x1-x1 ') + abs (y1-y 1') + abs (x2-x2 ') + abs (y2-y 2')). Wherein abs is the absolute value.

In the further explanation of the above method, the processing procedure of adjusting the target parameter in the 3D detection result according to the loss information and the preset iteration step length and the end step length to obtain the adjusted 3D detection result is explained as follows:

setting an initial iteration step d (e.g., 10m) corresponding to the depth value depth, an iteration end step d end (e.g., 0.1m), an iteration end step r end (e.g., 0.01 radians) corresponding to the initial iteration step r (e.g., 0.3 x pi) toward the angle rot, and an iteration step attenuation coefficient η (e.g., 0.5); l is the loss value determined by the 2D box and the initial projected 2D box.

6) if L _ pos > L and L _ pos > L _ neg, let rot be rot + step _ r, L be L _ pos; otherwise let rot-step _ r and L-neg.

In this regard, it should be noted that depth _ neg and depth _ pos are adjustment values of the depth value in two adjustment directions, respectively, and rot _ neg and rot _ pos are adjustment values of the orientation angle in two adjustment directions, respectively. L _ neg and L _ pos are loss values between the 2D frame and the projected 2D frame corresponding to the adjustment values of the depth value or orientation value in the two adjustment directions, respectively.

In the further explanation of the above method, the method mainly projects the 3D detection result onto the frame to be detected to obtain the projected 2D detection result, and includes:

and acquiring internal parameters of the picture acquisition equipment, projecting the 3D detection result to the picture to be detected according to the internal parameters, and acquiring a projected 2D detection result.

In this regard, the internal parameters of the image capturing device include the length of the focal length in the x-axis direction, the length of the focal length in the y-axis direction, and the coordinates of the midpoint.

In the present invention, projecting the 3D detection result to a to-be-detected picture according to the internal parameters to obtain a projected 2D detection result, including:

the projection formula includes:

In the following, the monocular 3D detecting device provided in the present invention is described, and the monocular 3D detecting device described below and the monocular 3D detecting method described above may be referred to correspondingly.

Fig. 3 shows a schematic structural diagram of a monocular 3D detecting device provided by the present invention, referring to fig. 3, the device includes an identification module 31, a projection module 32, an adjustment module 33, and a processing module 34, wherein:

the identification module 31 is configured to obtain a 2D detection result and a 3D detection result of the target in the to-be-detected picture;

the projection module 32 is configured to project the 3D detection result onto a to-be-detected picture to obtain a projected 2D detection result;

the adjusting module 33 is configured to perform iterative adjustment on the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result;

and the processing module 34 is configured to perform 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

In a further description of the above apparatus, the adjusting module is specifically configured to:

In a further description of the above apparatus, the adjusting module is specifically configured to, during a process of determining loss information according to the 2D detection result and the post-projection 2D detection result:

In a further description of the above apparatus, the target parameters in the 3D detection result include a depth value and an orientation angle, and accordingly, the adjusting module is specifically configured to, during a processing procedure of adjusting the target parameters in the 3D detection result according to the loss information and a preset iteration step length and a preset termination step length to obtain an adjusted 3D detection result:

In a further description of the above apparatus, the projection module is specifically configured to:

In further description of the above apparatus, the projection module is specifically configured to, in a process of projecting the 3D detection result onto the to-be-detected picture according to the internal parameters to obtain the projected 2D detection result:

the projection formula includes:

Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.

It should be noted that, in the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

According to the monocular 3D detection device, the projected 2D detection result is obtained by projecting the 3D detection result onto the picture to be detected, the 3D detection result is iteratively adjusted according to the 2D detection result and the projected 2D detection result, and the adjusted 3D detection result is obtained, so that the 2D frame labeling and the 3D frame labeling are carried out on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result, the output 3D detection result is effectively optimized, the rationality and the stability of the 3D detection result are improved, and the overall precision of monocular 3D detection is improved.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)41, a Communication Interface (Communication Interface)42, a memory (memory)43 and a Communication bus 44, wherein the processor 41, the Communication Interface 42 and the memory 43 complete Communication with each other through the Communication bus 44. The processor 41 may call a computer program in the memory 43 to perform the steps of the monocular 3D detection method, for example comprising: acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected; projecting the 3D detection result to a picture to be detected to obtain a projected 2D detection result; iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result; and performing 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

Furthermore, the logic instructions in the memory 43 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the monocular 3D detection method provided by the above methods, the method comprising: acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected; projecting the 3D detection result to a picture to be detected to obtain a projected 2D detection result; iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result; and performing 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

On the other hand, an embodiment of the present application further provides a processor-readable storage medium, where the processor-readable storage medium stores a computer program, where the computer program is configured to enable the processor to execute the monocular 3D detecting method provided in each of the above embodiments, and for example, the processor-readable storage medium includes: acquiring a 2D detection result and a 3D detection result of a target in a picture to be detected; projecting the 3D detection result to a picture to be detected to obtain a projected 2D detection result; iteratively adjusting the 3D detection result according to the 2D detection result and the projected 2D detection result to obtain an adjusted 3D detection result; and performing 2D frame labeling and 3D frame labeling on the target on the picture to be detected according to the 2D detection result and the adjusted 3D detection result.

The processor-readable storage medium can be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A monocular 3D detection method, comprising:

2. The monocular 3D detecting method according to claim 1, wherein the iteratively adjusting the 3D detecting result according to the 2D detecting result and the projected 2D detecting result to obtain an adjusted 3D detecting result comprises:

3. The monocular 3D detecting method according to claim 2, wherein determining loss information according to the 2D detecting result and the post-projection 2D detecting result comprises:

4. The monocular 3D detecting method according to claim 3, wherein the target parameters in the 3D detecting result include a depth value and an orientation angle, and accordingly, the target parameters in the 3D detecting result are adjusted according to the loss information and a preset iteration step and a preset termination step to obtain an adjusted 3D detecting result, including:

5. The monocular 3D detecting method according to claim 1, wherein the projecting the 3D detecting result onto the frame to be detected to obtain a projected 2D detecting result includes:

6. The monocular 3D detecting method according to claim 5, wherein the projecting the 3D detecting result to a picture to be detected according to the internal reference to obtain a projected 2D detecting result comprises:

the projection formula includes:

7. A monocular 3D detecting device, comprising:

8. The monocular 3D detecting device according to claim 7, wherein the adjusting module is specifically configured to:

9. An electronic device comprising a processor and a memory storing a computer program, wherein the steps of the monocular 3D detection method of any one of claims 1 to 6 are implemented when the processor executes the computer program.

10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program for causing a processor to perform the steps of the monocular 3D detection method of any one of claims 1 to 6.