CN115359326A

CN115359326A - Monocular 3D target detection method and device

Info

Publication number: CN115359326A
Application number: CN202210933560.7A
Authority: CN
Inventors: 陆强
Original assignee: Inceptio Star Intelligent Technology Shanghai Co Ltd
Current assignee: Inceptio Star Intelligent Technology Shanghai Co Ltd
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-11-18

Abstract

The invention provides a monocular 3D target detection method and a monocular 3D target detection device, wherein the method comprises the following steps: acquiring an image to be detected in a target time period; inputting an image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion. According to the method, the image features extracted based on the adjacent frames of the image to be detected are subjected to feature fusion, so that the robustness of feature extraction is improved, and the precision of subsequent monocular 3D target detection is improved conveniently; in addition, redundant information generated by correlation among different frame images is eliminated, and the model calculation amount is reduced.

Description

Monocular 3D target detection method and device

Technical Field

The invention relates to the technical field of image recognition, in particular to a monocular 3D target detection method and device.

Background

In an automatic driving system, 3D Object Detection (singular 3D Object Detection) is a very important task in the perception module, and the modules for rear-end prediction, planning, motion control, etc. all depend on the Detection result of specific class objects around the host vehicle. The monocular 3D target detection is based on single-frame images to realize category estimation and 3D bounding box regression of surrounding targets, and the inherent advantage of low cost enables the monocular 3D target detection to have wide application prospect and commercial value in the fields of automatic driving and robots.

However, in the existing monocular 3D detection, only according to the image view angle itself, a foreground target is identified in a 2D visual image, and an automatic driving perception task of the category, position and posture of the target is given, but two-dimensional information of a labeled object, such as length, width and position in an image coordinate system, can be acquired based on the 2D image, and it is difficult to determine three-dimensional information of the labeled object in a real world coordinate system, such as real length, width, height, deflection angle and distance of the object, so that the rationality of a 3D frame cannot be judged, and the accuracy of a 3D detection result cannot be ensured.

Disclosure of Invention

The invention provides a monocular 3D target detection method and device, which are used for solving the defect that the accuracy of a 3D detection result cannot be ensured based on a 2D image in the prior art, effectively optimizing the output 3D detection result, improving the rationality and stability of the 3D detection result and improving the overall accuracy of monocular 3D detection.

The invention provides a monocular 3D target detection method, which comprises the following steps: acquiring an image to be detected in a target time period; inputting the image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

According to the monocular 3D target detection method provided by the present invention, the monocular 3D target detection model includes: the characteristic extraction layer is used for extracting the characteristics of the image to be detected to obtain the corresponding image characteristics; the characteristic fusion layer aligns the prior image characteristics corresponding to the prior image to be detected in the adjacent frame image characteristics to the current image characteristics corresponding to the current image to be detected, and performs characteristic fusion to obtain fusion characteristics; and the monocular 3D target detection layer is used for performing monocular 3D detection on the fusion characteristics to obtain a target detection result.

According to the monocular 3D target detection method provided by the invention, the method for aligning the prior image characteristics corresponding to the prior image to be detected in the adjacent frame image characteristics to the current image characteristics corresponding to the current image to be detected and performing characteristic fusion to obtain fusion characteristics comprises the following steps: converting the prior image features into a global coordinate system of the current image to be detected to obtain the prior image global features; converting the global feature of the prior image into a pixel coordinate system of the feature of the current image to obtain an alignment feature; and performing feature fusion on the alignment feature and the current image feature to obtain a fusion feature.

According to the monocular 3D target detection method provided by the present invention, the monocular 3D target detection model further includes: a characteristic flattening layer, which is used for flattening the image characteristics extracted by the characteristic extraction layer; and the full connection layer is used for integrating the image characteristics subjected to the flattening operation to obtain the image characteristics subjected to corresponding integration.

According to the monocular 3D object detection method provided by the invention, before integrating the image features after the flattening operation, the method comprises the following steps: and converting the image characteristics into a pixel coordinate system corresponding to the image to be detected based on a preset camera external parameter matrix.

According to the monocular 3D target detection method provided by the invention, the training of the monocular 3D target detection model comprises the following steps: acquiring a training picture and a target truth value corresponding to the training picture; inputting the training picture into a monocular 3D target detection model to be trained to obtain a first target prediction result and a second target prediction result output by the monocular 3D target detection model to be trained; the first target prediction result is obtained by performing monocular 3D target detection on the model to be trained based on training picture features extracted from the training pictures, and the second target prediction result is obtained by performing feature fusion on the model to be trained based on adjacent frame training picture features extracted from adjacent frame training pictures and performing monocular 3D target detection by using the fusion training features after feature fusion; constructing a first loss function based on the first target prediction result and the target truth value, and constructing a second loss function based on the second target prediction result and the target truth value; and obtaining a total loss function based on the first loss function and the second loss function, and judging to finish training based on the convergence of the total loss function.

According to the monocular 3D object detection method provided by the present invention, the object true value includes a first object true value and a second object true value, the first object true value includes a velocity true value obtained based on an absolute motion velocity of the training picture, and the second object true value includes a velocity true value obtained based on an inter-frame relative displacement of an adjacent frame training picture;

in constructing the first loss function, further comprising: constructing the first loss function based on the first target prediction result and the first target truth value;

in constructing the second loss function, the method further includes: constructing the second loss function based on the second target prediction result and the second target truth value.

The present invention also provides a monocular 3D target detecting device, comprising: the image acquisition module is used for acquiring an image to be detected in a target time period; the monocular 3D target detection module is used for inputting the image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on fusion features after the feature fusion.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the monocular 3D object detection methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the monocular 3D object detecting method as any one of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the monocular 3D object detecting method as described in any one of the above.

According to the monocular 3D target detection method and device provided by the invention, the image characteristics extracted based on the adjacent frames of images to be detected are subjected to characteristic fusion, so that the robustness of characteristic extraction is improved, and the precision of subsequent monocular 3D target detection is further improved; in addition, redundant information generated by correlation among different frame images is eliminated, and the model calculation amount is reduced.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a monocular 3D object detection method provided by the present invention;

FIG. 2 is a schematic diagram of a monocular 3D object detection model according to one embodiment of the present invention;

FIG. 3 is a second schematic structural diagram of a monocular 3D object detection model provided in the present invention;

FIG. 4 is a schematic flow chart of training a monocular 3D object detection model provided by the present invention;

FIG. 5 is a schematic structural diagram of a monocular 3D object detection model to be trained provided by the present invention;

FIG. 6 is a schematic structural diagram of a monocular 3D object detecting device provided in the present invention;

FIG. 7 is a schematic diagram of a training module according to the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a schematic flow chart of a monocular 3D object detection method of the present invention, which includes:

s11, acquiring an image to be detected in a target time period;

s12, inputting the image to be detected into the monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

It should be noted that S1N in this specification does not represent the order of the monocular 3D object detection methods, and the monocular 3D object detection method of the present invention is described below with reference to fig. 2 to 3.

And S11, acquiring an image to be detected in a target time period.

In this embodiment, acquiring an image to be detected in a target time period includes: acquiring a video stream aiming at a target area to be detected; and extracting an image to be detected in a target time period based on the video stream, wherein the image to be detected is an image frame sequence. Further, the extracting of the image to be detected in the target time period includes: based on the video stream, the images to be detected of a certain number of frames before and after the images to be detected of the current frame are acquired, for example, five frames of images before and after the images to be detected of the current frame are acquired, so that feature fusion of a subsequent monocular 3D target detection model is facilitated based on image features extracted from two continuous frames of images to be detected, and target detection precision is improved.

In an optional embodiment, acquiring the image to be detected in the target time period includes: and acquiring a plurality of frames of continuously shot images to be detected in a target time period.

It should be noted that the image to be detected is obtained by shooting with a shooting device on the vehicle, where the shooting device may be a radar, a sensor, or a camera of the vehicle body, and the source of the image to be detected is not further limited herein. In addition, the vehicle may be a vehicle for carrying people or goods, such as a vehicle, a ship, or an airplane, wherein the vehicle may be a private car or an operating vehicle, such as a shared automobile, a networked car, a taxi, a bus, a school bus, a truck, a passenger car, a train, a subway, a tram, and the like.

S12, inputting an image to be detected into the monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

In this embodiment, referring to fig. 2, the monocular 3D object detection model includes: the characteristic extraction layer is used for extracting the characteristics of the image to be detected to obtain the corresponding image characteristics; the characteristic fusion layer aligns the prior image characteristics corresponding to the prior image to be detected in the adjacent frame image characteristics to the current image characteristics corresponding to the current image to be detected, and performs characteristic fusion to obtain fusion characteristics; and the monocular 3D target detection layer performs monocular 3D detection on the fusion characteristics to obtain a target detection result. It should be noted that the target detection result includes: a target center point detection result (hm), a target frame width and height detection result (wh), a center point offset detection result (offset), a direction angle detection result (rot), a 3D size detection result (dim), a speed detection result (vel) and a depth detection result (depth).

It should be noted that, the feature extraction layer may adopt a backbone network backbone, and after extracting features by using the feature extraction layer, decode the features by using a decoder (decoder), and input the decoded image features into the feature extraction layer; the monocular 3D target detection layer can adopt detection head (head), and for the convenience of better utilizing the characteristics that the backbone network extracted, still be equipped with detection neck (neck) between backbone network and the detection head for draw more complicated characteristics, and utilize the detection head to detect the characteristics that neck and backbone extracted.

Furthermore, aligning the prior image features corresponding to the prior image to be detected in the adjacent frame image features to the current image features corresponding to the current image to be detected, and performing feature fusion to obtain fusion features, including: converting the prior image features into a global coordinate system of the current image to be detected to obtain the prior image global features; converting the global features of the prior image into a pixel coordinate system of the features of the current image to obtain alignment features; and performing feature fusion on the alignment features and the current image features to obtain fusion features. It should be noted that, the feature fusion is performed by aligning the previous image features to the current image features to fuse the difference information in the previous image and the current image, so that the robustness of feature extraction is improved, and the accuracy of subsequent monocular 3D target detection is further improved; in addition, redundant information generated by correlation among different frame images is eliminated, and the model calculation amount is reduced.

It is necessary to supplement the method for converting the prior image features into the global coordinate system of the current image to be detected, including: converting the prior image features into the global coordinate system of the prior image to be detected based on a transformation matrix from a vehicle coordinate system to a global coordinate system, a transformation matrix from a radar coordinate system to the vehicle coordinate system, a transformation matrix from a camera coordinate system to the radar coordinate system and a transformation matrix from the camera coordinate system to a pixel coordinate system which are constructed in advance to obtain first prior image features; and converting the first prior image feature into a global coordinate system of the current image to be detected based on the pre-acquired sensor information to obtain the prior image global feature. The sensor information may be obtained based on sensors such as a Global Positioning System (GPS) and an Inertial Measurement Unit (IMU).

In an alternative embodiment, referring to fig. 3, in order to reduce the noise effect generated by feature fusion, the monocular 3D object detection model further includes: a characteristic flattening (flatten) layer, which is used for flattening the image characteristics extracted by the characteristic extraction layer; and the full connection layer is used for integrating the image characteristics subjected to the flattening operation to obtain the corresponding integrated image characteristics. It should be noted that, the features of the fully connected layer after the feature flattening layer is flattened are integrated to obtain the image features after the dimensionality reconstruction, so that the difference between the to-be-detected images of adjacent frames is reduced, noise is prevented from being introduced when the features are fused, and negative effects are reduced.

In an alternative embodiment, before integrating the image features after the flattening operation, the method comprises the following steps: and converting the image characteristics into a pixel coordinate system corresponding to the image to be detected based on a preset camera external parameter matrix. The camera external reference matrix is preset so as to conveniently convert the image characteristics of the image to be detected from the world coordinate system to the camera coordinate system, so that the image characteristics are conveniently analyzed by combining the pose of the camera, and the influence of the pose of the camera on the subsequent target detection result is avoided.

In an alternative embodiment, referring to fig. 4, before inputting the image to be detected into the monocular 3D object detecting model, the method further includes: and training a monocular 3D target detection model. The method specifically comprises the following steps:

s41, acquiring a training picture and a target truth value corresponding to the training picture;

s42, inputting a training picture into the monocular 3D target detection model to be trained to obtain a first target prediction result and a second target prediction result output by the monocular 3D target detection model to be trained; the first target prediction result is obtained by performing monocular 3D target detection on the model to be trained based on training picture features extracted from training pictures, and the second target prediction result is obtained by performing feature fusion on the model to be trained based on adjacent frame training picture features extracted from adjacent frame training pictures and performing monocular 3D target detection by using the fusion training features after feature fusion;

s43, constructing a first loss function based on the first target prediction result and the target true value, and constructing a second loss function based on the second target prediction result and the target true value;

and S44, obtaining a total loss function based on the first loss function and the second loss function, converging based on the total loss function, and judging to finish training.

It should be noted that S4N in this specification does not represent the order of the monocular 3D object detection method, and the monocular 3D object detection method of the present invention is described below with reference to fig. 5 specifically.

Step S41, a training picture and a target true value corresponding to the training picture are obtained.

In this embodiment, obtaining the training picture and the target truth value corresponding to the training picture includes: collecting training videos or pictures, and screening out videos or pictures containing target information as effective training pictures; and marking the effective training picture to obtain a target truth value.

In an alternative embodiment, the target true values include a first target true value and a second target true value, the first target true value includes a true speed value obtained based on an absolute motion speed of the training picture, and the second target true value includes a true speed value obtained based on an inter-frame relative displacement of an adjacent frame of the training picture. It should be noted that inter-frame relative displacement between adjacent frame training pictures is used as a true speed value, so that when a monocular 3D target detection model to be trained is used for prediction subsequently, the range of the regression speed is reduced, the accuracy of the prediction result is improved, and the model training is facilitated; the absolute movement speed is used as a true speed value, so that the detection precision of the subsequent model for directly utilizing the training picture to carry out monocular 3D target detection is improved.

In addition, when training videos or pictures are collected, the target can be recorded in different vehicle running environments, and different target videos can be recorded. In addition, a target picture can be downloaded on the internet or a picture for taking pictures of different targets can be used as a training picture.

In an optional embodiment, after obtaining the training picture and the target truth value corresponding to the training picture, the method further includes: and performing data enhancement on the training sample by using a data enhancement strategy. The data enhancement strategy comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each target is kept unchanged, and meanwhile, the coordinate information of the bounding box is updated according to a corresponding geometric transformation method.

S42, inputting a training picture into the monocular 3D target detection model to be trained to obtain a first target prediction result and a second target prediction result output by the monocular 3D target detection model to be trained; the first target prediction result is obtained by performing monocular 3D target detection on the model to be trained based on training picture features obtained by extracting training pictures, and the second target prediction result is obtained by performing feature fusion on the model to be trained based on adjacent frame training picture features obtained by extracting adjacent frame training pictures and performing monocular 3D target detection by using the fusion training features after feature fusion.

In this embodiment, referring to fig. 5, the monocular 3D object detection model to be trained includes: the characteristic extraction layer is used for extracting the characteristics of the training pictures to obtain the characteristics of the corresponding training pictures; the feature fusion layer aligns the prior training picture features in the training picture features of the adjacent frames to the current training picture features, and performs feature fusion to obtain fusion training features; the first monocular 3D target detection layer is used for carrying out 3D monocular detection on the training picture features extracted by the feature extraction layer to obtain a first target prediction result; and the second monocular 3D target detection layer performs monocular 3D detection on the fusion training characteristics to obtain a second target prediction result.

In an optional embodiment, the monocular 3D object detection model to be trained further includes: a feature flattening (flattening) layer, which is used for flattening the training picture features extracted by the feature extraction layer; and the full connection layer is used for integrating the training picture characteristics after flattening operation to obtain the corresponding integrated training picture characteristics.

In an optional embodiment, before integrating the training picture features after the flattening operation, the method includes: and converting the training picture characteristics into a pixel coordinate system corresponding to the training picture based on the preset camera external parameter matrix.

Step S43, construct a first loss function based on the first target prediction result and the target true value, and construct a second loss function based on the second target prediction result and the target true value.

It should be noted that when the target truth values include the first target truth value and the second target truth value, when constructing the first loss function, further including: constructing a first loss function based on the first target prediction result and the first target truth value; when constructing the second loss function, the method further comprises: and constructing a second loss function based on the second target prediction result and the second target truth value.

In this embodiment, the first loss function is expressed as:

L ₁ ＝(l_wh+l_offset+l_rot+l_dim+l_vel+l_depth)*weight_reg+l_hm*weight_hm

weight _ reg and weight _ hm are loss weights and can be set according to prior experience or actual design requirements, for example, weight _ reg can be 1, weight \ uhm can be 2; the loss equations for L _ wh, L _ offset, L _ rot, L _ dim, L _ vel, and L _ depth may refer to the regression loss function smooth L1 loss, and the loss equation for L _ hm is a focal loss function, which will not be further described herein.

It should be noted that the second loss function may be set with reference to the first loss function, and will not be described repeatedly here.

In this embodiment, the total loss function is expressed as:

total loss: l = λ × L ₁ +L ₂

Wherein λ is a first loss function L ₁ The lambda value may be 0.5 ₂ Representing a second loss function.

In summary, the embodiment of the invention performs feature fusion on the image features extracted based on the to-be-detected image of the adjacent frame, so as to improve the robustness of feature extraction, and further facilitate the improvement of the precision of subsequent monocular 3D target detection; in addition, redundant information generated by correlation among different frame images is eliminated, and the model calculation amount is reduced.

In the following, the monocular 3D object detecting device provided in the present invention is described, and the monocular 3D object detecting device described below and the monocular 3D object detecting method described above may be referred to in correspondence with each other.

Fig. 6 shows a schematic structural diagram of a monocular 3D object detecting device, which includes:

the image acquisition module 61 is used for acquiring an image to be detected in a target time period;

the monocular 3D target detecting module 62 inputs the image to be detected into the monocular 3D target detecting model to obtain a target detecting result output by the monocular 3D target detecting model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture;

the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

In this embodiment, the image acquisition module 61 includes: the video acquisition unit acquires a video stream for a target area to be detected; and the image extraction unit is used for extracting an image to be detected in a target time period based on the video stream, wherein the image to be detected is an image frame sequence. Still further, an image extraction unit includes: the image acquisition subunit acquires the images to be detected with a certain number of frames before and after the current frame image to be detected based on the video stream, for example, acquires five frames of images before and after the current frame image to be detected, so that feature fusion is performed on the image features extracted by the subsequent monocular 3D target detection model based on two continuous frames of images to be detected, and target detection precision is improved.

In an alternative embodiment, the image obtaining module 61 further includes: the image acquisition unit acquires multiple frames of images to be detected which are continuously shot in a target time period.

Monocular 3D object detection module 62, including data input unit, model detection unit and data output unit, wherein: the data input unit is used for inputting the image to be detected into the monocular 3D target detection model; the model detection unit is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting input adjacent frame images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion to obtain a target detection result; and a data output unit outputting a target detection result.

A model detection unit comprising: the characteristic extraction subunit is used for extracting the characteristics of the image to be detected to obtain the characteristics of the corresponding image; the feature fusion subunit aligns the prior image features corresponding to the prior image to be detected in the adjacent frame image features to the current image features corresponding to the current image to be detected, and performs feature fusion to obtain fusion features; and the monocular 3D target detection subunit performs monocular 3D detection on the fusion characteristics to obtain a target detection result.

Still further, a feature fusion subunit, comprising: the coordinate conversion grandchild unit is used for converting the prior image features into a global coordinate system of the current image to be detected to obtain the prior image global features; the feature adding unit is used for converting the global features of the prior image into a pixel coordinate system of the features of the current image to obtain alignment features; and the feature fusion sun unit is used for performing feature fusion on the alignment features and the current image features to obtain fusion features.

Additionally, the coordinate transformation unit includes: the first coordinate conversion Zeng and Sun unit is used for converting the prior image characteristics into the global coordinate system of the prior image to be detected based on a pre-constructed transformation matrix from a vehicle coordinate system to a global coordinate system, a transformation matrix from a radar coordinate system to the vehicle coordinate system, a transformation matrix from a camera coordinate system to the radar coordinate system and a transformation matrix from the camera coordinate system to a pixel coordinate system to obtain first prior image characteristics; and the second coordinate conversion great sun unit is used for converting the first prior image feature into a global coordinate system of the current image to be detected based on the pre-acquired sensor information to obtain the prior image global feature.

In an optional embodiment, the model detection unit further includes: a feature flattening (flattening) subunit, which is used for flattening the image features extracted by the feature extraction layer; and the full-connection subunit integrates the image characteristics subjected to the flattening operation to obtain the corresponding integrated image characteristics. It should be noted that, the features of the fully connected layer after the feature flattening layer is flattened are integrated to obtain the image features after the dimensionality reconstruction, so that the difference between the to-be-detected images of adjacent frames is reduced, noise is prevented from being introduced when the features are fused, and negative effects are reduced.

In an optional embodiment, the model detection unit further includes: and the first camera external reference matrix conversion subunit converts the image characteristics into a pixel coordinate system corresponding to the image to be detected based on a preset camera external reference matrix.

In an alternative embodiment, referring to fig. 7, the apparatus further comprises: and the training module is used for training the monocular 3D target detection model before inputting the image to be detected into the monocular 3D target detection model. Specifically, a training module, comprising:

a picture acquiring unit 71, configured to acquire a training picture and a target truth value corresponding to the training picture;

the model to be trained unit 72 inputs the training picture into the monocular 3D target detection model to be trained to obtain a first target prediction result and a second target prediction result output by the monocular 3D target detection model to be trained; the first target prediction result is obtained by performing monocular 3D target detection on the model to be trained based on training picture features obtained by extracting training pictures, and the second target prediction result is obtained by performing feature fusion on the model to be trained based on adjacent frame training picture features obtained by extracting adjacent frame training pictures and performing monocular 3D target detection by using the fusion training features after feature fusion;

a loss function constructing unit 73 that constructs a first loss function based on the first target prediction result and the target true value, and constructs a second loss function based on the second target prediction result and the target true value;

the training end determination unit 74 obtains a total loss function based on the first loss function and the second loss function, and determines to end training based on the total loss function convergence.

In this embodiment, the picture acquiring unit 71 includes: the data acquisition subunit acquires the training videos or pictures and screens out the videos or pictures containing the target information as effective training pictures; and the marking subunit marks the effective training picture to obtain a target truth value.

In an alternative embodiment, the target true values include a first target true value including a true velocity value obtained based on an absolute motion velocity of the training picture and a second target true value including a true velocity value obtained based on an inter-frame relative displacement of an adjacent frame of the training picture.

In an optional embodiment, the training module further comprises: and the data enhancement unit is used for enhancing the data of the training sample by using a data enhancement strategy. The data enhancement strategy comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each target is kept unchanged, and meanwhile, the coordinate information of the bounding box is updated according to a corresponding geometric transformation method.

The model unit to be trained 72 includes: the characteristic extraction subunit is used for extracting the characteristics of the training pictures to obtain the characteristics of the corresponding training pictures; the feature fusion subunit aligns the prior training picture features in the training picture features of the adjacent frames to the current training picture features, and performs feature fusion to obtain fusion training features; the first monocular 3D target detection subunit is used for carrying out 3D monocular detection on the training picture features extracted by the feature extraction layer to obtain a first target prediction result; and the second monocular 3D target detection subunit performs monocular 3D detection on the fusion training characteristics to obtain a second target prediction result.

In an alternative embodiment, the model unit to be trained 72 further includes: a feature flattening (flattening) subunit, which is used for flattening the training picture features extracted by the feature extraction layer; and the full-connection subunit integrates the training picture characteristics after the flattening operation to obtain the corresponding integrated training picture characteristics.

In an alternative embodiment, the model unit to be trained 72 further includes: and the second camera external reference matrix converting subunit converts the training picture characteristics into a pixel coordinate system corresponding to the training picture based on the preset camera external reference matrix.

In summary, the monocular 3D target detection module performs feature fusion on the image features extracted based on the adjacent frames of the image to be detected, so that the robustness of feature extraction is improved, and the accuracy of subsequent monocular 3D target detection is further improved; in addition, redundant information generated by correlation between different frame images is eliminated through the monocular 3D object detection module, and the model calculation amount is reduced.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor) 81, a communication Interface (Communications Interface) 82, a memory (memory) 83 and a communication bus 84, wherein the processor 81, the communication Interface 82 and the memory 83 complete communication with each other through the communication bus 84. Processor 81 may invoke logic instructions in memory 83 to perform a monocular 3D object detection method comprising: acquiring an image to be detected in a target time period; inputting an image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

In addition, the logic instructions in the memory 83 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the monocular 3D object detecting method provided by the above methods, the method comprising: acquiring an image to be detected in a target time period; inputting an image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on the fusion features after feature fusion.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the monocular 3D object detecting method provided by the above methods, the method comprising: acquiring an image to be detected in a target time period; inputting an image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture; the monocular 3D target detection model is used for carrying out feature fusion on the basis of adjacent frame image features obtained by extracting adjacent frames of images to be detected and carrying out monocular 3D target detection on fusion features after the feature fusion.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A monocular 3D object detection method is characterized by comprising the following steps:

acquiring an image to be detected in a target time period;

inputting the image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture;

the monocular 3D target detection model is used for carrying out feature fusion on adjacent frame image features obtained by extracting images to be detected based on adjacent frames and carrying out monocular 3D target detection on fusion features generated after feature fusion.

2. The monocular 3D object detecting method of claim 1, wherein the monocular 3D object detecting model comprises:

the characteristic extraction layer is used for extracting the characteristics of the image to be detected to obtain the corresponding image characteristics;

the characteristic fusion layer is used for aligning the prior image characteristics corresponding to the prior image to be detected in the adjacent frame image characteristics to the current image characteristics corresponding to the current image to be detected and carrying out characteristic fusion to obtain fusion characteristics;

and the monocular 3D target detection layer is used for performing monocular 3D detection on the fusion characteristics to obtain a target detection result.

3. The monocular 3D target detection method according to claim 2, wherein aligning a previous image feature corresponding to a previous image to be detected in the adjacent frame image features to a current image feature corresponding to a current image to be detected, and performing feature fusion to obtain a fusion feature, comprises:

converting the prior image features into a global coordinate system of the current image to be detected to obtain the prior image global features;

converting the global feature of the prior image into a pixel coordinate system of the feature of the current image to obtain an alignment feature;

and performing feature fusion on the alignment features and the current image features to obtain fusion features.

4. The monocular 3D object detecting method of claim 2, wherein the monocular 3D object detecting model further comprises:

a characteristic flattening layer, which is used for flattening the image characteristics extracted by the characteristic extraction layer;

and the full connection layer is used for integrating the image characteristics subjected to the flattening operation to obtain the image characteristics subjected to corresponding integration.

5. The monocular 3D object detecting method according to claim 4, comprising, before integrating the image features after the flattening operation:

and converting the image characteristics into a pixel coordinate system corresponding to the image to be detected based on a preset camera external parameter matrix.

6. The monocular 3D object detecting method of claim 1, wherein training the monocular 3D object detecting model comprises:

acquiring a training picture and a target truth value corresponding to the training picture;

inputting the training picture into a monocular 3D target detection model to be trained to obtain a first target prediction result and a second target prediction result output by the monocular 3D target detection model to be trained; the first target prediction result is obtained by performing monocular 3D target detection on the model to be trained based on training picture features extracted from the training pictures, and the second target prediction result is obtained by performing feature fusion on the model to be trained based on adjacent frame training picture features extracted from adjacent frame training pictures and performing monocular 3D target detection by using the fusion training features after feature fusion;

constructing a first loss function based on the first target prediction result and the target truth value, and constructing a second loss function based on the second target prediction result and the target truth value;

and obtaining a total loss function based on the first loss function and the second loss function, and judging to finish training based on the convergence of the total loss function.

7. The monocular 3D object detecting method of claim 6, wherein the target true value comprises a first target true value and a second target true value, the first target true value comprises a true speed value obtained based on an absolute motion speed of the training picture, and the second target true value comprises a true speed value obtained based on an inter-frame relative displacement of an adjacent frame of the training picture;

in constructing the first loss function, further comprising:

constructing the first loss function based on the first target prediction result and the first target truth value;

in constructing the second loss function, the method further includes:

constructing the second loss function based on the second target prediction result and the second target truth value.

8. A monocular 3D object detecting device, comprising:

the image acquisition module is used for acquiring an image to be detected in a target time period;

the monocular 3D target detection module is used for inputting the image to be detected into a monocular 3D target detection model to obtain a target detection result output by the monocular 3D target detection model; the monocular 3D target detection model is obtained by training based on a training picture and a target truth value corresponding to the training picture;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the monocular 3D object detecting method according to any one of claims 1 to 7 are implemented when the processor executes the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the monocular 3D object detecting method according to any one of claims 1 to 7.