CN112733672A

CN112733672A - Monocular camera-based three-dimensional target detection method and device and computer equipment

Info

Publication number: CN112733672A
Application number: CN202011631597.1A
Authority: CN
Inventors: 刘明; 廖毅雄; 马福龙
Original assignee: Shenzhen Yiqing Innovation Technology Co ltd
Current assignee: Shenzhen Yiqing Innovation Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-30

Abstract

The application relates to a monocular camera-based three-dimensional target detection method and device, computer equipment and storage medium. The method comprises the following steps: acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object. The method can improve the accuracy of three-dimensional target detection.

Description

Monocular camera-based three-dimensional target detection method and device and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a monocular camera-based three-dimensional target detection method, apparatus, and computer device.

Background

With the development of computer technology, automatic driving becomes a research hotspot. In an autonomous driving scenario, it is important to accurately detect surrounding objects. In order to save cost, mass production is mainly performed by acquiring images of peripheral obstacles with a camera and detecting peripheral objects from the captured images.

However, in the conventional method, the features are extracted from the captured image, and the object is detected based on the output feature map, because the field of view of the feature map obtained by directly extracting the features is not high, and the image captured by the camera has distortion in comparison with the real object, the detection is directly performed on the feature map with low field of view, and the peripheral object in the automatic driving scene cannot be accurately detected.

Disclosure of Invention

In view of the above, it is necessary to provide a monocular camera based three-dimensional object detection method, apparatus, computer device and storage medium capable of improving accuracy.

A monocular camera-based three-dimensional target detection method, the method comprising:

acquiring an image acquired by a monocular camera under an automatic driving scene;

inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;

performing feature enhancement on the fused feature map;

regressing a three-dimensional frame for identifying a target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map;

and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image by using deformable convolution further includes:

inputting the image into a distraction residual error network in a trained target object detection model;

in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape;

and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

In one embodiment, the step of upsampling the image by using deformable convolution and fusing each feature map obtained after upsampling with a feature map of a previous lower layer further includes:

performing up-sampling on the image for multiple times by using deformable convolution, and fusing each feature map obtained after each up-sampling with the feature map of the front lower layer;

and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.

In one embodiment, the step of regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map further includes:

adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame;

regressing the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera;

and obtaining a three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the method further comprises:

labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images collected by a laser radar;

converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information;

converting the coordinates under the camera coordinate system to a pixel coordinate system through the camera internal parameters of the label information;

and training a target object detection model according to the sample image converted into the pixel coordinate system.

In one embodiment, the method further comprises:

in the process of using the trained target object detection model, quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in an 8-bit integer form by using an inference optimizer in the target object detection model in a mode of minimizing KL divergence;

and calculating the data quantized into the form of 8-bit integers according to the target object detection model.

A monocular camera-based three-dimensional object detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image acquired by the monocular camera under an automatic driving scene;

the feature extraction module is used for inputting the image into a discrete attention residual error network in a trained target object detection model, utilizing deformable convolution to carry out up-sampling on the image, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer;

the enhancement module is used for carrying out feature enhancement on the fused feature map;

a regression module for regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map;

and the detection module is used for adjusting the position of the three-dimensional frame according to the offset and obtaining a target detection result of the target object.

In one embodiment, the feature extraction module is further configured to input the image into a distraction residual error network in a trained target object detection model; in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer

Performing feature enhancement on the fused feature map;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

performing feature enhancement on the fused feature map;

According to the monocular camera-based three-dimensional target detection method and device, the computer equipment and the storage medium, the image acquired by the monocular camera under the automatic driving scene is acquired, and compared with a binocular camera, the cost can be saved by half. The acquired image is input into a trained target object detection model, a distraction residual error network is used as a main feature extraction network, and compared with the residual error network, a distraction module is added, so that important features can be concerned more. In the distributed attention residual error network, the image is subjected to up-sampling for multiple times by utilizing deformable convolution, each feature map obtained after the up-sampling for multiple times is fused with the feature map of the front lower layer, the enhancement is further carried out, and the receptive field of the output feature map is relatively high. And regressing the offset of the three-dimensional frame and the central point of the target object for the feature map with higher receptive field, and adjusting the three-dimensional frame according to the offset, so that the inaccuracy of the framed target object caused by the truncation of the three-dimensional frame can be avoided. In conclusion, the accuracy of target object detection is integrally improved from the characteristic extraction by using a residual error network combined with a distraction module to the detection of the target object by using the adjusted three-dimensional frame after enhancing the characteristic diagram obtained by multiple times of upsampling fusion.

Drawings

FIG. 1 is a diagram of an application environment of a monocular camera-based three-dimensional object detection method in one embodiment;

FIG. 2 is a schematic flow chart illustrating a monocular camera-based three-dimensional object detection method according to an embodiment;

FIG. 3 is a diagram illustrating the results of a monocular camera-based three-dimensional object detection method in one embodiment;

FIG. 4 is a block diagram of a monocular camera-based three-dimensional object detection device in one embodiment;

FIG. 5 is a block diagram of a monocular camera-based three-dimensional object detecting device in another embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The monocular camera-based three-dimensional target detection method provided by the application can be applied to the application environment shown in fig. 1. Therein, the monocular camera 102 is connected to the vehicle 104 via a network. The vehicle 104 is provided with an object detection device. The monocular camera 102 detects by passing the captured images in the autonomous driving scene to the target detection device of the vehicle 104.

In one embodiment, as shown in fig. 2, a monocular camera-based three-dimensional object detection method is provided, which is described by taking the application of the method to the object detection device in fig. 1 as an example, and includes the following steps:

step 202, acquiring an image acquired by a monocular camera under an automatic driving scene.

The monocular camera is a camera. It is understood that a binocular camera or a multi-view camera uses two or more cameras. The automatic driving scene is a scene in which the vehicle automatically travels.

In one embodiment, the vehicle in the autonomous driving scenario may or may not have a driver.

In one embodiment, pictures of different automatic driving scenes, including pictures of various automatic driving scenes such as daytime, backlight, night, rainy days and foggy days, can be collected through the monocular camera.

Specifically, in a vehicle automatic driving scene, a monocular camera is arranged on the vehicle, and the monocular camera can shoot the automatic driving scene so as to acquire an image in the automatic driving scene.

And step 204, inputting the image into a discrete attention residual error network in the trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature map obtained after up-sampling with the feature map of the front lower layer.

The target object detection model is a model for detecting a target object in an image. The target object detection model comprises a distraction residual error network, a detection head and a task head. The distraction residual network is a residual network to which a distraction module is added, and is used to extract a feature of a target object of an image, and may also be referred to as a feature extractor. The deformable convolution is a convolution using a convolution kernel that can change the shape. The feature map is a map having a feature of the target object. The target object is an object to be detected.

Specifically, the target detection device puts the image collected by the monocular camera into a trained target object detection model. And the target detection equipment downsamples the image according to a preset step length in a dispersion attention residual error network of the trained target object detection model and outputs a characteristic diagram. And the target detection equipment performs up-sampling on the down-sampled feature map by using deformable convolution in a decentralized attention residual error network, outputs each feature map obtained after up-sampling through different channel numbers and superposes the feature maps into the up-sampled feature map. And the target detection equipment fuses the up-sampled characteristic diagram and the characteristic diagram of the front lower layer in the decentralized attention residual error network.

In one embodiment, the step size of the down-sampling may be 32.

And step 206, performing feature enhancement on the fused feature map.

The detection head is a code block for enhancing the target object.

Specifically, the target detection device convolves the fused feature map through a detection head in the target object detection model to increase the features of the fused feature map.

And step 208, a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object are regressed based on the enhanced feature map.

The three-dimensional frame is a solid geometric frame containing three-dimensional information.

Specifically, the target detection device regresses the length, width, and height of a three-dimensional frame for identifying the target object in the image, the orientation of the target object, and the offset amount of the center point of the identified target object by a task head in the target object detection model based on the enhanced feature map. The task head is a code block for regressing the offset between the three-dimensional frame and the central point of the target object.

And step 210, adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

The target detection result comprises the type, the orientation, the position and the length, the width and the height of the target object.

Specifically, during model training, a user may manually adjust camera parameters, adjust offsets of camera coordinates relative to world coordinates by adjusting the camera parameters, and store the offsets as parameters of a trained target object detection model. The target detection device adjusts the position of the three-dimensional frame according to the stored offset through a task head in the target object detection model so as to detect the type, the direction, the position, the length, the width and the height of the three-dimensional target object.

According to the three-dimensional target detection method based on the monocular camera, the image acquired by the monocular camera under the automatic driving scene is acquired, and compared with a binocular camera, the cost is saved by half. The acquired image is input into a trained target object detection model, a distraction residual error network is used as a main feature extraction network, and compared with the residual error network, a distraction module is added, so that important features can be concerned more. In the distributed attention residual error network, the image is subjected to up-sampling for multiple times by utilizing deformable convolution, each feature map obtained after the up-sampling for multiple times is fused with the feature map of the front lower layer, the enhancement is further carried out, and the receptive field of the output feature map is relatively high. The offset of the three-dimensional frame and the central point of the target object is regressed from the feature map with higher receptive field, and the three-dimensional frame is adjusted according to the offset, so that the inaccuracy of the framed target object caused by the truncation of the three-dimensional frame can be avoided. In conclusion, the target object is detected by using the adjusted three-dimensional frame after the features are extracted by using the residual error network combined with the distraction module and the feature graph obtained by multiple times of upsampling fusion are enhanced, so that the accuracy of target object detection is integrally improved.

In one embodiment, the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image by using deformable convolution further comprises: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

The convolution kernel is a matrix used for performing convolution with a pixel matrix corresponding to the image. And the receptive field is the visual field perception of the target object identified by the characteristic diagram. For example, if the target object can be easily identified by the feature map, the receptive field is said to be high, and conversely, the receptive field is said to be low.

Specifically, the target detection device performs up-sampling on the feature map of the down-sampled image a plurality of times by using deformable convolution in a discrete attention residual network of a trained target object detection model. In each upsampling, a convolution kernel which is deformed according to the geometric shape of the target object is utilized to be convolved with a pixel matrix corresponding to the characteristic map, so that the characteristic map with the receptive field consistent with the size of the target object is obtained through sampling.

In this embodiment, upsampling is performed by using deformable convolution, so that the adaptability of the target object detection model to the geometric shape change of the target object in the image can be enhanced.

In one embodiment, the step of upsampling the image by using deformable convolution and fusing each feature map obtained after the upsampling with the feature map of the front lower layer further includes: fusing the feature map obtained after each upsampling with the feature map of the front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.

The convolutional neural network is a network corresponding to a single neuron in the distraction residual error network. It will be appreciated that a neuron corresponds to a layer of the network, and that each layer of the network may be convolved with an image input to that layer. The image input to the layer may be an original image, or may be a feature image output by performing convolution with an upper layer convolution neural network.

Specifically, the target detection device fuses, through the discrete attention residual error network, each feature map obtained by upsampling in each layer of convolutional neural network with a feature map which is obtained by upsampling in a preceding lower layer of convolutional neural network of the convolutional neural network corresponding to each upsampling and is not fused.

In one embodiment, the number of upsampling may be three. The target detection equipment performs primary up-sampling on the feature map output after down-sampling in a dispersion attention residual error network of a trained target object detection model by using deformable convolution, outputs the feature map through 256 channels and superposes the feature map to obtain a primary up-sampled feature map which is not fused. And fusing the feature map which is up-sampled for the first time and is not fused with the feature map which is output after down-sampling to obtain the feature map which is up-sampled for the first time and is fused. And the target detection equipment performs second upsampling on the first upsampled and fused feature map by using deformable convolution through a decentralized attention residual error network, outputs the feature map through 128 channels and superposes the feature map to obtain a second upsampled and unfused feature map. And fusing the feature map which is up-sampled for the second time and is not fused with the feature map which is up-sampled for the first time and is fused to obtain a feature map which is up-sampled for the second time and is fused. And the target detection equipment performs third upsampling on the second upsampled and fused feature map by using a deformable convolution through a scattered attention residual error network, outputs the feature maps through 64 channels and superposes the feature maps to obtain a third upsampled and unfused feature map. And fusing the feature map which is up-sampled and not fused for the third time with the feature map which is up-sampled and fused for the second time to obtain a feature map which is up-sampled and fused for the third time.

In this embodiment, the receptive field of the feature map after downsampling is small, which is not beneficial to target detection, and the feature expression capability can be improved by upsampling the output feature map. The variable convolution can be adaptive to the change of the geometric shape of the target, and the generalization of the convolution neural network is improved.

In one embodiment, the step of regressing the three-dimensional frame for identifying the target object, the orientation of the target object, and the offset of the center point of the target object based on the enhanced feature map further comprises: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

Wherein, the prior frame is an initial two-dimensional geometric frame. It will be appreciated that the two-dimensional geometric box is only of length and width. It is understood that the prior frame may be multiple frames, and an accurate detection frame is finally obtained through continuous regression.

Specifically, the target detection device adds a plurality of prior frames in the image through a task head in a target object detection model based on a feature map enhanced by the detection head, and obtains an optimal prior frame, namely, the length and the width of a regression three-dimensional frame by continuously comparing a threshold value calculated by the prior frames with a reference threshold value in a trained target object detection model. And returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame by the target detection equipment through a task head in the target object detection model according to the distance from the central point of the target object to the monocular camera, and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, as shown in FIG. 3, from the image captured by the monocular camera, a monocular camera-based three-dimensional object detection device on the vehicle may identify the target object on the image with a three-dimensional frame.

In this embodiment, based on the enhanced feature image, the offset of the central point of the target object is regressed by the task head in the target object detection model to adjust the three-dimensional frame used for identifying the target object in the image, so as to reduce the fact that the three-dimensional frame exceeds the image range, that is, reduce the situation that the three-dimensional frame is truncated, thereby avoiding the truncation of the identified target object and improving the accuracy of the three-dimensional frame.

In one embodiment, the method further comprises: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.

The sample image is an image obtained by abnormally detecting the sample. The point cloud coordinate is the coordinate marked by a plurality of cloud-shaped points. The camera coordinate system is a coordinate system set at an angle of an image taken by the camera. The camera external parameters comprise rotation parameters of three axes of the coordinate axes of the camera and translation parameters of the three axes. The camera parameters comprise a radial distortion coefficient and a tangential distortion coefficient of the camera. The sample image, converted to a pixel coordinate system, is a pixel matrix.

Specifically, the user annotates a sample image acquired by the monocular camera. A user can input a sample image acquired by the monocular camera to the labeling software on the same labeling software, adjust the point cloud coordinates added to the target object by the labeling software, add the value corresponding to the coordinates, and add label information to the labeled target object. The tag information includes: picture identification, image category, camera internal parameters, two-dimensional frame, whether to cut off, occlusion degree, orientation and three-dimensional dimensionality. The truncation may be represented by "0" for no truncation, and "1" for truncation. Sample images in an autonomous driving scene are acquired by a monocular camera. And adding point cloud coordinates to a sample target object in the sample image and labeling label information by the user. The target detection equipment converts the point cloud coordinates into coordinates under a camera coordinate system through the camera external parameters in the tag information, and then converts the coordinates under the camera coordinate system into a pixel coordinate system through the camera internal parameters in the tag information. And training the target object detection model by the user through the sample image converted into the pixel coordinate system.

In one embodiment, fine tuning the camera's camera external parameters may adjust the three degrees of freedom of the camera coordinates, i.e., pitch, yaw, and roll, respectively, and translate the coordinate axes of the camera coordinates, i.e., the x, y, and z axes. The target detection device can convert the point cloud coordinate into a coordinate under a camera coordinate system through the offset obtained by fine adjustment in the label information. And the target detection equipment converts the coordinates under the camera coordinate system into a pixel coordinate system through the radial distortion coefficient, the tangential distortion coefficient and the pixel proportion of the camera.

In this embodiment, after sample images in different automatic driving scenes are labeled, the sample images are put into a target object model for training, so that the accuracy of target object model detection can be improved.

In one embodiment, the method further comprises: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using an inference optimizer in the target object detection model in a mode of minimizing KL divergence (Kullback-Leibler divergence); and calculating the data quantized into the form of 8-bit integers according to the target object detection model.

Specifically, the target detection device quantizes 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of an 8-bit integer by using an inference optimizer in the target object detection model in a manner of minimizing the KL divergence, and then performs calculation.

In one embodiment, for example, before upsampling, 2-bit or 16-bit data corresponding to the feature map to be subjected to deformable convolution may be quantized into data in the form of 8-bit integer, and then subjected to deformable convolution.

In one embodiment, the data 1/10 of the training data may be taken as a calibration data set, the target object in the image is inferred in the form of 32-bit data in the target object detection model, then a histogram of the activation values of each layer is collected, the saturation quantization distribution of different thresholds is counted, and finally the threshold that minimizes the KL divergence is found.

In one embodiment, the inference optimizer may be "TensorRT," a forward-propagation-only deep learning framework.

In this embodiment, the 32-bit or 16-bit data to be calculated is quantized into data in the form of an 8-bit integer, so that the operation speed of the CPU or the GPU of the target detection device can be increased, and the speed of detecting the target object by the target object detection model is never increased, specifically, the speed can be increased by at least 1.5 times compared with the speed before acceleration.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 4, there is provided a monocular camera-based three-dimensional object detecting device 400, comprising: an image acquisition module 402, a feature extraction module 404, an enhancement module 406, a regression module 408, and a detection module 410, wherein:

an image obtaining module 402, configured to obtain an image in an automatic driving scene captured by a monocular camera.

And the feature extraction module 404 is configured to input the image into a discrete attention residual error network in the trained target object detection model, perform upsampling on the image by using deformable convolution, and fuse each feature map obtained after the upsampling with a feature map of a front lower layer.

And an enhancing module 406, configured to perform feature enhancement on the fused feature map.

A regression module 408 for regressing a three-dimensional frame for identifying the target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map.

The detection module 410 is configured to adjust the position of the three-dimensional frame according to the offset, and obtain a target detection result of the target object.

In one embodiment, the feature extraction module 404 is further configured to input the image into a distraction residual network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

In one embodiment, the feature extraction module 404 is further configured to perform up-sampling on the image for multiple times by using deformable convolution, and fuse each feature map obtained after each up-sampling with a feature map of a previous lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.

In one embodiment, the regression module 408 is further configured to add a prior frame to the image based on the enhanced feature map, and regress the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the apparatus further comprises:

the training module 401 is configured to label the sample image with label information according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.

As shown in fig. 5, in one embodiment, the apparatus further comprises: a training module 401 and an acceleration module 412;

the acceleration module 412 is configured to quantize the to-be-calculated 32-bit or 16-bit data in the target object detection model into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a manner of minimizing the KL divergence in the process of using the trained target object detection model; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.

For specific limitations of the monocular camera-based three-dimensional object detection device, reference may be made to the above limitations of the monocular camera-based three-dimensional object detection method, which are not described herein again. The modules in the monocular camera-based three-dimensional object detecting device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be an object detection device on a vehicle in an autonomous driving scenario, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with external target detection equipment, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a monocular camera-based three-dimensional object detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing up-sampling on the image for multiple times by using deformable convolution, and fusing each feature map obtained after the up-sampling for multiple times with the feature map of the front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.

In one embodiment, the processor, when executing the computer program, further performs the steps of: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.

In one embodiment, the processor, when executing the computer program, further performs the steps of: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a mode of minimizing KL divergence; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring an image acquired by a monocular camera under an automatic driving scene; inputting the image into a discrete attention residual error network in a trained target object detection model, performing up-sampling on the image by using deformable convolution, and fusing each feature graph obtained after up-sampling with a feature graph of a front lower layer; performing feature enhancement on the fused feature map; regressing a three-dimensional frame for identifying the target object, the orientation of the target object and the offset of the central point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the image into a dispersion attention residual error network in a trained target object detection model; in the process of performing up-sampling on an image for multiple times by using deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each up-sampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing up-sampling on the image for multiple times by using deformable convolution, and fusing a feature map obtained after each up-sampling with a feature map of a front lower layer; and the characteristic diagram of the previous layer and the lower layer is the corresponding characteristic diagram which is output after the previous layer is sampled and is not fused before the current upsampling.

In one embodiment, the computer program when executed by the processor further performs the steps of: adding a prior frame in the image based on the enhanced feature map, and regressing the length and the width of the three-dimensional frame according to the prior frame; returning the orientation of the target object, the offset of the central point of the target object and the height of the three-dimensional frame according to the distance between the central point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinate is formed by a sample image collected by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the label information; converting coordinates under a camera coordinate system into a pixel coordinate system through camera internal reference of the label information; and training a target object detection model according to the sample image converted into the pixel coordinate system.

In one embodiment, the computer program when executed by the processor further performs the steps of: in the process of using the trained target object detection model, 32-bit or 16-bit data to be calculated in the target object detection model is quantized into data in an 8-bit integer form by using a reasoning optimizer in the target object detection model in a mode of minimizing KL divergence; and calculating the data quantized into the form of 8-bit integers according to the target object detection model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A monocular camera-based three-dimensional target detection method is characterized by comprising the following steps:

performing feature enhancement on the fused feature map;

2. The method of claim 1, wherein the step of inputting the image into a discrete attention residual network in a trained target object detection model and upsampling the image using a deformable convolution further comprises:

3. The method according to claim 1, wherein the step of upsampling the image by using deformable convolution and fusing each feature map obtained after the upsampling with the feature map of the front lower layer further comprises:

performing up-sampling on the image for multiple times by using deformable convolution, and fusing a feature map obtained after each up-sampling with a feature map of a front lower layer;

4. The method of claim 1, wherein the step of regressing a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object based on the enhanced feature map further comprises:

5. The method of claim 1, wherein the target object detection model is obtained by a model training step comprising:

6. The method according to any one of claims 1 to 5, further comprising:

7. A monocular camera-based three-dimensional object detection apparatus, the apparatus comprising:

8. The apparatus of claim 7, wherein the feature extraction module is further configured to input the image into a distracted residual network in a trained target object detection model; in the process of performing multiple times of upsampling on the image by utilizing deformable convolution, deforming a convolution kernel according to the geometric shape of a target object in the image aiming at each time of upsampling to obtain a convolution kernel adaptive to the geometric shape; and upsampling the image based on the convolution kernel adaptive to the geometric shape to obtain a characteristic diagram of which the receptive field is consistent with the size of the target object.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.